#archiveteam-bs log for 2021-05-09

Home Search Previous day Next day

00:12:44		Ruthalas (Ruthalas) joins
00:31:18		Arcorann_ joins
00:41:31		Mineroboter joins
00:44:04		Mineroboter_ quits [Ping timeout: 250 seconds]
00:46:47		pcr leaves
00:46:49		pcr joins
01:01:12		howardad (howardad) joins
01:01:41		howardad quits [Client Quit]
01:02:10		howardad (howardad) joins
01:03:06		dm4v quits [Read error: Connection reset by peer]
01:03:55		dm4v joins
01:03:57		dm4v is now authenticated as dm4v
01:03:57		dm4v quits [Changing host]
01:03:58		dm4v (dm4v) joins
01:12:27		Wayward quits [Remote host closed the connection]
02:03:29		lennier2 joins
02:06:57		lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
03:14:11		Wayward (wayward) joins
03:40:19		qw3rty_ joins
03:43:58		qw3rty__ quits [Ping timeout: 258 seconds]
05:01:29		DogsRNice quits [Read error: Connection reset by peer]
05:25:33		nerdguy1138 quits [Ping timeout: 258 seconds]
05:40:34		nerdguy1138 (nerdguy1138) joins
07:43:52		duce1337 (duce1337) joins
08:04:07		bobbyb joins
08:04:38	<bobbyb>	Is there a place to submit whole websites for archiving?
08:04:55	<bobbyb>	I just came across https://www.tapeheads.net/forums.php that has little presence on the Wayback Machine
08:05:42	<bobbyb>	I just archived a very interesting thread on the merits of demagnetizing tape heads that had no backups
08:06:56	<bobbyb>	Oh, I think I may be in the wrong channel
08:07:16	<thuban>	bobbyb: no no, here is fine
08:11:14	<thuban>	JAA: vbulletin forum, just under 1M posts. is that too big for an archivebot job? qwarc?
08:22:04	<thuban>	in other news, older rthk videos stored on its internal servers (at least for _hong kong connection_) went down around the same time they started deleting youtube videos :(
08:22:06	<thuban>	streaming versions timed out for a while and are now 404ing; 'archive' versions just 500
08:23:18	<thuban>	the episode web pages are still up, weirdly (even though they're now broken...) so i'm just going to grab thumbnails and descriptions for as many of the remaining episodes as i can
08:56:32		BlueMaxima quits [Read error: Connection reset by peer]
09:00:13		shoghicp quits [Ping timeout: 258 seconds]
09:02:07		shoghicp (shoghicp) joins
09:17:05		treora quits [Ping timeout: 258 seconds]
09:48:48		sliccricc (sliccricc) joins
10:22:38		sliccricc quits [Remote host closed the connection]
10:22:57		sliccricc (sliccricc) joins
10:44:23		duce1337_ (duce1337) joins
10:44:23		duce1337 quits [Read error: Connection reset by peer]
11:26:33	<betamax>	I've now finished pre-processing the URLs related to the recent UK elections. Specifically, I now have:
11:26:37	<betamax>	* 3904 Twitter usernames of political parties / candidates
11:26:39	<betamax>	* 4175 Facebook pages (likely not possible to be archived due to FB's rate limiting?)
11:26:42	<betamax>	* 318 Instagram profiles (again, likely not possible due to rate limiting?)
11:26:45	<betamax>	* 89 Youtube channels (and a Dailymotion channel) used by parties / candidates
11:26:48	<betamax>	* 530 Political Party websites
11:26:51	<betamax>	* 1273 Candidate Websites
11:26:53	<betamax>	What is the best way to get these archived?
11:26:53	<betamax>	The candidate web pages can go in archivebot with "archiveonly".
11:26:53	<betamax>	Should the party / candidate websites be split into groups / chunks to avoid overloading a single archivebot job?
11:26:56	<betamax>	* 2503 Candidate Web pages (these have been linked to separately and include manifestos, etc... - there will likely be a large overlap between these pages and the candidate websites, but I think it is best to archive these as well).
11:27:00	<betamax>	Is there a way to feed ~4000 twitter usernames into socialbot (without overloading it!)
11:37:00	<Barto>	this belongs in a wiki page
11:37:10	<AK>	Maybe some of that could go into urls, means ab doesn't have to do it all
11:40:19		forkwhilefork quits [Remote host closed the connection]
11:40:39		forkwhilefork (forkwhilefork) joins
11:40:44	<betamax>	Barto: OK, I'll make up a wiki page with the information.
11:42:05	<Barto>	i mean, the idea is to at least write it somewhere
11:42:58	<betamax>	Yup, seems sensible :)
11:46:12	<@HCross>	betamax: can you share the list please? I can also get started on them
11:49:04	<Barto>	^ that's exactly why I think writing it somewhere is great
11:50:08	<betamax>	Yup, just about to. Can I upload txt files to the wiki, or is it best to host them on my own server, then wayback them and link to the archived list?
11:50:49	<betamax>	Ah, seems the wiki supports generic file upload.
11:51:02	<betamax>	lol, no it doesn't
11:58:28	<betamax>	wiki page (with links to lists): https://wiki.archiveteam.org/index.php/Elections/2021_UK_elections
11:58:38	<betamax>	HCross: ^
11:58:46		treora joins
12:00:51		sliccricc quits [Remote host closed the connection]
12:02:41		nuroten quits [Remote host closed the connection]
12:05:03	<masterX244>	i would have used a archive.org item for storing the files (did that with some other discovery results at my end earlier)
12:47:14		Daloader joins
14:12:05		treeplant quits [Ping timeout: 244 seconds]
14:25:39		nuroten joins
14:25:48	<thuban>	rthk update: streaming videos appear to be back up; we'll see how long that lasts. also, 2+-year-old episodes appear to be hosted on yet another subdomain, which is still working. i am cautiously optimistic...
14:27:08	<nuroten>	thuban: thanks for the update. not sure how long they'll be up, but given what happened to the youtube videos better safe than regretful
14:31:02	<nuroten>	yesterday I found another commercial news outlet had pulled most of their short clips from a documentary series off their website (they fired a bunch of staff in december last year, including the entire team behind production of said series)
14:33:02	<nuroten>	the series was called News Lancet and was a bit like Hong Kong Connection in that it explored social and political topics
14:37:18		britm0b quits [Read error: Connection reset by peer]
14:40:03		duce1337_ quits [Read error: Connection reset by peer]
14:40:18		duce1337 (duce1337) joins
14:40:23	<nuroten>	fingers crossed as much of the site can be saved before the purge
14:42:45	<nuroten>	thuban: looks like most are on archive.* and some older ones on podcast.* ... ipv6 is down on archive.*
14:44:11		britmob joins
14:45:43	<thuban>	nuroten: there's actually a second set of sources for the newer eps (visible through the web page but not in the feed), what i've been calling 'streaming'; they are higher quality so i've been getting them where i can
14:46:04	<nuroten>	the ones from akamai?
14:47:06	<thuban>	the _new_ new ones are on akamai, but the old new ones are self-hosted (stmw.rthk.hk)
14:48:23	<nuroten>	thanks, didn't know that one ... how are speeds on it, as slow as archive.* ?
14:48:55	<thuban>	similar, yeah
14:51:36		LeGoupil joins
14:52:54		etnguyen03 (etnguyen03) joins
14:53:25	<nuroten>	okay. I've been trying to download select ones, and do a pass for all items in a feed later. full pass is more systematic, but resorting to cherry-pick due to slow speeds and unknown expiry
14:53:40	<thuban>	mhm
14:55:37	<thuban>	you may want to check against https://archive.org/details/hk-connection-2016-2020 (and other stuff under the "Hong Kong Connection" subject; i think there are a bunch of individual episodes, but i don't read chinese) and/or the contents of the torrent linked here https://old.reddit.com/r/DataHoarder/comments/n3pvnm/hong_kong_broadcaster_rthk_to_delete_shows_over_a/
14:56:27	<nuroten>	the 2016-2020 if it is what is says would be great, that's halfway there
14:56:56	<thuban>	yeah, i'm going to try and get 2010-2016 if i can
14:58:04	<nuroten>	there's another playlist that is select episodes, not complete
14:58:34	<nuroten>	on IA that is, besides the 2016-2020
15:00:37	<thuban>	(more: https://lih.kg/sMMkrnX (zh), https://docs.google.com/spreadsheets/d/1JPyevWnxvoq_xzY4ptOgaTaTYSLE9oMva66kxm66K0k/edit#gid=1023959306)
15:04:11	<nuroten>	from the little I can understand, poster said they already downloaded most of the english version
15:04:34	<nuroten>	and the rest of the post is instructions on downloading (via youtube-dl)
15:04:49		LeGoupil quits [Client Quit]
15:07:45	<thuban>	there seems to be a looot of info in the spreadsheet and the google doc it links to, so if you or anyone with good enough chinese could read it and (a) recommend items for download or (b) get copies to go to ia, that would be incredibly helpful!
15:08:40	<nuroten>	there was a suggestion in the thread to grab The Pulse and another Chinese show, I'll pull up a link for the 2nd one
15:14:52	<thuban>	i am going afk for a while (download script will keep running), but ping me for anything i should look at when i get back
15:15:34	<nuroten>	okay, thanks, cheers ... I'll leave a few links here for shows
15:18:39		LeGoupil joins
15:19:02		icedice joins
15:19:19		icedice quits [Remote host closed the connection]
15:25:51		Arcorann_ quits [Ping timeout: 258 seconds]
15:30:25		HP_Archivist (HP_Archivist) joins
15:31:38	<nuroten>	The Pulse https://podcast.rthk.hk/podcast/item.php?pid=205&lang=en-US (English version, there's a Chinese one of the same name with only 3 videos, not all that interesting)
15:46:37	<nuroten>	五夜講場 (English title "Philosophy Night"), 2020 and 2021 page: https://podcast.rthk.hk/podcast/item.php?pid=1734&year=2020&lang=zh-CN
15:53:58	<nuroten>	the series is split across 13 rss feeds (by year and subtheme) about 30-40 videos each, not sure which one the poster wanted
16:11:34		icedice joins
16:12:35		spirit quits [Client Quit]
16:13:21	<nuroten>	about the google docs spreadsheet, check the 3rd tab (the 1st one in Chinese) for a list of titles they're looking to backup
16:18:55	<nuroten>	(rthk + the programme name in web search should pull up the podcast feed in most cases) ... another thing is someone apparently put the entire collection of HKC on torrent, but I don't know if it really is complete
16:21:09	<nuroten>	the other 2 tabs in Chinese are show categories (e.g. "current affairs" and "culture") and have links to mega of select episodes people have uploaded
16:28:33		pcr leaves
16:46:40		pcr joins
16:50:57		treeplant joins
17:05:35		rsn_ joins
17:08:12		rsn quits [Ping timeout: 258 seconds]
17:12:49		DogsRNice (Webuser299) joins
17:25:27		Ruthalas quits [Ping timeout: 258 seconds]
17:26:42		Ruthalas (Ruthalas) joins
18:10:51		duce1337_ (duce1337) joins
18:10:51		duce1337 quits [Read error: Connection reset by peer]
18:13:29	<@JAA>	thuban, bobbyb: 1M posts is fine with AB. It'll take a little while though and is around the point where I typically start ignoring the URLs for individual posts to not let it blow up too much.
18:18:29		webdownload joins
18:18:51	<webdownload>	I'm now working on archiving the TED Talks website.
18:25:58	<webdownload>	I figured that it would be more feasible to archive than Vimeo.
18:48:15		LeGoupil quits [Ping timeout: 258 seconds]
19:08:12		LeGoupil joins
19:19:54		Daloader quits [Ping timeout: 250 seconds]
19:31:11		LeGoupil quits [Ping timeout: 258 seconds]
19:34:38		HP_Archivist quits [Ping timeout: 258 seconds]
20:23:52		nuroten quits [Remote host closed the connection]
20:25:57		nuroten joins
20:37:55		LeGoupil joins
20:44:24		tzt quits [Ping timeout: 258 seconds]
21:09:31		LeGoupil quits [Client Quit]
21:25:35	<betamax>	I thought I'd "bump" my earlier question about archiving a large amount of websites and twitter accounts (from the recent UK elections) - is there a good / recomended way of archiving ~4000 twitter users and ~2500 websites?
21:25:39	<betamax>	For the twitters, I figure either (a) socialbot (does it support that many accounts)? or (b) manually running snscrape then putting the list of tweets into AB separately.
21:25:45	<betamax>	For the websites, they should go in AB, but I don't know if / how I should split the list into more managable chunks.
21:26:22	<betamax>	(oops, wrong numbers - ~1700 websites, not ~2500)
21:31:10	<jodizzle>	I believe what we did last time was run snscrape manually like (b) and splitting into a few chunks, and running all the websites one-by-one
21:32:24	<betamax>	I've done that before and can certainly do that again.
21:33:19	<jodizzle>	For the twitters, I would definitely do (b). Just make sure to run snscrape with options for getting outlinks.
21:34:01	<jodizzle>	~1700 websites is a lot, though. I wonder if we've ever done that much before for an election.
21:34:51	<jodizzle>	We could simplify the process by splititng into chunks and using '!a <', but I'm not sure if that's the best either.
21:35:37	<betamax>	'!a <' was my plan, and I wasn't sure if it should be split into chunks (and if so, what size)
21:36:11	<betamax>	Smaller allows for easier monitoring (e.g: applying custom ignoresets) but obviously increases the number of jobs required
21:38:19	<jodizzle>	One thing I recall helping with the last round of U.K. election work was that some of the candidate sites weren't independant domains, but rather just pages on a host site, e.g., https://www.partyof.wales/. In that circumstance, we could just grab the host site. Is that true this time around?
21:39:52	<betamax>	There's likely to be some of that, yes. I did try to filter out that as much as possible but with over a thousand sites I will have missed things (and I don't have a good way to filter to find pages on a domain)
21:40:32	<betamax>	jodizzle: what's the option in snscrape to get twitter outlinks?
21:42:56	<jodizzle>	I use `snscrape --format '{url} {tcooutlinksss} {outlinksss}' twitter-user <username>`. That produces up to three links per line, which you then need to put newline breaks between. You could do that by piping it into `tr " " "\n" \| sed '/^$/d'` or something similar.
21:43:43	<betamax>	Ah, thanks. Wasn't aware of the --format option
21:44:30	<jodizzle>	Also add to the top of each file a link to just the profile page, i.e., https://twitter.com/<username>. If you do that as well, you'll produce lists exactly like snscrape does.
21:46:51	<betamax>	I'll start that running now, and also feed in the list of candidate pages into archivebot (with "!archiveonly").
21:47:10	<jodizzle>	I think lists of a couple hundred-thousand twitter links are pretty good for '!ao <' jobs. An alternative would be to hand them to the URLs warrior project.
21:51:26	<@JAA>	I don't think the URLs project would get anything useful.
21:52:12	<@JAA>	We could do one (or a few) !a < job for the websites, but that's probably not going to end well.
21:52:48	<jodizzle>	Why wouldn't the URLs project be good for twitter links?
21:54:41	<@JAA>	I believe it'd get the new Twitter UI, which is useless without JS.
21:55:07	<@JAA>	AB still gets the old UI, which actually works.
21:55:28	<jodizzle>	Ah, good point
21:56:43		tzt joins
22:00:36	<betamax>	JAA: what are the main issues with running !a < on the party / candidate websites?
22:02:14	<@JAA>	betamax: Ignores are hell, there will be so many outlinks that Python's cookie jar will slow everything to a crawl after a while, and probably simply too big for one job in general.
22:03:17	<betamax>	do you mean outlinks to external sites (which could be turned off I believe) or links to different pages in the same site
22:04:02	<@JAA>	External ones. They could be turned off, yeah, but I think they'd also be great to capture.
22:04:38	<betamax>	They would, but if they stop the entire thing from being captured...
22:06:23	<betamax>	JAA: is there anything I could / should be doing with the facebook / instagram URLs? (e.g: running the scrape very slowly over a few months to avoid ratelimiting, etc..) or is it a lost cause?
22:06:49	<@JAA>	It's pretty much a lost cause at this point, unfortunately.
22:09:00	<betamax>	OK - would it be worth !ao < all the links just to record that the pages existed (or will that trigger rate limiting)
22:10:07	<@JAA>	All AB pipelines are banned from Facebook and Instagram as of last time I checked. So it would only grab redirects to the login pages.
22:11:18	<betamax>	Ah, I now see the scope of their rate limiting.... OK, that's not going to work!
22:13:15	<@JAA>	Yeah, you get banned from both services just by accessing like 2 or 3 pages at a perfectly humanly reasonable speed. It's ridiculous.
22:52:58		superkuh quits [Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilayer]
23:04:16		HP_Archivist (HP_Archivist) joins
23:17:58		duce1337_ quits [Client Quit]
23:33:59		Arcorann_ joins
23:44:26		IKI quits [Remote host closed the connection]
23:49:39		BlueMaxima joins

Home Search Previous day Next day