#archiveteam-bs log for 2021-06-21

Home Search Previous day Next day

00:04:15		wyatt8740 joins
00:45:04		britmob quits [Read error: Connection reset by peer]
00:54:26		MaxG-1 quits [Remote host closed the connection]
00:55:19		britmob joins
01:02:20		dm4v quits [Client Quit]
01:03:05		dm4v joins
01:03:07		dm4v is now authenticated as dm4v
01:03:07		dm4v quits [Changing host]
01:03:07		dm4v (dm4v) joins
01:30:04		britmob quits [Read error: Connection reset by peer]
01:42:30		britmob joins
01:44:23		Edsavoie_srv quits [Ping timeout: 244 seconds]
02:01:55		Iki quits [Read error: Connection reset by peer]
02:57:45		britmob quits [Read error: Connection reset by peer]
02:59:34		ThreeHM quits [Ping timeout: 250 seconds]
03:01:35		ThreeHM (ThreeHeadedMonkey) joins
03:05:42	<atphoenix>	this may or may not be at risk: https://viking.tv/ . To my knowledge it originated as part of Viking's response to the covid pandemic, and was regularly featured during pre-show ad rolls on PBS Masterpiece episodes. It has been replaced with a new pre-show ad roll.
03:22:24		Krownest quits [Read error: Connection reset by peer]
03:25:11		Krownest (Krownest) joins
03:27:31		teej (teej) joins
03:40:02		teej quits [Client Quit]
03:52:11		qw3rty_ joins
03:55:54		qw3rty__ quits [Ping timeout: 250 seconds]
03:58:30		DogsRNice quits [Read error: Connection reset by peer]
04:21:08		save_fn joins
04:21:53		save_fn quits [Client Quit]
04:22:02		AntiLiberal joins
04:23:23		CrasherTN joins
04:24:01		CrasherTN quits [Remote host closed the connection]
04:24:13		AntiLiberal is now known as save_fn
05:12:21		rsn quits [Ping timeout: 258 seconds]
05:38:06		forkwhilefork quits [Quit: The Lounge - https://irc.rekt.app]
06:13:03		nertzy_ joins
06:14:08		nertzy__ quits [Ping timeout: 250 seconds]
06:50:07		sec^nd quits [Ping timeout: 255 seconds]
06:50:49		sec^nd (second) joins
07:43:07		HP_Archivist (HP_Archivist) joins
07:45:36		wyatt8750 joins
07:46:00		wyatt8740 quits [Ping timeout: 250 seconds]
07:54:26		BlueMaxima quits [Read error: Connection reset by peer]
09:44:33		wizards_ joins
09:47:20		wizards quits [Ping timeout: 251 seconds]
09:54:01		LeGoupil joins
11:59:19		LeGoupil quits [Client Quit]
12:39:55		yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
12:40:03		yanome quits [Quit: The Lounge - https://thelounge.chat]
12:40:14		ThreeHM quits [Ping timeout: 250 seconds]
12:41:12		yanome (yano) joins
12:41:49		yano (yano) joins
12:46:59		ThreeHM (ThreeHeadedMonkey) joins
12:50:40		KRG joins
12:50:40		KRG is now authenticated as KRG
13:17:18		KRG` joins
13:17:39		KRG quits [Ping timeout: 258 seconds]
13:18:59	<jodizzle>	(from #archiveteam) https://hk.appledaily.com/ has been running in AB for a few days, and I've now some other domains
13:19:35	<jodizzle>	One important detail from the reddit post: "Many articles contain videos, but youtube-dl doesn't seem to work. I'm out of ideas on how to get them." (I haven't verified this myself.)
13:21:49		KRG` is now known as KRG
13:22:06		KRG is now authenticated as KRG
13:23:25	<Jake>	The videos on the pages are just m3u8s from a JS script tag
13:26:57		KRG quits [Remote host closed the connection]
13:31:39		achivarin (achivarin) joins
13:38:37		ats (ats) joins
13:48:05	<jodizzle>	Okay, thanks. Looks like it should be possible to script something to assemble the .ts files.
13:49:37	<@OrIdow6>	If it's normally set up, you should just be able to pass the m3u8s into ffmpeg
13:52:22	<@OrIdow6>	Well, give it tthe url
13:54:41	<Jake>	It is normally setup ;)
13:56:34		KRG joins
13:56:34		KRG is now authenticated as KRG
13:57:30	<jodizzle>	Trying it and it does seem to work. Thanks!
13:58:27	<jodizzle>	I usually like to get the raw files in AB as well, though, but that's not too much work.
13:59:04	<jodizzle>	One problem might be iterating all the articles. Sitemaps don't seem to be complete.
13:59:32		Mateon1 quits [Ping timeout: 250 seconds]
14:05:46	<Jake>	(I also wonder if the videos on the site are also on YouTube?)
14:08:24		ragu__ joins
14:09:03	<jodizzle>	Ah, maybe.
14:11:02		Jonboy345 quits [Read error: Connection reset by peer]
14:12:05		ragu_ quits [Ping timeout: 258 seconds]
14:12:53		ragu_ joins
14:14:00		ragu__ quits [Ping timeout: 258 seconds]
14:14:42		Mateon1 joins
14:14:54		ragu__ joins
14:15:18		Jonboy345 joins
14:18:59		ragu_ quits [Ping timeout: 258 seconds]
14:22:54		Iki joins
14:30:05		britmob joins
14:41:31	<@EggplantN>	hkdaily is AB job 2u3qbx8mpv42jxi76wq27fbb2
14:41:37	<@EggplantN>	but it seems quite slow
14:48:11	<@EggplantN>	cracks nuckles
14:48:15	<@EggplantN>	sends IA 6Gbit >_>
14:55:17	<rewby>	Is this where we need to acquire some china telecom transit?
14:56:55	<@EggplantN>	i thing its just AB isnt the tool for the job
14:57:04	<@EggplantN>	or needs some more fine tuning
14:57:46	<@OrIdow6>	If you want to go really fast, 2 options I know of are Qwarc, and hackish backfeed warrior recursion
14:59:28	<rewby>	qwarc would go fast. But we'd need somewhere with good china connectivity. That said, I agree AB isn't the right tool for the job
14:59:56	<@EggplantN>	rewby sir?
15:00:00	<@EggplantN>	its akamai?
15:00:02	<@EggplantN>	not CT
15:00:14	<rewby>	Oh it's akamai?
15:00:18	<rewby>	I thought it was hosted in china
15:00:22	<rewby>	Did I look up the wrong domain
15:00:27	<@EggplantN>	no, even if it was it would be HK
15:00:31	<@EggplantN>	and HK is weird
15:00:38	<rewby>	That's a fair point
15:01:18	<@EggplantN>	does it have a usable sitemap or a way to find all the articles
15:01:20	<rewby>	If it's akamai, then yeah throw qwarc at the problem. I personally don't have amazing throughput to them but I'm sure someone here has
15:02:06	<rewby>	It appears to have a sitemap https://hk.appledaily.com/robots.txt
15:02:52	<@OrIdow6>	curl 'https://hk.appledaily.com/sitemap002.xml' \| grep loc \| wc -l gets me 14664, scale seems reasonable
15:03:28		Jonboy345 quits [Read error: Connection reset by peer]
15:03:35	<rewby>	OrIdow6: try the sitemap-index
15:03:41	<rewby>	It's got a ton of additional sitemaps listed
15:04:17	<@OrIdow6>	I noticed
15:04:40	<@OrIdow6>	You know, if we really want to panic grab
15:04:55	<@OrIdow6>	We can generate an URL list from the sitemaps and feed them into #//
15:05:41	<rewby>	I have scripts to do this...
15:05:58	<@OrIdow6>	To get the sitemaps?
15:06:03		Jonboy345 joins
15:06:05	<rewby>	Yeah, to take sitemaps and turn them into urls
15:06:16	<rewby>	Including recursive sitemaps
15:06:22	<@OrIdow6>	Nice
15:06:50	<@OrIdow6>	DO we have an idea for the timescale for this?
15:07:01	<rewby>	It's not very fast, only single threaded. But give it like an hour or two and it'll extract the urls
15:07:06	<rewby>	*from a map this size ish
15:07:09	<@OrIdow6>	For the shutdown, I mean
15:07:12	<@OrIdow6>	Hm
15:07:49	<@OrIdow6>	Would it work to extract from the big list with grep, and then do the others in parallel from a bash script or something?
15:08:06	<@OrIdow6>	Cheap parallelization with &, I mean
15:08:15	<@EggplantN>	if you grabbed all onsite content via a warrior project + offsite links to #// ?
15:08:21	<@EggplantN>	we could scale up to like
15:08:24	<@EggplantN>	infinity
15:08:31	<@EggplantN>	have this done quick AF
15:08:36	<rewby>	OrIdow6: Not quite with grep. The format is a bit weird so you have to actually python parse it
15:08:40	<@OrIdow6>	Warrior projects still take a while to set up
15:08:42	<@OrIdow6>	rewby: Oh
15:08:50	<rewby>	I do actual xml parsing because spacing is not consistent
15:08:58	<rewby>	And sometimes there's multiple items on one line
15:09:02	<rewby>	And other times there's weird encodings
15:09:11	<rewby>	Better to just let lxml deal with it
15:09:40	<@OrIdow6>	EggplantN: Maybe do an initial pass of the sitemap with #//, and then set up something more complicated?
15:09:58	<rewby>	The problem with #// is that it selects items from the queue randomly, doesn't it?
15:10:08	<rewby>	so we wouldn't guarantee it actually finishes those urls in time
15:10:11	<@OrIdow6>	The last-~hour increase in people talking about this sounds to me like social media panic, but you never know
15:10:47	<@EggplantN>	uh
15:10:55	<@EggplantN>	rewby we can make it do it quickly-sih
15:10:55	<@OrIdow6>	I haven't been paying much attention to #// recently, is it at capacity?
15:11:03	<@EggplantN>	no but its not in an amazing state
15:11:26	<rewby>	Can we just duplicate the urls project code and run it on its own tracker with just these urls?
15:11:32	<rewby>	That's an "easy" way to scale to the moon quickly
15:11:45	<achivarin>	Hi. For the sitemap idea, check out my reddit thread for more information: https://www.reddit.com/r/DataHoarder/comments/o4r4jv/help_wanted_hong_kongs_prodemocracy_newspaper_in/
15:12:31	<rewby>	Oh hm paywall.
15:12:38	<rewby>	Well, we can just put in the right cookie and that's fine
15:13:01	<@EggplantN>	oh
15:13:06	<@EggplantN>	i wonder if thats fucked AB over
15:13:18	<rewby>	I think it depends on whether you have the cookie or not?
15:13:21	<rewby>	Not sure
15:13:39	<@EggplantN>	it is a 200 response code
15:13:49	<rewby>	Oh I hate websites that do this
15:15:22	<achivarin>	Maybe try the googlebot user agent? It's also mentioned in the reddit thread.
15:15:31	<@EggplantN>	yeah that exact one didnt work for me
15:15:37	<@EggplantN>	plus it looks like a JS based paywall
15:16:09	<@OrIdow6>	So the correct text still gets sent in the response?
15:16:28	<rewby>	EggplantN: If I have adblock turned on I don't get paywalled
15:16:36	<@EggplantN>	i have adblock
15:17:03	<@OrIdow6>	Yeah, browsing around with Noscript gets me the right material
15:17:28	<achivarin>	Strange. In my testing I found that reading mode in Firefox just bypasses the wall, and if you go into developer console to delete the paywall box you can read the whole article.
15:17:41	<rewby>	Yeah so as long as we don't store cookies we should be fien
15:17:46	<@OrIdow6>	achivarin: Yes, that's what would be expected from a JS paywall
15:17:55	<rewby>	Heck, I think most of our tools ignore JS anyway
15:18:10	<achivarin>	Oh my bad
15:19:07	<@arkiver>	i believe JAA is able to archive lists of URLs at high speed into WARCs
15:19:18	<@OrIdow6>	Looks like the cookie may be set through JS anyway
15:19:25		Jonboy345 quits [Read error: Connection reset by peer]
15:19:35	<@OrIdow6>	Though I didn't really check that thouroughly
15:20:23	<@OrIdow6>	arkiver: We were talking about doing it with #//
15:20:30	<@arkiver>	right
15:20:37	<@OrIdow6>	For a rough pass in case it shuts down next few hrs
15:20:48	<@arkiver>	how do they check if you read the first article without paywall?
15:20:51	<@arkiver>	do they set a cookie?
15:21:13	<@EggplantN>	yes but the paywall is JS based anyway
15:21:24	<@arkiver>	right so content is there anyway
15:21:31	<@arkiver>	in that case - what is the cookie talk about/
15:21:31	<@arkiver>	?
15:22:04	<rewby>	I'm enumerating the urls, FYI
15:22:09	<rewby>	From the sitemap, at least
15:22:09	<@arkiver>	rewby: thanks
15:22:13	<@OrIdow6>	Well, first we had to figure out if it was JS-based
15:22:21		Jonboy345 joins
15:22:24	<@OrIdow6>	But now it's just been reduced to a playback concern
15:22:27	<@arkiver>	rewby: make sure to get the sitemap URLs themselves as well :)
15:22:37		Arcorann quits [Ping timeout: 258 seconds]
15:22:55	<rewby>	arkiver: Working on it!
15:24:33	<rewby>	There's a lot of submaps
15:24:41	<rewby>	So my scripts are having to send a ton of requests
15:30:25	<achivarin>	After this fire is hopefully put out, other independent media outlets are in the crosshairs too: https://thestandnews.com/ https://www.hkcnews.com/ https://hongkongfp.com/
15:30:42	<achivarin>	And they also have large YouTube channels
15:33:56	<@JAA>	arkiver: qwarc isn't currently very good at archiving lists of URLs because it quickly gets bogged down by the DB locking. Can't go much beyond a couple thousand items per minute or so, and one item per URL would be the most reasonable approach with URL lists. Can be worked around though.
15:34:05	<rewby>	If these urls are structure the way I'm expecting, it's somewhere in 2008's sitemaps working it's way up to today
15:34:17	<rewby>	Doing about a month every 4 seconds
15:34:53		Mateon1 quits [Ping timeout: 258 seconds]
15:38:42		Mateon1 joins
15:41:38	<@EggplantN>	is there much documentation on qwarc
15:42:13	<rewby>	I'm midway through 2016 with extracting things
15:42:29	<@EggplantN>	just extracting for now?
15:42:36	<rewby>	Just extracting URLs from sitemaps
15:42:39	<@JAA>	EggplantN: How much is zero?
15:43:58	<@EggplantN>	ok so everything i've made
15:54:29		missmega joins
15:54:32	<missmega>	Yo
15:54:40	<missmega>	So the megaupload archive was a failure?
15:55:09	<missmega>	I'm trying to find some content linked in this video at this moment: https://youtu.be/fD7X9SCn0To?t=398 It's all megaupload links. Sadly I can't find it with archive.org
15:55:35	<Jake>	https://wiki.archiveteam.org/index.php/MegaUpload The status is listed as Lost, so I imagine we didn't do a project.
15:56:05	<missmega>	That really sucks
15:56:07	<missmega>	:\|
16:00:19		ragu_ joins
16:03:28	<missmega>	OrIdow6
16:03:30	<missmega>	I am here
16:04:01		ragu__ quits [Ping timeout: 258 seconds]
16:04:12	<@OrIdow6>	Oh
16:04:24		ragu__ joins
16:05:09	<@OrIdow6>	I didn't notice you were the same person
16:05:10		ragu_ quits [Ping timeout: 258 seconds]
16:06:58	<@OrIdow6>	Sorry
16:08:09	<@OrIdow6>	EggplantN: "everything I've made"?
16:13:45		Jonboy345 quits [Remote host closed the connection]
16:14:01		Jonboy345 joins
16:20:34	<Jake>	I believe it was a joke on a lack of documentation on what he codes
16:23:02	<rewby>	OrIdow6, EggplantN, arkiver: I've parsed the hk.appledaily.com sitemaps. Here's the urls extracted: https://transfer.archivete.am/fzhO1/sorted_urls.txt and here's the urls of the sitemaps I pulled them from: https://transfer.archivete.am/14ZEQ8/sitemaps.txt
16:23:13	<rewby>	I've got 3221709 urls from that
16:23:16	<rewby>	That's quite a lot
16:28:42	<@EggplantN>	rewby feeding into urls now
16:30:31	<@EggplantN>	backfeed go brrrrr
16:31:18	<@JAA>	It looks like the AB job for AppleDaily will be incomplete. I'm seeing countless parsing warnings in the log, for example: 2021-06-20 21:53:40,365 - wpull.scraper.html - WARNING - Failed to read document at ‘https://hk.appledaily.com/racing/20190505/JX6MZ2JBWZR4BXTPPXLS6A4DLQ/’: 'utf-8' codec can't decode byte 0xe8 in position 46: unexpected end of data
16:31:38	<@EggplantN>	yeah we're doing an emergency via #//
16:31:49	<rewby>	EggplantN: Are you somehow giving the urls priority over the rest of the queue?
16:31:52	<@EggplantN>	nope
16:32:03	<@EggplantN>	i'm just gonna scale up instead
16:32:39	<rewby>	Mkay
16:32:42	<rewby>	Good luckj
16:32:44	<@EggplantN>	hrm
16:32:45	<@EggplantN>	okay
16:32:49	<@EggplantN>	i've found an issue perhaps
16:32:57	<rewby>	Oh no
16:33:40	<@JAA>	502
16:33:44	<@JAA>	?
16:33:48	<@EggplantN>	nah
16:33:53	<@EggplantN>	its backfeed related
16:34:14	<@JAA>	Ah, AB Job started 502ing a bit the same moment you said you'd start queueing. lol
16:34:35	<rewby>	Are you hug-of-death-ing it Eggplant
16:34:56	<@JAA>	Yay, hugs.
16:35:27	<@JAA>	But yeah, let's try not to murder it.
16:35:39	<@EggplantN>	FYI
16:35:49	<@EggplantN>	i've also removed the :maxtries from #//
16:35:57	<@EggplantN>	until i can verify we've grabbed everything
16:37:34	<missmega>	So
16:38:03	<missmega>	megaupload's data is gone
16:38:17	<@EggplantN>	link?
16:39:10	<@arkiver>	rewby: does this include embedded images?
16:40:22	<Jake>	missmega: I believe so, yes. Unless someone out there has a copy.
16:40:44	<@arkiver>	from what i see no embedded images from pages
16:41:04	<@arkiver>	EggplantN: i can quickly turn on getting embedded images
16:41:12	<@EggplantN>	sure if you want
16:41:24	<@EggplantN>	i forgot that was enabled now
16:41:29	<@EggplantN>	🤦
16:41:37	<rewby>	arkiver: No, I think it's just links to stories.
16:42:15	<rewby>	Also, I'm seeing 502s while I'm trying to tweak my scripts. So we're going a smidge too fast for them I think
16:42:23	<@arkiver>	EggplantN: its not enabled now
16:42:28	<@arkiver>	i'm enabling it now
16:42:30	<@EggplantN>	sorry *supported
16:42:34	<@EggplantN>	wrong word
16:44:56	<@arkiver>	this site doesnt embed images like most sites do
16:45:12	<@arkiver>	Wget-AT cant easily extract them (without us parsing the HTML)
16:45:22	<@arkiver>	that is custom extraction
16:46:16	<@arkiver>	let's just finish this run now
16:46:25	<@arkiver>	we can maybe do a second run later to get the images
16:48:14	<@arkiver>	that would mean duplicating HTML pages, but thats fine with me
17:05:14	<@EggplantN>	yeah that was my thinking
17:05:17	<@EggplantN>	lets just get what we can
17:08:59	<rewby>	arkiver: In case it's useful to you, I put my sitemap scraper on github. https://github.com/rewbycraft/sitemap-enumerator
17:09:06	<rewby>	It's mostly a wrapper/partial reimplementation of a library
17:09:10	<rewby>	But this one can go nyooom
17:13:46		britm0b joins
17:15:32		Webuser431 joins
17:15:42		britmob quits [Ping timeout: 258 seconds]
17:19:15	<Ryz>	On the subject on looking out other Hong Kong journalist/news websites, https://en.wikipedia.org/wiki/List_of_newspapers_in_Hong_Kong could be a good starting point
17:24:45	<@JAA>	Looks like en.appledaily.com needs some work as well. It was run through AB twice (once a few days ago, once today), but those jobs finished surprisingly quickly. It has infinite scrolling, and the sitemap only seems to cover the past month. :-/
17:25:16		Daloader joins
17:32:11	<achivarin>	Ryz: You have the right idea but most papers on that list are pro-Beijing rags. Reposting what I said above:
17:32:16	<achivarin>	After this fire is hopefully put out, other independent media outlets are in the crosshairs too: https://thestandnews.com/ https://www.hkcnews.com/ https://hongkongfp.com/ And they also have large YouTube channels
18:29:16		CZ joins
18:36:01		CZ quits [Remote host closed the connection]
18:36:23		Cz joins
19:10:25		Jonboy345 quits [Read error: Connection reset by peer]
19:13:46		Jonboy345 joins
19:16:27		Jonboy345 quits [Read error: Connection reset by peer]
19:18:52		Jonboy345 joins
19:20:12		Daloader quits [Ping timeout: 250 seconds]
19:29:14		missmega quits [Ping timeout: 244 seconds]
19:30:10		Cz quits [Remote host closed the connection]
19:34:12		Jonboy3451 joins
19:34:12		Jonboy345 quits [Read error: Connection reset by peer]
19:51:52		AlsoHP_Archivist joins
19:51:52		HP_Archivist quits [Read error: Connection reset by peer]
20:03:48		lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
20:04:49		lennier1 (lennier1) joins
20:29:58		ThreeHM quits [Ping timeout: 250 seconds]
20:33:16		ThreeHM (ThreeHeadedMonkey) joins
20:47:44		lunik1 quits [Ping timeout: 250 seconds]
21:05:30		lunik1 joins
21:18:06		EdSavoie joins
21:20:42		AlsoHP_Archivist quits [Read error: Connection reset by peer]
21:21:10		AlsoHP_Archivist joins
21:38:46	<@JAA>	Bethesda forum and mod comment API data retrieval is running. Specifically, I'm fetching /community/api/topic/TOPICID + /community/api/topic/SLUG?page=PAGE for topics and https://api.bethesda.net/mods/ugc-workshop/content/get?content_id=MODID + /community/comments/get/mods/mods_MODID/0 (and .../1 etc. for pagination) for mod comments. /community/* URLs are on bethesda.net.
21:39:35	<@JAA>	The list of mods also comes from the API, namely https://api.bethesda.net/mods/ugc-workshop/list/?number_results=20&order=desc&page=PAGE&platform=&product=&sort=published&text=
21:49:55		@EggplantN quits [Quit: Ping timeout (120 seconds)]
21:50:15		EggplantN joins
21:53:14	<@JAA>	Uh, looks like I'm getting zero mod comments. Maybe they already disabled that part without another announcement. :-(
21:53:41	<@JAA>	https://bethesda.net/en/mods/fallout4/mod-detail/911793 had a lot of comments as of a month ago, for example.
22:37:14		DogsRNice (Webuser299) joins
22:42:44		EggplantN is now authenticated as EggplantN
22:42:44		EggplantN quits [Changing host]
22:42:44		EggplantN (EggplantN) joins
22:42:44		@ChanServ sets mode: +o EggplantN
22:44:09	<@EggplantN>	when tf did my connection die
22:44:23	<@JAA>	21:49:55
22:44:36	<@EggplantN>	is it 22:44 now for you
22:44:47	<@JAA>	One time zone to rule them all (UTC). Yes
22:45:02	<@EggplantN>	that
22:45:05	<@EggplantN>	kinda doesnt make sense
22:45:07	<@EggplantN>	but ok
22:45:10		Muad_Dib quits [Ping timeout: 250 seconds]
22:55:36		Muad-Dib joins
23:21:47	<@JAA>	The forums part of the Bethesda crawl is done. I'm sending them an email about the mod comments.
23:23:40		AlsoHP_Archivist quits [Client Quit]
23:23:56		HP_Archivist (HP_Archivist) joins
23:42:40		BlueMaxima joins
23:52:05		Specular joins
23:53:04		Arcorann (Arcorann) joins
23:55:21		Specular quits [Client Quit]
23:55:39		nerdguy1138 quits [Quit: Leaving.]
23:58:25	<@JAA>	Bethesda forum stats in my WARC: 297121 topics, 363778 topic pages, 2851308 posts. The topic numbers on the forum homepage add up to 299359; close enough. Post IDs go to 3.3 million, but there are plenty of deleted topics (IDs go to just over 456789), so that sounds good enough as well.

Home Search Previous day Next day