#archiveteam-bs log for 2021-12-29

Home Search Previous day Next day

00:35:35		sonick quits [Quit: Connection closed for inactivity]
00:37:04		sonick (sonick) joins
01:00:02		dm4v quits [Client Quit]
01:01:34		dm4v joins
01:01:37		dm4v is now authenticated as dm4v
01:01:37		dm4v quits [Changing host]
01:01:37		dm4v (dm4v) joins
01:16:59		fionera_ quits [Quit: fionera_]
01:18:03		fionera (Fionera) joins
01:35:39	<@OrIdow6>	https://www.nytimes.com/2021/12/28/obituaries/john-madden-dead.html https://www.nytimes.com/2021/12/28/obituaries/harry-reid-dead.html deaths - can be inferred from the URLs
01:57:29	<jodizzle>	Got https://twitter.com/SenatorReid. Don't see anything else obvious for Reid.
01:57:35	<jodizzle>	Don't see anything obvious for Madden either.
02:01:43	<@OrIdow6>	Thanks
02:02:38		dm4v quits [Read error: Connection reset by peer]
02:04:21		dm4v joins
02:04:23		dm4v is now authenticated as dm4v
02:04:23		dm4v quits [Changing host]
02:04:23		dm4v (dm4v) joins
02:45:35		sonick quits [Client Quit]
02:51:13	<mgrandi>	@Ryz: did you get a chance to look at the memorial wiki thing yet?
03:40:15		Somebody2 (Somebody2) joins
03:41:14		xkey quits [Client Quit]
03:41:50		xkey (eyo) joins
04:14:08		qw3rty__ joins
04:17:53		qw3rty_ quits [Ping timeout: 265 seconds]
04:33:23		Stellarator quits [Ping timeout: 258 seconds]
04:56:56		DogsRNice quits [Read error: Connection reset by peer]
07:01:54		godane1 quits [Client Quit]
07:57:02		qwertyasdfuiopghjkl joins
08:10:36		guest01911 joins
08:10:53		guest01911 leaves
08:26:28	<jodizzle>	(from #archiveteam) https://www.thestandnews.com/ was last grabbed back in June, but I queued it again given the situation. https://twitter.com/ezracheungtoto/status/1475989778893996034
09:04:42	<IDK>	I got 503s
09:24:08		qwertyasdfuiopghjkl25 joins
09:27:01		qwertyasdfuiopghjkl quits [Ping timeout: 258 seconds]
09:27:15		Stellarator joins
09:41:44		achivarin (achivarin) joins
09:50:40	<achivarin>	The government has moved to shut down one of Hong Kong's last independent media outlets. https://www.thestandnews.com/ and https://www.youtube.com/c/StandNewsHK/videos urgently need help.
09:59:11	<jodizzle>	achivarin: Yes, it's been mentioned. I think we've queued the relevant sites in the relevant places.
09:59:43	<jodizzle>	https://www.thestandnews.com/ and some subdomains are running in #archivebot
10:00:15	<jodizzle>	https://www.thestandnews.com/ and some subdomains were also fully grabbed back in June
10:03:46	<thuban>	jodizzle: have we requeued their channel (UCGe96mv2FcdfQXmtSXYF_fA) in #youtubearchive?
10:04:04	<jodizzle>	Yeah.
10:04:32	<jodizzle>	In principle, this could be worth tubeup-ing, but I'm never sure of the rules/guidelines on that.
10:04:44	<thuban>	cool, ty
10:06:09	<thuban>	(i don't think the hong kong media wiki page has been updated in some months; might be worth rechecking the status of those orgs--iirc there's a lot of stuff we didn't get around to abing in the first place)
10:17:51		qwertyasdfuiopghjkl25 is now known as qwertyasdfuiopghjkl
11:33:19		BlueMaxima quits [Client Quit]
11:35:50		march_happy (march_happy) joins
11:42:27		OverhaulDeskwork joins
12:16:17	<achivarin>	Thanks for your reply. Can you check if the social media pages are being worked on? https://www.facebook.com/standnewshk/ https://www.instagram.com/thestandnews/ https://twitter.com/standnewshk https://t.me/thestandnews
12:19:51	<thuban>	achivarin: we're working on their twitter, but unfortunately instagram and to a lesser extent facebook have strict rate limiting that make them difficult to archive
12:48:50		britmob25636477 joins
13:00:48	<h2ibot>	PixelAlpha edited URLTeam (+232, /* Alive */ intip.in): https://wiki.archiveteam.org/?diff=48064&oldid=47853
13:00:49	<h2ibot>	Fidel edited Hong Kong media (+137): https://wiki.archiveteam.org/?diff=48065&oldid=47976
13:00:50	<h2ibot>	Fidel created Stand News (+1522, Create page, copy some from…): https://wiki.archiveteam.org/?title=Stand%20News
13:00:51	<h2ibot>	Taka edited Deathwatch (+197, /* 2022 */ Added about "excite friends"): https://wiki.archiveteam.org/?diff=48067&oldid=48062
13:03:43	<@arkiver>	is anyone here who was involved with this effort able to update https://wiki.archiveteam.org/index.php/Hong_Kong_media ?
13:13:13		MattieTK joins
13:15:25	<thuban>	i kept updating the page up to the time i stepped away; in a couple of hours i can go through the archivebot viewer and check up on (a) the status of listed jobs and (b) whether additional jobs were run on the sites marked 'not saved yet'
13:16:17	<thuban>	but anything that happened that's not in those records, i won't know about
13:16:39		MattieTK quits [Client Quit]
13:19:29		goose joins
13:20:57	<goose>	hey, guessing so, has anyone started archiving the stand?
13:28:53		Arcorann quits [Ping timeout: 265 seconds]
13:34:15	<thuban>	goose: yes, we're working on it
13:34:44	<goose>	nice, i've done a quick script to extract next.js' json data from the html
13:35:36	<@arkiver>	goose: what is this next.js stuff?
13:35:39	<@arkiver>	got an example webpage?
13:35:58	<goose>	next.js is the framework, it encodes the data of the article as json
13:36:02	<goose>	article example: https://www.thestandnews.com/society/%E9%A6%99%E6%B8%AF%E4%BF%9D%E8%AD%B7%E5%85%92%E7%AB%A5%E6%9C%83%E7%AB%A5%E6%A8%82%E5%B1%85%E5%86%8D%E5%A4%9A-4-%E8%81%B7%E5%93%A1%E6%B6%89%E9%AB%94%E7%BD%B0%E5%B9%BC%E5%85%92-%E5%B7%B2%E8%A2%AB%E5%81%9C%E8%81%B7-%E6%A9%9F%E6%A7%8B%E5%B7%B2%E5%A0%B1%E8%AD%A6
13:36:13	<goose>	run `__NEXT_DATA__.props.pageProps.article` in console to see
13:36:16	<@arkiver>	ah right
13:36:24	<goose>	you can extract from html too, i did it with some dirty regex atm
13:36:24	<@arkiver>	yeah we're getting that
13:36:29	<@arkiver>	all data will be in the wayback machine
13:36:33	<@arkiver>	the raw responses
13:36:36	<goose>	ah nice
13:36:43	<goose>	good work!
13:40:26		goose quits [Client Quit]
13:43:31		hker joins
13:46:19		hker leaves
14:09:15		Dj-Wawa quits [Quit: Dj-Wawa]
14:09:27		Dj-Wawa joins
14:09:28		Dj-Wawa is now authenticated as Dj-Wawa
14:09:30		Dj-Wawa quits [Client Quit]
14:10:13		Dj-Wawa joins
14:10:14		Dj-Wawa is now authenticated as Dj-Wawa
14:18:52		guy joins
14:25:43		guy quits [Client Quit]
14:47:18		guy joins
14:56:18		guy quits [Ping timeout: 258 seconds]
14:58:59		mutantmnky quits [Ping timeout: 258 seconds]
14:59:16		mutantmnky (mutantmonkey) joins
15:28:07	<@arkiver>	thestandnews.com is down
15:28:26	<@arkiver>	we should have gotten nearly all of the 158k articles (including embedded images) just in time
15:29:31	<achivarin>	Can we tell which ones are missing?
15:31:57	<@arkiver>	probably yeah, but i cant look into it now
15:33:08	<achivarin>	I've been crawling with browsertrix and archivebox myself. I'd like to contribute if I have the ones that are missing. Who should I talk to about this?
15:33:37	<@arkiver>	are you going to stick around on IRC for the next day or two?
15:33:47	<@arkiver>	if not, email me at arkiver@protonmail.com please
15:34:36	<achivarin>	Yeah I'll try to stick around
15:34:56	<achivarin>	Is it possible to pull them from CDN that might still be up?
15:35:01		OverhaulDeskwork quits [Ping timeout: 258 seconds]
15:38:51		Stellarator quits [Ping timeout: 258 seconds]
15:40:42		Stellarator joins
15:49:47	<achivarin>	Did anyone grab Stand News's Facebook videos? I completely forgot them
15:52:43	<achivarin>	Or could we use google's web cache to get the missing articles?
15:56:52		Iki1 quits [Ping timeout: 258 seconds]
16:08:23		Megame (Megame) joins
16:11:26		raskolnikovgirl joins
16:12:37		raskolnikovgirl quits [Client Quit]
16:16:54	<Sanqui>	achivarin: thanks for mentioning browsertrix, this is useful for my research. There's a lot of cutting edge crawling, scraping, and archival tools popping up. I've just started learning Puppeteer myself and was going to attempt to integrate it with warcprox myself.
16:22:26	<IDK>	IA's URL maximum request legnth limit is 4096 right?
16:23:35	<IDK>	URL was so long that I had to shortern it: http://gg.gg/xe3ij
16:24:05	<IDK>	The original: http://gg.gg/xe3ip
16:24:25	<IDK>	request on this URL will just fail and the item will not be displayed I think
16:27:42	<IDK>	is there a way where I can still view this type of URL without bad request?
16:32:50	<@rewby>	Sanqui, achivarin: I seem to remember browsertrix producing some wonky "not quite warc" file standard and there was something wrong with it. I just don't remember what. I think it might suffer from the same issue Chromebot did where not all headers are right
16:33:02	<@rewby>	But I honestly don't entirely remember
16:34:19	<Sanqui>	rewby: browsertrix claims to produce WARC files via pywb. I think JAA spoke against it and recommended warcprox instead
16:34:51	<@rewby>	Yeah, there's something screwy about it and JAA would be the one who knows exactly what the problem ia
16:34:54	<@rewby>	*is
16:35:51	<@rewby>	I generally trust JAA as an expert on warc files and their integrity
16:36:02	<Sanqui>	same, but I am taking notes :)
16:36:08	<Sanqui>	now
16:36:22	<@rewby>	Same. I just don't have my laptop on hand
16:39:55		OverhaulDeskwork joins
16:41:28	<Sanqui>	234:[20:15:22] <JAA> FWIW, webrecorder/pywb also has data integrity issues.
16:41:28	<Sanqui>	:P
16:41:54	<Sanqui>	that's unfortunate that browsertrix is still using it -- might raise an issue with them once I have a better understanding of the relevant problems.
16:53:09	<@OrIdow6>	https://github.com/webrecorder/warcio/issues/129
16:53:55	<@OrIdow6>	May be more
17:08:56		march_happy quits [Ping timeout: 258 seconds]
17:13:14	<pcr>	https://github.com/webrecorder/warcio/issues?q=is%3Aopen+author%3AJustAnotherArchivist+
17:26:39	<achivarin>	Sanqui: Sorry I didn't know the issue with browsertrix. But if you are looking for archiving tools here's a massive list to check out: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community
17:27:44	<Sanqui>	Massive lists are nice and useful but it's also good to hear what people use in practice and be able to exchange knowledge and opinions. Thanks :>
17:33:37	<achivarin>	I'm a complete noob trying everything on that list, so I might not be a good person to learn from lol
17:43:42	<britmob\|m>	grab-site is my goto personally, but that isn’t JS ofc
17:46:41	<Sanqui>	yeah, I'm getting into the whole js business
17:47:06	<Sanqui>	well - and also sites with weird behavior around cookies or the session
17:54:28		nick joins
17:54:46		nick quits [Client Quit]
17:56:52		standwithhk joins
17:57:58	<standwithhk>	Hello
17:58:11	<standwithhk>	Is there a status update for Stand News’ archive?
17:58:21	<standwithhk>	Sites offline now :(
17:59:14		standwithhk quits [Client Quit]
18:06:24		guest01911 joins
18:07:21	<thuban>	why don't people stay!
18:49:58		jspiros (jspiros) joins
19:00:40	<achivarin>	Perhaps the wiki page could be updated?
19:08:20	<thuban>	it could indeed, and i am working on it, but i doubt that will solve this particular problem
19:20:15		DogsRNice (Webuser299) joins
19:22:11	<achivarin>	I think younger people, including me, are not as adept at using IRC
19:23:11	<Sanqui>	now you can use the matrix bridge to stay connected without having to set up a bouncer (and now I say this as a person who's been using irc for "only" 15 years.)
19:27:41	<achivarin>	I heard that had performance problems so I never tried it. Maybe I should
19:57:13	<achivarin>	https://www.inmediahk.net/ needs urgent help as well
19:58:02	<achivarin>	https://www.facebook.com/inmediahknet/ https://www.instagram.com/inmediahknet/ https://twitter.com/inmediahk?lang=en https://telegram.me/inmediahknet https://mewe.com/p/%E7%8D%A8%E7%AB%8B%E5%AA%92%E9%AB%94wwwinmediahknet1
19:59:29	<achivarin>	https://www.youtube.com/c/inmediahkvideo YouTube channel should be prioritised
20:02:48	<achivarin>	My bad, the website https://www.inmediahk.net/ should be prioritised.
20:10:32		tzt quits [Ping timeout: 265 seconds]
20:11:34	<mgrandi>	are the youtube videos in danger?
20:11:37	<jodizzle>	It's been queued, but getting a number of cloudflare errors. According to the logs something like this also happened back when it was run in June/July.
20:11:55	<mgrandi>	would feel those would....stay up even after the site goes down, china can't really force youtube to take down stuff right?
20:12:34	<jodizzle>	Telegram, twitter and youtube queued; facebook, instagram, mewe have various issues.
20:12:38	<mgrandi>	either way i'm doing a JSON export of the telegram channel + files (<8mb) using my telegram client
20:22:54	<IDK>	Made change to deathwatch, added the stand news information
20:39:38	<thuban>	mgrandi: china can force chinese channel owners to take stuff down.
20:55:33	<thuban>	anyone happen to have the job id of the most recent standnewshk twitter scrape? scrolled off the dashboard, not yet on the viewer...
21:02:52		tzt (tzt) joins
21:07:53	<Megame>	thuban, 1pzlmuacmgtgl5ipygf5fh4uo
21:08:05	<thuban>	Megame: ty
21:08:54	<Megame>	You can check http://dashboard.at.ninjawedding.org/finished for recent jobs
21:09:17	<thuban>	oho!
21:10:57	<thuban>	what does the bold pink in the 'Remaining' column indicate? an abort?
21:13:44	<thuban>	... also, why is the job id for the main site listed as 17rh8v1zlqhwn6dchvr34b370, which is the same id as the june grab?
21:36:37		qwertyasdfuiopghjkl quits [Client Quit]
21:37:15		Arcorann (Arcorann) joins
21:44:22		qwertyasdfuiopghjkl joins
21:47:20		OverhaulDeskwork5 joins
21:50:21		march_happy (march_happy) joins
21:51:04		OverhaulDeskwork quits [Ping timeout: 258 seconds]
21:51:04		OverhaulDeskwork5 is now known as OverhaulDeskwork
22:09:33		Megame quits [Client Quit]
22:10:36	<h2ibot>	Switchnode edited Hong Kong media (+2008, update stand news jobs): https://wiki.archiveteam.org/?diff=48068&oldid=48065
22:29:40	<h2ibot>	Switchnode edited Hong Kong media (-32, make this nightmare table slightly more legible): https://wiki.archiveteam.org/?diff=48069&oldid=48068
22:36:28	<Ryz>	mgrandi, regarding the 'memorial wiki thing ', please refer to the people at #wikiteam - they do the dumps, I just make requests and find non-standard mediawikis oo;
22:41:42	<h2ibot>	Switchnode edited Hong Kong media (+377, update inmediahk jobs): https://wiki.archiveteam.org/?diff=48070&oldid=48069
22:48:05	<thuban>	i gotta go, but i'll come back and do some more early tomorrow
22:50:14	<thuban>	in the meantime i request people argue over how to handle re-runs of the june/july grabs--which (if any) we should do and how (if at all) we should annotate them on the wiki page
22:51:21	<thuban>	the inmediahk regrab also has the same job id as the previous one. is this intentional? seems problematic
23:07:44		BlueMaxima joins
23:38:01		march_happy quits [Ping timeout: 258 seconds]
23:50:10		guest01911 quits [Client Quit]
23:53:21		Stellarator quits [Ping timeout: 258 seconds]
23:53:34		Stellarator joins
23:54:36		Stellarator quits [Client Quit]

Home Search Previous day Next day