00:35:35sonick quits [Quit: Connection closed for inactivity]
00:37:04sonick (sonick) joins
01:00:02dm4v quits [Client Quit]
01:01:34dm4v joins
01:01:37dm4v quits [Changing host]
01:01:37dm4v (dm4v) joins
01:16:59fionera_ quits [Quit: fionera_]
01:18:03fionera (Fionera) joins
01:35:39<@OrIdow6>https://www.nytimes.com/2021/12/28/obituaries/john-madden-dead.html https://www.nytimes.com/2021/12/28/obituaries/harry-reid-dead.html deaths - can be inferred from the URLs
01:57:29<jodizzle>Got https://twitter.com/SenatorReid. Don't see anything else obvious for Reid.
01:57:35<jodizzle>Don't see anything obvious for Madden either.
02:01:43<@OrIdow6>Thanks
02:02:38dm4v quits [Read error: Connection reset by peer]
02:04:21dm4v joins
02:04:23dm4v quits [Changing host]
02:04:23dm4v (dm4v) joins
02:45:35sonick quits [Client Quit]
02:51:13<mgrandi>@Ryz: did you get a chance to look at the memorial wiki thing yet?
03:40:15Somebody2 (Somebody2) joins
03:41:14xkey quits [Client Quit]
03:41:50xkey (eyo) joins
04:14:08qw3rty__ joins
04:17:53qw3rty_ quits [Ping timeout: 265 seconds]
04:33:23Stellarator quits [Ping timeout: 258 seconds]
04:56:56DogsRNice quits [Read error: Connection reset by peer]
07:01:54godane1 quits [Client Quit]
07:57:02qwertyasdfuiopghjkl joins
08:10:36guest01911 joins
08:10:53guest01911 leaves
08:26:28<jodizzle>(from #archiveteam) https://www.thestandnews.com/ was last grabbed back in June, but I queued it again given the situation. https://twitter.com/ezracheungtoto/status/1475989778893996034
09:04:42<IDK>I got 503s
09:24:08qwertyasdfuiopghjkl25 joins
09:27:01qwertyasdfuiopghjkl quits [Ping timeout: 258 seconds]
09:27:15Stellarator joins
09:41:44achivarin (achivarin) joins
09:50:40<achivarin>The government has moved to shut down one of Hong Kong's last independent media outlets. https://www.thestandnews.com/ and https://www.youtube.com/c/StandNewsHK/videos urgently need help.
09:59:11<jodizzle>achivarin: Yes, it's been mentioned. I think we've queued the relevant sites in the relevant places.
09:59:43<jodizzle>https://www.thestandnews.com/ and some subdomains are running in #archivebot
10:00:15<jodizzle>https://www.thestandnews.com/ and some subdomains were also fully grabbed back in June
10:03:46<thuban>jodizzle: have we requeued their channel (UCGe96mv2FcdfQXmtSXYF_fA) in #youtubearchive?
10:04:04<jodizzle>Yeah.
10:04:32<jodizzle>In principle, this could be worth tubeup-ing, but I'm never sure of the rules/guidelines on that.
10:04:44<thuban>cool, ty
10:06:09<thuban>(i don't think the hong kong media wiki page has been updated in some months; might be worth rechecking the status of those orgs--iirc there's a lot of stuff we didn't get around to abing in the first place)
10:17:51qwertyasdfuiopghjkl25 is now known as qwertyasdfuiopghjkl
11:33:19BlueMaxima quits [Client Quit]
11:35:50march_happy (march_happy) joins
11:42:27OverhaulDeskwork joins
12:16:17<achivarin>Thanks for your reply. Can you check if the social media pages are being worked on? https://www.facebook.com/standnewshk/ https://www.instagram.com/thestandnews/ https://twitter.com/standnewshk https://t.me/thestandnews
12:19:51<thuban>achivarin: we're working on their twitter, but unfortunately instagram and to a lesser extent facebook have strict rate limiting that make them difficult to archive
12:48:50britmob25636477 joins
13:00:48<h2ibot>PixelAlpha edited URLTeam (+232, /* Alive */ intip.in): https://wiki.archiveteam.org/?diff=48064&oldid=47853
13:00:49<h2ibot>Fidel edited Hong Kong media (+137): https://wiki.archiveteam.org/?diff=48065&oldid=47976
13:00:50<h2ibot>Fidel created Stand News (+1522, Create page, copy some from…): https://wiki.archiveteam.org/?title=Stand%20News
13:00:51<h2ibot>Taka edited Deathwatch (+197, /* 2022 */ Added about "excite friends"): https://wiki.archiveteam.org/?diff=48067&oldid=48062
13:03:43<@arkiver>is anyone here who was involved with this effort able to update https://wiki.archiveteam.org/index.php/Hong_Kong_media ?
13:13:13MattieTK joins
13:15:25<thuban>i kept updating the page up to the time i stepped away; in a couple of hours i can go through the archivebot viewer and check up on (a) the status of listed jobs and (b) whether additional jobs were run on the sites marked 'not saved yet'
13:16:17<thuban>but anything that happened that's not in those records, i won't know about
13:16:39MattieTK quits [Client Quit]
13:19:29goose joins
13:20:57<goose>hey, guessing so, has anyone started archiving the stand?
13:28:53Arcorann quits [Ping timeout: 265 seconds]
13:34:15<thuban>goose: yes, we're working on it
13:34:44<goose>nice, i've done a quick script to extract next.js' json data from the html
13:35:36<@arkiver>goose: what is this next.js stuff?
13:35:39<@arkiver>got an example webpage?
13:35:58<goose>next.js is the framework, it encodes the data of the article as json
13:36:02<goose>article example: https://www.thestandnews.com/society/%E9%A6%99%E6%B8%AF%E4%BF%9D%E8%AD%B7%E5%85%92%E7%AB%A5%E6%9C%83%E7%AB%A5%E6%A8%82%E5%B1%85%E5%86%8D%E5%A4%9A-4-%E8%81%B7%E5%93%A1%E6%B6%89%E9%AB%94%E7%BD%B0%E5%B9%BC%E5%85%92-%E5%B7%B2%E8%A2%AB%E5%81%9C%E8%81%B7-%E6%A9%9F%E6%A7%8B%E5%B7%B2%E5%A0%B1%E8%AD%A6
13:36:13<goose>run `__NEXT_DATA__.props.pageProps.article` in console to see
13:36:16<@arkiver>ah right
13:36:24<goose>you can extract from html too, i did it with some dirty regex atm
13:36:24<@arkiver>yeah we're getting that
13:36:29<@arkiver>all data will be in the wayback machine
13:36:33<@arkiver>the raw responses
13:36:36<goose>ah nice
13:36:43<goose>good work!
13:40:26goose quits [Client Quit]
13:43:31hker joins
13:46:19hker leaves
14:09:15Dj-Wawa quits [Quit: Dj-Wawa]
14:09:27Dj-Wawa joins
14:09:30Dj-Wawa quits [Client Quit]
14:10:13Dj-Wawa joins
14:18:52guy joins
14:25:43guy quits [Client Quit]
14:47:18guy joins
14:56:18guy quits [Ping timeout: 258 seconds]
14:58:59mutantmnky quits [Ping timeout: 258 seconds]
14:59:16mutantmnky (mutantmonkey) joins
15:28:07<@arkiver>thestandnews.com is down
15:28:26<@arkiver>we should have gotten nearly all of the 158k articles (including embedded images) just in time
15:29:31<achivarin>Can we tell which ones are missing?
15:31:57<@arkiver>probably yeah, but i cant look into it now
15:33:08<achivarin>I've been crawling with browsertrix and archivebox myself. I'd like to contribute if I have the ones that are missing. Who should I talk to about this?
15:33:37<@arkiver>are you going to stick around on IRC for the next day or two?
15:33:47<@arkiver>if not, email me at arkiver@protonmail.com please
15:34:36<achivarin>Yeah I'll try to stick around
15:34:56<achivarin>Is it possible to pull them from CDN that might still be up?
15:35:01OverhaulDeskwork quits [Ping timeout: 258 seconds]
15:38:51Stellarator quits [Ping timeout: 258 seconds]
15:40:42Stellarator joins
15:49:47<achivarin>Did anyone grab Stand News's Facebook videos? I completely forgot them
15:52:43<achivarin>Or could we use google's web cache to get the missing articles?
15:56:52Iki1 quits [Ping timeout: 258 seconds]
16:08:23Megame (Megame) joins
16:11:26raskolnikovgirl joins
16:12:37raskolnikovgirl quits [Client Quit]
16:16:54<Sanqui>achivarin: thanks for mentioning browsertrix, this is useful for my research. There's a lot of cutting edge crawling, scraping, and archival tools popping up. I've just started learning Puppeteer myself and was going to attempt to integrate it with warcprox myself.
16:22:26<IDK>IA's URL maximum request legnth limit is 4096 right?
16:23:35<IDK>URL was so long that I had to shortern it: http://gg.gg/xe3ij
16:24:05<IDK>The original: http://gg.gg/xe3ip
16:24:25<IDK>request on this URL will just fail and the item will not be displayed I think
16:27:42<IDK>is there a way where I can still view this type of URL without bad request?
16:32:50<@rewby>Sanqui, achivarin: I seem to remember browsertrix producing some wonky "not quite warc" file standard and there was something wrong with it. I just don't remember what. I think it might suffer from the same issue Chromebot did where not all headers are right
16:33:02<@rewby>But I honestly don't entirely remember
16:34:19<Sanqui>rewby: browsertrix claims to produce WARC files via pywb. I think JAA spoke against it and recommended warcprox instead
16:34:51<@rewby>Yeah, there's something screwy about it and JAA would be the one who knows exactly what the problem ia
16:34:54<@rewby>*is
16:35:51<@rewby>I generally trust JAA as an expert on warc files and their integrity
16:36:02<Sanqui>same, but I am taking notes :)
16:36:08<Sanqui>now
16:36:22<@rewby>Same. I just don't have my laptop on hand
16:39:55OverhaulDeskwork joins
16:41:28<Sanqui>234:[20:15:22] <JAA> FWIW, webrecorder/pywb also has data integrity issues.
16:41:28<Sanqui>:P
16:41:54<Sanqui>that's unfortunate that browsertrix is still using it -- might raise an issue with them once I have a better understanding of the relevant problems.
16:53:09<@OrIdow6>https://github.com/webrecorder/warcio/issues/129
16:53:55<@OrIdow6>May be more
17:08:56march_happy quits [Ping timeout: 258 seconds]
17:13:14<pcr>https://github.com/webrecorder/warcio/issues?q=is%3Aopen+author%3AJustAnotherArchivist+
17:26:39<achivarin>Sanqui: Sorry I didn't know the issue with browsertrix. But if you are looking for archiving tools here's a massive list to check out: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community
17:27:44<Sanqui>Massive lists are nice and useful but it's also good to hear what people use in practice and be able to exchange knowledge and opinions. Thanks :>
17:33:37<achivarin>I'm a complete noob trying everything on that list, so I might not be a good person to learn from lol
17:43:42<britmob|m>grab-site is my goto personally, but that isn’t JS ofc
17:46:41<Sanqui>yeah, I'm getting into the whole js business
17:47:06<Sanqui>well - and also sites with weird behavior around cookies or the session
17:54:28nick joins
17:54:46nick quits [Client Quit]
17:56:52standwithhk joins
17:57:58<standwithhk>Hello
17:58:11<standwithhk>Is there a status update for Stand News’ archive?
17:58:21<standwithhk>Sites offline now :(
17:59:14standwithhk quits [Client Quit]
18:06:24guest01911 joins
18:07:21<thuban>why don't people stay!
18:49:58jspiros (jspiros) joins
19:00:40<achivarin>Perhaps the wiki page could be updated?
19:08:20<thuban>it could indeed, and i am working on it, but i doubt that will solve this particular problem
19:20:15DogsRNice (Webuser299) joins
19:22:11<achivarin>I think younger people, including me, are not as adept at using IRC
19:23:11<Sanqui>now you can use the matrix bridge to stay connected without having to set up a bouncer (and now I say this as a person who's been using irc for "only" 15 years.)
19:27:41<achivarin>I heard that had performance problems so I never tried it. Maybe I should
19:57:13<achivarin>https://www.inmediahk.net/ needs urgent help as well
19:58:02<achivarin>https://www.facebook.com/inmediahknet/ https://www.instagram.com/inmediahknet/ https://twitter.com/inmediahk?lang=en https://telegram.me/inmediahknet https://mewe.com/p/%E7%8D%A8%E7%AB%8B%E5%AA%92%E9%AB%94wwwinmediahknet1
19:59:29<achivarin>https://www.youtube.com/c/inmediahkvideo YouTube channel should be prioritised
20:02:48<achivarin>My bad, the website https://www.inmediahk.net/ should be prioritised.
20:10:32tzt quits [Ping timeout: 265 seconds]
20:11:34<mgrandi>are the youtube videos in danger?
20:11:37<jodizzle>It's been queued, but getting a number of cloudflare errors. According to the logs something like this also happened back when it was run in June/July.
20:11:55<mgrandi>would feel those would....stay up even after the site goes down, china can't really force youtube to take down stuff right?
20:12:34<jodizzle>Telegram, twitter and youtube queued; facebook, instagram, mewe have various issues.
20:12:38<mgrandi>either way i'm doing a JSON export of the telegram channel + files (<8mb) using my telegram client
20:22:54<IDK>Made change to deathwatch, added the stand news information
20:39:38<thuban>mgrandi: china can force chinese channel owners to take stuff down.
20:55:33<thuban>anyone happen to have the job id of the most recent standnewshk twitter scrape? scrolled off the dashboard, not yet on the viewer...
21:02:52tzt (tzt) joins
21:07:53<Megame>thuban, 1pzlmuacmgtgl5ipygf5fh4uo
21:08:05<thuban>Megame: ty
21:08:54<Megame>You can check http://dashboard.at.ninjawedding.org/finished for recent jobs
21:09:17<thuban>oho!
21:10:57<thuban>what does the bold pink in the 'Remaining' column indicate? an abort?
21:13:44<thuban>... also, why is the job id for the main site listed as 17rh8v1zlqhwn6dchvr34b370, which is the same id as the june grab?
21:36:37qwertyasdfuiopghjkl quits [Client Quit]
21:37:15Arcorann (Arcorann) joins
21:44:22qwertyasdfuiopghjkl joins
21:47:20OverhaulDeskwork5 joins
21:50:21march_happy (march_happy) joins
21:51:04OverhaulDeskwork quits [Ping timeout: 258 seconds]
21:51:04OverhaulDeskwork5 is now known as OverhaulDeskwork
22:09:33Megame quits [Client Quit]
22:10:36<h2ibot>Switchnode edited Hong Kong media (+2008, update stand news jobs): https://wiki.archiveteam.org/?diff=48068&oldid=48065
22:29:40<h2ibot>Switchnode edited Hong Kong media (-32, make this nightmare table slightly more legible): https://wiki.archiveteam.org/?diff=48069&oldid=48068
22:36:28<Ryz>mgrandi, regarding the 'memorial wiki thing ', please refer to the people at #wikiteam - they do the dumps, I just make requests and find non-standard mediawikis oo;
22:41:42<h2ibot>Switchnode edited Hong Kong media (+377, update inmediahk jobs): https://wiki.archiveteam.org/?diff=48070&oldid=48069
22:48:05<thuban>i gotta go, but i'll come back and do some more early tomorrow
22:50:14<thuban>in the meantime i request people argue over how to handle re-runs of the june/july grabs--which (if any) we should do and how (if at all) we should annotate them on the wiki page
22:51:21<thuban>the inmediahk regrab also has the same job id as the previous one. is this intentional? seems problematic
23:07:44BlueMaxima joins
23:38:01march_happy quits [Ping timeout: 258 seconds]
23:50:10guest01911 quits [Client Quit]
23:53:21Stellarator quits [Ping timeout: 258 seconds]
23:53:34Stellarator joins
23:54:36Stellarator quits [Client Quit]