| 00:35:35 | | sonick quits [Quit: Connection closed for inactivity] |
| 00:37:04 | | sonick (sonick) joins |
| 01:00:02 | | dm4v quits [Client Quit] |
| 01:01:34 | | dm4v joins |
| 01:01:37 | | dm4v is now authenticated as dm4v |
| 01:01:37 | | dm4v quits [Changing host] |
| 01:01:37 | | dm4v (dm4v) joins |
| 01:16:59 | | fionera_ quits [Quit: fionera_] |
| 01:18:03 | | fionera (Fionera) joins |
| 01:35:39 | <@OrIdow6> | https://www.nytimes.com/2021/12/28/obituaries/john-madden-dead.html https://www.nytimes.com/2021/12/28/obituaries/harry-reid-dead.html deaths - can be inferred from the URLs |
| 01:57:29 | <jodizzle> | Got https://twitter.com/SenatorReid. Don't see anything else obvious for Reid. |
| 01:57:35 | <jodizzle> | Don't see anything obvious for Madden either. |
| 02:01:43 | <@OrIdow6> | Thanks |
| 02:02:38 | | dm4v quits [Read error: Connection reset by peer] |
| 02:04:21 | | dm4v joins |
| 02:04:23 | | dm4v is now authenticated as dm4v |
| 02:04:23 | | dm4v quits [Changing host] |
| 02:04:23 | | dm4v (dm4v) joins |
| 02:45:35 | | sonick quits [Client Quit] |
| 02:51:13 | <mgrandi> | @Ryz: did you get a chance to look at the memorial wiki thing yet? |
| 03:40:15 | | Somebody2 (Somebody2) joins |
| 03:41:14 | | xkey quits [Client Quit] |
| 03:41:50 | | xkey (eyo) joins |
| 04:14:08 | | qw3rty__ joins |
| 04:17:53 | | qw3rty_ quits [Ping timeout: 265 seconds] |
| 04:33:23 | | Stellarator quits [Ping timeout: 258 seconds] |
| 04:56:56 | | DogsRNice quits [Read error: Connection reset by peer] |
| 07:01:54 | | godane1 quits [Client Quit] |
| 07:57:02 | | qwertyasdfuiopghjkl joins |
| 08:10:36 | | guest01911 joins |
| 08:10:53 | | guest01911 leaves |
| 08:26:28 | <jodizzle> | (from #archiveteam) https://www.thestandnews.com/ was last grabbed back in June, but I queued it again given the situation. https://twitter.com/ezracheungtoto/status/1475989778893996034 |
| 09:04:42 | <IDK> | I got 503s |
| 09:24:08 | | qwertyasdfuiopghjkl25 joins |
| 09:27:01 | | qwertyasdfuiopghjkl quits [Ping timeout: 258 seconds] |
| 09:27:15 | | Stellarator joins |
| 09:41:44 | | achivarin (achivarin) joins |
| 09:50:40 | <achivarin> | The government has moved to shut down one of Hong Kong's last independent media outlets. https://www.thestandnews.com/ and https://www.youtube.com/c/StandNewsHK/videos urgently need help. |
| 09:59:11 | <jodizzle> | achivarin: Yes, it's been mentioned. I think we've queued the relevant sites in the relevant places. |
| 09:59:43 | <jodizzle> | https://www.thestandnews.com/ and some subdomains are running in #archivebot |
| 10:00:15 | <jodizzle> | https://www.thestandnews.com/ and some subdomains were also fully grabbed back in June |
| 10:03:46 | <thuban> | jodizzle: have we requeued their channel (UCGe96mv2FcdfQXmtSXYF_fA) in #youtubearchive? |
| 10:04:04 | <jodizzle> | Yeah. |
| 10:04:32 | <jodizzle> | In principle, this could be worth tubeup-ing, but I'm never sure of the rules/guidelines on that. |
| 10:04:44 | <thuban> | cool, ty |
| 10:06:09 | <thuban> | (i don't think the hong kong media wiki page has been updated in some months; might be worth rechecking the status of those orgs--iirc there's a lot of stuff we didn't get around to abing in the first place) |
| 10:17:51 | | qwertyasdfuiopghjkl25 is now known as qwertyasdfuiopghjkl |
| 11:33:19 | | BlueMaxima quits [Client Quit] |
| 11:35:50 | | march_happy (march_happy) joins |
| 11:42:27 | | OverhaulDeskwork joins |
| 12:16:17 | <achivarin> | Thanks for your reply. Can you check if the social media pages are being worked on? https://www.facebook.com/standnewshk/ https://www.instagram.com/thestandnews/ https://twitter.com/standnewshk https://t.me/thestandnews |
| 12:19:51 | <thuban> | achivarin: we're working on their twitter, but unfortunately instagram and to a lesser extent facebook have strict rate limiting that make them difficult to archive |
| 12:48:50 | | britmob25636477 joins |
| 13:00:48 | <h2ibot> | PixelAlpha edited URLTeam (+232, /* Alive */ intip.in): https://wiki.archiveteam.org/?diff=48064&oldid=47853 |
| 13:00:49 | <h2ibot> | Fidel edited Hong Kong media (+137): https://wiki.archiveteam.org/?diff=48065&oldid=47976 |
| 13:00:50 | <h2ibot> | Fidel created Stand News (+1522, Create page, copy some from…): https://wiki.archiveteam.org/?title=Stand%20News |
| 13:00:51 | <h2ibot> | Taka edited Deathwatch (+197, /* 2022 */ Added about "excite friends"): https://wiki.archiveteam.org/?diff=48067&oldid=48062 |
| 13:03:43 | <@arkiver> | is anyone here who was involved with this effort able to update https://wiki.archiveteam.org/index.php/Hong_Kong_media ? |
| 13:13:13 | | MattieTK joins |
| 13:15:25 | <thuban> | i kept updating the page up to the time i stepped away; in a couple of hours i can go through the archivebot viewer and check up on (a) the status of listed jobs and (b) whether additional jobs were run on the sites marked 'not saved yet' |
| 13:16:17 | <thuban> | but anything that happened that's not in those records, i won't know about |
| 13:16:39 | | MattieTK quits [Client Quit] |
| 13:19:29 | | goose joins |
| 13:20:57 | <goose> | hey, guessing so, has anyone started archiving the stand? |
| 13:28:53 | | Arcorann quits [Ping timeout: 265 seconds] |
| 13:34:15 | <thuban> | goose: yes, we're working on it |
| 13:34:44 | <goose> | nice, i've done a quick script to extract next.js' json data from the html |
| 13:35:36 | <@arkiver> | goose: what is this next.js stuff? |
| 13:35:39 | <@arkiver> | got an example webpage? |
| 13:35:58 | <goose> | next.js is the framework, it encodes the data of the article as json |
| 13:36:02 | <goose> | article example: https://www.thestandnews.com/society/%E9%A6%99%E6%B8%AF%E4%BF%9D%E8%AD%B7%E5%85%92%E7%AB%A5%E6%9C%83%E7%AB%A5%E6%A8%82%E5%B1%85%E5%86%8D%E5%A4%9A-4-%E8%81%B7%E5%93%A1%E6%B6%89%E9%AB%94%E7%BD%B0%E5%B9%BC%E5%85%92-%E5%B7%B2%E8%A2%AB%E5%81%9C%E8%81%B7-%E6%A9%9F%E6%A7%8B%E5%B7%B2%E5%A0%B1%E8%AD%A6 |
| 13:36:13 | <goose> | run `__NEXT_DATA__.props.pageProps.article` in console to see |
| 13:36:16 | <@arkiver> | ah right |
| 13:36:24 | <goose> | you can extract from html too, i did it with some dirty regex atm |
| 13:36:24 | <@arkiver> | yeah we're getting that |
| 13:36:29 | <@arkiver> | all data will be in the wayback machine |
| 13:36:33 | <@arkiver> | the raw responses |
| 13:36:36 | <goose> | ah nice |
| 13:36:43 | <goose> | good work! |
| 13:40:26 | | goose quits [Client Quit] |
| 13:43:31 | | hker joins |
| 13:46:19 | | hker leaves |
| 14:09:15 | | Dj-Wawa quits [Quit: Dj-Wawa] |
| 14:09:27 | | Dj-Wawa joins |
| 14:09:28 | | Dj-Wawa is now authenticated as Dj-Wawa |
| 14:09:30 | | Dj-Wawa quits [Client Quit] |
| 14:10:13 | | Dj-Wawa joins |
| 14:10:14 | | Dj-Wawa is now authenticated as Dj-Wawa |
| 14:18:52 | | guy joins |
| 14:25:43 | | guy quits [Client Quit] |
| 14:47:18 | | guy joins |
| 14:56:18 | | guy quits [Ping timeout: 258 seconds] |
| 14:58:59 | | mutantmnky quits [Ping timeout: 258 seconds] |
| 14:59:16 | | mutantmnky (mutantmonkey) joins |
| 15:28:07 | <@arkiver> | thestandnews.com is down |
| 15:28:26 | <@arkiver> | we should have gotten nearly all of the 158k articles (including embedded images) just in time |
| 15:29:31 | <achivarin> | Can we tell which ones are missing? |
| 15:31:57 | <@arkiver> | probably yeah, but i cant look into it now |
| 15:33:08 | <achivarin> | I've been crawling with browsertrix and archivebox myself. I'd like to contribute if I have the ones that are missing. Who should I talk to about this? |
| 15:33:37 | <@arkiver> | are you going to stick around on IRC for the next day or two? |
| 15:33:47 | <@arkiver> | if not, email me at arkiver@protonmail.com please |
| 15:34:36 | <achivarin> | Yeah I'll try to stick around |
| 15:34:56 | <achivarin> | Is it possible to pull them from CDN that might still be up? |
| 15:35:01 | | OverhaulDeskwork quits [Ping timeout: 258 seconds] |
| 15:38:51 | | Stellarator quits [Ping timeout: 258 seconds] |
| 15:40:42 | | Stellarator joins |
| 15:49:47 | <achivarin> | Did anyone grab Stand News's Facebook videos? I completely forgot them |
| 15:52:43 | <achivarin> | Or could we use google's web cache to get the missing articles? |
| 15:56:52 | | Iki1 quits [Ping timeout: 258 seconds] |
| 16:08:23 | | Megame (Megame) joins |
| 16:11:26 | | raskolnikovgirl joins |
| 16:12:37 | | raskolnikovgirl quits [Client Quit] |
| 16:16:54 | <Sanqui> | achivarin: thanks for mentioning browsertrix, this is useful for my research. There's a lot of cutting edge crawling, scraping, and archival tools popping up. I've just started learning Puppeteer myself and was going to attempt to integrate it with warcprox myself. |
| 16:22:26 | <IDK> | IA's URL maximum request legnth limit is 4096 right? |
| 16:23:35 | <IDK> | URL was so long that I had to shortern it: http://gg.gg/xe3ij |
| 16:24:05 | <IDK> | The original: http://gg.gg/xe3ip |
| 16:24:25 | <IDK> | request on this URL will just fail and the item will not be displayed I think |
| 16:27:42 | <IDK> | is there a way where I can still view this type of URL without bad request? |
| 16:32:50 | <@rewby> | Sanqui, achivarin: I seem to remember browsertrix producing some wonky "not quite warc" file standard and there was something wrong with it. I just don't remember what. I think it might suffer from the same issue Chromebot did where not all headers are right |
| 16:33:02 | <@rewby> | But I honestly don't entirely remember |
| 16:34:19 | <Sanqui> | rewby: browsertrix claims to produce WARC files via pywb. I think JAA spoke against it and recommended warcprox instead |
| 16:34:51 | <@rewby> | Yeah, there's something screwy about it and JAA would be the one who knows exactly what the problem ia |
| 16:34:54 | <@rewby> | *is |
| 16:35:51 | <@rewby> | I generally trust JAA as an expert on warc files and their integrity |
| 16:36:02 | <Sanqui> | same, but I am taking notes :) |
| 16:36:08 | <Sanqui> | now |
| 16:36:22 | <@rewby> | Same. I just don't have my laptop on hand |
| 16:39:55 | | OverhaulDeskwork joins |
| 16:41:28 | <Sanqui> | 234:[20:15:22] <JAA> FWIW, webrecorder/pywb also has data integrity issues. |
| 16:41:28 | <Sanqui> | :P |
| 16:41:54 | <Sanqui> | that's unfortunate that browsertrix is still using it -- might raise an issue with them once I have a better understanding of the relevant problems. |
| 16:53:09 | <@OrIdow6> | https://github.com/webrecorder/warcio/issues/129 |
| 16:53:55 | <@OrIdow6> | May be more |
| 17:08:56 | | march_happy quits [Ping timeout: 258 seconds] |
| 17:13:14 | <pcr> | https://github.com/webrecorder/warcio/issues?q=is%3Aopen+author%3AJustAnotherArchivist+ |
| 17:26:39 | <achivarin> | Sanqui: Sorry I didn't know the issue with browsertrix. But if you are looking for archiving tools here's a massive list to check out: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community |
| 17:27:44 | <Sanqui> | Massive lists are nice and useful but it's also good to hear what people use in practice and be able to exchange knowledge and opinions. Thanks :> |
| 17:33:37 | <achivarin> | I'm a complete noob trying everything on that list, so I might not be a good person to learn from lol |
| 17:43:42 | <britmob|m> | grab-site is my goto personally, but that isn’t JS ofc |
| 17:46:41 | <Sanqui> | yeah, I'm getting into the whole js business |
| 17:47:06 | <Sanqui> | well - and also sites with weird behavior around cookies or the session |
| 17:54:28 | | nick joins |
| 17:54:46 | | nick quits [Client Quit] |
| 17:56:52 | | standwithhk joins |
| 17:57:58 | <standwithhk> | Hello |
| 17:58:11 | <standwithhk> | Is there a status update for Stand News’ archive? |
| 17:58:21 | <standwithhk> | Sites offline now :( |
| 17:59:14 | | standwithhk quits [Client Quit] |
| 18:06:24 | | guest01911 joins |
| 18:07:21 | <thuban> | why don't people stay! |
| 18:49:58 | | jspiros (jspiros) joins |
| 19:00:40 | <achivarin> | Perhaps the wiki page could be updated? |
| 19:08:20 | <thuban> | it could indeed, and i am working on it, but i doubt that will solve this particular problem |
| 19:20:15 | | DogsRNice (Webuser299) joins |
| 19:22:11 | <achivarin> | I think younger people, including me, are not as adept at using IRC |
| 19:23:11 | <Sanqui> | now you can use the matrix bridge to stay connected without having to set up a bouncer (and now I say this as a person who's been using irc for "only" 15 years.) |
| 19:27:41 | <achivarin> | I heard that had performance problems so I never tried it. Maybe I should |
| 19:57:13 | <achivarin> | https://www.inmediahk.net/ needs urgent help as well |
| 19:58:02 | <achivarin> | https://www.facebook.com/inmediahknet/ https://www.instagram.com/inmediahknet/ https://twitter.com/inmediahk?lang=en https://telegram.me/inmediahknet https://mewe.com/p/%E7%8D%A8%E7%AB%8B%E5%AA%92%E9%AB%94wwwinmediahknet1 |
| 19:59:29 | <achivarin> | https://www.youtube.com/c/inmediahkvideo YouTube channel should be prioritised |
| 20:02:48 | <achivarin> | My bad, the website https://www.inmediahk.net/ should be prioritised. |
| 20:10:32 | | tzt quits [Ping timeout: 265 seconds] |
| 20:11:34 | <mgrandi> | are the youtube videos in danger? |
| 20:11:37 | <jodizzle> | It's been queued, but getting a number of cloudflare errors. According to the logs something like this also happened back when it was run in June/July. |
| 20:11:55 | <mgrandi> | would feel those would....stay up even after the site goes down, china can't really force youtube to take down stuff right? |
| 20:12:34 | <jodizzle> | Telegram, twitter and youtube queued; facebook, instagram, mewe have various issues. |
| 20:12:38 | <mgrandi> | either way i'm doing a JSON export of the telegram channel + files (<8mb) using my telegram client |
| 20:22:54 | <IDK> | Made change to deathwatch, added the stand news information |
| 20:39:38 | <thuban> | mgrandi: china can force chinese channel owners to take stuff down. |
| 20:55:33 | <thuban> | anyone happen to have the job id of the most recent standnewshk twitter scrape? scrolled off the dashboard, not yet on the viewer... |
| 21:02:52 | | tzt (tzt) joins |
| 21:07:53 | <Megame> | thuban, 1pzlmuacmgtgl5ipygf5fh4uo |
| 21:08:05 | <thuban> | Megame: ty |
| 21:08:54 | <Megame> | You can check http://dashboard.at.ninjawedding.org/finished for recent jobs |
| 21:09:17 | <thuban> | oho! |
| 21:10:57 | <thuban> | what does the bold pink in the 'Remaining' column indicate? an abort? |
| 21:13:44 | <thuban> | ... also, why is the job id for the main site listed as 17rh8v1zlqhwn6dchvr34b370, which is the same id as the june grab? |
| 21:36:37 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 21:37:15 | | Arcorann (Arcorann) joins |
| 21:44:22 | | qwertyasdfuiopghjkl joins |
| 21:47:20 | | OverhaulDeskwork5 joins |
| 21:50:21 | | march_happy (march_happy) joins |
| 21:51:04 | | OverhaulDeskwork quits [Ping timeout: 258 seconds] |
| 21:51:04 | | OverhaulDeskwork5 is now known as OverhaulDeskwork |
| 22:09:33 | | Megame quits [Client Quit] |
| 22:10:36 | <h2ibot> | Switchnode edited Hong Kong media (+2008, update stand news jobs): https://wiki.archiveteam.org/?diff=48068&oldid=48065 |
| 22:29:40 | <h2ibot> | Switchnode edited Hong Kong media (-32, make this nightmare table slightly more legible): https://wiki.archiveteam.org/?diff=48069&oldid=48068 |
| 22:36:28 | <Ryz> | mgrandi, regarding the 'memorial wiki thing ', please refer to the people at #wikiteam - they do the dumps, I just make requests and find non-standard mediawikis oo; |
| 22:41:42 | <h2ibot> | Switchnode edited Hong Kong media (+377, update inmediahk jobs): https://wiki.archiveteam.org/?diff=48070&oldid=48069 |
| 22:48:05 | <thuban> | i gotta go, but i'll come back and do some more early tomorrow |
| 22:50:14 | <thuban> | in the meantime i request people argue over how to handle re-runs of the june/july grabs--which (if any) we should do and how (if at all) we should annotate them on the wiki page |
| 22:51:21 | <thuban> | the inmediahk regrab also has the same job id as the previous one. is this intentional? seems problematic |
| 23:07:44 | | BlueMaxima joins |
| 23:38:01 | | march_happy quits [Ping timeout: 258 seconds] |
| 23:50:10 | | guest01911 quits [Client Quit] |
| 23:53:21 | | Stellarator quits [Ping timeout: 258 seconds] |
| 23:53:34 | | Stellarator joins |
| 23:54:36 | | Stellarator quits [Client Quit] |