00:04:52lennier1 quits [Client Quit]
00:06:52BlueMaxima joins
00:07:31lennier1 (lennier1) joins
00:10:09nerdguy1138 (nerdguy1138) joins
00:10:38nerdguy1138 quits [Client Quit]
00:11:26wizards_ is now known as wizards
00:27:48<@arkiver>JAA: i like the explanation with "classical" in it better :)
00:28:11<@arkiver>:P*
00:38:37<@OrIdow6>kpcyrd: What is sks?
00:50:48<thuban>ugh, i'm at the 'staring at packet dumps and recompiling curl' stage of reverse engineering and i'm not getting anywhere
00:51:02<thuban>cloudflare lets me through every time on two different browsers _and_ javascript's fetch/xhr but is batting a thousand 403ing command-line tools, and i don't know how, because every header is _the same_. it can't be one-time keys, because i can replay the same request in a browser, cache-free, after failing from curl, and it'll _still_ work.
00:51:09<thuban>what are they snooping? the handshake protocol? the http2 settings? the frame batching???
00:51:38<@OrIdow6>thuban: What is this in the context of?
00:53:15<thuban>OrIdow6: a browser game i wanted to enumerate some asset urls from
00:59:33<@OrIdow6>thuban: Oh
00:59:42<@OrIdow6>Yeah, those sound like good ideas
01:00:00<@OrIdow6>Also timings
01:00:42<thuban>i thought about suggesting that but it sounded excessive even as a joke :<
01:00:59<wizards>is the tool making a request to robots.txt
01:01:07<thuban>haha no
01:02:09<thuban>i guess i might finally be forced into trying selenium. o this age of brass
01:02:11<@OrIdow6>Could try to narrow it down by trying some more of the few non-Chrome-based browsers still around
01:02:23<Jake>I'd give it a try if you want to DM me the game
01:02:50<wizards>would you be willing to share a link to the game publicly?
01:02:56KRG quits [Ping timeout: 250 seconds]
01:04:59<thuban>i'm gonna see if i can get some decrypted dumps from another browser first
01:06:30KRG joins
01:17:43<thuban>yes, i can; no, no obvious answers. i think i'll think about this some more
01:26:26Larsenv (Larsenv) joins
02:35:57<pcr>thuban: doesn't look like it's been updated recently but this could be a little help https://github.com/Anorov/cloudflare-scrape
02:54:55Petchea joins
03:26:59DogsRNice quits [Read error: Connection reset by peer]
03:50:09qw3rty_ joins
03:53:45qw3rty__ quits [Ping timeout: 258 seconds]
04:29:15<wizards>pcr: that's not maintained anymore https://github.com/Anorov/cloudflare-scrape/issues/406
04:34:37<pcr>Yeah, it'll need an update to work, but it's a starting point
04:49:33sec^nd quits [Remote host closed the connection]
05:02:10sec^nd (second) joins
05:27:22HP_Archivist (HP_Archivist) joins
05:46:55BlueMaxima quits [Read error: Connection reset by peer]
06:51:19Arcorann (Arcorann) joins
06:58:17KRG` joins
06:59:34KRG quits [Ping timeout: 250 seconds]
07:17:09G4te_Keep3r quits [Quit: Ping timeout (120 seconds)]
07:17:28G4te_Keep3r joins
07:53:46Joat joins
07:57:10HP_Archivist quits [Ping timeout: 258 seconds]
08:09:52Joat quits [Remote host closed the connection]
08:21:30<masterX244>Good that china can't mess with AT pulling a backup of stuff they don't like....
08:40:24dfdsdsdasfgh joins
09:28:43girst_ is now known as girst
09:30:13Vista2003 joins
09:30:19<Vista2003>https://www.nextdigital.com.hk/investor/download/Press%20Release%20(Sat%20Cease).pdf.cd8933f1b8326db4f3a382bb95b07c0a
09:30:54<Vista2003>RIP Apple Daily 1995 - 2021
09:36:00dfdsdsdasfgh quits [Ping timeout: 244 seconds]
09:44:48<Vista2003>https://www.bbc.co.uk/news/world-asia-china-57578926
09:45:23<Vista2003>Upcoming deadlines:
09:45:34<Vista2003>Tomorrow - end of updates from Apple Daily
09:46:04<Vista2003>"No later" than the 26th - end of Apple Daily's site
09:51:20<kpcyrd>Jake, OrIdow6: https://sks-keyservers.net/
09:59:59Petchea leaves
10:02:21gf joins
10:08:46Vista2003 quits [Remote host closed the connection]
10:13:37gf quits [Remote host closed the connection]
10:23:44Vista2003 joins
10:30:35<h3ndr1k>I don't think we can do anything about sks-keyservers. You cannot to my knowledge list all keys from a keyserver, and it is unlikely they will provide a dump, as it seems, that they shut down because of too many gdpr requests.
10:35:18<grawity>hmm, actually, I was fairly sure a few operators *do* provide dumps as a way to bootstrap a new server
10:36:32<grawity>but I'm not sure if it's really at risk -- it's not the keyservers that were shut down, and most of them aren't run by the pool's operator
10:36:58<grawity>but for example (outdated): https://gist.github.com/mattrude/84aa65d1bb2bbf9bd81370ec6cdbf91a#download-the-needed-database-files
10:38:02<grawity>a more recent one https://pgp.key-server.io/sks-dump
10:39:52<grawity>as well as https://keys.niif.hu/keydump/
10:42:20Webuser164 joins
10:42:44Webuser164 leaves
10:59:45<h3ndr1k>huh, interesting. So just the pool operator shut down? I could not find much information about the thing. The websites ssl certificate expired in april or so and there is only a notice that some pool dns records were removed.
11:00:44<h3ndr1k>maybe someone can just run these urls and the website through archivebot.
11:02:29noteness quits [Remote host closed the connection]
11:02:35<@EggplantN>done
11:02:48noteness (noteness) joins
11:03:37<Vista2003>https://hongkongfp.com/2021/06/23/breaking-last-edition-of-apple-daily-to-print-no-later-than-sat-as-board-forced-to-halt-all-hong-kong-operations/
11:03:44<@EggplantN>yep
11:03:47<Vista2003>What's the status of the Apple Daily backup?
11:03:51<@EggplantN>"enough"
11:04:17<Vista2003>And what does "enough" include?
11:04:57<@EggplantN>https://wiki.archiveteam.org/index.php/Apple_Daily
11:05:16<Vista2003>ah ok
11:05:19noteness quits [Remote host closed the connection]
11:05:38noteness (noteness) joins
11:07:59dfdsdsdasfgh joins
11:13:51<Vista2003>https://hk.appledaily.com/local/20210623/WSI6PSB2EFCO5JAUMLLZOP4RGM/ The website shutdown date is today at 23:59 HKT or 15:59UTC
11:17:28dfdsdsdasfgh quits [Remote host closed the connection]
11:18:07Andrewyyol17 joins
11:21:07FatPenguin62 joins
11:21:12FatPenguin62 quits [Remote host closed the connection]
11:26:16Andrewyyol17 quits [Remote host closed the connection]
11:32:55HackMii quits [Remote host closed the connection]
11:33:18HackMii (hacktheplanet) joins
11:34:23silsha joins
11:35:32checker18in joins
11:41:49silsha quits [Remote host closed the connection]
11:45:21<Jake>https://bunny.net/blog/the-stack-overflow-of-death-dns-collapse/
11:49:54archiveapple joins
11:50:27archiveapple quits [Remote host closed the connection]
11:57:00bbsky joins
11:59:58<achivarin>Has the outlinks on this page been saved?: https://hk.appledaily.com/member/ Some of those are on different domains like nextdigital.com.hk
12:06:33jjleung joins
12:07:08jjleung leaves
12:10:14<@EggplantN>Yep Jake. That was why we had an outage yesterday
12:10:27<@EggplantN>trackerproxy relies on BunnyCDN
12:11:24@HCross ducks
12:11:26<@HCross>and runs away
12:22:11KRG joins
12:23:50KRG` quits [Read error: Connection reset by peer]
12:48:06<Jake>thought so! interesting postmortem.
12:52:54<@EggplantN>We used CloudFlare but they’re not up to what we need
13:05:42<@JAA>thuban: No idea if Cloudflare does it as well, but I believe Google analyses TLS at the bit level to detect different implementations. Since browsers use their own libraries, they behave ever so slightly differently than curl, wget, etc. with OpenSSL or GnuTLS.
13:06:54<@JAA>Just as another idea of what could be going on.
13:12:20yano quits [Remote host closed the connection]
13:12:51yanome quits [Quit: Ping timeout (120 seconds)]
13:13:01yano (yano) joins
13:13:02yanome (yano) joins
13:13:17noteness quits [Remote host closed the connection]
13:13:49noteness (noteness) joins
13:14:38WhatIsLove joins
13:14:57HackMii quits [Ping timeout: 258 seconds]
13:15:35WhatIsLove quits [Remote host closed the connection]
13:16:27mutantmonkey quits [Remote host closed the connection]
13:16:44HackMii (hacktheplanet) joins
13:17:45checker18in quits [Remote host closed the connection]
13:21:20<kpcyrd>h3ndr1k: there's more context here: https://lists.nongnu.org/archive/html/sks-devel/2021-06/msg00000.html
13:28:25mutantmonkey (mutantmonkey) joins
13:29:38Wing30 joins
13:29:59hyle joins
13:36:28<grawity>thuban: cloudflare does profile the TLS handshake, and might block you if yours is significantly different than what it expects from the User-Agent
13:38:27<grawity>thuban: e.g. I've discovered that if you're using python-requests, it deliberately disables "session tickets" for its TLS connections, and together with a fake User-Agent it might trip the block -- as in https://github.com/upbit/pixivpy/issues/171#issuecomment-860264788
13:40:55Wing30 quits [Remote host closed the connection]
13:41:24<hexa->whew
13:58:06Gemma joins
13:58:50<@JAA>OrIdow6: 16 hours remaining until GREE does whatever they'll be doing.
14:01:44Gemma leaves
14:06:00Wing joins
14:17:42jcjl joins
14:18:36jcjl quits [Remote host closed the connection]
14:43:44ieh joins
14:45:44hyle leaves
14:47:08<h3ndr1k>kpcyrd: Thanks, might read it later
15:11:57appleguy joins
15:12:21cc25 joins
15:12:47LeGoupil joins
15:12:47cc25 quits [Remote host closed the connection]
15:17:01appleguy quits [Remote host closed the connection]
15:20:37ieh quits [Ping timeout: 244 seconds]
15:26:45ieh joins
15:31:54Wing quits [Remote host closed the connection]
15:35:08Charlie joins
15:35:15Arcorann quits [Ping timeout: 258 seconds]
15:36:31Charlie quits [Remote host closed the connection]
15:42:17orly joins
15:49:50ieh quits [Remote host closed the connection]
16:05:56nuroten joins
16:06:47<nuroten>hi, is archiveteam already aware of the Hong Kong-based Apple Daily newspaper closing?
16:07:42<grawity>looks like it, based on them having a whole wiki page https://wiki.archiveteam.org/index.php/Apple_Daily
16:07:59<nuroten>grawity: great, thanks :)
16:09:30<nuroten>"June 21 2021, the newspaper announced that it is likely to shut down soon" - fwiw, it has been announced the last print edition is this thursday
16:10:06<Jake>yup, we haven't updated the wiki page yet.
16:10:16<nuroten>I don't know how long the online version will be up after that, given the asset freeze means they are having trouble paying their vendors
16:11:46<nuroten>thanks archiveteam!
16:13:01<orly>Apple Daily's youtube channel just 404'd.
16:15:53<nuroten>hope the account wasn't ... compromised
16:24:57spirit joins
16:26:22Daloader joins
16:27:57<rewby>I think we've already ran them through some emergency archiving
16:35:04<nuroten>is it possible / okay to suggest a youtube channel or a website (mainly for text and images) for contingency archiving? I have 2 websites in mind that are safe for now, but after Apple Daily, they may eventually be targeted
16:37:46<rewby>You can suggest them
16:40:05<rewby>I think we're watching the last moments of appledaily's online presence: https://en.appledaily.com/
16:41:08n joins
16:41:14<nuroten>Youtube channel: D100 - they are a listener/donor-supported public radio in Cantonese, d100.net is the website. they frequently interview local political commentators, professionals, pro-democracy legislators (well, former now) and activists
16:41:18<rewby>Yep, hk. also just went down
16:41:35<nuroten>rip Apple Daily
16:42:57<nuroten>English-speaking digital news: Hong Kong Free Press https://hongkongfp.com
16:44:07Joesh joins
16:44:36<nuroten>Apple Daily was basically the last major pro-democracy news outlet ... online media would most likely be the next targets
16:46:25<nuroten>alongside HKFP, there's Stand News https://www.thestandnews.com/english/
16:47:25<rewby>You don't have to pick just english things
16:47:29<rewby>We archive pretty much anything
16:48:03<orly>If I may, a few suggestions: Stand News, another big target (website, youtube, and facebook: thestandnews.com); SocREC, loads of in-the-crowd livestreams with little commentary (multiple youtube channels, one per reporter, e.g. UCg1-HnZBBnpB82g6saKc5FQ); Polymer, one of the bigger publications of the more localist side of the spectrum (website and
16:48:03<orly>facebook: polymerhk.com); Hong Kong Free Press, formed from people leaving SCMP (website and facebook: hongkongfp.com)
16:49:51<@arkiver>nuroten: orly: list all you know!
16:50:10<orly>Do you want full URLs or would just names suffice?
16:50:23<rewby>Ah neat, I can get a sitemap out of hongkongfp. Generating a list of urls right now...
16:50:45<@arkiver>URLs to websites, easier than names. if the list if large, you can upload a txt file to transfer.archivete.am and post the URL here
16:50:52<@arkiver>thanks rewby
16:51:52<nuroten>okay :) though I also suggested the English ones because more people can read and understand the contents
16:54:02n quits [Remote host closed the connection]
16:54:42spirit quits [Client Quit]
16:56:44<nuroten>a number of former journalists/radio show hosts, pro-democracy legislators, etc. have youtube channels and patreon accounts. as orly pointed out, facebook is where a lot of it is (I don't do facebook myself, but maybe I can look up some youtube channels if that's of interest?)
17:01:59<nuroten>D100 youtube channel: https://www.youtube.com/user/D100HK and specifically this show: https://www.youtube.com/playlist?list=PLm30xDjDCFYWi-g5Nt6NaZW1wrTOKkSpr
17:04:43<rewby>thestandnews: https://transfer.archivete.am/4s2s9/thestandnews.tar.gz https://transfer.archivete.am/104ZOF/urls.txt https://transfer.archivete.am/4oSP5/valid_sitemaps.txt
17:04:59<rewby>hongkongfp: https://transfer.archivete.am/7TuOG/hongkongfp.tar.gz https://transfer.archivete.am/erAdf/urls.txt https://transfer.archivete.am/JUNLy/valid_sitemaps.txt
17:05:18<rewby>polymerhk: https://transfer.archivete.am/4GjEC/polymerhk.tar.gz https://transfer.archivete.am/FQu0u/urls.txt https://transfer.archivete.am/AzjyX/valid_sitemaps.txt
17:05:22<rewby>That's the urls I get out of sitemaps
17:05:28<rewby>I should really turn this into an IRC bot
17:05:43<nuroten>it's a current affairs show, the first 2/3 talks about the day's big news/topics and usually they interview 1-2 guests for commentary/industry insight depending on topic, the last 1/3 is a listener phone-in segment
17:06:23<nuroten>rewby: fantastic :)
17:07:03<rewby>I don't claim these are complete
17:07:10<rewby>I just claim that's what I get out of sitemaps
17:07:22gohill4652 joins
17:08:48m350n joins
17:09:20<rewby>These urls are a fun mess of unicode
17:09:28<rewby>I wonder if that's gonna break anything
17:12:47<@EggplantN>want these yeeting into urls?
17:12:57<rewby>Maybe just in case?
17:13:02<rewby>Can't hurt to have them archived
17:14:34<@arkiver>yeah lets put them in #//
17:22:17<nuroten>Wall-fare is a group formed to help incarcerated people and their families, they provide services and raise awareness of poor prison conditions. mentioning it as a historical thing as it was formed in response to the influx of pro-democracy people being incarcerated and handles letters sent to them by the public
17:22:18<nuroten>https://www.facebook.com/wallfarelimited
17:24:56<nuroten>I haven't checked, but their facebook page posts may have some perspective on prison conditions, what happened to the activists who were convicted and so on
17:28:32<orly>Right. I've got a long list. Including online press, online radio, and student press from various universities.
17:28:36<orly>https://transfer.archivete.am/J1GF3/Links%20for%20archive%20-%20Hong%20Kong%20press.txt
17:30:44<nuroten>Civil Human Rights Front - not sure how long their website will be up, it was disbanded recently after being investigated for potential NSL violation, the main convener himself has a few lawsuits ongoing https://www.civilhrfront.org/
17:32:23<nuroten>(this is an alliance of individuals and groups that organise the annual July 1 marches)
17:34:22<nuroten>sorry, not lawsuits, cases/charges rather. the CHRF organised peaceful protests to petition for democracy and all that
17:35:15<nuroten>orly: you're very organised :) ... I only have a bunch of names and things in my head
17:36:00<rewby>I'll go through the list to find some urls
17:38:06m350n leaves
17:42:10<rewby>More URLs!
17:42:21<rewby>myradio: https://transfer.archivete.am/YnptP/myradio.tar.gz https://transfer.archivete.am/7zuGc/urls.txt https://transfer.archivete.am/xkMF5/valid_sitemaps.txt
17:42:41<rewby>thehousenewsblogger: https://transfer.archivete.am/Dv4oz/thehousenewsbloggers.tar.gz https://transfer.archivete.am/KAnzf/urls.txt https://transfer.archivete.am/emBLc/valid_sitemaps.txt
17:42:56<rewby>undergrad: https://transfer.archivete.am/NQfQF/undergrad.tar.gz https://transfer.archivete.am/YQxLX/urls.txt https://transfer.archivete.am/Bdx8o/valid_sitemaps.txt
17:43:02<rewby>EggplantN: Can you yeet those into //?
17:43:50nuroten quits [Remote host closed the connection]
17:43:58Tansuke joins
17:45:00nuroten joins
17:46:31<nuroten>http://www.alliance.org.hk/ this org organises the annual June 4th vigil, leader was arrested shortly before June 4th this year and released on bail, no idea how long the org will be around
17:50:05<nuroten>https://64museum.blogspot.com/ the website of the museum they run. museum was forced to close (temporarily?) on account on not having a permit or something. one on wordpress and one on blogspot, may be safe but don't know if they will force a takedown like what happened with the hkcharter website and wix
17:53:22<nuroten>https://www.2021hkcharter.com/ the website in question, taken down "by mistake" but came back up https://hongkongfp.com/2021/06/04/hong-kong-democracy-site-pulled-by-mistake/
17:54:52<@EggplantN>rewby what are the tar.gz file
17:55:04<nuroten>that's all for now, thanks! orly's list is a pretty good one to go on already
17:55:13<rewby>EggplantN: raw output from my tools. You only care about the urls.txt files
17:55:44<@EggplantN>ok plz dont name them all urls.txt in future
17:55:44<@EggplantN>lol
17:56:37<@EggplantN>ok added
17:56:51<@EggplantN>waiting for backfeed/the websocket to show me they went through
17:57:21<@EggplantN>done rewby orly nuroten :)
17:57:28<@EggplantN>apart from those last 3
17:57:48<nuroten>EggplantN: thanks muchly! :D
17:57:53<rewby>It's not every site you mentioned
17:57:54<@EggplantN>http://tracker.archiveteam.org/urls/
17:57:56<rewby>Just the ones with sitemaps
17:57:57<@EggplantN>go brrrrrrrrrrrrrr
17:57:59<@EggplantN>aight
18:06:34DogsRNice (Webuser299) joins
18:09:49<jodizzle>Did anyone put the articles from tw.appledaily.com through #//?
18:26:52dm4v quits [Client Quit]
18:27:56dm4v joins
18:27:58dm4v quits [Changing host]
18:27:58dm4v (dm4v) joins
18:29:44<thuban>JAA, grawity: thanks for the comments
18:30:39<nuroten>jodizzle, in progress according to the wiki page
18:31:25<@JAA>No, the AB crawl is in progress. Not aware of anyone having thrown it into #//.
18:32:19<thuban>seems like maintaining a bypass might be a full-time project. i don't need speed for this particular application, so i punted and used selenium/chromedriver (need to strip "Headless" from the user agent if you're running headless, but that's all)
18:32:27<nuroten>oh sorry, misread
18:32:44Mateon1 quits [Ping timeout: 258 seconds]
18:36:18<jodizzle>tw.appledaily.com doesn't appear to have sitemaps, annoyingly. Would have to get the articles /archive/, maybe.
18:36:36<Frogging101>youtube is really kicking the goose lately
18:36:45<Frogging101>First the age gating, now this unlisted→private thing
18:36:51<Frogging101>bullshit
18:38:00<Ryz>RIP that particular website that keeps track of YouTube unlisted videos s:
18:39:59<thuban>nuroten and/or orly: i still have the downloader i wrote for RTHK podcasts; any i should be working on now?
18:40:16<@EggplantN>Google seem to be in general on a general cleanup right now
18:40:35Mateon1 joins
18:41:39<nuroten>thuban: is it specifically tailored for RTHK podcasts?
18:42:46<thuban>nuroten: yes
18:43:14<thuban>(i can of course write scrapers for other publishers' media if it's important, but this is what i happen to have on hand)
18:44:46<nuroten>Headliner, that show may or may not disappear soon. production's been axed and the producers on contract didn't get a renewal (read: fired) https://podcast.rthk.hk/podcast/item.php?pid=272&lang=zh-CN
18:45:22<nuroten>it's a parody current affairs show, but apparently the team does rigorous fact-checking
18:46:03<thuban>nuroten: do you have aurl for the rss feed? i've forgotten where it's hidden
18:46:07<thuban>*a url
18:46:15<nuroten>https://podcast.rthk.hk/podcast/headliner_i.xml
18:46:22<thuban>thanks!
18:47:53<nuroten>(it's in a menu after pressing the orange button under the show title/square icon)
18:50:45<nuroten>Hong Kong Letters (CN version) was another one someone requested in that spreadsheet from a while back https://podcast.rthk.hk/podcast/item.php?pid=42&lang=zh-CN https://podcast.rthk.hk/podcast/hkletter.xml
18:52:40<thuban>running Headliner now
18:55:30<nuroten>those are the two main ones aside from HK Connection, if you think it's valuable, there's also the news in sign language https://podcast.rthk.hk/podcast/tv_newsreview_i.xml
18:56:10<thuban>looks like video download is working fine out of the box, but i'll keep an eye on it in case older eps use yet another format
18:56:30dm4v quits [Client Quit]
18:57:31dm4v joins
18:57:33dm4v quits [Changing host]
18:57:33dm4v (dm4v) joins
18:58:22<thuban>the only potential issue is that i'm getting the title from the rss xml and the description from the episode page html, and there seems to be an encoding difference... fortunately i can just leave it running and re-grab the metadata later
18:58:59<nuroten>sounds good
19:03:22<nuroten>this mini-site might be worth backing up, it has video clips of history from 50s to present. it follows a different format and so on from the RTHK Podcasts section so may be better to download as a regular site https://app4.rthk.hk/special/rthkmemory/
19:04:51<thuban>fixed the encoding issue! it was my bad
19:05:44<thuban>looks like that site is pretty js-heavy, so would not work well in ab. i can take a look at it later though
19:11:54Daloader quits [Ping timeout: 250 seconds]
19:12:28<nuroten>yeah, whatever you can pull is fine, specifically the clips in the Major Events category. there are some other cultural things that might be nice to have but maybe not the first thing I would grab personally
19:15:19<nuroten>https://app4.rthk.hk/special/rthkmemory/category/major-events and https://app4.rthk.hk/special/rthkmemory/category/innovation/programme/ (this part is about the beginnings of the HK Connection show: https://app4.rthk.hk/special/rthkmemory/programme/6 )
19:16:21<nuroten>thanks :)
19:20:29<nuroten>clips from the first 3 episodes of Headliner ever https://app4.rthk.hk/special/rthkmemory/programme/34
19:23:23<nuroten>for music fans, top 10 popular songs starting from the 70s and 80s https://app4.rthk.hk/special/rthkmemory/programme/33 (these last 3 links are from the innovation/programme category)
19:26:34<@OrIdow6>Would it be possible to have a very quick project for GREE set up in 20 minutes or so?
19:27:21<@HCross>GREE?
19:27:33<@OrIdow6>Japanese social network
19:27:47<@OrIdow6>Among other things
19:27:57<@OrIdow6>See deathwatch, date was apparently moved up
19:28:09<@OrIdow6>So less time than I thought
19:35:04<AK>https://twitter.com/textfiles/status/1407782416039690241
19:35:18<AK>Time to archive the madman
19:35:45<thuban>why, what happened now
19:37:35<AK>"Spanish media reporting that John McAfee comitted suicide in a spanish jail cell after he was cleared to be extradited to the U.S."
19:37:35<AK>That
19:37:36<AK>Umm
19:37:49<AK>Didn't expect it to be that honestly
19:37:54<AK>I was expecting bitcoin and cocaine
19:38:32<AK>Seems confirmed by US justice department https://twitter.com/InvestorsLive/status/1407780136188002304
19:40:37<@OrIdow6>HCross: Am I right in saying that arkiver is needed to do backfeed? It's not essential here since it seems that most/all of the publicly-accessible pages are in robots and the list page anyhow, but nice to have
19:40:55<@HCross>yes
19:41:28<@OrIdow6>Ok
19:42:58<@arkiver>OrIdow6: are you already working on this?
19:43:06<@OrIdow6>arkiver: Yes, mostly done
19:43:07<@arkiver>else I will try to setup a project for that now
19:43:10<@OrIdow6>Since it's fairly simple
19:43:14<@arkiver>alright ping me when it's somewhere
19:43:18<@OrIdow6>Ok
19:46:55<@arkiver>OrIdow6: from what i see, all posts are under a username
19:47:02<nuroten>thuban: there's a series called Hong Kong Stories (CN: 香港故事) with a different theme each season. it's about everyday people, some of them artisans, farmers, small business owners, etc. there are 10+ of them if you search the CN title, but here's one about food (the subtitle translates roughly to "thinking of the taste of home")
19:47:02<nuroten>https://podcast.rthk.hk/podcast/item.php?pid=1635&lang=en-US https://podcast.rthk.hk/podcast/tv_hkstories44_i.xml
19:47:06<@arkiver>so i guess just discovery of users while crawling is needed
19:47:25<@arkiver>i see some account are behind a login
19:47:56<@OrIdow6>arkiver: Most users seem to be private
19:47:57dm4v quits [Read error: Connection reset by peer]
19:48:06<@OrIdow6>We have 2 lists of what may or may not be all public users
19:48:17dm4v joins
19:48:18dm4v quits [Changing host]
19:48:18dm4v (dm4v) joins
19:48:29<@arkiver>AK: damn, didnt expect that either
19:48:37<@arkiver>OrIdow6: perfect
19:48:49<@arkiver>will get it setup and started as soon as you have it ready
19:50:33<@Kaz>john mcafee.
19:50:39<AK>Yep
19:51:33<KRG>extradition seemed to be too much for him
19:51:44<@Kaz>understandable
19:52:34<@HCross>let me know when
19:52:35<@HCross>will go hard
19:52:36<@HCross>and fast
19:52:45<@arkiver>HCross: will ping
19:53:03<@arkiver>OrIdow6: do you have the list of users somewhere?
19:53:20<@OrIdow6>arkiver: It was in the form of URLs
19:53:24<@OrIdow6>Let me find them
19:55:00<@OrIdow6>https://transfer.archivete.am/2Bw3b/gree_all.txt https://transfer.archivete.am/HYWQ4/gree.txt , URLs from sitemap and from scraping the user list page, respectively, neither done by me, still need to be parsed
19:55:14<@OrIdow6>If someone other than me wants to do it, format is user:username
19:55:21<@arkiver>yeah i'll parse them
19:55:29<@arkiver>thanks
19:55:33<@OrIdow6>Thank you
19:56:37<AK>Can someone with voice in ab do mcafees twitter?
19:59:33dm4v_ joins
19:59:33dm4v quits [Read error: Connection reset by peer]
19:59:45dm4v_ is now known as dm4v
19:59:47dm4v quits [Changing host]
19:59:47dm4v (dm4v) joins
20:01:04<AK>Do we archive articles about peoples death?
20:01:12dm4v quits [Read error: Connection reset by peer]
20:01:21<@arkiver>yes
20:01:38<@arkiver>or is this a 'how' question?
20:01:59<AK>Naah it was a do we, I think I've worked out what to do now
20:02:41<@arkiver>right, so policy question. answer is yes!
20:04:07dm4v joins
20:04:10dm4v quits [Changing host]
20:04:10dm4v (dm4v) joins
20:10:30<@arkiver>OrIdow6: if gree can handle high load (we'll know when HCross is on the project), i'll put the URLs in #// as well most likely
20:10:34<@arkiver>though warrior project first
20:10:45<@HCross>warrior first please
20:10:54<@HCross>I don't like running // unless I have too
20:11:02<@arkiver>like i said :)
20:11:04<@OrIdow6>arkiver: Feeling is that it'll be rickety
20:11:19<@OrIdow6>I think this was something that had its heyday about 13 years ago or so
20:11:21<@arkiver>well warrior first, so we'll see
20:11:35<@HCross>are we talking pentium 4 servers in a closet somewhere
20:11:36<@HCross>in Japan
20:11:41<@HCross>cc rewby
20:12:01<@OrIdow6>Alright arkiver https://github.com/OrIdow6/gree-grab
20:12:05<@arkiver>thank OrIdow6
20:12:07<AK>If you want to put stuff in #// I can spin up some stuff on that
20:12:14<@arkiver>we can do a channel, not sure if it's needed
20:13:03<@OrIdow6>A good test item is user:kakei_toshio
20:13:31<@arkiver>OrIdow6: is it just me or is there a lot of wikidot stuff in there
20:13:34<@arkiver>will filter that out now
20:13:40<@arkiver>should be running in a few
20:13:58<@HCross>arkiver: let me know when code is ready
20:14:00<@OrIdow6>Yeah, looks like I did leave a bit in
20:14:02<@HCross>and I'll get underway
20:14:20<@arkiver>OrIdow6: no worries, checking it now
20:14:30<@arkiver>HCross: yeah, rewby EggplantN for target
20:14:36Joesh quits [Ping timeout: 244 seconds]
20:14:37<@arkiver>archiveteam_gree_
20:14:40<@arkiver>gree_
20:14:51<@arkiver>Archive Team GREE:
20:15:11<Jake>do we have a channel for GREE or just sticking for -bs?
20:15:27<@arkiver>may be good yeah
20:15:29<@arkiver>ideas?
20:15:33<rewby>We need targets?
20:15:39<AK>#greedy
20:15:42<@arkiver>nice
20:26:34LeGoupil quits [Client Quit]
20:28:20Vista2003 leaves
20:30:55HP_Archivist (HP_Archivist) joins
20:48:20nuroten quits [Remote host closed the connection]
20:51:40<@EggplantN>Rewby or deploy FMT
20:51:42<@EggplantN>I’m afk
20:52:07<rewby>EggplantN: nvme is overloaded and I've not got SSH on any of your boxes
20:52:19<rewby>We've deployed two CPX31s
20:52:22<rewby>Hopefully enough
20:58:32<SketchTheCow>Hey, people
20:58:32<SketchTheCow>I'm doing this game event thing today, then I turn back to general high focus.
20:58:32<SketchTheCow>Arkiver's getting most of the IA-Archiveteam integration/work done these days, but I'm around.
21:05:42nuroten joins
21:07:09fuzzy8021 quits [Read error: Connection reset by peer]
21:07:31fuzzy8021 (fuzzy8021) joins
21:07:42@EggplantN quits [Client Quit]
21:07:45ave quits [Quit: Ping timeout (120 seconds)]
21:07:45lun4 quits [Quit: Ping timeout (120 seconds)]
21:08:04ave (ave) joins
21:08:04lun4 (lun4) joins
21:08:22KRG quits [Changing host]
21:08:22KRG (KRG) joins
21:08:41EggplantN joins
21:09:54aaa joins
21:10:42<aaa>is this channel for back up apple daily?
21:11:01<rewby>Apple Daily is mostly down already. We grabbed what we could.
21:12:16<aaa>I think there are still some links that are up
21:12:23<aaa>That can still be backed up
21:12:40<@JAA>Any examples?
21:12:50<rewby>If you list them we'll do our best to get 'em
21:14:40<aaa>1 min
21:16:11<aaa>were you guys able to download the vids in this txt file here?
21:16:11<aaa>Videos (M3U8 and TS segments) from article pages extracted by User:Jodizzle in several parts: Part 1 Saved! with ArchiveBot on 2021-06-22: https://transfer.archivete.am/15b9yl/hk.appledaily.com-m3u8s-expanded.1.txt and job:atm5u7fjmgegw508c90ty32wi Part 2 Saved! with ArchiveBot on 2021-06-22/23:
21:16:12<aaa>https://transfer.archivete.am/RZHFJ/hk.appledaily.com-m3u8s-expanded.2.txt and job:183qpki4h8e40cswj2035wqmf Part 3 Saved! with ArchiveBot on 2021-06-23: https://transfer.archivete.am/OIkBX/hk.appledaily.com-m3u8s-expanded.3.txt and job:5ue8wjnyg1gbg1g7x420b5gpg Part 4 Saved! with ArchiveBot on 2021-06-23:
21:16:12<aaa>https://transfer.archivete.am/CKU9J/hk.appledaily.com-m3u8s-expanded.4.txt and job:91rs9mykjwyxmj5vekc8ol1qf Part 5 In progress... with ArchiveBot on 2021-06-23: https://transfer.archivete.am/11wYuN/hk.appledaily.com-m3u8s-expanded.5.txt and job:cxp0gi0dive8hio7t156y9o3r More parts Upcoming...
21:16:27<aaa>thanks a lot for your support btw
21:16:35<@JAA>I mean, it says 'Saved!' and 'In progress...' there.
21:17:24<aaa>Yeah I was confused if it meant the links are saved or the actual vid .ts files are saved.
21:17:39<@JAA>The videos are.
21:17:42<aaa>oh wonderful
21:17:49<@JAA>Part 5 actually finished by now as far as I can see.
21:22:15<nuroten>orly: FactWire https://www.factwire.org/?lang=en and HKPORI https://www.pori.hk/?lang=en will probably be safe for longer, but who knows ... in case you want to add it to your list
21:28:12<nuroten>612 Humanitarian Fund offers legal advice, financial assistance, etc. for people charged from the 2019 protests https://612fund.hk/en/home https://www.facebook.com/612Fund/reviews/
21:33:11<nuroten>after what happened with CHRF some of these orgs on the front lines of providing assistance, organising vigils, etc. might be at risk
21:33:48<AK>Added the website now, gonna look at facebook too
21:34:45<nuroten>AK: fantastic, thanks
22:00:23<aaa>This apple daily site is still up: https://www.nextdigital.com.hk/
22:00:54<aaa>JAA rewby ^
22:01:20<rewby>Didn't we throw that into archivebot?
22:02:16<rewby>Ah it appears we have not done so
22:02:21<rewby>But I don't have the perms to do it
22:02:30<rewby>So leaving that for JAA / other AB operators
22:03:11<rewby>Oh wait no we did put it in
22:03:14<rewby>and it's finished
22:03:28<rewby>Helps if I don't typo into /grep
22:07:52<@JAA>Yup, added to the wiki page.
22:08:43<aaa>This one too, I think, still works: https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/
22:12:41<@JAA>Hmm
22:13:26Tansuke quits [Ping timeout: 244 seconds]
22:13:54HP_Archivist quits [Ping timeout: 250 seconds]
22:15:53HP_Archivist (HP_Archivist) joins
22:16:50<AK>nuroten, website is done :)
22:22:39EggplantN quits [Changing host]
22:22:39EggplantN (EggplantN) joins
22:22:39@ChanServ sets mode: +o EggplantN
22:23:53HP_Archivist quits [Ping timeout: 258 seconds]
22:25:58DogsRNice quits [Read error: Connection reset by peer]
22:27:10DogsRNice (Webuser299) joins
22:30:41<nuroten>AK: whoa, thanks :D
22:41:19<@JAA>jodizzle: I should soon have a complete list of all videos on hk.
22:41:46<@JAA>aaa: Nice find, thanks. That's the complete site it seems. :-)
22:45:40<aaa>JAA any help you guys need to archive that site? happy to help as much as I can
22:47:16<@arkiver>if someone here knows about physics material in Hong Kong being removed, please ping me!
22:47:28<@arkiver>especially if you are able to get access to the material that will be discarded
22:47:32<@JAA>I'm crawling through the /archive section right now to collect all articles and videos. Apparently nobody did that before, or at least I haven't seen anyone mention it.
22:47:35<@arkiver>of course, safety first
22:47:39<@JAA>That's on HK Apple Daily.
22:48:01<aaa>arkiver what physics material are you referring to?
22:48:12<@arkiver>ugh
22:48:14<@arkiver>physical
22:48:30<@arkiver>newspapers, books, DVDs, whatever
22:48:38<@arkiver>but it's probably too late to get stuff like that out of the country
22:49:09<@arkiver>but if you know of that, please let me know
22:49:11<aaa>Yeah good point, haven't heard anything about that happening yet, but cannot rule anything out of course with the current situation
22:49:47<@arkiver>i can imagine that with various institutions closing, there would be archives and small (company) libraries closing as well
22:50:05<aaa>Yeah it's definitely in the realm of (high) possibilities
22:50:29<aaa>Is there any hope to download YT videos that are still online, but private?
22:50:34<@arkiver>ping me if you find out anything!
22:51:01<@arkiver>also nuroten orly on physical material ^
22:53:34<nuroten>I don't have access, maybe people on LIHKG would know or have something https://lihkg.com
22:54:25<nuroten>I heard a lot of people will be grabbing the last print edition of Apple Daily first thing on thursday, 1 M copies to be printed
22:55:30<aaa>JAA you're crawling through /archive section of the arcpublishing site?
22:55:40<@JAA>aaa: Yes
22:55:44<aaa>thank you!
22:55:51<@JAA>Half-way done or so.
22:56:01<aaa>Do you guys have a guide on how you do these archives? for future reference haha
22:56:02<@arkiver>nuroten: im hoping we may still be able to take archives out of the hong kong
22:56:16<@JAA>I'm using qwarc for this, which is completely undocumented.
22:56:39<@arkiver>nuroten: oof, yeah lihkg is a bit hard to read
22:56:56<@arkiver>what is lihkg?
22:57:08<aaa>arkiver popular hk forum, think hks version of reddit
22:58:06<jodizzle>JAA: Great! I'm probably not iterating the articles at all fast enough on my setup.
22:58:22<@JAA>jodizzle: The /archive/YYYYMMDD/ pages have all the M3U8 URLs. :-)
22:58:36<@JAA>So no need to retrieve each article page.
22:59:10aaa quits [Remote host closed the connection]
22:59:12<@JAA>How many videos did you discover?
23:00:56<jodizzle>Ahh okay. I think I had noticed there being a 'digest' on tw.appledaily.com /archive/YYYYMMDD/ pages, but I didn't think to look in detail for videos, or on hk.appledaily.com. Nice!
23:02:49<jodizzle>It looks like my five lists have 27,438 m3u8s. But many of those are different qualities of the same video.
23:04:04gohill4652 quits [Ping timeout: 244 seconds]
23:05:01xit quits [Quit: Ping timeout (120 seconds)]
23:05:20xit joins
23:05:57<jodizzle>I'll leave my thing collecting until you confirm that you have everything.
23:12:14<@JAA>Well, I'm not sure I trust them that /archive/ lists everything. But I can get you a list of article IDs listed there to compare.
23:12:37<jodizzle>Sounds good
23:12:54wessel15129 joins
23:12:57wessel1512 quits [Read error: Connection reset by peer]
23:12:57wessel15129 is now known as wessel1512
23:13:56aaa joins
23:15:26orly quits [Ping timeout: 244 seconds]
23:15:35<aaa>There may be some things here that can still be downloaded: ml-welcome01.nxtdig.com.hk/stage/
23:16:20<aaa>Ex: ml-welcome01.nxtdig.com.hk/stage/f54332d768439dfbf720661800624b42e7244fb2
23:17:05<@arkiver>aaa: if you happen to see something interesting on lihkg, can you please let us know here? also regarding physical materials
23:17:06<@JAA>Surprisingly large. Just a WARC of the /archive/ pages from 2000 to today is 3.8 GB after compression.
23:17:36<aaa>arkiver sure
23:18:01<thuban>aaa: you mentioned lihkg; do you read chinese?
23:18:07<aaa>yh
23:19:09<nuroten>did someone download this playlist yet? https://www.youtube.com/playlist?list=PLiY6wtxjK6QPPh4cSFBgcjhDK3XTDAV2F
23:19:42<nuroten>it's a food show by NEXT TV apparently
23:19:53<aaa>nuroten that is not apple daily hk
23:20:06<aaa>that is a taiwan outlet, so no need to backup
23:20:20<thuban>aaa: they have some threads and documents of rthk stuff they're trying to save (since the deleting-old-material warning went up a little while back)
23:20:29<thuban>i wrote a high-quality scraper and i'd like to coordinate, but the lists are pretty hard to navigate through google translate. one sec, i'll grab the links
23:20:49<nuroten>aaa: oh okay, good to know, thanks ... someone on LIHKG suggested two other channels to back up, that was one of them
23:21:56<thuban>aaa: https://docs.google.com/spreadsheets/d/1JPyevWnxvoq_xzY4ptOgaTaTYSLE9oMva66kxm66K0k/edit ; https://docs.google.com/document/d/1I3yYU2CTjlDt39xOZaSf7Qjj87eArrqEDp4897SaaSw/edit
23:22:31<nuroten>(and yeah, it's not Apple Daily, I wasn't sure how long the tw-produced content will stay up, even with tw-based partner)
23:23:48<aaa>thuban those spreadsheets are for RTHK, some controversial RTHK content has already been deleted as of 1 mth ago
23:24:42<thuban>aaa: yeah, i got the entire english-language hong kong connection backlog at that time.
23:24:49<aaa>nice
23:25:44<thuban>i realize that it's not as time-sensitive as apple daily content now, but i still have the downloader and i'd like to get anything else that may be in danger in the future if i can
23:26:41<nuroten>aaa: thuban helped save some of the RTHK podcasts back in May when RTHK announced it was taking down content older than 1 year so their website and social media are aligned (whatever that meant)
23:27:28<nuroten>a few of us were worried at the time it will also affect their archives, not just youtube
23:27:29<aaa>yeah it was just a bullshit excuse lol nuroten
23:28:56<nuroten>yeah ... so now hopefully if they decide to quietly drop content, we will hopefully be more prepared
23:29:20<@JAA>Ewwwww. These /archive/ pages on hk.appledaily.com each contain a ~5 MB JSON object. And some of the keys are themselves JSON strings. Disgusting...
23:30:18<thuban>yo dawg, i heard you like...
23:30:39<@JAA>{ ..., "{\"feedOffset\":0,\"feedQuery\":\"taxonomy.primary_section._id:\\\"%2Fdaily%2Fentertainment\\\"+AND+type:story+AND+(editor_note:\\\"20180111\\\"+OR+display_date:[2018-01-11T16:00:00Z||-24h+TO+2018-01-11T16:00:00Z])\",\"feedSize\":100,\"sort\":\"location:asc\"}": { ... } }
23:31:31aaaa joins
23:32:21aaa quits [Remote host closed the connection]
23:32:50aaaa quits [Remote host closed the connection]
23:35:07<Jake>I wonder if zstd would compress it nicely...
23:36:57aaaaa joins
23:37:11<@JAA>I'm sure I could train a dict that would absolutely shred it.
23:44:32<thuban>hm, looks like someone else has uploaded hkc (the entire run? not sure, but would bet) to ia as individual episodes
23:44:50<thuban>i was planning to upload everything as one item; should i still?
23:45:01HP_Archivist (HP_Archivist) joins
23:45:07<@arkiver>thuban: how many videos?
23:45:55<thuban>arkiver: have i? 297 (with thumbnails)
23:46:39<@arkiver>from appledaily?
23:47:04<nuroten>are the ones uploaded also the English version?
23:47:19<thuban>no, this is the rthk stuff we were doing last month
23:47:23<thuban>they appear to be
23:49:34<nuroten>I kind of like the idea of them as one item where it generates a playlist and can be viewed sequentially, is that also available if they're all individual inside a series collection?
23:50:38<thuban>no idea. they're not currently in a collection, though
23:50:57<nuroten>but not picky as long as there are copies ... same resolution?
23:51:05<thuban>looks like, yeah
23:51:40<nuroten>okay ... I remembered there was some weird thing with some of them having different res depending on whether they were from akamai or archive
23:52:47<nuroten>they make more sense to me as 1 item (or some way to group them as a set) but yeah, up to you, thanks for saving those :)
23:53:04<thuban>yeah. i did a convenience sample of one very new one and one very old one; both were the same size as mine. i think there was an intermediate phase between thos ebut don't recall the details (and it looks like we used very similar methods)
23:53:12<thuban>*those but
23:53:18<@arkiver>thuban: if its 297 videos, the lets do individual items if it's not too much more work
23:53:28<@arkiver>let me know when they're up and I'll put them in some AT collection
23:53:53<@arkiver>make sure you have good/correct metadata
23:54:31<thuban>arkiver: someone else already upped them as individual episodes https://archive.org/details/@kwc114
23:54:59<thuban>metadata looks good but they're not in a collection--iirc only the uploader can put them in one; is that right?
23:55:20<@arkiver>or I can
23:55:31<thuban>oh, cool
23:55:31<@arkiver>or someone else at IA
23:56:42<thuban>i do note that they don't seem to have grabbed the original thumbnails. ia generates its own, so not a big deal from a usability perspective, but maybe i should upload them for preservation?
23:56:54<@arkiver>yes
23:57:00<@arkiver>but, in one item then
23:57:17<thuban>ok, will do
23:57:24<thuban>once i remember how the cli works lol