| 00:04:52 | | lennier1 quits [Client Quit] |
| 00:06:52 | | BlueMaxima joins |
| 00:07:31 | | lennier1 (lennier1) joins |
| 00:10:09 | | nerdguy1138 (nerdguy1138) joins |
| 00:10:38 | | nerdguy1138 quits [Client Quit] |
| 00:11:26 | | wizards_ is now known as wizards |
| 00:27:48 | <@arkiver> | JAA: i like the explanation with "classical" in it better :) |
| 00:28:11 | <@arkiver> | :P* |
| 00:38:37 | <@OrIdow6> | kpcyrd: What is sks? |
| 00:50:48 | <thuban> | ugh, i'm at the 'staring at packet dumps and recompiling curl' stage of reverse engineering and i'm not getting anywhere |
| 00:51:02 | <thuban> | cloudflare lets me through every time on two different browsers _and_ javascript's fetch/xhr but is batting a thousand 403ing command-line tools, and i don't know how, because every header is _the same_. it can't be one-time keys, because i can replay the same request in a browser, cache-free, after failing from curl, and it'll _still_ work. |
| 00:51:09 | <thuban> | what are they snooping? the handshake protocol? the http2 settings? the frame batching??? |
| 00:51:38 | <@OrIdow6> | thuban: What is this in the context of? |
| 00:53:15 | <thuban> | OrIdow6: a browser game i wanted to enumerate some asset urls from |
| 00:59:33 | <@OrIdow6> | thuban: Oh |
| 00:59:42 | <@OrIdow6> | Yeah, those sound like good ideas |
| 01:00:00 | <@OrIdow6> | Also timings |
| 01:00:42 | <thuban> | i thought about suggesting that but it sounded excessive even as a joke :< |
| 01:00:59 | <wizards> | is the tool making a request to robots.txt |
| 01:01:07 | <thuban> | haha no |
| 01:02:09 | <thuban> | i guess i might finally be forced into trying selenium. o this age of brass |
| 01:02:11 | <@OrIdow6> | Could try to narrow it down by trying some more of the few non-Chrome-based browsers still around |
| 01:02:23 | <Jake> | I'd give it a try if you want to DM me the game |
| 01:02:50 | <wizards> | would you be willing to share a link to the game publicly? |
| 01:02:56 | | KRG quits [Ping timeout: 250 seconds] |
| 01:04:59 | <thuban> | i'm gonna see if i can get some decrypted dumps from another browser first |
| 01:06:30 | | KRG joins |
| 01:06:30 | | KRG is now authenticated as KRG |
| 01:17:43 | <thuban> | yes, i can; no, no obvious answers. i think i'll think about this some more |
| 01:26:26 | | Larsenv (Larsenv) joins |
| 02:35:57 | <pcr> | thuban: doesn't look like it's been updated recently but this could be a little help https://github.com/Anorov/cloudflare-scrape |
| 02:54:55 | | Petchea joins |
| 03:26:59 | | DogsRNice quits [Read error: Connection reset by peer] |
| 03:50:09 | | qw3rty_ joins |
| 03:53:45 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 04:29:15 | <wizards> | pcr: that's not maintained anymore https://github.com/Anorov/cloudflare-scrape/issues/406 |
| 04:34:37 | <pcr> | Yeah, it'll need an update to work, but it's a starting point |
| 04:49:33 | | sec^nd quits [Remote host closed the connection] |
| 05:02:10 | | sec^nd (second) joins |
| 05:27:22 | | HP_Archivist (HP_Archivist) joins |
| 05:46:55 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:51:19 | | Arcorann (Arcorann) joins |
| 06:58:17 | | KRG` joins |
| 06:59:34 | | KRG quits [Ping timeout: 250 seconds] |
| 07:17:09 | | G4te_Keep3r quits [Quit: Ping timeout (120 seconds)] |
| 07:17:28 | | G4te_Keep3r joins |
| 07:53:46 | | Joat joins |
| 07:57:10 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 08:09:52 | | Joat quits [Remote host closed the connection] |
| 08:21:30 | <masterX244> | Good that china can't mess with AT pulling a backup of stuff they don't like.... |
| 08:40:24 | | dfdsdsdasfgh joins |
| 09:28:43 | | girst_ is now known as girst |
| 09:30:13 | | Vista2003 joins |
| 09:30:19 | <Vista2003> | https://www.nextdigital.com.hk/investor/download/Press%20Release%20(Sat%20Cease).pdf.cd8933f1b8326db4f3a382bb95b07c0a |
| 09:30:54 | <Vista2003> | RIP Apple Daily 1995 - 2021 |
| 09:36:00 | | dfdsdsdasfgh quits [Ping timeout: 244 seconds] |
| 09:44:48 | <Vista2003> | https://www.bbc.co.uk/news/world-asia-china-57578926 |
| 09:45:23 | <Vista2003> | Upcoming deadlines: |
| 09:45:34 | <Vista2003> | Tomorrow - end of updates from Apple Daily |
| 09:46:04 | <Vista2003> | "No later" than the 26th - end of Apple Daily's site |
| 09:51:20 | <kpcyrd> | Jake, OrIdow6: https://sks-keyservers.net/ |
| 09:59:59 | | Petchea leaves |
| 10:02:21 | | gf joins |
| 10:08:46 | | Vista2003 quits [Remote host closed the connection] |
| 10:13:37 | | gf quits [Remote host closed the connection] |
| 10:23:44 | | Vista2003 joins |
| 10:30:35 | <h3ndr1k> | I don't think we can do anything about sks-keyservers. You cannot to my knowledge list all keys from a keyserver, and it is unlikely they will provide a dump, as it seems, that they shut down because of too many gdpr requests. |
| 10:35:18 | <grawity> | hmm, actually, I was fairly sure a few operators *do* provide dumps as a way to bootstrap a new server |
| 10:36:32 | <grawity> | but I'm not sure if it's really at risk -- it's not the keyservers that were shut down, and most of them aren't run by the pool's operator |
| 10:36:58 | <grawity> | but for example (outdated): https://gist.github.com/mattrude/84aa65d1bb2bbf9bd81370ec6cdbf91a#download-the-needed-database-files |
| 10:38:02 | <grawity> | a more recent one https://pgp.key-server.io/sks-dump |
| 10:39:52 | <grawity> | as well as https://keys.niif.hu/keydump/ |
| 10:42:20 | | Webuser164 joins |
| 10:42:44 | | Webuser164 leaves |
| 10:59:45 | <h3ndr1k> | huh, interesting. So just the pool operator shut down? I could not find much information about the thing. The websites ssl certificate expired in april or so and there is only a notice that some pool dns records were removed. |
| 11:00:44 | <h3ndr1k> | maybe someone can just run these urls and the website through archivebot. |
| 11:02:29 | | noteness quits [Remote host closed the connection] |
| 11:02:35 | <@EggplantN> | done |
| 11:02:48 | | noteness (noteness) joins |
| 11:03:37 | <Vista2003> | https://hongkongfp.com/2021/06/23/breaking-last-edition-of-apple-daily-to-print-no-later-than-sat-as-board-forced-to-halt-all-hong-kong-operations/ |
| 11:03:44 | <@EggplantN> | yep |
| 11:03:47 | <Vista2003> | What's the status of the Apple Daily backup? |
| 11:03:51 | <@EggplantN> | "enough" |
| 11:04:17 | <Vista2003> | And what does "enough" include? |
| 11:04:57 | <@EggplantN> | https://wiki.archiveteam.org/index.php/Apple_Daily |
| 11:05:16 | <Vista2003> | ah ok |
| 11:05:19 | | noteness quits [Remote host closed the connection] |
| 11:05:38 | | noteness (noteness) joins |
| 11:07:59 | | dfdsdsdasfgh joins |
| 11:13:51 | <Vista2003> | https://hk.appledaily.com/local/20210623/WSI6PSB2EFCO5JAUMLLZOP4RGM/ The website shutdown date is today at 23:59 HKT or 15:59UTC |
| 11:17:28 | | dfdsdsdasfgh quits [Remote host closed the connection] |
| 11:18:07 | | Andrewyyol17 joins |
| 11:21:07 | | FatPenguin62 joins |
| 11:21:12 | | FatPenguin62 quits [Remote host closed the connection] |
| 11:26:16 | | Andrewyyol17 quits [Remote host closed the connection] |
| 11:32:55 | | HackMii quits [Remote host closed the connection] |
| 11:33:18 | | HackMii (hacktheplanet) joins |
| 11:34:23 | | silsha joins |
| 11:35:32 | | checker18in joins |
| 11:41:49 | | silsha quits [Remote host closed the connection] |
| 11:45:21 | <Jake> | https://bunny.net/blog/the-stack-overflow-of-death-dns-collapse/ |
| 11:49:54 | | archiveapple joins |
| 11:50:27 | | archiveapple quits [Remote host closed the connection] |
| 11:57:00 | | bbsky joins |
| 11:59:58 | <achivarin> | Has the outlinks on this page been saved?: https://hk.appledaily.com/member/ Some of those are on different domains like nextdigital.com.hk |
| 12:06:33 | | jjleung joins |
| 12:07:08 | | jjleung leaves |
| 12:10:14 | <@EggplantN> | Yep Jake. That was why we had an outage yesterday |
| 12:10:27 | <@EggplantN> | trackerproxy relies on BunnyCDN |
| 12:11:24 | | @HCross ducks |
| 12:11:26 | <@HCross> | and runs away |
| 12:22:11 | | KRG joins |
| 12:22:11 | | KRG is now authenticated as KRG |
| 12:23:50 | | KRG` quits [Read error: Connection reset by peer] |
| 12:48:06 | <Jake> | thought so! interesting postmortem. |
| 12:52:54 | <@EggplantN> | We used CloudFlare but they’re not up to what we need |
| 13:05:42 | <@JAA> | thuban: No idea if Cloudflare does it as well, but I believe Google analyses TLS at the bit level to detect different implementations. Since browsers use their own libraries, they behave ever so slightly differently than curl, wget, etc. with OpenSSL or GnuTLS. |
| 13:06:54 | <@JAA> | Just as another idea of what could be going on. |
| 13:12:20 | | yano quits [Remote host closed the connection] |
| 13:12:51 | | yanome quits [Quit: Ping timeout (120 seconds)] |
| 13:13:01 | | yano (yano) joins |
| 13:13:02 | | yanome (yano) joins |
| 13:13:17 | | noteness quits [Remote host closed the connection] |
| 13:13:49 | | noteness (noteness) joins |
| 13:14:38 | | WhatIsLove joins |
| 13:14:57 | | HackMii quits [Ping timeout: 258 seconds] |
| 13:15:35 | | WhatIsLove quits [Remote host closed the connection] |
| 13:16:27 | | mutantmonkey quits [Remote host closed the connection] |
| 13:16:44 | | HackMii (hacktheplanet) joins |
| 13:17:45 | | checker18in quits [Remote host closed the connection] |
| 13:21:20 | <kpcyrd> | h3ndr1k: there's more context here: https://lists.nongnu.org/archive/html/sks-devel/2021-06/msg00000.html |
| 13:28:25 | | mutantmonkey (mutantmonkey) joins |
| 13:29:38 | | Wing30 joins |
| 13:29:59 | | hyle joins |
| 13:36:28 | <grawity> | thuban: cloudflare does profile the TLS handshake, and might block you if yours is significantly different than what it expects from the User-Agent |
| 13:38:27 | <grawity> | thuban: e.g. I've discovered that if you're using python-requests, it deliberately disables "session tickets" for its TLS connections, and together with a fake User-Agent it might trip the block -- as in https://github.com/upbit/pixivpy/issues/171#issuecomment-860264788 |
| 13:40:55 | | Wing30 quits [Remote host closed the connection] |
| 13:41:24 | <hexa-> | whew |
| 13:58:06 | | Gemma joins |
| 13:58:50 | <@JAA> | OrIdow6: 16 hours remaining until GREE does whatever they'll be doing. |
| 14:01:44 | | Gemma leaves |
| 14:06:00 | | Wing joins |
| 14:17:42 | | jcjl joins |
| 14:18:36 | | jcjl quits [Remote host closed the connection] |
| 14:43:44 | | ieh joins |
| 14:45:44 | | hyle leaves |
| 14:47:08 | <h3ndr1k> | kpcyrd: Thanks, might read it later |
| 15:11:57 | | appleguy joins |
| 15:12:21 | | cc25 joins |
| 15:12:47 | | LeGoupil joins |
| 15:12:47 | | cc25 quits [Remote host closed the connection] |
| 15:17:01 | | appleguy quits [Remote host closed the connection] |
| 15:20:37 | | ieh quits [Ping timeout: 244 seconds] |
| 15:26:45 | | ieh joins |
| 15:31:54 | | Wing quits [Remote host closed the connection] |
| 15:35:08 | | Charlie joins |
| 15:35:15 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:36:31 | | Charlie quits [Remote host closed the connection] |
| 15:42:17 | | orly joins |
| 15:49:50 | | ieh quits [Remote host closed the connection] |
| 16:05:56 | | nuroten joins |
| 16:06:47 | <nuroten> | hi, is archiveteam already aware of the Hong Kong-based Apple Daily newspaper closing? |
| 16:07:42 | <grawity> | looks like it, based on them having a whole wiki page https://wiki.archiveteam.org/index.php/Apple_Daily |
| 16:07:59 | <nuroten> | grawity: great, thanks :) |
| 16:09:30 | <nuroten> | "June 21 2021, the newspaper announced that it is likely to shut down soon" - fwiw, it has been announced the last print edition is this thursday |
| 16:10:06 | <Jake> | yup, we haven't updated the wiki page yet. |
| 16:10:16 | <nuroten> | I don't know how long the online version will be up after that, given the asset freeze means they are having trouble paying their vendors |
| 16:11:46 | <nuroten> | thanks archiveteam! |
| 16:13:01 | <orly> | Apple Daily's youtube channel just 404'd. |
| 16:15:53 | <nuroten> | hope the account wasn't ... compromised |
| 16:24:57 | | spirit joins |
| 16:26:22 | | Daloader joins |
| 16:27:57 | <rewby> | I think we've already ran them through some emergency archiving |
| 16:35:04 | <nuroten> | is it possible / okay to suggest a youtube channel or a website (mainly for text and images) for contingency archiving? I have 2 websites in mind that are safe for now, but after Apple Daily, they may eventually be targeted |
| 16:37:46 | <rewby> | You can suggest them |
| 16:40:05 | <rewby> | I think we're watching the last moments of appledaily's online presence: https://en.appledaily.com/ |
| 16:41:08 | | n joins |
| 16:41:14 | <nuroten> | Youtube channel: D100 - they are a listener/donor-supported public radio in Cantonese, d100.net is the website. they frequently interview local political commentators, professionals, pro-democracy legislators (well, former now) and activists |
| 16:41:18 | <rewby> | Yep, hk. also just went down |
| 16:41:35 | <nuroten> | rip Apple Daily |
| 16:42:57 | <nuroten> | English-speaking digital news: Hong Kong Free Press https://hongkongfp.com |
| 16:44:07 | | Joesh joins |
| 16:44:36 | <nuroten> | Apple Daily was basically the last major pro-democracy news outlet ... online media would most likely be the next targets |
| 16:46:25 | <nuroten> | alongside HKFP, there's Stand News https://www.thestandnews.com/english/ |
| 16:47:25 | <rewby> | You don't have to pick just english things |
| 16:47:29 | <rewby> | We archive pretty much anything |
| 16:48:03 | <orly> | If I may, a few suggestions: Stand News, another big target (website, youtube, and facebook: thestandnews.com); SocREC, loads of in-the-crowd livestreams with little commentary (multiple youtube channels, one per reporter, e.g. UCg1-HnZBBnpB82g6saKc5FQ); Polymer, one of the bigger publications of the more localist side of the spectrum (website and |
| 16:48:03 | <orly> | facebook: polymerhk.com); Hong Kong Free Press, formed from people leaving SCMP (website and facebook: hongkongfp.com) |
| 16:49:51 | <@arkiver> | nuroten: orly: list all you know! |
| 16:50:10 | <orly> | Do you want full URLs or would just names suffice? |
| 16:50:23 | <rewby> | Ah neat, I can get a sitemap out of hongkongfp. Generating a list of urls right now... |
| 16:50:45 | <@arkiver> | URLs to websites, easier than names. if the list if large, you can upload a txt file to transfer.archivete.am and post the URL here |
| 16:50:52 | <@arkiver> | thanks rewby |
| 16:51:52 | <nuroten> | okay :) though I also suggested the English ones because more people can read and understand the contents |
| 16:54:02 | | n quits [Remote host closed the connection] |
| 16:54:42 | | spirit quits [Client Quit] |
| 16:56:44 | <nuroten> | a number of former journalists/radio show hosts, pro-democracy legislators, etc. have youtube channels and patreon accounts. as orly pointed out, facebook is where a lot of it is (I don't do facebook myself, but maybe I can look up some youtube channels if that's of interest?) |
| 17:01:59 | <nuroten> | D100 youtube channel: https://www.youtube.com/user/D100HK and specifically this show: https://www.youtube.com/playlist?list=PLm30xDjDCFYWi-g5Nt6NaZW1wrTOKkSpr |
| 17:04:43 | <rewby> | thestandnews: https://transfer.archivete.am/4s2s9/thestandnews.tar.gz https://transfer.archivete.am/104ZOF/urls.txt https://transfer.archivete.am/4oSP5/valid_sitemaps.txt |
| 17:04:59 | <rewby> | hongkongfp: https://transfer.archivete.am/7TuOG/hongkongfp.tar.gz https://transfer.archivete.am/erAdf/urls.txt https://transfer.archivete.am/JUNLy/valid_sitemaps.txt |
| 17:05:18 | <rewby> | polymerhk: https://transfer.archivete.am/4GjEC/polymerhk.tar.gz https://transfer.archivete.am/FQu0u/urls.txt https://transfer.archivete.am/AzjyX/valid_sitemaps.txt |
| 17:05:22 | <rewby> | That's the urls I get out of sitemaps |
| 17:05:28 | <rewby> | I should really turn this into an IRC bot |
| 17:05:43 | <nuroten> | it's a current affairs show, the first 2/3 talks about the day's big news/topics and usually they interview 1-2 guests for commentary/industry insight depending on topic, the last 1/3 is a listener phone-in segment |
| 17:06:23 | <nuroten> | rewby: fantastic :) |
| 17:07:03 | <rewby> | I don't claim these are complete |
| 17:07:10 | <rewby> | I just claim that's what I get out of sitemaps |
| 17:07:22 | | gohill4652 joins |
| 17:08:48 | | m350n joins |
| 17:09:20 | <rewby> | These urls are a fun mess of unicode |
| 17:09:28 | <rewby> | I wonder if that's gonna break anything |
| 17:12:47 | <@EggplantN> | want these yeeting into urls? |
| 17:12:57 | <rewby> | Maybe just in case? |
| 17:13:02 | <rewby> | Can't hurt to have them archived |
| 17:14:34 | <@arkiver> | yeah lets put them in #// |
| 17:22:17 | <nuroten> | Wall-fare is a group formed to help incarcerated people and their families, they provide services and raise awareness of poor prison conditions. mentioning it as a historical thing as it was formed in response to the influx of pro-democracy people being incarcerated and handles letters sent to them by the public |
| 17:22:18 | <nuroten> | https://www.facebook.com/wallfarelimited |
| 17:24:56 | <nuroten> | I haven't checked, but their facebook page posts may have some perspective on prison conditions, what happened to the activists who were convicted and so on |
| 17:28:32 | <orly> | Right. I've got a long list. Including online press, online radio, and student press from various universities. |
| 17:28:36 | <orly> | https://transfer.archivete.am/J1GF3/Links%20for%20archive%20-%20Hong%20Kong%20press.txt |
| 17:30:44 | <nuroten> | Civil Human Rights Front - not sure how long their website will be up, it was disbanded recently after being investigated for potential NSL violation, the main convener himself has a few lawsuits ongoing https://www.civilhrfront.org/ |
| 17:32:23 | <nuroten> | (this is an alliance of individuals and groups that organise the annual July 1 marches) |
| 17:34:22 | <nuroten> | sorry, not lawsuits, cases/charges rather. the CHRF organised peaceful protests to petition for democracy and all that |
| 17:35:15 | <nuroten> | orly: you're very organised :) ... I only have a bunch of names and things in my head |
| 17:36:00 | <rewby> | I'll go through the list to find some urls |
| 17:38:06 | | m350n leaves |
| 17:42:10 | <rewby> | More URLs! |
| 17:42:21 | <rewby> | myradio: https://transfer.archivete.am/YnptP/myradio.tar.gz https://transfer.archivete.am/7zuGc/urls.txt https://transfer.archivete.am/xkMF5/valid_sitemaps.txt |
| 17:42:41 | <rewby> | thehousenewsblogger: https://transfer.archivete.am/Dv4oz/thehousenewsbloggers.tar.gz https://transfer.archivete.am/KAnzf/urls.txt https://transfer.archivete.am/emBLc/valid_sitemaps.txt |
| 17:42:56 | <rewby> | undergrad: https://transfer.archivete.am/NQfQF/undergrad.tar.gz https://transfer.archivete.am/YQxLX/urls.txt https://transfer.archivete.am/Bdx8o/valid_sitemaps.txt |
| 17:43:02 | <rewby> | EggplantN: Can you yeet those into //? |
| 17:43:50 | | nuroten quits [Remote host closed the connection] |
| 17:43:58 | | Tansuke joins |
| 17:45:00 | | nuroten joins |
| 17:46:31 | <nuroten> | http://www.alliance.org.hk/ this org organises the annual June 4th vigil, leader was arrested shortly before June 4th this year and released on bail, no idea how long the org will be around |
| 17:50:05 | <nuroten> | https://64museum.blogspot.com/ the website of the museum they run. museum was forced to close (temporarily?) on account on not having a permit or something. one on wordpress and one on blogspot, may be safe but don't know if they will force a takedown like what happened with the hkcharter website and wix |
| 17:53:22 | <nuroten> | https://www.2021hkcharter.com/ the website in question, taken down "by mistake" but came back up https://hongkongfp.com/2021/06/04/hong-kong-democracy-site-pulled-by-mistake/ |
| 17:54:52 | <@EggplantN> | rewby what are the tar.gz file |
| 17:55:04 | <nuroten> | that's all for now, thanks! orly's list is a pretty good one to go on already |
| 17:55:13 | <rewby> | EggplantN: raw output from my tools. You only care about the urls.txt files |
| 17:55:44 | <@EggplantN> | ok plz dont name them all urls.txt in future |
| 17:55:44 | <@EggplantN> | lol |
| 17:56:37 | <@EggplantN> | ok added |
| 17:56:51 | <@EggplantN> | waiting for backfeed/the websocket to show me they went through |
| 17:57:21 | <@EggplantN> | done rewby orly nuroten :) |
| 17:57:28 | <@EggplantN> | apart from those last 3 |
| 17:57:48 | <nuroten> | EggplantN: thanks muchly! :D |
| 17:57:53 | <rewby> | It's not every site you mentioned |
| 17:57:54 | <@EggplantN> | http://tracker.archiveteam.org/urls/ |
| 17:57:56 | <rewby> | Just the ones with sitemaps |
| 17:57:57 | <@EggplantN> | go brrrrrrrrrrrrrr |
| 17:57:59 | <@EggplantN> | aight |
| 18:06:34 | | DogsRNice (Webuser299) joins |
| 18:09:49 | <jodizzle> | Did anyone put the articles from tw.appledaily.com through #//? |
| 18:26:52 | | dm4v quits [Client Quit] |
| 18:27:56 | | dm4v joins |
| 18:27:58 | | dm4v is now authenticated as dm4v |
| 18:27:58 | | dm4v quits [Changing host] |
| 18:27:58 | | dm4v (dm4v) joins |
| 18:29:44 | <thuban> | JAA, grawity: thanks for the comments |
| 18:30:39 | <nuroten> | jodizzle, in progress according to the wiki page |
| 18:31:25 | <@JAA> | No, the AB crawl is in progress. Not aware of anyone having thrown it into #//. |
| 18:32:19 | <thuban> | seems like maintaining a bypass might be a full-time project. i don't need speed for this particular application, so i punted and used selenium/chromedriver (need to strip "Headless" from the user agent if you're running headless, but that's all) |
| 18:32:27 | <nuroten> | oh sorry, misread |
| 18:32:44 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 18:36:18 | <jodizzle> | tw.appledaily.com doesn't appear to have sitemaps, annoyingly. Would have to get the articles /archive/, maybe. |
| 18:36:36 | <Frogging101> | youtube is really kicking the goose lately |
| 18:36:45 | <Frogging101> | First the age gating, now this unlisted→private thing |
| 18:36:51 | <Frogging101> | bullshit |
| 18:38:00 | <Ryz> | RIP that particular website that keeps track of YouTube unlisted videos s: |
| 18:39:59 | <thuban> | nuroten and/or orly: i still have the downloader i wrote for RTHK podcasts; any i should be working on now? |
| 18:40:16 | <@EggplantN> | Google seem to be in general on a general cleanup right now |
| 18:40:35 | | Mateon1 joins |
| 18:41:39 | <nuroten> | thuban: is it specifically tailored for RTHK podcasts? |
| 18:42:46 | <thuban> | nuroten: yes |
| 18:43:14 | <thuban> | (i can of course write scrapers for other publishers' media if it's important, but this is what i happen to have on hand) |
| 18:44:46 | <nuroten> | Headliner, that show may or may not disappear soon. production's been axed and the producers on contract didn't get a renewal (read: fired) https://podcast.rthk.hk/podcast/item.php?pid=272&lang=zh-CN |
| 18:45:22 | <nuroten> | it's a parody current affairs show, but apparently the team does rigorous fact-checking |
| 18:46:03 | <thuban> | nuroten: do you have aurl for the rss feed? i've forgotten where it's hidden |
| 18:46:07 | <thuban> | *a url |
| 18:46:15 | <nuroten> | https://podcast.rthk.hk/podcast/headliner_i.xml |
| 18:46:22 | <thuban> | thanks! |
| 18:47:53 | <nuroten> | (it's in a menu after pressing the orange button under the show title/square icon) |
| 18:50:45 | <nuroten> | Hong Kong Letters (CN version) was another one someone requested in that spreadsheet from a while back https://podcast.rthk.hk/podcast/item.php?pid=42&lang=zh-CN https://podcast.rthk.hk/podcast/hkletter.xml |
| 18:52:40 | <thuban> | running Headliner now |
| 18:55:30 | <nuroten> | those are the two main ones aside from HK Connection, if you think it's valuable, there's also the news in sign language https://podcast.rthk.hk/podcast/tv_newsreview_i.xml |
| 18:56:10 | <thuban> | looks like video download is working fine out of the box, but i'll keep an eye on it in case older eps use yet another format |
| 18:56:30 | | dm4v quits [Client Quit] |
| 18:57:31 | | dm4v joins |
| 18:57:33 | | dm4v is now authenticated as dm4v |
| 18:57:33 | | dm4v quits [Changing host] |
| 18:57:33 | | dm4v (dm4v) joins |
| 18:58:22 | <thuban> | the only potential issue is that i'm getting the title from the rss xml and the description from the episode page html, and there seems to be an encoding difference... fortunately i can just leave it running and re-grab the metadata later |
| 18:58:59 | <nuroten> | sounds good |
| 19:03:22 | <nuroten> | this mini-site might be worth backing up, it has video clips of history from 50s to present. it follows a different format and so on from the RTHK Podcasts section so may be better to download as a regular site https://app4.rthk.hk/special/rthkmemory/ |
| 19:04:51 | <thuban> | fixed the encoding issue! it was my bad |
| 19:05:44 | <thuban> | looks like that site is pretty js-heavy, so would not work well in ab. i can take a look at it later though |
| 19:11:54 | | Daloader quits [Ping timeout: 250 seconds] |
| 19:12:28 | <nuroten> | yeah, whatever you can pull is fine, specifically the clips in the Major Events category. there are some other cultural things that might be nice to have but maybe not the first thing I would grab personally |
| 19:15:19 | <nuroten> | https://app4.rthk.hk/special/rthkmemory/category/major-events and https://app4.rthk.hk/special/rthkmemory/category/innovation/programme/ (this part is about the beginnings of the HK Connection show: https://app4.rthk.hk/special/rthkmemory/programme/6 ) |
| 19:16:21 | <nuroten> | thanks :) |
| 19:20:29 | <nuroten> | clips from the first 3 episodes of Headliner ever https://app4.rthk.hk/special/rthkmemory/programme/34 |
| 19:23:23 | <nuroten> | for music fans, top 10 popular songs starting from the 70s and 80s https://app4.rthk.hk/special/rthkmemory/programme/33 (these last 3 links are from the innovation/programme category) |
| 19:26:34 | <@OrIdow6> | Would it be possible to have a very quick project for GREE set up in 20 minutes or so? |
| 19:27:21 | <@HCross> | GREE? |
| 19:27:33 | <@OrIdow6> | Japanese social network |
| 19:27:47 | <@OrIdow6> | Among other things |
| 19:27:57 | <@OrIdow6> | See deathwatch, date was apparently moved up |
| 19:28:09 | <@OrIdow6> | So less time than I thought |
| 19:35:04 | <AK> | https://twitter.com/textfiles/status/1407782416039690241 |
| 19:35:18 | <AK> | Time to archive the madman |
| 19:35:45 | <thuban> | why, what happened now |
| 19:37:35 | <AK> | "Spanish media reporting that John McAfee comitted suicide in a spanish jail cell after he was cleared to be extradited to the U.S." |
| 19:37:35 | <AK> | That |
| 19:37:36 | <AK> | Umm |
| 19:37:49 | <AK> | Didn't expect it to be that honestly |
| 19:37:54 | <AK> | I was expecting bitcoin and cocaine |
| 19:38:32 | <AK> | Seems confirmed by US justice department https://twitter.com/InvestorsLive/status/1407780136188002304 |
| 19:40:37 | <@OrIdow6> | HCross: Am I right in saying that arkiver is needed to do backfeed? It's not essential here since it seems that most/all of the publicly-accessible pages are in robots and the list page anyhow, but nice to have |
| 19:40:55 | <@HCross> | yes |
| 19:41:28 | <@OrIdow6> | Ok |
| 19:42:58 | <@arkiver> | OrIdow6: are you already working on this? |
| 19:43:06 | <@OrIdow6> | arkiver: Yes, mostly done |
| 19:43:07 | <@arkiver> | else I will try to setup a project for that now |
| 19:43:10 | <@OrIdow6> | Since it's fairly simple |
| 19:43:14 | <@arkiver> | alright ping me when it's somewhere |
| 19:43:18 | <@OrIdow6> | Ok |
| 19:46:55 | <@arkiver> | OrIdow6: from what i see, all posts are under a username |
| 19:47:02 | <nuroten> | thuban: there's a series called Hong Kong Stories (CN: 香港故事) with a different theme each season. it's about everyday people, some of them artisans, farmers, small business owners, etc. there are 10+ of them if you search the CN title, but here's one about food (the subtitle translates roughly to "thinking of the taste of home") |
| 19:47:02 | <nuroten> | https://podcast.rthk.hk/podcast/item.php?pid=1635&lang=en-US https://podcast.rthk.hk/podcast/tv_hkstories44_i.xml |
| 19:47:06 | <@arkiver> | so i guess just discovery of users while crawling is needed |
| 19:47:25 | <@arkiver> | i see some account are behind a login |
| 19:47:56 | <@OrIdow6> | arkiver: Most users seem to be private |
| 19:47:57 | | dm4v quits [Read error: Connection reset by peer] |
| 19:48:06 | <@OrIdow6> | We have 2 lists of what may or may not be all public users |
| 19:48:17 | | dm4v joins |
| 19:48:18 | | dm4v is now authenticated as dm4v |
| 19:48:18 | | dm4v quits [Changing host] |
| 19:48:18 | | dm4v (dm4v) joins |
| 19:48:29 | <@arkiver> | AK: damn, didnt expect that either |
| 19:48:37 | <@arkiver> | OrIdow6: perfect |
| 19:48:49 | <@arkiver> | will get it setup and started as soon as you have it ready |
| 19:50:33 | <@Kaz> | john mcafee. |
| 19:50:39 | <AK> | Yep |
| 19:51:33 | <KRG> | extradition seemed to be too much for him |
| 19:51:44 | <@Kaz> | understandable |
| 19:52:34 | <@HCross> | let me know when |
| 19:52:35 | <@HCross> | will go hard |
| 19:52:36 | <@HCross> | and fast |
| 19:52:45 | <@arkiver> | HCross: will ping |
| 19:53:03 | <@arkiver> | OrIdow6: do you have the list of users somewhere? |
| 19:53:20 | <@OrIdow6> | arkiver: It was in the form of URLs |
| 19:53:24 | <@OrIdow6> | Let me find them |
| 19:55:00 | <@OrIdow6> | https://transfer.archivete.am/2Bw3b/gree_all.txt https://transfer.archivete.am/HYWQ4/gree.txt , URLs from sitemap and from scraping the user list page, respectively, neither done by me, still need to be parsed |
| 19:55:14 | <@OrIdow6> | If someone other than me wants to do it, format is user:username |
| 19:55:21 | <@arkiver> | yeah i'll parse them |
| 19:55:29 | <@arkiver> | thanks |
| 19:55:33 | <@OrIdow6> | Thank you |
| 19:56:37 | <AK> | Can someone with voice in ab do mcafees twitter? |
| 19:59:33 | | dm4v_ joins |
| 19:59:33 | | dm4v quits [Read error: Connection reset by peer] |
| 19:59:45 | | dm4v_ is now known as dm4v |
| 19:59:47 | | dm4v is now authenticated as dm4v |
| 19:59:47 | | dm4v quits [Changing host] |
| 19:59:47 | | dm4v (dm4v) joins |
| 20:01:04 | <AK> | Do we archive articles about peoples death? |
| 20:01:12 | | dm4v quits [Read error: Connection reset by peer] |
| 20:01:21 | <@arkiver> | yes |
| 20:01:38 | <@arkiver> | or is this a 'how' question? |
| 20:01:59 | <AK> | Naah it was a do we, I think I've worked out what to do now |
| 20:02:41 | <@arkiver> | right, so policy question. answer is yes! |
| 20:04:07 | | dm4v joins |
| 20:04:10 | | dm4v is now authenticated as dm4v |
| 20:04:10 | | dm4v quits [Changing host] |
| 20:04:10 | | dm4v (dm4v) joins |
| 20:10:30 | <@arkiver> | OrIdow6: if gree can handle high load (we'll know when HCross is on the project), i'll put the URLs in #// as well most likely |
| 20:10:34 | <@arkiver> | though warrior project first |
| 20:10:45 | <@HCross> | warrior first please |
| 20:10:54 | <@HCross> | I don't like running // unless I have too |
| 20:11:02 | <@arkiver> | like i said :) |
| 20:11:04 | <@OrIdow6> | arkiver: Feeling is that it'll be rickety |
| 20:11:19 | <@OrIdow6> | I think this was something that had its heyday about 13 years ago or so |
| 20:11:21 | <@arkiver> | well warrior first, so we'll see |
| 20:11:35 | <@HCross> | are we talking pentium 4 servers in a closet somewhere |
| 20:11:36 | <@HCross> | in Japan |
| 20:11:41 | <@HCross> | cc rewby |
| 20:12:01 | <@OrIdow6> | Alright arkiver https://github.com/OrIdow6/gree-grab |
| 20:12:05 | <@arkiver> | thank OrIdow6 |
| 20:12:07 | <AK> | If you want to put stuff in #// I can spin up some stuff on that |
| 20:12:14 | <@arkiver> | we can do a channel, not sure if it's needed |
| 20:13:03 | <@OrIdow6> | A good test item is user:kakei_toshio |
| 20:13:31 | <@arkiver> | OrIdow6: is it just me or is there a lot of wikidot stuff in there |
| 20:13:34 | <@arkiver> | will filter that out now |
| 20:13:40 | <@arkiver> | should be running in a few |
| 20:13:58 | <@HCross> | arkiver: let me know when code is ready |
| 20:14:00 | <@OrIdow6> | Yeah, looks like I did leave a bit in |
| 20:14:02 | <@HCross> | and I'll get underway |
| 20:14:20 | <@arkiver> | OrIdow6: no worries, checking it now |
| 20:14:30 | <@arkiver> | HCross: yeah, rewby EggplantN for target |
| 20:14:36 | | Joesh quits [Ping timeout: 244 seconds] |
| 20:14:37 | <@arkiver> | archiveteam_gree_ |
| 20:14:40 | <@arkiver> | gree_ |
| 20:14:51 | <@arkiver> | Archive Team GREE: |
| 20:15:11 | <Jake> | do we have a channel for GREE or just sticking for -bs? |
| 20:15:27 | <@arkiver> | may be good yeah |
| 20:15:29 | <@arkiver> | ideas? |
| 20:15:33 | <rewby> | We need targets? |
| 20:15:39 | <AK> | #greedy |
| 20:15:42 | <@arkiver> | nice |
| 20:26:34 | | LeGoupil quits [Client Quit] |
| 20:28:20 | | Vista2003 leaves |
| 20:30:55 | | HP_Archivist (HP_Archivist) joins |
| 20:48:20 | | nuroten quits [Remote host closed the connection] |
| 20:51:40 | <@EggplantN> | Rewby or deploy FMT |
| 20:51:42 | <@EggplantN> | I’m afk |
| 20:52:07 | <rewby> | EggplantN: nvme is overloaded and I've not got SSH on any of your boxes |
| 20:52:19 | <rewby> | We've deployed two CPX31s |
| 20:52:22 | <rewby> | Hopefully enough |
| 20:58:32 | <SketchTheCow> | Hey, people |
| 20:58:32 | <SketchTheCow> | I'm doing this game event thing today, then I turn back to general high focus. |
| 20:58:32 | <SketchTheCow> | Arkiver's getting most of the IA-Archiveteam integration/work done these days, but I'm around. |
| 21:05:42 | | nuroten joins |
| 21:07:09 | | fuzzy8021 quits [Read error: Connection reset by peer] |
| 21:07:31 | | fuzzy8021 (fuzzy8021) joins |
| 21:07:42 | | @EggplantN quits [Client Quit] |
| 21:07:45 | | ave quits [Quit: Ping timeout (120 seconds)] |
| 21:07:45 | | lun4 quits [Quit: Ping timeout (120 seconds)] |
| 21:08:04 | | ave (ave) joins |
| 21:08:04 | | lun4 (lun4) joins |
| 21:08:22 | | KRG quits [Changing host] |
| 21:08:22 | | KRG (KRG) joins |
| 21:08:41 | | EggplantN joins |
| 21:09:54 | | aaa joins |
| 21:10:42 | <aaa> | is this channel for back up apple daily? |
| 21:11:01 | <rewby> | Apple Daily is mostly down already. We grabbed what we could. |
| 21:12:16 | <aaa> | I think there are still some links that are up |
| 21:12:23 | <aaa> | That can still be backed up |
| 21:12:40 | <@JAA> | Any examples? |
| 21:12:50 | <rewby> | If you list them we'll do our best to get 'em |
| 21:14:40 | <aaa> | 1 min |
| 21:16:11 | <aaa> | were you guys able to download the vids in this txt file here? |
| 21:16:11 | <aaa> | Videos (M3U8 and TS segments) from article pages extracted by User:Jodizzle in several parts: Part 1 Saved! with ArchiveBot on 2021-06-22: https://transfer.archivete.am/15b9yl/hk.appledaily.com-m3u8s-expanded.1.txt and job:atm5u7fjmgegw508c90ty32wi Part 2 Saved! with ArchiveBot on 2021-06-22/23: |
| 21:16:12 | <aaa> | https://transfer.archivete.am/RZHFJ/hk.appledaily.com-m3u8s-expanded.2.txt and job:183qpki4h8e40cswj2035wqmf Part 3 Saved! with ArchiveBot on 2021-06-23: https://transfer.archivete.am/OIkBX/hk.appledaily.com-m3u8s-expanded.3.txt and job:5ue8wjnyg1gbg1g7x420b5gpg Part 4 Saved! with ArchiveBot on 2021-06-23: |
| 21:16:12 | <aaa> | https://transfer.archivete.am/CKU9J/hk.appledaily.com-m3u8s-expanded.4.txt and job:91rs9mykjwyxmj5vekc8ol1qf Part 5 In progress... with ArchiveBot on 2021-06-23: https://transfer.archivete.am/11wYuN/hk.appledaily.com-m3u8s-expanded.5.txt and job:cxp0gi0dive8hio7t156y9o3r More parts Upcoming... |
| 21:16:27 | <aaa> | thanks a lot for your support btw |
| 21:16:35 | <@JAA> | I mean, it says 'Saved!' and 'In progress...' there. |
| 21:17:24 | <aaa> | Yeah I was confused if it meant the links are saved or the actual vid .ts files are saved. |
| 21:17:39 | <@JAA> | The videos are. |
| 21:17:42 | <aaa> | oh wonderful |
| 21:17:49 | <@JAA> | Part 5 actually finished by now as far as I can see. |
| 21:22:15 | <nuroten> | orly: FactWire https://www.factwire.org/?lang=en and HKPORI https://www.pori.hk/?lang=en will probably be safe for longer, but who knows ... in case you want to add it to your list |
| 21:28:12 | <nuroten> | 612 Humanitarian Fund offers legal advice, financial assistance, etc. for people charged from the 2019 protests https://612fund.hk/en/home https://www.facebook.com/612Fund/reviews/ |
| 21:33:11 | <nuroten> | after what happened with CHRF some of these orgs on the front lines of providing assistance, organising vigils, etc. might be at risk |
| 21:33:48 | <AK> | Added the website now, gonna look at facebook too |
| 21:34:45 | <nuroten> | AK: fantastic, thanks |
| 22:00:23 | <aaa> | This apple daily site is still up: https://www.nextdigital.com.hk/ |
| 22:00:54 | <aaa> | JAA rewby ^ |
| 22:01:20 | <rewby> | Didn't we throw that into archivebot? |
| 22:02:16 | <rewby> | Ah it appears we have not done so |
| 22:02:21 | <rewby> | But I don't have the perms to do it |
| 22:02:30 | <rewby> | So leaving that for JAA / other AB operators |
| 22:03:11 | <rewby> | Oh wait no we did put it in |
| 22:03:14 | <rewby> | and it's finished |
| 22:03:28 | <rewby> | Helps if I don't typo into /grep |
| 22:07:52 | <@JAA> | Yup, added to the wiki page. |
| 22:08:43 | <aaa> | This one too, I think, still works: https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/ |
| 22:12:41 | <@JAA> | Hmm |
| 22:13:26 | | Tansuke quits [Ping timeout: 244 seconds] |
| 22:13:54 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 22:15:53 | | HP_Archivist (HP_Archivist) joins |
| 22:16:50 | <AK> | nuroten, website is done :) |
| 22:22:39 | | EggplantN is now authenticated as EggplantN |
| 22:22:39 | | EggplantN quits [Changing host] |
| 22:22:39 | | EggplantN (EggplantN) joins |
| 22:22:39 | | @ChanServ sets mode: +o EggplantN |
| 22:23:53 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 22:25:58 | | DogsRNice quits [Read error: Connection reset by peer] |
| 22:27:10 | | DogsRNice (Webuser299) joins |
| 22:30:41 | <nuroten> | AK: whoa, thanks :D |
| 22:41:19 | <@JAA> | jodizzle: I should soon have a complete list of all videos on hk. |
| 22:41:46 | <@JAA> | aaa: Nice find, thanks. That's the complete site it seems. :-) |
| 22:45:40 | <aaa> | JAA any help you guys need to archive that site? happy to help as much as I can |
| 22:47:16 | <@arkiver> | if someone here knows about physics material in Hong Kong being removed, please ping me! |
| 22:47:28 | <@arkiver> | especially if you are able to get access to the material that will be discarded |
| 22:47:32 | <@JAA> | I'm crawling through the /archive section right now to collect all articles and videos. Apparently nobody did that before, or at least I haven't seen anyone mention it. |
| 22:47:35 | <@arkiver> | of course, safety first |
| 22:47:39 | <@JAA> | That's on HK Apple Daily. |
| 22:48:01 | <aaa> | arkiver what physics material are you referring to? |
| 22:48:12 | <@arkiver> | ugh |
| 22:48:14 | <@arkiver> | physical |
| 22:48:30 | <@arkiver> | newspapers, books, DVDs, whatever |
| 22:48:38 | <@arkiver> | but it's probably too late to get stuff like that out of the country |
| 22:49:09 | <@arkiver> | but if you know of that, please let me know |
| 22:49:11 | <aaa> | Yeah good point, haven't heard anything about that happening yet, but cannot rule anything out of course with the current situation |
| 22:49:47 | <@arkiver> | i can imagine that with various institutions closing, there would be archives and small (company) libraries closing as well |
| 22:50:05 | <aaa> | Yeah it's definitely in the realm of (high) possibilities |
| 22:50:29 | <aaa> | Is there any hope to download YT videos that are still online, but private? |
| 22:50:34 | <@arkiver> | ping me if you find out anything! |
| 22:51:01 | <@arkiver> | also nuroten orly on physical material ^ |
| 22:53:34 | <nuroten> | I don't have access, maybe people on LIHKG would know or have something https://lihkg.com |
| 22:54:25 | <nuroten> | I heard a lot of people will be grabbing the last print edition of Apple Daily first thing on thursday, 1 M copies to be printed |
| 22:55:30 | <aaa> | JAA you're crawling through /archive section of the arcpublishing site? |
| 22:55:40 | <@JAA> | aaa: Yes |
| 22:55:44 | <aaa> | thank you! |
| 22:55:51 | <@JAA> | Half-way done or so. |
| 22:56:01 | <aaa> | Do you guys have a guide on how you do these archives? for future reference haha |
| 22:56:02 | <@arkiver> | nuroten: im hoping we may still be able to take archives out of the hong kong |
| 22:56:16 | <@JAA> | I'm using qwarc for this, which is completely undocumented. |
| 22:56:39 | <@arkiver> | nuroten: oof, yeah lihkg is a bit hard to read |
| 22:56:56 | <@arkiver> | what is lihkg? |
| 22:57:08 | <aaa> | arkiver popular hk forum, think hks version of reddit |
| 22:58:06 | <jodizzle> | JAA: Great! I'm probably not iterating the articles at all fast enough on my setup. |
| 22:58:22 | <@JAA> | jodizzle: The /archive/YYYYMMDD/ pages have all the M3U8 URLs. :-) |
| 22:58:36 | <@JAA> | So no need to retrieve each article page. |
| 22:59:10 | | aaa quits [Remote host closed the connection] |
| 22:59:12 | <@JAA> | How many videos did you discover? |
| 23:00:56 | <jodizzle> | Ahh okay. I think I had noticed there being a 'digest' on tw.appledaily.com /archive/YYYYMMDD/ pages, but I didn't think to look in detail for videos, or on hk.appledaily.com. Nice! |
| 23:02:49 | <jodizzle> | It looks like my five lists have 27,438 m3u8s. But many of those are different qualities of the same video. |
| 23:04:04 | | gohill4652 quits [Ping timeout: 244 seconds] |
| 23:05:01 | | xit quits [Quit: Ping timeout (120 seconds)] |
| 23:05:20 | | xit joins |
| 23:05:57 | <jodizzle> | I'll leave my thing collecting until you confirm that you have everything. |
| 23:12:14 | <@JAA> | Well, I'm not sure I trust them that /archive/ lists everything. But I can get you a list of article IDs listed there to compare. |
| 23:12:37 | <jodizzle> | Sounds good |
| 23:12:54 | | wessel15129 joins |
| 23:12:57 | | wessel1512 quits [Read error: Connection reset by peer] |
| 23:12:57 | | wessel15129 is now known as wessel1512 |
| 23:13:56 | | aaa joins |
| 23:15:26 | | orly quits [Ping timeout: 244 seconds] |
| 23:15:35 | <aaa> | There may be some things here that can still be downloaded: ml-welcome01.nxtdig.com.hk/stage/ |
| 23:16:20 | <aaa> | Ex: ml-welcome01.nxtdig.com.hk/stage/f54332d768439dfbf720661800624b42e7244fb2 |
| 23:17:05 | <@arkiver> | aaa: if you happen to see something interesting on lihkg, can you please let us know here? also regarding physical materials |
| 23:17:06 | <@JAA> | Surprisingly large. Just a WARC of the /archive/ pages from 2000 to today is 3.8 GB after compression. |
| 23:17:36 | <aaa> | arkiver sure |
| 23:18:01 | <thuban> | aaa: you mentioned lihkg; do you read chinese? |
| 23:18:07 | <aaa> | yh |
| 23:19:09 | <nuroten> | did someone download this playlist yet? https://www.youtube.com/playlist?list=PLiY6wtxjK6QPPh4cSFBgcjhDK3XTDAV2F |
| 23:19:42 | <nuroten> | it's a food show by NEXT TV apparently |
| 23:19:53 | <aaa> | nuroten that is not apple daily hk |
| 23:20:06 | <aaa> | that is a taiwan outlet, so no need to backup |
| 23:20:20 | <thuban> | aaa: they have some threads and documents of rthk stuff they're trying to save (since the deleting-old-material warning went up a little while back) |
| 23:20:29 | <thuban> | i wrote a high-quality scraper and i'd like to coordinate, but the lists are pretty hard to navigate through google translate. one sec, i'll grab the links |
| 23:20:49 | <nuroten> | aaa: oh okay, good to know, thanks ... someone on LIHKG suggested two other channels to back up, that was one of them |
| 23:21:56 | <thuban> | aaa: https://docs.google.com/spreadsheets/d/1JPyevWnxvoq_xzY4ptOgaTaTYSLE9oMva66kxm66K0k/edit ; https://docs.google.com/document/d/1I3yYU2CTjlDt39xOZaSf7Qjj87eArrqEDp4897SaaSw/edit |
| 23:22:31 | <nuroten> | (and yeah, it's not Apple Daily, I wasn't sure how long the tw-produced content will stay up, even with tw-based partner) |
| 23:23:48 | <aaa> | thuban those spreadsheets are for RTHK, some controversial RTHK content has already been deleted as of 1 mth ago |
| 23:24:42 | <thuban> | aaa: yeah, i got the entire english-language hong kong connection backlog at that time. |
| 23:24:49 | <aaa> | nice |
| 23:25:44 | <thuban> | i realize that it's not as time-sensitive as apple daily content now, but i still have the downloader and i'd like to get anything else that may be in danger in the future if i can |
| 23:26:41 | <nuroten> | aaa: thuban helped save some of the RTHK podcasts back in May when RTHK announced it was taking down content older than 1 year so their website and social media are aligned (whatever that meant) |
| 23:27:28 | <nuroten> | a few of us were worried at the time it will also affect their archives, not just youtube |
| 23:27:29 | <aaa> | yeah it was just a bullshit excuse lol nuroten |
| 23:28:56 | <nuroten> | yeah ... so now hopefully if they decide to quietly drop content, we will hopefully be more prepared |
| 23:29:20 | <@JAA> | Ewwwww. These /archive/ pages on hk.appledaily.com each contain a ~5 MB JSON object. And some of the keys are themselves JSON strings. Disgusting... |
| 23:30:18 | <thuban> | yo dawg, i heard you like... |
| 23:30:39 | <@JAA> | { ..., "{\"feedOffset\":0,\"feedQuery\":\"taxonomy.primary_section._id:\\\"%2Fdaily%2Fentertainment\\\"+AND+type:story+AND+(editor_note:\\\"20180111\\\"+OR+display_date:[2018-01-11T16:00:00Z||-24h+TO+2018-01-11T16:00:00Z])\",\"feedSize\":100,\"sort\":\"location:asc\"}": { ... } } |
| 23:31:31 | | aaaa joins |
| 23:32:21 | | aaa quits [Remote host closed the connection] |
| 23:32:50 | | aaaa quits [Remote host closed the connection] |
| 23:35:07 | <Jake> | I wonder if zstd would compress it nicely... |
| 23:36:57 | | aaaaa joins |
| 23:37:11 | <@JAA> | I'm sure I could train a dict that would absolutely shred it. |
| 23:44:32 | <thuban> | hm, looks like someone else has uploaded hkc (the entire run? not sure, but would bet) to ia as individual episodes |
| 23:44:50 | <thuban> | i was planning to upload everything as one item; should i still? |
| 23:45:01 | | HP_Archivist (HP_Archivist) joins |
| 23:45:07 | <@arkiver> | thuban: how many videos? |
| 23:45:55 | <thuban> | arkiver: have i? 297 (with thumbnails) |
| 23:46:39 | <@arkiver> | from appledaily? |
| 23:47:04 | <nuroten> | are the ones uploaded also the English version? |
| 23:47:19 | <thuban> | no, this is the rthk stuff we were doing last month |
| 23:47:23 | <thuban> | they appear to be |
| 23:49:34 | <nuroten> | I kind of like the idea of them as one item where it generates a playlist and can be viewed sequentially, is that also available if they're all individual inside a series collection? |
| 23:50:38 | <thuban> | no idea. they're not currently in a collection, though |
| 23:50:57 | <nuroten> | but not picky as long as there are copies ... same resolution? |
| 23:51:05 | <thuban> | looks like, yeah |
| 23:51:40 | <nuroten> | okay ... I remembered there was some weird thing with some of them having different res depending on whether they were from akamai or archive |
| 23:52:47 | <nuroten> | they make more sense to me as 1 item (or some way to group them as a set) but yeah, up to you, thanks for saving those :) |
| 23:53:04 | <thuban> | yeah. i did a convenience sample of one very new one and one very old one; both were the same size as mine. i think there was an intermediate phase between thos ebut don't recall the details (and it looks like we used very similar methods) |
| 23:53:12 | <thuban> | *those but |
| 23:53:18 | <@arkiver> | thuban: if its 297 videos, the lets do individual items if it's not too much more work |
| 23:53:28 | <@arkiver> | let me know when they're up and I'll put them in some AT collection |
| 23:53:53 | <@arkiver> | make sure you have good/correct metadata |
| 23:54:31 | <thuban> | arkiver: someone else already upped them as individual episodes https://archive.org/details/@kwc114 |
| 23:54:59 | <thuban> | metadata looks good but they're not in a collection--iirc only the uploader can put them in one; is that right? |
| 23:55:20 | <@arkiver> | or I can |
| 23:55:31 | <thuban> | oh, cool |
| 23:55:31 | <@arkiver> | or someone else at IA |
| 23:56:42 | <thuban> | i do note that they don't seem to have grabbed the original thumbnails. ia generates its own, so not a big deal from a usability perspective, but maybe i should upload them for preservation? |
| 23:56:54 | <@arkiver> | yes |
| 23:57:00 | <@arkiver> | but, in one item then |
| 23:57:17 | <thuban> | ok, will do |
| 23:57:24 | <thuban> | once i remember how the cli works lol |