| 00:04:25 | <thuban> | (oh, the other thing is that i included external identifiers so that videos/thumbnails could be matched without relying on titles or filenames, which i don't think the other uploader did. i don't think that's a big deal, though) |
| 00:05:05 | | Arcorann (Arcorann) joins |
| 00:06:23 | <thuban> | oh FUCK i forgot to set the mediatype |
| 00:08:24 | <thuban> | arkiver: is there any way to get around this? i'd like to be able to use the identifier i originally wanted :/ |
| 00:14:40 | <@JAA> | I *love* working with 40 GB of JSON. Just *wonderful*! |
| 00:15:26 | | Doranwen quits [Ping timeout: 258 seconds] |
| 00:16:03 | <@arkiver> | thuban: i'll change it for you |
| 00:16:15 | <@arkiver> | ping me the item when it's uploaded! |
| 00:16:20 | <thuban> | arkiver: ah, thank you so much :) |
| 00:16:27 | <@arkiver> | change the mediatype that is, not the identifier |
| 00:16:38 | <thuban> | https://archive.org/details/rthk-podcast-hkconnection-en-thumbnails |
| 00:16:51 | <thuban> | i interrupted the upload when i realized what i'd done |
| 00:18:41 | <@JAA> | Oh, only 35.7 GB in the end. Phew, that's much better... Please shoot me. |
| 00:23:38 | | BlueMaxima joins |
| 00:34:29 | <aaaaa> | JAA any luck on the arcpublishing backup? it sounds like it worked? :) |
| 00:35:46 | <@JAA> | I'm still shooting myself in the foot. :-) |
| 00:37:42 | <@JAA> | jodizzle: https://transfer.archivete.am/TV0oq/hk.appledaily.com-archive-article-urls.zst |
| 00:38:29 | <@JAA> | Ignore the video-leaf URLs, I think. Still need to look into that more closely. But otherwise, that should be all articles from the /archive/ listings. |
| 00:42:20 | | ZizzyDizzyMC joins |
| 00:43:31 | <@JAA> | The video-leaf stuff are unique IDs for the videos, not paths on the website, it seems. Not all video-leaf 'URLs' are in that list. |
| 00:45:21 | | Ajay1 joins |
| 00:47:15 | | Ajay quits [Ping timeout: 258 seconds] |
| 00:47:50 | | orly (orly) joins |
| 00:50:06 | <@JAA> | jodizzle: 201466 videos according to my extraction (which is definitely crude). |
| 00:50:42 | <@JAA> | And almost all of them, namely 193142, are MP4. |
| 00:52:29 | <@JAA> | They also conveniently provide a 'filesize' field in the JSON. Not sure how accurate it is for M3U8, but summing that up gets me to 1.93 TB. |
| 00:54:53 | <@JAA> | rewby: I love how your HK Apple Daily article URL list is called sorted_urls.txt and is anything but sorted. :-P |
| 00:56:17 | | dav3 joins |
| 00:57:59 | | Doranwen (Doranwen) joins |
| 00:58:01 | <@JAA> | Looks like the /archive/ iteration is incomplete. :-( For example, /sports/20150707/UPVFKXHDPUZPUU4UJTHHQCSFCI/ does not appear on https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/archive/20150707/ |
| 00:58:46 | <orly> | Hello. Sorry, was night time in Hong Kong, so probably missed a lot after I posted that txt of websites. |
| 00:59:26 | <@JAA> | orly: https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292224 |
| 01:00:09 | <orly> | JAA, thanks |
| 01:03:21 | | dm4v quits [Ping timeout: 258 seconds] |
| 01:03:32 | | dm4v joins |
| 01:03:34 | | dm4v is now authenticated as dm4v |
| 01:03:34 | | dm4v quits [Changing host] |
| 01:03:34 | | dm4v (dm4v) joins |
| 01:07:37 | <@JAA> | 1.65M URLs in my article URLs list, 3.22M in rewby's list from the sitemaps. :-| |
| 01:23:40 | | dav3 quits [Remote host closed the connection] |
| 01:24:33 | | orly quits [Client Quit] |
| 01:34:41 | | Tansuke joins |
| 01:42:57 | <@JAA> | But 98k (90k without video-leaf) URLs in my list that aren't in the sitemaps. |
| 01:54:14 | | satoru1126 joins |
| 01:55:36 | | SomeRando quits [Ping timeout: 244 seconds] |
| 01:57:18 | | satoru1126 leaves |
| 02:15:14 | | Tansuke quits [Ping timeout: 244 seconds] |
| 02:31:28 | <aaaaa> | Folks is there anyway to scrape this? https://playboard.co/en/channel/UCeqUUXaM75wrK5Aalo6UorQ/videos |
| 02:31:46 | <aaaaa> | A lot of the videos aren't up anymore, but scraping metadata too is still good |
| 02:36:20 | | pm5 joins |
| 02:43:07 | <thuban> | pagination is js-triggered and server-side by repeated POSTs, so would not work in ab/wbm |
| 02:43:31 | <thuban> | but a custom scraper could do the pagination, then generate video urls to feed to archivebot |
| 02:44:20 | <thuban> | between the json and the video pages should get all the metadata |
| 02:46:53 | <thuban> | JAA (or others): is that worth doing or would it be purely duplicative of the youtube archive work on this channel? i haven't been in #down-the-tube |
| 03:05:12 | <aaaaa> | were you guys able to download all the youtube videos from apple daily? |
| 03:05:40 | | tansuke joins |
| 03:05:45 | <thuban> | aaaaa: definitely a lot of them, but 'all' is unclear https://wiki.archiveteam.org/index.php/Apple_Daily#Others |
| 03:10:43 | | tansuke quits [Remote host closed the connection] |
| 03:13:50 | | Tansuke joins |
| 03:14:25 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 03:21:59 | <thuban> | i wrote that pager just in case; it's running now |
| 03:31:53 | | Tansuke quits [Remote host closed the connection] |
| 03:36:42 | <thuban> | https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst |
| 03:48:48 | | qw3rty__ joins |
| 03:52:24 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 04:30:10 | <Ajay1> | https://workspaceupdates.googleblog.com/2021/06/drive-file-link-updates.html |
| 04:33:10 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:38:01 | | ZizzyDizzyMC quits [Remote host closed the connection] |
| 04:46:44 | | Webuser431 joins |
| 04:52:53 | | orly (orly) joins |
| 04:58:16 | | G4te_Keep3r quits [Client Quit] |
| 04:58:53 | | Viniter69 (Viniter) joins |
| 05:02:06 | | Viniter6 quits [Ping timeout: 250 seconds] |
| 05:02:06 | | Viniter69 is now known as Viniter6 |
| 05:03:35 | | G4te_Keep3r joins |
| 05:03:37 | | ZizzyDizzyMC joins |
| 05:05:33 | | orly quits [Client Quit] |
| 06:20:26 | <jodizzle> | JAA: If I'm doing this right, it looks like my m3u8 collector only actually found m3u8s from 9097 articles. Of those, 2219 are present in the hk.appledaily.com-archive-article-urls list you sent. |
| 07:17:06 | | orly (orly) joins |
| 07:20:07 | | leo60228 quits [Quit: ZNC 1.8.1 - https://znc.in] |
| 07:21:48 | | sonick quits [Client Quit] |
| 07:23:09 | <orly> | Hiya. I just realised I missed one university students' press when I posted that txt of Hong Kong stuff yesterday. It's the Chinese University Campus Radio. Facebook: cuhkcampusradio Youtube: UCg-D5uUTXfTSolC_zY5KXCg |
| 07:23:29 | <Ajay1> | Is there a channel for Google drive currently? |
| 07:25:03 | <thuban> | orly: mention the youtube channel in #down-the-tube ? |
| 07:25:45 | | ieh joins |
| 07:25:45 | <orly> | Sure thing. But they mainly upload politically sensitive stuff on their Facebook. Their Youtube is basically just archive of their meeting recordings. |
| 07:30:44 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 07:31:26 | | Webuser431 quits [Ping timeout: 244 seconds] |
| 07:33:11 | | leo60228 (leo60228) joins |
| 07:34:10 | | ieh quits [Remote host closed the connection] |
| 07:50:28 | | channel13y4 joins |
| 08:03:28 | | ZizzyDizzyMC quits [Ping timeout: 244 seconds] |
| 08:30:25 | <jodizzle> | I started another appledaily video crawl via article URLs pointed at https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/, this time getting mp4s as well. |
| 08:45:30 | <achivarin> | Could you make sure to get https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/video/lifestyle/ first? We missed that YouTube channel. Thanks. |
| 08:46:02 | | sonick (sonick) joins |
| 09:24:18 | | Hackerpcs quits [Quit: Hackerpcs] |
| 09:25:37 | | Hackerpcs (Hackerpcs) joins |
| 09:32:30 | <achivarin> | jodizzle: How did you get mp4s? Can you give me a few pointers? |
| 09:33:04 | | nuroten quits [Remote host closed the connection] |
| 10:01:56 | | dm4v_ joins |
| 10:01:56 | | dm4v quits [Read error: Connection reset by peer] |
| 10:02:08 | | dm4v_ is now known as dm4v |
| 10:02:10 | | dm4v is now authenticated as dm4v |
| 10:02:10 | | dm4v quits [Changing host] |
| 10:02:10 | | dm4v (dm4v) joins |
| 10:06:44 | | dm4v quits [Read error: Connection reset by peer] |
| 10:06:57 | | dm4v joins |
| 10:06:59 | | dm4v is now authenticated as dm4v |
| 10:06:59 | | dm4v quits [Changing host] |
| 10:06:59 | | dm4v (dm4v) joins |
| 10:08:58 | | dm4v quits [Read error: Connection reset by peer] |
| 10:09:02 | | dm4v_ joins |
| 10:09:19 | | dm4v_ quits [Client Quit] |
| 10:11:19 | | dm4v joins |
| 10:11:21 | | dm4v is now authenticated as dm4v |
| 10:11:21 | | dm4v quits [Changing host] |
| 10:11:21 | | dm4v (dm4v) joins |
| 10:17:47 | | dm4v_ joins |
| 10:19:34 | | dm4v quits [Ping timeout: 258 seconds] |
| 10:19:34 | | dm4v_ is now known as dm4v |
| 10:19:34 | | dm4v is now authenticated as dm4v |
| 10:19:34 | | dm4v quits [Changing host] |
| 10:19:34 | | dm4v (dm4v) joins |
| 10:21:54 | | dm4v quits [Read error: Connection reset by peer] |
| 10:22:42 | | dm4v joins |
| 10:22:45 | | dm4v is now authenticated as dm4v |
| 10:22:45 | | dm4v quits [Changing host] |
| 10:22:45 | | dm4v (dm4v) joins |
| 10:22:58 | | bbsky quits [Ping timeout: 244 seconds] |
| 10:31:32 | | marked is now known as marked1 |
| 10:51:11 | <wessel1512> | avoozl it's going well with viva forum grap and will probably be done before the deadline |
| 10:59:12 | | CTL joins |
| 10:59:32 | | CTL quits [Remote host closed the connection] |
| 10:59:55 | | wessel1512 is now authenticated as wessel1512 |
| 11:16:22 | <alard> | To add to that: The quick threads-only grab of the viva forum finished yesterday (https://archive.fart.website/archivebot/viewer/job/34m7c), so all messages should be in there. wessel's grab will be more complete with outgoing links, redirects from direct links to individual posts etc. |
| 11:24:45 | <avoozl> | wessel1512: cool is there any warcs I can pick up and take a look at? |
| 11:25:09 | <avoozl> | alard: awesome |
| 11:26:40 | <avoozl> | I'll kick off some downloads and see if my parser still works |
| 11:26:57 | <wessel1512> | i dont know if they have been uploaded to AI jet |
| 11:27:08 | <avoozl> | the links in the website alard pasted are working |
| 11:27:13 | | orly quits [Client Quit] |
| 11:27:36 | <avoozl> | it'll take a day or so to get those here, but at least things are moving |
| 11:27:46 | | orly (orly) joins |
| 11:28:28 | <wessel1512> | you can download the warcs form https://archive.fart.website/archivebot/viewer/job/ade35 |
| 11:28:56 | <avoozl> | I'll add them to the list |
| 11:29:21 | <wessel1512> | your list ? |
| 11:29:37 | <avoozl> | of things to download so I can start parsing |
| 11:29:54 | <avoozl> | I've been building a warc parsed that extracts structured posts/threads from it so that it can be easily searched |
| 11:30:36 | <avoozl> | it produces json chunks like this https://archive.org/details/warceater_yahooanswers and has a little tool for building search indices on top of them and host them with a generic forum skin |
| 11:30:47 | <wessel1512> | than its better to use https://archive.fart.website/archivebot/viewer/job/34m7c |
| 11:31:04 | <avoozl> | the second one has outlinks the first one is just the threads, right? |
| 11:31:24 | <wessel1512> | in reverse |
| 11:31:46 | <avoozl> | check |
| 11:31:51 | <wessel1512> | the fist one has outlinks |
| 11:31:54 | <avoozl> | ok looking forward to having some time to play with this |
| 11:31:58 | <avoozl> | thanks |
| 11:46:09 | | channel13y4 quits [Ping timeout: 244 seconds] |
| 12:18:32 | | yano quits [Client Quit] |
| 12:18:42 | | yanome quits [Client Quit] |
| 12:19:12 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 12:19:45 | | yanome (yano) joins |
| 12:20:00 | | VerifiedJ (VerifiedJ) joins |
| 12:20:47 | | yano (yano) joins |
| 12:56:47 | | orly quits [Client Quit] |
| 13:12:30 | | Webuser431 joins |
| 13:34:57 | | Daloader joins |
| 13:45:04 | | y joins |
| 13:45:49 | | y quits [Remote host closed the connection] |
| 13:52:13 | | LeGoupil joins |
| 14:15:54 | | Daloader quits [Ping timeout: 250 seconds] |
| 14:44:04 | | LeGoupil quits [Ping timeout: 258 seconds] |
| 15:02:25 | | sec^nd quits [Ping timeout: 255 seconds] |
| 15:11:10 | | LeGoupil joins |
| 15:17:16 | | Daloader joins |
| 15:17:26 | | britm0b quits [Ping timeout: 250 seconds] |
| 15:22:03 | | britmob joins |
| 15:22:12 | | Arcorann quits [Ping timeout: 250 seconds] |
| 15:34:15 | | sec^nd (second) joins |
| 16:04:52 | | Larsenv quits [Client Quit] |
| 16:08:53 | | spirit joins |
| 16:21:16 | | LeGoupil quits [Client Quit] |
| 16:28:30 | <aaaaa> | Are there any tools that can scan whether your file contains any personally identifiable metadata (ex: IP address, location, browser fingerprint, etc)? Would like to use one before uploading things |
| 16:29:17 | <aaaaa> | For example, if you download YT metadata files through yt-dl, your ip is shown in the metadata .json files |
| 16:32:33 | | britmob quits [Ping timeout: 258 seconds] |
| 16:39:04 | | mutantmonkey quits [Remote host closed the connection] |
| 16:39:20 | | mutantmonkey (mutantmonkey) joins |
| 17:01:38 | | HP_Archivist (HP_Archivist) joins |
| 17:03:59 | | monoxane quits [Ping timeout: 258 seconds] |
| 17:10:53 | | monoxane (monoxane) joins |
| 17:30:15 | | wyatt8750 quits [Remote host closed the connection] |
| 17:34:39 | | atphoenix quits [Ping timeout: 258 seconds] |
| 17:36:19 | | wyatt8740 joins |
| 17:48:29 | | Larsenv (Larsenv) joins |
| 17:50:53 | <Ryz> | ...Oh, ooh, Windows 11 to support Android apps via Amazon's appstore (wondering how it would affect archiving operations): https://www.theverge.com/2021/6/24/22548428/microsoft-windows-11-android-apps-support-amazon-store |
| 18:48:54 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 18:49:38 | | C4K3 quits [Remote host closed the connection] |
| 18:53:52 | <@JAA> | jodizzle: Oops, forgot to upload this last night, here's my list of video streams from the archive pagination: https://transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst |
| 18:54:16 | <@JAA> | I'll throw the MP4s into AB now but will let you handle the M3U8. |
| 18:56:15 | <@JAA> | (There are some duplicates in this list.) |
| 18:58:19 | | ZizzyDizzyMC joins |
| 19:00:32 | <thuban> | JAA: thoughts on the playboard metadata? (conversation between me and aaaaa above) |
| 19:00:54 | | C4K3 joins |
| 19:00:54 | | C4K3 is now authenticated as C4K3 |
| 19:14:28 | <@arkiver> | thuban: fixed your item to mediatype image |
| 19:15:11 | <thuban> | arkiver: thank you! |
| 19:17:38 | <thuban> | if you're still planning on creating a collection for the show, it may be a good idea to move the thumbs there as well as the episode items |
| 19:18:34 | <thuban> | (the same user has also uploaded runs of a couple of other shows, which likewise don't have collections) |
| 19:18:41 | <@JAA> | HK Apple Daily MP4s are now running through AB, should be about 1.5 TB. |
| 19:19:58 | <@JAA> | thuban: Playboard is a metadata aggregator for YouTube it seems? Certainly can't hurt to grab the metadata, although we likely already have much of it in #youtubearchive. |
| 19:20:36 | <@JAA> | Metadata is also generally tiny, so duplication isn't much of a problem there. |
| 19:22:01 | <thuban> | JAA: that's right. the zst i uploaded (https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst) has all the page urls and should be archivebot-ready. |
| 19:23:30 | <@JAA> | Hmm, only 9.8k videos? |
| 19:24:04 | <thuban> | to all appearances that's as many as playboard knew about |
| 19:24:25 | <@JAA> | Yeah, apparently. Weak. :-P |
| 19:25:05 | <thuban> | i'm checking now to see whether there's anything in the pagination json that isn't in the video page source |
| 19:25:46 | <@JAA> | Running through AB now. |
| 19:32:39 | <thuban> | ^ the only thing missing is the channel's profile image's youtube url: https://yt3.ggpht.com/ytc/AKedOLSXcaGJFYW3dY0xIp9WOOx1JJtrDHyj909W38XbQw |
| 19:34:55 | | Matthww8 quits [Quit: Ping timeout (120 seconds)] |
| 19:35:18 | | Matthww8 joins |
| 19:38:05 | | Larsenv quits [Client Quit] |
| 19:54:12 | | spirit quits [Client Quit] |
| 19:57:22 | | Daloader quits [Ping timeout: 250 seconds] |
| 20:14:17 | | Larsenv (Larsenv) joins |
| 20:28:43 | | britmob joins |
| 20:29:55 | | HP_Archivist (HP_Archivist) joins |
| 22:36:42 | | omni quits [Read error: Connection reset by peer] |
| 22:36:43 | | omni joins |
| 22:36:47 | <@Kaz> | man |
| 22:36:53 | | xit quits [Client Quit] |
| 22:36:57 | <@Kaz> | I was just writing '#archiveteam-bs before we get shouted at' |
| 22:37:00 | <@Kaz> | and THERE WE GO |
| 22:37:04 | <@JAA> | :-) |
| 22:37:52 | | noteness quits [Ping timeout: 258 seconds] |
| 22:38:48 | | noteness (noteness) joins |
| 22:39:00 | | ThreeHM quits [Ping timeout: 250 seconds] |
| 22:39:01 | | mutantmonkey quits [Ping timeout: 258 seconds] |
| 22:39:01 | | Suika quits [Ping timeout: 258 seconds] |
| 22:39:29 | | xit joins |
| 22:39:47 | | luckcolors quits [Ping timeout: 258 seconds] |
| 22:39:57 | | Suika joins |
| 22:40:11 | | luckcolors (luckcolors) joins |
| 22:40:18 | | xkey quits [Ping timeout: 250 seconds] |
| 22:40:18 | | @AlsoJAA quits [Ping timeout: 250 seconds] |
| 22:40:53 | | C4K3 quits [Remote host closed the connection] |
| 22:40:55 | | C4K3 joins |
| 22:40:55 | | C4K3 is now authenticated as C4K3 |
| 22:41:01 | | AlsoJAA (JAA) joins |
| 22:41:01 | | @ChanServ sets mode: +o AlsoJAA |
| 22:41:12 | | xkey (eyo) joins |
| 22:51:18 | | mutantmonkey (mutantmonkey) joins |
| 23:03:25 | | lorwp quits [Quit: ZNC - https://znc.in] |
| 23:09:23 | | ThreeHM (ThreeHeadedMonkey) joins |