00:04:25<thuban>(oh, the other thing is that i included external identifiers so that videos/thumbnails could be matched without relying on titles or filenames, which i don't think the other uploader did. i don't think that's a big deal, though)
00:05:05Arcorann (Arcorann) joins
00:06:23<thuban>oh FUCK i forgot to set the mediatype
00:08:24<thuban>arkiver: is there any way to get around this? i'd like to be able to use the identifier i originally wanted :/
00:14:40<@JAA>I *love* working with 40 GB of JSON. Just *wonderful*!
00:15:26Doranwen quits [Ping timeout: 258 seconds]
00:16:03<@arkiver>thuban: i'll change it for you
00:16:15<@arkiver>ping me the item when it's uploaded!
00:16:20<thuban>arkiver: ah, thank you so much :)
00:16:27<@arkiver>change the mediatype that is, not the identifier
00:16:38<thuban>https://archive.org/details/rthk-podcast-hkconnection-en-thumbnails
00:16:51<thuban>i interrupted the upload when i realized what i'd done
00:18:41<@JAA>Oh, only 35.7 GB in the end. Phew, that's much better... Please shoot me.
00:23:38BlueMaxima joins
00:34:29<aaaaa>JAA any luck on the arcpublishing backup? it sounds like it worked? :)
00:35:46<@JAA>I'm still shooting myself in the foot. :-)
00:37:42<@JAA>jodizzle: https://transfer.archivete.am/TV0oq/hk.appledaily.com-archive-article-urls.zst
00:38:29<@JAA>Ignore the video-leaf URLs, I think. Still need to look into that more closely. But otherwise, that should be all articles from the /archive/ listings.
00:42:20ZizzyDizzyMC joins
00:43:31<@JAA>The video-leaf stuff are unique IDs for the videos, not paths on the website, it seems. Not all video-leaf 'URLs' are in that list.
00:45:21Ajay1 joins
00:47:15Ajay quits [Ping timeout: 258 seconds]
00:47:50orly (orly) joins
00:50:06<@JAA>jodizzle: 201466 videos according to my extraction (which is definitely crude).
00:50:42<@JAA>And almost all of them, namely 193142, are MP4.
00:52:29<@JAA>They also conveniently provide a 'filesize' field in the JSON. Not sure how accurate it is for M3U8, but summing that up gets me to 1.93 TB.
00:54:53<@JAA>rewby: I love how your HK Apple Daily article URL list is called sorted_urls.txt and is anything but sorted. :-P
00:56:17dav3 joins
00:57:59Doranwen (Doranwen) joins
00:58:01<@JAA>Looks like the /archive/ iteration is incomplete. :-( For example, /sports/20150707/UPVFKXHDPUZPUU4UJTHHQCSFCI/ does not appear on https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/archive/20150707/
00:58:46<orly>Hello. Sorry, was night time in Hong Kong, so probably missed a lot after I posted that txt of websites.
00:59:26<@JAA>orly: https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292224
01:00:09<orly>JAA, thanks
01:03:21dm4v quits [Ping timeout: 258 seconds]
01:03:32dm4v joins
01:03:34dm4v quits [Changing host]
01:03:34dm4v (dm4v) joins
01:07:37<@JAA>1.65M URLs in my article URLs list, 3.22M in rewby's list from the sitemaps. :-|
01:23:40dav3 quits [Remote host closed the connection]
01:24:33orly quits [Client Quit]
01:34:41Tansuke joins
01:42:57<@JAA>But 98k (90k without video-leaf) URLs in my list that aren't in the sitemaps.
01:54:14satoru1126 joins
01:55:36SomeRando quits [Ping timeout: 244 seconds]
01:57:18satoru1126 leaves
02:15:14Tansuke quits [Ping timeout: 244 seconds]
02:31:28<aaaaa>Folks is there anyway to scrape this? https://playboard.co/en/channel/UCeqUUXaM75wrK5Aalo6UorQ/videos
02:31:46<aaaaa>A lot of the videos aren't up anymore, but scraping metadata too is still good
02:36:20pm5 joins
02:43:07<thuban>pagination is js-triggered and server-side by repeated POSTs, so would not work in ab/wbm
02:43:31<thuban>but a custom scraper could do the pagination, then generate video urls to feed to archivebot
02:44:20<thuban>between the json and the video pages should get all the metadata
02:46:53<thuban>JAA (or others): is that worth doing or would it be purely duplicative of the youtube archive work on this channel? i haven't been in #down-the-tube
03:05:12<aaaaa>were you guys able to download all the youtube videos from apple daily?
03:05:40tansuke joins
03:05:45<thuban>aaaaa: definitely a lot of them, but 'all' is unclear https://wiki.archiveteam.org/index.php/Apple_Daily#Others
03:10:43tansuke quits [Remote host closed the connection]
03:13:50Tansuke joins
03:14:25BlueMaxima quits [Read error: Connection reset by peer]
03:21:59<thuban>i wrote that pager just in case; it's running now
03:31:53Tansuke quits [Remote host closed the connection]
03:36:42<thuban>https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst
03:48:48qw3rty__ joins
03:52:24qw3rty_ quits [Ping timeout: 258 seconds]
04:30:10<Ajay1>https://workspaceupdates.googleblog.com/2021/06/drive-file-link-updates.html
04:33:10DogsRNice quits [Read error: Connection reset by peer]
04:38:01ZizzyDizzyMC quits [Remote host closed the connection]
04:46:44Webuser431 joins
04:52:53orly (orly) joins
04:58:16G4te_Keep3r quits [Client Quit]
04:58:53Viniter69 (Viniter) joins
05:02:06Viniter6 quits [Ping timeout: 250 seconds]
05:02:06Viniter69 is now known as Viniter6
05:03:35G4te_Keep3r joins
05:03:37ZizzyDizzyMC joins
05:05:33orly quits [Client Quit]
06:20:26<jodizzle>JAA: If I'm doing this right, it looks like my m3u8 collector only actually found m3u8s from 9097 articles. Of those, 2219 are present in the hk.appledaily.com-archive-article-urls list you sent.
07:17:06orly (orly) joins
07:20:07leo60228 quits [Quit: ZNC 1.8.1 - https://znc.in]
07:21:48sonick quits [Client Quit]
07:23:09<orly>Hiya. I just realised I missed one university students' press when I posted that txt of Hong Kong stuff yesterday. It's the Chinese University Campus Radio. Facebook: cuhkcampusradio Youtube: UCg-D5uUTXfTSolC_zY5KXCg
07:23:29<Ajay1>Is there a channel for Google drive currently?
07:25:03<thuban>orly: mention the youtube channel in #down-the-tube ?
07:25:45ieh joins
07:25:45<orly>Sure thing. But they mainly upload politically sensitive stuff on their Facebook. Their Youtube is basically just archive of their meeting recordings.
07:30:44HP_Archivist quits [Ping timeout: 250 seconds]
07:31:26Webuser431 quits [Ping timeout: 244 seconds]
07:33:11leo60228 (leo60228) joins
07:34:10ieh quits [Remote host closed the connection]
07:50:28channel13y4 joins
08:03:28ZizzyDizzyMC quits [Ping timeout: 244 seconds]
08:30:25<jodizzle>I started another appledaily video crawl via article URLs pointed at https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/, this time getting mp4s as well.
08:45:30<achivarin>Could you make sure to get https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/video/lifestyle/ first? We missed that YouTube channel. Thanks.
08:46:02sonick (sonick) joins
09:24:18Hackerpcs quits [Quit: Hackerpcs]
09:25:37Hackerpcs (Hackerpcs) joins
09:32:30<achivarin>jodizzle: How did you get mp4s? Can you give me a few pointers?
09:33:04nuroten quits [Remote host closed the connection]
10:01:56dm4v_ joins
10:01:56dm4v quits [Read error: Connection reset by peer]
10:02:08dm4v_ is now known as dm4v
10:02:10dm4v quits [Changing host]
10:02:10dm4v (dm4v) joins
10:06:44dm4v quits [Read error: Connection reset by peer]
10:06:57dm4v joins
10:06:59dm4v quits [Changing host]
10:06:59dm4v (dm4v) joins
10:08:58dm4v quits [Read error: Connection reset by peer]
10:09:02dm4v_ joins
10:09:19dm4v_ quits [Client Quit]
10:11:19dm4v joins
10:11:21dm4v quits [Changing host]
10:11:21dm4v (dm4v) joins
10:17:47dm4v_ joins
10:19:34dm4v quits [Ping timeout: 258 seconds]
10:19:34dm4v_ is now known as dm4v
10:19:34dm4v quits [Changing host]
10:19:34dm4v (dm4v) joins
10:21:54dm4v quits [Read error: Connection reset by peer]
10:22:42dm4v joins
10:22:45dm4v quits [Changing host]
10:22:45dm4v (dm4v) joins
10:22:58bbsky quits [Ping timeout: 244 seconds]
10:31:32marked is now known as marked1
10:51:11<wessel1512>avoozl it's going well with viva forum grap and will probably be done before the deadline
10:59:12CTL joins
10:59:32CTL quits [Remote host closed the connection]
11:16:22<alard>To add to that: The quick threads-only grab of the viva forum finished yesterday (https://archive.fart.website/archivebot/viewer/job/34m7c), so all messages should be in there. wessel's grab will be more complete with outgoing links, redirects from direct links to individual posts etc.
11:24:45<avoozl>wessel1512: cool is there any warcs I can pick up and take a look at?
11:25:09<avoozl>alard: awesome
11:26:40<avoozl>I'll kick off some downloads and see if my parser still works
11:26:57<wessel1512>i dont know if they have been uploaded to AI jet
11:27:08<avoozl>the links in the website alard pasted are working
11:27:13orly quits [Client Quit]
11:27:36<avoozl>it'll take a day or so to get those here, but at least things are moving
11:27:46orly (orly) joins
11:28:28<wessel1512>you can download the warcs form https://archive.fart.website/archivebot/viewer/job/ade35
11:28:56<avoozl>I'll add them to the list
11:29:21<wessel1512>your list ?
11:29:37<avoozl>of things to download so I can start parsing
11:29:54<avoozl>I've been building a warc parsed that extracts structured posts/threads from it so that it can be easily searched
11:30:36<avoozl>it produces json chunks like this https://archive.org/details/warceater_yahooanswers and has a little tool for building search indices on top of them and host them with a generic forum skin
11:30:47<wessel1512>than its better to use https://archive.fart.website/archivebot/viewer/job/34m7c
11:31:04<avoozl>the second one has outlinks the first one is just the threads, right?
11:31:24<wessel1512>in reverse
11:31:46<avoozl>check
11:31:51<wessel1512>the fist one has outlinks
11:31:54<avoozl>ok looking forward to having some time to play with this
11:31:58<avoozl>thanks
11:46:09channel13y4 quits [Ping timeout: 244 seconds]
12:18:32yano quits [Client Quit]
12:18:42yanome quits [Client Quit]
12:19:12VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
12:19:45yanome (yano) joins
12:20:00VerifiedJ (VerifiedJ) joins
12:20:47yano (yano) joins
12:56:47orly quits [Client Quit]
13:12:30Webuser431 joins
13:34:57Daloader joins
13:45:04y joins
13:45:49y quits [Remote host closed the connection]
13:52:13LeGoupil joins
14:15:54Daloader quits [Ping timeout: 250 seconds]
14:44:04LeGoupil quits [Ping timeout: 258 seconds]
15:02:25sec^nd quits [Ping timeout: 255 seconds]
15:11:10LeGoupil joins
15:17:16Daloader joins
15:17:26britm0b quits [Ping timeout: 250 seconds]
15:22:03britmob joins
15:22:12Arcorann quits [Ping timeout: 250 seconds]
15:34:15sec^nd (second) joins
16:04:52Larsenv quits [Client Quit]
16:08:53spirit joins
16:21:16LeGoupil quits [Client Quit]
16:28:30<aaaaa>Are there any tools that can scan whether your file contains any personally identifiable metadata (ex: IP address, location, browser fingerprint, etc)? Would like to use one before uploading things
16:29:17<aaaaa>For example, if you download YT metadata files through yt-dl, your ip is shown in the metadata .json files
16:32:33britmob quits [Ping timeout: 258 seconds]
16:39:04mutantmonkey quits [Remote host closed the connection]
16:39:20mutantmonkey (mutantmonkey) joins
17:01:38HP_Archivist (HP_Archivist) joins
17:03:59monoxane quits [Ping timeout: 258 seconds]
17:10:53monoxane (monoxane) joins
17:30:15wyatt8750 quits [Remote host closed the connection]
17:34:39atphoenix quits [Ping timeout: 258 seconds]
17:36:19wyatt8740 joins
17:48:29Larsenv (Larsenv) joins
17:50:53<Ryz>...Oh, ooh, Windows 11 to support Android apps via Amazon's appstore (wondering how it would affect archiving operations): https://www.theverge.com/2021/6/24/22548428/microsoft-windows-11-android-apps-support-amazon-store
18:48:54HP_Archivist quits [Ping timeout: 250 seconds]
18:49:38C4K3 quits [Remote host closed the connection]
18:53:52<@JAA>jodizzle: Oops, forgot to upload this last night, here's my list of video streams from the archive pagination: https://transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst
18:54:16<@JAA>I'll throw the MP4s into AB now but will let you handle the M3U8.
18:56:15<@JAA>(There are some duplicates in this list.)
18:58:19ZizzyDizzyMC joins
19:00:32<thuban>JAA: thoughts on the playboard metadata? (conversation between me and aaaaa above)
19:00:54C4K3 joins
19:14:28<@arkiver>thuban: fixed your item to mediatype image
19:15:11<thuban>arkiver: thank you!
19:17:38<thuban>if you're still planning on creating a collection for the show, it may be a good idea to move the thumbs there as well as the episode items
19:18:34<thuban>(the same user has also uploaded runs of a couple of other shows, which likewise don't have collections)
19:18:41<@JAA>HK Apple Daily MP4s are now running through AB, should be about 1.5 TB.
19:19:58<@JAA>thuban: Playboard is a metadata aggregator for YouTube it seems? Certainly can't hurt to grab the metadata, although we likely already have much of it in #youtubearchive.
19:20:36<@JAA>Metadata is also generally tiny, so duplication isn't much of a problem there.
19:22:01<thuban>JAA: that's right. the zst i uploaded (https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst) has all the page urls and should be archivebot-ready.
19:23:30<@JAA>Hmm, only 9.8k videos?
19:24:04<thuban>to all appearances that's as many as playboard knew about
19:24:25<@JAA>Yeah, apparently. Weak. :-P
19:25:05<thuban>i'm checking now to see whether there's anything in the pagination json that isn't in the video page source
19:25:46<@JAA>Running through AB now.
19:32:39<thuban>^ the only thing missing is the channel's profile image's youtube url: https://yt3.ggpht.com/ytc/AKedOLSXcaGJFYW3dY0xIp9WOOx1JJtrDHyj909W38XbQw
19:34:55Matthww8 quits [Quit: Ping timeout (120 seconds)]
19:35:18Matthww8 joins
19:38:05Larsenv quits [Client Quit]
19:54:12spirit quits [Client Quit]
19:57:22Daloader quits [Ping timeout: 250 seconds]
20:14:17Larsenv (Larsenv) joins
20:28:43britmob joins
20:29:55HP_Archivist (HP_Archivist) joins
22:36:42omni quits [Read error: Connection reset by peer]
22:36:43omni joins
22:36:47<@Kaz>man
22:36:53xit quits [Client Quit]
22:36:57<@Kaz>I was just writing '#archiveteam-bs before we get shouted at'
22:37:00<@Kaz>and THERE WE GO
22:37:04<@JAA>:-)
22:37:52noteness quits [Ping timeout: 258 seconds]
22:38:48noteness (noteness) joins
22:39:00ThreeHM quits [Ping timeout: 250 seconds]
22:39:01mutantmonkey quits [Ping timeout: 258 seconds]
22:39:01Suika quits [Ping timeout: 258 seconds]
22:39:29xit joins
22:39:47luckcolors quits [Ping timeout: 258 seconds]
22:39:57Suika joins
22:40:11luckcolors (luckcolors) joins
22:40:18xkey quits [Ping timeout: 250 seconds]
22:40:18@AlsoJAA quits [Ping timeout: 250 seconds]
22:40:53C4K3 quits [Remote host closed the connection]
22:40:55C4K3 joins
22:41:01AlsoJAA (JAA) joins
22:41:01@ChanServ sets mode: +o AlsoJAA
22:41:12xkey (eyo) joins
22:51:18mutantmonkey (mutantmonkey) joins
23:03:25lorwp quits [Quit: ZNC - https://znc.in]
23:09:23ThreeHM (ThreeHeadedMonkey) joins