00:30:07DogsRNice (Webuser299) joins
00:51:16lorwp (lorwp) joins
01:02:48dm4v quits [Client Quit]
01:05:28dm4v joins
01:05:30dm4v quits [Changing host]
01:05:30dm4v (dm4v) joins
01:13:48fuzzy8021 quits [Read error: Connection reset by peer]
01:14:15fuzzy8021 (fuzzy8021) joins
01:16:06lorwp quits [Client Quit]
01:22:23<@JAA>https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/ is now 503ing.
01:23:11<thuban>F
01:34:33lorwp (lorwp) joins
01:34:55systwi quits [Ping timeout: 250 seconds]
01:36:05lorwp quits [Client Quit]
01:39:45<mgrandi>how much of it did we get?
01:41:44systwi (systwi) joins
01:42:56<@JAA>Don't think we grabbed anything from that host except for my /archive/ grab and jodizzle's attempt to collect more videos from article pages.
01:43:03lorwp (lorwp) joins
01:45:31<@JAA>On another note, I found that a web search for site:img.appledaily.com.tw brings up a bunch of interesting-looking things. PDFs, DOCs, 'classified' (i.e. ads) JS flipping book thingies, etc. Would be good to look into that at some point. Probably not in danger though since it's the Taiwanese Apple Daily.
01:48:50Webuser431 quits [Ping timeout: 244 seconds]
02:07:33Barto quits [Ping timeout: 258 seconds]
02:20:00Barto (Barto) joins
02:29:27BlueMaxima joins
02:49:54HP_Archivist quits [Ping timeout: 250 seconds]
02:54:42ThreeHM quits [Ping timeout: 258 seconds]
02:56:48ThreeHM (ThreeHeadedMonkey) joins
03:31:23atphoenix (atphoenix) joins
03:47:54qw3rty_ joins
03:51:26qw3rty__ quits [Ping timeout: 250 seconds]
04:00:00treora quits [Quit: blub blub.]
04:01:17treora joins
04:17:53Soulflare joins
04:20:30DogsRNice quits [Read error: Connection reset by peer]
04:34:56Soulflare quits [Client Quit]
04:35:08Soulflare (Soulflare) joins
04:47:09HP_Archivist (HP_Archivist) joins
06:21:12Arcorann (Arcorann) joins
06:39:59HP_Archivist quits [Client Quit]
07:49:24BlueMaxima quits [Read error: Connection reset by peer]
08:32:59<achivarin>thuban: Can you make sure to get Apple Daily's other channels on Playboard as well? Like https://playboard.co/en/channel/UCCzKM7UMxGCPAgUDXmFw5Gg/videos https://playboard.co/en/channel/UC-8CVMKt5Zlju_i07zhkC-Q/videos https://www.youtube.com/user/eatravel Thank you!
08:38:36<achivarin>Also, how do we piece the ts and m3u8 segments back together? Do we have the accompanying metadata?
08:43:02britmob quits [Remote host closed the connection]
08:43:10benjinsmith joins
08:43:27britmob joins
08:43:35lukash7 quits [Quit: Ping timeout (120 seconds)]
08:43:54lukash7 joins
08:45:27benjins quits [Ping timeout: 258 seconds]
08:48:12Krownest quits [Read error: Connection reset by peer]
08:48:15atphoenix_ (atphoenix) joins
08:48:17mrfooooo quits [Quit: Ping timeout (120 seconds)]
08:48:33Justin[home] joins
08:48:36jtagcat0 (jtagcat) joins
08:48:37chriscoffee_ (chriscoffee) joins
08:48:46swebb_ joins
08:48:48G4te_Keep3r quits [Client Quit]
08:48:48billy549 quits [Read error: Connection reset by peer]
08:48:49@Kaz quits [Quit: Ping timeout (120 seconds)]
08:48:51monoxane quits [Read error: Connection reset by peer]
08:48:51noteness quits [Client Quit]
08:48:51PlsNoJava4 (ROpdebee) joins
08:48:51starship_8601 quits [Quit: Ping timeout (120 seconds)]
08:48:52Arcorann quits [Remote host closed the connection]
08:48:54lennier1 quits [Read error: Connection reset by peer]
08:48:54taavi quits [Quit: Ping timeout (120 seconds)]
08:48:56<jodizzle>I have some metadata from the scraping process that could be used to piece them back together, yes. I should probably organize that. However, if you were willing to analyze them in bulk, you could also do it from the m3u8 files themselves. (Ideally, the m3u8s and ts files would all be ordered in the uploaded text files, but my scraping process screwed up the order a little bit.)
08:48:57phiresky quits [Quit: Ping timeout (120 seconds)]
08:48:57Arcorann (Arcorann) joins
08:48:57benjinss joins
08:48:58Krownest (Krownest) joins
08:48:58coderobe quits [Quit: Ping timeout (120 seconds)]
08:48:59swebb quits [Quit: ZNC 1.7.1+deb1+xenial1 - https://znc.in]
08:49:00lun4 quits [Client Quit]
08:49:00vela0 (vela) joins
08:49:00ave2 (ave) joins
08:49:02starship_8601 (starship_8601) joins
08:49:02Kaz (Kaz) joins
08:49:02@ChanServ sets mode: +o Kaz
08:49:03mrfooooo joins
08:49:04taavi (taavi) joins
08:49:04monoxane (monoxane) joins
08:49:05G4te_Keep3r joins
08:49:06phiresky joins
08:49:09noteness (noteness) joins
08:49:09jtagcat quits [Read error: Connection reset by peer]
08:49:09jtagcat0 is now known as jtagcat
08:49:10kiska quits [Quit: Ping timeout (120 seconds)]
08:49:13lennier1 (lennier1) joins
08:49:14lun4 (lun4) joins
08:49:15lunik1 quits [Quit: Ping timeout (120 seconds)]
08:49:18coderobe (coderobe) joins
08:49:21xit quits [Client Quit]
08:49:27lunik1 joins
08:49:38xit joins
08:49:42<jodizzle>To answer your earlier question achivarin, I got the mp4s by just using a pretty lazy regex that appeared to work well.
08:49:43Ryz quits [Quit: Ping timeout (120 seconds)]
08:50:01Ryz (Ryz) joins
08:50:02PlsNoJava quits [Read error: Connection reset by peer]
08:50:02PlsNoJava4 is now known as PlsNoJava
08:50:03DopefishJustin quits [Ping timeout: 258 seconds]
08:50:06kiska (kiska) joins
08:50:35ave quits [Read error: Connection reset by peer]
08:50:35ave2 is now known as ave
08:51:12atphoenix quits [Ping timeout: 258 seconds]
08:51:35vela quits [Ping timeout: 258 seconds]
08:51:35vela0 is now known as vela
08:51:54<jodizzle>JAA: I've uploaded the last of my m3u8s (I think). Later, I'll dedupe against your list and process any additional ones.
08:51:58benjinsmith quits [Ping timeout: 258 seconds]
08:53:07chriscoffee quits [Ping timeout: 622 seconds]
08:53:07chriscoffee_ is now known as chriscoffee
08:54:32<jodizzle>I did go ahead and diff the mp4s I collected against your list, and found 74959 unique to mine. I made an AB job for them.
08:54:53billy549 (Billy549) joins
09:22:39benjinss is now known as benjins
10:41:50Matthww83 joins
10:42:45Matthww8 quits [Ping timeout: 258 seconds]
10:42:45Matthww83 is now known as Matthww8
10:45:34Matthww85 joins
10:47:26Matthww8 quits [Ping timeout: 250 seconds]
10:47:26Matthww85 is now known as Matthww8
11:01:05<@JAA>Nice! It's weird that /archive/ is so incomplete.
11:01:39<@JAA>Or perhaps I screwed up my extraction. It was very crude.
11:19:33wizards quits [Ping timeout: 258 seconds]
11:21:24wizards joins
11:33:44benjins quits [Ping timeout: 258 seconds]
11:51:33atphoenix_ is now known as atphoenix
12:21:14ZizzyDizzyMC quits [Ping timeout: 244 seconds]
14:41:47channel13y4 joins
14:41:52channel13y4 quits [Remote host closed the connection]
15:29:52Arcorann quits [Ping timeout: 258 seconds]
15:39:32dav3 joins
15:43:50<@JAA>jodizzle: We could go through the AB job WARCs to perhaps find more videos as well. Bit of a pain though since that's 550 GiB (so far)... :-|
15:44:15<@JAA>rewby: Want to do that?
15:44:44<dav3>hi, i have about 2TB of hk-appledaily data from 2014-2021 scraped using the /archive/ pages, mostly videos, images and html. the data is on 2 1TB servers, what is the best way of transferring the data?
15:45:30<@JAA>Hi dav3. What data format? WARCs? Plain files? Something else?
15:45:59<dav3>plain files, mp4, html, jpg/gif
15:49:04<@JAA>Not entirely sure. arkiver, any advice?
15:49:41<@JAA>dav3: Do you still have the URLs for the MP4s? We also grabbed about 1.5 TB of those last night, but would be nice to compare.
15:55:38<dav3>sure, i can generate a list of video urls
16:19:48Doranwen quits [Ping timeout: 250 seconds]
16:33:16<dav3>https://mega.nz/file/UgNU2DTD#mycIn1NqzWHEHhnPi2q45cM7PqiGyzHRC-Aw_OPUI0g
16:34:26<@JAA>Thanks! Rehosted for future reference: https://transfer.archivete.am/rtGCF/hk%20appledaily%20video%20data%202014-2021.zip
16:34:53<@arkiver>getting reports that thestandnews.com and beta.thestandnews.com may be next
16:35:19<@arkiver>at least thestandnews.com has sitemaps
16:35:31<@JAA>jodizzle: ^ Also some M3U8 in dav3's list.
16:35:32<@arkiver>did we already get these, and else can we get these?
16:35:44<@arkiver>(ping rewby as well on thestandnews)
16:36:00<rewby>I already sent them
16:36:09<rewby>Check the logs
16:36:09Doranwen (Doranwen) joins
16:36:23<@arkiver>yes, just saw
16:36:25<@JAA>Looks like we had a job for a bunch of *.thestandnews.com stuff in Aug 2019.
16:36:31<@arkiver>EggplantN: have all lists from rewby been queued?
16:36:38<@JAA>s/a job/AB jobs/
16:36:41<@EggplantN>uh
16:36:42<@EggplantN>yes
16:36:46<@EggplantN>they had
16:37:14<@arkiver>the lists from rewby for thestandnews hongkongfp and polymerkhk
16:37:19<@arkiver>alright, good
16:37:25<rewby>I didn't do beta yet, I'll run that after I've had a shower
16:37:32<@arkiver>thanks rewby
16:38:51<@arkiver>EggplantN: have those archived already been uploaded to IA for the lists from rewby or are they stashed somewhere?
16:39:12<@EggplantN>they were through #//
16:39:16<@EggplantN>so they should be uploaded
16:45:24<@JAA>dav3: Hmm, line 569 in the 2014-2018 file is malformed.
16:46:20<@JAA>Guessing that should be https://d2i91erehhsxi2.cloudfront.net/appledaily/2020/04/02/5e8641adc9e77c0009709acb/08012014_ent_05_w.mp4
16:48:06<dav3>oops not sure what happened there. i will make a new file
16:49:28<@JAA>Looks like you found 16781 videos that aren't in my list from /archive/.
16:49:36<@JAA>Comparing with the other lists we have now.
16:52:48<@JAA>Filtering out all the video lists mentioned on the wiki brings it to 11661 videos.
16:54:16<@JAA>https://transfer.archivete.am/4O5K7/hk.appledaily.com-dav3-video-urls-filtered.zst
16:54:59<@JAA>That's dav3's list from the ZIP above minus https://transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst minus jodizzle's nine lists on the wiki page.
16:55:25<@JAA>jodizzle: ^ More M3U8 for you. :-)
16:56:33<@JAA>Throwing the `grep '\.mp4$'` of that into ArchiveBot now.
17:00:47<@arkiver>EggplantN: do you happen to have an easy list of all of them?
17:00:59<@EggplantN>not on hand
17:01:06<@EggplantN>you can always try and queue again if you wanna check
17:01:18<@arkiver>mostly thinking about embedded images now
17:01:31<@EggplantN>dont need to use urls.js as the ones i did all had valid URLs
17:01:36<@arkiver>i remember one appledaily had images which could not be extracted without custom code
17:01:48<@arkiver>yeah
17:02:28<@arkiver>so for next sites, if we want the embedded image, let's make sure the domain is in the https://github.com/ArchiveTeam/urls-grab/blob/master/extract-outlinks-patterns.txt list
17:02:44<@arkiver>the HTML alone is already very important of course
17:06:59<@JAA>Hmm, I just remembered something... We can probably still grab Apple Daily images. Checking now.
17:08:14<@JAA>Most article images were using a resizing server thingy, but the original image URL is in that resize URL and on another server and still up for now.
17:08:29<@JAA>E.g. https://hk.appledaily.com/resizer/4NRo9yeFwLnX468wDNUDZpiEhvs=/494x/filters:quality(100)/cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/VXECD5EO7BR3AW7JAJRQB6OJZI.jpg → https://cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/VXECD5EO7BR3AW7JAJRQB6OJZI.jpg
17:10:05<dav3>https://mega.nz/file/V18g3B6b#JXAAje1bTPNox-z-vdIEN5cPLdZbQWu48K4ePfybs1w V2 2014-2018
17:10:06<rewby>arkiver: beta.thestandnews.com just redirects sitemaps to www.'s sitemap.
17:18:11<dav3>i downloaded ~533,000 images. i will output a url and file list for those..
17:22:09<@JAA>Rehosted: https://transfer.archivete.am/14x3GF/video_data_2014-2018_V2.zip
17:27:01<@JAA>Huh, that unearthed another 36 MP4s that aren't elsewhere. Nice. :-)
17:31:28<@JAA>(No extra M3U8)
17:31:55<jodizzle>Ugh, so many m3u8s to unpack…
17:32:22<@JAA>Yeah, and it's getting messy to see what we have and haven't covered despite the wiki page. :-|
17:35:07<@hook54321>!a https://www.cesletter.info/
17:35:10<@hook54321>woops
17:40:54<dav3>2014-2018 images urls https://mega.nz/file/40cVHKJa#oK1FOJibRxhB4ZjdgMyfh2_RWR3nZ0U38SBsGuQI08U
17:42:12<@JAA>Rehosted: https://transfer.archivete.am/xSRbu/2014-2018_image_data.zip
17:48:03<@JAA>I'll process this later if noone beats me to it.
17:48:13Daloader joins
17:49:25<dav3>2019-2021 images urls https://mega.nz/file/NoNDASRL#NCG5tNkoh2kC9U4MqEQBALePvTOI8q9kLq07YB_ZcNA
17:49:35<@JAA>Cheers
17:49:58<@JAA>https://transfer.archivete.am/DtK5q/2019-2021_image_data.zip
17:57:22DogsRNice (Webuser299) joins
17:59:56<AK>JAA: for the two above, any cleanup needing doing or can I just throw all the urls into ab? (I can manage getting the urls our and into a list)
18:04:13<@JAA>AK: Not sure yet if AB or #// is better for these.
18:04:28<@JAA>Depends on how many more there are. I'll get the ones from the AB jobs as well.
18:04:42<@JAA>But these are the full images, so no processing needed in that sense.
18:04:54<AK>Alright, I'll turn them both into one list, then upload it here and it can go somewhere else
18:05:27<@JAA>Upload as hk.appledaily.com-dav3-image-urls.zst please. :-)
18:05:47<@EggplantN>Are they the same domain per URL? If so turn on page reqs
18:06:13<@JAA>EggplantN: They're just direct URLs for images. The pages they were on aren't available anymore.
18:06:43<@EggplantN>Ah okie. Sure #// them that box can take up to 11Gbit peaks inbound
18:06:59<AK>Now I've gotta work out how I zst them lmao
18:08:43<@JAA>Nice thing about AB is that it produces a single grouped dataset.
18:09:07<@JAA>But yeah, will get the ones from the two AB job DBs later and then decide.
18:12:54<AK>https://transfer.archivete.am/EWVa4/hk.appledaily.com-dav3-image-urls.zst
18:13:00<AK>Fuck knows if I managed that correctly
18:13:45<@JAA>Thanks
18:14:06<AK>I'm not sure I did
18:14:07<AK>lol
18:14:49<AK>Actually I think I did
18:14:58<@arkiver>zstd filename
18:15:01<@arkiver>did you do that?
18:15:05<@arkiver>then it should be fine
18:15:07<@arkiver>:P
18:15:36<@JAA>I usually increase the level a bit. If I feel patient, I go for `--ultra -22`.
18:15:45<@JAA>But yeah, that'll do.
18:15:58<@JAA>And will still beat `gzip -9` in virtually all cases.
18:16:13<AK>Don't get too mad, but I used 7zip gui lol
18:16:20<AK>Figured 11 was better than the default of 1
18:16:33<@JAA>lol
18:16:36<AK>Used https://github.com/mcmilk/7-Zip-zstd
18:16:54<AK>Powershell for the extraction, then used a nice gui for the compressing back
18:17:04<AK>Can you tell I'm a windows admin ;)
18:17:12<@JAA>Yes, my condolences.
18:17:54<@JAA>I'd probably have done `awk -F, '{print $3}' input | zstd -8 -o out.zst`.
18:17:54<@arkiver>damn
18:18:09<@JAA>But I won't kinkshame. :-)
18:18:19<@arkiver>if it works it works :P
18:18:26<@arkiver>AK will upgrade to Linux at some points :)
18:18:32<@arkiver>point*
18:23:54<@HCross>_stares intently at arkiver_
18:33:08<AK>arkiver I use linux loads, how else do you download your Windows 10 iso?
18:55:19benjins joins
19:22:33omni quits [Ping timeout: 258 seconds]
19:25:18HP_Archivist (HP_Archivist) joins
20:01:24simon816 quits [Read error: Connection reset by peer]
20:01:38benjinsmith joins
20:04:43benjins quits [Ping timeout: 258 seconds]
20:10:50simon816 (simon816) joins
20:15:09alard quits [Quit: ZNC 1.7.2+deb3 - https://znc.in]
20:27:55AlsoHP_Archivist joins
20:29:04benjinsmith is now known as benjins
20:30:01HP_Archivist quits [Ping timeout: 258 seconds]
20:35:27HP_Archivist (HP_Archivist) joins
20:35:56nuroten joins
20:36:33wyatt8750 joins
20:37:38AlsoHP_Archivist quits [Ping timeout: 250 seconds]
20:37:38wyatt8740 quits [Ping timeout: 250 seconds]
20:45:31<thuban>achivarin: yep, running
21:04:52<thuban>would someone please queue https://transfer.archivete.am/J0QXS/playboard_UCCzKM7UMxGCPAgUDXmFw5Gg_urls.txt.zst and https://transfer.archivete.am/dDyMC/playboard_UC-8CVMKt5Zlju_i07zhkC-Q_urls.txt.zst ?
21:05:05<AK>Can do
21:05:13<thuban>thanks!
21:07:24<thuban>achivarin: i am not sure whether https://www.youtube.com/user/eatravel corresponds to one of the playboard channel urls you linked or to something else. youtube says the page is not available; can you explain?
22:03:00Daloader quits [Ping timeout: 250 seconds]
22:17:16AlsoHP_Archivist joins
22:19:02HP_Archivist quits [Ping timeout: 250 seconds]
22:55:26Larsenv quits [Ping timeout: 250 seconds]
23:01:49<AK>BT Community Webkit closed on the 24th May 2021.
23:01:53<AK>https://appletonlemoorshistory.chessck.co.uk/SiteClosed.aspx
23:02:02<AK>Did anyone know about BT Community Webkit?
23:02:26<@JAA>Yeah, it was known.
23:02:35<AK>Damn, ahh well
23:02:46<@EggplantN>AK it was done
23:02:55<@EggplantN>see archiveteam_inbox
23:02:59<@EggplantN>HCross did it
23:03:14BlueMaxima joins
23:03:45<AK>Awesome
23:04:04<@EggplantN>https://archive.org/details/archiveteam_inbox?query=BT&sin=