| 00:30:07 | | DogsRNice (Webuser299) joins |
| 00:51:16 | | lorwp (lorwp) joins |
| 01:02:48 | | dm4v quits [Client Quit] |
| 01:05:28 | | dm4v joins |
| 01:05:30 | | dm4v is now authenticated as dm4v |
| 01:05:30 | | dm4v quits [Changing host] |
| 01:05:30 | | dm4v (dm4v) joins |
| 01:13:48 | | fuzzy8021 quits [Read error: Connection reset by peer] |
| 01:14:15 | | fuzzy8021 (fuzzy8021) joins |
| 01:16:06 | | lorwp quits [Client Quit] |
| 01:22:23 | <@JAA> | https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/ is now 503ing. |
| 01:23:11 | <thuban> | F |
| 01:34:33 | | lorwp (lorwp) joins |
| 01:34:55 | | systwi quits [Ping timeout: 250 seconds] |
| 01:36:05 | | lorwp quits [Client Quit] |
| 01:39:45 | <mgrandi> | how much of it did we get? |
| 01:41:44 | | systwi (systwi) joins |
| 01:42:56 | <@JAA> | Don't think we grabbed anything from that host except for my /archive/ grab and jodizzle's attempt to collect more videos from article pages. |
| 01:43:03 | | lorwp (lorwp) joins |
| 01:45:31 | <@JAA> | On another note, I found that a web search for site:img.appledaily.com.tw brings up a bunch of interesting-looking things. PDFs, DOCs, 'classified' (i.e. ads) JS flipping book thingies, etc. Would be good to look into that at some point. Probably not in danger though since it's the Taiwanese Apple Daily. |
| 01:48:50 | | Webuser431 quits [Ping timeout: 244 seconds] |
| 02:07:33 | | Barto quits [Ping timeout: 258 seconds] |
| 02:20:00 | | Barto (Barto) joins |
| 02:29:27 | | BlueMaxima joins |
| 02:49:54 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 02:54:42 | | ThreeHM quits [Ping timeout: 258 seconds] |
| 02:56:48 | | ThreeHM (ThreeHeadedMonkey) joins |
| 03:31:23 | | atphoenix (atphoenix) joins |
| 03:47:54 | | qw3rty_ joins |
| 03:51:26 | | qw3rty__ quits [Ping timeout: 250 seconds] |
| 04:00:00 | | treora quits [Quit: blub blub.] |
| 04:01:17 | | treora joins |
| 04:17:53 | | Soulflare joins |
| 04:20:30 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:30:51 | | Soulflare is now authenticated as Soulflare |
| 04:34:56 | | Soulflare quits [Client Quit] |
| 04:35:08 | | Soulflare (Soulflare) joins |
| 04:47:09 | | HP_Archivist (HP_Archivist) joins |
| 06:21:12 | | Arcorann (Arcorann) joins |
| 06:39:59 | | HP_Archivist quits [Client Quit] |
| 07:49:24 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 08:32:59 | <achivarin> | thuban: Can you make sure to get Apple Daily's other channels on Playboard as well? Like https://playboard.co/en/channel/UCCzKM7UMxGCPAgUDXmFw5Gg/videos https://playboard.co/en/channel/UC-8CVMKt5Zlju_i07zhkC-Q/videos https://www.youtube.com/user/eatravel Thank you! |
| 08:38:36 | <achivarin> | Also, how do we piece the ts and m3u8 segments back together? Do we have the accompanying metadata? |
| 08:43:02 | | britmob quits [Remote host closed the connection] |
| 08:43:10 | | benjinsmith joins |
| 08:43:27 | | britmob joins |
| 08:43:35 | | lukash7 quits [Quit: Ping timeout (120 seconds)] |
| 08:43:54 | | lukash7 joins |
| 08:45:27 | | benjins quits [Ping timeout: 258 seconds] |
| 08:48:12 | | Krownest quits [Read error: Connection reset by peer] |
| 08:48:15 | | atphoenix_ (atphoenix) joins |
| 08:48:17 | | mrfooooo quits [Quit: Ping timeout (120 seconds)] |
| 08:48:33 | | Justin[home] joins |
| 08:48:33 | | Justin[home] is now authenticated as DopefishJustin |
| 08:48:36 | | jtagcat0 (jtagcat) joins |
| 08:48:37 | | chriscoffee_ (chriscoffee) joins |
| 08:48:46 | | swebb_ joins |
| 08:48:48 | | G4te_Keep3r quits [Client Quit] |
| 08:48:48 | | billy549 quits [Read error: Connection reset by peer] |
| 08:48:49 | | @Kaz quits [Quit: Ping timeout (120 seconds)] |
| 08:48:51 | | monoxane quits [Read error: Connection reset by peer] |
| 08:48:51 | | noteness quits [Client Quit] |
| 08:48:51 | | PlsNoJava4 (ROpdebee) joins |
| 08:48:51 | | starship_8601 quits [Quit: Ping timeout (120 seconds)] |
| 08:48:52 | | Arcorann quits [Remote host closed the connection] |
| 08:48:54 | | lennier1 quits [Read error: Connection reset by peer] |
| 08:48:54 | | taavi quits [Quit: Ping timeout (120 seconds)] |
| 08:48:56 | <jodizzle> | I have some metadata from the scraping process that could be used to piece them back together, yes. I should probably organize that. However, if you were willing to analyze them in bulk, you could also do it from the m3u8 files themselves. (Ideally, the m3u8s and ts files would all be ordered in the uploaded text files, but my scraping process screwed up the order a little bit.) |
| 08:48:57 | | phiresky quits [Quit: Ping timeout (120 seconds)] |
| 08:48:57 | | Arcorann (Arcorann) joins |
| 08:48:57 | | benjinss joins |
| 08:48:58 | | Krownest (Krownest) joins |
| 08:48:58 | | coderobe quits [Quit: Ping timeout (120 seconds)] |
| 08:48:59 | | swebb quits [Quit: ZNC 1.7.1+deb1+xenial1 - https://znc.in] |
| 08:49:00 | | lun4 quits [Client Quit] |
| 08:49:00 | | vela0 (vela) joins |
| 08:49:00 | | ave2 (ave) joins |
| 08:49:02 | | starship_8601 (starship_8601) joins |
| 08:49:02 | | Kaz (Kaz) joins |
| 08:49:02 | | @ChanServ sets mode: +o Kaz |
| 08:49:03 | | mrfooooo joins |
| 08:49:04 | | taavi (taavi) joins |
| 08:49:04 | | monoxane (monoxane) joins |
| 08:49:05 | | G4te_Keep3r joins |
| 08:49:06 | | phiresky joins |
| 08:49:09 | | noteness (noteness) joins |
| 08:49:09 | | jtagcat quits [Read error: Connection reset by peer] |
| 08:49:09 | | jtagcat0 is now known as jtagcat |
| 08:49:10 | | kiska quits [Quit: Ping timeout (120 seconds)] |
| 08:49:13 | | lennier1 (lennier1) joins |
| 08:49:14 | | lun4 (lun4) joins |
| 08:49:15 | | lunik1 quits [Quit: Ping timeout (120 seconds)] |
| 08:49:18 | | coderobe (coderobe) joins |
| 08:49:21 | | xit quits [Client Quit] |
| 08:49:27 | | lunik1 joins |
| 08:49:38 | | xit joins |
| 08:49:42 | <jodizzle> | To answer your earlier question achivarin, I got the mp4s by just using a pretty lazy regex that appeared to work well. |
| 08:49:43 | | Ryz quits [Quit: Ping timeout (120 seconds)] |
| 08:50:01 | | Ryz (Ryz) joins |
| 08:50:02 | | PlsNoJava quits [Read error: Connection reset by peer] |
| 08:50:02 | | PlsNoJava4 is now known as PlsNoJava |
| 08:50:03 | | DopefishJustin quits [Ping timeout: 258 seconds] |
| 08:50:06 | | kiska (kiska) joins |
| 08:50:35 | | ave quits [Read error: Connection reset by peer] |
| 08:50:35 | | ave2 is now known as ave |
| 08:51:12 | | atphoenix quits [Ping timeout: 258 seconds] |
| 08:51:35 | | vela quits [Ping timeout: 258 seconds] |
| 08:51:35 | | vela0 is now known as vela |
| 08:51:54 | <jodizzle> | JAA: I've uploaded the last of my m3u8s (I think). Later, I'll dedupe against your list and process any additional ones. |
| 08:51:58 | | benjinsmith quits [Ping timeout: 258 seconds] |
| 08:53:07 | | chriscoffee quits [Ping timeout: 622 seconds] |
| 08:53:07 | | chriscoffee_ is now known as chriscoffee |
| 08:54:32 | <jodizzle> | I did go ahead and diff the mp4s I collected against your list, and found 74959 unique to mine. I made an AB job for them. |
| 08:54:53 | | billy549 (Billy549) joins |
| 09:22:39 | | benjinss is now known as benjins |
| 09:22:41 | | benjins is now authenticated as benjins |
| 10:41:50 | | Matthww83 joins |
| 10:42:45 | | Matthww8 quits [Ping timeout: 258 seconds] |
| 10:42:45 | | Matthww83 is now known as Matthww8 |
| 10:45:34 | | Matthww85 joins |
| 10:47:26 | | Matthww8 quits [Ping timeout: 250 seconds] |
| 10:47:26 | | Matthww85 is now known as Matthww8 |
| 11:01:05 | <@JAA> | Nice! It's weird that /archive/ is so incomplete. |
| 11:01:39 | <@JAA> | Or perhaps I screwed up my extraction. It was very crude. |
| 11:19:33 | | wizards quits [Ping timeout: 258 seconds] |
| 11:21:24 | | wizards joins |
| 11:33:44 | | benjins quits [Ping timeout: 258 seconds] |
| 11:51:33 | | atphoenix_ is now known as atphoenix |
| 12:21:14 | | ZizzyDizzyMC quits [Ping timeout: 244 seconds] |
| 14:41:47 | | channel13y4 joins |
| 14:41:52 | | channel13y4 quits [Remote host closed the connection] |
| 15:29:52 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:39:32 | | dav3 joins |
| 15:43:50 | <@JAA> | jodizzle: We could go through the AB job WARCs to perhaps find more videos as well. Bit of a pain though since that's 550 GiB (so far)... :-| |
| 15:44:15 | <@JAA> | rewby: Want to do that? |
| 15:44:44 | <dav3> | hi, i have about 2TB of hk-appledaily data from 2014-2021 scraped using the /archive/ pages, mostly videos, images and html. the data is on 2 1TB servers, what is the best way of transferring the data? |
| 15:45:30 | <@JAA> | Hi dav3. What data format? WARCs? Plain files? Something else? |
| 15:45:59 | <dav3> | plain files, mp4, html, jpg/gif |
| 15:49:04 | <@JAA> | Not entirely sure. arkiver, any advice? |
| 15:49:41 | <@JAA> | dav3: Do you still have the URLs for the MP4s? We also grabbed about 1.5 TB of those last night, but would be nice to compare. |
| 15:55:38 | <dav3> | sure, i can generate a list of video urls |
| 16:19:48 | | Doranwen quits [Ping timeout: 250 seconds] |
| 16:33:16 | <dav3> | https://mega.nz/file/UgNU2DTD#mycIn1NqzWHEHhnPi2q45cM7PqiGyzHRC-Aw_OPUI0g |
| 16:34:26 | <@JAA> | Thanks! Rehosted for future reference: https://transfer.archivete.am/rtGCF/hk%20appledaily%20video%20data%202014-2021.zip |
| 16:34:53 | <@arkiver> | getting reports that thestandnews.com and beta.thestandnews.com may be next |
| 16:35:19 | <@arkiver> | at least thestandnews.com has sitemaps |
| 16:35:31 | <@JAA> | jodizzle: ^ Also some M3U8 in dav3's list. |
| 16:35:32 | <@arkiver> | did we already get these, and else can we get these? |
| 16:35:44 | <@arkiver> | (ping rewby as well on thestandnews) |
| 16:36:00 | <rewby> | I already sent them |
| 16:36:09 | <rewby> | Check the logs |
| 16:36:09 | | Doranwen (Doranwen) joins |
| 16:36:23 | <@arkiver> | yes, just saw |
| 16:36:25 | <@JAA> | Looks like we had a job for a bunch of *.thestandnews.com stuff in Aug 2019. |
| 16:36:31 | <@arkiver> | EggplantN: have all lists from rewby been queued? |
| 16:36:38 | <@JAA> | s/a job/AB jobs/ |
| 16:36:41 | <@EggplantN> | uh |
| 16:36:42 | <@EggplantN> | yes |
| 16:36:46 | <@EggplantN> | they had |
| 16:37:14 | <@arkiver> | the lists from rewby for thestandnews hongkongfp and polymerkhk |
| 16:37:19 | <@arkiver> | alright, good |
| 16:37:25 | <rewby> | I didn't do beta yet, I'll run that after I've had a shower |
| 16:37:32 | <@arkiver> | thanks rewby |
| 16:38:51 | <@arkiver> | EggplantN: have those archived already been uploaded to IA for the lists from rewby or are they stashed somewhere? |
| 16:39:12 | <@EggplantN> | they were through #// |
| 16:39:16 | <@EggplantN> | so they should be uploaded |
| 16:45:24 | <@JAA> | dav3: Hmm, line 569 in the 2014-2018 file is malformed. |
| 16:46:20 | <@JAA> | Guessing that should be https://d2i91erehhsxi2.cloudfront.net/appledaily/2020/04/02/5e8641adc9e77c0009709acb/08012014_ent_05_w.mp4 |
| 16:48:06 | <dav3> | oops not sure what happened there. i will make a new file |
| 16:49:28 | <@JAA> | Looks like you found 16781 videos that aren't in my list from /archive/. |
| 16:49:36 | <@JAA> | Comparing with the other lists we have now. |
| 16:52:48 | <@JAA> | Filtering out all the video lists mentioned on the wiki brings it to 11661 videos. |
| 16:54:16 | <@JAA> | https://transfer.archivete.am/4O5K7/hk.appledaily.com-dav3-video-urls-filtered.zst |
| 16:54:59 | <@JAA> | That's dav3's list from the ZIP above minus https://transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst minus jodizzle's nine lists on the wiki page. |
| 16:55:25 | <@JAA> | jodizzle: ^ More M3U8 for you. :-) |
| 16:56:33 | <@JAA> | Throwing the `grep '\.mp4$'` of that into ArchiveBot now. |
| 17:00:47 | <@arkiver> | EggplantN: do you happen to have an easy list of all of them? |
| 17:00:59 | <@EggplantN> | not on hand |
| 17:01:06 | <@EggplantN> | you can always try and queue again if you wanna check |
| 17:01:18 | <@arkiver> | mostly thinking about embedded images now |
| 17:01:31 | <@EggplantN> | dont need to use urls.js as the ones i did all had valid URLs |
| 17:01:36 | <@arkiver> | i remember one appledaily had images which could not be extracted without custom code |
| 17:01:48 | <@arkiver> | yeah |
| 17:02:28 | <@arkiver> | so for next sites, if we want the embedded image, let's make sure the domain is in the https://github.com/ArchiveTeam/urls-grab/blob/master/extract-outlinks-patterns.txt list |
| 17:02:44 | <@arkiver> | the HTML alone is already very important of course |
| 17:06:59 | <@JAA> | Hmm, I just remembered something... We can probably still grab Apple Daily images. Checking now. |
| 17:08:14 | <@JAA> | Most article images were using a resizing server thingy, but the original image URL is in that resize URL and on another server and still up for now. |
| 17:08:29 | <@JAA> | E.g. https://hk.appledaily.com/resizer/4NRo9yeFwLnX468wDNUDZpiEhvs=/494x/filters:quality(100)/cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/VXECD5EO7BR3AW7JAJRQB6OJZI.jpg → https://cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/VXECD5EO7BR3AW7JAJRQB6OJZI.jpg |
| 17:10:05 | <dav3> | https://mega.nz/file/V18g3B6b#JXAAje1bTPNox-z-vdIEN5cPLdZbQWu48K4ePfybs1w V2 2014-2018 |
| 17:10:06 | <rewby> | arkiver: beta.thestandnews.com just redirects sitemaps to www.'s sitemap. |
| 17:18:11 | <dav3> | i downloaded ~533,000 images. i will output a url and file list for those.. |
| 17:22:09 | <@JAA> | Rehosted: https://transfer.archivete.am/14x3GF/video_data_2014-2018_V2.zip |
| 17:27:01 | <@JAA> | Huh, that unearthed another 36 MP4s that aren't elsewhere. Nice. :-) |
| 17:31:28 | <@JAA> | (No extra M3U8) |
| 17:31:55 | <jodizzle> | Ugh, so many m3u8s to unpack… |
| 17:32:22 | <@JAA> | Yeah, and it's getting messy to see what we have and haven't covered despite the wiki page. :-| |
| 17:35:07 | <@hook54321> | !a https://www.cesletter.info/ |
| 17:35:10 | <@hook54321> | woops |
| 17:40:54 | <dav3> | 2014-2018 images urls https://mega.nz/file/40cVHKJa#oK1FOJibRxhB4ZjdgMyfh2_RWR3nZ0U38SBsGuQI08U |
| 17:42:12 | <@JAA> | Rehosted: https://transfer.archivete.am/xSRbu/2014-2018_image_data.zip |
| 17:48:03 | <@JAA> | I'll process this later if noone beats me to it. |
| 17:48:13 | | Daloader joins |
| 17:49:25 | <dav3> | 2019-2021 images urls https://mega.nz/file/NoNDASRL#NCG5tNkoh2kC9U4MqEQBALePvTOI8q9kLq07YB_ZcNA |
| 17:49:35 | <@JAA> | Cheers |
| 17:49:58 | <@JAA> | https://transfer.archivete.am/DtK5q/2019-2021_image_data.zip |
| 17:57:22 | | DogsRNice (Webuser299) joins |
| 17:59:56 | <AK> | JAA: for the two above, any cleanup needing doing or can I just throw all the urls into ab? (I can manage getting the urls our and into a list) |
| 18:04:13 | <@JAA> | AK: Not sure yet if AB or #// is better for these. |
| 18:04:28 | <@JAA> | Depends on how many more there are. I'll get the ones from the AB jobs as well. |
| 18:04:42 | <@JAA> | But these are the full images, so no processing needed in that sense. |
| 18:04:54 | <AK> | Alright, I'll turn them both into one list, then upload it here and it can go somewhere else |
| 18:05:27 | <@JAA> | Upload as hk.appledaily.com-dav3-image-urls.zst please. :-) |
| 18:05:47 | <@EggplantN> | Are they the same domain per URL? If so turn on page reqs |
| 18:06:13 | <@JAA> | EggplantN: They're just direct URLs for images. The pages they were on aren't available anymore. |
| 18:06:43 | <@EggplantN> | Ah okie. Sure #// them that box can take up to 11Gbit peaks inbound |
| 18:06:59 | <AK> | Now I've gotta work out how I zst them lmao |
| 18:08:43 | <@JAA> | Nice thing about AB is that it produces a single grouped dataset. |
| 18:09:07 | <@JAA> | But yeah, will get the ones from the two AB job DBs later and then decide. |
| 18:12:54 | <AK> | https://transfer.archivete.am/EWVa4/hk.appledaily.com-dav3-image-urls.zst |
| 18:13:00 | <AK> | Fuck knows if I managed that correctly |
| 18:13:45 | <@JAA> | Thanks |
| 18:14:06 | <AK> | I'm not sure I did |
| 18:14:07 | <AK> | lol |
| 18:14:49 | <AK> | Actually I think I did |
| 18:14:58 | <@arkiver> | zstd filename |
| 18:15:01 | <@arkiver> | did you do that? |
| 18:15:05 | <@arkiver> | then it should be fine |
| 18:15:07 | <@arkiver> | :P |
| 18:15:36 | <@JAA> | I usually increase the level a bit. If I feel patient, I go for `--ultra -22`. |
| 18:15:45 | <@JAA> | But yeah, that'll do. |
| 18:15:58 | <@JAA> | And will still beat `gzip -9` in virtually all cases. |
| 18:16:13 | <AK> | Don't get too mad, but I used 7zip gui lol |
| 18:16:20 | <AK> | Figured 11 was better than the default of 1 |
| 18:16:33 | <@JAA> | lol |
| 18:16:36 | <AK> | Used https://github.com/mcmilk/7-Zip-zstd |
| 18:16:54 | <AK> | Powershell for the extraction, then used a nice gui for the compressing back |
| 18:17:04 | <AK> | Can you tell I'm a windows admin ;) |
| 18:17:12 | <@JAA> | Yes, my condolences. |
| 18:17:54 | <@JAA> | I'd probably have done `awk -F, '{print $3}' input | zstd -8 -o out.zst`. |
| 18:17:54 | <@arkiver> | damn |
| 18:18:09 | <@JAA> | But I won't kinkshame. :-) |
| 18:18:19 | <@arkiver> | if it works it works :P |
| 18:18:26 | <@arkiver> | AK will upgrade to Linux at some points :) |
| 18:18:32 | <@arkiver> | point* |
| 18:23:54 | <@HCross> | _stares intently at arkiver_ |
| 18:33:08 | <AK> | arkiver I use linux loads, how else do you download your Windows 10 iso? |
| 18:55:19 | | benjins joins |
| 18:57:07 | | benjins is now authenticated as benjins |
| 19:22:33 | | omni quits [Ping timeout: 258 seconds] |
| 19:25:18 | | HP_Archivist (HP_Archivist) joins |
| 20:01:24 | | simon816 quits [Read error: Connection reset by peer] |
| 20:01:38 | | benjinsmith joins |
| 20:04:43 | | benjins quits [Ping timeout: 258 seconds] |
| 20:10:50 | | simon816 (simon816) joins |
| 20:15:09 | | alard quits [Quit: ZNC 1.7.2+deb3 - https://znc.in] |
| 20:27:55 | | AlsoHP_Archivist joins |
| 20:29:04 | | benjinsmith is now known as benjins |
| 20:29:07 | | benjins is now authenticated as benjins |
| 20:30:01 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 20:35:27 | | HP_Archivist (HP_Archivist) joins |
| 20:35:56 | | nuroten joins |
| 20:36:33 | | wyatt8750 joins |
| 20:37:38 | | AlsoHP_Archivist quits [Ping timeout: 250 seconds] |
| 20:37:38 | | wyatt8740 quits [Ping timeout: 250 seconds] |
| 20:45:31 | <thuban> | achivarin: yep, running |
| 21:04:52 | <thuban> | would someone please queue https://transfer.archivete.am/J0QXS/playboard_UCCzKM7UMxGCPAgUDXmFw5Gg_urls.txt.zst and https://transfer.archivete.am/dDyMC/playboard_UC-8CVMKt5Zlju_i07zhkC-Q_urls.txt.zst ? |
| 21:05:05 | <AK> | Can do |
| 21:05:13 | <thuban> | thanks! |
| 21:07:24 | <thuban> | achivarin: i am not sure whether https://www.youtube.com/user/eatravel corresponds to one of the playboard channel urls you linked or to something else. youtube says the page is not available; can you explain? |
| 22:03:00 | | Daloader quits [Ping timeout: 250 seconds] |
| 22:17:16 | | AlsoHP_Archivist joins |
| 22:19:02 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 22:55:26 | | Larsenv quits [Ping timeout: 250 seconds] |
| 23:01:49 | <AK> | BT Community Webkit closed on the 24th May 2021. |
| 23:01:53 | <AK> | https://appletonlemoorshistory.chessck.co.uk/SiteClosed.aspx |
| 23:02:02 | <AK> | Did anyone know about BT Community Webkit? |
| 23:02:26 | <@JAA> | Yeah, it was known. |
| 23:02:35 | <AK> | Damn, ahh well |
| 23:02:46 | <@EggplantN> | AK it was done |
| 23:02:55 | <@EggplantN> | see archiveteam_inbox |
| 23:02:59 | <@EggplantN> | HCross did it |
| 23:03:14 | | BlueMaxima joins |
| 23:03:45 | <AK> | Awesome |
| 23:04:04 | <@EggplantN> | https://archive.org/details/archiveteam_inbox?query=BT&sin= |