| 00:03:46 | | sec^nd quits [Ping timeout: 255 seconds] |
| 00:18:38 | | sec^nd (second) joins |
| 00:34:13 | | wyatt8750 quits [Ping timeout: 250 seconds] |
| 00:34:49 | | wyatt8740 joins |
| 00:49:39 | | Arcorann (Arcorann) joins |
| 01:04:08 | | dm4v quits [Ping timeout: 250 seconds] |
| 01:04:09 | | dm4v_ joins |
| 01:04:35 | | dm4v_ is now known as dm4v |
| 01:04:35 | | dm4v is now authenticated as dm4v |
| 01:04:35 | | dm4v quits [Changing host] |
| 01:04:35 | | dm4v (dm4v) joins |
| 01:29:03 | <@JAA> | Ok, here we go, the lists of images after post-processing from the two ArchiveBot jobs: https://transfer.archivete.am/UmdL5/hk.appledaily.com-images.zst https://transfer.archivete.am/PSAGY/twitter-@appledaily_hk-images.zst |
| 01:29:20 | <@JAA> | The job that's running through AB now is progressing very slowly. I think I'll abort it, merge everything, and split it into a few jobs. |
| 01:31:20 | <@JAA> | These two lists increase the count to nearly a million images! |
| 01:31:26 | <@JAA> | dav3: ^ |
| 01:36:30 | <@JAA> | Only 10k duplicates between the three lists. That's a bit surprising. |
| 01:39:31 | <@JAA> | Ok, found one reason: https://arc-photo-appledaily.s3.amazonaws.com/ap-ne-1-prod/public/ and https://cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/ serve the same files. So there's a lot of duplication in that. |
| 01:40:04 | <@JAA> | Over 120k filenames are duplicates but unique URLs. |
| 01:44:41 | | dav3 quits [Remote host closed the connection] |
| 02:20:16 | | Docti joins |
| 02:23:33 | <Frogging101> | Ryz: People in this thread are talking about it https://www.virtualteen.org/forums/showthread.php?t=2056688 |
| 02:23:33 | <@JAA> | Docti: That thread is behind a login wall. :-/ |
| 02:23:56 | <Ryz> | Big welp on the login wall... :c |
| 02:24:07 | <Docti> | Sorry, I will find another one |
| 02:24:59 | <@JAA> | Yeah, that thread above is where most of the public discussion appears to have taken place. Statements from staff members in there as well. |
| 02:25:15 | <Docti> | Exactly |
| 02:25:15 | <Frogging101> | fuuck, I remember this site. I even posted on it I think |
| 02:25:33 | <Frogging101> | haven't been on it for a good 10 years but the other week I was wondering if it was still around :P |
| 02:25:41 | <Docti> | For real ? wow |
| 02:26:04 | <Docti> | Perhaps you can find again your account and say hello one last time :) |
| 02:26:50 | <@JAA> | People in that thread are also mentioning that the only other major forum of that sort is Teenhut (https://www.teenhut.net/ I think?) and is in a very sorry state itself. |
| 02:31:07 | <Docti> | Unfortunately Virtual Teen that was still active is closing down very soon, whereas Teen Hut that survives is not used anymore |
| 02:45:58 | | AlsoHP_Archivist quits [Ping timeout: 250 seconds] |
| 02:46:53 | | Docti24 joins |
| 02:48:43 | | Docti quits [Ping timeout: 244 seconds] |
| 02:53:20 | | ThreeHM quits [Ping timeout: 250 seconds] |
| 02:55:13 | | ThreeHM (ThreeHeadedMonkey) joins |
| 02:55:38 | <Docti24> | I will have to leave soon. Do you think you can do something for Virtualteen? I'm sad to see this website go, so many teens grew with it for years, it's a whole life of virtual and later IRL friends |
| 02:59:49 | <@JAA> | Yeah, we'll try to get what we can. |
| 03:01:13 | | AntiLiberal joins |
| 03:03:09 | <thuban> | Docti24: that won't include pages behind login walls, so if you feel really strongly about them you might want to make your own personal archive. this may be useful: https://github.com/archiveteam/grab-site |
| 03:04:28 | <thuban> | do we have a shutdown date? |
| 03:05:02 | <@JAA> | 30 Jun |
| 03:05:56 | <@JAA> | Yeah, a fair bit of content is behind a login wall, but based on the numbers on the homepage, it looks like the majority of posts isn't. |
| 03:06:42 | <Docti24> | Thank you |
| 03:07:00 | <Docti24> | Unfortunately I can't use that tool, I have Windows 7 |
| 03:13:15 | <Docti24> | By the way, if possible, avoid saving the the VT Arcade section, or do it once you have saved the rest. It's not interesting |
| 03:14:28 | <Docti24> | VT Arcade makes up ~740,000 posts out of 3.6 millions total |
| 03:14:40 | <@JAA> | Ah yes, forum counting games and all that stuff. Damn, I miss this. |
| 03:15:09 | <@JAA> | But we probably can't really avoid it. |
| 03:15:37 | <Docti24> | Yeah, sadly. And it's also part of the history of Internet |
| 03:15:49 | <thuban> | Docti24: an option in that case is wget for windows (https://eternallybored.org/misc/wget/), although it will take more work to get the exact set of pages you want. you could also try grab-site in a vm. (all of these options require the user to be somewhat technical i'm afraid) |
| 03:19:36 | <@JAA> | I've started a crawl with ArchiveBot and will explore other ways on the weekend. It's vBulletin, so pretty straightforward in principle. Depends on how much load the server can handle, but should be fine in 3-4 days. |
| 03:20:09 | <Docti24> | Thank you very much! |
| 03:20:19 | <@JAA> | As thuban said, no logged-in content because that gets very messy very quickly, both technically and ethically. |
| 03:20:20 | <thuban> | JAA: i believe the 'likes' functionality is an extension, so may need extra ignores |
| 03:21:26 | <@JAA> | As far as I can see, it just adds a list of user profile links (at least when you're not logged in). Don't see anything that would need ignores. |
| 03:21:35 | <Docti24> | Let's suppose I manage to save content for logged-in users: would it be possible to upload it to archive.org ? |
| 03:22:20 | <@JAA> | Yes, anyone can upload to the Internet Archive, but it wouldn't be included in the Wayback Machine. |
| 03:22:56 | <@JAA> | Also, you may accidentally leak details about your account and the like. |
| 03:23:59 | <Docti24> | Ok, I see. Is the work of ArchiveBot included in the wayback machine ? (well, I wouldn't care about my account, nothing is really private) |
| 03:24:11 | <@JAA> | Yes, AB data goes into the WBM. |
| 03:26:10 | <@JAA> | The other thing I want to try would be included in the WBM as well and should guarantee (VT server power permitting) that all thread contents are safe at least, even if the pages may not be fully covered otherwise (e.g. images or page design). But no guarantees on that yet. |
| 03:26:31 | <thuban> | qwarc? |
| 03:26:32 | <@JAA> | The ArchiveBot job will likely not be fast enough to get everything in 3-4 days. |
| 03:26:37 | <@JAA> | Yup |
| 03:34:28 | <Docti24> | Thanks. I have to go now. i will try to save pages for users only. You can contact me on the wiki if necessary, username: Docti. Good night. |
| 03:34:45 | | gazorpazorp quits [Ping timeout: 258 seconds] |
| 03:34:57 | <thuban> | thanks for the report, and good night! |
| 03:36:40 | | Docti24 quits [Remote host closed the connection] |
| 03:46:25 | | qw3rty__ joins |
| 03:50:05 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 04:21:22 | <AntiLiberal> | Did anyone manage to archive Myspace? |
| 04:54:57 | | AlsoHP_Archivist joins |
| 05:01:37 | | DogsRNice quits [Read error: Connection reset by peer] |
| 05:26:49 | | achivarin quits [Remote host closed the connection] |
| 06:00:47 | <jodizzle> | I think I've handled all the appledaily m3u8s at this point. |
| 06:01:59 | <jodizzle> | While I was doing that, I realized that there are unfortunately a pretty large number of dupe URLs among my original 8 lists. Must've screwed something up when preparing them. |
| 06:06:48 | <jodizzle> | JAA: On getting more videos: I think ideally, we'd process articles saved by #//, right? Since I think that probably got more coverage than the AB job. Or are the archives produced by that harder to deal with? |
| 07:16:06 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 07:31:46 | | apache2 joins |
| 07:32:02 | <apache2> | Hi, can someone help me with an upload? |
| 07:32:14 | <apache2> | (I need a place to drop the files) |
| 07:36:35 | <apache2> | please /msg for details |
| 07:42:02 | <AK> | Apache2: what kind of files are you trying to upload? And to where? |
| 07:49:58 | <apache2> | I wrote you a message |
| 08:08:33 | | TigerbotHesh quits [Quit: ZNC - https://znc.in] |
| 08:08:42 | | Pingerfowder joins |
| 08:17:03 | | AlsoHP_Archivist quits [Client Quit] |
| 08:17:21 | | HP_Archivist (HP_Archivist) joins |
| 08:40:12 | <@HCross> | Please detail in here instead of DM |
| 08:40:18 | <@HCross> | Other people may be able to help here |
| 09:32:45 | | nuroten quits [Remote host closed the connection] |
| 09:57:58 | | Pingerfowder is now authenticated as Pingerfowder |
| 11:04:03 | | justcool393 quits [Quit: Connection closed for inactivity] |
| 11:11:40 | | shogchips quits [Ping timeout: 250 seconds] |
| 11:26:42 | | dav3 joins |
| 11:40:03 | | Matthww83 joins |
| 11:41:12 | | Matthww8 quits [Ping timeout: 258 seconds] |
| 11:41:12 | | Matthww83 is now known as Matthww8 |
| 11:45:57 | | Matthww88 joins |
| 11:47:43 | | Matthww8 quits [Ping timeout: 258 seconds] |
| 11:47:43 | | Matthww88 is now known as Matthww8 |
| 11:54:03 | | cpina quits [Quit: Bye!] |
| 11:56:06 | | cpina joins |
| 12:00:00 | | dav3 quits [Ping timeout: 244 seconds] |
| 12:15:01 | <@arkiver> | JAA: did we recrawl everything from dav3, or does dav3 still have data in that 2 TB that we haven't archived yet? |
| 12:54:42 | | dav3 joins |
| 13:05:27 | <dav3> | here are a few more video and image urls from the last few days of the site: https://mega.nz/file/wp10zBaY#8QQ21fqaFl_b1JRdOSx4Gy4axJSG97hX5RTpZQvuYLs i dont think there was an archive page for the 23rd |
| 13:48:05 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 13:57:30 | | Mateon1 joins |
| 14:02:48 | | achivarin (achivarin) joins |
| 14:03:10 | <Jake> | https://transfer.archivete.am/UnV7C/20210620-20210622_image_video_urls.zip (reupped) |
| 14:14:19 | <@JAA> | jodizzle: Yes, processing the #// archives would be great, but it's mixed in with all the other crap that goes through that project, so it's difficult. Probably requires downloading a couple hundred GB of WARCs from IA etc. |
| 14:16:40 | <@JAA> | arkiver: We should have all the videos and images (though that's unverified) but are likely missing some of the articles, which can no longer be grabbed. Also, dav3 is here now. :-) |
| 14:18:13 | <@JAA> | dav3: Thanks! I can confirm that the archive page for the 23rd was empty as of the early UTC hours of the 24th. |
| 14:21:38 | <@JAA> | And thanks, Jake! :-) |
| 14:37:26 | | user1 joins |
| 14:37:31 | | user1 is now known as gazorpazorp |
| 14:39:52 | | scara joins |
| 14:40:01 | | scara leaves |
| 14:53:05 | | dav3 quits [Ping timeout: 244 seconds] |
| 15:14:20 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:23:54 | | dav3 joins |
| 15:25:20 | | nyany (nyany) joins |
| 16:01:16 | <rewby> | JAA: if you give me a list of urls items to crawl I can throw my scanner at it and do it at like gig linerate |
| 16:03:20 | | Niklink joins |
| 16:12:39 | | Niklink quits [Ping timeout: 244 seconds] |
| 16:13:27 | | Niklink joins |
| 16:32:00 | | Niklink quits [Remote host closed the connection] |
| 17:46:33 | <Ryz> | In regards to the shutdown from https://site.nicovideo.jp/ch/userblomaga_thanks/ - tis was when I was rechecking my list of pending stuff, being https://ch.nicovideo.jp/katakuti_tdb and http://ch.nicovideo.jp/takeyabu/blomaga/ar1144933 - from which I saw the announcement on top of the page |
| 17:49:02 | <@OrIdow6> | HCross: It shouldn't be as hard as the last one, I think |
| 17:49:38 | <@OrIdow6> | Because in that case, the important data was accessed in a comp,icated way by JS |
| 17:49:40 | <AK> | At least this time we have lots of time |
| 17:49:57 | <@OrIdow6> | Whereas I think these are mostly static web pages |
| 17:51:18 | | Daloader joins |
| 17:52:02 | <@OrIdow6> | yes |
| 17:52:44 | <@HCross> | Ah right, so we don't have to go full speed |
| 18:25:14 | | DogsRNice (Webuser299) joins |
| 19:08:21 | <@JAA> | rewby: So the issue is that I don't really know how long it took to run them through since the graphs are broken etc. Which means I have no idea how many items they're spread over. But here are the items that could contain them: https://transfer.archivete.am/G7PVb/archiveteam_urls-hk.appledaily.com-candidate-items |
| 19:09:39 | <@JAA> | I'd start from the top and count the number of hk.appledaily.com records that you encounter, I suppose. (There were 3221709 in your list, for reference.) |
| 19:10:05 | <@JAA> | The video and image URLs are in JS blobs, I believe, *not* HTML. |
| 19:14:22 | <@JAA> | Videos are on d2i91erehhsxi2.cloudfront.net, images are under a number of hosts and URLs: cloudfront-ap-northeast-1.images.arcpublishing.com, arc-photo-appledaily.s3.amazonaws.com, d87urpdhi5rdo.cloudfront.net, and https://hk.appledaily.com/resizer/* (those last ones need post-processing) |
| 19:34:37 | | dewdrop quits [Ping timeout: 258 seconds] |
| 19:49:33 | <@JAA> | jodizzle: https://transfer.archivete.am/UnV7C/20210620-20210622_image_video_urls.zip has only M3U8 videos in the video part. (I'm taking care of the images now.) |
| 19:51:58 | | Vukky (Vukky) joins |
| 20:10:34 | <AK> | Someday I'd love to be able to say we've archived every single gov.uk domain: https://www.gov.uk/government/publications/list-of-gov-uk-domain-names |
| 20:11:35 | <Vukky> | "This preview only shows the first 1,000 rows and 50 columns" |
| 20:11:54 | <@JAA> | Sadly, gov sites are among the worst on the internet, so a lot of those would probably require black magic to properly/fully archive. |
| 20:27:12 | | Daloader quits [Ping timeout: 250 seconds] |
| 20:32:09 | <russss> | gov.uk is better than .gov I think |
| 20:32:10 | <@HCross> | JAA: The UK is doing somewhat of a standardisation drive |
| 20:32:21 | <@HCross> | So I suspect it may be a lot better than what you’re comparing it to in America |
| 20:33:16 | <russss> | but also .gov.uk is archived by the UK National Archives, at least in theory |
| 20:33:44 | <@JAA> | The US ones are horrible, yes, but I was really talking about gov sites globally. Virtually every country I encountered during my election stuff etc. had absolutely horrible sites. |
| 20:34:07 | <@JAA> | But may well be that the UK ones are better. A surprise to be sure, but a welcome one. :-) |
| 20:34:52 | <AK> | russss, I think so yeah, but it might be good to do the ones that might not get archived at some point |
| 20:40:16 | <russss> | yeah definitely |
| 22:31:08 | | driib6 (driib) joins |
| 22:35:02 | | driib quits [Ping timeout: 250 seconds] |
| 22:35:02 | | driib6 is now known as driib |
| 23:48:21 | | Larsenv (Larsenv) joins |