00:03:46sec^nd quits [Ping timeout: 255 seconds]
00:18:38sec^nd (second) joins
00:34:13wyatt8750 quits [Ping timeout: 250 seconds]
00:34:49wyatt8740 joins
00:49:39Arcorann (Arcorann) joins
01:04:08dm4v quits [Ping timeout: 250 seconds]
01:04:09dm4v_ joins
01:04:35dm4v_ is now known as dm4v
01:04:35dm4v quits [Changing host]
01:04:35dm4v (dm4v) joins
01:29:03<@JAA>Ok, here we go, the lists of images after post-processing from the two ArchiveBot jobs: https://transfer.archivete.am/UmdL5/hk.appledaily.com-images.zst https://transfer.archivete.am/PSAGY/twitter-@appledaily_hk-images.zst
01:29:20<@JAA>The job that's running through AB now is progressing very slowly. I think I'll abort it, merge everything, and split it into a few jobs.
01:31:20<@JAA>These two lists increase the count to nearly a million images!
01:31:26<@JAA>dav3: ^
01:36:30<@JAA>Only 10k duplicates between the three lists. That's a bit surprising.
01:39:31<@JAA>Ok, found one reason: https://arc-photo-appledaily.s3.amazonaws.com/ap-ne-1-prod/public/ and https://cloudfront-ap-northeast-1.images.arcpublishing.com/appledaily/ serve the same files. So there's a lot of duplication in that.
01:40:04<@JAA>Over 120k filenames are duplicates but unique URLs.
01:44:41dav3 quits [Remote host closed the connection]
02:20:16Docti joins
02:23:33<Frogging101>Ryz: People in this thread are talking about it https://www.virtualteen.org/forums/showthread.php?t=2056688
02:23:33<@JAA>Docti: That thread is behind a login wall. :-/
02:23:56<Ryz>Big welp on the login wall... :c
02:24:07<Docti>Sorry, I will find another one
02:24:59<@JAA>Yeah, that thread above is where most of the public discussion appears to have taken place. Statements from staff members in there as well.
02:25:15<Docti>Exactly
02:25:15<Frogging101>fuuck, I remember this site. I even posted on it I think
02:25:33<Frogging101>haven't been on it for a good 10 years but the other week I was wondering if it was still around :P
02:25:41<Docti>For real ? wow
02:26:04<Docti>Perhaps you can find again your account and say hello one last time :)
02:26:50<@JAA>People in that thread are also mentioning that the only other major forum of that sort is Teenhut (https://www.teenhut.net/ I think?) and is in a very sorry state itself.
02:31:07<Docti>Unfortunately Virtual Teen that was still active is closing down very soon, whereas Teen Hut that survives is not used anymore
02:45:58AlsoHP_Archivist quits [Ping timeout: 250 seconds]
02:46:53Docti24 joins
02:48:43Docti quits [Ping timeout: 244 seconds]
02:53:20ThreeHM quits [Ping timeout: 250 seconds]
02:55:13ThreeHM (ThreeHeadedMonkey) joins
02:55:38<Docti24>I will have to leave soon. Do you think you can do something for Virtualteen? I'm sad to see this website go, so many teens grew with it for years, it's a whole life of virtual and later IRL friends
02:59:49<@JAA>Yeah, we'll try to get what we can.
03:01:13AntiLiberal joins
03:03:09<thuban>Docti24: that won't include pages behind login walls, so if you feel really strongly about them you might want to make your own personal archive. this may be useful: https://github.com/archiveteam/grab-site
03:04:28<thuban>do we have a shutdown date?
03:05:02<@JAA>30 Jun
03:05:56<@JAA>Yeah, a fair bit of content is behind a login wall, but based on the numbers on the homepage, it looks like the majority of posts isn't.
03:06:42<Docti24>Thank you
03:07:00<Docti24>Unfortunately I can't use that tool, I have Windows 7
03:13:15<Docti24>By the way, if possible, avoid saving the the VT Arcade section, or do it once you have saved the rest. It's not interesting
03:14:28<Docti24>VT Arcade makes up ~740,000 posts out of 3.6 millions total
03:14:40<@JAA>Ah yes, forum counting games and all that stuff. Damn, I miss this.
03:15:09<@JAA>But we probably can't really avoid it.
03:15:37<Docti24>Yeah, sadly. And it's also part of the history of Internet
03:15:49<thuban>Docti24: an option in that case is wget for windows (https://eternallybored.org/misc/wget/), although it will take more work to get the exact set of pages you want. you could also try grab-site in a vm. (all of these options require the user to be somewhat technical i'm afraid)
03:19:36<@JAA>I've started a crawl with ArchiveBot and will explore other ways on the weekend. It's vBulletin, so pretty straightforward in principle. Depends on how much load the server can handle, but should be fine in 3-4 days.
03:20:09<Docti24>Thank you very much!
03:20:19<@JAA>As thuban said, no logged-in content because that gets very messy very quickly, both technically and ethically.
03:20:20<thuban>JAA: i believe the 'likes' functionality is an extension, so may need extra ignores
03:21:26<@JAA>As far as I can see, it just adds a list of user profile links (at least when you're not logged in). Don't see anything that would need ignores.
03:21:35<Docti24>Let's suppose I manage to save content for logged-in users: would it be possible to upload it to archive.org ?
03:22:20<@JAA>Yes, anyone can upload to the Internet Archive, but it wouldn't be included in the Wayback Machine.
03:22:56<@JAA>Also, you may accidentally leak details about your account and the like.
03:23:59<Docti24>Ok, I see. Is the work of ArchiveBot included in the wayback machine ? (well, I wouldn't care about my account, nothing is really private)
03:24:11<@JAA>Yes, AB data goes into the WBM.
03:26:10<@JAA>The other thing I want to try would be included in the WBM as well and should guarantee (VT server power permitting) that all thread contents are safe at least, even if the pages may not be fully covered otherwise (e.g. images or page design). But no guarantees on that yet.
03:26:31<thuban>qwarc?
03:26:32<@JAA>The ArchiveBot job will likely not be fast enough to get everything in 3-4 days.
03:26:37<@JAA>Yup
03:34:28<Docti24>Thanks. I have to go now. i will try to save pages for users only. You can contact me on the wiki if necessary, username: Docti. Good night.
03:34:45gazorpazorp quits [Ping timeout: 258 seconds]
03:34:57<thuban>thanks for the report, and good night!
03:36:40Docti24 quits [Remote host closed the connection]
03:46:25qw3rty__ joins
03:50:05qw3rty_ quits [Ping timeout: 258 seconds]
04:21:22<AntiLiberal>Did anyone manage to archive Myspace?
04:54:57AlsoHP_Archivist joins
05:01:37DogsRNice quits [Read error: Connection reset by peer]
05:26:49achivarin quits [Remote host closed the connection]
06:00:47<jodizzle>I think I've handled all the appledaily m3u8s at this point.
06:01:59<jodizzle>While I was doing that, I realized that there are unfortunately a pretty large number of dupe URLs among my original 8 lists. Must've screwed something up when preparing them.
06:06:48<jodizzle>JAA: On getting more videos: I think ideally, we'd process articles saved by #//, right? Since I think that probably got more coverage than the AB job. Or are the archives produced by that harder to deal with?
07:16:06BlueMaxima quits [Read error: Connection reset by peer]
07:31:46apache2 joins
07:32:02<apache2>Hi, can someone help me with an upload?
07:32:14<apache2>(I need a place to drop the files)
07:36:35<apache2>please /msg for details
07:42:02<AK>Apache2: what kind of files are you trying to upload? And to where?
07:49:58<apache2>I wrote you a message
08:08:33TigerbotHesh quits [Quit: ZNC - https://znc.in]
08:08:42Pingerfowder joins
08:17:03AlsoHP_Archivist quits [Client Quit]
08:17:21HP_Archivist (HP_Archivist) joins
08:40:12<@HCross>Please detail in here instead of DM
08:40:18<@HCross>Other people may be able to help here
09:32:45nuroten quits [Remote host closed the connection]
11:04:03justcool393 quits [Quit: Connection closed for inactivity]
11:11:40shogchips quits [Ping timeout: 250 seconds]
11:26:42dav3 joins
11:40:03Matthww83 joins
11:41:12Matthww8 quits [Ping timeout: 258 seconds]
11:41:12Matthww83 is now known as Matthww8
11:45:57Matthww88 joins
11:47:43Matthww8 quits [Ping timeout: 258 seconds]
11:47:43Matthww88 is now known as Matthww8
11:54:03cpina quits [Quit: Bye!]
11:56:06cpina joins
12:00:00dav3 quits [Ping timeout: 244 seconds]
12:15:01<@arkiver>JAA: did we recrawl everything from dav3, or does dav3 still have data in that 2 TB that we haven't archived yet?
12:54:42dav3 joins
13:05:27<dav3>here are a few more video and image urls from the last few days of the site: https://mega.nz/file/wp10zBaY#8QQ21fqaFl_b1JRdOSx4Gy4axJSG97hX5RTpZQvuYLs i dont think there was an archive page for the 23rd
13:48:05Mateon1 quits [Ping timeout: 258 seconds]
13:57:30Mateon1 joins
14:02:48achivarin (achivarin) joins
14:03:10<Jake>https://transfer.archivete.am/UnV7C/20210620-20210622_image_video_urls.zip (reupped)
14:14:19<@JAA>jodizzle: Yes, processing the #// archives would be great, but it's mixed in with all the other crap that goes through that project, so it's difficult. Probably requires downloading a couple hundred GB of WARCs from IA etc.
14:16:40<@JAA>arkiver: We should have all the videos and images (though that's unverified) but are likely missing some of the articles, which can no longer be grabbed. Also, dav3 is here now. :-)
14:18:13<@JAA>dav3: Thanks! I can confirm that the archive page for the 23rd was empty as of the early UTC hours of the 24th.
14:21:38<@JAA>And thanks, Jake! :-)
14:37:26user1 joins
14:37:31user1 is now known as gazorpazorp
14:39:52scara joins
14:40:01scara leaves
14:53:05dav3 quits [Ping timeout: 244 seconds]
15:14:20Arcorann quits [Ping timeout: 258 seconds]
15:23:54dav3 joins
15:25:20nyany (nyany) joins
16:01:16<rewby>JAA: if you give me a list of urls items to crawl I can throw my scanner at it and do it at like gig linerate
16:03:20Niklink joins
16:12:39Niklink quits [Ping timeout: 244 seconds]
16:13:27Niklink joins
16:32:00Niklink quits [Remote host closed the connection]
17:46:33<Ryz>In regards to the shutdown from https://site.nicovideo.jp/ch/userblomaga_thanks/ - tis was when I was rechecking my list of pending stuff, being https://ch.nicovideo.jp/katakuti_tdb and http://ch.nicovideo.jp/takeyabu/blomaga/ar1144933 - from which I saw the announcement on top of the page
17:49:02<@OrIdow6>HCross: It shouldn't be as hard as the last one, I think
17:49:38<@OrIdow6>Because in that case, the important data was accessed in a comp,icated way by JS
17:49:40<AK>At least this time we have lots of time
17:49:57<@OrIdow6>Whereas I think these are mostly static web pages
17:51:18Daloader joins
17:52:02<@OrIdow6>yes
17:52:44<@HCross>Ah right, so we don't have to go full speed
18:25:14DogsRNice (Webuser299) joins
19:08:21<@JAA>rewby: So the issue is that I don't really know how long it took to run them through since the graphs are broken etc. Which means I have no idea how many items they're spread over. But here are the items that could contain them: https://transfer.archivete.am/G7PVb/archiveteam_urls-hk.appledaily.com-candidate-items
19:09:39<@JAA>I'd start from the top and count the number of hk.appledaily.com records that you encounter, I suppose. (There were 3221709 in your list, for reference.)
19:10:05<@JAA>The video and image URLs are in JS blobs, I believe, *not* HTML.
19:14:22<@JAA>Videos are on d2i91erehhsxi2.cloudfront.net, images are under a number of hosts and URLs: cloudfront-ap-northeast-1.images.arcpublishing.com, arc-photo-appledaily.s3.amazonaws.com, d87urpdhi5rdo.cloudfront.net, and https://hk.appledaily.com/resizer/* (those last ones need post-processing)
19:34:37dewdrop quits [Ping timeout: 258 seconds]
19:49:33<@JAA>jodizzle: https://transfer.archivete.am/UnV7C/20210620-20210622_image_video_urls.zip has only M3U8 videos in the video part. (I'm taking care of the images now.)
19:51:58Vukky (Vukky) joins
20:10:34<AK>Someday I'd love to be able to say we've archived every single gov.uk domain: https://www.gov.uk/government/publications/list-of-gov-uk-domain-names
20:11:35<Vukky>"This preview only shows the first 1,000 rows and 50 columns"
20:11:54<@JAA>Sadly, gov sites are among the worst on the internet, so a lot of those would probably require black magic to properly/fully archive.
20:27:12Daloader quits [Ping timeout: 250 seconds]
20:32:09<russss>gov.uk is better than .gov I think
20:32:10<@HCross>JAA: The UK is doing somewhat of a standardisation drive
20:32:21<@HCross>So I suspect it may be a lot better than what you’re comparing it to in America
20:33:16<russss>but also .gov.uk is archived by the UK National Archives, at least in theory
20:33:44<@JAA>The US ones are horrible, yes, but I was really talking about gov sites globally. Virtually every country I encountered during my election stuff etc. had absolutely horrible sites.
20:34:07<@JAA>But may well be that the UK ones are better. A surprise to be sure, but a welcome one. :-)
20:34:52<AK>russss, I think so yeah, but it might be good to do the ones that might not get archived at some point
20:40:16<russss>yeah definitely
22:31:08driib6 (driib) joins
22:35:02driib quits [Ping timeout: 250 seconds]
22:35:02driib6 is now known as driib
23:48:21Larsenv (Larsenv) joins