| 00:00:12 | <fireonlive> | thank you, on by default directory listings |
| 00:00:16 | <fireonlive> | ^o^ |
| 00:07:09 | <nicolas17> | ...I just now realized this is Knowledge Adventure |
| 00:07:19 | | jtagcat quits [Quit: Bye!] |
| 00:07:33 | | jtagcat (jtagcat) joins |
| 00:08:10 | <nicolas17> | I have several 90s-era KA games |
| 00:08:11 | <@JAA> | I was wondering what 'ka' stands for. Thanks. |
| 00:08:27 | <nicolas17> | there's JumpStart stuff in the bucket too |
| 00:08:53 | | dumbgoy_ quits [Read error: Connection reset by peer] |
| 00:08:53 | <nicolas17> | all the JS* dirs are probably for https://en.wikipedia.org/wiki/JumpStart |
| 00:09:45 | <@JAA> | Looks like this is the bucket behind http://media1.knowledgeadventure.com/ then. |
| 00:10:10 | <nicolas17> | let's find out |
| 00:10:27 | <@JAA> | It matches. I checked a couple random games from the KA site. |
| 00:10:28 | <nicolas17> | http://media1.knowledgeadventure.com/DWADragonsUnity/MAC/3.24.0/High/movies/StoickMemorial.mp4 |
| 00:11:44 | <nicolas17> | (my kingdom for an S3 listing of the bucket behind updates.cdn-apple.com!) |
| 00:12:20 | <fireonlive> | i like how http://media1.knowledgeadventure.com/DWADragonsUnity/ says "Key: DWADragonsUnity/DoNotDelete.jpg" but they're going to purge it all |
| 00:13:25 | | vegbrasil joins |
| 00:16:47 | <nicolas17> | JAA: https://media.jumpstart.com/ uses the same bucket |
| 00:17:11 | <@JAA> | Great |
| 00:17:31 | <nicolas17> | are you going to archive the whole bucket? it seems the SODWebsite/ prefix has lots of assets used by schoolofdragons.com |
| 00:18:05 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 00:18:06 | <@JAA> | betamax: Can you ask your friend which of https://s3.amazonaws.com/origin.ka.cdn/ http://media1.knowledgeadventure.com/ https://media.jumpstart.com/ was actually used by the game? |
| 00:18:14 | <@JAA> | Or is, I suppose. |
| 00:18:58 | <@JAA> | nicolas17: DWADragonsUnity only at the moment, but yes, depending on total size, I'd like to grab it all. |
| 00:19:40 | <nicolas17> | I got the complete list by getting lists of every version in parallel |
| 00:19:46 | <nicolas17> | the other directories don't have such a consistent structure |
| 00:20:07 | <nicolas17> | might be a very lopsided tree :) |
| 00:20:42 | <@JAA> | I might also grab everything under all three URLs above with dedupe. |
| 00:20:52 | <@JAA> | Let me know if you find any further domains like that. |
| 00:21:06 | <@JAA> | I can't actually start pulling tonight anyway. |
| 00:21:12 | <nicolas17> | 6TB x 3 download then? |
| 00:21:45 | <nicolas17> | er 7TB+ |
| 00:21:59 | <@JAA> | Sure, if there's enough time, might as well. |
| 00:22:16 | <@JAA> | That's only 10 MB/s over the remaining 20 days. |
| 00:22:46 | <nicolas17> | if I tried that I'm sure my ISP would yell at me... and I don't have 3TB free disk anyway :D |
| 00:23:08 | <@JAA> | OVH won't even blink. :-P |
| 00:24:49 | <fireonlive> | all i have is a measly 300/20Mbps (at best) at home lol |
| 00:25:03 | <fireonlive> | they always fuck you on the upload |
| 00:25:07 | <fireonlive> | but that's HFC for you |
| 00:25:20 | <@JAA> | I could get symmetric 25G here if I wanted. But meh. |
| 00:25:28 | <fireonlive> | oooh |
| 00:25:35 | <@JAA> | Also, -ot :-) |
| 00:25:40 | <fireonlive> | ye |
| 00:36:08 | | BearFortress joins |
| 00:39:50 | <@JAA> | I'll let it battle through these timeouts overnight. |
| 01:29:45 | | Abacus6427 joins |
| 01:34:32 | | tzt quits [Ping timeout: 252 seconds] |
| 01:43:25 | | Abacus6427 quits [Ping timeout: 265 seconds] |
| 01:45:49 | | dumbgoy joins |
| 02:01:53 | | Naruyoko quits [Remote host closed the connection] |
| 02:02:15 | | Naruyoko joins |
| 02:07:28 | | imer quits [Client Quit] |
| 02:08:05 | | imer (imer) joins |
| 02:16:35 | | tzt (tzt) joins |
| 02:16:53 | | Mateon1 quits [Remote host closed the connection] |
| 02:16:53 | | diggan quits [Remote host closed the connection] |
| 02:16:59 | | Mateon1 joins |
| 02:40:12 | | imer quits [Client Quit] |
| 02:43:17 | | imer (imer) joins |
| 02:49:53 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 02:59:14 | | tzt quits [Ping timeout: 252 seconds] |
| 03:00:58 | | tzt (tzt) joins |
| 03:04:41 | <pabs> | https://torrentfreak.com/youtube-orders-invidious-privacy-software-to-shut-down-in-7-days-230609 |
| 03:20:39 | | Mateon1 quits [Remote host closed the connection] |
| 03:20:59 | | Mateon1 joins |
| 03:34:10 | | taggart quits [Client Quit] |
| 03:39:42 | <pabs> | that_lurker: re Tor Forums migration, I am doing an AB job for the Tor forums, will do an AB !ao < afterwards to capture the redirects too |
| 03:40:15 | <pabs> | (https://blog.torproject.org/tor-project-forum-migration/) |
| 03:46:01 | | Elizabeth quits [Remote host closed the connection] |
| 03:46:11 | | elizabeth joins |
| 03:46:44 | | elizabeth is now authenticated as Elizabeth |
| 03:56:35 | | elizabeth quits [Client Quit] |
| 03:56:39 | | Elizabeth joins |
| 03:57:40 | | Elizabeth is now authenticated as Elizabeth |
| 03:57:52 | <h2ibot> | PaulWise edited Mailman (+142, mention that forum-dl supports pipermail archives): https://wiki.archiveteam.org/?diff=49893&oldid=21240 |
| 03:57:59 | <pabs> | mikolaj|m: ^ |
| 03:58:10 | | Elizabeth quits [Client Quit] |
| 03:58:12 | | Elizabeth (Elizabeth) joins |
| 03:58:52 | <h2ibot> | PaulWise edited Mailman (-2, fix link): https://wiki.archiveteam.org/?diff=49894&oldid=49893 |
| 03:58:55 | | Elizabeth quits [Changing host] |
| 03:58:55 | | Elizabeth (Elizabeth) joins |
| 04:17:55 | <h2ibot> | PaulWise edited Mailman (+159, mention that mailman 2 is now EOL): https://wiki.archiveteam.org/?diff=49895&oldid=49894 |
| 04:28:10 | | emberquill (emberquill) joins |
| 04:33:04 | | GNU_world joins |
| 04:35:29 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 04:37:59 | <h2ibot> | PaulWise created Mailman2 (+43214, start a page about mailman2 archiving): https://wiki.archiveteam.org/?title=Mailman2 |
| 04:38:44 | <pabs> | JAA: if you have mailman2/pipermail URLs locally, please add them to ^ |
| 04:38:51 | <pabs> | (same for anyone else here) |
| 04:40:13 | | nicolas17 joins |
| 04:47:05 | | dumbgoy quits [Ping timeout: 265 seconds] |
| 04:55:02 | <h2ibot> | PaulWise edited Mailman2 (+1588, add more tips, sites that were already done by me): https://wiki.archiveteam.org/?diff=49897&oldid=49896 |
| 05:02:26 | | vegbrasil joins |
| 05:06:54 | | vegbrasil quits [Ping timeout: 265 seconds] |
| 05:21:40 | | futawe joins |
| 05:24:09 | <futawe> | fyi, StackExchange: "The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June" |
| 05:24:09 | <futawe> | https://meta.stackexchange.com/questions/389922/june-2023-data-dump-is-missing/390023#390023 |
| 05:25:47 | | futawe quits [Remote host closed the connection] |
| 05:26:00 | | futawe joins |
| 05:26:12 | | futawe quits [Remote host closed the connection] |
| 05:30:01 | <nicolas17> | "organizations looking to profit from the work of our community" that sounds like... stackexchange itself? |
| 05:32:16 | <nicolas17> | JAA: purely out of curiosity (since I know you want the exact bytes for preservation etc etc) I looked into how much I can deduplicate/compress this KA/DWADragonsUnity data |
| 05:33:21 | <nicolas17> | it seems .unity3d files have internal LZMA compression, and if I decompress that I'd probably get far better deduplication, but I can't compress it back to the same data, I guess Unity doesn't use the standard LZMA library? :/ |
| 05:33:49 | <nicolas17> | I tried all compression levels and none matches |
| 05:51:11 | <h2ibot> | PaulWise edited ArchiveBot (+1956, add section on alternative dashboard clients): https://wiki.archiveteam.org/?diff=49898&oldid=49790 |
| 05:52:48 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:53:19 | | railen63 quits [Remote host closed the connection] |
| 05:53:31 | | vegbrasil joins |
| 05:53:35 | | railen63 joins |
| 05:56:12 | <h2ibot> | PaulWise edited ArchiveBot (-10, fix formatting): https://wiki.archiveteam.org/?diff=49899&oldid=49898 |
| 06:00:13 | <h2ibot> | PaulWise edited ArchiveBot (-21, fix command): https://wiki.archiveteam.org/?diff=49900&oldid=49899 |
| 06:00:34 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 06:05:37 | | nicolas17 quits [Read error: Connection reset by peer] |
| 06:06:13 | | a joins |
| 06:06:14 | | a quits [Remote host closed the connection] |
| 06:06:33 | | nicolas17 joins |
| 07:01:27 | | lk quits [Client Quit] |
| 07:01:32 | | lk joins |
| 07:53:00 | | manu|m quits [Quit: issued !quit command] |
| 07:55:34 | | manu|m joins |
| 07:58:34 | <h2ibot> | OrIdow6 uploaded File:Egloos logo.gif: https://wiki.archiveteam.org/?title=File%3AEgloos%20logo.gif |
| 08:06:42 | | pabs quits [Ping timeout: 265 seconds] |
| 08:15:37 | <h2ibot> | OrIdow6 created Egloos (+400, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=Egloos |
| 08:23:34 | | Dango360 quits [Ping timeout: 252 seconds] |
| 08:23:41 | | Dango360 (Dango360) joins |
| 08:34:35 | | pabs (pabs) joins |
| 08:37:11 | | pabs quits [Remote host closed the connection] |
| 08:39:34 | | pabs (pabs) joins |
| 09:13:18 | | T31M quits [Client Quit] |
| 09:13:37 | | T31M joins |
| 09:14:46 | | T31M is now authenticated as T31M |
| 09:21:04 | <@OrIdow6> | arkiver: https://github.com/OrIdow6/egloos-grab https://github.com/OrIdow6/egloos-items - Framework/backfeed.lua in the former needs keys - long tail so I have not bothered with an exact estimate, but it feels like < 5 TB fare to me, and there |
| 09:21:14 | <@OrIdow6> | 's stuff to cut out if it gets too high |
| 09:36:57 | <flashfire42> | https://tracker.archiveteam.org/google-sites/ is this still supposed to be spitting out items? |
| 09:50:19 | | railen63 quits [Remote host closed the connection] |
| 09:53:48 | | railen63 joins |
| 09:55:03 | | railen63 quits [Remote host closed the connection] |
| 09:55:16 | | railen63 joins |
| 10:00:01 | | railen63 quits [Remote host closed the connection] |
| 10:00:16 | | railen63 joins |
| 10:13:49 | | Ruthalas5 quits [Ping timeout: 265 seconds] |
| 10:23:42 | | Ruthalas5 (Ruthalas) joins |
| 11:22:29 | | icedice quits [Client Quit] |
| 11:28:42 | | vegbrasil joins |
| 11:32:46 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 11:33:29 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 11:36:24 | | nicolas17 joins |
| 11:39:17 | | nicolas17 quits [Read error: Connection reset by peer] |
| 11:39:42 | | nicolas17 joins |
| 11:44:48 | | decky_e quits [Remote host closed the connection] |
| 11:54:53 | | diggan joins |
| 11:57:08 | | systwi__ (systwi) joins |
| 11:58:14 | | systwi quits [Ping timeout: 252 seconds] |
| 12:13:41 | | diggan quits [Ping timeout: 265 seconds] |
| 12:21:22 | | vegbrasil joins |
| 12:21:53 | | railen63 quits [Remote host closed the connection] |
| 12:22:48 | | diggan joins |
| 12:30:07 | | vegbrasil quits [Ping timeout: 265 seconds] |
| 12:40:35 | | railen63 joins |
| 12:41:46 | | railen63 quits [Remote host closed the connection] |
| 12:42:00 | | railen63 joins |
| 12:43:57 | | icedice (icedice) joins |
| 13:07:41 | | IDK quits [Quit: Connection closed for inactivity] |
| 13:14:04 | | vegbrasil joins |
| 13:18:32 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 13:22:37 | | dumbgoy joins |
| 13:32:18 | | vegbrasil joins |
| 13:37:03 | | JohnnyJ joins |
| 13:37:18 | | vegbrasil quits [Ping timeout: 265 seconds] |
| 13:39:06 | | TunaLobster joins |
| 13:43:48 | | vegbrasil joins |
| 13:52:05 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 14:01:17 | | vegbrasil joins |
| 14:04:33 | | HP_Archivist (HP_Archivist) joins |
| 14:04:44 | | beario_ quits [Ping timeout: 252 seconds] |
| 14:05:23 | | systwi__ is now known as systwi |
| 14:06:24 | | Pichu0102 quits [Ping timeout: 252 seconds] |
| 14:06:27 | | Pichu0202 joins |
| 14:08:14 | | vegbrasil quits [Ping timeout: 265 seconds] |
| 14:12:37 | | vegbrasil joins |
| 14:12:40 | | graham quits [Quit: The Lounge - https://thelounge.chat] |
| 14:15:39 | | railen63 quits [Remote host closed the connection] |
| 14:17:19 | | graham joins |
| 14:19:04 | | Naruyoko quits [Read error: Connection reset by peer] |
| 14:21:14 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 14:26:48 | | HP_Archivist quits [Client Quit] |
| 14:27:09 | | HP_Archivist (HP_Archivist) joins |
| 14:28:19 | | graham quits [Client Quit] |
| 14:28:32 | | nicolas17 quits [Ping timeout: 265 seconds] |
| 14:32:22 | | nicolas17 joins |
| 14:34:35 | | geezabiscuit quits [Quit: ZNC - https://znc.in] |
| 14:47:19 | | icedice quits [Client Quit] |
| 14:52:07 | | vegbrasil joins |
| 14:53:47 | | geezabiscuit joins |
| 14:53:47 | | geezabiscuit is now authenticated as geezabiscuit |
| 14:53:47 | | geezabiscuit quits [Changing host] |
| 14:53:47 | | geezabiscuit (geezabiscuit) joins |
| 14:56:26 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 15:26:32 | | diggan quits [Ping timeout: 265 seconds] |
| 15:35:27 | | diggan joins |
| 15:36:36 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 15:36:39 | | AmAnd0A joins |
| 15:43:40 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 15:44:03 | | AmAnd0A joins |
| 15:55:56 | <diggan> | where can I find the dictionary for a archive? For example, the dictionary used for zst files in https://archive.org/download/archiveteam_reddit_20230610072320_4ab81500 |
| 15:56:07 | <diggan> | the zst dictionary used for the compression, just to be clear |
| 15:58:28 | | lk quits [Client Quit] |
| 16:00:09 | | lk joins |
| 16:07:28 | | graham joins |
| 16:13:24 | | Dalek (Dalek) joins |
| 16:15:16 | | graham quits [Client Quit] |
| 16:26:17 | <imer> | diggan: the zst isn't actually one archive, its multiple I believe, JAA has a script for this (havent used myself, so no idea how): https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat |
| 16:27:16 | <diggan> | I see. Thanks imer! Regardless, they're compressed with a custom dictionary as far as I understand, which means I'd have to use the same dictionary to extract it |
| 16:27:54 | <imer> | the dict is included in the warc as a custom record or something like that, script should handle it |
| 16:31:18 | | geezabiscuit quits [Ping timeout: 265 seconds] |
| 16:35:10 | <diggan> | aha, I see. Thanks again imer |
| 16:37:08 | <diggan> | seems like that handy little utility did the trick, awesome |
| 16:46:07 | | railen64 joins |
| 16:47:05 | | railen64 quits [Remote host closed the connection] |
| 16:47:21 | | railen64 joins |
| 16:56:56 | | graham joins |
| 17:00:21 | | railen64 quits [Remote host closed the connection] |
| 17:07:02 | | geezabiscuit (geezabiscuit) joins |
| 17:09:44 | | graham quits [Client Quit] |
| 17:10:00 | | graham joins |
| 17:14:29 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 17:19:23 | | graham quits [Client Quit] |
| 17:21:50 | <that_lurker> | Could someone grab https://www.lpga.com/ and https://www.kpmgwomenspgachampionship.com/ The current merger of pga and liv could have an effect on those too |
| 17:26:47 | <fireonlive> | might be worth asking in #archivebot |
| 17:33:10 | | lk quits [Client Quit] |
| 17:40:36 | | railen64 joins |
| 17:40:51 | | beario quits [Remote host closed the connection] |
| 17:42:09 | | railen64 quits [Remote host closed the connection] |
| 17:42:26 | | beario joins |
| 17:43:37 | <dave> | Probably an obvious question, but why is 6 the max concurrency per warrior? The reddit project suggests up to 10 may work, and empirically I'm pulling with 8 archivers spread across two VMs no problem. It'd be nicer to crank one up to 8 so I can dedicate the other VM to a different project |
| 17:45:09 | <dave> | The warriors are consuming maybe 10Mbps out of my symmetric 1G, so looking for how to contribute more firepower without tripping IP rate limits |
| 17:47:44 | <Maakuth|m> | You can run more projects manually with docker https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker |
| 17:48:04 | | lk joins |
| 17:48:20 | <dave> | yeah, although for boring reasons that's annoying to get running on my infrastructure vs. spawning the warrior VM image |
| 17:49:39 | <Maakuth|m> | How about multiple vms and manually picking projects in warrior? |
| 17:50:30 | <dave> | yup, that's what I'm doing, but for that the limiting factor ends up being the RAM budget for each VM. Being able to pack more workers into one VM for projects that can sustain it would be more efficient. |
| 17:50:45 | <dave> | But that's just optimization, in practice yeah I'm spinning more VMs |
| 17:51:22 | <that_lurker> | If you are running the warrior you can go to the webportals settings and check the advanced options and increase it |
| 17:52:13 | <dave> | only up to 6 though, so for reddit I end up needing multiple VMs to get up to the right rate. That's why I was wondering why the limit is set at 6. Maybe to not overwhelm smaller projects with a ton of workers? I dunno. |
| 17:52:37 | <that_lurker> | oh yeah forgot that was max 6 |
| 17:55:11 | <dave> | looking at the instructions for running coordinated containers by hand, that does look kinda what I'm looking for, so I should just go fix my infra to make that work. |
| 17:56:47 | <fireonlive> | #warrior would |
| 17:56:51 | <fireonlive> | be the better place for such |
| 17:57:17 | <dave> | ah, thx |
| 17:57:26 | <fireonlive> | =] |
| 18:06:34 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 18:07:12 | | AmAnd0A joins |
| 18:09:21 | | lk is now authenticated as lk |
| 18:10:30 | | railen63 joins |
| 18:10:34 | <mikolaj|m> | pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail |
| 18:12:43 | | railen63 quits [Remote host closed the connection] |
| 18:12:58 | <fireonlive> | oh hey mikolaj|m are you the creator of forum-dl? |
| 18:13:16 | <mikolaj|m> | fireonlive: yes |
| 18:13:26 | <fireonlive> | ah! :) it looks quite cool thanks |
| 18:13:40 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 18:13:48 | | lk quits [Changing host] |
| 18:13:48 | | lk (lk) joins |
| 18:13:51 | <fireonlive> | I don't have a GitHub account for... me.. yet but I tried pointing it at https://forums.tomshardware.com but it doesn't seem to register that it's xenforo |
| 18:13:57 | | AmAnd0A joins |
| 18:14:01 | | vegbrasil joins |
| 18:14:04 | <fireonlive> | was wondering if there was a way to force a certain scraper |
| 18:14:41 | <@JAA> | pabs: Thanks for handling the Tor forums! :-) Re Mailman, there are definitely one in [[Internet infrastructure]] that could be listed there. I don't have any local lists, I think. |
| 18:14:46 | <mikolaj|m> | fireonlive: I'll fix this |
| 18:14:55 | <fireonlive> | thanks :) |
| 18:14:56 | <@JAA> | nicolas17: Yeah, I won't decompress or anything like that. |
| 18:15:14 | <fireonlive> | i'll get around to creating a new GitHub account shortly; sorry for short-circuiting the issues system |
| 18:15:26 | | nicolas17 is still on-and-off experimenting with compressing Apple updates |
| 18:15:49 | | lk quits [Client Quit] |
| 18:15:57 | | lk (lk) joins |
| 18:16:51 | <nicolas17> | last week's WWDC released 223GB of betas |
| 18:17:27 | <fireonlive> | oh damn |
| 18:17:29 | <mikolaj|m> | fireonlive: forum themes break forum-dl frequently, but fortunately each can also be fixed extremely quickly (it usually just requires tweaking selectors a bit) |
| 18:17:39 | <fireonlive> | ah! i see |
| 18:18:01 | <nicolas17> | next week they'll probably post beta 2 and it will be a similar size |
| 18:18:03 | <fireonlive> | custom themeing can be a bit of a bane |
| 18:18:08 | | railen63 joins |
| 18:18:09 | <mikolaj|m> | and I haven't had enough time to write more tests for each extractor |
| 18:18:15 | <@JAA> | imer, diggan: To be precise, the dictionary is stored in a skippable zstd frame. Standard tooling knows nothing about that frame, so as the name suggests, it just skips over it (and then fails to decompress because it doesn't have the right dict). An attempt to upstream this format led to feature creep (support multiple dictionaries, individual frames compressed without a dict, indexing, etc.) until |
| 18:18:21 | <@JAA> | it ground to a halt. |
| 18:18:32 | <mikolaj|m> | but I expect to be able to reliably cover at least 90% of online forums |
| 18:18:40 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 18:19:06 | <fireonlive> | if i have a `$ grab-site https://labs-web-bay.vercel.app/` and promise I'm a good boy can i upload that to IA and get it moved to the 'wayback machine loves me' collection? lol |
| 18:19:10 | | railen63 quits [Remote host closed the connection] |
| 18:19:23 | | railen63 joins |
| 18:19:46 | <nicolas17> | fireonlive: would we conclude you're a good boy if we saw your browser history? ;) |
| 18:19:48 | <mikolaj|m> | the remaining 10% will be covered by making it easy to override selectors, and it would be interesting to try generating the selectors automatically, maybe using an LLM |
| 18:20:12 | <mikolaj|m> | (it may be less than 10%, hopefully) |
| 18:20:15 | <fireonlive> | nicolas17: 😇 depends which browser profile |
| 18:20:34 | <fireonlive> | ooh that does sound cool |
| 18:23:11 | <mikolaj|m> | I found PhpBB to be the worst when it comes to forum-dl breaking due to theming |
| 18:23:32 | <@rewby> | Awful idea. ChatGPT to create selectors. Lol |
| 18:24:23 | <nicolas17> | hey chatgpt how do I change my forum html to break mikolaj's crawler |
| 18:24:27 | <nicolas17> | it's an arm's race :D |
| 18:24:51 | <mikolaj|m> | actually there's a ChatGPT-powered library for scraping already: https://github.com/jamesturk/scrapeghost |
| 18:24:51 | <fireonlive> | (it's also very small so AB could have it done in like 5 mins) |
| 18:25:40 | <mikolaj|m> | but I'm afraid it would be quite expensive scraping ;) |
| 18:28:02 | | lk quits [Client Quit] |
| 18:28:06 | | lk (lk) joins |
| 18:28:41 | <mikolaj|m> | auto-generating just the selectors somehow would be much cheaper |
| 18:45:25 | | railen63 quits [Remote host closed the connection] |
| 18:48:05 | | railen64 joins |
| 18:48:57 | | railen64 quits [Remote host closed the connection] |
| 18:49:11 | | railen64 joins |
| 18:52:48 | | decky_e (decky_e) joins |
| 18:54:28 | | HP_Archivist (HP_Archivist) joins |
| 19:07:29 | | graham joins |
| 19:12:00 | | railen64 quits [Remote host closed the connection] |
| 19:12:50 | | graham quits [Client Quit] |
| 19:14:59 | | graham joins |
| 19:24:15 | | hitgrr8 joins |
| 19:30:27 | | graham quits [Client Quit] |
| 19:33:58 | | Abacus6427 joins |
| 19:37:14 | | vegbrasil joins |
| 19:41:32 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 19:43:00 | | vegbrasil joins |
| 19:47:23 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 19:55:39 | | railen63 joins |
| 19:55:39 | | railen63 quits [Remote host closed the connection] |
| 19:56:45 | | railen64 joins |
| 20:00:31 | | vegbrasil joins |
| 20:05:00 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 20:08:30 | | vegbrasil joins |
| 20:16:32 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 20:32:21 | | vegbrasil joins |
| 20:39:28 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 20:51:26 | | vegbrasil joins |
| 20:56:06 | | sorch joins |
| 20:59:16 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 21:07:00 | | fireonlive quits [Client Quit] |
| 21:07:27 | | fireonlive (fireonlive) joins |
| 21:13:06 | | vegbrasil joins |
| 21:16:08 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 21:16:32 | | AmAnd0A joins |
| 21:19:15 | | graham joins |
| 21:19:38 | <@arkiver> | OrIdow6: i'm getting that one started tomorrow |
| 21:20:49 | | vegbrasil quits [Ping timeout: 265 seconds] |
| 21:24:06 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 21:24:50 | | AmAnd0A joins |
| 21:25:39 | | graham quits [Client Quit] |
| 21:30:56 | | graham joins |
| 21:37:23 | <andrew> | I'm trying to get wpull to resume an unfinished grab-site crawl but it seems to be spending a lot of time doing absolutely nothing: |
| 21:37:26 | <andrew> | ❯ /nix/store/9rrr3q95w3zqwp97b66mxxn5kfxah9zl-python3.8-ludios_wpull-3.0.9/bin/wpull3 -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --no-check-certificate --no-robots --inet4-only |
| 21:37:26 | <andrew> | --dns-timeout 20 --connect-timeout 20 --read-timeout 900 --session-timeout 172800 --tries 3 --waitretry 5 --max-redirect 8 --append-output wpull.log --database wpull.db --save-cookies cookies.txt --delete-after --page-requisites --no-parent --concurrent 6 --warc-file ztank-archive-ddosecrets-search2-queue.txt-2023-06-10-a0f7d115 --warc-max-size |
| 21:37:26 | <andrew> | 5368709120 --warc-cdx --strip-session-id --escaped-fragment --level inf --page-requisites-level 5 --span-hosts-allow page-requisites --sitemaps --recursive --warc-append --domains domains.txt -v https://search.ddosecrets.com/ |
| 21:37:26 | <andrew> | psutil: No module named 'psutil'. Resource monitoring will be unavailable. |
| 21:37:26 | <andrew> | INFO FINISHED. |
| 21:37:26 | <andrew> | INFO Duration: 2:16:36. Speed: -- B/s. |
| 21:37:26 | <andrew> | INFO Downloaded: 0 files, 0.0 B. |
| 21:37:53 | <andrew> | domains.txt includes a list of domains that I want to be included in the crawl |
| 21:38:31 | | gwetchen|m joins |
| 21:47:00 | | Muad-Dib quits [Remote host closed the connection] |
| 21:47:17 | | Muad-Dib joins |
| 21:49:36 | | graham quits [Client Quit] |
| 21:51:19 | | lk quits [Client Quit] |
| 21:51:31 | | lk (lk) joins |
| 21:54:19 | | lk quits [Client Quit] |
| 21:54:30 | | fangfufu quits [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in] |
| 21:59:47 | | fangfufu joins |
| 22:00:03 | | fangfufu is now authenticated as fangfufu |
| 22:00:06 | | Hajdar quits [Remote host closed the connection] |
| 22:00:22 | | Hajdar (Hajdar) joins |
| 22:01:23 | | lk (lk) joins |
| 22:21:27 | | vegbrasil joins |
| 22:27:18 | <betamax> | JAA: to answer your question from ~24 hours ago, the URL the game uses is http://media.schoolofdragons.com/ |
| 22:27:30 | <@JAA> | Ah, of course there's another one. :-) |
| 22:27:32 | <@JAA> | Thanks! |
| 22:27:44 | <betamax> | (I realise there's a lot of scrollback and I haven't caught up yet - feel free to mention / ping me for specific things) |
| 22:27:47 | <nicolas17> | augh the duplicate downloads >< |
| 22:27:55 | | hitgrr8 quits [Client Quit] |
| 22:28:32 | | vegbrasil quits [Ping timeout: 252 seconds] |
| 22:28:39 | <fireonlive> | what an odd way they chose to organize things |
| 22:29:28 | <nicolas17> | fireonlive: in this case it's 3 CDN URLs pointing at the same underlying S3 bucket |
| 22:29:57 | <fireonlive> | ah i mainly just meant all the duplicate files spread across platform directories |
| 22:30:03 | <betamax> | the bucket seems to be used for all / many of JumpStart's games, only School of Dragons is confirmed to be shutting down however JumpStart are being sued so might be a good idea to grab everything |
| 22:30:10 | <nicolas17> | yeah that's another issue |
| 22:30:14 | <fireonlive> | i think i read the same assets being everywhere |
| 22:30:18 | <betamax> | but Schoolofdragons has a hard deadline of 30 June |
| 22:30:59 | <nicolas17> | fireonlive: I think JAA intends to download them from all 4 URLs (raw bucket, schoolofdragons, jumpstart, knowledgeadventure) so that they all work in the WBM |
| 22:31:10 | <nicolas17> | that will get nicely deduplicated in the WARCs |
| 22:31:30 | <nicolas17> | but it means 24TB of downloading for him and it pains me xD |
| 22:31:52 | <fireonlive> | dedupe is best dupe |
| 22:32:03 | <nicolas17> | *in addition* there's a ton of duplication in the data itself yeah |
| 22:32:08 | <fireonlive> | but haha yeah |
| 22:32:15 | <h2ibot> | Usernam edited List of websites excluded from the Wayback Machine (+31): https://wiki.archiveteam.org/?diff=49903&oldid=49890 |
| 22:32:16 | <h2ibot> | Yts98 uploaded File:LINE BLOG icon.png: https://wiki.archiveteam.org/?title=File%3ALINE%20BLOG%20icon.png |
| 22:32:38 | <nicolas17> | in DWADragonsUnity: |
| 22:32:39 | <nicolas17> | Total data: 6916.14 GiB in 7757546 files |
| 22:32:41 | <nicolas17> | Unique data: 3037.20 GiB in 1143186 files |
| 22:33:33 | <fireonlive> | all that S3 space they're needlessly paying for |
| 22:33:34 | <fireonlive> | lol |
| 22:35:07 | <Elizabeth> | if my brain isn't dead, it's only 160/mo there assuming it's all in standard tier, was probably cheaper than figuring out what needed to go. |
| 22:35:32 | <betamax> | OK, brief update from my friend: |
| 22:35:36 | <betamax> | 1.) They're apparently still updating the game. There's ~2 devs still assigned to this, and they're trying to push final updates even though the game will be shut down in <1 month |
| 22:35:48 | <betamax> | 2.) My friend is in contact with one of the devs (that how they heard the above). So getting a list of all the updated assets closer to shutdown date may be possible. |
| 22:36:51 | <dumbgoy> | betamax: what are ya talking about? interested |
| 22:36:52 | <@JAA> | nicolas17: Looks like the vast majority of data is under DWADragonsUnity, yeah. |
| 22:37:05 | <nicolas17> | JAA: oh I simply didn't check the rest of the bucket yet |
| 22:37:25 | <@JAA> | Total stats: 8.06 TiB in 11656937 objects, 3.21 unique TiB in 1673341 objects |
| 22:37:54 | <fireonlive> | ah ok, 160/mo is peanuts to them |
| 22:38:04 | <betamax> | dumbgoy: the School of Dragons game (https://www.schoolofdragons.com/) |
| 22:38:39 | <dumbgoy> | gotcha |
| 22:38:47 | <nicolas17> | WAIT |
| 22:38:48 | <dumbgoy> | keep on rockin, i just popped in and was wondering |
| 22:38:51 | <nicolas17> | data has already been deleted?! |
| 22:39:08 | <nicolas17> | no, ffs I was on the wrong prefix |
| 22:39:09 | <@JAA> | Hm? |
| 22:39:12 | <nicolas17> | god I panicked |
| 22:39:13 | <@JAA> | Ah |
| 22:39:23 | <nicolas17> | that's what I get for relying on shell history |
| 22:39:27 | <nicolas17> | DWADragonsCodingUnity is a different folder >.< |
| 22:39:32 | <betamax> | apparently old data for previous versions of the game was in the DWAStandAlone subfolder on the bucket |
| 22:39:40 | <dumbgoy> | love ya guys, keep up the good work.! |
| 22:39:52 | <@JAA> | betamax: I'll just grab all of it since it doesn't make a huge difference. |
| 22:40:13 | <betamax> | great! (I'm just copy-pasting messages from my friend into here, so apologies if I'm mentioning stuff that's already covered) |
| 22:40:50 | <@JAA> | A bit over 24 TiB to download into something like 3.3 TiB of WARCs. Sounds fun. :-) |
| 22:41:14 | <@JAA> | There'll be a little bit of duplication in my data most likely, but I'll try to keep it to a minimum. |
| 22:41:15 | <betamax> | JAA: the stats you posted above (8.06TB - is that the entire bucket or just DWADragonsUnity) |
| 22:41:21 | <@JAA> | Entire bucket |
| 22:41:32 | <@JAA> | nicolas17 posted the ones for just DWADragonsUnity. |
| 22:41:42 | <@JAA> | (I didn't rerun that analysis.) |
| 22:42:18 | <betamax> | thanks! |
| 22:42:45 | <dumbgoy> | anyone know about the feasibility and data requirements to pull all of waybackmachine from archive.org? I imagine it would be MASSIVE, and not sure about the process, maybe wget? |
| 22:42:46 | <@JAA> | I'm getting a very similar number for the number of files in that though. |
| 22:42:51 | | sorch quits [Client Quit] |
| 22:42:56 | <dumbgoy> | i'm worried about them going down |
| 22:43:11 | <@JAA> | dumbgoy: Do you have too much money? |
| 22:43:29 | <dumbgoy> | not really, i wish i did |
| 22:43:33 | <@JAA> | Infeasible, especially through the WBM. |
| 22:43:35 | <nicolas17> | betamax: fun fact, if I look *only* at *.mp4 files, there's 1247 MiB unique data (92 files) + 514697 MiB duplicate (48357 files) |
| 22:43:39 | <dumbgoy> | i kind of figured, just wondered |
| 22:44:06 | <@JAA> | Data requirements are in the dozens of TB per day of data. |
| 22:44:24 | <dumbgoy> | ouch, they refuse to even put residential fiber here, and i don't make too much money |
| 22:44:45 | <@JAA> | 1.8 PiB of web items in the past month |
| 22:44:59 | <dumbgoy> | lordy. well thanks for answering, love ya guys |
| 22:44:59 | <@JAA> | (Not all of those are publicly accessible though.) |
| 22:45:15 | <betamax> | JAA / nicolas17: if you notice any data in the bucket that looks like user posts, pls let me know. they shut down the user forum with no warning a couple of months ago, and a lot of that data is not in wayback |
| 22:45:50 | <nicolas17> | dumbgoy: archiveteam's reddit archival project alone is at 2.85 PiB |
| 22:45:56 | <dumbgoy> | wow |
| 22:46:18 | <nicolas17> | adding 27GiB per minute right now |
| 22:46:33 | <dumbgoy> | let me know if there's ANYTHING I can do to help with any archival. |
| 22:46:51 | <dumbgoy> | I only have around 8~tb free right now, but if i can help you folks lemme know |
| 22:47:18 | <@JAA> | betamax: Very unlikely that that would be on their CDN S3 bucket. I'll dump the complete bucket listing onto IA anyway. |
| 22:47:48 | <@JAA> | It's only ~300 MiB as .jsonl.zst. :-) |
| 22:47:50 | <nicolas17> | dumbgoy: the warrior stuff doesn't need much disk space, it downloads from (eg.) reddit and immediately uploads elsewhere, it won't accumulate much data on your disk |
| 22:48:02 | <nicolas17> | JAA: do you have your own script for that? |
| 22:48:18 | <dumbgoy> | if ya elaborate a bit, and tell me what's needed, not informed about warrior stuff |
| 22:48:43 | <nicolas17> | "aws s3 ls" outputs text with "timestamp size filename" |
| 22:48:45 | <@JAA> | nicolas17: Yep, https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list (warning, may cause brain death) |
| 22:49:07 | <@JAA> | It has no retries, so due to those weird timeouts, I had to script around it in Bash. :-) |
| 22:49:08 | <nicolas17> | "aws s3api list-objects" seems to collect everything in memory and when it finishes it outputs a single JSON array |
| 22:52:08 | <fireonlive> | <stackoverflow parsing xml with regex post> |
| 22:52:21 | | NIC007a83 joins |
| 22:52:22 | <fireonlive> | (lol i don't care about that) |
| 22:53:13 | <@JAA> | The regex is only used for crude validation of the beginning of the response. The parsing happens with string slicing etc. :-P |
| 22:53:36 | <@JAA> | But yes, I like Tony the pony. |
| 22:55:57 | <nicolas17> | how long does it take you to list the bucket? |
| 22:55:59 | <@JAA> | I also converted it to a script that can invoke qwarc to do it fast and archive the responses as WARC. It's even more ridiculous: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list-qwarc |
| 22:57:14 | <TheTechRobo> | JAA: Why did you do all that readarray stuff instead of just `cat > ${prefix}.py << EOF` ? |
| 22:57:24 | <TheTechRobo> | (i think its called a heredoc?) |
| 22:57:29 | <@JAA> | This one took 8 hours due to all the timeouts. |
| 22:57:46 | <@JAA> | Don't have good enough logs to subtract those. |
| 22:57:55 | <TheTechRobo> | Oh indentation. |
| 22:57:58 | <nicolas17> | ew |
| 22:57:59 | <TheTechRobo> | Didnt check that comment yet. |
| 22:58:38 | | Braven joins |
| 22:58:40 | <@JAA> | I can relist the bucket much faster now with the qwarc version. |
| 22:59:01 | <@JAA> | Since I can split it into pretty equal parts and process those in parallel. |
| 23:00:29 | <@JAA> | When a bucket has nice patterns, that can be used from the start of course. |
| 23:00:38 | <@JAA> | But this one is a slight mess. |
| 23:08:47 | <fireonlive> | love the bash script |
| 23:12:26 | <fireonlive> | is there a good place to shove google drive links |
| 23:22:17 | <@JAA> | #googlecrash in theory, but that's been dormant for a long time. |
| 23:22:28 | <@JAA> | We should revive it in #mediaonfire style. |
| 23:27:55 | | TunaLobster_extra joins |
| 23:30:50 | | TunaLobster quits [Ping timeout: 265 seconds] |
| 23:36:12 | | NIC007a83 quits [Client Quit] |
| 23:39:30 | | BlueMaxima joins |
| 23:43:11 | | lolesports joins |
| 23:43:58 | <masterX244> | True, got a batch of links from a crawl, too |
| 23:45:28 | <fireonlive> | yee :) |
| 23:49:25 | <pabs> | mikolaj|m: are you able to update the Mailman wiki page to add those two things? |
| 23:52:19 | <mikolaj|m> | pabs: what two things? |
| 23:52:33 | <pabs> | the ones you mentioned above :) |
| 23:52:44 | <pabs> | <mikolaj|m> pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail |
| 23:53:07 | <pabs> | (please link to Perceval too) |
| 23:53:56 | <@JAA> | Hmm, what does an ETag value of '96a106ed73262892740656e84c5437b2-1' mean on AWS S3? |
| 23:54:29 | <mikolaj|m> | pabs: I would need to make an account on your wiki, I might not have enough spoons for that today (executive dysfunction) |
| 23:55:14 | <@JAA> | Ah, multi-part uploads, apparently. |
| 23:55:23 | <@JAA> | https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums |
| 23:56:56 | | geezabiscuit quits [Ping timeout: 265 seconds] |