00:00:12<fireonlive>thank you, on by default directory listings
00:00:16<fireonlive>^o^
00:07:09<nicolas17>...I just now realized this is Knowledge Adventure
00:07:19jtagcat quits [Quit: Bye!]
00:07:33jtagcat (jtagcat) joins
00:08:10<nicolas17>I have several 90s-era KA games
00:08:11<@JAA>I was wondering what 'ka' stands for. Thanks.
00:08:27<nicolas17>there's JumpStart stuff in the bucket too
00:08:53dumbgoy_ quits [Read error: Connection reset by peer]
00:08:53<nicolas17>all the JS* dirs are probably for https://en.wikipedia.org/wiki/JumpStart
00:09:45<@JAA>Looks like this is the bucket behind http://media1.knowledgeadventure.com/ then.
00:10:10<nicolas17>let's find out
00:10:27<@JAA>It matches. I checked a couple random games from the KA site.
00:10:28<nicolas17>http://media1.knowledgeadventure.com/DWADragonsUnity/MAC/3.24.0/High/movies/StoickMemorial.mp4
00:11:44<nicolas17>(my kingdom for an S3 listing of the bucket behind updates.cdn-apple.com!)
00:12:20<fireonlive>i like how http://media1.knowledgeadventure.com/DWADragonsUnity/ says "Key: DWADragonsUnity/DoNotDelete.jpg" but they're going to purge it all
00:13:25vegbrasil joins
00:16:47<nicolas17>JAA: https://media.jumpstart.com/ uses the same bucket
00:17:11<@JAA>Great
00:17:31<nicolas17>are you going to archive the whole bucket? it seems the SODWebsite/ prefix has lots of assets used by schoolofdragons.com
00:18:05vegbrasil quits [Ping timeout: 252 seconds]
00:18:06<@JAA>betamax: Can you ask your friend which of https://s3.amazonaws.com/origin.ka.cdn/ http://media1.knowledgeadventure.com/ https://media.jumpstart.com/ was actually used by the game?
00:18:14<@JAA>Or is, I suppose.
00:18:58<@JAA>nicolas17: DWADragonsUnity only at the moment, but yes, depending on total size, I'd like to grab it all.
00:19:40<nicolas17>I got the complete list by getting lists of every version in parallel
00:19:46<nicolas17>the other directories don't have such a consistent structure
00:20:07<nicolas17>might be a very lopsided tree :)
00:20:42<@JAA>I might also grab everything under all three URLs above with dedupe.
00:20:52<@JAA>Let me know if you find any further domains like that.
00:21:06<@JAA>I can't actually start pulling tonight anyway.
00:21:12<nicolas17>6TB x 3 download then?
00:21:45<nicolas17>er 7TB+
00:21:59<@JAA>Sure, if there's enough time, might as well.
00:22:16<@JAA>That's only 10 MB/s over the remaining 20 days.
00:22:46<nicolas17>if I tried that I'm sure my ISP would yell at me... and I don't have 3TB free disk anyway :D
00:23:08<@JAA>OVH won't even blink. :-P
00:24:49<fireonlive>all i have is a measly 300/20Mbps (at best) at home lol
00:25:03<fireonlive>they always fuck you on the upload
00:25:07<fireonlive>but that's HFC for you
00:25:20<@JAA>I could get symmetric 25G here if I wanted. But meh.
00:25:28<fireonlive>oooh
00:25:35<@JAA>Also, -ot :-)
00:25:40<fireonlive>ye
00:36:08BearFortress joins
00:39:50<@JAA>I'll let it battle through these timeouts overnight.
01:29:45Abacus6427 joins
01:34:32tzt quits [Ping timeout: 252 seconds]
01:43:25Abacus6427 quits [Ping timeout: 265 seconds]
01:45:49dumbgoy joins
02:01:53Naruyoko quits [Remote host closed the connection]
02:02:15Naruyoko joins
02:07:28imer quits [Client Quit]
02:08:05imer (imer) joins
02:16:35tzt (tzt) joins
02:16:53Mateon1 quits [Remote host closed the connection]
02:16:53diggan quits [Remote host closed the connection]
02:16:59Mateon1 joins
02:40:12imer quits [Client Quit]
02:43:17imer (imer) joins
02:49:53HP_Archivist quits [Ping timeout: 252 seconds]
02:59:14tzt quits [Ping timeout: 252 seconds]
03:00:58tzt (tzt) joins
03:04:41<pabs>https://torrentfreak.com/youtube-orders-invidious-privacy-software-to-shut-down-in-7-days-230609
03:20:39Mateon1 quits [Remote host closed the connection]
03:20:59Mateon1 joins
03:34:10taggart quits [Client Quit]
03:39:42<pabs>that_lurker: re Tor Forums migration, I am doing an AB job for the Tor forums, will do an AB !ao < afterwards to capture the redirects too
03:40:15<pabs>(https://blog.torproject.org/tor-project-forum-migration/)
03:46:01Elizabeth quits [Remote host closed the connection]
03:46:11elizabeth joins
03:56:35elizabeth quits [Client Quit]
03:56:39Elizabeth joins
03:57:52<h2ibot>PaulWise edited Mailman (+142, mention that forum-dl supports pipermail archives): https://wiki.archiveteam.org/?diff=49893&oldid=21240
03:57:59<pabs>mikolaj|m: ^
03:58:10Elizabeth quits [Client Quit]
03:58:12Elizabeth (Elizabeth) joins
03:58:52<h2ibot>PaulWise edited Mailman (-2, fix link): https://wiki.archiveteam.org/?diff=49894&oldid=49893
03:58:55Elizabeth quits [Changing host]
03:58:55Elizabeth (Elizabeth) joins
04:17:55<h2ibot>PaulWise edited Mailman (+159, mention that mailman 2 is now EOL): https://wiki.archiveteam.org/?diff=49895&oldid=49894
04:28:10emberquill (emberquill) joins
04:33:04GNU_world joins
04:35:29nicolas17 quits [Ping timeout: 252 seconds]
04:37:59<h2ibot>PaulWise created Mailman2 (+43214, start a page about mailman2 archiving): https://wiki.archiveteam.org/?title=Mailman2
04:38:44<pabs>JAA: if you have mailman2/pipermail URLs locally, please add them to ^
04:38:51<pabs>(same for anyone else here)
04:40:13nicolas17 joins
04:47:05dumbgoy quits [Ping timeout: 265 seconds]
04:55:02<h2ibot>PaulWise edited Mailman2 (+1588, add more tips, sites that were already done by me): https://wiki.archiveteam.org/?diff=49897&oldid=49896
05:02:26vegbrasil joins
05:06:54vegbrasil quits [Ping timeout: 265 seconds]
05:21:40futawe joins
05:24:09<futawe>fyi, StackExchange: "The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June"
05:24:09<futawe>https://meta.stackexchange.com/questions/389922/june-2023-data-dump-is-missing/390023#390023
05:25:47futawe quits [Remote host closed the connection]
05:26:00futawe joins
05:26:12futawe quits [Remote host closed the connection]
05:30:01<nicolas17>"organizations looking to profit from the work of our community" that sounds like... stackexchange itself?
05:32:16<nicolas17>JAA: purely out of curiosity (since I know you want the exact bytes for preservation etc etc) I looked into how much I can deduplicate/compress this KA/DWADragonsUnity data
05:33:21<nicolas17>it seems .unity3d files have internal LZMA compression, and if I decompress that I'd probably get far better deduplication, but I can't compress it back to the same data, I guess Unity doesn't use the standard LZMA library? :/
05:33:49<nicolas17>I tried all compression levels and none matches
05:51:11<h2ibot>PaulWise edited ArchiveBot (+1956, add section on alternative dashboard clients): https://wiki.archiveteam.org/?diff=49898&oldid=49790
05:52:48BlueMaxima quits [Read error: Connection reset by peer]
05:53:19railen63 quits [Remote host closed the connection]
05:53:31vegbrasil joins
05:53:35railen63 joins
05:56:12<h2ibot>PaulWise edited ArchiveBot (-10, fix formatting): https://wiki.archiveteam.org/?diff=49899&oldid=49898
06:00:13<h2ibot>PaulWise edited ArchiveBot (-21, fix command): https://wiki.archiveteam.org/?diff=49900&oldid=49899
06:00:34vegbrasil quits [Ping timeout: 252 seconds]
06:05:37nicolas17 quits [Read error: Connection reset by peer]
06:06:13a joins
06:06:14a quits [Remote host closed the connection]
06:06:33nicolas17 joins
07:01:27lk quits [Client Quit]
07:01:32lk joins
07:53:00manu|m quits [Quit: issued !quit command]
07:55:34manu|m joins
07:58:34<h2ibot>OrIdow6 uploaded File:Egloos logo.gif: https://wiki.archiveteam.org/?title=File%3AEgloos%20logo.gif
08:06:42pabs quits [Ping timeout: 265 seconds]
08:15:37<h2ibot>OrIdow6 created Egloos (+400, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=Egloos
08:23:34Dango360 quits [Ping timeout: 252 seconds]
08:23:41Dango360 (Dango360) joins
08:34:35pabs (pabs) joins
08:37:11pabs quits [Remote host closed the connection]
08:39:34pabs (pabs) joins
09:13:18T31M quits [Client Quit]
09:13:37T31M joins
09:21:04<@OrIdow6>arkiver: https://github.com/OrIdow6/egloos-grab https://github.com/OrIdow6/egloos-items - Framework/backfeed.lua in the former needs keys - long tail so I have not bothered with an exact estimate, but it feels like < 5 TB fare to me, and there
09:21:14<@OrIdow6>'s stuff to cut out if it gets too high
09:36:57<flashfire42>https://tracker.archiveteam.org/google-sites/ is this still supposed to be spitting out items?
09:50:19railen63 quits [Remote host closed the connection]
09:53:48railen63 joins
09:55:03railen63 quits [Remote host closed the connection]
09:55:16railen63 joins
10:00:01railen63 quits [Remote host closed the connection]
10:00:16railen63 joins
10:13:49Ruthalas5 quits [Ping timeout: 265 seconds]
10:23:42Ruthalas5 (Ruthalas) joins
11:22:29icedice quits [Client Quit]
11:28:42vegbrasil joins
11:32:46nicolas17 quits [Ping timeout: 252 seconds]
11:33:29vegbrasil quits [Ping timeout: 252 seconds]
11:36:24nicolas17 joins
11:39:17nicolas17 quits [Read error: Connection reset by peer]
11:39:42nicolas17 joins
11:44:48decky_e quits [Remote host closed the connection]
11:54:53diggan joins
11:57:08systwi__ (systwi) joins
11:58:14systwi quits [Ping timeout: 252 seconds]
12:13:41diggan quits [Ping timeout: 265 seconds]
12:21:22vegbrasil joins
12:21:53railen63 quits [Remote host closed the connection]
12:22:48diggan joins
12:30:07vegbrasil quits [Ping timeout: 265 seconds]
12:40:35railen63 joins
12:41:46railen63 quits [Remote host closed the connection]
12:42:00railen63 joins
12:43:57icedice (icedice) joins
13:07:41IDK quits [Quit: Connection closed for inactivity]
13:14:04vegbrasil joins
13:18:32vegbrasil quits [Ping timeout: 252 seconds]
13:22:37dumbgoy joins
13:32:18vegbrasil joins
13:37:03JohnnyJ joins
13:37:18vegbrasil quits [Ping timeout: 265 seconds]
13:39:06TunaLobster joins
13:43:48vegbrasil joins
13:52:05vegbrasil quits [Ping timeout: 252 seconds]
14:01:17vegbrasil joins
14:04:33HP_Archivist (HP_Archivist) joins
14:04:44beario_ quits [Ping timeout: 252 seconds]
14:05:23systwi__ is now known as systwi
14:06:24Pichu0102 quits [Ping timeout: 252 seconds]
14:06:27Pichu0202 joins
14:08:14vegbrasil quits [Ping timeout: 265 seconds]
14:12:37vegbrasil joins
14:12:40graham quits [Quit: The Lounge - https://thelounge.chat]
14:15:39railen63 quits [Remote host closed the connection]
14:17:19graham joins
14:19:04Naruyoko quits [Read error: Connection reset by peer]
14:21:14vegbrasil quits [Ping timeout: 252 seconds]
14:26:48HP_Archivist quits [Client Quit]
14:27:09HP_Archivist (HP_Archivist) joins
14:28:19graham quits [Client Quit]
14:28:32nicolas17 quits [Ping timeout: 265 seconds]
14:32:22nicolas17 joins
14:34:35geezabiscuit quits [Quit: ZNC - https://znc.in]
14:47:19icedice quits [Client Quit]
14:52:07vegbrasil joins
14:53:47geezabiscuit joins
14:53:47geezabiscuit quits [Changing host]
14:53:47geezabiscuit (geezabiscuit) joins
14:56:26vegbrasil quits [Ping timeout: 252 seconds]
15:26:32diggan quits [Ping timeout: 265 seconds]
15:35:27diggan joins
15:36:36AmAnd0A quits [Ping timeout: 252 seconds]
15:36:39AmAnd0A joins
15:43:40AmAnd0A quits [Read error: Connection reset by peer]
15:44:03AmAnd0A joins
15:55:56<diggan>where can I find the dictionary for a archive? For example, the dictionary used for zst files in https://archive.org/download/archiveteam_reddit_20230610072320_4ab81500
15:56:07<diggan>the zst dictionary used for the compression, just to be clear
15:58:28lk quits [Client Quit]
16:00:09lk joins
16:07:28graham joins
16:13:24Dalek (Dalek) joins
16:15:16graham quits [Client Quit]
16:26:17<imer>diggan: the zst isn't actually one archive, its multiple I believe, JAA has a script for this (havent used myself, so no idea how): https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat
16:27:16<diggan>I see. Thanks imer! Regardless, they're compressed with a custom dictionary as far as I understand, which means I'd have to use the same dictionary to extract it
16:27:54<imer>the dict is included in the warc as a custom record or something like that, script should handle it
16:31:18geezabiscuit quits [Ping timeout: 265 seconds]
16:35:10<diggan>aha, I see. Thanks again imer
16:37:08<diggan>seems like that handy little utility did the trick, awesome
16:46:07railen64 joins
16:47:05railen64 quits [Remote host closed the connection]
16:47:21railen64 joins
16:56:56graham joins
17:00:21railen64 quits [Remote host closed the connection]
17:07:02geezabiscuit (geezabiscuit) joins
17:09:44graham quits [Client Quit]
17:10:00graham joins
17:14:29HP_Archivist quits [Ping timeout: 252 seconds]
17:19:23graham quits [Client Quit]
17:21:50<that_lurker>Could someone grab https://www.lpga.com/ and https://www.kpmgwomenspgachampionship.com/ The current merger of pga and liv could have an effect on those too
17:26:47<fireonlive>might be worth asking in #archivebot
17:33:10lk quits [Client Quit]
17:40:36railen64 joins
17:40:51beario quits [Remote host closed the connection]
17:42:09railen64 quits [Remote host closed the connection]
17:42:26beario joins
17:43:37<dave>Probably an obvious question, but why is 6 the max concurrency per warrior? The reddit project suggests up to 10 may work, and empirically I'm pulling with 8 archivers spread across two VMs no problem. It'd be nicer to crank one up to 8 so I can dedicate the other VM to a different project
17:45:09<dave>The warriors are consuming maybe 10Mbps out of my symmetric 1G, so looking for how to contribute more firepower without tripping IP rate limits
17:47:44<Maakuth|m>You can run more projects manually with docker https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker
17:48:04lk joins
17:48:20<dave>yeah, although for boring reasons that's annoying to get running on my infrastructure vs. spawning the warrior VM image
17:49:39<Maakuth|m>How about multiple vms and manually picking projects in warrior?
17:50:30<dave>yup, that's what I'm doing, but for that the limiting factor ends up being the RAM budget for each VM. Being able to pack more workers into one VM for projects that can sustain it would be more efficient.
17:50:45<dave>But that's just optimization, in practice yeah I'm spinning more VMs
17:51:22<that_lurker>If you are running the warrior you can go to the webportals settings and check the advanced options and increase it
17:52:13<dave>only up to 6 though, so for reddit I end up needing multiple VMs to get up to the right rate. That's why I was wondering why the limit is set at 6. Maybe to not overwhelm smaller projects with a ton of workers? I dunno.
17:52:37<that_lurker>oh yeah forgot that was max 6
17:55:11<dave>looking at the instructions for running coordinated containers by hand, that does look kinda what I'm looking for, so I should just go fix my infra to make that work.
17:56:47<fireonlive>#warrior would
17:56:51<fireonlive>be the better place for such
17:57:17<dave>ah, thx
17:57:26<fireonlive>=]
18:06:34AmAnd0A quits [Ping timeout: 252 seconds]
18:07:12AmAnd0A joins
18:10:30railen63 joins
18:10:34<mikolaj|m>pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail
18:12:43railen63 quits [Remote host closed the connection]
18:12:58<fireonlive>oh hey mikolaj|m are you the creator of forum-dl?
18:13:16<mikolaj|m>fireonlive: yes
18:13:26<fireonlive>ah! :) it looks quite cool thanks
18:13:40AmAnd0A quits [Read error: Connection reset by peer]
18:13:48lk quits [Changing host]
18:13:48lk (lk) joins
18:13:51<fireonlive>I don't have a GitHub account for... me.. yet but I tried pointing it at https://forums.tomshardware.com but it doesn't seem to register that it's xenforo
18:13:57AmAnd0A joins
18:14:01vegbrasil joins
18:14:04<fireonlive>was wondering if there was a way to force a certain scraper
18:14:41<@JAA>pabs: Thanks for handling the Tor forums! :-) Re Mailman, there are definitely one in [[Internet infrastructure]] that could be listed there. I don't have any local lists, I think.
18:14:46<mikolaj|m>fireonlive: I'll fix this
18:14:55<fireonlive>thanks :)
18:14:56<@JAA>nicolas17: Yeah, I won't decompress or anything like that.
18:15:14<fireonlive>i'll get around to creating a new GitHub account shortly; sorry for short-circuiting the issues system
18:15:26nicolas17 is still on-and-off experimenting with compressing Apple updates
18:15:49lk quits [Client Quit]
18:15:57lk (lk) joins
18:16:51<nicolas17>last week's WWDC released 223GB of betas
18:17:27<fireonlive>oh damn
18:17:29<mikolaj|m>fireonlive: forum themes break forum-dl frequently, but fortunately each can also be fixed extremely quickly (it usually just requires tweaking selectors a bit)
18:17:39<fireonlive>ah! i see
18:18:01<nicolas17>next week they'll probably post beta 2 and it will be a similar size
18:18:03<fireonlive>custom themeing can be a bit of a bane
18:18:08railen63 joins
18:18:09<mikolaj|m>and I haven't had enough time to write more tests for each extractor
18:18:15<@JAA>imer, diggan: To be precise, the dictionary is stored in a skippable zstd frame. Standard tooling knows nothing about that frame, so as the name suggests, it just skips over it (and then fails to decompress because it doesn't have the right dict). An attempt to upstream this format led to feature creep (support multiple dictionaries, individual frames compressed without a dict, indexing, etc.) until
18:18:21<@JAA>it ground to a halt.
18:18:32<mikolaj|m>but I expect to be able to reliably cover at least 90% of online forums
18:18:40vegbrasil quits [Ping timeout: 252 seconds]
18:19:06<fireonlive>if i have a `$ grab-site https://labs-web-bay.vercel.app/` and promise I'm a good boy can i upload that to IA and get it moved to the 'wayback machine loves me' collection? lol
18:19:10railen63 quits [Remote host closed the connection]
18:19:23railen63 joins
18:19:46<nicolas17>fireonlive: would we conclude you're a good boy if we saw your browser history? ;)
18:19:48<mikolaj|m>the remaining 10% will be covered by making it easy to override selectors, and it would be interesting to try generating the selectors automatically, maybe using an LLM
18:20:12<mikolaj|m>(it may be less than 10%, hopefully)
18:20:15<fireonlive>nicolas17: 😇 depends which browser profile
18:20:34<fireonlive>ooh that does sound cool
18:23:11<mikolaj|m>I found PhpBB to be the worst when it comes to forum-dl breaking due to theming
18:23:32<@rewby>Awful idea. ChatGPT to create selectors. Lol
18:24:23<nicolas17>hey chatgpt how do I change my forum html to break mikolaj's crawler
18:24:27<nicolas17>it's an arm's race :D
18:24:51<mikolaj|m>actually there's a ChatGPT-powered library for scraping already: https://github.com/jamesturk/scrapeghost
18:24:51<fireonlive>(it's also very small so AB could have it done in like 5 mins)
18:25:40<mikolaj|m>but I'm afraid it would be quite expensive scraping ;)
18:28:02lk quits [Client Quit]
18:28:06lk (lk) joins
18:28:41<mikolaj|m>auto-generating just the selectors somehow would be much cheaper
18:45:25railen63 quits [Remote host closed the connection]
18:48:05railen64 joins
18:48:57railen64 quits [Remote host closed the connection]
18:49:11railen64 joins
18:52:48decky_e (decky_e) joins
18:54:28HP_Archivist (HP_Archivist) joins
19:07:29graham joins
19:12:00railen64 quits [Remote host closed the connection]
19:12:50graham quits [Client Quit]
19:14:59graham joins
19:24:15hitgrr8 joins
19:30:27graham quits [Client Quit]
19:33:58Abacus6427 joins
19:37:14vegbrasil joins
19:41:32vegbrasil quits [Ping timeout: 252 seconds]
19:43:00vegbrasil joins
19:47:23vegbrasil quits [Ping timeout: 252 seconds]
19:55:39railen63 joins
19:55:39railen63 quits [Remote host closed the connection]
19:56:45railen64 joins
20:00:31vegbrasil joins
20:05:00vegbrasil quits [Ping timeout: 252 seconds]
20:08:30vegbrasil joins
20:16:32vegbrasil quits [Ping timeout: 252 seconds]
20:32:21vegbrasil joins
20:39:28vegbrasil quits [Ping timeout: 252 seconds]
20:51:26vegbrasil joins
20:56:06sorch joins
20:59:16vegbrasil quits [Ping timeout: 252 seconds]
21:07:00fireonlive quits [Client Quit]
21:07:27fireonlive (fireonlive) joins
21:13:06vegbrasil joins
21:16:08AmAnd0A quits [Ping timeout: 252 seconds]
21:16:32AmAnd0A joins
21:19:15graham joins
21:19:38<@arkiver>OrIdow6: i'm getting that one started tomorrow
21:20:49vegbrasil quits [Ping timeout: 265 seconds]
21:24:06AmAnd0A quits [Read error: Connection reset by peer]
21:24:50AmAnd0A joins
21:25:39graham quits [Client Quit]
21:30:56graham joins
21:37:23<andrew>I'm trying to get wpull to resume an unfinished grab-site crawl but it seems to be spending a lot of time doing absolutely nothing:
21:37:26<andrew>❯ /nix/store/9rrr3q95w3zqwp97b66mxxn5kfxah9zl-python3.8-ludios_wpull-3.0.9/bin/wpull3 -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --no-check-certificate --no-robots --inet4-only
21:37:26<andrew>--dns-timeout 20 --connect-timeout 20 --read-timeout 900 --session-timeout 172800 --tries 3 --waitretry 5 --max-redirect 8 --append-output wpull.log --database wpull.db --save-cookies cookies.txt --delete-after --page-requisites --no-parent --concurrent 6 --warc-file ztank-archive-ddosecrets-search2-queue.txt-2023-06-10-a0f7d115 --warc-max-size
21:37:26<andrew>5368709120 --warc-cdx --strip-session-id --escaped-fragment --level inf --page-requisites-level 5 --span-hosts-allow page-requisites --sitemaps --recursive --warc-append --domains domains.txt -v https://search.ddosecrets.com/
21:37:26<andrew>psutil: No module named 'psutil'. Resource monitoring will be unavailable.
21:37:26<andrew>INFO FINISHED.
21:37:26<andrew>INFO Duration: 2:16:36. Speed: -- B/s.
21:37:26<andrew>INFO Downloaded: 0 files, 0.0 B.
21:37:53<andrew>domains.txt includes a list of domains that I want to be included in the crawl
21:38:31gwetchen|m joins
21:47:00Muad-Dib quits [Remote host closed the connection]
21:47:17Muad-Dib joins
21:49:36graham quits [Client Quit]
21:51:19lk quits [Client Quit]
21:51:31lk (lk) joins
21:54:19lk quits [Client Quit]
21:54:30fangfufu quits [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]
21:59:47fangfufu joins
22:00:06Hajdar quits [Remote host closed the connection]
22:00:22Hajdar (Hajdar) joins
22:01:23lk (lk) joins
22:21:27vegbrasil joins
22:27:18<betamax>JAA: to answer your question from ~24 hours ago, the URL the game uses is http://media.schoolofdragons.com/
22:27:30<@JAA>Ah, of course there's another one. :-)
22:27:32<@JAA>Thanks!
22:27:44<betamax>(I realise there's a lot of scrollback and I haven't caught up yet - feel free to mention / ping me for specific things)
22:27:47<nicolas17>augh the duplicate downloads ><
22:27:55hitgrr8 quits [Client Quit]
22:28:32vegbrasil quits [Ping timeout: 252 seconds]
22:28:39<fireonlive>what an odd way they chose to organize things
22:29:28<nicolas17>fireonlive: in this case it's 3 CDN URLs pointing at the same underlying S3 bucket
22:29:57<fireonlive>ah i mainly just meant all the duplicate files spread across platform directories
22:30:03<betamax>the bucket seems to be used for all / many of JumpStart's games, only School of Dragons is confirmed to be shutting down however JumpStart are being sued so might be a good idea to grab everything
22:30:10<nicolas17>yeah that's another issue
22:30:14<fireonlive>i think i read the same assets being everywhere
22:30:18<betamax>but Schoolofdragons has a hard deadline of 30 June
22:30:59<nicolas17>fireonlive: I think JAA intends to download them from all 4 URLs (raw bucket, schoolofdragons, jumpstart, knowledgeadventure) so that they all work in the WBM
22:31:10<nicolas17>that will get nicely deduplicated in the WARCs
22:31:30<nicolas17>but it means 24TB of downloading for him and it pains me xD
22:31:52<fireonlive>dedupe is best dupe
22:32:03<nicolas17>*in addition* there's a ton of duplication in the data itself yeah
22:32:08<fireonlive>but haha yeah
22:32:15<h2ibot>Usernam edited List of websites excluded from the Wayback Machine (+31): https://wiki.archiveteam.org/?diff=49903&oldid=49890
22:32:16<h2ibot>Yts98 uploaded File:LINE BLOG icon.png: https://wiki.archiveteam.org/?title=File%3ALINE%20BLOG%20icon.png
22:32:38<nicolas17>in DWADragonsUnity:
22:32:39<nicolas17>Total data: 6916.14 GiB in 7757546 files
22:32:41<nicolas17>Unique data: 3037.20 GiB in 1143186 files
22:33:33<fireonlive>all that S3 space they're needlessly paying for
22:33:34<fireonlive>lol
22:35:07<Elizabeth>if my brain isn't dead, it's only 160/mo there assuming it's all in standard tier, was probably cheaper than figuring out what needed to go.
22:35:32<betamax>OK, brief update from my friend:
22:35:36<betamax>1.) They're apparently still updating the game. There's ~2 devs still assigned to this, and they're trying to push final updates even though the game will be shut down in <1 month
22:35:48<betamax>2.) My friend is in contact with one of the devs (that how they heard the above). So getting a list of all the updated assets closer to shutdown date may be possible.
22:36:51<dumbgoy>betamax: what are ya talking about? interested
22:36:52<@JAA>nicolas17: Looks like the vast majority of data is under DWADragonsUnity, yeah.
22:37:05<nicolas17>JAA: oh I simply didn't check the rest of the bucket yet
22:37:25<@JAA>Total stats: 8.06 TiB in 11656937 objects, 3.21 unique TiB in 1673341 objects
22:37:54<fireonlive>ah ok, 160/mo is peanuts to them
22:38:04<betamax>dumbgoy: the School of Dragons game (https://www.schoolofdragons.com/)
22:38:39<dumbgoy>gotcha
22:38:47<nicolas17>WAIT
22:38:48<dumbgoy>keep on rockin, i just popped in and was wondering
22:38:51<nicolas17>data has already been deleted?!
22:39:08<nicolas17>no, ffs I was on the wrong prefix
22:39:09<@JAA>Hm?
22:39:12<nicolas17>god I panicked
22:39:13<@JAA>Ah
22:39:23<nicolas17>that's what I get for relying on shell history
22:39:27<nicolas17>DWADragonsCodingUnity is a different folder >.<
22:39:32<betamax>apparently old data for previous versions of the game was in the DWAStandAlone subfolder on the bucket
22:39:40<dumbgoy>love ya guys, keep up the good work.!
22:39:52<@JAA>betamax: I'll just grab all of it since it doesn't make a huge difference.
22:40:13<betamax>great! (I'm just copy-pasting messages from my friend into here, so apologies if I'm mentioning stuff that's already covered)
22:40:50<@JAA>A bit over 24 TiB to download into something like 3.3 TiB of WARCs. Sounds fun. :-)
22:41:14<@JAA>There'll be a little bit of duplication in my data most likely, but I'll try to keep it to a minimum.
22:41:15<betamax>JAA: the stats you posted above (8.06TB - is that the entire bucket or just DWADragonsUnity)
22:41:21<@JAA>Entire bucket
22:41:32<@JAA>nicolas17 posted the ones for just DWADragonsUnity.
22:41:42<@JAA>(I didn't rerun that analysis.)
22:42:18<betamax>thanks!
22:42:45<dumbgoy>anyone know about the feasibility and data requirements to pull all of waybackmachine from archive.org? I imagine it would be MASSIVE, and not sure about the process, maybe wget?
22:42:46<@JAA>I'm getting a very similar number for the number of files in that though.
22:42:51sorch quits [Client Quit]
22:42:56<dumbgoy>i'm worried about them going down
22:43:11<@JAA>dumbgoy: Do you have too much money?
22:43:29<dumbgoy>not really, i wish i did
22:43:33<@JAA>Infeasible, especially through the WBM.
22:43:35<nicolas17>betamax: fun fact, if I look *only* at *.mp4 files, there's 1247 MiB unique data (92 files) + 514697 MiB duplicate (48357 files)
22:43:39<dumbgoy>i kind of figured, just wondered
22:44:06<@JAA>Data requirements are in the dozens of TB per day of data.
22:44:24<dumbgoy>ouch, they refuse to even put residential fiber here, and i don't make too much money
22:44:45<@JAA>1.8 PiB of web items in the past month
22:44:59<dumbgoy>lordy. well thanks for answering, love ya guys
22:44:59<@JAA>(Not all of those are publicly accessible though.)
22:45:15<betamax>JAA / nicolas17: if you notice any data in the bucket that looks like user posts, pls let me know. they shut down the user forum with no warning a couple of months ago, and a lot of that data is not in wayback
22:45:50<nicolas17>dumbgoy: archiveteam's reddit archival project alone is at 2.85 PiB
22:45:56<dumbgoy>wow
22:46:18<nicolas17>adding 27GiB per minute right now
22:46:33<dumbgoy>let me know if there's ANYTHING I can do to help with any archival.
22:46:51<dumbgoy>I only have around 8~tb free right now, but if i can help you folks lemme know
22:47:18<@JAA>betamax: Very unlikely that that would be on their CDN S3 bucket. I'll dump the complete bucket listing onto IA anyway.
22:47:48<@JAA>It's only ~300 MiB as .jsonl.zst. :-)
22:47:50<nicolas17>dumbgoy: the warrior stuff doesn't need much disk space, it downloads from (eg.) reddit and immediately uploads elsewhere, it won't accumulate much data on your disk
22:48:02<nicolas17>JAA: do you have your own script for that?
22:48:18<dumbgoy>if ya elaborate a bit, and tell me what's needed, not informed about warrior stuff
22:48:43<nicolas17>"aws s3 ls" outputs text with "timestamp size filename"
22:48:45<@JAA>nicolas17: Yep, https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list (warning, may cause brain death)
22:49:07<@JAA>It has no retries, so due to those weird timeouts, I had to script around it in Bash. :-)
22:49:08<nicolas17>"aws s3api list-objects" seems to collect everything in memory and when it finishes it outputs a single JSON array
22:52:08<fireonlive><stackoverflow parsing xml with regex post>
22:52:21NIC007a83 joins
22:52:22<fireonlive>(lol i don't care about that)
22:53:13<@JAA>The regex is only used for crude validation of the beginning of the response. The parsing happens with string slicing etc. :-P
22:53:36<@JAA>But yes, I like Tony the pony.
22:55:57<nicolas17>how long does it take you to list the bucket?
22:55:59<@JAA>I also converted it to a script that can invoke qwarc to do it fast and archive the responses as WARC. It's even more ridiculous: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list-qwarc
22:57:14<TheTechRobo>JAA: Why did you do all that readarray stuff instead of just `cat > ${prefix}.py << EOF` ?
22:57:24<TheTechRobo>(i think its called a heredoc?)
22:57:29<@JAA>This one took 8 hours due to all the timeouts.
22:57:46<@JAA>Don't have good enough logs to subtract those.
22:57:55<TheTechRobo>Oh indentation.
22:57:58<nicolas17>ew
22:57:59<TheTechRobo>Didnt check that comment yet.
22:58:38Braven joins
22:58:40<@JAA>I can relist the bucket much faster now with the qwarc version.
22:59:01<@JAA>Since I can split it into pretty equal parts and process those in parallel.
23:00:29<@JAA>When a bucket has nice patterns, that can be used from the start of course.
23:00:38<@JAA>But this one is a slight mess.
23:08:47<fireonlive>love the bash script
23:12:26<fireonlive>is there a good place to shove google drive links
23:22:17<@JAA>#googlecrash in theory, but that's been dormant for a long time.
23:22:28<@JAA>We should revive it in #mediaonfire style.
23:27:55TunaLobster_extra joins
23:30:50TunaLobster quits [Ping timeout: 265 seconds]
23:36:12NIC007a83 quits [Client Quit]
23:39:30BlueMaxima joins
23:43:11lolesports joins
23:43:58<masterX244>True, got a batch of links from a crawl, too
23:45:28<fireonlive>yee :)
23:49:25<pabs>mikolaj|m: are you able to update the Mailman wiki page to add those two things?
23:52:19<mikolaj|m>pabs: what two things?
23:52:33<pabs>the ones you mentioned above :)
23:52:44<pabs><mikolaj|m> pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail
23:53:07<pabs>(please link to Perceval too)
23:53:56<@JAA>Hmm, what does an ETag value of '96a106ed73262892740656e84c5437b2-1' mean on AWS S3?
23:54:29<mikolaj|m>pabs: I would need to make an account on your wiki, I might not have enough spoons for that today (executive dysfunction)
23:55:14<@JAA>Ah, multi-part uploads, apparently.
23:55:23<@JAA>https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums
23:56:56geezabiscuit quits [Ping timeout: 265 seconds]