#archiveteam-bs log for 2023-06-10

Home Search Previous day Next day

00:00:12	<fireonlive>	thank you, on by default directory listings
00:00:16	<fireonlive>	^o^
00:07:09	<nicolas17>	...I just now realized this is Knowledge Adventure
00:07:19		jtagcat quits [Quit: Bye!]
00:07:33		jtagcat (jtagcat) joins
00:08:10	<nicolas17>	I have several 90s-era KA games
00:08:11	<@JAA>	I was wondering what 'ka' stands for. Thanks.
00:08:27	<nicolas17>	there's JumpStart stuff in the bucket too
00:08:53		dumbgoy_ quits [Read error: Connection reset by peer]
00:08:53	<nicolas17>	all the JS* dirs are probably for https://en.wikipedia.org/wiki/JumpStart
00:09:45	<@JAA>	Looks like this is the bucket behind http://media1.knowledgeadventure.com/ then.
00:10:10	<nicolas17>	let's find out
00:10:27	<@JAA>	It matches. I checked a couple random games from the KA site.
00:10:28	<nicolas17>	http://media1.knowledgeadventure.com/DWADragonsUnity/MAC/3.24.0/High/movies/StoickMemorial.mp4
00:11:44	<nicolas17>	(my kingdom for an S3 listing of the bucket behind updates.cdn-apple.com!)
00:12:20	<fireonlive>	i like how http://media1.knowledgeadventure.com/DWADragonsUnity/ says "Key: DWADragonsUnity/DoNotDelete.jpg" but they're going to purge it all
00:13:25		vegbrasil joins
00:16:47	<nicolas17>	JAA: https://media.jumpstart.com/ uses the same bucket
00:17:11	<@JAA>	Great
00:17:31	<nicolas17>	are you going to archive the whole bucket? it seems the SODWebsite/ prefix has lots of assets used by schoolofdragons.com
00:18:05		vegbrasil quits [Ping timeout: 252 seconds]
00:18:06	<@JAA>	betamax: Can you ask your friend which of https://s3.amazonaws.com/origin.ka.cdn/ http://media1.knowledgeadventure.com/ https://media.jumpstart.com/ was actually used by the game?
00:18:14	<@JAA>	Or is, I suppose.
00:18:58	<@JAA>	nicolas17: DWADragonsUnity only at the moment, but yes, depending on total size, I'd like to grab it all.
00:19:40	<nicolas17>	I got the complete list by getting lists of every version in parallel
00:19:46	<nicolas17>	the other directories don't have such a consistent structure
00:20:07	<nicolas17>	might be a very lopsided tree :)
00:20:42	<@JAA>	I might also grab everything under all three URLs above with dedupe.
00:20:52	<@JAA>	Let me know if you find any further domains like that.
00:21:06	<@JAA>	I can't actually start pulling tonight anyway.
00:21:12	<nicolas17>	6TB x 3 download then?
00:21:45	<nicolas17>	er 7TB+
00:21:59	<@JAA>	Sure, if there's enough time, might as well.
00:22:16	<@JAA>	That's only 10 MB/s over the remaining 20 days.
00:22:46	<nicolas17>	if I tried that I'm sure my ISP would yell at me... and I don't have 3TB free disk anyway :D
00:23:08	<@JAA>	OVH won't even blink. :-P
00:24:49	<fireonlive>	all i have is a measly 300/20Mbps (at best) at home lol
00:25:03	<fireonlive>	they always fuck you on the upload
00:25:07	<fireonlive>	but that's HFC for you
00:25:20	<@JAA>	I could get symmetric 25G here if I wanted. But meh.
00:25:28	<fireonlive>	oooh
00:25:35	<@JAA>	Also, -ot :-)
00:25:40	<fireonlive>	ye
00:36:08		BearFortress joins
00:39:50	<@JAA>	I'll let it battle through these timeouts overnight.
01:29:45		Abacus6427 joins
01:34:32		tzt quits [Ping timeout: 252 seconds]
01:43:25		Abacus6427 quits [Ping timeout: 265 seconds]
01:45:49		dumbgoy joins
02:01:53		Naruyoko quits [Remote host closed the connection]
02:02:15		Naruyoko joins
02:07:28		imer quits [Client Quit]
02:08:05		imer (imer) joins
02:16:35		tzt (tzt) joins
02:16:53		Mateon1 quits [Remote host closed the connection]
02:16:53		diggan quits [Remote host closed the connection]
02:16:59		Mateon1 joins
02:40:12		imer quits [Client Quit]
02:43:17		imer (imer) joins
02:49:53		HP_Archivist quits [Ping timeout: 252 seconds]
02:59:14		tzt quits [Ping timeout: 252 seconds]
03:00:58		tzt (tzt) joins
03:04:41	<pabs>	https://torrentfreak.com/youtube-orders-invidious-privacy-software-to-shut-down-in-7-days-230609
03:20:39		Mateon1 quits [Remote host closed the connection]
03:20:59		Mateon1 joins
03:34:10		taggart quits [Client Quit]
03:39:42	<pabs>	that_lurker: re Tor Forums migration, I am doing an AB job for the Tor forums, will do an AB !ao < afterwards to capture the redirects too
03:40:15	<pabs>	(https://blog.torproject.org/tor-project-forum-migration/)
03:46:01		Elizabeth quits [Remote host closed the connection]
03:46:11		elizabeth joins
03:46:44		elizabeth is now authenticated as Elizabeth
03:56:35		elizabeth quits [Client Quit]
03:56:39		Elizabeth joins
03:57:40		Elizabeth is now authenticated as Elizabeth
03:57:52	<h2ibot>	PaulWise edited Mailman (+142, mention that forum-dl supports pipermail archives): https://wiki.archiveteam.org/?diff=49893&oldid=21240
03:57:59	<pabs>	mikolaj\|m: ^
03:58:10		Elizabeth quits [Client Quit]
03:58:12		Elizabeth (Elizabeth) joins
03:58:52	<h2ibot>	PaulWise edited Mailman (-2, fix link): https://wiki.archiveteam.org/?diff=49894&oldid=49893
03:58:55		Elizabeth quits [Changing host]
03:58:55		Elizabeth (Elizabeth) joins
04:17:55	<h2ibot>	PaulWise edited Mailman (+159, mention that mailman 2 is now EOL): https://wiki.archiveteam.org/?diff=49895&oldid=49894
04:28:10		emberquill (emberquill) joins
04:33:04		GNU_world joins
04:35:29		nicolas17 quits [Ping timeout: 252 seconds]
04:37:59	<h2ibot>	PaulWise created Mailman2 (+43214, start a page about mailman2 archiving): https://wiki.archiveteam.org/?title=Mailman2
04:38:44	<pabs>	JAA: if you have mailman2/pipermail URLs locally, please add them to ^
04:38:51	<pabs>	(same for anyone else here)
04:40:13		nicolas17 joins
04:47:05		dumbgoy quits [Ping timeout: 265 seconds]
04:55:02	<h2ibot>	PaulWise edited Mailman2 (+1588, add more tips, sites that were already done by me): https://wiki.archiveteam.org/?diff=49897&oldid=49896
05:02:26		vegbrasil joins
05:06:54		vegbrasil quits [Ping timeout: 265 seconds]
05:21:40		futawe joins
05:24:09	<futawe>	fyi, StackExchange: "The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June"
05:24:09	<futawe>	https://meta.stackexchange.com/questions/389922/june-2023-data-dump-is-missing/390023#390023
05:25:47		futawe quits [Remote host closed the connection]
05:26:00		futawe joins
05:26:12		futawe quits [Remote host closed the connection]
05:30:01	<nicolas17>	"organizations looking to profit from the work of our community" that sounds like... stackexchange itself?
05:32:16	<nicolas17>	JAA: purely out of curiosity (since I know you want the exact bytes for preservation etc etc) I looked into how much I can deduplicate/compress this KA/DWADragonsUnity data
05:33:21	<nicolas17>	it seems .unity3d files have internal LZMA compression, and if I decompress that I'd probably get far better deduplication, but I can't compress it back to the same data, I guess Unity doesn't use the standard LZMA library? :/
05:33:49	<nicolas17>	I tried all compression levels and none matches
05:51:11	<h2ibot>	PaulWise edited ArchiveBot (+1956, add section on alternative dashboard clients): https://wiki.archiveteam.org/?diff=49898&oldid=49790
05:52:48		BlueMaxima quits [Read error: Connection reset by peer]
05:53:19		railen63 quits [Remote host closed the connection]
05:53:31		vegbrasil joins
05:53:35		railen63 joins
05:56:12	<h2ibot>	PaulWise edited ArchiveBot (-10, fix formatting): https://wiki.archiveteam.org/?diff=49899&oldid=49898
06:00:13	<h2ibot>	PaulWise edited ArchiveBot (-21, fix command): https://wiki.archiveteam.org/?diff=49900&oldid=49899
06:00:34		vegbrasil quits [Ping timeout: 252 seconds]
06:05:37		nicolas17 quits [Read error: Connection reset by peer]
06:06:13		a joins
06:06:14		a quits [Remote host closed the connection]
06:06:33		nicolas17 joins
07:01:27		lk quits [Client Quit]
07:01:32		lk joins
07:53:00		manu\|m quits [Quit: issued !quit command]
07:55:34		manu\|m joins
07:58:34	<h2ibot>	OrIdow6 uploaded File:Egloos logo.gif: https://wiki.archiveteam.org/?title=File%3AEgloos%20logo.gif
08:06:42		pabs quits [Ping timeout: 265 seconds]
08:15:37	<h2ibot>	OrIdow6 created Egloos (+400, Created page with "{{Infobox project \| title =…): https://wiki.archiveteam.org/?title=Egloos
08:23:34		Dango360 quits [Ping timeout: 252 seconds]
08:23:41		Dango360 (Dango360) joins
08:34:35		pabs (pabs) joins
08:37:11		pabs quits [Remote host closed the connection]
08:39:34		pabs (pabs) joins
09:13:18		T31M quits [Client Quit]
09:13:37		T31M joins
09:14:46		T31M is now authenticated as T31M
09:21:04	<@OrIdow6>	arkiver: https://github.com/OrIdow6/egloos-grab https://github.com/OrIdow6/egloos-items - Framework/backfeed.lua in the former needs keys - long tail so I have not bothered with an exact estimate, but it feels like < 5 TB fare to me, and there
09:21:14	<@OrIdow6>	's stuff to cut out if it gets too high
09:36:57	<flashfire42>	https://tracker.archiveteam.org/google-sites/ is this still supposed to be spitting out items?
09:50:19		railen63 quits [Remote host closed the connection]
09:53:48		railen63 joins
09:55:03		railen63 quits [Remote host closed the connection]
09:55:16		railen63 joins
10:00:01		railen63 quits [Remote host closed the connection]
10:00:16		railen63 joins
10:13:49		Ruthalas5 quits [Ping timeout: 265 seconds]
10:23:42		Ruthalas5 (Ruthalas) joins
11:22:29		icedice quits [Client Quit]
11:28:42		vegbrasil joins
11:32:46		nicolas17 quits [Ping timeout: 252 seconds]
11:33:29		vegbrasil quits [Ping timeout: 252 seconds]
11:36:24		nicolas17 joins
11:39:17		nicolas17 quits [Read error: Connection reset by peer]
11:39:42		nicolas17 joins
11:44:48		decky_e quits [Remote host closed the connection]
11:54:53		diggan joins
11:57:08		systwi__ (systwi) joins
11:58:14		systwi quits [Ping timeout: 252 seconds]
12:13:41		diggan quits [Ping timeout: 265 seconds]
12:21:22		vegbrasil joins
12:21:53		railen63 quits [Remote host closed the connection]
12:22:48		diggan joins
12:30:07		vegbrasil quits [Ping timeout: 265 seconds]
12:40:35		railen63 joins
12:41:46		railen63 quits [Remote host closed the connection]
12:42:00		railen63 joins
12:43:57		icedice (icedice) joins
13:07:41		IDK quits [Quit: Connection closed for inactivity]
13:14:04		vegbrasil joins
13:18:32		vegbrasil quits [Ping timeout: 252 seconds]
13:22:37		dumbgoy joins
13:32:18		vegbrasil joins
13:37:03		JohnnyJ joins
13:37:18		vegbrasil quits [Ping timeout: 265 seconds]
13:39:06		TunaLobster joins
13:43:48		vegbrasil joins
13:52:05		vegbrasil quits [Ping timeout: 252 seconds]
14:01:17		vegbrasil joins
14:04:33		HP_Archivist (HP_Archivist) joins
14:04:44		beario_ quits [Ping timeout: 252 seconds]
14:05:23		systwi__ is now known as systwi
14:06:24		Pichu0102 quits [Ping timeout: 252 seconds]
14:06:27		Pichu0202 joins
14:08:14		vegbrasil quits [Ping timeout: 265 seconds]
14:12:37		vegbrasil joins
14:12:40		graham quits [Quit: The Lounge - https://thelounge.chat]
14:15:39		railen63 quits [Remote host closed the connection]
14:17:19		graham joins
14:19:04		Naruyoko quits [Read error: Connection reset by peer]
14:21:14		vegbrasil quits [Ping timeout: 252 seconds]
14:26:48		HP_Archivist quits [Client Quit]
14:27:09		HP_Archivist (HP_Archivist) joins
14:28:19		graham quits [Client Quit]
14:28:32		nicolas17 quits [Ping timeout: 265 seconds]
14:32:22		nicolas17 joins
14:34:35		geezabiscuit quits [Quit: ZNC - https://znc.in]
14:47:19		icedice quits [Client Quit]
14:52:07		vegbrasil joins
14:53:47		geezabiscuit joins
14:53:47		geezabiscuit is now authenticated as geezabiscuit
14:53:47		geezabiscuit quits [Changing host]
14:53:47		geezabiscuit (geezabiscuit) joins
14:56:26		vegbrasil quits [Ping timeout: 252 seconds]
15:26:32		diggan quits [Ping timeout: 265 seconds]
15:35:27		diggan joins
15:36:36		AmAnd0A quits [Ping timeout: 252 seconds]
15:36:39		AmAnd0A joins
15:43:40		AmAnd0A quits [Read error: Connection reset by peer]
15:44:03		AmAnd0A joins
15:55:56	<diggan>	where can I find the dictionary for a archive? For example, the dictionary used for zst files in https://archive.org/download/archiveteam_reddit_20230610072320_4ab81500
15:56:07	<diggan>	the zst dictionary used for the compression, just to be clear
15:58:28		lk quits [Client Quit]
16:00:09		lk joins
16:07:28		graham joins
16:13:24		Dalek (Dalek) joins
16:15:16		graham quits [Client Quit]
16:26:17	<imer>	diggan: the zst isn't actually one archive, its multiple I believe, JAA has a script for this (havent used myself, so no idea how): https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat
16:27:16	<diggan>	I see. Thanks imer! Regardless, they're compressed with a custom dictionary as far as I understand, which means I'd have to use the same dictionary to extract it
16:27:54	<imer>	the dict is included in the warc as a custom record or something like that, script should handle it
16:31:18		geezabiscuit quits [Ping timeout: 265 seconds]
16:35:10	<diggan>	aha, I see. Thanks again imer
16:37:08	<diggan>	seems like that handy little utility did the trick, awesome
16:46:07		railen64 joins
16:47:05		railen64 quits [Remote host closed the connection]
16:47:21		railen64 joins
16:56:56		graham joins
17:00:21		railen64 quits [Remote host closed the connection]
17:07:02		geezabiscuit (geezabiscuit) joins
17:09:44		graham quits [Client Quit]
17:10:00		graham joins
17:14:29		HP_Archivist quits [Ping timeout: 252 seconds]
17:19:23		graham quits [Client Quit]
17:21:50	<that_lurker>	Could someone grab https://www.lpga.com/ and https://www.kpmgwomenspgachampionship.com/ The current merger of pga and liv could have an effect on those too
17:26:47	<fireonlive>	might be worth asking in #archivebot
17:33:10		lk quits [Client Quit]
17:40:36		railen64 joins
17:40:51		beario quits [Remote host closed the connection]
17:42:09		railen64 quits [Remote host closed the connection]
17:42:26		beario joins
17:43:37	<dave>	Probably an obvious question, but why is 6 the max concurrency per warrior? The reddit project suggests up to 10 may work, and empirically I'm pulling with 8 archivers spread across two VMs no problem. It'd be nicer to crank one up to 8 so I can dedicate the other VM to a different project
17:45:09	<dave>	The warriors are consuming maybe 10Mbps out of my symmetric 1G, so looking for how to contribute more firepower without tripping IP rate limits
17:47:44	<Maakuth\|m>	You can run more projects manually with docker https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker
17:48:04		lk joins
17:48:20	<dave>	yeah, although for boring reasons that's annoying to get running on my infrastructure vs. spawning the warrior VM image
17:49:39	<Maakuth\|m>	How about multiple vms and manually picking projects in warrior?
17:50:30	<dave>	yup, that's what I'm doing, but for that the limiting factor ends up being the RAM budget for each VM. Being able to pack more workers into one VM for projects that can sustain it would be more efficient.
17:50:45	<dave>	But that's just optimization, in practice yeah I'm spinning more VMs
17:51:22	<that_lurker>	If you are running the warrior you can go to the webportals settings and check the advanced options and increase it
17:52:13	<dave>	only up to 6 though, so for reddit I end up needing multiple VMs to get up to the right rate. That's why I was wondering why the limit is set at 6. Maybe to not overwhelm smaller projects with a ton of workers? I dunno.
17:52:37	<that_lurker>	oh yeah forgot that was max 6
17:55:11	<dave>	looking at the instructions for running coordinated containers by hand, that does look kinda what I'm looking for, so I should just go fix my infra to make that work.
17:56:47	<fireonlive>	#warrior would
17:56:51	<fireonlive>	be the better place for such
17:57:17	<dave>	ah, thx
17:57:26	<fireonlive>	=]
18:06:34		AmAnd0A quits [Ping timeout: 252 seconds]
18:07:12		AmAnd0A joins
18:09:21		lk is now authenticated as lk
18:10:30		railen63 joins
18:10:34	<mikolaj\|m>	pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail
18:12:43		railen63 quits [Remote host closed the connection]
18:12:58	<fireonlive>	oh hey mikolaj\|m are you the creator of forum-dl?
18:13:16	<mikolaj\|m>	fireonlive: yes
18:13:26	<fireonlive>	ah! :) it looks quite cool thanks
18:13:40		AmAnd0A quits [Read error: Connection reset by peer]
18:13:48		lk quits [Changing host]
18:13:48		lk (lk) joins
18:13:51	<fireonlive>	I don't have a GitHub account for... me.. yet but I tried pointing it at https://forums.tomshardware.com but it doesn't seem to register that it's xenforo
18:13:57		AmAnd0A joins
18:14:01		vegbrasil joins
18:14:04	<fireonlive>	was wondering if there was a way to force a certain scraper
18:14:41	<@JAA>	pabs: Thanks for handling the Tor forums! :-) Re Mailman, there are definitely one in [[Internet infrastructure]] that could be listed there. I don't have any local lists, I think.
18:14:46	<mikolaj\|m>	fireonlive: I'll fix this
18:14:55	<fireonlive>	thanks :)
18:14:56	<@JAA>	nicolas17: Yeah, I won't decompress or anything like that.
18:15:14	<fireonlive>	i'll get around to creating a new GitHub account shortly; sorry for short-circuiting the issues system
18:15:26		nicolas17 is still on-and-off experimenting with compressing Apple updates
18:15:49		lk quits [Client Quit]
18:15:57		lk (lk) joins
18:16:51	<nicolas17>	last week's WWDC released 223GB of betas
18:17:27	<fireonlive>	oh damn
18:17:29	<mikolaj\|m>	fireonlive: forum themes break forum-dl frequently, but fortunately each can also be fixed extremely quickly (it usually just requires tweaking selectors a bit)
18:17:39	<fireonlive>	ah! i see
18:18:01	<nicolas17>	next week they'll probably post beta 2 and it will be a similar size
18:18:03	<fireonlive>	custom themeing can be a bit of a bane
18:18:08		railen63 joins
18:18:09	<mikolaj\|m>	and I haven't had enough time to write more tests for each extractor
18:18:15	<@JAA>	imer, diggan: To be precise, the dictionary is stored in a skippable zstd frame. Standard tooling knows nothing about that frame, so as the name suggests, it just skips over it (and then fails to decompress because it doesn't have the right dict). An attempt to upstream this format led to feature creep (support multiple dictionaries, individual frames compressed without a dict, indexing, etc.) until
18:18:21	<@JAA>	it ground to a halt.
18:18:32	<mikolaj\|m>	but I expect to be able to reliably cover at least 90% of online forums
18:18:40		vegbrasil quits [Ping timeout: 252 seconds]
18:19:06	<fireonlive>	if i have a `$ grab-site https://labs-web-bay.vercel.app/` and promise I'm a good boy can i upload that to IA and get it moved to the 'wayback machine loves me' collection? lol
18:19:10		railen63 quits [Remote host closed the connection]
18:19:23		railen63 joins
18:19:46	<nicolas17>	fireonlive: would we conclude you're a good boy if we saw your browser history? ;)
18:19:48	<mikolaj\|m>	the remaining 10% will be covered by making it easy to override selectors, and it would be interesting to try generating the selectors automatically, maybe using an LLM
18:20:12	<mikolaj\|m>	(it may be less than 10%, hopefully)
18:20:15	<fireonlive>	nicolas17: 😇 depends which browser profile
18:20:34	<fireonlive>	ooh that does sound cool
18:23:11	<mikolaj\|m>	I found PhpBB to be the worst when it comes to forum-dl breaking due to theming
18:23:32	<@rewby>	Awful idea. ChatGPT to create selectors. Lol
18:24:23	<nicolas17>	hey chatgpt how do I change my forum html to break mikolaj's crawler
18:24:27	<nicolas17>	it's an arm's race :D
18:24:51	<mikolaj\|m>	actually there's a ChatGPT-powered library for scraping already: https://github.com/jamesturk/scrapeghost
18:24:51	<fireonlive>	(it's also very small so AB could have it done in like 5 mins)
18:25:40	<mikolaj\|m>	but I'm afraid it would be quite expensive scraping ;)
18:28:02		lk quits [Client Quit]
18:28:06		lk (lk) joins
18:28:41	<mikolaj\|m>	auto-generating just the selectors somehow would be much cheaper
18:45:25		railen63 quits [Remote host closed the connection]
18:48:05		railen64 joins
18:48:57		railen64 quits [Remote host closed the connection]
18:49:11		railen64 joins
18:52:48		decky_e (decky_e) joins
18:54:28		HP_Archivist (HP_Archivist) joins
19:07:29		graham joins
19:12:00		railen64 quits [Remote host closed the connection]
19:12:50		graham quits [Client Quit]
19:14:59		graham joins
19:24:15		hitgrr8 joins
19:30:27		graham quits [Client Quit]
19:33:58		Abacus6427 joins
19:37:14		vegbrasil joins
19:41:32		vegbrasil quits [Ping timeout: 252 seconds]
19:43:00		vegbrasil joins
19:47:23		vegbrasil quits [Ping timeout: 252 seconds]
19:55:39		railen63 joins
19:55:39		railen63 quits [Remote host closed the connection]
19:56:45		railen64 joins
20:00:31		vegbrasil joins
20:05:00		vegbrasil quits [Ping timeout: 252 seconds]
20:08:30		vegbrasil joins
20:16:32		vegbrasil quits [Ping timeout: 252 seconds]
20:32:21		vegbrasil joins
20:39:28		vegbrasil quits [Ping timeout: 252 seconds]
20:51:26		vegbrasil joins
20:56:06		sorch joins
20:59:16		vegbrasil quits [Ping timeout: 252 seconds]
21:07:00		fireonlive quits [Client Quit]
21:07:27		fireonlive (fireonlive) joins
21:13:06		vegbrasil joins
21:16:08		AmAnd0A quits [Ping timeout: 252 seconds]
21:16:32		AmAnd0A joins
21:19:15		graham joins
21:19:38	<@arkiver>	OrIdow6: i'm getting that one started tomorrow
21:20:49		vegbrasil quits [Ping timeout: 265 seconds]
21:24:06		AmAnd0A quits [Read error: Connection reset by peer]
21:24:50		AmAnd0A joins
21:25:39		graham quits [Client Quit]
21:30:56		graham joins
21:37:23	<andrew>	I'm trying to get wpull to resume an unfinished grab-site crawl but it seems to be spending a lot of time doing absolutely nothing:
21:37:26	<andrew>	❯ /nix/store/9rrr3q95w3zqwp97b66mxxn5kfxah9zl-python3.8-ludios_wpull-3.0.9/bin/wpull3 -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --no-check-certificate --no-robots --inet4-only
21:37:26	<andrew>	--dns-timeout 20 --connect-timeout 20 --read-timeout 900 --session-timeout 172800 --tries 3 --waitretry 5 --max-redirect 8 --append-output wpull.log --database wpull.db --save-cookies cookies.txt --delete-after --page-requisites --no-parent --concurrent 6 --warc-file ztank-archive-ddosecrets-search2-queue.txt-2023-06-10-a0f7d115 --warc-max-size
21:37:26	<andrew>	5368709120 --warc-cdx --strip-session-id --escaped-fragment --level inf --page-requisites-level 5 --span-hosts-allow page-requisites --sitemaps --recursive --warc-append --domains domains.txt -v https://search.ddosecrets.com/
21:37:26	<andrew>	psutil: No module named 'psutil'. Resource monitoring will be unavailable.
21:37:26	<andrew>	INFO FINISHED.
21:37:26	<andrew>	INFO Duration: 2:16:36. Speed: -- B/s.
21:37:26	<andrew>	INFO Downloaded: 0 files, 0.0 B.
21:37:53	<andrew>	domains.txt includes a list of domains that I want to be included in the crawl
21:38:31		gwetchen\|m joins
21:47:00		Muad-Dib quits [Remote host closed the connection]
21:47:17		Muad-Dib joins
21:49:36		graham quits [Client Quit]
21:51:19		lk quits [Client Quit]
21:51:31		lk (lk) joins
21:54:19		lk quits [Client Quit]
21:54:30		fangfufu quits [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]
21:59:47		fangfufu joins
22:00:03		fangfufu is now authenticated as fangfufu
22:00:06		Hajdar quits [Remote host closed the connection]
22:00:22		Hajdar (Hajdar) joins
22:01:23		lk (lk) joins
22:21:27		vegbrasil joins
22:27:18	<betamax>	JAA: to answer your question from ~24 hours ago, the URL the game uses is http://media.schoolofdragons.com/
22:27:30	<@JAA>	Ah, of course there's another one. :-)
22:27:32	<@JAA>	Thanks!
22:27:44	<betamax>	(I realise there's a lot of scrollback and I haven't caught up yet - feel free to mention / ping me for specific things)
22:27:47	<nicolas17>	augh the duplicate downloads ><
22:27:55		hitgrr8 quits [Client Quit]
22:28:32		vegbrasil quits [Ping timeout: 252 seconds]
22:28:39	<fireonlive>	what an odd way they chose to organize things
22:29:28	<nicolas17>	fireonlive: in this case it's 3 CDN URLs pointing at the same underlying S3 bucket
22:29:57	<fireonlive>	ah i mainly just meant all the duplicate files spread across platform directories
22:30:03	<betamax>	the bucket seems to be used for all / many of JumpStart's games, only School of Dragons is confirmed to be shutting down however JumpStart are being sued so might be a good idea to grab everything
22:30:10	<nicolas17>	yeah that's another issue
22:30:14	<fireonlive>	i think i read the same assets being everywhere
22:30:18	<betamax>	but Schoolofdragons has a hard deadline of 30 June
22:30:59	<nicolas17>	fireonlive: I think JAA intends to download them from all 4 URLs (raw bucket, schoolofdragons, jumpstart, knowledgeadventure) so that they all work in the WBM
22:31:10	<nicolas17>	that will get nicely deduplicated in the WARCs
22:31:30	<nicolas17>	but it means 24TB of downloading for him and it pains me xD
22:31:52	<fireonlive>	dedupe is best dupe
22:32:03	<nicolas17>	in addition there's a ton of duplication in the data itself yeah
22:32:08	<fireonlive>	but haha yeah
22:32:15	<h2ibot>	Usernam edited List of websites excluded from the Wayback Machine (+31): https://wiki.archiveteam.org/?diff=49903&oldid=49890
22:32:16	<h2ibot>	Yts98 uploaded File:LINE BLOG icon.png: https://wiki.archiveteam.org/?title=File%3ALINE%20BLOG%20icon.png
22:32:38	<nicolas17>	in DWADragonsUnity:
22:32:39	<nicolas17>	Total data: 6916.14 GiB in 7757546 files
22:32:41	<nicolas17>	Unique data: 3037.20 GiB in 1143186 files
22:33:33	<fireonlive>	all that S3 space they're needlessly paying for
22:33:34	<fireonlive>	lol
22:35:07	<Elizabeth>	if my brain isn't dead, it's only 160/mo there assuming it's all in standard tier, was probably cheaper than figuring out what needed to go.
22:35:32	<betamax>	OK, brief update from my friend:
22:35:36	<betamax>	1.) They're apparently still updating the game. There's ~2 devs still assigned to this, and they're trying to push final updates even though the game will be shut down in <1 month
22:35:48	<betamax>	2.) My friend is in contact with one of the devs (that how they heard the above). So getting a list of all the updated assets closer to shutdown date may be possible.
22:36:51	<dumbgoy>	betamax: what are ya talking about? interested
22:36:52	<@JAA>	nicolas17: Looks like the vast majority of data is under DWADragonsUnity, yeah.
22:37:05	<nicolas17>	JAA: oh I simply didn't check the rest of the bucket yet
22:37:25	<@JAA>	Total stats: 8.06 TiB in 11656937 objects, 3.21 unique TiB in 1673341 objects
22:37:54	<fireonlive>	ah ok, 160/mo is peanuts to them
22:38:04	<betamax>	dumbgoy: the School of Dragons game (https://www.schoolofdragons.com/)
22:38:39	<dumbgoy>	gotcha
22:38:47	<nicolas17>	WAIT
22:38:48	<dumbgoy>	keep on rockin, i just popped in and was wondering
22:38:51	<nicolas17>	data has already been deleted?!
22:39:08	<nicolas17>	no, ffs I was on the wrong prefix
22:39:09	<@JAA>	Hm?
22:39:12	<nicolas17>	god I panicked
22:39:13	<@JAA>	Ah
22:39:23	<nicolas17>	that's what I get for relying on shell history
22:39:27	<nicolas17>	DWADragonsCodingUnity is a different folder >.<
22:39:32	<betamax>	apparently old data for previous versions of the game was in the DWAStandAlone subfolder on the bucket
22:39:40	<dumbgoy>	love ya guys, keep up the good work.!
22:39:52	<@JAA>	betamax: I'll just grab all of it since it doesn't make a huge difference.
22:40:13	<betamax>	great! (I'm just copy-pasting messages from my friend into here, so apologies if I'm mentioning stuff that's already covered)
22:40:50	<@JAA>	A bit over 24 TiB to download into something like 3.3 TiB of WARCs. Sounds fun. :-)
22:41:14	<@JAA>	There'll be a little bit of duplication in my data most likely, but I'll try to keep it to a minimum.
22:41:15	<betamax>	JAA: the stats you posted above (8.06TB - is that the entire bucket or just DWADragonsUnity)
22:41:21	<@JAA>	Entire bucket
22:41:32	<@JAA>	nicolas17 posted the ones for just DWADragonsUnity.
22:41:42	<@JAA>	(I didn't rerun that analysis.)
22:42:18	<betamax>	thanks!
22:42:45	<dumbgoy>	anyone know about the feasibility and data requirements to pull all of waybackmachine from archive.org? I imagine it would be MASSIVE, and not sure about the process, maybe wget?
22:42:46	<@JAA>	I'm getting a very similar number for the number of files in that though.
22:42:51		sorch quits [Client Quit]
22:42:56	<dumbgoy>	i'm worried about them going down
22:43:11	<@JAA>	dumbgoy: Do you have too much money?
22:43:29	<dumbgoy>	not really, i wish i did
22:43:33	<@JAA>	Infeasible, especially through the WBM.
22:43:35	<nicolas17>	betamax: fun fact, if I look only at *.mp4 files, there's 1247 MiB unique data (92 files) + 514697 MiB duplicate (48357 files)
22:43:39	<dumbgoy>	i kind of figured, just wondered
22:44:06	<@JAA>	Data requirements are in the dozens of TB per day of data.
22:44:24	<dumbgoy>	ouch, they refuse to even put residential fiber here, and i don't make too much money
22:44:45	<@JAA>	1.8 PiB of web items in the past month
22:44:59	<dumbgoy>	lordy. well thanks for answering, love ya guys
22:44:59	<@JAA>	(Not all of those are publicly accessible though.)
22:45:15	<betamax>	JAA / nicolas17: if you notice any data in the bucket that looks like user posts, pls let me know. they shut down the user forum with no warning a couple of months ago, and a lot of that data is not in wayback
22:45:50	<nicolas17>	dumbgoy: archiveteam's reddit archival project alone is at 2.85 PiB
22:45:56	<dumbgoy>	wow
22:46:18	<nicolas17>	adding 27GiB per minute right now
22:46:33	<dumbgoy>	let me know if there's ANYTHING I can do to help with any archival.
22:46:51	<dumbgoy>	I only have around 8~tb free right now, but if i can help you folks lemme know
22:47:18	<@JAA>	betamax: Very unlikely that that would be on their CDN S3 bucket. I'll dump the complete bucket listing onto IA anyway.
22:47:48	<@JAA>	It's only ~300 MiB as .jsonl.zst. :-)
22:47:50	<nicolas17>	dumbgoy: the warrior stuff doesn't need much disk space, it downloads from (eg.) reddit and immediately uploads elsewhere, it won't accumulate much data on your disk
22:48:02	<nicolas17>	JAA: do you have your own script for that?
22:48:18	<dumbgoy>	if ya elaborate a bit, and tell me what's needed, not informed about warrior stuff
22:48:43	<nicolas17>	"aws s3 ls" outputs text with "timestamp size filename"
22:48:45	<@JAA>	nicolas17: Yep, https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list (warning, may cause brain death)
22:49:07	<@JAA>	It has no retries, so due to those weird timeouts, I had to script around it in Bash. :-)
22:49:08	<nicolas17>	"aws s3api list-objects" seems to collect everything in memory and when it finishes it outputs a single JSON array
22:52:08	<fireonlive>	<stackoverflow parsing xml with regex post>
22:52:21		NIC007a83 joins
22:52:22	<fireonlive>	(lol i don't care about that)
22:53:13	<@JAA>	The regex is only used for crude validation of the beginning of the response. The parsing happens with string slicing etc. :-P
22:53:36	<@JAA>	But yes, I like Tony the pony.
22:55:57	<nicolas17>	how long does it take you to list the bucket?
22:55:59	<@JAA>	I also converted it to a script that can invoke qwarc to do it fast and archive the responses as WARC. It's even more ridiculous: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/s3-bucket-list-qwarc
22:57:14	<TheTechRobo>	JAA: Why did you do all that readarray stuff instead of just `cat > ${prefix}.py << EOF` ?
22:57:24	<TheTechRobo>	(i think its called a heredoc?)
22:57:29	<@JAA>	This one took 8 hours due to all the timeouts.
22:57:46	<@JAA>	Don't have good enough logs to subtract those.
22:57:55	<TheTechRobo>	Oh indentation.
22:57:58	<nicolas17>	ew
22:57:59	<TheTechRobo>	Didnt check that comment yet.
22:58:38		Braven joins
22:58:40	<@JAA>	I can relist the bucket much faster now with the qwarc version.
22:59:01	<@JAA>	Since I can split it into pretty equal parts and process those in parallel.
23:00:29	<@JAA>	When a bucket has nice patterns, that can be used from the start of course.
23:00:38	<@JAA>	But this one is a slight mess.
23:08:47	<fireonlive>	love the bash script
23:12:26	<fireonlive>	is there a good place to shove google drive links
23:22:17	<@JAA>	#googlecrash in theory, but that's been dormant for a long time.
23:22:28	<@JAA>	We should revive it in #mediaonfire style.
23:27:55		TunaLobster_extra joins
23:30:50		TunaLobster quits [Ping timeout: 265 seconds]
23:36:12		NIC007a83 quits [Client Quit]
23:39:30		BlueMaxima joins
23:43:11		lolesports joins
23:43:58	<masterX244>	True, got a batch of links from a crawl, too
23:45:28	<fireonlive>	yee :)
23:49:25	<pabs>	mikolaj\|m: are you able to update the Mailman wiki page to add those two things?
23:52:19	<mikolaj\|m>	pabs: what two things?
23:52:33	<pabs>	the ones you mentioned above :)
23:52:44	<pabs>	<mikolaj\|m> pabs: thank you. Just note that forum-dl's support for Pipermail (and Hyperkitty) archives currently works only by HTML scraping. I haven't implemented getting it from the mbox files, I'll get this working eventually. Also note that there's a tool called Perceval that has some support for Pipermail
23:53:07	<pabs>	(please link to Perceval too)
23:53:56	<@JAA>	Hmm, what does an ETag value of '96a106ed73262892740656e84c5437b2-1' mean on AWS S3?
23:54:29	<mikolaj\|m>	pabs: I would need to make an account on your wiki, I might not have enough spoons for that today (executive dysfunction)
23:55:14	<@JAA>	Ah, multi-part uploads, apparently.
23:55:23	<@JAA>	https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums
23:56:56		geezabiscuit quits [Ping timeout: 265 seconds]

Home Search Previous day Next day