#archiveteam-bs log for 2023-11-08

Home Search Previous day Next day

00:00:11	<Pedrosso>	Awesome. I do have about 1.5 million of the smaller files downloaded, although since it's only the creature file and not the entire page I'm unsure of the relevance
00:00:11	<pokechu22>	Looks like the stuff that was run before was https://spore-cr.ucoz.com/ and some stuff on staging.spore.com (https://transfer.archivete.am/inline/xEMox/staging.spore.com_seed_urls.txt specifically)
00:00:51	<pokechu22>	hmm, http://www.spore.com/sporepedia#qry=pg-220 looks to be handled via POST which archivebot can't do
00:03:25	<pokechu22>	sporepedia itself says 191,397,848 creations to date, but the browse tab says 1,769 newest creations - http://www.spore.com/sporepedia#qry=st-sc looks like it goes on through 503,904 things though
00:04:55	<Pedrosso>	What I believe is the case is that it starts indexing at 500,000
00:05:26	<Pedrosso>	or- wait. 500,000,000,000
00:07:13		Larsenv (Larsenv) joins
00:10:16	<pokechu22>	some interesting URLs: http://www.spore.com/view/myspore/erokuma http://static.spore.com/static/thumb/501/110/199/501110199667.png http://static.spore.com/static/image/501/110/199/501110199667_lrg.png
00:10:50	<pokechu22>	It also says you can drag the thumbnail into the spore creator app but I'm not sure how that works (if there's an additional URL for extra data or they're hiding it in the image somehow)
00:11:05	<Pedrosso>	They're hiding it in the image somehow
00:12:50	<Pedrosso>	Just to verify that, I'm gonna go into spore, turn off my internet, and pull one in
00:14:26	<pokechu22>	Looks like I found an article about it: https://nedbatchelder.com/blog/200806/spore_creature_creator_and_steganography.html
00:15:42	<Pedrosso>	I'll post other possibly relevant URLs. https://www.spore.com/comm/developer/ https://www.spore.com/comm/samples (the latter has a list of possibly relevant urls)
00:18:10	<pokechu22>	there's also e.g. http://www.spore.com/sporepedia#qry=sast-501110199667%3Apg-220 which does a POST to http://www.spore.com/jsserv/call/plaincall/assetService.fetchComments.dwr (a POST to http://www.spore.com/jsserv/call/plaincall/assetService.listAssets.dwr also exists). POSTs don't work with web.archive.org unfortunately
00:18:52	<pokechu22>	The API docs there are helpful
00:19:31	<Pedrosso>	That is unfortunate. However as the API docs show, the files can be accessed directly, although that will miss out on users, comments, etc
00:20:25	<Pedrosso>	actually- I disregard that last statement about missing out, as I don't know how to read the XML files
00:21:40	<pokechu22>	Theoretically we could generate WARCs containing the POST data if a whole custom crawl were done, they just wouldn't allow navigating the site directly on web.archive.org as it stands today (theoretically it could be implemented in the future, but I think there are technical complications)
00:22:22	<nicolas17>	Pedrosso: if I understand correctly, we can get the user and comment, we just can't display it as a functional website on web.archive.org
00:22:30	<nicolas17>	user and comment data*
00:22:41	<Pedrosso>	That is awesome, thank you
00:23:06	<nicolas17>	so it will be a pile of xml or whatever waiting for someone to make a tool to read it
00:24:45	<pokechu22>	We can if something custom was implemented - archivebot wouldn't work for it (though giving archivebot millions of images as a list also isn't easy since it needs the list ahead of time; you can't just tell it the pattern the images follow)
00:29:36	<Pedrosso>	I've no clue how I didn't find it before but there's a page on the archiveteam wiki with also possibly relevant info https://wiki.archiveteam.org/index.php/Spore
00:31:41		etnguyen03 (etnguyen03) joins
00:31:50	<pokechu22>	I started an archivebot job for http://www.spore.com/ but that's not going to recurse into anything that's accessed via javascript only (so it's not going to find everything on the character creator)
00:34:50	<h2ibot>	Pokechu22 edited Spore (+211, mention that the thumbnails include data): https://wiki.archiveteam.org/?diff=51106&oldid=51087
00:43:49		wickedplayer494 quits [Ping timeout: 272 seconds]
00:46:48	<Pedrosso>	pokechu22: You say archivebot needs the list ahead of time, could you elaborate on that? Because I mean, making a very long list full of urls following the pattern is possible, no?
00:46:49		wickedplayer494 joins
00:47:01		wickedplayer494 is now authenticated as wickedplayer494
00:47:43	<pokechu22>	Yeah, it's definitely possible, not too difficult even, but if http://static.spore.com/static/thumb/501/110/210/501110210233.png implies there are 1,110,210,233 images, I think that exceeds some of the reasonable limits :)
00:48:51	<pokechu22>	it's possible to upload zst-compressed text files to transfer.archivete.am and then remove the zst extension to download it decompressed, which helps a bit, but archivebot still downloads it decompressed (and ends up uploading that decompressed list to archive.org without any other compression)
00:49:05	<Pedrosso>	You'll have to forgive me as I've no real basis on what's reasonable
00:50:14	<pokechu22>	Yeah, I'm trying to dig up an example of when I last did this
00:50:22	<Pedrosso>	Thank you
00:50:34	<pokechu22>	(the info on the archivebot article on the wiki is fairly out of date - we can and regularly do jobs a lot larger than it recommends there)
00:51:15	<@JAA>	A billion images? Oh dear...
00:52:12	<@JAA>	Request rates of something like 25/s are possible in AB, but then we'd still be looking at something like 1.5 years...
00:52:46	<Pedrosso>	If I interpret the infromation correctly, many urls in that pattern could be pointing to nothing
00:53:07	<@JAA>	That is likely, given that the site itself says there are only 191 million creations.
00:53:35	<@JAA>	So roughly every 6th URL will work.
00:54:12	<@JAA>	But it doesn't matter for this purpose since we'd still have to try the full billion.
00:54:31	<Pedrosso>	That's true. Unless there's any way to check if it exists beforehand
00:54:46	<Pedrosso>	Well, any reasonable way
00:55:58	<@JAA>	Yeah, maybe the API has some bulk lookup endpoint. Otherwise, probably not.
00:56:54		icedice quits [Client Quit]
00:57:35	<Pedrosso>	which API? Sporepedia's?
00:58:45	<@JAA>	Yeah
01:01:10	<pokechu22>	OK, right, the example I had was https://wwii.germandocsinrussia.org/ of which there were 54533607 URLs related to map tiles (e.g. https://wwii.germandocsinrussia.org/system/pages/000/015/45/map/8f3b4796a50501d2550bad6385f57cf65d78ca736f78d93dbfe7fc063bf0d396/2/2_0.jpg - but at a bunch of zoom levels), which I generated by a script. I split the list into 5 lists of 11000000
01:01:13	<pokechu22>	URLs, which ended up being about 118.1 GiB of data per list. I ran those lists one at a time (starting the next one after the previous one finished); it took about 2 hours for archivebot to download each list of 11M URLs and queue it (as that process isn't very optimized), and it took about 5 days for it to actually download the URLs in that list (though I don't think that's
01:01:15	<pokechu22>	representative of actual speeds for downloading...)
01:01:52	<pokechu22>	In other cases (which I can't find) I did parallelize the process between a few AB pipelines, and each pipeline downloads multiple files at once, but it's still not ideal
01:02:24	<pokechu22>	That job is still fairly comparable though because it's downloading a bunch of low-resolution images
01:03:29	<Pedrosso>	even the "large" images (same item, different image, same info for the char just higher res I belive) are approx 60 kB
01:05:10	<pokechu22>	The storage space for downloaded images probably isn't an issue overall (as that can be uploaded to web.archive.org in 5GB chunks), it's more the storage space used for the list of URLs and such
01:05:33	<Pedrosso>	(quite ironic)
01:06:23	<pokechu22>	Similarly, I'm not sure how useful it'd be to save the "large" images as it seems like they don't have the embedded data, unlike the "thumb" images, so presumably it'd be possible to regenerate the large images from the data in the thumb images ingame, which is the opposite of how thumbnails/high resolution images usually work
01:07:26	<Pedrosso>	That's fair
01:11:39	<pokechu22>	Assuming 20kB for thumbnails and the listed 191,397,848 creatures, that's about 4TB, which is a reasonable amount (on the large side, but still reasonable)
01:12:33	<Pedrosso>	Would it be relevant to save comments as well? I'd suggest users but that process is far less iterable
01:13:27	<pokechu22>	It looks like comments requires POST so archivebot can't do that, but those would be nice to save
01:13:45	<Pedrosso>	This? https://www.spore.com/rest/comments/500226147573/0/5000
01:13:52	<Pedrosso>	5000 is just an arbitrarily big value I put there
01:14:41	<pokechu22>	for what it's worth https://www.spore.com/ gives me an expired certificate error though http://www.spore.com/ works - I'm guessing you dismissed that error beforehand?
01:15:22	<Pedrosso>	I don't recall, so assume that I have
01:15:35	<pokechu22>	Looks like that API works: http://www.spore.com/rest/comments/500447019787/0/5000 - it's not the one used on http://www.spore.com/sporepedia#qry=sast-500447019787%3Aview-top_rated though
01:15:56	<Pedrosso>	As long as it's the same information it's all good, right?
01:16:22	<Pedrosso>	As for users though, it does seem like there's a "userid" however I can't see anywhere you can put it to get the url for the userpage
01:16:39	<pokechu22>	Yeah, at least for having the information - it wouldn't make the first URL work on web.archive.org but that's not as important
01:19:20	<@JAA>	Ah, nedbat wrote that thumbnail data article, nice.
01:19:58	<Pedrosso>	So, what'd need to be done is to get that URL list and split it in reasonable chunks?
01:22:53	<pokechu22>	I should also note that archivebot isn't the only possible tool
01:23:22	<Pedrosso>	That is good to note, yes. Though I'm not really aware of many of the others
01:24:51	<@JAA>	If there are really no rate limits, qwarc could get through this in no time.
01:25:35	<@JAA>	I've done 2k requests per second before with qwarc.
01:25:51	<@JAA>	That'd work out to a week for 1.1 billion.
01:26:52	<Pedrosso>	I wouldn't say there are none, but they may not be too limiting. I have nothing against running it on my own machine, but I'm not really aware of how to use it properly as of now
01:27:33	<@JAA>	Well, 'not too limiting' and 'allowing 2k/s' are two very different things. :-)
01:27:34	<pokechu22>	archiveteam also has https://wiki.archiveteam.org/index.php/DPoS where you have a bunch of tasks distributed to other users, and a lua script that handles it. So creature:500447019787 could be one task and that would fetch http://www.spore.com/rest/comments/500226147573/0/5000 and http://www.spore.com/rest/creature/500226147573 and
01:27:37	<pokechu22>	http://static.spore.com/static/image/500/226/147/500226147573_lrg.png and http://static.spore.com/static/thumb/500/226/147/500226147573.png and http://www.spore.com/rest/asset/500226147573, and could queue additional stuff based on that (e.g. /rest/asset gives the author, which could be queued)
01:28:06	<pokechu22>	It's fully scriptable... but that means you need to write the full script :)
01:28:15	<pokechu22>	So it's a lot more difficult to actually do it
01:28:30	<@JAA>	Yeah, same with qwarc.
01:28:33	<pokechu22>	(oh, and DPoS projects can also record POST requests, though they still won't play back properly)
01:28:35	<@JAA>	Frameworks for archiving things at scale.
01:29:34	<pokechu22>	Yeah, qwarc being a single-machine system instead of distributed
01:30:45	<Pedrosso>	I wouldn't mind trying to run qwarc, Anything I should be aware of?
01:32:03	<@JAA>	Beware of the dragons.
01:32:23	<@JAA>	:-)
01:32:25	<fireonlive>	😳
01:33:56	<@JAA>	There's no documentation, and there are some quirks to running it, especially memory-related. There's a memory 'leak' somewhere that I haven't been able to locate. With a large crawl like this, you're going to run into that.
01:35:14		Pedrosso57 joins
01:35:24	<Pedrosso57>	What a convenient time for my internet to drop, hah
01:35:55	<pokechu22>	https://hackint.logs.kiska.pw/archiveteam-bs/20231108
01:37:36	<Pedrosso57>	?
01:37:41		Pedrosso quits [Ping timeout: 265 seconds]
01:38:07	<pokechu22>	(message history, in case you missed anything)
01:38:11	<Pedrosso57>	Thank you
01:38:30		Pedrosso57 is now known as Pedrosso
02:03:37		etnguyen03 quits [Ping timeout: 272 seconds]
02:04:34		JohnnyJ quits [Quit: Ping timeout (120 seconds)]
02:10:53		lun4 quits [Quit: Ping timeout (120 seconds)]
02:11:09		lun4 (lun4) joins
02:21:20		etnguyen03 (etnguyen03) joins
02:26:13		HP_Archivist (HP_Archivist) joins
02:32:15		_Dango360 joins
02:33:43		lun4 quits [Client Quit]
02:33:43		igloo22225 quits [Quit: Ping timeout (120 seconds)]
02:34:09		ave quits [Quit: Ping timeout (120 seconds)]
02:34:24		decky quits [Read error: Connection reset by peer]
02:34:46		BlueMaxima quits [Read error: Connection reset by peer]
02:34:51		dumbgoy quits [Read error: Connection reset by peer]
02:35:00		BlueMaxima joins
02:35:10		decky joins
02:35:12		dumbgoy joins
02:35:19		lennier1 quits [Read error: Connection reset by peer]
02:35:31		lennier1 (lennier1) joins
02:35:41		Ryz quits [Quit: Ping timeout (120 seconds)]
02:35:56		Bleo1 quits [Client Quit]
02:36:10		@rewby quits [Ping timeout: 265 seconds]
02:36:15		igloo22225 (igloo22225) joins
02:36:15		lun4 (lun4) joins
02:36:39		Ryz (Ryz) joins
02:36:42		ave (ave) joins
02:36:56		nepeat quits [Quit: ZNC - https://znc.in]
02:38:17		rewby (rewby) joins
02:38:17		@ChanServ sets mode: +o rewby
02:38:35		MetaNova quits [Ping timeout: 265 seconds]
02:39:04		anarcat quits [Ping timeout: 265 seconds]
02:40:46		Bleo1 joins
02:41:52		MetaNova (MetaNova) joins
02:45:32		decky_e_ joins
02:45:39		Larsenv0 (Larsenv) joins
02:46:10		qwertyasdfuiopghjkl quits [Ping timeout: 252 seconds]
02:46:10		shreyasminocha quits [Ping timeout: 252 seconds]
02:46:10		thehedgeh0g quits [Ping timeout: 252 seconds]
02:46:10		evan quits [Ping timeout: 252 seconds]
02:46:10		kallsyms quits [Ping timeout: 252 seconds]
02:46:10		IRC2DC quits [Ping timeout: 252 seconds]
02:46:10		parfait_ quits [Remote host closed the connection]
02:46:11		Larsenv quits [Client Quit]
02:46:12		Ryz quits [Client Quit]
02:46:12		decky quits [Remote host closed the connection]
02:46:12		TheTechRobo quits [Client Quit]
02:46:12		jarfeh quits [Client Quit]
02:46:12		Pedrosso quits [Client Quit]
02:46:12		Mateon1 quits [Remote host closed the connection]
02:46:12		T31M quits [Quit: ZNC - https://znc.in]
02:46:12		evan joins
02:46:12		Mateon1 joins
02:46:12		Larsenv0 is now known as Larsenv
02:46:14		T31M joins
02:46:15		thehedgeh0g (mrHedgehog0) joins
02:46:16		shreyasminocha (shreyasminocha) joins
02:46:16		IRC2DC joins
02:46:18		parfait_ joins
02:46:21		kallsyms joins
02:46:38		TheTechRobo (TheTechRobo) joins
02:47:08		Ryz (Ryz) joins
02:51:33		TheTechRobo quits [Excess Flood]
02:51:33		Mateon1 quits [Remote host closed the connection]
02:51:33		Ryz quits [Client Quit]
02:51:37		Mateon1 joins
02:51:52		nepeat (nepeat) joins
02:52:00		TheTechRobo (TheTechRobo) joins
02:52:15		Ryz (Ryz) joins
02:53:34		parfait_ quits [Ping timeout: 247 seconds]
02:55:42		jarfeh joins
02:56:35		anarcat (anarcat) joins
03:03:10		pabs (pabs) joins
03:16:27		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
03:20:52		pabs quits [Client Quit]
03:21:23		pabs (pabs) joins
03:31:01		etnguyen03 quits [Ping timeout: 272 seconds]
03:38:35		pabs quits [Remote host closed the connection]
03:38:54		pabs (pabs) joins
03:42:05		etnguyen03 (etnguyen03) joins
03:59:43		jarfeh quits [Remote host closed the connection]
04:03:02		nicolas17 quits [Client Quit]
04:25:12		jwn joins
04:30:14	<jwn>	This is where I can notify people of a website shutting down, right? Just making sure I have the right channel.
04:34:58	<pabs>	yes
04:35:08	<pabs>	which website, when is the shutdown?
04:38:33	<jwn>	The website is brick-hill.com, unsure of any exact date of shutdown but I do know of plans to re-launch the site due to ownership issues but without accounts and I assume forum posts by extension.
04:41:29	<@JAA>	Huh. Brickset shut down their forums the other day. Is that just a coincidence?
04:42:18	<jwn>	Never heard of it, so probably. I also assume it wasn't as messy.
04:43:06	<@JAA>	That was https://forum.brickset.com/ (a few pages are still in their server-side cache).
04:43:13	<@JAA>	And yeah, not very messy.
04:44:36	<@JAA>	Just funny that we go years without any LEGO-related shutdowns (that I remember), and then there's two in quick succession.
04:45:34	<@JAA>	https://www.brick-hill.com/ does seem to work fairly well without JS, so that's nice.
04:45:58	<jwn>	Technically Brick-Hill is a Roblox clone but resemblances to Lego Island weren't accidental.
04:46:23	<pabs>	looks like the www/blog/merch subdomains have been captured previously
04:46:29	<pabs>	https://archive.fart.website/archivebot/viewer/?q=brick-hill.com
04:46:41	<pabs>	oh, some of them relatively recently
04:46:53	<pabs>	20230904
04:47:04	<@JAA>	Hmm, https://archive.fart.website/archivebot/viewer/job/202309041546573rfz7 seems very small for well over 2 million forum threads.
04:47:59	<DogsRNice>	could be one of those forums that wont let you view some boards if you arent signed in
04:48:20	<jwn>	I'm pretty sure you don't need to be logged in to see the forums.
04:48:25	<@JAA>	80% or so are in a single forum that is publicly viewable.
04:49:45	<@JAA>	Hmm, no, the job did go deep there, too: https://web.archive.org/web/20230913003824/https://www.brick-hill.com/forum/2/40000
04:52:16	<@JAA>	Ryz: ^ You started that job.
04:52:47	<@JAA>	Ok yeah, it got 200s from 2352033 unique thread IDs. I guess that should be reasonably close to the total.
05:02:13		dumbgoy quits [Ping timeout: 272 seconds]
05:20:59		etnguyen03 quits [Ping timeout: 265 seconds]
05:21:51		etnguyen03 (etnguyen03) joins
05:25:57		DogsRNice quits [Read error: Connection reset by peer]
05:40:27		ragra joins
05:47:54		BlueMaxima quits [Read error: Connection reset by peer]
05:52:09		etnguyen03 quits [Client Quit]
06:03:47		jwn quits [Remote host closed the connection]
06:04:58		Church quits [Ping timeout: 265 seconds]
06:16:52		null joins
06:17:40		rktk quits [Client Quit]
06:17:40		TheTechRobo quits [Client Quit]
06:17:40		Mateon1 quits [Remote host closed the connection]
06:17:40		Mateon1 joins
06:18:07		TheTechRobo (TheTechRobo) joins
06:21:53		ragra quits [Ping timeout: 265 seconds]
06:22:13		Church (Church) joins
06:23:06		Arcorann (Arcorann) joins
06:27:54		Barto (Barto) joins
07:04:07	<vokunal\|m>	I'm definitely completely wrong about this, but if we had #Y working, would we need dedicated projects for sites anymore or could they be run through that with modification? Would we need AB?
07:36:37		BornOn420 quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
07:48:34	<thuban>	vokunal\|m: we would still need dedicated projects for sites that couldn't be crawled by generic spidering logic (because eg they depend on javascript api interactions).
07:51:03	<thuban>	theoretically it could do anything that archivebot could, but between overhead and the increased complexity of 'live' configuration in a distributed environment, we'd want to keep ab around anyway
07:53:30		decky_e joins
07:55:45		decky_e_ quits [Ping timeout: 272 seconds]
08:01:37		Island quits [Read error: Connection reset by peer]
08:07:21	<pabs>	from #archivebot <mannie> the energy company ENSTRAGO has been declared bankrupt. Here is the court annoucement: https://insolventies.rechtspraak.nl/#!/details/03.lim.23.189.F.1300.1.23 and here the official website: https://enstroga.nl
08:22:53		BornOn420 (BornOn420) joins
08:30:10	<pabs>	<mannie> A12 taxi zoetermeer : https://www.taxizoetermeer.nl is declared bakrupt. courtfiles: https://insolventies.rechtspraak.nl/#!/details/09.dha.23.294.F.1300.1.23
08:40:05		Church quits [Ping timeout: 272 seconds]
08:56:33		Church (Church) joins
09:50:23		katocala quits [Ping timeout: 272 seconds]
09:50:27		katocala joins
10:00:02		Bleo1 quits [Client Quit]
10:01:16		Bleo1 joins
10:03:03		Wohlstand (Wohlstand) joins
10:06:26		T31M_ joins
10:06:31		decky joins
10:08:04		neggles quits [Quit: bye friends - ZNC - https://znc.in]
10:08:12		ThetaDev_ joins
10:08:14		Bleo1 quits [Client Quit]
10:08:14		BearFortress quits [Client Quit]
10:08:14		T31M quits [Client Quit]
10:08:14		TheTechRobo quits [Client Quit]
10:08:14		ThetaDev quits [Client Quit]
10:08:14		decky_e quits [Remote host closed the connection]
10:08:14		Wohlstand quits [Remote host closed the connection]
10:08:14		Mateon1 quits [Remote host closed the connection]
10:08:14		katocala quits [Remote host closed the connection]
10:08:14		h3ndr1k quits [Quit: No Ping reply in 180 seconds.]
10:08:14		jodizzle quits [Quit: ...]
10:08:14		T31M_ is now known as T31M
10:08:14		Mateon1 joins
10:08:15		Bleo1 joins
10:08:15		katocala joins
10:08:18		Wohlstand (Wohlstand) joins
10:08:20		neggles (neggles) joins
10:08:40		TheTechRobo (TheTechRobo) joins
10:09:11		BearFortress joins
10:09:28		TheTechRobo quits [Excess Flood]
10:09:37		h3ndr1k (h3ndr1k) joins
10:10:06		TheTechRobo (TheTechRobo) joins
10:10:40		jodizzle (jodizzle) joins
10:10:55		TheTechRobo quits [Excess Flood]
10:19:06	<masterX244>	Jwn: brickset is a lego fansite, forum got too expensive and activity declined. Luckily we caught it just before the shredders were starting
11:34:29		sd quits [Quit: sd]
11:49:56		sd (sd) joins
11:53:58		sd quits [Client Quit]
11:54:26		sd (sd) joins
12:03:18		luffytam joins
12:03:25		luffytam quits [Remote host closed the connection]
12:13:16		Mateon1 quits [Ping timeout: 265 seconds]
12:17:55		mindstrut quits [Quit: Leaving]
12:42:15		Dango360_ joins
12:46:27		_Dango360 quits [Ping timeout: 272 seconds]
12:57:30		TheTechRobo (TheTechRobo) joins
12:57:51		Arcorann quits [Ping timeout: 272 seconds]
13:06:52		katocala is now authenticated as katocala
13:14:50		Matthww1197 joins
13:17:29		Matthww119 quits [Ping timeout: 272 seconds]
13:17:29		Matthww1197 is now known as Matthww119
13:23:43		Wohlstand quits [Remote host closed the connection]
13:23:51		Wohlstand (Wohlstand) joins
13:26:14		Dango360_ quits [Remote host closed the connection]
13:26:38		Dango360 (Dango360) joins
13:29:54		Matthww1191 joins
13:30:01		Matthww119 quits [Client Quit]
13:30:01		TheTechRobo quits [Client Quit]
13:30:01		Bleo1 quits [Client Quit]
13:30:01		Wohlstand quits [Remote host closed the connection]
13:30:01		Matthww1191 is now known as Matthww119
13:30:06		Bleo1 joins
13:30:09		Wohlstand (Wohlstand) joins
13:30:28		TheTechRobo (TheTechRobo) joins
13:36:49		kiryu_ joins
13:40:00		Wohlstand quits [Remote host closed the connection]
13:40:00		katocala quits [Ping timeout: 334 seconds]
13:40:00		kiryu quits [Ping timeout: 276 seconds]
13:40:05		Adrmcr (Adrmcr) joins
13:40:10		katocala joins
13:40:12		Wohlstand (Wohlstand) joins
13:43:00	<Adrmcr>	On the note for spore, I found that https://staging.spore.com/ has its "static" and "www_static" subdirectories open; as far as i can tell, everything on there is also on the regular non-staging website, so it may be safe to extract everything, strip "staging.", and archive the main file links
13:58:54		etnguyen03 (etnguyen03) joins
14:27:56		dumbgoy joins
14:34:36		celestial quits [Quit: ZNC 1.8.0 - https://znc.in]
14:34:54		celestial joins
14:48:03		treora quits [Ping timeout: 272 seconds]
14:51:39		treora joins
15:18:45		Adrmcr quits [Remote host closed the connection]
15:40:58		balrog quits [Quit: Bye]
15:45:14		balrog (balrog) joins
16:06:24		Island joins
16:44:21		T31M quits [Client Quit]
16:44:41		T31M joins
16:46:06		T31M is now authenticated as T31M
16:55:32		etnguyen03 quits [Ping timeout: 265 seconds]
17:19:55		HP_Archivist quits [Client Quit]
17:45:25		Sam joins
17:46:14		Edel69 joins
17:49:29		Sam quits [Remote host closed the connection]
17:50:50	<Edel69>	Hello. I have a quick question. Is it possible to find a deleted private Imgur album among the huge archive dump with an album URL link or a link to a single image that was within the album?
17:52:57	<@JAA>	Edel69: If it was archived, it's in the Wayback Machine. Album or image page or direct image link should all work.
17:57:53	<Edel69>	Thanks for the response. I was under the impression that the team behind the archive job has actual access to files that were downloaded and backed up before the May 2023 TOS change went into effect.
17:58:51		dumbgoy quits [Ping timeout: 265 seconds]
18:01:13	<@JAA>	Well, the raw data is all publicly accessible, but trust me, you don't want to work with that. :-)
18:02:07	<Edel69>	I wouldn't even know what to do with all of that. lol
18:02:09	<@JAA>	The WBM index should contain all of them, and that's the far more convenient way of accessing it.
18:06:29	<Ryz>	Regarding https://www.brick-hill.com/forum/ - JAA, hmm, I'm a bit iffy on how much is it covered, because I recall the last couple of times it got errored out from overloading or something...?
18:16:05		dumbgoy joins
18:18:19		etnguyen03 (etnguyen03) joins
18:23:51	<Edel69>	So I tried multiple album and separate image URLs in the Wayback Machine and I get no hits at all. I don't think any of my deleted account's uploads have been archived on there. None of my albums were public, so it wouldn't have been possible for there to be Web archives maybe? My decade old account was abruptly deleted with no warnings just a few
18:23:51	<Edel69>	days ago, so if there's nothing at all I guess that means my data was somehow not archived.
18:26:46		dumbgoy_ joins
18:27:01	<@JAA>	I think we should've grabbed virtually all 5-char image IDs. But beyond that, it would've been mostly things that were publicly shared in one of the sources we scraped.
18:29:47		dumbgoy quits [Ping timeout: 265 seconds]
18:34:09		etnguyen03 quits [Ping timeout: 272 seconds]
18:41:12	<Edel69>	I finally got a hit from one of the limited URLs I have. https://i.imgur.com/eClDaR3.jpg - An image from a Resident Evil album. I guess this wouldn't help in finding anything else that was in the same album though.
18:42:22		etnguyen03 (etnguyen03) joins
18:52:27	<Edel69>	Isn't the image ID in the URL link? If, so they're all 7 characters.
18:53:11	<vokunal\|m>	Yeah
18:53:35	<vokunal\|m>	really old urls can be 5 characters though
18:54:01	<vokunal\|m>	they went through all the 5 character ids before upping their ids to 7 characters
18:55:09	<Edel69>	Ah, so the 7 character IDs were also backed up. I was thinking he was saying that they only grabbed the 5 character IDs.
18:55:55	<imer>	We didn't get all 5char albums unfortunately, virtually all 5 char images should be saved and then most 7char ones we found
18:56:57	<vokunal\|m>	We grabbed basically all 900M 5char images, and around 1 billion 7 character images, i think
18:57:27	<vokunal\|m>	we brute forced a lot, but there's 3.5 trillion possible ids in the 7 character space
18:58:34	<vokunal\|m>	just guessing on the 5char, because i thik that's what that would pan out to with our total done
19:02:49	<Edel69>	That's a lot of downloading you all did. With that massive amount of data it would be looking for the needle in the haystack to find anything specific I guess, let alone a specific album collection. I'm just going to cut my losses and forget about it lol. Thanks for the help and information though.
19:04:07		etnguyen03 quits [Ping timeout: 265 seconds]
19:04:07	<imer>	Edel69: for 5char albums there is metadata which might be easier to search through https://archive.org/details/imgur_album5_api_dump
19:04:23	<imer>	still a lot of data though
19:05:46	<imer>	I don't have these local anymore unfortunately, could've done a quick search otherwise :(
19:08:54	<@JAA>	→ #imgone for further discussion please
19:26:46		Edel69 quits [Remote host closed the connection]
19:44:16		immibis_ is now known as immibis
19:47:37		lennier1 quits [Ping timeout: 272 seconds]
20:00:41		icedice (icedice) joins
20:33:20		Pedrosso joins
20:36:15		etnguyen03 (etnguyen03) joins
20:40:20	<h2ibot>	Manu edited Political parties/Germany (+63, /* CDU: more */): https://wiki.archiveteam.org/?diff=51107&oldid=48436
21:15:01		etnguyen03 quits [Ping timeout: 272 seconds]
21:20:59	<Pedrosso>	It appears that, according to others, ArchiveBot is putting pressure on spore.com hence I'm not planning to do that archive instantly using qwarc. I am going to keep looking into it though.
21:21:40	<Pedrosso>	On another note, what kind of motivations (if any) are needed for using the ArchiveBot? I've got a few small sites in mind, but I've mostly no good reason other than "I want em archived lol"
21:29:33	<pokechu22>	For small sites that's probably good enough of a motivation right now as there's nothing urgent that needs to be run
21:29:36		DogsRNice joins
21:30:12		BlueMaxima joins
21:30:45	<pokechu22>	I'm not entirely sure about the amount of pressure - archivebot was slowed to one request per second and it seems like the site is giving a response basically instantly (and it didn't look like things were bad when it was running at con=3, d=250-375)
21:31:11	<pokechu22>	but I'm also not monitoring the site directly and am not an expert
21:31:49	<Pedrosso>	Alright, that's good. As for offsite links, if I'm understanding this correctly, it goes recursively within a websites, but doesn't do so with outlinks?
21:32:03	<@JAA>	The intent is to provide ~~players~~ archivists with a sense of pride and accomplishment for ~~unlocking different heroes~~ slowing down the servers.
21:32:40	<pokechu22>	It does do outlinks by default; the outlink and any of its resources (e.g. embedded images, scripts, audio, or video (if it's done in a way that can be parsed automatically)) will be saved
21:32:50	<pokechu22>	There is a --no-offsite option to disable that but it's generally fine to include them
21:33:01	<@JAA>	I also wouldn't expect AB to make a difference for a website by a major game publisher, but you never know.
21:33:08	<@JAA>	It was already slow last night before we started the job.
21:33:43	<Pedrosso>	Haha
21:36:00	<Pedrosso>	also, I was not informed of any restrictions of commands lol. Makes sense to not let people just randomly do it but I didn't find anything on the wiki like that
21:40:45	<@JAA>	> Note that you will need channel operator (@) or voice (+) permissions in order to issue archiving jobs
21:41:34	<@JAA>	It's not mentioned in the command docs though, only on the wiki page.
21:43:01	<Pedrosso>	Thank you
21:47:16	<@JAA>	I've been trying to separate the docs of 'ArchiveBot the software' from 'ArchiveBot the AT instance'. But the permissions part should be in the former, too.
21:56:25		etnguyen03 (etnguyen03) joins
22:01:14		HP_Archivist (HP_Archivist) joins
22:06:38	<h2ibot>	Manu created Political parties/Germany/Hamburg (+7072, Beginn collection political parties for…): https://wiki.archiveteam.org/?title=Political%20parties/Germany/Hamburg
22:34:10		Wohlstand quits [Remote host closed the connection]
22:34:15		Wohlstand (Wohlstand) joins
22:55:01		Dango360_ joins
22:56:14		_Dango360 joins
22:56:21		Matthww119 quits [Ping timeout: 272 seconds]
22:57:02		Matthww119 joins
22:58:53		Dango360 quits [Ping timeout: 272 seconds]
22:59:58		Dango360_ quits [Ping timeout: 265 seconds]
23:00:49	<h2ibot>	JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51109&oldid=51000
23:03:21		ragra joins
23:03:39		ragra quits [Remote host closed the connection]
23:04:16		ragra joins
23:09:01		etnguyen03 quits [Ping timeout: 272 seconds]
23:48:00		TastyWiener95 quits [Quit: So long, farewell, auf wiedersehen, good night]
23:53:55		TastyWiener95 (TastyWiener95) joins

Home Search Previous day Next day