#archiveteam-bs log for 2022-03-30

Home Search Previous day Next day

00:18:28		lennier1 quits [Client Quit]
00:19:06		lennier1 (lennier1) joins
00:34:16		Megame quits [Client Quit]
00:36:06		sec^nd quits [Remote host closed the connection]
00:37:16		sec^nd (second) joins
01:02:42		dm4v_ joins
01:02:48		dm4v quits [Read error: Connection reset by peer]
01:02:54		dm4v_ is now known as dm4v
01:02:57		dm4v is now authenticated as dm4v
01:02:57		dm4v quits [Changing host]
01:02:57		dm4v (dm4v) joins
01:22:23		datechnoman quits [Client Quit]
01:23:12		datechnoman (datechnoman) joins
01:51:47		wyatt8750 joins
01:52:54		wyatt8740 quits [Ping timeout: 265 seconds]
01:59:50		user__ quits [Read error: Connection reset by peer]
02:00:17		user__ (gazorpazorp) joins
02:06:49		@Fusl quits [Excess Flood]
02:07:05		Fusl (Fusl) joins
02:07:05		@ChanServ sets mode: +o Fusl
02:41:07	<h2ibot>	NightHnh099 edited Alive... OR ARE THEY (+238): https://wiki.archiveteam.org/?diff=48444&oldid=47018
02:56:54		lennier1 quits [Client Quit]
02:57:16		lennier1 (lennier1) joins
03:10:11	<h2ibot>	JustAnotherArchivist edited Deathwatch (+15, /* 2022 */ BPO's date shifted again): https://wiki.archiveteam.org/?diff=48445&oldid=48428
03:36:50		Mateon1 quits [Remote host closed the connection]
03:37:03		Mateon1 joins
04:13:25		qwertyasdfuiopghjkl quits [Client Quit]
04:16:23		qwertyasdfuiopghjkl joins
04:28:32		BlueMaxima quits [Read error: Connection reset by peer]
04:35:04		drexler joins
04:35:14	<drexler>	> What's the best way to archive large files at their original URL to ensure the maximum chance of people finding it?
04:35:25	<@JAA>	Hello again. :-) What is this about?
04:36:27	<drexler>	Oh, I'm trying to archive publicly available AI models.
04:36:55	<drexler>	Because they're large files that tend to be considered 'incidental' and disappear once someone decides to stop hosting them, but if they're available you can go straight back to a particular 'era' of AI art.
04:37:37	<drexler>	That is, they're large files with short lived popularity but likely long tail value that is not being recognized right this moment but probably will be later.
04:38:27	<drexler>	And because of this their hosting tends to be odd? People will put them up on their personal server or a university lab workstation or something then one day that goes down and never comes back up.
04:39:45	<@JAA>	Right. What order of magnitude are we talking about here? As in, if this were done continuously, how much new data per time would you expect?
04:40:59	<drexler>	Oh you could probably fit every historically interesting AI model on a few terabytes maybe? That's being conservative/assuming it takes more memory than it probably does.
04:41:18	<drexler>	Uh, at least for the AI art ones
04:41:22	<drexler>	Which is I'll I'm interested in at the moment
04:42:12	<@JAA>	That doesn't sound too horrible. I was expecting more.
04:46:09	<drexler>	Oh, no.
04:46:20	<drexler>	StyleGAN models are probably something like 700mb each iirc
04:46:23	<@JAA>	These are normally distributed over HTTP(S), right? I remember mirroring some stuff from rsync servers before.
04:46:29	<drexler>	Yeah, they usually are
04:46:59	<drexler>	And then there's bigger models like the ones I'm training right now, which are 10gb each. But people don't train very many of those.
04:47:05	<drexler>	Yet.
04:47:16	<drexler>	This early period is probably going to be of the most historical interest, and it uses the least storage space.
04:47:51	<@JAA>	Ok, if you have a list, we can probably just run it through ArchiveBot. Shouldn't take more than a few weeks to maybe a couple months for that size, though it obviously depends on the servers as well.
04:47:55	<@JAA>	arkiver: ^ Thoughts?
04:50:42	<drexler>	Yeah, the alternative is I just download them all slowly and then reupload as an Internet Archive collection, but I think it'd be ideal if they were findable at their original URLs, dunno. I guess the flip side of it is that people might forget what their original URLs even are, given how weird and obscure the hosting often is in the first place.
04:51:49	<@JAA>	Yeah, documenting those URLs would definitely be a good idea. If it isn't an endless list, this could go onto our wiki.
04:53:57	<thuban>	if you do end up submitting a list to archivebot, it might be a good idea to include urls of 'about' pages and/or whatever you used to discover the model urls in the first place (since that discovery would then be replicable within the wbm)
04:56:11		drexler thumbs up
05:15:34	<drexler>	Hm, now that you mention it.
05:15:36	<drexler>	For about/context
05:15:44	<drexler>	You ever archived Google CoLab before?
05:17:21		wyatt8750 quits [Ping timeout: 265 seconds]
05:18:31	<h2ibot>	JustAnotherArchivist edited DPoS (-35, FOS hasn't been used in years; HTTPS for tracker): https://wiki.archiveteam.org/?diff=48446&oldid=48431
05:19:08		wyatt8740 joins
06:05:39	<Jake>	(colab if I remember correctly is _tons_ of JavaScript, so the actual page isn't easily archivable, from what I remember, but I think you can export pretty easily?)
06:10:31	<drexler>	Jake, Yeah
06:10:42	<drexler>	You don't need to archive the page
06:10:48	<drexler>	You just need to archive/export the notebook itself
06:11:07	<drexler>	Because that's how ML people tend to distribute the prototype/public use version of their programs
06:11:39	<drexler>	So if you're saving the models at an expensive storage cost, it would be foolish not to also save the notebooks which are a fraction of a fraction the size and what allows you to actually use the models to do something.
06:12:01		Hackerpcs quits [Client Quit]
06:12:14	<drexler>	They also contain the URLs the models are stored at, which is why I thought of them.
06:15:50		Hackerpcs (Hackerpcs) joins
06:37:06		lennier1 quits [Ping timeout: 265 seconds]
06:42:02		lennier1 (lennier1) joins
07:18:27		CatBatHat joins
07:23:44		CatBatHat quits [Remote host closed the connection]
07:27:14		march_happy quits [Ping timeout: 265 seconds]
07:28:12		march_happy (march_happy) joins
07:37:52		march_happy quits [Ping timeout: 265 seconds]
07:38:09		march_happy (march_happy) joins
07:40:37		@dxrt quits [Quit: ZNC - http://znc.sourceforge.net]
07:41:19		dxrt joins
07:41:22		dxrt is now authenticated as dxrt
07:41:22		dxrt quits [Changing host]
07:41:22		dxrt (dxrt) joins
07:41:22		@ChanServ sets mode: +o dxrt
08:09:25		march_happy quits [Ping timeout: 265 seconds]
08:09:38		march_happy (march_happy) joins
08:30:27		nepeat quits [Client Quit]
08:39:52		qwertyasdfuiopghjkl quits [Remote host closed the connection]
08:48:45		nepeat (nepeat) joins
08:55:24		nepeat quits [Client Quit]
08:56:56		[42] quits [Max SendQ exceeded]
08:57:47		[42] (N4Y) joins
09:01:03		nepeat (nepeat) joins
09:29:36		Megame (Megame) joins
09:53:38		sonick quits [Client Quit]
10:28:40		sonick (sonick) joins
10:52:37		AK quits [Quit: Ping timeout (120 seconds)]
10:57:58		JackThompson joins
11:00:35		JackThompson quits [Client Quit]
11:04:52		march_happy quits [Ping timeout: 265 seconds]
11:05:07		march_happy (march_happy) joins
11:24:41		march_happy quits [Ping timeout: 265 seconds]
11:25:01		march_happy (march_happy) joins
11:29:43		AK (AK) joins
11:30:40	<AK>	arkiver, I know we spoke at one point about doing nslookups of domains and not attempting to grab if the domain resolves to an internal address
11:30:45	<AK>	My ban reason by hetzner was "since the IPs listed in the log are not routed in the Internet at the moment they are not reachable and therefore this is seen as an abuse."
11:31:19	<AK>	So I'm either going to ask please can we add something like that on the grab side, or I'll need to look into how I can null route outbound requests to private addresses
11:53:41		march_happy quits [Ping timeout: 265 seconds]
11:54:03		march_happy (march_happy) joins
11:57:48		pokes quits [Remote host closed the connection]
12:10:36		march_happy quits [Ping timeout: 265 seconds]
12:11:22		march_happy (march_happy) joins
12:26:33		jacobk quits [Ping timeout: 265 seconds]
12:44:20		CatBatHat joins
12:55:10		jacobk joins
13:29:23		CatBatHat quits [Ping timeout: 265 seconds]
13:32:46		jacobk quits [Ping timeout: 265 seconds]
13:41:26		LeGoupil joins
13:46:01		binzyboi quits [Quit: Leaving]
14:08:24		Arcorann quits [Ping timeout: 265 seconds]
14:25:23		LeGoupil quits [Client Quit]
14:41:17	<@arkiver>	AK: lets take that to #//
14:48:39		march_happy quits [Ping timeout: 265 seconds]
14:49:25		march_happy (march_happy) joins
14:59:31		wessel1512 is now authenticated as wessel1512
15:01:12		Megame quits [Client Quit]
15:08:28		march_happy quits [Ping timeout: 265 seconds]
15:08:42		march_happy (march_happy) joins
15:17:01		jacobk joins
15:30:21		JackThompson joins
15:39:45		JackThompson quits [Client Quit]
15:46:33		JackThompson joins
15:54:23		rsn quits [Ping timeout: 265 seconds]
16:05:10		qwertyasdfuiopghjkl joins
17:33:05		swety-lis joins
17:33:22		swety-lis quits [Remote host closed the connection]
17:47:18		AnotherIki joins
17:50:52		Iki1 quits [Ping timeout: 265 seconds]
18:15:31		march_happy quits [Ping timeout: 265 seconds]
18:26:59		Church quits [Ping timeout: 265 seconds]
18:42:06		jacobk quits [Ping timeout: 265 seconds]
18:44:55		march_happy (march_happy) joins
18:46:20		Church (Church) joins
18:47:47		Megame (Megame) joins
19:07:35		march_happy quits [Ping timeout: 265 seconds]
19:10:00		march_happy (march_happy) joins
19:14:12		rsn joins
19:23:10		LeGoupil joins
19:46:15		thetechrobo_ joins
19:49:46		TheTechRobo quits [Ping timeout: 265 seconds]
20:26:19		thetechrobo_ is now known as TheTechRobo
20:26:30		TheTechRobo is now authenticated as TheTechRobo
20:50:33		jacobk joins
21:06:00		yyyy joins
21:06:20		yyyy quits [Remote host closed the connection]
21:08:45		spirit quits [Client Quit]
21:22:42		Megame quits [Client Quit]
21:33:06		eroc1990 quits [Client Quit]
21:39:20		eroc1990 (eroc1990) joins
21:51:35		LeGoupil quits [Client Quit]
21:58:31		BlueMaxima joins
22:02:20		eroc1990 quits [Client Quit]
22:23:27		eroc1990 (eroc1990) joins
22:37:15		datechnoman quits [Client Quit]
22:37:38		datechnoman (datechnoman) joins
22:43:11		Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
22:43:31		Minkafighter joins
22:51:33		Arcorann (Arcorann) joins
23:19:03		geezabiscuit quits [Ping timeout: 265 seconds]
23:19:12		drin joins
23:19:46		drin is now known as geezabiscuit
23:22:33		binzyboi joins
23:29:41		march_happy quits [Ping timeout: 265 seconds]

Home Search Previous day Next day