00:18:28lennier1 quits [Client Quit]
00:19:06lennier1 (lennier1) joins
00:34:16Megame quits [Client Quit]
00:36:06sec^nd quits [Remote host closed the connection]
00:37:16sec^nd (second) joins
01:02:42dm4v_ joins
01:02:48dm4v quits [Read error: Connection reset by peer]
01:02:54dm4v_ is now known as dm4v
01:02:57dm4v quits [Changing host]
01:02:57dm4v (dm4v) joins
01:22:23datechnoman quits [Client Quit]
01:23:12datechnoman (datechnoman) joins
01:51:47wyatt8750 joins
01:52:54wyatt8740 quits [Ping timeout: 265 seconds]
01:59:50user__ quits [Read error: Connection reset by peer]
02:00:17user__ (gazorpazorp) joins
02:06:49@Fusl quits [Excess Flood]
02:07:05Fusl (Fusl) joins
02:07:05@ChanServ sets mode: +o Fusl
02:41:07<h2ibot>NightHnh099 edited Alive... OR ARE THEY (+238): https://wiki.archiveteam.org/?diff=48444&oldid=47018
02:56:54lennier1 quits [Client Quit]
02:57:16lennier1 (lennier1) joins
03:10:11<h2ibot>JustAnotherArchivist edited Deathwatch (+15, /* 2022 */ BPO's date shifted again): https://wiki.archiveteam.org/?diff=48445&oldid=48428
03:36:50Mateon1 quits [Remote host closed the connection]
03:37:03Mateon1 joins
04:13:25qwertyasdfuiopghjkl quits [Client Quit]
04:16:23qwertyasdfuiopghjkl joins
04:28:32BlueMaxima quits [Read error: Connection reset by peer]
04:35:04drexler joins
04:35:14<drexler>> What's the best way to archive large files at their original URL to ensure the maximum chance of people finding it?
04:35:25<@JAA>Hello again. :-) What is this about?
04:36:27<drexler>Oh, I'm trying to archive publicly available AI models.
04:36:55<drexler>Because they're large files that tend to be considered 'incidental' and disappear once someone decides to stop hosting them, but if they're available you can go straight back to a particular 'era' of AI art.
04:37:37<drexler>That is, they're large files with short lived popularity but likely long tail value that is not being recognized right this moment but probably will be later.
04:38:27<drexler>And because of this their hosting tends to be odd? People will put them up on their personal server or a university lab workstation or something then one day that goes down and never comes back up.
04:39:45<@JAA>Right. What order of magnitude are we talking about here? As in, if this were done continuously, how much new data per time would you expect?
04:40:59<drexler>Oh you could probably fit every historically interesting AI model on a few terabytes maybe? That's being conservative/assuming it takes more memory than it probably does.
04:41:18<drexler>Uh, at least for the AI art ones
04:41:22<drexler>Which is I'll I'm interested in at the moment
04:42:12<@JAA>That doesn't sound too horrible. I was expecting more.
04:46:09<drexler>Oh, no.
04:46:20<drexler>StyleGAN models are probably something like 700mb each iirc
04:46:23<@JAA>These are normally distributed over HTTP(S), right? I remember mirroring some stuff from rsync servers before.
04:46:29<drexler>Yeah, they usually are
04:46:59<drexler>And then there's bigger models like the ones I'm training right now, which are 10gb each. But people don't train very many of those.
04:47:05<drexler>Yet.
04:47:16<drexler>This early period is probably going to be of the most historical interest, and it uses the least storage space.
04:47:51<@JAA>Ok, if you have a list, we can probably just run it through ArchiveBot. Shouldn't take more than a few weeks to maybe a couple months for that size, though it obviously depends on the servers as well.
04:47:55<@JAA>arkiver: ^ Thoughts?
04:50:42<drexler>Yeah, the alternative is I just download them all slowly and then reupload as an Internet Archive collection, but I think it'd be ideal if they were findable at their original URLs, dunno. I guess the flip side of it is that people might forget what their original URLs even are, given how weird and obscure the hosting often is in the first place.
04:51:49<@JAA>Yeah, documenting those URLs would definitely be a good idea. If it isn't an endless list, this could go onto our wiki.
04:53:57<thuban>if you do end up submitting a list to archivebot, it might be a good idea to include urls of 'about' pages and/or whatever you used to discover the model urls in the first place (since that discovery would then be replicable within the wbm)
04:56:11drexler thumbs up
05:15:34<drexler>Hm, now that you mention it.
05:15:36<drexler>For about/context
05:15:44<drexler>You ever archived Google CoLab before?
05:17:21wyatt8750 quits [Ping timeout: 265 seconds]
05:18:31<h2ibot>JustAnotherArchivist edited DPoS (-35, FOS hasn't been used in years; HTTPS for tracker): https://wiki.archiveteam.org/?diff=48446&oldid=48431
05:19:08wyatt8740 joins
06:05:39<Jake>(colab if I remember correctly is _tons_ of JavaScript, so the actual page isn't easily archivable, from what I remember, but I think you can export pretty easily?)
06:10:31<drexler>Jake, Yeah
06:10:42<drexler>You don't need to archive the page
06:10:48<drexler>You just need to archive/export the notebook itself
06:11:07<drexler>Because that's how ML people tend to distribute the prototype/public use version of their programs
06:11:39<drexler>So if you're saving the models at an expensive storage cost, it would be foolish not to also save the notebooks which are a fraction of a fraction the size and what allows you to actually use the models to do something.
06:12:01Hackerpcs quits [Client Quit]
06:12:14<drexler>They also contain the URLs the models are stored at, which is why I thought of them.
06:15:50Hackerpcs (Hackerpcs) joins
06:37:06lennier1 quits [Ping timeout: 265 seconds]
06:42:02lennier1 (lennier1) joins
07:18:27CatBatHat joins
07:23:44CatBatHat quits [Remote host closed the connection]
07:27:14march_happy quits [Ping timeout: 265 seconds]
07:28:12march_happy (march_happy) joins
07:37:52march_happy quits [Ping timeout: 265 seconds]
07:38:09march_happy (march_happy) joins
07:40:37@dxrt quits [Quit: ZNC - http://znc.sourceforge.net]
07:41:19dxrt joins
07:41:22dxrt quits [Changing host]
07:41:22dxrt (dxrt) joins
07:41:22@ChanServ sets mode: +o dxrt
08:09:25march_happy quits [Ping timeout: 265 seconds]
08:09:38march_happy (march_happy) joins
08:30:27nepeat quits [Client Quit]
08:39:52qwertyasdfuiopghjkl quits [Remote host closed the connection]
08:48:45nepeat (nepeat) joins
08:55:24nepeat quits [Client Quit]
08:56:56[42] quits [Max SendQ exceeded]
08:57:47[42] (N4Y) joins
09:01:03nepeat (nepeat) joins
09:29:36Megame (Megame) joins
09:53:38sonick quits [Client Quit]
10:28:40sonick (sonick) joins
10:52:37AK quits [Quit: Ping timeout (120 seconds)]
10:57:58JackThompson joins
11:00:35JackThompson quits [Client Quit]
11:04:52march_happy quits [Ping timeout: 265 seconds]
11:05:07march_happy (march_happy) joins
11:24:41march_happy quits [Ping timeout: 265 seconds]
11:25:01march_happy (march_happy) joins
11:29:43AK (AK) joins
11:30:40<AK>arkiver, I know we spoke at one point about doing nslookups of domains and not attempting to grab if the domain resolves to an internal address
11:30:45<AK>My ban reason by hetzner was "since the IPs listed in the log are not routed in the Internet at the moment they are not reachable and therefore this is seen as an abuse."
11:31:19<AK>So I'm either going to ask please can we add something like that on the grab side, or I'll need to look into how I can null route outbound requests to private addresses
11:53:41march_happy quits [Ping timeout: 265 seconds]
11:54:03march_happy (march_happy) joins
11:57:48pokes quits [Remote host closed the connection]
12:10:36march_happy quits [Ping timeout: 265 seconds]
12:11:22march_happy (march_happy) joins
12:26:33jacobk quits [Ping timeout: 265 seconds]
12:44:20CatBatHat joins
12:55:10jacobk joins
13:29:23CatBatHat quits [Ping timeout: 265 seconds]
13:32:46jacobk quits [Ping timeout: 265 seconds]
13:41:26LeGoupil joins
13:46:01binzyboi quits [Quit: Leaving]
14:08:24Arcorann quits [Ping timeout: 265 seconds]
14:25:23LeGoupil quits [Client Quit]
14:41:17<@arkiver>AK: lets take that to #//
14:48:39march_happy quits [Ping timeout: 265 seconds]
14:49:25march_happy (march_happy) joins
15:01:12Megame quits [Client Quit]
15:08:28march_happy quits [Ping timeout: 265 seconds]
15:08:42march_happy (march_happy) joins
15:17:01jacobk joins
15:30:21JackThompson joins
15:39:45JackThompson quits [Client Quit]
15:46:33JackThompson joins
15:54:23rsn quits [Ping timeout: 265 seconds]
16:05:10qwertyasdfuiopghjkl joins
17:33:05swety-lis joins
17:33:22swety-lis quits [Remote host closed the connection]
17:47:18AnotherIki joins
17:50:52Iki1 quits [Ping timeout: 265 seconds]
18:15:31march_happy quits [Ping timeout: 265 seconds]
18:26:59Church quits [Ping timeout: 265 seconds]
18:42:06jacobk quits [Ping timeout: 265 seconds]
18:44:55march_happy (march_happy) joins
18:46:20Church (Church) joins
18:47:47Megame (Megame) joins
19:07:35march_happy quits [Ping timeout: 265 seconds]
19:10:00march_happy (march_happy) joins
19:14:12rsn joins
19:23:10LeGoupil joins
19:46:15thetechrobo_ joins
19:49:46TheTechRobo quits [Ping timeout: 265 seconds]
20:26:19thetechrobo_ is now known as TheTechRobo
20:50:33jacobk joins
21:06:00yyyy joins
21:06:20yyyy quits [Remote host closed the connection]
21:08:45spirit quits [Client Quit]
21:22:42Megame quits [Client Quit]
21:33:06eroc1990 quits [Client Quit]
21:39:20eroc1990 (eroc1990) joins
21:51:35LeGoupil quits [Client Quit]
21:58:31BlueMaxima joins
22:02:20eroc1990 quits [Client Quit]
22:23:27eroc1990 (eroc1990) joins
22:37:15datechnoman quits [Client Quit]
22:37:38datechnoman (datechnoman) joins
22:43:11Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
22:43:31Minkafighter joins
22:51:33Arcorann (Arcorann) joins
23:19:03geezabiscuit quits [Ping timeout: 265 seconds]
23:19:12drin joins
23:19:46drin is now known as geezabiscuit
23:22:33binzyboi joins
23:29:41march_happy quits [Ping timeout: 265 seconds]