00:09:11jtagcat quits [Quit: Bye!]
00:09:42jtagcat (jtagcat) joins
00:12:03Gooshka joins
00:14:42<Gooshka>pokechu22: Can files from this https://archive.org/details/lurkmore-2018 and this https://archive.org/details/lurkmore.to-warc items be useful for Wikiteam as they are the archives of one wiki (lurkmore.to and now lurkmore.wtf)?
00:15:03<Gooshka>If so they can be moved to wikiteam collection
00:15:16Gooshka quits [Remote host closed the connection]
00:16:14<pokechu22>Those probably don't have the information needed to actually create a wikiteam-style dump from them so they probably don't belong in the wikiteam collection. One is a warc that could theoretically be used on web.archive.org (but I don't know if it is currently there), the other is a CHM file (which is a very odd choice)
00:16:30<pokechu22>that's not to say that they're not useful files of course, just not useful for restoring the wiki into a new instance
00:20:14<@JAA>WARCZone = not in the WBM
00:23:29<pokechu22>I thought some stuff in warczone was in the WBM, it just depended?
00:23:48<pokechu22>That's what it says at https://archive.org/details/warczone?tab=about
00:29:35<@JAA>I guess there might be exceptions, but usually, it shouldn't be in the WBM.
05:28:16jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))]
05:28:23jtagcat (jtagcat) joins
05:29:32OrIdow6^2 (OrIdow6) joins
05:29:32@ChanServ sets mode: +o OrIdow6^2
05:32:03@OrIdow6 quits [Excess Flood]
05:32:03qwertyasdfuiopghjkl quits [Client Quit]
05:39:17tzt_ (tzt) joins
05:39:22tzt quits [Remote host closed the connection]
05:40:12qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
05:43:01qwertyasdfuiopghjkl quits [Remote host closed the connection]
05:43:40qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
06:11:34qwertyasdfuiopghjkl quits [Client Quit]
06:21:38DiscantX joins
07:59:11hitgrr8 joins
08:05:59<pabs>http://www.oyranos.org/wiki/
10:33:02qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:03:45HackMii (hacktheplanet) joins
13:20:01systwi quits [Ping timeout: 252 seconds]
13:24:15systwi (systwi) joins
14:57:59<michaelblob_>pabs: i'm having trouble archiving oyranos.org/wiki but it looks like it's pulling (at least partially) from openicc.info?
14:58:41<pabs>plausible, they would be related
15:15:47qwertyasdfuiopghjkl quits [Client Quit]
15:16:21qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
16:43:45<pokechu22>http://www.oyranos.org/wiki/Special:Version - looks like it's not really a wiki anymore
16:44:35<pokechu22>http://www.oyranos.org/wiki/index.php%3Ftitle=ColourWiki:About.html is also dubious, though I did see another wiki that had a similarly cursed set of URLs
17:05:31kdqep__ quits [Ping timeout: 252 seconds]
17:13:34qwertyasdfuiopghjkl quits [Client Quit]
17:19:11hitgrr8_ joins
17:22:40hitgrr8 quits [Remote host closed the connection]
17:33:21qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
20:00:22kdqep (kdqep) joins
21:11:08qwertyasdfuiopghjkl quits [Client Quit]
21:12:01qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
22:04:42Sanqui_ is now known as Sanqui
22:04:42Sanqui quits [Changing host]
22:04:42Sanqui (Sanqui) joins
22:04:42@ChanServ sets mode: +o Sanqui
22:43:25tzt_ is now known as tzt
23:01:15<pabs>when AB archiving mediawikis, do you usually ignore old revisions of pages?
23:02:00<@JAA>I wouldn't say 'usually', but if a wiki is slow or broken or strongly rate-limited, we sometimes do that, yes.
23:03:54<pabs>the iphonewiki is up to 677GB and still going on old pages, so I was thinking about it
23:04:20pabs wonders if AB can deprioritise URLs on a domain
23:04:40<schwarzkatz|m>holy cow, that's a lot
23:05:35<pabs>might be things it links to, .ipa files and such
23:06:00<@JAA>Yes, most of it is that, and we ignored a bunch of Apple CDN URLs already.
23:06:22<@JAA>And no, (de)prioritisation is not a thing yet.
23:07:29<schwarzkatz|m>makes sense. can you see the size of the wiki without outlinks too, by chance?
23:07:48<@JAA>Negative
23:08:14<@JAA>*Maybe* it can be grepped from the log file, but that's painful. Otherwise, it would require parsing the WARCs.
23:08:21<@JAA>Or the CDXs.
23:08:28<schwarzkatz|m>a thingie for the TODO list then? :D
23:09:29<@JAA>Eh, kind of. Structured logging would be nice.
23:10:16<@JAA>But the actual issue is that wpull only logs the Content-Length header, not the actual data size retrieved.
23:10:38<@JAA>I guess most of those Apple CDN responses would have the header, but in the general case, it's often enough missing due to Transfer-Encoding.
23:10:53<pokechu22>pabs: most of that 600GB is from external links
23:11:00<pokechu22>oh, already noted
23:15:37<@JAA>(→ -dev for the wpull stuff)
23:16:21qwertyasdfuiopghjkl quits [Client Quit]
23:19:25<pokechu22>That said, https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/nomediawikihistory.json exists if yuo need it - it's just probably not needed in that case
23:45:17qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins