00:09:11 | | jtagcat quits [Quit: Bye!] |
00:09:42 | | jtagcat (jtagcat) joins |
00:12:03 | | Gooshka joins |
00:14:42 | <Gooshka> | pokechu22: Can files from this https://archive.org/details/lurkmore-2018 and this https://archive.org/details/lurkmore.to-warc items be useful for Wikiteam as they are the archives of one wiki (lurkmore.to and now lurkmore.wtf)? |
00:15:03 | <Gooshka> | If so they can be moved to wikiteam collection |
00:15:16 | | Gooshka quits [Remote host closed the connection] |
00:16:14 | <pokechu22> | Those probably don't have the information needed to actually create a wikiteam-style dump from them so they probably don't belong in the wikiteam collection. One is a warc that could theoretically be used on web.archive.org (but I don't know if it is currently there), the other is a CHM file (which is a very odd choice) |
00:16:30 | <pokechu22> | that's not to say that they're not useful files of course, just not useful for restoring the wiki into a new instance |
00:20:14 | <@JAA> | WARCZone = not in the WBM |
00:23:29 | <pokechu22> | I thought some stuff in warczone was in the WBM, it just depended? |
00:23:48 | <pokechu22> | That's what it says at https://archive.org/details/warczone?tab=about |
00:29:35 | <@JAA> | I guess there might be exceptions, but usually, it shouldn't be in the WBM. |
05:28:16 | | jtagcat is now authenticated as * |
05:28:16 | | jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))] |
05:28:23 | | jtagcat (jtagcat) joins |
05:29:32 | | OrIdow6^2 (OrIdow6) joins |
05:29:32 | | @ChanServ sets mode: +o OrIdow6^2 |
05:32:03 | | @OrIdow6 quits [Excess Flood] |
05:32:03 | | qwertyasdfuiopghjkl quits [Client Quit] |
05:39:17 | | tzt_ (tzt) joins |
05:39:22 | | tzt quits [Remote host closed the connection] |
05:40:12 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
05:43:01 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
05:43:40 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
06:11:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
06:21:38 | | DiscantX joins |
07:59:11 | | hitgrr8 joins |
08:05:59 | <pabs> | http://www.oyranos.org/wiki/ |
10:33:02 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
12:03:45 | | HackMii (hacktheplanet) joins |
13:20:01 | | systwi quits [Ping timeout: 252 seconds] |
13:24:15 | | systwi (systwi) joins |
14:57:59 | <michaelblob_> | pabs: i'm having trouble archiving oyranos.org/wiki but it looks like it's pulling (at least partially) from openicc.info? |
14:58:41 | <pabs> | plausible, they would be related |
15:15:47 | | qwertyasdfuiopghjkl quits [Client Quit] |
15:16:21 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
16:43:45 | <pokechu22> | http://www.oyranos.org/wiki/Special:Version - looks like it's not really a wiki anymore |
16:44:35 | <pokechu22> | http://www.oyranos.org/wiki/index.php%3Ftitle=ColourWiki:About.html is also dubious, though I did see another wiki that had a similarly cursed set of URLs |
17:05:31 | | kdqep__ quits [Ping timeout: 252 seconds] |
17:13:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
17:19:11 | | hitgrr8_ joins |
17:22:40 | | hitgrr8 quits [Remote host closed the connection] |
17:33:21 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
20:00:22 | | kdqep (kdqep) joins |
21:11:08 | | qwertyasdfuiopghjkl quits [Client Quit] |
21:12:01 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
22:04:42 | | Sanqui_ is now known as Sanqui |
22:04:42 | | Sanqui is now authenticated as Sanqui |
22:04:42 | | Sanqui quits [Changing host] |
22:04:42 | | Sanqui (Sanqui) joins |
22:04:42 | | @ChanServ sets mode: +o Sanqui |
22:43:25 | | tzt_ is now known as tzt |
23:01:15 | <pabs> | when AB archiving mediawikis, do you usually ignore old revisions of pages? |
23:02:00 | <@JAA> | I wouldn't say 'usually', but if a wiki is slow or broken or strongly rate-limited, we sometimes do that, yes. |
23:03:54 | <pabs> | the iphonewiki is up to 677GB and still going on old pages, so I was thinking about it |
23:04:20 | | pabs wonders if AB can deprioritise URLs on a domain |
23:04:40 | <schwarzkatz|m> | holy cow, that's a lot |
23:05:35 | <pabs> | might be things it links to, .ipa files and such |
23:06:00 | <@JAA> | Yes, most of it is that, and we ignored a bunch of Apple CDN URLs already. |
23:06:22 | <@JAA> | And no, (de)prioritisation is not a thing yet. |
23:07:29 | <schwarzkatz|m> | makes sense. can you see the size of the wiki without outlinks too, by chance? |
23:07:48 | <@JAA> | Negative |
23:08:14 | <@JAA> | *Maybe* it can be grepped from the log file, but that's painful. Otherwise, it would require parsing the WARCs. |
23:08:21 | <@JAA> | Or the CDXs. |
23:08:28 | <schwarzkatz|m> | a thingie for the TODO list then? :D |
23:09:29 | <@JAA> | Eh, kind of. Structured logging would be nice. |
23:10:16 | <@JAA> | But the actual issue is that wpull only logs the Content-Length header, not the actual data size retrieved. |
23:10:38 | <@JAA> | I guess most of those Apple CDN responses would have the header, but in the general case, it's often enough missing due to Transfer-Encoding. |
23:10:53 | <pokechu22> | pabs: most of that 600GB is from external links |
23:11:00 | <pokechu22> | oh, already noted |
23:15:37 | <@JAA> | (→ -dev for the wpull stuff) |
23:16:21 | | qwertyasdfuiopghjkl quits [Client Quit] |
23:19:25 | <pokechu22> | That said, https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/nomediawikihistory.json exists if yuo need it - it's just probably not needed in that case |
23:45:17 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |