| 00:09:11 | | jtagcat quits [Quit: Bye!] |
| 00:09:42 | | jtagcat (jtagcat) joins |
| 00:12:03 | | Gooshka joins |
| 00:14:42 | <Gooshka> | pokechu22: Can files from this https://archive.org/details/lurkmore-2018 and this https://archive.org/details/lurkmore.to-warc items be useful for Wikiteam as they are the archives of one wiki (lurkmore.to and now lurkmore.wtf)? |
| 00:15:03 | <Gooshka> | If so they can be moved to wikiteam collection |
| 00:15:16 | | Gooshka quits [Remote host closed the connection] |
| 00:16:14 | <pokechu22> | Those probably don't have the information needed to actually create a wikiteam-style dump from them so they probably don't belong in the wikiteam collection. One is a warc that could theoretically be used on web.archive.org (but I don't know if it is currently there), the other is a CHM file (which is a very odd choice) |
| 00:16:30 | <pokechu22> | that's not to say that they're not useful files of course, just not useful for restoring the wiki into a new instance |
| 00:20:14 | <@JAA> | WARCZone = not in the WBM |
| 00:23:29 | <pokechu22> | I thought some stuff in warczone was in the WBM, it just depended? |
| 00:23:48 | <pokechu22> | That's what it says at https://archive.org/details/warczone?tab=about |
| 00:29:35 | <@JAA> | I guess there might be exceptions, but usually, it shouldn't be in the WBM. |
| 05:28:16 | | jtagcat is now authenticated as * |
| 05:28:16 | | jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))] |
| 05:28:23 | | jtagcat (jtagcat) joins |
| 05:29:32 | | OrIdow6^2 (OrIdow6) joins |
| 05:29:32 | | @ChanServ sets mode: +o OrIdow6^2 |
| 05:32:03 | | @OrIdow6 quits [Excess Flood] |
| 05:32:03 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 05:39:17 | | tzt_ (tzt) joins |
| 05:39:22 | | tzt quits [Remote host closed the connection] |
| 05:40:12 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 05:43:01 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 05:43:40 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 06:11:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 06:21:38 | | DiscantX joins |
| 07:59:11 | | hitgrr8 joins |
| 08:05:59 | <pabs> | http://www.oyranos.org/wiki/ |
| 10:33:02 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 12:03:45 | | HackMii (hacktheplanet) joins |
| 13:20:01 | | systwi quits [Ping timeout: 252 seconds] |
| 13:24:15 | | systwi (systwi) joins |
| 14:57:59 | <michaelblob_> | pabs: i'm having trouble archiving oyranos.org/wiki but it looks like it's pulling (at least partially) from openicc.info? |
| 14:58:41 | <pabs> | plausible, they would be related |
| 15:15:47 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 15:16:21 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 16:43:45 | <pokechu22> | http://www.oyranos.org/wiki/Special:Version - looks like it's not really a wiki anymore |
| 16:44:35 | <pokechu22> | http://www.oyranos.org/wiki/index.php%3Ftitle=ColourWiki:About.html is also dubious, though I did see another wiki that had a similarly cursed set of URLs |
| 17:05:31 | | kdqep__ quits [Ping timeout: 252 seconds] |
| 17:13:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 17:19:11 | | hitgrr8_ joins |
| 17:22:40 | | hitgrr8 quits [Remote host closed the connection] |
| 17:33:21 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 20:00:22 | | kdqep (kdqep) joins |
| 21:11:08 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 21:12:01 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 22:04:42 | | Sanqui_ is now known as Sanqui |
| 22:04:42 | | Sanqui is now authenticated as Sanqui |
| 22:04:42 | | Sanqui quits [Changing host] |
| 22:04:42 | | Sanqui (Sanqui) joins |
| 22:04:42 | | @ChanServ sets mode: +o Sanqui |
| 22:43:25 | | tzt_ is now known as tzt |
| 23:01:15 | <pabs> | when AB archiving mediawikis, do you usually ignore old revisions of pages? |
| 23:02:00 | <@JAA> | I wouldn't say 'usually', but if a wiki is slow or broken or strongly rate-limited, we sometimes do that, yes. |
| 23:03:54 | <pabs> | the iphonewiki is up to 677GB and still going on old pages, so I was thinking about it |
| 23:04:20 | | pabs wonders if AB can deprioritise URLs on a domain |
| 23:04:40 | <schwarzkatz|m> | holy cow, that's a lot |
| 23:05:35 | <pabs> | might be things it links to, .ipa files and such |
| 23:06:00 | <@JAA> | Yes, most of it is that, and we ignored a bunch of Apple CDN URLs already. |
| 23:06:22 | <@JAA> | And no, (de)prioritisation is not a thing yet. |
| 23:07:29 | <schwarzkatz|m> | makes sense. can you see the size of the wiki without outlinks too, by chance? |
| 23:07:48 | <@JAA> | Negative |
| 23:08:14 | <@JAA> | *Maybe* it can be grepped from the log file, but that's painful. Otherwise, it would require parsing the WARCs. |
| 23:08:21 | <@JAA> | Or the CDXs. |
| 23:08:28 | <schwarzkatz|m> | a thingie for the TODO list then? :D |
| 23:09:29 | <@JAA> | Eh, kind of. Structured logging would be nice. |
| 23:10:16 | <@JAA> | But the actual issue is that wpull only logs the Content-Length header, not the actual data size retrieved. |
| 23:10:38 | <@JAA> | I guess most of those Apple CDN responses would have the header, but in the general case, it's often enough missing due to Transfer-Encoding. |
| 23:10:53 | <pokechu22> | pabs: most of that 600GB is from external links |
| 23:11:00 | <pokechu22> | oh, already noted |
| 23:15:37 | <@JAA> | (→ -dev for the wpull stuff) |
| 23:16:21 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 23:19:25 | <pokechu22> | That said, https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/nomediawikihistory.json exists if yuo need it - it's just probably not needed in that case |
| 23:45:17 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |