#wikiteam log for 2023-03-25

Home Search Previous day Next day

00:09:11		jtagcat quits [Quit: Bye!]
00:09:42		jtagcat (jtagcat) joins
00:12:03		Gooshka joins
00:14:42	<Gooshka>	pokechu22: Can files from this https://archive.org/details/lurkmore-2018 and this https://archive.org/details/lurkmore.to-warc items be useful for Wikiteam as they are the archives of one wiki (lurkmore.to and now lurkmore.wtf)?
00:15:03	<Gooshka>	If so they can be moved to wikiteam collection
00:15:16		Gooshka quits [Remote host closed the connection]
00:16:14	<pokechu22>	Those probably don't have the information needed to actually create a wikiteam-style dump from them so they probably don't belong in the wikiteam collection. One is a warc that could theoretically be used on web.archive.org (but I don't know if it is currently there), the other is a CHM file (which is a very odd choice)
00:16:30	<pokechu22>	that's not to say that they're not useful files of course, just not useful for restoring the wiki into a new instance
00:20:14	<@JAA>	WARCZone = not in the WBM
00:23:29	<pokechu22>	I thought some stuff in warczone was in the WBM, it just depended?
00:23:48	<pokechu22>	That's what it says at https://archive.org/details/warczone?tab=about
00:29:35	<@JAA>	I guess there might be exceptions, but usually, it shouldn't be in the WBM.
05:28:16		jtagcat is now authenticated as *
05:28:16		jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))]
05:28:23		jtagcat (jtagcat) joins
05:29:32		OrIdow6^2 (OrIdow6) joins
05:29:32		@ChanServ sets mode: +o OrIdow6^2
05:32:03		@OrIdow6 quits [Excess Flood]
05:32:03		qwertyasdfuiopghjkl quits [Client Quit]
05:39:17		tzt_ (tzt) joins
05:39:22		tzt quits [Remote host closed the connection]
05:40:12		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
05:43:01		qwertyasdfuiopghjkl quits [Remote host closed the connection]
05:43:40		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
06:11:34		qwertyasdfuiopghjkl quits [Client Quit]
06:21:38		DiscantX joins
07:59:11		hitgrr8 joins
08:05:59	<pabs>	http://www.oyranos.org/wiki/
10:33:02		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:03:45		HackMii (hacktheplanet) joins
13:20:01		systwi quits [Ping timeout: 252 seconds]
13:24:15		systwi (systwi) joins
14:57:59	<michaelblob_>	pabs: i'm having trouble archiving oyranos.org/wiki but it looks like it's pulling (at least partially) from openicc.info?
14:58:41	<pabs>	plausible, they would be related
15:15:47		qwertyasdfuiopghjkl quits [Client Quit]
15:16:21		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
16:43:45	<pokechu22>	http://www.oyranos.org/wiki/Special:Version - looks like it's not really a wiki anymore
16:44:35	<pokechu22>	http://www.oyranos.org/wiki/index.php%3Ftitle=ColourWiki:About.html is also dubious, though I did see another wiki that had a similarly cursed set of URLs
17:05:31		kdqep__ quits [Ping timeout: 252 seconds]
17:13:34		qwertyasdfuiopghjkl quits [Client Quit]
17:19:11		hitgrr8_ joins
17:22:40		hitgrr8 quits [Remote host closed the connection]
17:33:21		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
20:00:22		kdqep (kdqep) joins
21:11:08		qwertyasdfuiopghjkl quits [Client Quit]
21:12:01		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
22:04:42		Sanqui_ is now known as Sanqui
22:04:42		Sanqui is now authenticated as Sanqui
22:04:42		Sanqui quits [Changing host]
22:04:42		Sanqui (Sanqui) joins
22:04:42		@ChanServ sets mode: +o Sanqui
22:43:25		tzt_ is now known as tzt
23:01:15	<pabs>	when AB archiving mediawikis, do you usually ignore old revisions of pages?
23:02:00	<@JAA>	I wouldn't say 'usually', but if a wiki is slow or broken or strongly rate-limited, we sometimes do that, yes.
23:03:54	<pabs>	the iphonewiki is up to 677GB and still going on old pages, so I was thinking about it
23:04:20		pabs wonders if AB can deprioritise URLs on a domain
23:04:40	<schwarzkatz\|m>	holy cow, that's a lot
23:05:35	<pabs>	might be things it links to, .ipa files and such
23:06:00	<@JAA>	Yes, most of it is that, and we ignored a bunch of Apple CDN URLs already.
23:06:22	<@JAA>	And no, (de)prioritisation is not a thing yet.
23:07:29	<schwarzkatz\|m>	makes sense. can you see the size of the wiki without outlinks too, by chance?
23:07:48	<@JAA>	Negative
23:08:14	<@JAA>	Maybe it can be grepped from the log file, but that's painful. Otherwise, it would require parsing the WARCs.
23:08:21	<@JAA>	Or the CDXs.
23:08:28	<schwarzkatz\|m>	a thingie for the TODO list then? :D
23:09:29	<@JAA>	Eh, kind of. Structured logging would be nice.
23:10:16	<@JAA>	But the actual issue is that wpull only logs the Content-Length header, not the actual data size retrieved.
23:10:38	<@JAA>	I guess most of those Apple CDN responses would have the header, but in the general case, it's often enough missing due to Transfer-Encoding.
23:10:53	<pokechu22>	pabs: most of that 600GB is from external links
23:11:00	<pokechu22>	oh, already noted
23:15:37	<@JAA>	(→ -dev for the wpull stuff)
23:16:21		qwertyasdfuiopghjkl quits [Client Quit]
23:19:25	<pokechu22>	That said, https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/nomediawikihistory.json exists if yuo need it - it's just probably not needed in that case
23:45:17		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins

Home Search Previous day Next day