#wikiteam log for 2023-06-18

Home Search Previous day Next day

00:45:29	<pabs>	a docuwiki https://wiki.linuxfoundation.org/
00:45:44	<pabs>	also, the page links to a bunch of other Linux Foundation related wikis
02:31:08		systwi__ (systwi) joins
02:32:58		systwi quits [Ping timeout: 265 seconds]
02:34:12		rktk (rktk) joins
02:51:16	<rktk>	Eh so I'm sure someone noticed Miraheze is shutting down now
02:51:17	<rktk>	right?
02:51:47	<rktk>	if not, https://meta.miraheze.org/wiki/Board/Policies/20230615-Statement, and specifically important note:
02:51:50	<rktk>	>We intend to wind down operations in September, and will no longer be providing wikis by October 1st. However, the domain will remain in operation for as long as funds allow to provide users access to full dumps of their wikis.
02:52:10	<rktk>	"provide users access" most likely implying the admins of the wikis access to full dumps; whereas for people like us we'd have to hit the API
03:09:44	<pabs>	yep. there are also former Miraheze folks trying to rejoin and revive the project https://news.ycombinator.com/item?id=36362547
03:10:31	<pabs>	but either way there are plans to save the wikis in both dump form and to the WBM
03:11:00		systwi__ is now known as systwi
03:22:02		dan- joins
03:28:38	<rktk>	sweet :)
03:29:20		dan- is now authenticated as dan-
03:36:50		BigBrain quits [Remote host closed the connection]
03:37:07		BigBrain (bigbrain) joins
04:58:07	<Naruyoko>	Found "WikiTide" and "Telepedia" projects mentioned in the abovementioned thread, I wonder how long they are going to last?
05:24:31		hitgrr8 joins
08:22:19	<Nemo_bis>	Naruyoko: Hard to tell, they're still being formed.
08:28:25	<Nemo_bis>	pokechu22: Yes, not sure what I was thinking there. https://github.com/WikiTeam/wikiteam/issues/467#issuecomment-1596031036
08:28:39	<Nemo_bis>	arkiver: Actually I just finished dumping all wikis so do what you please. :)
09:17:41		@Sanqui quits [Read error: Connection reset by peer]
09:18:09		Sanqui joins
09:18:11		Sanqui is now authenticated as Sanqui
09:18:11		Sanqui quits [Changing host]
09:18:11		Sanqui (Sanqui) joins
09:18:11		@ChanServ sets mode: +o Sanqui
10:01:02		Matthww1 quits [Read error: Connection reset by peer]
10:02:00		Matthww1 joins
11:28:40	<@arkiver>	Nemo_bis: woah that was fast!
11:40:23		sepro quits [Ping timeout: 252 seconds]
13:38:21		sepro (sepro) joins
16:57:20		that_lurker quits [Client Quit]
16:57:44		that_lurker (that_lurker) joins
17:17:48		BigBrain quits [Remote host closed the connection]
17:18:05		BigBrain (bigbrain) joins
17:19:48	<pokechu22>	Nemo_bis: Dumping https://fiction.miraheze.org/ finished just now; I had to resume it 3 times. I dumped with --xml but not --xmlrevisions, which does mean I'm able to resume (--xmlrevisions only allows restarting the whole namespace IIRC)
17:27:35		@rewby quits [Ping timeout: 258 seconds]
17:33:01		pino_p joins
17:33:17	<pino_p>	Good news: Wiki host Miraheze isn't shutting down after all https://meta.miraheze.org/wiki/Miraheze_is_Not_Shutting_Down
17:34:55		rewby (rewby) joins
17:34:55		@ChanServ sets mode: +o rewby
17:40:10	<Nemo_bis>	pokechu22: thanks, good news. Any insight on what might have caused the failures? Just temporary overload?
17:41:07	<pokechu22>	Yeah, probably. I got 2 failures from 502 and 1 failure from 503 but was able to resume after each one just fine
17:43:20	<pokechu22>	Lots of random read time outs before then (but only the first 502 had a read time out before it on the same page)
17:46:26	<yzqzss\|m>	pokechu22: MW-scarper saves the arvcontinue value to the attribute of the <page> element of xmldump. In this way, we get resumeable --xmlrevisions. 🫠
17:46:51	<pokechu22>	Yeah, that seems like a good idea
17:59:06		pino_p quits [Remote host closed the connection]
18:13:26		eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
18:13:28	<yzqzss\|m>	Also, I would like to know if the <revision> element of the xmldump generated by wikiteam(py2) using --xmlrevisions has the <sha1> element in it?
18:13:54	<yzqzss\|m>	* MediaWiki will use <sha1> for de-duplication when importing xml. If <sha1> is missing, MediaWiki doesn't know if the corresponding revision has been imported, which can lead to duplicate imports.
18:15:03	<yzqzss\|m>	* This is common, and if you import a large wiki, you can easily be forced to break the import due to module/configuration issues. When you deal with the problem and re-import, the previously imported revision will be imported again, which is very annoying.
18:15:17	<yzqzss\|m>	* For this reason, I also don't recommend using Special:Export to make xmldump, which doesn't output <sha1>.
18:15:23		eroc1990 (eroc1990) joins
18:16:34	<pokechu22>	I think it attempts to replicate Special:Export, so it won't have that
18:16:51	<yzqzss\|m>	* You may see sha1 also appearing in the <text> attribute. But I asked a MediaWiki developer, and currently the import tool doesn't recognize attributes in <text>, only separate <sha1> elements.
18:17:27	<pokechu22>	As in <text sha1="1234">blabla</text>?
18:17:54	<pokechu22>	or <text><sha1>1234</sha1>blabla</text>?
18:23:06	<yzqzss\|m>	<revision><text bytes="1234" sha1="1234" xml:space="preserve">...</text><sha1>1234</sha1></revision>
18:23:26	<Nemo_bis>	As usual I'm confused how my upload speed to IA can possible be capped at https://paste.debian.net/plain/1283386
18:23:32	<Nemo_bis>	* capped at 1 MiB/s
18:27:08	<yzqzss\|m>	https://www.mediawiki.org/xml/export-0.11.xsd
18:28:13	<Nemo_bis>	We do create sha1 elements in the usual way but whether they work or not is another matter https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L1102
18:28:26	<Nemo_bis>	Test imports very welcome :)
18:28:47	<yzqzss\|m>	:)
18:34:38	<yzqzss\|m>	Oh, the <contentformat> and <minor> elements seem to be missing. I fixed it in MW-Scraper before, but forgot to backport it.
18:34:45	<yzqzss\|m>	https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/152/commits/049efc8b36bcd65ae33afcc3d27ca26e5a86c904
18:42:00	<Nemo_bis>	Ah yes, that's good.
18:42:25	<Nemo_bis>	Nowadays Wikia's XML seems ok, maybe we don't even need to strip these sha1 codes any more https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L631
18:45:59		Megame (Megame) joins
18:46:26		tzt quits [Remote host closed the connection]
18:46:48		tzt (tzt) joins
18:47:29	<@JAA>	Nemo_bis: IA doesn't limit the uploads as far as I know, but throughput from far away has never been great (though there are sysctl things you can do to help it somewhat), and in the past couple weeks, IA seems rather busy. My own uploads have also not been going well, even with those sysctl tweaks.
18:48:02		Barto (Barto) joins
18:49:17	<Nemo_bis>	JAA: Thanks for confirming. I forgot to apply my usual sysctl tweaks. IIRC they helped mostly in the case of high latency/jitter, which doesn't seem an issue right now.
18:49:54	<Nemo_bis>	For bigger uploads/downloads I usually create an ad hoc VM in DigitalOcean SFO DC, but these miraheze dumps are very small so it's just a minor nuisance.
18:56:37	<Nemo_bis>	Aah the old MobileMe times when I used to upload from a university room sitting almost next to MIX... I had forgotten https://wiki.archiveteam.org/index.php/User:Nemo_bis
19:06:43	<@JAA>	I've managed a couple dozen MB/s to IA before, but beyond that, only parallel connections seem to help.
19:07:29	<@JAA>	That was from a server that probably has a line of sight to IA, so that helps.
19:16:22		tzt quits [Ping timeout: 265 seconds]
19:23:12		tzt (tzt) joins
19:27:17	<Nemo_bis>	Yes, it's either parallelism or throwing money at someone with good SFMIX connections. I spent 80 $ for https://archive.org/details/wikia_dump_20200214 and it was worth it.
19:29:43	<Nemo_bis>	But something like https://archive.org/details/wikimediacommons?sort=-addeddate would cost a fortune to update nowadays, with both Linode and DigitalOcean enforcing egress limits.
19:34:52	<Nemo_bis>	I guess I could try https://ifog.ch/en/vps/vps-fremont
19:51:13		nix78 joins
19:52:50	<@JAA>	Ah yeah, wikimediacommons is still somewhere on my todo list.
19:53:20	<@JAA>	I don't think it'd be too horrible.
19:53:31	<@JAA>	Compared to the DPoS projects, it's a drop in the bucket really.
20:00:05	<Nemo_bis>	If IA is ok with hosting some 350 TiB of files which are sometimes duplicated from IA itself, I'd probably try to coordinate with WMF SRE so that we upload directly from WMF peering to SFMIX
20:07:41	<Nemo_bis>	Though with all the millions WMF is spending on dubious projects I've said before we should just throw a few millions at IA for the service instead
20:08:45		andrew quits [Quit: ]
20:13:08	<@JAA>	Right, I was only thinking of new/updated files, not the whole backlog.
20:13:13		BigBrain quits [Remote host closed the connection]
20:13:33		BigBrain (bigbrain) joins
20:13:45	<@JAA>	It'd also be nice to have it as WARC under the proper URLs. But that'd need a fair bit of additional engineering.
20:19:03	<Nemo_bis>	The main advantage of going the WARC way would be to avoid duplicating URLs which are already archived by the wayback machine (as people would be able to use the same interface to download everything), but I'm not sure how much overlap there is.
20:20:22	<Nemo_bis>	I think the use case for the Commons mirror is "WMF goes rogue and we (people who relied on Commons) need to set up a new Commons elsewhere", which is never going to be a painless operation.
20:20:56		taavi quits [Remote host closed the connection]
20:21:07	<Nemo_bis>	I missed Taavi was here :)
20:21:44	<Nemo_bis>	masterX244: You could start an image dump of https://wiki.creaturathegame.com/wiki/Special:MediaStatistics
20:22:08	<Nemo_bis>	With 650k images at an average of 1047 B each, I'm never going to be able to finish it one API call at a time
20:22:32	<masterX244>	wheres the latest wikiteam tooling for that?
20:22:40		taavi (taavi) joins
20:23:01	<Nemo_bis>	I'm just using the standard launcher.py and dumpgenerator.py on Debian 11 https://github.com/WikiTeam/wikiteam
20:24:29		Megame quits [Client Quit]
20:25:20	<masterX244>	doing a --xml --images pull
20:25:34	<@JAA>	Deduping against the WBM is horrible and simply impossible at scale. But deduping within a separate 'let's archive all of Wikimedia Commons' WARC thing is possible with the right software.
20:26:12	<masterX244>	and as a regular user its hard-impossible, no access to a big chink of warcs
20:26:59	<@JAA>	Even as a special user, it's basically impossible. We tried.
20:27:49	<@JAA>	Maybe IA staff could make it happy, but I'm doubtful even about that.
20:29:19	<Nemo_bis>	What do you mean by deduping here? I'm thinking of things like DjVU and PDF files from IA re-hosted on Commons
20:29:49	<Nemo_bis>	masterX244: thanks. If you get many timeouts, please try commenting these lines which attempt to create the .desc files https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L1521-L1549
20:31:18	<@JAA>	Right, I doubt that'll be possible at scale as well. I mean deduping identical files from Commons, including on regular reretrievals.
20:34:21	<Nemo_bis>	Ok. I don't think we have that many bit-identical duplicates. Only 360 mostly small files at the moment. https://commons.wikimedia.org/w/index.php?title=Special:ListDuplicatedFiles&limit=500&offset=0
20:36:20	<@JAA>	That's lower than I expected, but makes sense that it's being monitored.
20:36:27	<Nemo_bis>	There are more duplicates across different wikis and across revision histories of files. WMF stopped caring about these long ago because they're deduplicate by the object storage backend anyway.
20:36:36	<@JAA>	Right
20:36:37		TTN joins
20:37:27	<@JAA>	What I'd love to happen is every page revision on Wikimedia projects getting archived as WARC in near-realtime.
20:39:55	<Nemo_bis>	Hmm. For that you'd probably want to shard the WARCs by title so that you compress the highly duplicated HTML.
20:41:03	<masterX244>	WARCs are compressed record-by-record
20:41:23	<masterX244>	aka redundancy between pages can't be used unless the compression allows a pre-seeded dictionary
20:41:42	<masterX244>	the AT zstd ones are optimized that way by using a optimized per-project dictionary
20:45:27	<Nemo_bis>	Ah. Is that enforced by megaWARC?
20:46:59	<Nemo_bis>	AFAIK it's not mandatory https://archive-access.sourceforge.net/warc/warc_file_format-0.16.html#anchor43
20:47:22	<@JAA>	It is mandatory for zstd-compressed WARCs: http://iipc.github.io/warc-specifications/specifications/warc-zstd/
20:47:47	<@JAA>	(We defined this format.)
20:49:03	<Nemo_bis>	Ok, makes sense.
20:49:56	<Nemo_bis>	Then you'd need a per-page dictionary for the most frequently updated pages.
20:50:20	<@JAA>	zstd is magic. It'd probably work fine with just one dict.
20:50:28	<Nemo_bis>	We used to have something of the sort in openzip to increase compression of extremely repetitive HTML like certain templates
20:50:44	<@JAA>	But also, .warc.zst does not support multiple dicts, so it'd have to be separate files, which is ... silly.
20:51:00	<Nemo_bis>	And you'd update the index every now and then based on the entire corpus?
20:51:15	<Nemo_bis>	The dict I mean
20:51:33	<Nemo_bis>	and openzim, not openzip. sigh, I need to go to bed
20:51:34	<@JAA>	Yeah, or the most recent WARCs or similar. Code for that already exists since we use it in the DPoS projects.
20:51:42	<Nemo_bis>	oki
20:52:36	<@JAA>	Do you know what order of magnitude the number of page edits per second is?
20:59:24	<Nemo_bis>	Excluding Wikidata, I assume?
21:00:54		andrew (andrew) joins
21:02:13		tzt quits [Ping timeout: 265 seconds]
21:02:15	<@JAA>	With and without would be interesting, but Wikidata would certainly need special handling, yeah.
21:02:47	<@JAA>	Something like archiving items ten minutes after their last edit or similar.
21:03:24		tzt (tzt) joins
21:06:43	<Nemo_bis>	Found it... Should be about 20 edits per second https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=30s&viewPanel=10&from=now-30d&to=now
21:08:48	<Nemo_bis>	And Wikidata is about 8 per second https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&viewPanel=7
21:09:04	<@JAA>	Huh
21:09:05		tzt quits [Ping timeout: 252 seconds]
21:09:39	<Nemo_bis>	The most edited articles are also likely to be bigger than average though...
21:09:41	<@JAA>	So that'd be easily doable, even with a little potato server somewhere.
21:10:27	<@JAA>	(I'm assuming WMF wouldn't mind ~30 req/s, which I'd expect to not even be a rounding error in the request rates.)
21:10:40	<Nemo_bis>	Don't tell the Wikimedia Enterprise people who charge Google millions for the service. ;)
21:10:58	<@JAA>	Heh
21:11:45	<Nemo_bis>	You'd use EventStreams, not the normal API. Not sure how much traffic it sees https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams
21:11:58	<@JAA>	Oh yeah, I played with that once.
21:12:44	<Nemo_bis>	I think IA is still using EventStreams, but they supposedly switched some loads to the Wikimedia Enterprise offering https://wikimediafoundation.org/news/2022/06/21/wikimedia-enterprise-announces-google-and-internet-archive-first-customers/
21:13:17	<@JAA>	Ah right, for the outlink archival.
21:13:56	<Nemo_bis>	The main thing is that in order to generate the list of URLs the entire wikitext needs to be cached and the cache stored somewhere, so requesting the HTML at the same time shouldn't generate significant strain (but what do I know)
21:14:28	<Nemo_bis>	* the wikitext needs to be parsed and the results cached somewhere
21:15:48	<Nemo_bis>	I've updated https://wiki.archiveteam.org/index.php?title=Wikimedia_Commons&type=revision&diff=49963&oldid=48033 with some dose of realism
21:17:16	<@JAA>	:-)
21:33:07		hitgrr8 quits [Client Quit]
21:46:47	<dan->	morning all! feels good to be doing some wiki backups again :)
22:00:42	<fireonlive>	very interesting reading :)
22:09:07	<Nemo_bis>	Hello dan-
22:25:04		nix78 quits [Remote host closed the connection]
22:31:28	<pokechu22>	Isn't there also a project that saves outlinks from wikipedia edits? Not sure who does that
22:33:14	<Nemo_bis>	pokechu22: IA does https://timotijhof.net/posts/2022/internet-archive-crawling/
22:34:19	<pokechu22>	ah, and the collection is https://archive.org/details/wikipedia-eventstream
23:27:19		rocketdive joins
23:40:31		rocketdive quits [Client Quit]
23:40:53		quackifi joins
23:47:19		quackifi quits [Read error: Connection reset by peer]
23:48:32		quackifi joins

Home Search Previous day Next day