00:45:29 | <pabs> | a docuwiki https://wiki.linuxfoundation.org/ |
00:45:44 | <pabs> | also, the page links to a bunch of other Linux Foundation related wikis |
02:31:08 | | systwi__ (systwi) joins |
02:32:58 | | systwi quits [Ping timeout: 265 seconds] |
02:34:12 | | rktk (rktk) joins |
02:51:16 | <rktk> | Eh so I'm sure someone noticed Miraheze is shutting down now |
02:51:17 | <rktk> | right? |
02:51:47 | <rktk> | if not, https://meta.miraheze.org/wiki/Board/Policies/20230615-Statement, and specifically important note: |
02:51:50 | <rktk> | >We intend to wind down operations in September, and will no longer be providing wikis by October 1st. However, the domain will remain in operation for as long as funds allow to provide users access to full dumps of their wikis. |
02:52:10 | <rktk> | "provide users access" most likely implying the *admins* of the wikis access to full dumps; whereas for people like us we'd have to hit the API |
03:09:44 | <pabs> | yep. there are also former Miraheze folks trying to rejoin and revive the project https://news.ycombinator.com/item?id=36362547 |
03:10:31 | <pabs> | but either way there are plans to save the wikis in both dump form and to the WBM |
03:11:00 | | systwi__ is now known as systwi |
03:22:02 | | dan- joins |
03:28:38 | <rktk> | sweet :) |
03:29:20 | | dan- is now authenticated as dan- |
03:36:50 | | BigBrain quits [Remote host closed the connection] |
03:37:07 | | BigBrain (bigbrain) joins |
04:58:07 | <Naruyoko> | Found "WikiTide" and "Telepedia" projects mentioned in the abovementioned thread, I wonder how long they are going to last? |
05:24:31 | | hitgrr8 joins |
08:22:19 | <Nemo_bis> | Naruyoko: Hard to tell, they're still being formed. |
08:28:25 | <Nemo_bis> | pokechu22: Yes, not sure what I was thinking there. https://github.com/WikiTeam/wikiteam/issues/467#issuecomment-1596031036 |
08:28:39 | <Nemo_bis> | arkiver: Actually I just finished dumping all wikis so do what you please. :) |
09:17:41 | | @Sanqui quits [Read error: Connection reset by peer] |
09:18:09 | | Sanqui joins |
09:18:11 | | Sanqui is now authenticated as Sanqui |
09:18:11 | | Sanqui quits [Changing host] |
09:18:11 | | Sanqui (Sanqui) joins |
09:18:11 | | @ChanServ sets mode: +o Sanqui |
10:01:02 | | Matthww1 quits [Read error: Connection reset by peer] |
10:02:00 | | Matthww1 joins |
11:28:40 | <@arkiver> | Nemo_bis: woah that was fast! |
11:40:23 | | sepro quits [Ping timeout: 252 seconds] |
13:38:21 | | sepro (sepro) joins |
16:57:20 | | that_lurker quits [Client Quit] |
16:57:44 | | that_lurker (that_lurker) joins |
17:17:48 | | BigBrain quits [Remote host closed the connection] |
17:18:05 | | BigBrain (bigbrain) joins |
17:19:48 | <pokechu22> | Nemo_bis: Dumping https://fiction.miraheze.org/ finished just now; I had to resume it 3 times. I dumped with --xml but not --xmlrevisions, which does mean I'm able to resume (--xmlrevisions only allows restarting the whole namespace IIRC) |
17:27:35 | | @rewby quits [Ping timeout: 258 seconds] |
17:33:01 | | pino_p joins |
17:33:17 | <pino_p> | Good news: Wiki host Miraheze isn't shutting down after all https://meta.miraheze.org/wiki/Miraheze_is_Not_Shutting_Down |
17:34:55 | | rewby (rewby) joins |
17:34:55 | | @ChanServ sets mode: +o rewby |
17:40:10 | <Nemo_bis> | pokechu22: thanks, good news. Any insight on what might have caused the failures? Just temporary overload? |
17:41:07 | <pokechu22> | Yeah, probably. I got 2 failures from 502 and 1 failure from 503 but was able to resume after each one just fine |
17:43:20 | <pokechu22> | Lots of random read time outs before then (but only the first 502 had a read time out before it on the same page) |
17:46:26 | <yzqzss|m> | pokechu22: MW-scarper saves the arvcontinue value to the attribute of the <page> element of xmldump. In this way, we get resumeable --xmlrevisions. 🫠|
17:46:51 | <pokechu22> | Yeah, that seems like a good idea |
17:59:06 | | pino_p quits [Remote host closed the connection] |
18:13:26 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
18:13:28 | <yzqzss|m> | Also, I would like to know if the <revision> element of the xmldump generated by wikiteam(py2) using --xmlrevisions has the <sha1> element in it? |
18:13:54 | <yzqzss|m> | * MediaWiki will use <sha1> for de-duplication when importing xml. If <sha1> is missing, MediaWiki doesn't know if the corresponding revision has been imported, which can lead to duplicate imports. |
18:15:03 | <yzqzss|m> | * This is common, and if you import a large wiki, you can easily be forced to break the import due to module/configuration issues. When you deal with the problem and re-import, the previously imported revision will be imported again, which is very annoying. |
18:15:17 | <yzqzss|m> | * For this reason, I also don't recommend using Special:Export to make xmldump, which doesn't output <sha1>. |
18:15:23 | | eroc1990 (eroc1990) joins |
18:16:34 | <pokechu22> | I think it attempts to replicate Special:Export, so it won't have that |
18:16:51 | <yzqzss|m> | * You may see sha1 also appearing in the <text> attribute. But I asked a MediaWiki developer, and currently the import tool doesn't recognize attributes in <text>, only separate <sha1> elements. |
18:17:27 | <pokechu22> | As in <text sha1="1234">blabla</text>? |
18:17:54 | <pokechu22> | or <text><sha1>1234</sha1>blabla</text>? |
18:23:06 | <yzqzss|m> | <revision><text bytes="1234" sha1="1234" xml:space="preserve">...</text><sha1>1234</sha1></revision> |
18:23:26 | <Nemo_bis> | As usual I'm confused how my upload speed to IA can possible be capped at https://paste.debian.net/plain/1283386 |
18:23:32 | <Nemo_bis> | * capped at 1 MiB/s |
18:27:08 | <yzqzss|m> | https://www.mediawiki.org/xml/export-0.11.xsd |
18:28:13 | <Nemo_bis> | We do create sha1 elements in the usual way but whether they work or not is another matter https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L1102 |
18:28:26 | <Nemo_bis> | Test imports very welcome :) |
18:28:47 | <yzqzss|m> | :) |
18:34:38 | <yzqzss|m> | Oh, the <contentformat> and <minor> elements seem to be missing. I fixed it in MW-Scraper before, but forgot to backport it. |
18:34:45 | <yzqzss|m> | https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/152/commits/049efc8b36bcd65ae33afcc3d27ca26e5a86c904 |
18:42:00 | <Nemo_bis> | Ah yes, that's good. |
18:42:25 | <Nemo_bis> | Nowadays Wikia's XML seems ok, maybe we don't even need to strip these sha1 codes any more https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L631 |
18:45:59 | | Megame (Megame) joins |
18:46:26 | | tzt quits [Remote host closed the connection] |
18:46:48 | | tzt (tzt) joins |
18:47:29 | <@JAA> | Nemo_bis: IA doesn't limit the uploads as far as I know, but throughput from far away has never been great (though there are sysctl things you can do to help it somewhat), and in the past couple weeks, IA seems rather busy. My own uploads have also not been going well, even with those sysctl tweaks. |
18:48:02 | | Barto (Barto) joins |
18:49:17 | <Nemo_bis> | JAA: Thanks for confirming. I forgot to apply my usual sysctl tweaks. IIRC they helped mostly in the case of high latency/jitter, which doesn't seem an issue right now. |
18:49:54 | <Nemo_bis> | For bigger uploads/downloads I usually create an ad hoc VM in DigitalOcean SFO DC, but these miraheze dumps are very small so it's just a minor nuisance. |
18:56:37 | <Nemo_bis> | Aah the old MobileMe times when I used to upload from a university room sitting almost next to MIX... I had forgotten https://wiki.archiveteam.org/index.php/User:Nemo_bis |
19:06:43 | <@JAA> | I've managed a couple dozen MB/s to IA before, but beyond that, only parallel connections seem to help. |
19:07:29 | <@JAA> | That was from a server that probably has a line of sight to IA, so that helps. |
19:16:22 | | tzt quits [Ping timeout: 265 seconds] |
19:23:12 | | tzt (tzt) joins |
19:27:17 | <Nemo_bis> | Yes, it's either parallelism or throwing money at someone with good SFMIX connections. I spent 80 $ for https://archive.org/details/wikia_dump_20200214 and it was worth it. |
19:29:43 | <Nemo_bis> | But something like https://archive.org/details/wikimediacommons?sort=-addeddate would cost a fortune to update nowadays, with both Linode and DigitalOcean enforcing egress limits. |
19:34:52 | <Nemo_bis> | I guess I could try https://ifog.ch/en/vps/vps-fremont |
19:51:13 | | nix78 joins |
19:52:50 | <@JAA> | Ah yeah, wikimediacommons is still somewhere on my todo list. |
19:53:20 | <@JAA> | I don't think it'd be too horrible. |
19:53:31 | <@JAA> | Compared to the DPoS projects, it's a drop in the bucket really. |
20:00:05 | <Nemo_bis> | If IA is ok with hosting some 350 TiB of files which are sometimes duplicated from IA itself, I'd probably try to coordinate with WMF SRE so that we upload directly from WMF peering to SFMIX |
20:07:41 | <Nemo_bis> | Though with all the millions WMF is spending on dubious projects I've said before we should just throw a few millions at IA for the service instead |
20:08:45 | | andrew quits [Quit: ] |
20:13:08 | <@JAA> | Right, I was only thinking of new/updated files, not the whole backlog. |
20:13:13 | | BigBrain quits [Remote host closed the connection] |
20:13:33 | | BigBrain (bigbrain) joins |
20:13:45 | <@JAA> | It'd also be nice to have it as WARC under the proper URLs. But that'd need a fair bit of additional engineering. |
20:19:03 | <Nemo_bis> | The main advantage of going the WARC way would be to avoid duplicating URLs which are already archived by the wayback machine (as people would be able to use the same interface to download everything), but I'm not sure how much overlap there is. |
20:20:22 | <Nemo_bis> | I think the use case for the Commons mirror is "WMF goes rogue and we (people who relied on Commons) need to set up a new Commons elsewhere", which is never going to be a painless operation. |
20:20:56 | | taavi quits [Remote host closed the connection] |
20:21:07 | <Nemo_bis> | I missed Taavi was here :) |
20:21:44 | <Nemo_bis> | masterX244: You could start an image dump of https://wiki.creaturathegame.com/wiki/Special:MediaStatistics |
20:22:08 | <Nemo_bis> | With 650k images at an average of 1047 B each, I'm never going to be able to finish it one API call at a time |
20:22:32 | <masterX244> | wheres the latest wikiteam tooling for that? |
20:22:40 | | taavi (taavi) joins |
20:23:01 | <Nemo_bis> | I'm just using the standard launcher.py and dumpgenerator.py on Debian 11 https://github.com/WikiTeam/wikiteam |
20:24:29 | | Megame quits [Client Quit] |
20:25:20 | <masterX244> | doing a --xml --images pull |
20:25:34 | <@JAA> | Deduping against the WBM is horrible and simply impossible at scale. But deduping within a separate 'let's archive all of Wikimedia Commons' WARC thing is possible with the right software. |
20:26:12 | <masterX244> | and as a regular user its hard-impossible, no access to a big chink of warcs |
20:26:59 | <@JAA> | Even as a special user, it's basically impossible. We tried. |
20:27:49 | <@JAA> | Maybe IA staff could make it happy, but I'm doubtful even about that. |
20:29:19 | <Nemo_bis> | What do you mean by deduping here? I'm thinking of things like DjVU and PDF files from IA re-hosted on Commons |
20:29:49 | <Nemo_bis> | masterX244: thanks. If you get many timeouts, please try commenting these lines which attempt to create the .desc files https://github.com/WikiTeam/wikiteam/blob/5b0e98afe629d1b05d60fca0b776c8e588f13a29/dumpgenerator.py#L1521-L1549 |
20:31:18 | <@JAA> | Right, I doubt that'll be possible at scale as well. I mean deduping identical files from Commons, including on regular reretrievals. |
20:34:21 | <Nemo_bis> | Ok. I don't think we have that many bit-identical duplicates. Only 360 mostly small files at the moment. https://commons.wikimedia.org/w/index.php?title=Special:ListDuplicatedFiles&limit=500&offset=0 |
20:36:20 | <@JAA> | That's lower than I expected, but makes sense that it's being monitored. |
20:36:27 | <Nemo_bis> | There are more duplicates across different wikis and across revision histories of files. WMF stopped caring about these long ago because they're deduplicate by the object storage backend anyway. |
20:36:36 | <@JAA> | Right |
20:36:37 | | TTN joins |
20:37:27 | <@JAA> | What I'd love to happen is every page revision on Wikimedia projects getting archived as WARC in near-realtime. |
20:39:55 | <Nemo_bis> | Hmm. For that you'd probably want to shard the WARCs by title so that you compress the highly duplicated HTML. |
20:41:03 | <masterX244> | WARCs are compressed record-by-record |
20:41:23 | <masterX244> | aka redundancy between pages can't be used unless the compression allows a pre-seeded dictionary |
20:41:42 | <masterX244> | the AT zstd ones are optimized that way by using a optimized per-project dictionary |
20:45:27 | <Nemo_bis> | Ah. Is that enforced by megaWARC? |
20:46:59 | <Nemo_bis> | AFAIK it's not mandatory https://archive-access.sourceforge.net/warc/warc_file_format-0.16.html#anchor43 |
20:47:22 | <@JAA> | It is mandatory for zstd-compressed WARCs: http://iipc.github.io/warc-specifications/specifications/warc-zstd/ |
20:47:47 | <@JAA> | (We defined this format.) |
20:49:03 | <Nemo_bis> | Ok, makes sense. |
20:49:56 | <Nemo_bis> | Then you'd need a per-page dictionary for the most frequently updated pages. |
20:50:20 | <@JAA> | zstd is magic. It'd probably work fine with just one dict. |
20:50:28 | <Nemo_bis> | We used to have something of the sort in openzip to increase compression of extremely repetitive HTML like certain templates |
20:50:44 | <@JAA> | But also, .warc.zst does not support multiple dicts, so it'd have to be separate files, which is ... silly. |
20:51:00 | <Nemo_bis> | And you'd update the index every now and then based on the entire corpus? |
20:51:15 | <Nemo_bis> | The dict I mean |
20:51:33 | <Nemo_bis> | and openzim, not openzip. sigh, I need to go to bed |
20:51:34 | <@JAA> | Yeah, or the most recent WARCs or similar. Code for that already exists since we use it in the DPoS projects. |
20:51:42 | <Nemo_bis> | oki |
20:52:36 | <@JAA> | Do you know what order of magnitude the number of page edits per second is? |
20:59:24 | <Nemo_bis> | Excluding Wikidata, I assume? |
21:00:54 | | andrew (andrew) joins |
21:02:13 | | tzt quits [Ping timeout: 265 seconds] |
21:02:15 | <@JAA> | With and without would be interesting, but Wikidata would certainly need special handling, yeah. |
21:02:47 | <@JAA> | Something like archiving items ten minutes after their last edit or similar. |
21:03:24 | | tzt (tzt) joins |
21:06:43 | <Nemo_bis> | Found it... Should be about 20 edits per second https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=30s&viewPanel=10&from=now-30d&to=now |
21:08:48 | <Nemo_bis> | And Wikidata is about 8 per second https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&viewPanel=7 |
21:09:04 | <@JAA> | Huh |
21:09:05 | | tzt quits [Ping timeout: 252 seconds] |
21:09:39 | <Nemo_bis> | The most edited articles are also likely to be bigger than average though... |
21:09:41 | <@JAA> | So that'd be easily doable, even with a little potato server somewhere. |
21:10:27 | <@JAA> | (I'm assuming WMF wouldn't mind ~30 req/s, which I'd expect to not even be a rounding error in the request rates.) |
21:10:40 | <Nemo_bis> | Don't tell the Wikimedia Enterprise people who charge Google millions for the service. ;) |
21:10:58 | <@JAA> | Heh |
21:11:45 | <Nemo_bis> | You'd use EventStreams, not the normal API. Not sure how much traffic it sees https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams |
21:11:58 | <@JAA> | Oh yeah, I played with that once. |
21:12:44 | <Nemo_bis> | I think IA is still using EventStreams, but they supposedly switched some loads to the Wikimedia Enterprise offering https://wikimediafoundation.org/news/2022/06/21/wikimedia-enterprise-announces-google-and-internet-archive-first-customers/ |
21:13:17 | <@JAA> | Ah right, for the outlink archival. |
21:13:56 | <Nemo_bis> | The main thing is that in order to generate the list of URLs the entire wikitext needs to be cached and the cache stored somewhere, so requesting the HTML at the same time shouldn't generate significant strain (but what do I know) |
21:14:28 | <Nemo_bis> | * the wikitext needs to be parsed and the results cached somewhere |
21:15:48 | <Nemo_bis> | I've updated https://wiki.archiveteam.org/index.php?title=Wikimedia_Commons&type=revision&diff=49963&oldid=48033 with some dose of realism |
21:17:16 | <@JAA> | :-) |
21:33:07 | | hitgrr8 quits [Client Quit] |
21:46:47 | <dan-> | morning all! feels good to be doing some wiki backups again :) |
22:00:42 | <fireonlive> | very interesting reading :) |
22:09:07 | <Nemo_bis> | Hello dan- |
22:25:04 | | nix78 quits [Remote host closed the connection] |
22:31:28 | <pokechu22> | Isn't there also a project that saves outlinks from wikipedia edits? Not sure who does that |
22:33:14 | <Nemo_bis> | pokechu22: IA does https://timotijhof.net/posts/2022/internet-archive-crawling/ |
22:34:19 | <pokechu22> | ah, and the collection is https://archive.org/details/wikipedia-eventstream |
23:27:19 | | rocketdive joins |
23:40:31 | | rocketdive quits [Client Quit] |
23:40:53 | | quackifi joins |
23:47:19 | | quackifi quits [Read error: Connection reset by peer] |
23:48:32 | | quackifi joins |