03:51:28 | | TheTechRobo quits [Remote host closed the connection] |
03:51:50 | | TheTechRobo (TheTechRobo) joins |
03:52:29 | | Craigle quits [Quit: The Lounge - https://thelounge.chat] |
03:54:13 | | Craigle (Craigle) joins |
04:07:58 | | TheTechRobo quits [Remote host closed the connection] |
04:08:20 | | TheTechRobo (TheTechRobo) joins |
04:32:24 | | TheTechRobo quits [Remote host closed the connection] |
04:32:50 | | TheTechRobo (TheTechRobo) joins |
09:39:05 | | igloo22225 quits [Client Quit] |
09:39:40 | | igloo22225 (igloo22225) joins |
09:44:56 | | tech_exorcist (tech_exorcist) joins |
12:07:59 | | tech_exorcist quits [Remote host closed the connection] |
12:08:24 | | tech_exorcist (tech_exorcist) joins |
12:38:50 | | Iki1 joins |
12:41:46 | | Iki quits [Ping timeout: 240 seconds] |
13:14:12 | | Matthww1 quits [Ping timeout: 265 seconds] |
13:29:48 | | Matthww1 joins |
16:04:16 | | kdqep quits [Ping timeout: 240 seconds] |
21:30:30 | | qwertyasdfuiopghjkl joins |
22:00:20 | | kdqep (kdqep) joins |
22:10:51 | <@JAA> | Could someone dump Rodovid please? It consists of 25 wikis. Seven of them have been dumped before, but only once, a decade ago (Apr 2012)... Links are at the bottom of https://www.rodovid.org/ |
22:11:43 | <@JAA> | It's an ancient version of MW, so that might cause some troubles. |
22:12:02 | | tech_exorcist quits [Client Quit] |
22:12:04 | <@JAA> | 1.9.3 from 2007 |
22:14:41 | <@JAA> | Rate limiting is in place as well, it appears. |
22:39:40 | <pokechu22> | "Crawl-delay:9" according to robots.txt |
22:41:27 | <pokechu22> | https://en.rodovid.org/wk/Special:Statistics says 186,265 total pages... and also "a total of 284 page views" which is good. Also https://en.rodovid.org/wk/Special:Version gives MediaWiki 1.9.3 from circa 2008... |
22:44:21 | <pokechu22> | Special:Export is restricted, https://en.rodovid.org/api.php seems to exist. Though https://en.rodovid.org/wk/Special:Recentchanges indicates that there's also a https://en.rodovid.org/wk/Special:Changedrecords which seems to be specialized... example https://en.rodovid.org/wk/Person:1445866 |
22:51:49 | <@JAA> | I believe those are just specially formatted pages. The API seems to indicate they use XML. |
22:52:05 | <@JAA> | See e.g. https://en.rodovid.org/api.php?action=query&prop=revisions&pageids=81887&rvprop=timestamp|user|comment|content |
22:52:18 | <@JAA> | But yeah, I wonder whether the existing tooling can handle that. |
22:57:46 | <pokechu22> | https://www.mediawiki.org/wiki/API:Userinfo is from 1.11, seems like mwclient depends on that (and also wants 1.16?) |
23:00:15 | <pokechu22> | https://github.com/mwclient/mwclient/blame/6e664d98c24ce783d67a40efebda5e6c9e3379e8/README.md makes me think that getting it to work with wikiteam tools isn't going to be possible; there doesn't seem to be *any* version that supports mediawiki that old |
23:01:19 | <pokechu22> | (see also https://github.com/mwclient/mwclient/commit/74624b19597cec73f3196ba6e484d832d7243b5a) |
23:04:27 | | igloo22225 quits [Read error: Connection reset by peer] |
23:04:50 | | igloo22225 joins |
23:05:54 | <pokechu22> | Nothing in https://archive.org/search.php?query=rodovid%20wikiteam seems to be a good archive - they only contain a titles.txt file |
23:12:22 | <@JAA> | Ah :-( |
23:17:51 | <pokechu22> | If you want to try to do an !a < list, https://en.rodovid.org/wk?title=Special%3AAllpages&from=&namespace=0 with the other namespaces should probably capture everything... but not the page source, since e.g. https://en.rodovid.org/edit/Family:10000 is restricted |
23:19:22 | <pokechu22> | I'll build a list of all namespaces, one sec |
23:20:14 | <@JAA> | Yeah, page source would need to be fetched via the API. |
23:21:50 | <@JAA> | I think I'll put this on my list of todos. I'd want to do a proper complete archive. |
23:22:05 | <@JAA> | An !a < job for decent coverage would still be nice though. |
23:23:11 | <pokechu22> | https://en.rodovid.org/api.php nicely gives a list of valid namespaces near list=allpages |
23:24:11 | <@JAA> | Right, so does https://en.rodovid.org/wk/Special:Allpages with a bit of grepping. :-P |
23:24:58 | <pokechu22> | The first time I did it I just manually copied URLs after changing the dropdown... it's nice to have a clean comma-separated list |
23:25:26 | <pokechu22> | https://transfer.archivete.am/jCnmi/en.rodovid.org_all_pages_seed.txt |
23:27:07 | <pokechu22> | You'd want to ignore /edit/ but not /history/ probably... |
23:28:30 | <pokechu22> | Hmm, based on https://en.rodovid.org/wk/Special:Changedrecords vs https://en.rodovid.org/wk?title=Special:Changedrecords&onlylocal=1 doing the english one should cover most of the content, but maybe not in a super useful manner |
23:29:11 | <@JAA> | Some pages don't have /history/. :-( |
23:29:53 | <@JAA> | Ah, those are pages associated with another wiki. |
23:30:00 | <pokechu22> | Hmm, I think ones that aren't local don't have /history/ - interesting. |
23:30:04 | <@JAA> | E.g. https://en.rodovid.org/wk/Person:1390527 = https://ru.rodovid.org/wk/%D0%97%D0%B0%D0%BF%D0%B8%D1%81%D1%8C:1390527 |
23:31:44 | <@JAA> | 'Its data is periodically archived by the WikiTeam project at the Internet Archive.' per English Wikipedia... lol |
23:31:47 | <pokechu22> | The "In other languages" sidebar link there is both useful and probably problematic since it'd result in lots of duplication for those records that aren't native to a specific wiki |
23:32:42 | <@JAA> | Also stored on servers in Kyiv per enwiki. If that's true... Well, let's just say this just got a bit more interesting and important than I thought a few minutes ago. |
23:34:43 | <pokechu22> | https://en.rodovid.org/wk/About_Rodovid does say the maintainers are from Ukraine |
23:37:17 | <@JAA> | Seems to be hosted at AWS, unless they just proxy it through. |
23:37:23 | <pokechu22> | https://engine.rodovid.org/wk/Special:Statistics is probably a good starting point. It also looks like Special:AllPages only lists entries at exist locally... as https://engine.rodovid.org/wk?title=Special:Allpages&namespace=1000 has 9 pages of 1000 entries each, which seems close to 12,555 total pages |
23:38:17 | <@JAA> | Yeah, the Russian one is by far the largest, 777k pages. |
23:38:56 | <@JAA> | AB would still retrieve the pages in other languages as outlinks unless ignored, but when ignored, the browsability will suffer significantly. |
23:39:20 | <pokechu22> | Hmm, I'm less sure about that actually, https://es.rodovid.org/wk/Persona:326677 and https://en.rodovid.org/wk/Person:326677 both exist but https://engine.rodovid.org/wk/Person:326677 doesn't |
23:40:57 | <pokechu22> | https://en.rodovid.org/wk/Person:1419167 / https://ru.rodovid.org/wk/%D0%97%D0%B0%D0%BF%D0%B8%D1%81%D1%8C:1419167 - the ru Special:AllPages does have a longer list than the en Special:AllPages. Maybe engine.rodovid.org uses a separate database since it acts like a test site? |
23:41:49 | <pokechu22> | I should try and get wikiteam tools to work - the place where mwclient fails is for user info which is probably irrelevant for us since we don't log in to make edits. Maybe things'll work if I just rip out that check? No way that can end poorly :P |
23:42:18 | <@JAA> | It implies at the top that mwclient is only required for --xmlrevisions also. |
23:43:25 | <pokechu22> | --help says --xmlrevisions is MediaWiki 1.27+ only (now I'm confused as to what makes that different from --xml...) |
23:43:46 | | Sluggs quits [Ping timeout: 240 seconds] |
23:47:51 | <pokechu22> | Oh boy, https://www.mediawiki.org/wiki/API:Siteinfo is MW 1.8, even more stuff to rip out... |
23:48:42 | <@JAA> | But this has 1.9.3...? |
23:49:38 | <@JAA> | And meta=siteinfo exists on its API. |
23:49:41 | <pokechu22> | ... hm... oh, my attempt at editing it locally failed |
23:49:59 | <pokechu22> | I do like how https://engine.rodovid.org/api.php?version=1 shows "ApiQueryRevisions: $Id$" |
23:50:49 | <@JAA> | :-) |
23:51:02 | <pokechu22> | (it = mwclient, and my attempt was copying /home/pokechu22/python2-env/lib/python2.7/site-packages/mwclient/ so that there was a mwclient/ directory next to dumpgenerator.py... which seemed to work at first, but now isn't working?) |
23:52:00 | <pokechu22> | No, wait, that worked, I'm just confused now |
23:52:18 | <@JAA> | That should work, yeah. |
23:52:46 | <@JAA> | Python prefers the cwd over the site and dist package dirs on imports. |
23:53:23 | <@JAA> | The details are a bit complicated though as it depends on how you invoke Python. And I have no idea how it works exactly in Python 2. I closed that chapter a long time ago and erased any information of it from my brain. :-) |
23:54:48 | <@JAA> | But yeah, it might depend on whether you do `python script.py`, `python -c 'import mwclient'`, or just `python` for the REPL. |
23:55:28 | <pokechu22> | https://en.rodovid.org/api.php?action=query&meta=siteinfo&siprop=general|namespaces works in my browser and *should* be what I'm sending, but I'm getting mwclient.errors.APIError: (u'unknown_meta', u"Unrecognised value for parameter 'meta'", u'incredibly long info message skipped') |