00:49:26 | | Ryz quits [Quit: Ping timeout (120 seconds)] |
00:49:55 | | Ryz (Ryz) joins |
00:52:45 | | eroc1990 quits [Client Quit] |
00:53:35 | | eroc1990 (eroc1990) joins |
01:31:20 | | HackMii quits [Write error: Connection reset by peer] |
01:35:01 | | HackMii (hacktheplanet) joins |
04:53:00 | | HackMii quits [Remote host closed the connection] |
04:54:21 | | HackMii (hacktheplanet) joins |
08:51:18 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
09:24:35 | | qwertyasdfuiopghjkl joins |
09:27:08 | | tech_exorcist (tech_exorcist) joins |
10:11:32 | | tech_exorcist_ (tech_exorcist) joins |
10:11:46 | | tech_exorcist quits [Ping timeout: 240 seconds] |
11:14:15 | | tech_exorcist_ quits [Remote host closed the connection] |
11:16:08 | | tech_exorcist_ (tech_exorcist) joins |
11:29:46 | | tech_exorcist_ quits [Ping timeout: 240 seconds] |
11:30:44 | | tech_exorcist (tech_exorcist) joins |
12:00:11 | | tech_exorcist quits [Remote host closed the connection] |
12:00:37 | | tech_exorcist (tech_exorcist) joins |
12:25:15 | | tech_exorcist quits [Remote host closed the connection] |
12:27:39 | | tech_exorcist (tech_exorcist) joins |
12:53:45 | | tech_exorcist quits [Client Quit] |
18:16:35 | | Iki1 joins |
18:20:22 | | AnotherIki quits [Ping timeout: 265 seconds] |
19:49:47 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
20:55:07 | <pokechu22> | I seem to be able to get revisions to load with --xmlrevisions as long as I make getXmlHeader always `return "<mediawiki><!-- no header available -->\n", config` and make a few minor changes to getXMLRevisions (which already supports prop=revisions) and makeXmlFromPage... currently trying to export engine.rodovid.org as a proof of concept. |
20:57:42 | <pokechu22> | (in particular: rvlimit is commented out originally, which breaks pagination (no continuation is provided unless a limit is set); I've just enabled it at 50. And, invalid rvprop values result in an error on this version instead of a warning, but that's easy enough to fix.. makeXmlFromPage needed to be changed to use `for rev in page['revisions'].itervalues()` instead of `for rev in |
20:57:44 | <pokechu22> | page['revisions']`, not sure if this is a difference in MW output or something else but modern MW does warn about a format difference in some cases) |
20:59:50 | <pokechu22> | So far there doesn't seem to be any rate-limiting on the API, unlike what was previously seen with archivebot on regular pages |
21:01:32 | <pokechu22> | OK, I guess clarification: I *do* get a 429 when I try to load the HTML pages while doing this export, so the rate-limiting exists, it's just that the rate-limiting only rejects index.php and not api.php or something like that? |
21:03:28 | <pokechu22> | That's good because trying to use a crawl-delay of 9 seconds would be bad; there are ~10000 pages on the engine site (which is more their test instance), and 90000 seconds is over a day |
21:03:32 | <michaelblob_> | i believe there is no hard limit on read requests through the API |
21:04:32 | <michaelblob_> | see https://www.mediawiki.org/wiki/API:Etiquette |
21:06:51 | <pokechu22> | See also https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header and https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header... what's also odd is that there is a --delay option but it doesn't seem to be used between pages (I did figure out where to add that if it's needed, but it doesn't seem to be needed on this site) |
21:11:17 | <pokechu22> | https://github.com/Pokechu22/wikiteam/commits/rodovid is what I've got FWIW |
21:38:49 | <pokechu22> | Is there any reason why action=query&list=allimages or action=query&generator=allpages&prop=imageinfo is needed? imageinfo and allimages are both not supported, but I feel like just using the list of page titles would be enough... |
21:39:29 | <pokechu22> | Oh, wait, it's to get the image filename/URL, isn't it? Just having a page name woudln't be enough without scraping (and scraping would invoke the rate-limit but is probably doable) |
21:46:02 | | igloo22225 (igloo22225) joins |
21:46:06 | <pokechu22> | Oh, that's interesting but also a bit annoying: https://engine.rodovid.org/images/4/4e/01_common.gif is a 301 redirect to https://*.rodovid.org/4/4e/01_common.gif which is... not going to work. The actual location per https://engine.rodovid.org/wk/Image:01_common.gif is https://rodvoid.org/4/4e/01_common.gif (note: rodvoid.org, not rodovid.org!) |
21:47:15 | <@JAA> | Yeah, found that the other day as well on other wikis. |
21:47:26 | <@JAA> | This entire thing is a mess. |
21:53:36 | <pokechu22> | I should be able to special-scae it at least; there's already a curateImageURL function that massages the URL to something nicer |
22:27:05 | <pokechu22> | Does getXMLFileDesc / the .desc file generated with an image provide any additional value if there's a separate full xml history dump? getXMLFileDesc uses getXMLPage which doesn't work (and it seems tricky to adapt it to getXMLRevisions) |
23:06:32 | <pokechu22> | How does this look? https://transfer.archivete.am/6LHQ0/arrodovidorg-20220907-wikidump.7z |
23:08:47 | <pokechu22> | (it looks mostly reasonable to me, but I don't really have a reference to compare it to; there definitely is history in it including for people pages, though) |
23:18:49 | <@JAA> | (I'm not all that familiar with the details of the wikidump format, so I can't really comment on it.) |
23:24:44 | <pokechu22> | Well, I've started exporting the other subdomains now (not sure how long it'll take, probably several hours, maybe more than a day), so hopefully this is good enough. |
23:50:28 | | @rewby quits [Ping timeout: 240 seconds] |
23:50:46 | | HackMii quits [Ping timeout: 240 seconds] |
23:52:32 | | HackMii (hacktheplanet) joins |
23:52:41 | | nepeat_ (nepeat) joins |
23:52:54 | | nepeat quits [Ping timeout: 265 seconds] |
23:53:46 | | rewby (rewby) joins |
23:53:46 | | @ChanServ sets mode: +o rewby |