00:49:26Ryz quits [Quit: Ping timeout (120 seconds)]
00:49:55Ryz (Ryz) joins
00:52:45eroc1990 quits [Client Quit]
00:53:35eroc1990 (eroc1990) joins
01:31:20HackMii quits [Write error: Connection reset by peer]
01:35:01HackMii (hacktheplanet) joins
04:53:00HackMii quits [Remote host closed the connection]
04:54:21HackMii (hacktheplanet) joins
08:51:18qwertyasdfuiopghjkl quits [Remote host closed the connection]
09:24:35qwertyasdfuiopghjkl joins
09:27:08tech_exorcist (tech_exorcist) joins
10:11:32tech_exorcist_ (tech_exorcist) joins
10:11:46tech_exorcist quits [Ping timeout: 240 seconds]
11:14:15tech_exorcist_ quits [Remote host closed the connection]
11:16:08tech_exorcist_ (tech_exorcist) joins
11:29:46tech_exorcist_ quits [Ping timeout: 240 seconds]
11:30:44tech_exorcist (tech_exorcist) joins
12:00:11tech_exorcist quits [Remote host closed the connection]
12:00:37tech_exorcist (tech_exorcist) joins
12:25:15tech_exorcist quits [Remote host closed the connection]
12:27:39tech_exorcist (tech_exorcist) joins
12:53:45tech_exorcist quits [Client Quit]
18:16:35Iki1 joins
18:20:22AnotherIki quits [Ping timeout: 265 seconds]
19:49:47qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
20:55:07<pokechu22>I seem to be able to get revisions to load with --xmlrevisions as long as I make getXmlHeader always `return "<mediawiki><!-- no header available -->\n", config` and make a few minor changes to getXMLRevisions (which already supports prop=revisions) and makeXmlFromPage... currently trying to export engine.rodovid.org as a proof of concept.
20:57:42<pokechu22>(in particular: rvlimit is commented out originally, which breaks pagination (no continuation is provided unless a limit is set); I've just enabled it at 50. And, invalid rvprop values result in an error on this version instead of a warning, but that's easy enough to fix.. makeXmlFromPage needed to be changed to use `for rev in page['revisions'].itervalues()` instead of `for rev in
20:57:44<pokechu22>page['revisions']`, not sure if this is a difference in MW output or something else but modern MW does warn about a format difference in some cases)
20:59:50<pokechu22>So far there doesn't seem to be any rate-limiting on the API, unlike what was previously seen with archivebot on regular pages
21:01:32<pokechu22>OK, I guess clarification: I *do* get a 429 when I try to load the HTML pages while doing this export, so the rate-limiting exists, it's just that the rate-limiting only rejects index.php and not api.php or something like that?
21:03:28<pokechu22>That's good because trying to use a crawl-delay of 9 seconds would be bad; there are ~10000 pages on the engine site (which is more their test instance), and 90000 seconds is over a day
21:03:32<michaelblob_>i believe there is no hard limit on read requests through the API
21:04:32<michaelblob_>see https://www.mediawiki.org/wiki/API:Etiquette
21:06:51<pokechu22>See also https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header and https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header... what's also odd is that there is a --delay option but it doesn't seem to be used between pages (I did figure out where to add that if it's needed, but it doesn't seem to be needed on this site)
21:11:17<pokechu22>https://github.com/Pokechu22/wikiteam/commits/rodovid is what I've got FWIW
21:38:49<pokechu22>Is there any reason why action=query&list=allimages or action=query&generator=allpages&prop=imageinfo is needed? imageinfo and allimages are both not supported, but I feel like just using the list of page titles would be enough...
21:39:29<pokechu22>Oh, wait, it's to get the image filename/URL, isn't it? Just having a page name woudln't be enough without scraping (and scraping would invoke the rate-limit but is probably doable)
21:46:02igloo22225 (igloo22225) joins
21:46:06<pokechu22>Oh, that's interesting but also a bit annoying: https://engine.rodovid.org/images/4/4e/01_common.gif is a 301 redirect to https://*.rodovid.org/4/4e/01_common.gif which is... not going to work. The actual location per https://engine.rodovid.org/wk/Image:01_common.gif is https://rodvoid.org/4/4e/01_common.gif (note: rodvoid.org, not rodovid.org!)
21:47:15<@JAA>Yeah, found that the other day as well on other wikis.
21:47:26<@JAA>This entire thing is a mess.
21:53:36<pokechu22>I should be able to special-scae it at least; there's already a curateImageURL function that massages the URL to something nicer
22:27:05<pokechu22>Does getXMLFileDesc / the .desc file generated with an image provide any additional value if there's a separate full xml history dump? getXMLFileDesc uses getXMLPage which doesn't work (and it seems tricky to adapt it to getXMLRevisions)
23:06:32<pokechu22>How does this look? https://transfer.archivete.am/6LHQ0/arrodovidorg-20220907-wikidump.7z
23:08:47<pokechu22>(it looks mostly reasonable to me, but I don't really have a reference to compare it to; there definitely is history in it including for people pages, though)
23:18:49<@JAA>(I'm not all that familiar with the details of the wikidump format, so I can't really comment on it.)
23:24:44<pokechu22>Well, I've started exporting the other subdomains now (not sure how long it'll take, probably several hours, maybe more than a day), so hopefully this is good enough.
23:50:28@rewby quits [Ping timeout: 240 seconds]
23:50:46HackMii quits [Ping timeout: 240 seconds]
23:52:32HackMii (hacktheplanet) joins
23:52:41nepeat_ (nepeat) joins
23:52:54nepeat quits [Ping timeout: 265 seconds]
23:53:46rewby (rewby) joins
23:53:46@ChanServ sets mode: +o rewby