01:01:52tzt quits [Remote host closed the connection]
01:02:14tzt (tzt) joins
04:05:04andrew (andrew) joins
04:05:13Ivan226 joins
06:18:52<yzqzss|m>pokechu22: Can use --xmlapiexport, I'm downloading...
06:20:56<yzqzss|m>Oh, failed after downloading ~100 pages. Trying with --curonly ...
07:08:09AnotherIki quits [Remote host closed the connection]
07:08:20AnotherIki joins
07:08:40hitgrr8 joins
09:47:35kdqep (kdqep) joins
10:17:38qwertyasdfuiopghjkl quits [Remote host closed the connection]
11:33:43<yzqzss|m>https://archive.org/details/wiki-urbanculturelol
12:56:37qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
15:05:03qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
15:22:42<pokechu22>yzqzss|m: The allrevisions API doesn't work, but the prop=revisions one does (where you ask for revisions one page at a time). But the implementation of that is fairly borked. I was able to fix it (at least for this wiki, I remember different wikis behaving differently though) and get all revisions out though
15:43:29<pokechu22>https://github.com/WikiTeam/wikiteam/pull/462
16:04:12kdqep quits [Ping timeout: 252 seconds]
17:38:42<Nemo_bis>Nice pokechu22 !
17:39:15<Nemo_bis>At the time the main focus was getting data out of Wikia, the rest was collateral damage
17:40:16<Nemo_bis>I still have a directory with dumps from 9000 wikis I made in early 2020... I think I uploaded them all to archive.org but I'm not super sure
17:49:43<pokechu22>For wikia I doubt that path was actually being hit; I'd expect special:export and the regular allrevisions path to work for all wikia wikis the same (since they'd all be on the same mediawiki version)
17:55:03<Nemo_bis>No, Special:Export was not an option
17:55:58<pokechu22>Hmm, ok, I'll need to test that further at some point then
17:56:14<pokechu22>But I'm pretty sure the code as it existed before only gave the most recent revision; that's what it was doing for urbanculture.lol at least
17:57:33<Nemo_bis>Yes I also accidentally uploaded some history dumps which actually were only latest revisions, but I tried to correct these. Can't remember the details now.
17:58:23<Nemo_bis>Listing the history.xml.7z files from the wikiteam collection and looking for suspicious size drops is one way to find debugging targets.
17:58:35<Nemo_bis>Otherwise I can provide a list of wikis which finished in an error back then
17:59:55<pokechu22>That'd be helpful, but I won't have time to look into it for a few days
18:00:50<Nemo_bis>They've waited for 3 years. :)
18:24:23<yzqzss|m>pokechu22: I just investigated and wikiteam3 fails after downloading ~100 pages with --xmlapiexport (API:Revisions) because it doesn't handle the special query-continue format for MW 1.21~1.25. I also fixed it.
18:24:32<yzqzss|m>https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/144
18:25:40<pokechu22>If it fails on the page Kiss that's the same issue I ran into and fixed
18:25:57<pokechu22>though I changed different code, so that's a bit odd
18:26:12<yzqzss|m>Yes, page Kiss
18:30:58<pokechu22>Another small fix: https://github.com/WikiTeam/wikiteam/pull/463
18:48:12<yzqzss|m>Thanks, wikiteam3 has fixed this before. https://github.com/saveweb/wikiteam3/blob/fd3bb2ee37f914a50a29518ddbfc71dccd106ecc/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L286
18:49:17<pokechu22>Hmm, interesting, you still have the rvlimit: 50 line commented out: https://github.com/saveweb/wikiteam3/blob/00b47646e83871ff6d117bf53eae034cde7540ba/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L274
18:49:43<pokechu22>actually, you have the same issue as I fixed in my PR here too: https://github.com/saveweb/wikiteam3/blob/00b47646e83871ff6d117bf53eae034cde7540ba/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L329
18:49:56<pokechu22>(though you fixed the pparams one)
19:11:07<pokechu22>Oh, and one other thing to consider: right now, if you have a case-insensitive filesystem (which I do, because Windows jank) images can get clobbered, which both loses data and breaks resuming. We probably should detect this case and rename images like is done for ones that are too long. Perhaps images.txt should include both the original name and the adjusted name? (And while
19:11:09<pokechu22>we're at it maybe also add other metadata from the API such as size and hash?)
19:14:31<Nemo_bis>We shouldn't be uploading such image dumps to archive.org, as there is no way to restore them into MediaWiki
19:15:00<Nemo_bis>One option would be to write directly to an compressed archive which preserves case
19:15:18<pokechu22>Do the ones where files get renamed because their name is too long work when importing into mediawiki?
19:15:24<yzqzss|m>wikiteam3 is already using size and sha1 for file deduplication and download resume. see: wikiteam3/dumpgenerator/dump/image/image.py#L45
19:15:54<Nemo_bis>Yes the truncation also makes the image dumps unusable
19:16:38<pokechu22>Yeah, in that case, it's probably better to save something even if it's not directly usable, instead of saving nothing... but still not great :/
19:17:10<pokechu22>The silent clobbering with case-insensitive files is pretty big problem though; I think it affects a decent number of my dumps
19:18:13<Nemo_bis>It would probably be best to not do image dumps on Windows for now
19:18:33<Nemo_bis>Unless it's a last minute archival of a wiki which we know is about to die
19:19:25<Nemo_bis>Or we could mark them differently, perhaps with their own filename, so that people know there might be surprises inside
19:19:50<yzqzss|m>Perhaps a MediaWiki-style HashedUploadDirectory directory can be used to solve this problem.
19:19:59<yzqzss|m>https://www.mediawiki.org/wiki/Manual:$wgHashedUploadDirectory
19:20:39<Nemo_bis>It reduces the clobbering but not sure it helps with restoring
19:21:10<pokechu22>Based on what I see at https://www.mediawiki.org/wiki/Manual:ImportImages.php it seems like restoring is a giant pain anyways
19:21:47<Nemo_bis>no doubt :)
19:23:15<Nemo_bis>yzqzss|m: are you uploading dumps with deduplicated images to the wikiteam collection?
19:24:09<pokechu22>I do try to add notes to things I upload when I've done something weird, but I don't always remember because I tend to upload things in groups a few days after I did it (note also that I'm manually uploading, not using the bulk upload tool, though I probably should switch to that at some point)
19:25:36<pokechu22>also if I have to change the way tools work (e.g. hacking in different retry behavior) I try to include a patch in the thing I upload, and I also include a copy of the output in the upload (most of the time). though that doesn't include information about clobbered images :/
19:25:43<yzqzss|m>There is a problem with my formulation. I just pre-copied the old wikidump's images directory to the new wikidump with cp --reflink on btrfs. This way they have the same hash and won't be downloaded again. Doesn't consume my disk space either.
19:27:53<Nemo_bis>pokechu22: Using launcher.py and uploader.py helps find common issues like broken archives and it's how I remained relatively sane while archiving thousands of wikis, so I agree you probably should try to do that (and codify some of your tricks into their code).
19:28:19<Nemo_bis>otoh if you're focusing on wikis which previously failed an artisanal approach is also needed
19:29:01<Nemo_bis>yzqzss|m: Ah. Yes, handling the deduplication at fs level is the best (sorry I didn't read the code yet)
19:29:32<pokechu22>Yeah, a lot of the ones I've looked at previously failed - random russian and other self-hosted wikis in particular. If I just had a giant list of fandom links where everything should behave the same automation would be a lot easier
19:30:13<Nemo_bis>The wikidump.7z is created with 7z -mx=1 in the hope that the 7z scanning and ordering will manage to put duplicates next to each other and save some space on compression
19:31:02<Nemo_bis>Indeed. If you archive Wikia wikis, they tend to fail in similar ways.
19:33:03Craigle quits [Quit: The Lounge - https://thelounge.chat]
19:33:34<Nemo_bis>It's also worth poking Wikia devs every now and then, they do try to fix the dumps when they have some extra time
19:34:43Craigle (Craigle) joins
19:35:58<Nemo_bis>And I don't want to be always the one pestering them ;)
19:50:49<yzqzss|m>Try printing the Response of item.modify_metadata() in the uploader of wikiteam(py2). Maybe you'll see that it fails. <https://github.com/saveweb/wikiteam3/blob/python3/wikiteam3/uploader.py#L334~L340>
19:51:55<pokechu22>Just to clarify, I haven't actually tried using the uploader, as I haven't looked into setting up IA access keys yet. It's definitely something I should do though
20:39:09hitgrr8 quits [Client Quit]
20:47:07<yzqzss|m>Also, after passing `rvlimit`, do not pass multiple titles to pparams (keep `batch=False`), otherwise you will only get the latest rev. <https://www.mediawiki.org/wiki/API:Revisions#Possible_errors> --> `rvmultpages`
21:23:46vitzli (vitzli) joins
21:46:40Bedivere quits [Ping timeout: 252 seconds]
22:01:04Bedivere joins
22:29:43tzt quits [Ping timeout: 265 seconds]
22:30:47tzt (tzt) joins
22:43:20Ivan226 quits [Remote host closed the connection]
22:44:35Ivan226 joins
22:51:06qw3rty_ joins
22:51:30atphoenix_ (atphoenix) joins
22:52:25Ivan22666 joins
22:54:51Ivan226 quits [Ping timeout: 265 seconds]
22:55:30Ivan22666 is now known as Ivan226
22:56:18qw3rty quits [Ping timeout: 265 seconds]
22:56:18atphoenix quits [Ping timeout: 265 seconds]
23:02:06Matthww1 quits [Quit: Ping timeout (120 seconds)]
23:03:00Matthww1 joins
23:23:22vokunal|m joins
23:29:10Matthww1 quits [Client Quit]
23:30:06Matthww1 joins