01:01:52 | | tzt quits [Remote host closed the connection] |
01:02:14 | | tzt (tzt) joins |
04:05:04 | | andrew (andrew) joins |
04:05:13 | | Ivan226 joins |
06:18:52 | <yzqzss|m> | pokechu22: Can use --xmlapiexport, I'm downloading... |
06:20:56 | <yzqzss|m> | Oh, failed after downloading ~100 pages. Trying with --curonly ... |
07:08:09 | | AnotherIki quits [Remote host closed the connection] |
07:08:20 | | AnotherIki joins |
07:08:40 | | hitgrr8 joins |
09:47:35 | | kdqep (kdqep) joins |
10:17:38 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
11:33:43 | <yzqzss|m> | https://archive.org/details/wiki-urbanculturelol |
12:56:37 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
15:05:03 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
15:22:42 | <pokechu22> | yzqzss|m: The allrevisions API doesn't work, but the prop=revisions one does (where you ask for revisions one page at a time). But the implementation of that is fairly borked. I was able to fix it (at least for this wiki, I remember different wikis behaving differently though) and get all revisions out though |
15:43:29 | <pokechu22> | https://github.com/WikiTeam/wikiteam/pull/462 |
16:04:12 | | kdqep quits [Ping timeout: 252 seconds] |
17:38:42 | <Nemo_bis> | Nice pokechu22 ! |
17:39:15 | <Nemo_bis> | At the time the main focus was getting data out of Wikia, the rest was collateral damage |
17:40:16 | <Nemo_bis> | I still have a directory with dumps from 9000 wikis I made in early 2020... I think I uploaded them all to archive.org but I'm not super sure |
17:49:43 | <pokechu22> | For wikia I doubt that path was actually being hit; I'd expect special:export and the regular allrevisions path to work for all wikia wikis the same (since they'd all be on the same mediawiki version) |
17:55:03 | <Nemo_bis> | No, Special:Export was not an option |
17:55:58 | <pokechu22> | Hmm, ok, I'll need to test that further at some point then |
17:56:14 | <pokechu22> | But I'm pretty sure the code as it existed before only gave the most recent revision; that's what it was doing for urbanculture.lol at least |
17:57:33 | <Nemo_bis> | Yes I also accidentally uploaded some history dumps which actually were only latest revisions, but I tried to correct these. Can't remember the details now. |
17:58:23 | <Nemo_bis> | Listing the history.xml.7z files from the wikiteam collection and looking for suspicious size drops is one way to find debugging targets. |
17:58:35 | <Nemo_bis> | Otherwise I can provide a list of wikis which finished in an error back then |
17:59:55 | <pokechu22> | That'd be helpful, but I won't have time to look into it for a few days |
18:00:50 | <Nemo_bis> | They've waited for 3 years. :) |
18:24:23 | <yzqzss|m> | pokechu22: I just investigated and wikiteam3 fails after downloading ~100 pages with --xmlapiexport (API:Revisions) because it doesn't handle the special query-continue format for MW 1.21~1.25. I also fixed it. |
18:24:32 | <yzqzss|m> | https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/144 |
18:25:40 | <pokechu22> | If it fails on the page Kiss that's the same issue I ran into and fixed |
18:25:57 | <pokechu22> | though I changed different code, so that's a bit odd |
18:26:12 | <yzqzss|m> | Yes, page Kiss |
18:30:58 | <pokechu22> | Another small fix: https://github.com/WikiTeam/wikiteam/pull/463 |
18:48:12 | <yzqzss|m> | Thanks, wikiteam3 has fixed this before. https://github.com/saveweb/wikiteam3/blob/fd3bb2ee37f914a50a29518ddbfc71dccd106ecc/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L286 |
18:49:17 | <pokechu22> | Hmm, interesting, you still have the rvlimit: 50 line commented out: https://github.com/saveweb/wikiteam3/blob/00b47646e83871ff6d117bf53eae034cde7540ba/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L274 |
18:49:43 | <pokechu22> | actually, you have the same issue as I fixed in my PR here too: https://github.com/saveweb/wikiteam3/blob/00b47646e83871ff6d117bf53eae034cde7540ba/wikiteam3/dumpgenerator/dump/page/xmlrev/xml_revisions.py#L329 |
18:49:56 | <pokechu22> | (though you fixed the pparams one) |
19:11:07 | <pokechu22> | Oh, and one other thing to consider: right now, if you have a case-insensitive filesystem (which I do, because Windows jank) images can get clobbered, which both loses data and breaks resuming. We probably should detect this case and rename images like is done for ones that are too long. Perhaps images.txt should include both the original name and the adjusted name? (And while |
19:11:09 | <pokechu22> | we're at it maybe also add other metadata from the API such as size and hash?) |
19:14:31 | <Nemo_bis> | We shouldn't be uploading such image dumps to archive.org, as there is no way to restore them into MediaWiki |
19:15:00 | <Nemo_bis> | One option would be to write directly to an compressed archive which preserves case |
19:15:18 | <pokechu22> | Do the ones where files get renamed because their name is too long work when importing into mediawiki? |
19:15:24 | <yzqzss|m> | wikiteam3 is already using size and sha1 for file deduplication and download resume. see: wikiteam3/dumpgenerator/dump/image/image.py#L45 |
19:15:54 | <Nemo_bis> | Yes the truncation also makes the image dumps unusable |
19:16:38 | <pokechu22> | Yeah, in that case, it's probably better to save something even if it's not directly usable, instead of saving nothing... but still not great :/ |
19:17:10 | <pokechu22> | The silent clobbering with case-insensitive files is pretty big problem though; I think it affects a decent number of my dumps |
19:18:13 | <Nemo_bis> | It would probably be best to not do image dumps on Windows for now |
19:18:33 | <Nemo_bis> | Unless it's a last minute archival of a wiki which we know is about to die |
19:19:25 | <Nemo_bis> | Or we could mark them differently, perhaps with their own filename, so that people know there might be surprises inside |
19:19:50 | <yzqzss|m> | Perhaps a MediaWiki-style HashedUploadDirectory directory can be used to solve this problem. |
19:19:59 | <yzqzss|m> | https://www.mediawiki.org/wiki/Manual:$wgHashedUploadDirectory |
19:20:39 | <Nemo_bis> | It reduces the clobbering but not sure it helps with restoring |
19:21:10 | <pokechu22> | Based on what I see at https://www.mediawiki.org/wiki/Manual:ImportImages.php it seems like restoring is a giant pain anyways |
19:21:47 | <Nemo_bis> | no doubt :) |
19:23:15 | <Nemo_bis> | yzqzss|m: are you uploading dumps with deduplicated images to the wikiteam collection? |
19:24:09 | <pokechu22> | I do try to add notes to things I upload when I've done something weird, but I don't always remember because I tend to upload things in groups a few days after I did it (note also that I'm manually uploading, not using the bulk upload tool, though I probably should switch to that at some point) |
19:25:36 | <pokechu22> | also if I have to change the way tools work (e.g. hacking in different retry behavior) I try to include a patch in the thing I upload, and I also include a copy of the output in the upload (most of the time). though that doesn't include information about clobbered images :/ |
19:25:43 | <yzqzss|m> | There is a problem with my formulation. I just pre-copied the old wikidump's images directory to the new wikidump with cp --reflink on btrfs. This way they have the same hash and won't be downloaded again. Doesn't consume my disk space either. |
19:27:53 | <Nemo_bis> | pokechu22: Using launcher.py and uploader.py helps find common issues like broken archives and it's how I remained relatively sane while archiving thousands of wikis, so I agree you probably should try to do that (and codify some of your tricks into their code). |
19:28:19 | <Nemo_bis> | otoh if you're focusing on wikis which previously failed an artisanal approach is also needed |
19:29:01 | <Nemo_bis> | yzqzss|m: Ah. Yes, handling the deduplication at fs level is the best (sorry I didn't read the code yet) |
19:29:32 | <pokechu22> | Yeah, a lot of the ones I've looked at previously failed - random russian and other self-hosted wikis in particular. If I just had a giant list of fandom links where everything should behave the same automation would be a lot easier |
19:30:13 | <Nemo_bis> | The wikidump.7z is created with 7z -mx=1 in the hope that the 7z scanning and ordering will manage to put duplicates next to each other and save some space on compression |
19:31:02 | <Nemo_bis> | Indeed. If you archive Wikia wikis, they tend to fail in similar ways. |
19:33:03 | | Craigle quits [Quit: The Lounge - https://thelounge.chat] |
19:33:34 | <Nemo_bis> | It's also worth poking Wikia devs every now and then, they do try to fix the dumps when they have some extra time |
19:34:43 | | Craigle (Craigle) joins |
19:35:58 | <Nemo_bis> | And I don't want to be always the one pestering them ;) |
19:50:49 | <yzqzss|m> | Try printing the Response of item.modify_metadata() in the uploader of wikiteam(py2). Maybe you'll see that it fails. <https://github.com/saveweb/wikiteam3/blob/python3/wikiteam3/uploader.py#L334~L340> |
19:51:55 | <pokechu22> | Just to clarify, I haven't actually tried using the uploader, as I haven't looked into setting up IA access keys yet. It's definitely something I should do though |
20:39:09 | | hitgrr8 quits [Client Quit] |
20:47:07 | <yzqzss|m> | Also, after passing `rvlimit`, do not pass multiple titles to pparams (keep `batch=False`), otherwise you will only get the latest rev. <https://www.mediawiki.org/wiki/API:Revisions#Possible_errors> --> `rvmultpages` |
21:23:46 | | vitzli (vitzli) joins |
21:46:40 | | Bedivere quits [Ping timeout: 252 seconds] |
22:01:04 | | Bedivere joins |
22:29:43 | | tzt quits [Ping timeout: 265 seconds] |
22:30:47 | | tzt (tzt) joins |
22:43:20 | | Ivan226 quits [Remote host closed the connection] |
22:44:35 | | Ivan226 joins |
22:51:06 | | qw3rty_ joins |
22:51:30 | | atphoenix_ (atphoenix) joins |
22:52:25 | | Ivan22666 joins |
22:54:51 | | Ivan226 quits [Ping timeout: 265 seconds] |
22:55:30 | | Ivan22666 is now known as Ivan226 |
22:56:18 | | qw3rty quits [Ping timeout: 265 seconds] |
22:56:18 | | atphoenix quits [Ping timeout: 265 seconds] |
23:02:06 | | Matthww1 quits [Quit: Ping timeout (120 seconds)] |
23:03:00 | | Matthww1 joins |
23:23:22 | | vokunal|m joins |
23:29:10 | | Matthww1 quits [Client Quit] |
23:30:06 | | Matthww1 joins |