00:40:07sralracer quits [Client Quit]
00:49:23wickedplayer494 quits [Remote host closed the connection]
00:49:40wickedplayer494 joins
01:30:34<@JAA>Barto++
01:30:34<eggdrop>[karma] 'Barto' now has 12 karma!
01:51:22<nicolas17>JAA: https://transfer.archivete.am/inline/ew4AR/qwarc.txt now-what.gif
01:51:48<@JAA>nicolas17: For starters, don't install the master branch.
01:51:48<nicolas17>do I have too old python or
01:52:20nicolas17 tries last tag instead
01:52:44<@JAA>But the issue comes from a backwards-incompatible change in an aiohttp dependency that broke everything a while back.
01:52:58<nicolas17>sounds damn familiar
01:53:04<nicolas17>like I fought this exact issue before
01:53:12<@JAA>async-timeout==3.0.1
01:53:37<nicolas17>would it be possible to pin it in qwarc's setup.py to avoid the issue?
01:54:25<nicolas17>okay that works, and now it asks me for a specfile
01:54:33<nicolas17>which I guess is a python script using an undocumented API
01:56:00<@JAA>Correct
01:56:14<@JAA>And yeah, I should add a pin.
01:56:33<nicolas17>...does nothing else support dedup?
01:57:13<@JAA>wget does (but writes faulty WARCs, although that seems to be getting fixed soon, finally).
01:57:36<@JAA>You could try wget-at, if you're not on ARM.
01:58:11<nicolas17>are there qwarc examples somewhere at least?
01:58:13<@JAA>Or rather: if you're on x86-64
01:58:34<@JAA>In my IA uploads. qwarc is self-documenting and writes the spec file to the meta WARC.
01:58:40nicolas17 puts RISC-V machine back down
02:05:37<@JAA>... it's been 3 years since the last qwarc version? Wat?
02:05:50<nicolas17>time is fake
02:06:29<@JAA>Yeah, must be.
02:09:41<nicolas17>so far cloning gnulib git repo has been the slowest step
02:16:30<@JAA>Cloning what and why?
02:16:47<nicolas17>wget-at has gnulib as a submodule
02:17:23<@JAA>Ah, you're trying wget-at now, right.
02:18:12<@OrIdow6>Yeah it's super slow/times out sometimes, don't know why
02:18:15<nicolas17>using the Dockerfile built an image that can't run because it *only* has wget-at and its libraries, but no libc or anything, because it uses "FROM scratch"? what?
02:18:20<@OrIdow6>Good thing is you only have to do it once
02:18:45<@JAA>Yeah, the wget-at image isn't really meant to be used directly.
02:19:24<@JAA>Or wget-lua image, I suppose.
02:19:27decky quits [Ping timeout: 252 seconds]
02:19:31<@JAA>Yay, legacy naming.
02:19:44<nicolas17>and if I change "FROM scratch" to "FROM debian:bookworm-slim" it has to download gnulib and compile wget again, because *any* change to Dockerfile changes the effects of the previous statement "COPY . /tmp/wget"
02:20:24<@JAA>You could just pull the grab-base image from atdr.
02:20:26<nicolas17>why are computers
02:20:58<@JAA>It's meant for DPoS, but it should be possible to just use it to run wget-at.
02:21:01<nicolas17>also fun how cares took much longer to compile than wget, if we ignore the time needed to clone gnulib and the autoconf nonsense
02:22:18decky_e joins
02:41:26<nicolas17>the --progress option seems broken
02:42:29<nicolas17>--progress=bar or --progress=dot:mega etc, none changes anything, it keeps flooding my console with one dot per KB
02:45:50HP_Archivist quits [Quit: Leaving]
02:53:31<nicolas17>hm this initial test seems fine, but I should probably run this for real on my other computer so that I can actually fit a whole InstallAssistant in RAM
02:57:29cow_2001 quits [Quit: ✡]
02:58:37cow_2001 joins
02:59:25<nicolas17>wtf it's appending to the -O file?
03:01:56nulldata quits [Quit: Ping timeout (120 seconds)]
03:02:02<nicolas17>"wget-at -O wget.tmp -i list.txt--warc-file=test" downloaded the 12GB file into wget.tmp, then wrote the 12GB data into test.warc, and now wget.tmp is growing more (with the content of the second URL?), I expected it to get truncated for the second download
03:02:49nulldata (nulldata) joins
03:06:02<@JAA>Yes, it should truncate. That's clearly not your exact command since there's no space between txt and --.
03:06:15<nicolas17>...pipelines pass --truncate-output
03:06:16<nicolas17>I see
03:07:01<@JAA>Oh, right, that's a wget-at thing, I think.
03:08:38<nicolas17>I think I can also just use -O /dev/null and save some temporary disk space
03:08:48<nicolas17>since it uses a separate temporary file for warc purposes anyway
03:10:51<@JAA>Since you don't need to extract links or similar, that sounds plausible, yeah.
03:11:31<nicolas17>hm the .cdx only has the first URL
03:30:34pete joins
03:31:04pete quits [Client Quit]
03:39:22nic8693102004 (nic) joins
03:53:57<TheTechRobo>nicolas17: You can do `FROM <wget-at image> as wget` and then `COPY --from=wget /path/to/wget-lua /usr/bin/wget-lua`, for the record.
03:55:03<TheTechRobo>You don't need a runnable container to copy data out of it.
03:55:13<nicolas17>on a separate dockerfile you mean?
03:55:20<TheTechRobo>Yeah
03:55:29<TheTechRobo>IIRC that's what I did
03:55:53<nicolas17>(editing this dockerfile in any way caused a rebuild of wget because it was considered part of the source code being copied in a build step)
03:55:54<TheTechRobo>Although I haven't used wget-at outside of docker for probably at least a year now.
03:56:02etnguyen03 quits [Remote host closed the connection]
03:57:54wyatt8740 quits [Ping timeout: 252 seconds]
03:58:15loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
03:58:57<nicolas17>augh
03:59:10<nicolas17>JAA: "VEILPUTFJNJQAAAAAAAAAAAAAAAAAAAA" is this the bug mentioned recently in #archiveteam-dev?
03:59:19<@JAA>Sure looks like it.
04:00:17<nicolas17>it's joever
04:00:21<nicolas17>every warc tool is broken
04:01:49<TheTechRobo>Enjoy the broken digests. :-)
04:02:02wyatt8740 joins
04:03:04<TheTechRobo>JAA: Do you think it's reasonable to add an option to test for the broken digest bug in warc-tiny so it doesn't flood the logs so much?
04:03:48<TheTechRobo>Similar to how it detects the broken handling of transfer-encoding, but as an option since only Wget-AT is affected
04:04:31<nicolas17>this is a 104-byte download btw
04:04:48ducky quits [Ping timeout: 260 seconds]
04:05:29<TheTechRobo>nicolas17: Not sure what the exact odds of hitting the bug are, but I think you just got unlucky.
04:05:36<nicolas17>https://swcdn.apple.com/content/downloads/13/38/072-11038-A_8VILF7KGLR/ekuwqfer80bkta2a6l6hn9flavknip4edt/MajorOSInfo.pkg.integrityDataV1
04:06:17ducky (ducky) joins
04:07:47<@OrIdow6><nicolas17> every warc tool is broken
04:07:49<h2ibot>OrIdow6 edited The WARC Ecosystem (+155, /* Tools */ wget-lua to yellow): https://wiki.archiveteam.org/?diff=53840&oldid=53766
04:07:49<@OrIdow6>Good point...
04:08:03<nicolas17>ffs
04:08:14<TheTechRobo>Oh, right, it affects dedupe too.
04:08:30<@OrIdow6>I don't think anyone's tested that but apparently that's what it looked like
04:08:45<TheTechRobo>Yeah, the copied hash is what's passed to the deduplication function.
04:09:01TheTechRobo wonders how many records have been incorrectly deduplicated
04:09:27nulldata quits [Client Quit]
04:09:52<@OrIdow6>Mmmm, probably a lot
04:10:09<TheTechRobo>I'm not so sure
04:10:14<TheTechRobo>URL-agnostic dedupe is per-process
04:10:20nulldata (nulldata) joins
04:10:22<@JAA>1 of 256 hashes will be all 0 bytes with this bug.
04:10:46<TheTechRobo>Oh, those odds are significantly worse than I thought. :-(
04:10:55<@OrIdow6>Like apparently a 100-request item (/multiitem batch) will have a 5% chance of having a false dedup
04:10:55<@JAA>And that's just the biggest chunk of it.
04:11:12<@OrIdow6>No
04:11:14<@JAA>Specifically the probability of the first byte of the has being NUL.
04:11:25<@JAA>hash*
04:11:32<@OrIdow6>Yes, actually, I copied the wrong number but it was about the same as the right one
04:11:40<@OrIdow6>(Replying to myself not to J A A)
04:11:45<@JAA>Collisions on later NUL bytes are also possible but not as likely.
04:12:38<@OrIdow6>I'll turn off dedup on Cohost in a bit
04:12:53<nicolas17>is it strcpy'ing the binary hash?
04:12:57<TheTechRobo>Yes
04:13:00<TheTechRobo>strncpy specifically
04:13:02<@JAA>strncpy, but yes
04:13:11<nicolas17>let's quit computers and start a farm
04:13:16<TheTechRobo>:-)
04:13:16<@JAA>Sounds good to me.
04:13:31<@JAA>Oh wait, modern farm equipment is all computers. D:
04:13:42<nicolas17>and fighting tractor DRM
04:16:53<nicolas17>so wtf
04:16:56<nicolas17>isn't this "easy" to fix
04:17:02<TheTechRobo>Yes
04:17:11<TheTechRobo>strncmp -> memcmp, theoretically
04:17:15<TheTechRobo>Er
04:17:23<TheTechRobo>strncpy -> memcpy, theoretically
04:18:53TheTechRobo has just noticed that there is now only one green row on The WARC Ecosystem :-(
04:19:43<nicolas17>let's go find bugs in it
04:21:59<@JAA>And even that green row has bugs inherited from wpull. :-(
04:24:44<nicolas17>also indeed running this on a machine with 24GB RAM was so much better
04:24:54<nicolas17>whole file stays in disk cache
04:27:52<nicolas17>https://archive.org/details/macos-installassistant-24C5073e-warc see you in two hours
04:40:18Guest54 quits [Quit: My MacBook has gone to sleep. ZZZzzz…]
04:41:40Unholy2361924645377131 quits [Ping timeout: 260 seconds]
04:41:55<nicolas17>does this affect archivebot or does that use different software?
04:43:07<nicolas17>it seems to be 1 of the 2 uses of strncpy in the whole codebase so yes this seems easy to fix
04:43:21<nicolas17>src/warc.c:2085: strncpy(sha1_res_payload, sha1_payload, SHA1_DIGEST_SIZE);
05:27:00<nicolas17>JAA: I finished uploading the item
05:27:36<nicolas17>I left the .warc uncompressed because it's mainly a giant already-compressed file, is that a problem? are there tools that expect .warc to always be .gz/.zstd?
05:36:25HP_Archivist (HP_Archivist) joins
06:06:34ave quits [Quit: Ping timeout (120 seconds)]
06:06:54ave (ave) joins
06:27:12night quits [Remote host closed the connection]
06:27:23night joins
07:03:52BlueMaxima quits [Quit: Leaving]
07:05:51Unholy2361924645377131 (Unholy2361) joins
07:08:04Pedrosso5 joins
07:08:09ScenarioPlanet2 (ScenarioPlanet) joins
07:08:10TheTechRobo2 (TheTechRobo) joins
07:10:25ScenarioPlanet quits [Ping timeout: 255 seconds]
07:10:25Pedrosso quits [Ping timeout: 255 seconds]
07:10:25ScenarioPlanet2 is now known as ScenarioPlanet
07:10:26Pedrosso5 is now known as Pedrosso
07:10:57TheTechRobo quits [Ping timeout: 252 seconds]
07:10:57TheTechRobo2 is now known as TheTechRobo
07:12:45@rewby quits [Ping timeout: 260 seconds]
07:48:13ducky quits [Ping timeout: 260 seconds]
08:18:05<upperbody321|m>So-net Blog (SS Blog), the former Sony Communications blogging business, will end its services on 31 March 2025.
08:18:05<upperbody321|m>https://blog-wn.blog.ss-blog.jp/2024-11-15
08:18:05<upperbody321|m>Sorry if it has already been posted
08:37:05rewby (rewby) joins
08:37:05@ChanServ sets mode: +o rewby
08:42:53Island quits [Read error: Connection reset by peer]
08:54:12xarph quits [Read error: Connection reset by peer]
08:54:30xarph joins
09:02:15BennyOtt (BennyOtt) joins
09:21:48BennyOtt quits [Client Quit]
09:37:14ducky (ducky) joins
09:41:28BennyOtt (BennyOtt) joins
10:12:07<@arkiver>the bug in Wget-AT is now fixed with https://github.com/ArchiveTeam/wget-lua/commit/8adeb442e256ca8c737da19cc0224e1ca09ef266
10:13:26<@arkiver>i will make sure it is propagated to all projects
10:27:29<@arkiver>how i believe this would work out in the Wayback Machine is the following. upon indexing of WARCs (created the .cdx.gx file from the .warc.gz), record have their hashes recalculated. this means they would end up in the CDX file with the correct hash.
10:28:37<@arkiver>however, hashes used in revisit records are of course not being recalculated - those are assumed to be correct. when a revisit record is resolved, i believe the nearest record matching advertised URL+hash is found and redirected to. if hashes do not match, i believe no redirect would happen.
10:28:43<@arkiver>i will confirm that
10:36:21sralracer joins
10:40:18<@arkiver>but... let's see, is this fixable? to some degree it is, but it would require parsing a ton of data
10:40:52<@arkiver>we do not have any kind of 'global' deduplication with some central collection of hashes against which we deduplicate. deduplication only happens withing a single session.
10:41:43<@arkiver>a single session produces one WARCs, which always ends up in one megaWARC (it's never split over multiple), and every record in the WARC is clearly associated to the session (or item) it was produced with.
10:43:24<@arkiver>together with the WARC-Refers-To-Date and WARC-Refers-To-Target-URI WARC headers, it is possible with a very high degree of certainty to match records together, and fix hashes in that way.
10:50:15<@arkiver>thinking more about this. while ideal would be to fix the megaWARCs themselves, it may be possible as well ti create a second WARC next to the megaWARC with the fixed revisit records. this would require only using the CDX to find "maybe bad revisit recording" (all those ending with one or more A s?), then writing fixed version of these to a WARC and placing this smaller WARC in the item.
10:51:02<@arkiver>of course, this would not fix the records in the megaWARC, but those that are revisit records will already have their digests recalculated upon creating the CDX file.
10:51:34<@arkiver>i'm looking into this, i think there's a possibilities
10:58:57Wohlstand quits [Remote host closed the connection]
11:10:52<@arkiver>err that is not the complete story, that is for payloads that have been correctly deduplicated from each other. payload that have been wrongly deduplicated cannot be fixed
11:12:22@arkiver is not thinking well at the moment :/
11:13:08nulldata0 (nulldata) joins
11:14:46nulldata quits [Ping timeout: 255 seconds]
11:14:47nulldata0 is now known as nulldata
12:00:02Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat]
12:02:43Bleo182600722719623 joins
12:05:45LddPotato quits [Ping timeout: 252 seconds]
12:38:44SkilledAlpaca41896 quits [Quit: SkilledAlpaca41896]
12:44:43SkilledAlpaca41896 joins
12:51:21th3z0l4_ quits [Read error: Connection reset by peer]
12:52:17th3z0l4 joins
13:06:48th3z0l4 quits [Ping timeout: 252 seconds]
13:07:23th3z0l4 joins
13:16:03lennier2_ joins
13:19:05lennier2 quits [Ping timeout: 260 seconds]
13:56:34<BennyOtt>What is actually the best option? The "warrior-dockerfile" with the interface or each project you want to support individually?
14:07:35Guest54 joins
14:17:52<that_lurker>each project individually would be the best way to go.
14:23:50FartWithFury (FartWithFury) joins
14:25:29<nstrom|m>Yeah primarily because you can run multiple projects at once that way
14:26:46<nstrom|m>Usually individual projects are limited to some extent on how much you can run on a single IP so even if you have plenty of bandwidth /cpu/Ram you can't usually devote it all to a single project since the site usually blocks/throttles on their end
14:37:43<FartWithFury>what are you all archiving with and where too?
14:38:24<nstrom|m>Standalone docker images, each project has one
14:39:11<nstrom|m>And from where, a number of VPS/cloud servers and some Linux hardware at home
14:39:18<nstrom|m>For me at least
14:42:45<imer>^same
14:50:40<FartWithFury><3
14:51:22<that_lurker>FartWithFury: grab-site for my own website acrhives, tubeup for videso or yt-dlp, wikiteam3 for mediawikis... almost everything goes to archive.org
14:51:48<that_lurker>chat_downloader for live chats (youtube, twitch..)
14:54:38<FartWithFury>i'm using httrac , wget (python) and downloadthemall addon fore firefox :) and same up to archive
14:55:09<FartWithFury>then up to archive.org*
15:02:57<BennyOtt>ok, thanks @that_lurker and @nstrom. then I might change something a bit on my side, since "warrior-dockerfile" hasn't received an update yet, even though it's easy to manage when new projects come along.
15:04:30<that_lurker>you can also just run multiple instances of the warrior-docker if you want a gui experience
15:06:33<BennyOtt>yes, I did that before too
15:31:05Mateon1 quits [Quit: Mateon1]
15:41:30VerifiedJ9 quits [Quit: The Lounge - https://thelounge.chat]
15:41:32Mateon1 joins
15:42:00thuban quits [Ping timeout: 260 seconds]
15:42:10VerifiedJ9 (VerifiedJ) joins
15:46:02thuban (thuban) joins
15:55:41Raithmir joins
16:04:20ducky quits [Read error: Connection reset by peer]
16:04:40ducky (ducky) joins
16:07:05pabs quits [Ping timeout: 260 seconds]
16:10:23pabs (pabs) joins
16:15:07Raithmir quits [Client Quit]
16:25:45wyatt8740 quits [Ping timeout: 260 seconds]
16:29:51<katia>pabs, kokos asked me about Gemini and the possibilities of archiving it. I know you looked into it at some point, are you doing anything with it?
16:37:12wyatt8740 joins
17:01:06katocala quits [Ping timeout: 252 seconds]
17:01:32katocala joins
17:19:20wessel1512 joins
17:19:30bladem quits [Read error: Connection reset by peer]
17:32:27katocala quits [Ping timeout: 252 seconds]
17:33:02katocala joins
17:56:45pabs quits [Ping timeout: 260 seconds]
17:59:10sralracer quits [Client Quit]
18:00:16pabs (pabs) joins
18:01:11Naruyoko quits [Read error: Connection reset by peer]
18:01:28Naruyoko joins
18:13:47sralracer (sralracer) joins
18:20:38<h2ibot>JustAnotherArchivist edited Template:IRC channel (-2, Update for new web chat based on…): https://wiki.archiveteam.org/?diff=53841&oldid=47317
18:25:13<@JAA>nicolas17: Uncompressed WARC is fine, I think.
18:27:23<@JAA>arkiver: Identifying potentially faulty records should be possible from the CDX and the megawarc JSON. The former should contain the payload digests, and the latter allows identifying boundaries between mini-WARCs to eliminate those false positives.
18:28:29<@JAA>Still a lot of data though. And we'd have to check what hash the CDX contains exactly; I think IA recalculates it rather than relying on what's in the WARC, but not entirely sure.
19:06:52ducky_ (ducky) joins
19:07:13ducky quits [Ping timeout: 260 seconds]
19:07:28ducky_ is now known as ducky
19:12:45Webuser074404 joins
19:13:29Webuser074404 quits [Client Quit]
19:28:40BlueMaxima joins
19:33:27FartWithFury quits [Read error: Connection reset by peer]
20:49:29Naruyoko5 joins
20:50:03Naruyoko quits [Read error: Connection reset by peer]
21:03:57Arachnophine quits [Quit: Ping timeout (120 seconds)]
21:04:14Arachnophine (Arachnophine) joins
21:06:58BornOn420 quits [Remote host closed the connection]
21:07:14alexlehm quits [Remote host closed the connection]
21:07:34Sluggs quits [Quit: ZNC - http://znc.in]
21:07:38alexlehm (alexlehm) joins
21:08:36Barto quits [Quit: WeeChat 4.4.3]
21:09:15katia_ quits [Ping timeout: 260 seconds]
21:09:50@JAA quits [Ping timeout: 260 seconds]
21:09:50kokos| quits [Ping timeout: 260 seconds]
21:11:09JAA (JAA) joins
21:11:09@ChanServ sets mode: +o JAA
21:12:09loug8318142 joins
21:14:50kokos- joins
21:16:29katia_ (katia) joins
21:18:59Sluggs joins
21:21:52BornOn420 (BornOn420) joins
21:34:33Island joins
21:43:06sralracer quits [Quit: Ooops, wrong browser tab.]
21:43:41Barto (Barto) joins
22:04:35@JAA quits [Remote host closed the connection]
22:05:11JAA (JAA) joins
22:05:11@ChanServ sets mode: +o JAA
22:14:22<TheTechRobo>Is using a VPN with Warrior projects OK if I control the VPN? It's wireguard.
22:20:43BlueMaxima quits [Read error: Connection reset by peer]
22:22:32<@OrIdow6>arkiver: The sidecar WARC idea would work I think; also your explanation of the WBM's behavior seems to fit what I can see - e.g. https://web.archive.org/web/https://t.nhentai.net/galleries/73599/170t.jpg is a revisit with all A's in https://archive.org/details/archiveteam_nhentai_20240920192453_7ede7e27 , but in playback it redirects to a date but then says there are no WBM captures (live URL NSFW)
22:26:22<@OrIdow6>If we wanted to do *only* a CDX/sidecar-warc approach with no reading/writing of the original we could have a threshold of entropy of the hash (number of A's) - say "even if this ends in 20 A's, the first 20 digits are the same, there's only a 1/(whatever) chance this would've happened by coincidence"
22:29:45Unholy2361924645377131 quits [Ping timeout: 260 seconds]
22:29:46<@OrIdow6>More broadly I think we could come up with some kind of "score" for how likely the 2 are to be true equals?
22:31:04<@OrIdow6>LIke, # of bits that match before the A's start + (5 if the URLs are the same else 0) + (5 if the Etag and HTTP content-length headers are the same else 0)
22:31:28<@OrIdow6>And check if that's greater than 20
22:32:35<nicolas17>for non-revisit records you could also just calculate the correct hash
22:32:46<@OrIdow6>But better be cautious with this because if done wrongly (if the heuristic measures have too much weight) it could veer into "faking data" territory
22:34:03<nicolas17>if it has *any* 00 at the end, calculate the correct payload hash, if it doesn't match then anything with that hash is suspect
22:35:46<@OrIdow6>arkiver: Also in addition to the above I think that ASAP we should go thru uploaded collections, as well as the temporary storage, find all revisits that we think might be false-positive-revisits (i.e., those with above some set number of A's), and throw those into URLs
22:36:18<@OrIdow6>Which is doable from the CDXs from what's on IA already
22:37:49<@OrIdow6>nicolas17: For stuff already on the IA they already calculate the correct hash (due to the issue with chunked encoding or whatever it was) so that's what can be matched against
22:38:31<@OrIdow6>But the fact that the IA already assumes that the WARC-Payload-Digest values in the WARC are garbage means our big issue here is revisits, not the values of that header per se
22:41:06<@OrIdow6>"and throw those into URLs" - or maybe AB because URLs might DDOS some stuff
22:43:20<@JAA>Depends on how many there are, I'd say.
22:44:25<@JAA>And throwing into #// would possibly need to bypass backfeed.
22:47:55<@OrIdow6>(On the topic of revisits, I need to check again if those sha1-collision-attack PDFs are mixed up in the WBM...)
22:48:51<nicolas17>oh no
22:56:03etnguyen03 (etnguyen03) joins
22:59:07Wohlstand (Wohlstand) joins
23:07:08sralracer (sralracer) joins
23:13:41pixel leaves [Error from remote client]
23:14:12peo joins
23:26:37<peo>Hi all! Anyone awaye and have knowledge about the grab-site tool ? Need to resume a interrupted download if possible
23:27:23<@JAA>Hi peo. That's not supported: https://github.com/ArchiveTeam/grab-site/issues/58
23:28:16<peo>yep, as I have read.. just thought if anyone had a work-around except from using the "pause when diskspace is low"-trick
23:29:17<peo>It filled up in the datadir's temp folder because it stumbled on a large file while downloading the whole world..
23:30:31<@JAA>Ah, I bet it's planet-latest.osm.pbf. Classic.
23:31:37<nicolas17>x_x
23:35:01loug8318142 quits [Client Quit]
23:36:31<@OrIdow6>I am running a very slow scan thru some CDXs in collection:archiveteam that are accessible to me for false revisits, if you want the scripts (just a pipeline + GNU parallel, but took a while since the latter still has pretty bad documention) contact me
23:37:49<that_lurker>would distributing that job help speed it up?
23:38:06<nicolas17>I assume IA would become the bottleneck quickly
23:40:02sralracer quits [Client Quit]
23:41:28<@OrIdow6>Parallel thinks there are 20h left on the set I'm doing (very reduced, limiting it to stuff I can access + stuff I think is likely to be at temporally closer risk)