01:25:00<pokechu22>JAA: I got the resumption key thing to work properly (i.e. give more URLs than the maximum limit I ran into of ~100,000 at which point things were behaving weirdly) for one of the Finish ISP things, though I don't remember exactly what I did
01:26:18<@JAA>pokechu22: Well yeah, it 'works', or at least it makes it seem that it does, but it just cuts off at some point, and depending on your query, you might also get zero results even though there are hits.
01:26:27<@JAA>(Context see -bs)
01:26:51<@JAA>The page number pagination is the only approach that actually works reliably.
01:27:08<pokechu22>It at least gave more results than I got from raising the limit (even if I had the limit higher)... I'm pretty sure the pagination one didn't work reliably for me?
01:27:46<@JAA>I've listed all of youtube.com etc. with the pagination. It certainly works.
01:28:06<@JAA>Or much, much better than the resumption key, anyway.
01:28:32<@JAA>I forgot what the actual condition for the resumption key issues was. OrIdow6 figured that one out a while ago.
01:29:16<pokechu22>ah, no, I was thinking of the offset parameter; that's the one that doesn't give more results
01:29:20<@JAA>In any case, I recommend using the ia-cdx-search script from my little-things repo. It has parallelism and handles the pagination as well as rate limits and other weirdness.
01:29:27<pokechu22>as documented on https://archive.org/developers/wayback-cdx-server.html#query-result-limits
01:32:05<pokechu22>The query I had was more or less https://web.archive.org/cdx/search/cdx?url=pp.fi&matchType=domain&collapse=urlkey&fl=original&limit=100000&showResumeKey=true originally; it ended up going into 13 pages (at 100k URLs each), and I didn't run into any rate limiting issues from that (but I imagine e.g. YouTube would run into other issues)
01:34:55<@JAA>Yeah, that seems to match what I'm getting. IIRC, it depends on the number of results and also the number of duplicates. For unpopular sites, the resume key should usually work, but for anything with a lot of snapshots, it'll break, and you'll have no way of telling that it broke.
01:35:21<@JAA>So in my eyes, it's best to ignore that it exists at all and use the method that works reliably.
01:45:40<@JAA>A note on the parallel use of ia-cdx-search: I've found that anything beyond 4 concurrent retrievals is mostly useless, at least from a Canadian server. IA's rate limits will trigger too quickly beyond that.
01:46:01<@JAA>I usually use `--concurrency 4 --tries 10`.
01:51:51<pabs>is there an official CLI for the CDX API, or is ia-cdx-search the one to use?
01:55:14<@JAA>Nothing official that I could find when I searched for it.
01:55:48<@JAA>Nor could I find anything decent unofficial, which is why I wrote ia-cdx-search. :-P
02:04:20<OrIdow6>Yeah use the paginated version
02:04:40<OrIdow6>I just have a pipeline that's "saved" in my shell history that does the trick
02:05:36<OrIdow6>And alongside that... beware that the paginated version may not have recent additions to the WBM
02:06:35<OrIdow6>Honestly it amazes me that CDX is still up, I would've imagined that people would have exploited it into the ground by now
02:10:40G4te_Keep3r349 quits [Client Quit]
02:10:57G4te_Keep3r349 joins
03:59:26Stilett0 joins
04:01:13Stiletto quits [Ping timeout: 252 seconds]
05:03:26Stiletto joins
05:04:38Stilett0 quits [Ping timeout: 265 seconds]
06:20:57Arcorann (Arcorann) joins
07:44:57G4te_Keep3r349 quits [Client Quit]
07:45:12G4te_Keep3r349 joins
07:56:46G4te_Keep3r349 quits [Client Quit]
07:56:56G4te_Keep3r3494 joins
08:05:14G4te_Keep3r3494 quits [Ping timeout: 252 seconds]
08:06:31G4te_Keep3r349 joins
09:02:14sknebel (sknebel) joins
10:11:45G4te_Keep3r349 quits [Client Quit]
10:11:45qwertyasdfuiopghjkl quits [Remote host closed the connection]
10:12:03G4te_Keep3r349 joins
10:18:13qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
10:32:40G4te_Keep3r349 quits [Client Quit]
10:32:59G4te_Keep3r349 joins
10:34:56qwertyasdfuiopghjkl quits [Client Quit]
12:47:34Arcorann quits [Ping timeout: 252 seconds]
12:54:59igloo22225 quits [Quit: Ping timeout (120 seconds)]
12:55:19igloo22225 (igloo22225) joins
14:21:36balrog quits [Quit: Bye]
15:06:16balrog (balrog) joins
16:24:25G4te_Keep3r3492 joins
16:25:01igloo22225 quits [Client Quit]
16:25:03G4te_Keep3r349 quits [Client Quit]
16:25:03G4te_Keep3r3492 is now known as G4te_Keep3r349
16:25:19igloo22225 (igloo22225) joins
19:07:15AnotherIki joins
19:10:55Iki1 quits [Ping timeout: 252 seconds]
19:35:01igloo22225 quits [Client Quit]
19:35:16igloo22225 (igloo22225) joins
19:57:18pabs quits [Ping timeout: 252 seconds]
19:57:53pabs (pabs) joins
20:11:08igloo222252 joins
20:12:03igloo22225 quits [Client Quit]
20:12:03igloo222252 is now known as igloo22225
21:34:59qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
21:49:19spirit quits [Quit: Leaving]
23:07:08igloo22225 quits [Client Quit]
23:07:28igloo22225 (igloo22225) joins
23:31:07tbc1887 (tbc1887) joins