| 01:25:00 | <pokechu22> | JAA: I got the resumption key thing to work properly (i.e. give more URLs than the maximum limit I ran into of ~100,000 at which point things were behaving weirdly) for one of the Finish ISP things, though I don't remember exactly what I did |
| 01:26:18 | <@JAA> | pokechu22: Well yeah, it 'works', or at least it makes it seem that it does, but it just cuts off at some point, and depending on your query, you might also get zero results even though there are hits. |
| 01:26:27 | <@JAA> | (Context see -bs) |
| 01:26:51 | <@JAA> | The page number pagination is the only approach that actually works reliably. |
| 01:27:08 | <pokechu22> | It at least gave more results than I got from raising the limit (even if I had the limit higher)... I'm pretty sure the pagination one didn't work reliably for me? |
| 01:27:46 | <@JAA> | I've listed all of youtube.com etc. with the pagination. It certainly works. |
| 01:28:06 | <@JAA> | Or much, much better than the resumption key, anyway. |
| 01:28:32 | <@JAA> | I forgot what the actual condition for the resumption key issues was. OrIdow6 figured that one out a while ago. |
| 01:29:16 | <pokechu22> | ah, no, I was thinking of the offset parameter; that's the one that doesn't give more results |
| 01:29:20 | <@JAA> | In any case, I recommend using the ia-cdx-search script from my little-things repo. It has parallelism and handles the pagination as well as rate limits and other weirdness. |
| 01:29:27 | <pokechu22> | as documented on https://archive.org/developers/wayback-cdx-server.html#query-result-limits |
| 01:32:05 | <pokechu22> | The query I had was more or less https://web.archive.org/cdx/search/cdx?url=pp.fi&matchType=domain&collapse=urlkey&fl=original&limit=100000&showResumeKey=true originally; it ended up going into 13 pages (at 100k URLs each), and I didn't run into any rate limiting issues from that (but I imagine e.g. YouTube would run into other issues) |
| 01:34:55 | <@JAA> | Yeah, that seems to match what I'm getting. IIRC, it depends on the number of results and also the number of duplicates. For unpopular sites, the resume key should usually work, but for anything with a lot of snapshots, it'll break, and you'll have no way of telling that it broke. |
| 01:35:21 | <@JAA> | So in my eyes, it's best to ignore that it exists at all and use the method that works reliably. |
| 01:45:40 | <@JAA> | A note on the parallel use of ia-cdx-search: I've found that anything beyond 4 concurrent retrievals is mostly useless, at least from a Canadian server. IA's rate limits will trigger too quickly beyond that. |
| 01:46:01 | <@JAA> | I usually use `--concurrency 4 --tries 10`. |
| 01:51:51 | <pabs> | is there an official CLI for the CDX API, or is ia-cdx-search the one to use? |
| 01:55:14 | <@JAA> | Nothing official that I could find when I searched for it. |
| 01:55:48 | <@JAA> | Nor could I find anything decent unofficial, which is why I wrote ia-cdx-search. :-P |
| 02:04:20 | <OrIdow6> | Yeah use the paginated version |
| 02:04:40 | <OrIdow6> | I just have a pipeline that's "saved" in my shell history that does the trick |
| 02:05:36 | <OrIdow6> | And alongside that... beware that the paginated version may not have recent additions to the WBM |
| 02:06:35 | <OrIdow6> | Honestly it amazes me that CDX is still up, I would've imagined that people would have exploited it into the ground by now |
| 02:10:40 | | G4te_Keep3r349 quits [Client Quit] |
| 02:10:57 | | G4te_Keep3r349 joins |
| 03:59:26 | | Stilett0 joins |
| 04:01:13 | | Stiletto quits [Ping timeout: 252 seconds] |
| 05:03:26 | | Stiletto joins |
| 05:04:38 | | Stilett0 quits [Ping timeout: 265 seconds] |
| 06:20:57 | | Arcorann (Arcorann) joins |
| 07:44:57 | | G4te_Keep3r349 quits [Client Quit] |
| 07:45:12 | | G4te_Keep3r349 joins |
| 07:56:46 | | G4te_Keep3r349 quits [Client Quit] |
| 07:56:56 | | G4te_Keep3r3494 joins |
| 08:05:14 | | G4te_Keep3r3494 quits [Ping timeout: 252 seconds] |
| 08:06:31 | | G4te_Keep3r349 joins |
| 09:02:14 | | sknebel (sknebel) joins |
| 10:11:45 | | G4te_Keep3r349 quits [Client Quit] |
| 10:11:45 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 10:12:03 | | G4te_Keep3r349 joins |
| 10:18:13 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 10:32:40 | | G4te_Keep3r349 quits [Client Quit] |
| 10:32:59 | | G4te_Keep3r349 joins |
| 10:34:56 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 12:47:34 | | Arcorann quits [Ping timeout: 252 seconds] |
| 12:54:59 | | igloo22225 quits [Quit: Ping timeout (120 seconds)] |
| 12:55:19 | | igloo22225 (igloo22225) joins |
| 14:21:36 | | balrog quits [Quit: Bye] |
| 15:06:16 | | balrog (balrog) joins |
| 16:24:25 | | G4te_Keep3r3492 joins |
| 16:25:01 | | igloo22225 quits [Client Quit] |
| 16:25:03 | | G4te_Keep3r349 quits [Client Quit] |
| 16:25:03 | | G4te_Keep3r3492 is now known as G4te_Keep3r349 |
| 16:25:19 | | igloo22225 (igloo22225) joins |
| 19:07:15 | | AnotherIki joins |
| 19:10:55 | | Iki1 quits [Ping timeout: 252 seconds] |
| 19:35:01 | | igloo22225 quits [Client Quit] |
| 19:35:16 | | igloo22225 (igloo22225) joins |
| 19:57:18 | | pabs quits [Ping timeout: 252 seconds] |
| 19:57:53 | | pabs (pabs) joins |
| 20:11:08 | | igloo222252 joins |
| 20:12:03 | | igloo22225 quits [Client Quit] |
| 20:12:03 | | igloo222252 is now known as igloo22225 |
| 21:34:59 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 21:49:19 | | spirit quits [Quit: Leaving] |
| 23:07:08 | | igloo22225 quits [Client Quit] |
| 23:07:28 | | igloo22225 (igloo22225) joins |
| 23:31:07 | | tbc1887 (tbc1887) joins |