| 00:00:49 | <audrooku|m> | I was having a conversation with JAA about the WBM CDX API, I've been having issues with getting an exhaustive list of urls, I typically have to fuzz the url to get a broader list |
| 00:02:00 | <audrooku|m> | for example I will get more results for |
| 00:02:00 | <audrooku|m> | `soundcloud.com/oe*` |
| 00:02:00 | <audrooku|m> | than for |
| 00:02:00 | <audrooku|m> | `soundcloud.com/*` |
| 00:02:00 | <audrooku|m> | due to soundcloud having a very popular oembed feature at `soundcloud.com/oembed` |
| 00:06:17 | <@JAA> | I'm trying a listing equivalent to soundcloud.com/* myself now. How many results did you get exactly there? |
| 00:09:03 | <audrooku|m> | I get 1,432,078 rows with `wget "https://web.archive.org/cdx/search/cdx?url=https://soundcloud.com/*&matchType=domain"` |
| 00:09:03 | <audrooku|m> | I will have to re-run without matchType=domain to test the effect of fuzzing, will get back to you in a sec |
| 00:09:12 | <@JAA> | I'm about 10% done and got just over 20 million unique URLs so far. |
| 00:09:44 | <audrooku|m> | what is the exact url you're querying? |
| 00:10:15 | <@JAA> | /cdx/search/cdx?url=soundcloud.com&matchType=domain&collapse=urlkey&fl=original&output=json with pagination |
| 00:11:11 | <@JAA> | Which pagination are you using? |
| 00:12:06 | <@JAA> | I do showNumPages=true + page, because the resumption key pagination is known not to work well. |
| 00:13:16 | <audrooku|m> | None, I wasn't under the impression that this was neccessary unless you set a limit, is this neccessary? |
| 00:13:30 | <@JAA> | It is. |
| 00:15:56 | <audrooku|m> | does matchType=domain also return more results if you're querying for all results under a domain? |
| 00:17:57 | <@JAA> | There's some sort of 'sharding' involved here, or whatever it's called in this context. If you don't do pagination, you only ever get results from the first 'shard'. Or something similar to that. |
| 00:18:26 | <@JAA> | With the 'page' param, you can iterate over the entire CDX store. -ish |
| 00:19:11 | <@JAA> | resumeKey behaves weirdly for queries with lots of results, which is why I switched (back, I think) to the page pagination in ia-cdx-search some time ago. |
| 00:19:18 | <audrooku|m> | wow, that's incredible, I wish this was more emphasized in the docs, when the paginated query with your tool is finished I'll let you know |
| 00:19:28 | <@JAA> | matchType=domain means matching the domain and any subdomains. |
| 00:23:56 | <@JAA> | Welp, my run crashed due to 503s after 3.3k pages. I understand you're running it as well anyway? If so, I won't resume. |
| 00:24:07 | <@JAA> | `ia-cdx-search --concurrency 4 --tries 10 'url=soundcloud.com&matchType=domain&collapse=urlkey&fl=original'` is what I used to be precise. |
| 00:24:09 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 00:24:16 | <nicolas17> | crashed? |
| 00:24:25 | <nicolas17> | so 503s continued despite 10 retries? |
| 00:24:30 | <@JAA> | Yep |
| 00:24:43 | <@JAA> | The WBM is very stable, you must know. |
| 00:24:44 | <@JAA> | :-) |
| 00:25:53 | <@JAA> | It doesn't sleep between retries though except for 429s, so it might not have persisted for long. |
| 00:25:58 | <audrooku|m> | > Welp, my run crashed due to 503s after 3.3k pages. I understand you're running it as well anyway? If so, I won't resume. |
| 00:25:58 | <audrooku|m> | Yeah mine crashed as well and I accidentally overwrote the file I was piping to haha, I'll run this query until it finishes, since I need the cdx data anyway, will ping you when it finishes |
| 00:30:39 | <fireonlive> | tries 9999999999 |
| 01:34:26 | | Dango360 quits [Ping timeout: 252 seconds] |
| 02:31:47 | | AnotherIki joins |
| 02:35:36 | | Iki1 quits [Ping timeout: 265 seconds] |
| 03:30:38 | <@JAA> | > This URL has been already captured 7 times today. |
| 03:30:43 | <@JAA> | > Wayback Machine has not archived that URL. |
| 03:30:45 | <@JAA> | Sigh |
| 03:31:00 | <@JAA> | But yeah, apparently the limit is 7 captures per day now. |
| 03:38:39 | <audrooku|m> | Interesting |
| 03:40:27 | <audrooku|m> | BTW JAA do you know how the WBM handles captures of the same url and timestamp? given that the only variables for accessing a capture are the url and timestamp are those, I doubt it would be accessible, but I'm curious if you know what would happen in that scenario, I assume they at least allow conflicting url/timestamps between collections or warcs ? |
| 03:40:31 | <audrooku|m> | just curious |
| 03:41:31 | <@JAA> | I believe it prefers 200s over other status codes, which is how it makes www → non-www redirects work, for example. But beyond that, no idea. |
| 04:42:27 | | Stiletto joins |
| 05:03:13 | <audrooku|m> | audrey is still waiting on the cdx grab, page 6242 atm |
| 07:42:01 | | Arcorann (Arcorann) joins |
| 08:23:31 | <audrooku|m> | How does the fakeurl work for youtube videos? is it just an alias for the googlevideos videoplayback urls? |
| 12:24:22 | | Matthww11 quits [Quit: Ping timeout (120 seconds)] |
| 12:25:12 | | Matthww11 joins |
| 13:38:44 | | Arcorann quits [Ping timeout: 265 seconds] |
| 14:45:53 | | IDK (IDK) joins |
| 14:53:10 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 15:56:46 | | BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 16:26:54 | | HP_Archivist (HP_Archivist) joins |
| 16:40:36 | | Dango360 (Dango360) joins |
| 18:25:20 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 19:34:01 | | Lord_Nightmare quits [Quit: ZNC - http://znc.in] |
| 19:37:42 | | Lord_Nightmare (Lord_Nightmare) joins |
| 19:47:17 | | balrog quits [Quit: Bye] |
| 19:55:44 | | balrog (balrog) joins |
| 21:30:50 | | HP_Archivist (HP_Archivist) joins |
| 21:47:32 | | Barto quits [Read error: Connection reset by peer] |
| 21:47:38 | | Barto (Barto) joins |
| 22:14:16 | | andrew quits [Quit: ] |
| 22:40:28 | | imer quits [Quit: Oh no] |
| 22:41:24 | | imer (imer) joins |
| 23:03:10 | | andrew (andrew) joins |
| 23:42:17 | <@JAA> | > FATAL ERROR: server must neither be primary nor secondary as it has no mate! |
| 23:42:25 | <@JAA> | Aw, poor server. |
| 23:45:34 | <nicolas17> | I'm uploading Xcode 15 and it seems to be saturating my upstream |
| 23:45:55 | <nicolas17> | last year upload speeds were slow and inconsistent |
| 23:59:54 | <fireonlive> | JAA: me_irl |