02:22:14BearFortress_ joins
02:24:27BearFortress quits [Ping timeout: 272 seconds]
02:25:05nicolas17 quits [Ping timeout: 272 seconds]
02:25:20nicolas17 (nicolas17) joins
02:27:22BearFortress joins
02:31:25BearFortress_ quits [Ping timeout: 272 seconds]
02:54:10tzt quits [Read error: Connection reset by peer]
02:55:05tzt (tzt) joins
06:17:58BearFortress_ joins
06:19:26BearFortress__ joins
06:20:19BearFortress___ joins
06:21:57BearFortress quits [Ping timeout: 272 seconds]
06:23:13Pedrosso quits [Ping timeout: 272 seconds]
06:23:13balrog quits [Ping timeout: 272 seconds]
06:23:20nicolas17 quits [Remote host closed the connection]
06:23:44nicolas17 (nicolas17) joins
06:23:47Pedrosso joins
06:23:51BearFortress_ quits [Ping timeout: 272 seconds]
06:24:30BearFortress__ quits [Ping timeout: 272 seconds]
06:27:15balrog (balrog) joins
06:49:06DogsRNice quits [Read error: Connection reset by peer]
07:27:49zhongfu quits [Ping timeout: 272 seconds]
07:31:28Sluggs quits [Ping timeout: 256 seconds]
07:33:19zhongfu (zhongfu) joins
09:50:46tzt quits [Quit: tzt]
09:50:59tzt (tzt) joins
10:01:43X-Scale quits [Ping timeout: 272 seconds]
10:13:25zhongfu_ (zhongfu) joins
10:13:30zhongfu quits [Read error: Connection reset by peer]
11:08:31X-Scale joins
11:44:51<datechnoman>Hey All. Got a simple one. What's the most efficient way to query the WBM CDX for all URLs for a specific site. For example lets use imgur. "https://web.archive.org/cdx/search/cdx?url=https://imgur.com/gallery*&output=json&limit=2500&from=20250201&to=20250230&filter=statuscode:200&collapse=urlkey"
11:45:21<datechnoman>I could create a script that runs "chunks" of dates but its very slow to query. Is this the most efficient or should I be using the IA command line tool
11:51:26<Jake>datechnoman: I believe JAA's utility works rather well https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/ia-cdx-search
12:39:19<justauser|m>https://wiki.archiveteam.org/index.php/Site_exploration has a different snippet, but the same basic idea. I'll probably link the JAA one.
12:52:06AK quits [Quit: AK]
15:47:19Dango360 quits [Quit: The Lounge - https://thelounge.chat]
15:54:11Dango360 (Dango360) joins
19:45:42SootBector quits [Remote host closed the connection]
19:46:30SootBector (SootBector) joins
20:51:06<@JAA>datechnoman: I'd suggest dropping from/to and listing the whole range in one go. It's purely an output filter, so if you split it up, the server ends up reading the same underlying data over and over.
20:52:35<@JAA>`ia-cdx-search --concurrency 4 --tries 10 'url=https://imgur.com/gallery*&filter=statuscode:200&collapse=urlkey'`
20:52:50<@JAA>If you want only the URL, you can also add `fl=original`.
20:54:29<@JAA>Oh yeah, also, `collapse=urlkey` will collapse case collisions. You probably don't want that on Imgur.
20:55:43<@JAA>E.g. https://web.archive.org/cdx/search/cdx?url=https://i.imgur.com/fPIGA.* vs https://web.archive.org/cdx/search/cdx?url=https://i.imgur.com/fPIGA.*&collapse=urlkey
21:37:45zhongfu_ quits [Ping timeout: 272 seconds]
21:39:19zhongfu (zhongfu) joins
21:47:42DogsRNice joins
22:22:06<datechnoman>Thank you very much Jake, JAA and justauser|m
22:22:11<datechnoman>Exactly what im after :)
23:28:34DopefishJustin quits [Ping timeout: 256 seconds]
23:29:53<pabs>wow "The capture will start in ~9 hours, 34 minutes because our service is currently overloaded."
23:30:19<@JAA>Yeah, way down from yesterday :-P
23:35:14<pabs>huh, must have been quite some backlog
23:35:31<pabs>wonder if someone was spamming the save API or something
23:50:13DopefishJustin joins