#internetarchive log for 2025-11-23

Home Search Previous day Next day

02:22:14		BearFortress_ joins
02:24:27		BearFortress quits [Ping timeout: 272 seconds]
02:25:05		nicolas17 quits [Ping timeout: 272 seconds]
02:25:20		nicolas17 (nicolas17) joins
02:27:22		BearFortress joins
02:31:25		BearFortress_ quits [Ping timeout: 272 seconds]
02:54:10		tzt quits [Read error: Connection reset by peer]
02:55:05		tzt (tzt) joins
06:17:58		BearFortress_ joins
06:19:26		BearFortress__ joins
06:20:19		BearFortress___ joins
06:21:57		BearFortress quits [Ping timeout: 272 seconds]
06:23:13		Pedrosso quits [Ping timeout: 272 seconds]
06:23:13		balrog quits [Ping timeout: 272 seconds]
06:23:20		nicolas17 quits [Remote host closed the connection]
06:23:44		nicolas17 (nicolas17) joins
06:23:47		Pedrosso joins
06:23:51		BearFortress_ quits [Ping timeout: 272 seconds]
06:24:30		BearFortress__ quits [Ping timeout: 272 seconds]
06:27:15		balrog (balrog) joins
06:49:06		DogsRNice quits [Read error: Connection reset by peer]
07:27:49		zhongfu quits [Ping timeout: 272 seconds]
07:31:28		Sluggs quits [Ping timeout: 256 seconds]
07:33:19		zhongfu (zhongfu) joins
09:50:46		tzt quits [Quit: tzt]
09:50:59		tzt (tzt) joins
10:01:43		X-Scale quits [Ping timeout: 272 seconds]
10:13:25		zhongfu_ (zhongfu) joins
10:13:30		zhongfu quits [Read error: Connection reset by peer]
11:08:31		X-Scale joins
11:44:51	<datechnoman>	Hey All. Got a simple one. What's the most efficient way to query the WBM CDX for all URLs for a specific site. For example lets use imgur. "https://web.archive.org/cdx/search/cdx?url=https://imgur.com/gallery*&output=json&limit=2500&from=20250201&to=20250230&filter=statuscode:200&collapse=urlkey"
11:45:21	<datechnoman>	I could create a script that runs "chunks" of dates but its very slow to query. Is this the most efficient or should I be using the IA command line tool
11:51:26	<Jake>	datechnoman: I believe JAA's utility works rather well https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/ia-cdx-search
12:39:19	<justauser\|m>	https://wiki.archiveteam.org/index.php/Site_exploration has a different snippet, but the same basic idea. I'll probably link the JAA one.
12:52:06		AK quits [Quit: AK]
15:47:19		Dango360 quits [Quit: The Lounge - https://thelounge.chat]
15:54:11		Dango360 (Dango360) joins
19:45:42		SootBector quits [Remote host closed the connection]
19:46:30		SootBector (SootBector) joins
20:51:06	<@JAA>	datechnoman: I'd suggest dropping from/to and listing the whole range in one go. It's purely an output filter, so if you split it up, the server ends up reading the same underlying data over and over.
20:52:35	<@JAA>	`ia-cdx-search --concurrency 4 --tries 10 'url=https://imgur.com/gallery*&filter=statuscode:200&collapse=urlkey'`
20:52:50	<@JAA>	If you want only the URL, you can also add `fl=original`.
20:54:29	<@JAA>	Oh yeah, also, `collapse=urlkey` will collapse case collisions. You probably don't want that on Imgur.
20:55:43	<@JAA>	E.g. https://web.archive.org/cdx/search/cdx?url=https://i.imgur.com/fPIGA.* vs https://web.archive.org/cdx/search/cdx?url=https://i.imgur.com/fPIGA.*&collapse=urlkey
21:37:45		zhongfu_ quits [Ping timeout: 272 seconds]
21:39:19		zhongfu (zhongfu) joins
21:47:42		DogsRNice joins
22:22:06	<datechnoman>	Thank you very much Jake, JAA and justauser\|m
22:22:11	<datechnoman>	Exactly what im after :)
23:28:34		DopefishJustin quits [Ping timeout: 256 seconds]
23:29:53	<pabs>	wow "The capture will start in ~9 hours, 34 minutes because our service is currently overloaded."
23:30:19	<@JAA>	Yeah, way down from yesterday :-P
23:35:14	<pabs>	huh, must have been quite some backlog
23:35:31	<pabs>	wonder if someone was spamming the save API or something
23:50:13		DopefishJustin joins
23:50:13		DopefishJustin is now authenticated as DopefishJustin

Home Search Previous day Next day