#internetarchive log for 2023-09-18

Home Search Previous day Next day

00:00:49	<audrooku\|m>	I was having a conversation with JAA about the WBM CDX API, I've been having issues with getting an exhaustive list of urls, I typically have to fuzz the url to get a broader list
00:02:00	<audrooku\|m>	for example I will get more results for
00:02:00	<audrooku\|m>	`soundcloud.com/oe*`
00:02:00	<audrooku\|m>	than for
00:02:00	<audrooku\|m>	`soundcloud.com/*`
00:02:00	<audrooku\|m>	due to soundcloud having a very popular oembed feature at `soundcloud.com/oembed`
00:06:17	<@JAA>	I'm trying a listing equivalent to soundcloud.com/* myself now. How many results did you get exactly there?
00:09:03	<audrooku\|m>	I get 1,432,078 rows with `wget "https://web.archive.org/cdx/search/cdx?url=https://soundcloud.com/*&matchType=domain"`
00:09:03	<audrooku\|m>	I will have to re-run without matchType=domain to test the effect of fuzzing, will get back to you in a sec
00:09:12	<@JAA>	I'm about 10% done and got just over 20 million unique URLs so far.
00:09:44	<audrooku\|m>	what is the exact url you're querying?
00:10:15	<@JAA>	/cdx/search/cdx?url=soundcloud.com&matchType=domain&collapse=urlkey&fl=original&output=json with pagination
00:11:11	<@JAA>	Which pagination are you using?
00:12:06	<@JAA>	I do showNumPages=true + page, because the resumption key pagination is known not to work well.
00:13:16	<audrooku\|m>	None, I wasn't under the impression that this was neccessary unless you set a limit, is this neccessary?
00:13:30	<@JAA>	It is.
00:15:56	<audrooku\|m>	does matchType=domain also return more results if you're querying for all results under a domain?
00:17:57	<@JAA>	There's some sort of 'sharding' involved here, or whatever it's called in this context. If you don't do pagination, you only ever get results from the first 'shard'. Or something similar to that.
00:18:26	<@JAA>	With the 'page' param, you can iterate over the entire CDX store. -ish
00:19:11	<@JAA>	resumeKey behaves weirdly for queries with lots of results, which is why I switched (back, I think) to the page pagination in ia-cdx-search some time ago.
00:19:18	<audrooku\|m>	wow, that's incredible, I wish this was more emphasized in the docs, when the paginated query with your tool is finished I'll let you know
00:19:28	<@JAA>	matchType=domain means matching the domain and any subdomains.
00:23:56	<@JAA>	Welp, my run crashed due to 503s after 3.3k pages. I understand you're running it as well anyway? If so, I won't resume.
00:24:07	<@JAA>	`ia-cdx-search --concurrency 4 --tries 10 'url=soundcloud.com&matchType=domain&collapse=urlkey&fl=original'` is what I used to be precise.
00:24:09		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
00:24:16	<nicolas17>	crashed?
00:24:25	<nicolas17>	so 503s continued despite 10 retries?
00:24:30	<@JAA>	Yep
00:24:43	<@JAA>	The WBM is very stable, you must know.
00:24:44	<@JAA>	:-)
00:25:53	<@JAA>	It doesn't sleep between retries though except for 429s, so it might not have persisted for long.
00:25:58	<audrooku\|m>	> Welp, my run crashed due to 503s after 3.3k pages. I understand you're running it as well anyway? If so, I won't resume.
00:25:58	<audrooku\|m>	Yeah mine crashed as well and I accidentally overwrote the file I was piping to haha, I'll run this query until it finishes, since I need the cdx data anyway, will ping you when it finishes
00:30:39	<fireonlive>	tries 9999999999
01:34:26		Dango360 quits [Ping timeout: 252 seconds]
02:31:47		AnotherIki joins
02:35:36		Iki1 quits [Ping timeout: 265 seconds]
03:30:38	<@JAA>	> This URL has been already captured 7 times today.
03:30:43	<@JAA>	> Wayback Machine has not archived that URL.
03:30:45	<@JAA>	Sigh
03:31:00	<@JAA>	But yeah, apparently the limit is 7 captures per day now.
03:38:39	<audrooku\|m>	Interesting
03:40:27	<audrooku\|m>	BTW JAA do you know how the WBM handles captures of the same url and timestamp? given that the only variables for accessing a capture are the url and timestamp are those, I doubt it would be accessible, but I'm curious if you know what would happen in that scenario, I assume they at least allow conflicting url/timestamps between collections or warcs ?
03:40:31	<audrooku\|m>	just curious
03:41:31	<@JAA>	I believe it prefers 200s over other status codes, which is how it makes www → non-www redirects work, for example. But beyond that, no idea.
04:42:27		Stiletto joins
05:03:13	<audrooku\|m>	audrey is still waiting on the cdx grab, page 6242 atm
07:42:01		Arcorann (Arcorann) joins
08:23:31	<audrooku\|m>	How does the fakeurl work for youtube videos? is it just an alias for the googlevideos videoplayback urls?
12:24:22		Matthww11 quits [Quit: Ping timeout (120 seconds)]
12:25:12		Matthww11 joins
13:38:44		Arcorann quits [Ping timeout: 265 seconds]
14:45:53		IDK (IDK) joins
14:53:10		HP_Archivist quits [Ping timeout: 265 seconds]
15:56:46		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
16:26:54		HP_Archivist (HP_Archivist) joins
16:40:36		Dango360 (Dango360) joins
18:25:20		HP_Archivist quits [Ping timeout: 252 seconds]
19:34:01		Lord_Nightmare quits [Quit: ZNC - http://znc.in]
19:37:42		Lord_Nightmare (Lord_Nightmare) joins
19:47:17		balrog quits [Quit: Bye]
19:55:44		balrog (balrog) joins
21:30:50		HP_Archivist (HP_Archivist) joins
21:47:32		Barto quits [Read error: Connection reset by peer]
21:47:38		Barto (Barto) joins
22:14:16		andrew quits [Quit: ]
22:40:28		imer quits [Quit: Oh no]
22:41:24		imer (imer) joins
23:03:10		andrew (andrew) joins
23:42:17	<@JAA>	> FATAL ERROR: server must neither be primary nor secondary as it has no mate!
23:42:25	<@JAA>	Aw, poor server.
23:45:34	<nicolas17>	I'm uploading Xcode 15 and it seems to be saturating my upstream
23:45:55	<nicolas17>	last year upload speeds were slow and inconsistent
23:59:54	<fireonlive>	JAA: me_irl

Home Search Previous day Next day