#internetarchive log for 2025-12-18

Home Search Previous day Next day

00:00:55	<thalia>	Is there a way to order PDFs in the book viewer when there's multiple, besides just filename lexical ordering?
00:01:23	<thalia>	And is there some metadata to mark that a PDF should be opened in the viewer with the 1-page view by default?
01:15:04		xkey quits [Quit: WeeChat 4.7.2]
01:16:16		xkey (xkey) joins
02:39:05		nulldata-alt9 (nulldata) joins
02:40:51		nulldata-alt quits [Ping timeout: 272 seconds]
02:40:51		nulldata-alt9 is now known as nulldata-alt
03:40:24		AlsoHP_Archivist joins
03:40:30		PredatorIWD255 joins
03:42:18		PredatorIWD25 quits [Ping timeout: 256 seconds]
03:42:18		PredatorIWD255 is now known as PredatorIWD25
03:44:34		HP_Archivist quits [Ping timeout: 256 seconds]
03:55:55		AlsoHP_Archivist quits [Client Quit]
03:56:10		HP_Archivist (HP_Archivist) joins
04:15:19	<pabs>	anyone know what could cause SPN2 (email API) to give "Error! Capture timed out" for a URL?
04:23:39	<nicolas17>	I just uploaded a zip to archive.org that I didn't mean to upload compressed, and I used --delete so it was gone locally
04:24:20	<nicolas17>	I re-downloaded it from IA, unzipped it, deleted it from IA using 'ia rm --no-backup', and uploaded the folder
04:24:39	<nicolas17>	which uploaded the unzipped contents and the zip again because I forgot to delete it after unzip :/
05:15:22		DogsRNice quits [Read error: Connection reset by peer]
07:06:32		DopefishJustin quits [Remote host closed the connection]
07:14:44		DopefishJustin joins
07:14:45		DopefishJustin is now authenticated as DopefishJustin
07:17:46		dendory3 joins
07:19:31		dendory quits [Ping timeout: 272 seconds]
12:16:18	<cruller>	According to https://help.archive.org/help/using-the-wayback-machine/, “Alexa Internet, in cooperation with the Internet Archive, has designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.”
12:16:24	<cruller>	What exactly are these “three dimensions”?
12:29:37	<vics>	I suppose that two dimensions are browsing a web site as usual, and the third one is time.
12:42:31		pabs quits [Ping timeout: 272 seconds]
12:47:25	<cruller>	You mean vertical, horizontal, and time? I also think that's the most plausible theory, but I'm not certain.
12:49:51	<cruller>	At least, time must be included.
13:06:37		pabs (pabs) joins
13:37:31		lexikiq quits [Quit: Ping timeout (120 seconds)]
13:37:56		lexikiq joins
14:03:47		dendory3 is now known as dendory
14:03:56		dendory is now authenticated as dendory
14:03:56		dendory quits [Changing host]
14:03:56		dendory (dendory) joins
14:47:08	<nstrom\|m>	I mean it could be a VR virtualization where you fly through file folders like in Hackers. but I doubt it
15:08:12		SootBector quits [Remote host closed the connection]
15:09:24		SootBector (SootBector) joins
15:54:35		rewby quits [Quit: WeeChat 4.4.2]
16:30:44		rewby (rewby) joins
21:11:09	<datechnoman>	JAA - Is this still the best way to query / list subdomains using IA's CDX? - https://gitea.arpa.li/JustAnotherArchivist/little-things/raw/branch/master/ia-cdx-search-subdomains
21:11:29	<datechnoman>	I havent tried configuring the requirements yet. Wasnt sure if this is still valid or not
21:16:19	<@JAA>	datechnoman: Hmm, I haven't actually used that script in a long while, only ia-cdx-search directly.
21:16:42	<@JAA>	But I think it should work, yeah.
21:24:54	<datechnoman>	Got an easier way / one liner otherwise?
21:25:21	<datechnoman>	All I want to do is be able to query IA's CDX for all subdomains of a domain, eg: *.foo.com
21:25:56	<datechnoman>	Reading the CDX Documentation, https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server, it should be a simple command but I cant get it to work the way it should....
21:26:00	<datechnoman>	Its most likely just me :/
21:27:30	<datechnoman>	ChatGPT keeps telling me to run the same things and they dont seem to work either :/
21:30:25	<@JAA>	Do you want the list of domains or the list of URLs?
21:31:08	<pokechu22>	I'm not aware of a way that's better than listing all URLs and then collapsing it to a list of subdomains afterwards. The CDX server does let you collapse to a substring of URLs, but that doesn't match with subdomains directly - you need to pick a length that you expect substrings to be under, and you'll still get multiple results for most subdomains
21:32:34	<datechnoman>	Happy to do cleanup post pulling the data
21:32:53	<datechnoman>	In this case im wanting the domains and dont care about the URLs, BUT, it would be nice to know how to use that in the future
21:33:17	<datechnoman>	Lets say a list of URLs as I can then process it to be just the subdomains and have both capabilities
21:34:56	<pokechu22>	I think ia-cdx-search does that. The way I personally do it (which has some other limitations regarding rate-limiting/getting multiple pages at a time) is plugging it in to https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=domain&fl=original&limit=100000&showResumeKey=true&resumeKey= and then as needed plugging the resume key at the end of the
21:34:58	<pokechu22>	list back into that, but I generally do that by hand and it's kinda slow
21:38:29	<pokechu22>	ia-cdx-search is probably faster/more reliable for scripting, and it looks like ia-cdx-search-subdomains just uses ia-cdx-search and then post-processes it to a list of subdomains. The way I do it works but sometimes gets 502s (I think, might be 503s?)
21:41:27	<datechnoman>	Only issue is that it is only listing urls matching "example.com" where I want "subdomain1.example.com", "subdomain2.example.com*" if you know what I mean? - https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=domain&fl=original&limit=100000&showResumeKey=true&resumeKey=
21:42:00	<@JAA>	matchType=domain matches subdomains, too.
21:43:34	<datechnoman>	Hmmm maybe I need to let it run longer, results are streaming through. See if any subdomains start popping up
21:45:23	<@JAA>	url=example.org&matchType=domain&collapse=urlkey&fl=original is what I normally use for this use case. And I post-process it into a list of domains later when needed.
21:45:52	<@JAA>	Which is exactly what ia-cdx-search-subdomains does as well, just with the post-processing immediately.
21:46:13	<@JAA>	It does strip ports as well, in case that matters.
21:46:20	<@JAA>	So really only domains, not hosts.
21:46:41	<pokechu22>	Note that you'll get both www and non-www at the start because the urlkey suppresses www; other subdomains appear after those
22:45:56	<datechnoman>	Ahhh so thats working. Turns out it was always working, i just needed to let it output the root domain urls first.... sigh
22:46:39	<datechnoman>	Thank you very much all
22:52:38	<pokechu22>	Is there more information on the indexing issue (mentioned in #archiveteam-bs recently)? My understanding is that it's related to indexing but I'd appreciate something public/official I can point to
23:16:54		atphoenix_ (atphoenix) joins
23:19:39		atphoenix__ quits [Ping timeout: 272 seconds]

Home Search Previous day Next day