| 00:00:55 | <thalia> | Is there a way to order PDFs in the book viewer when there's multiple, besides just filename lexical ordering? |
| 00:01:23 | <thalia> | And is there some metadata to mark that a PDF should be opened in the viewer with the 1-page view by default? |
| 01:15:04 | | xkey quits [Quit: WeeChat 4.7.2] |
| 01:16:16 | | xkey (xkey) joins |
| 02:39:05 | | nulldata-alt9 (nulldata) joins |
| 02:40:51 | | nulldata-alt quits [Ping timeout: 272 seconds] |
| 02:40:51 | | nulldata-alt9 is now known as nulldata-alt |
| 03:40:24 | | AlsoHP_Archivist joins |
| 03:40:30 | | PredatorIWD255 joins |
| 03:42:18 | | PredatorIWD25 quits [Ping timeout: 256 seconds] |
| 03:42:18 | | PredatorIWD255 is now known as PredatorIWD25 |
| 03:44:34 | | HP_Archivist quits [Ping timeout: 256 seconds] |
| 03:55:55 | | AlsoHP_Archivist quits [Client Quit] |
| 03:56:10 | | HP_Archivist (HP_Archivist) joins |
| 04:15:19 | <pabs> | anyone know what could cause SPN2 (email API) to give "Error! Capture timed out" for a URL? |
| 04:23:39 | <nicolas17> | I just uploaded a zip to archive.org that I didn't mean to upload compressed, and I used --delete so it was gone locally |
| 04:24:20 | <nicolas17> | I re-downloaded it from IA, unzipped it, deleted it from IA using 'ia rm --no-backup', and uploaded the folder |
| 04:24:39 | <nicolas17> | which uploaded the unzipped contents *and the zip again* because I forgot to delete it after unzip :/ |
| 05:15:22 | | DogsRNice quits [Read error: Connection reset by peer] |
| 07:06:32 | | DopefishJustin quits [Remote host closed the connection] |
| 07:14:44 | | DopefishJustin joins |
| 07:14:45 | | DopefishJustin is now authenticated as DopefishJustin |
| 07:17:46 | | dendory3 joins |
| 07:19:31 | | dendory quits [Ping timeout: 272 seconds] |
| 12:16:18 | <cruller> | According to https://help.archive.org/help/using-the-wayback-machine/, “Alexa Internet, in cooperation with the Internet Archive, has designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.” |
| 12:16:24 | <cruller> | What exactly are these “three dimensions”? |
| 12:29:37 | <vics> | I suppose that two dimensions are browsing a web site as usual, and the third one is time. |
| 12:42:31 | | pabs quits [Ping timeout: 272 seconds] |
| 12:47:25 | <cruller> | You mean vertical, horizontal, and time? I also think that's the most plausible theory, but I'm not certain. |
| 12:49:51 | <cruller> | At least, time must be included. |
| 13:06:37 | | pabs (pabs) joins |
| 13:37:31 | | lexikiq quits [Quit: Ping timeout (120 seconds)] |
| 13:37:56 | | lexikiq joins |
| 14:03:47 | | dendory3 is now known as dendory |
| 14:03:56 | | dendory is now authenticated as dendory |
| 14:03:56 | | dendory quits [Changing host] |
| 14:03:56 | | dendory (dendory) joins |
| 14:47:08 | <nstrom|m> | I mean it could be a VR virtualization where you fly through file folders like in Hackers. but I doubt it |
| 15:08:12 | | SootBector quits [Remote host closed the connection] |
| 15:09:24 | | SootBector (SootBector) joins |
| 15:54:35 | | rewby quits [Quit: WeeChat 4.4.2] |
| 16:30:44 | | rewby (rewby) joins |
| 21:11:09 | <datechnoman> | JAA - Is this still the best way to query / list subdomains using IA's CDX? - https://gitea.arpa.li/JustAnotherArchivist/little-things/raw/branch/master/ia-cdx-search-subdomains |
| 21:11:29 | <datechnoman> | I havent tried configuring the requirements yet. Wasnt sure if this is still valid or not |
| 21:16:19 | <@JAA> | datechnoman: Hmm, I haven't actually used that script in a long while, only ia-cdx-search directly. |
| 21:16:42 | <@JAA> | But I think it should work, yeah. |
| 21:24:54 | <datechnoman> | Got an easier way / one liner otherwise? |
| 21:25:21 | <datechnoman> | All I want to do is be able to query IA's CDX for all subdomains of a domain, eg: *.foo.com |
| 21:25:56 | <datechnoman> | Reading the CDX Documentation, https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server, it should be a simple command but I cant get it to work the way it should.... |
| 21:26:00 | <datechnoman> | Its most likely just me :/ |
| 21:27:30 | <datechnoman> | ChatGPT keeps telling me to run the same things and they dont seem to work either :/ |
| 21:30:25 | <@JAA> | Do you want the list of domains or the list of URLs? |
| 21:31:08 | <pokechu22> | I'm not aware of a way that's better than listing all URLs and then collapsing it to a list of subdomains afterwards. The CDX server *does* let you collapse to a substring of URLs, but that doesn't match with subdomains directly - you need to pick a length that you expect substrings to be under, and you'll still get multiple results for most subdomains |
| 21:32:34 | <datechnoman> | Happy to do cleanup post pulling the data |
| 21:32:53 | <datechnoman> | In this case im wanting the domains and dont care about the URLs, BUT, it would be nice to know how to use that in the future |
| 21:33:17 | <datechnoman> | Lets say a list of URLs as I can then process it to be just the subdomains and have both capabilities |
| 21:34:56 | <pokechu22> | I think ia-cdx-search does that. The way I personally do it (which has some other limitations regarding rate-limiting/getting multiple pages at a time) is plugging it in to https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=domain&fl=original&limit=100000&showResumeKey=true&resumeKey= and then as needed plugging the resume key at the end of the |
| 21:34:58 | <pokechu22> | list back into that, but I generally do that by hand and it's kinda slow |
| 21:38:29 | <pokechu22> | ia-cdx-search is probably faster/more reliable for scripting, and it looks like ia-cdx-search-subdomains just uses ia-cdx-search and then post-processes it to a list of subdomains. The way I do it works but sometimes gets 502s (I think, might be 503s?) |
| 21:41:27 | <datechnoman> | Only issue is that it is only listing urls matching "example.com*" where I want "subdomain1.example.com*", "subdomain2.example.com*" if you know what I mean? - https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=domain&fl=original&limit=100000&showResumeKey=true&resumeKey= |
| 21:42:00 | <@JAA> | matchType=domain matches subdomains, too. |
| 21:43:34 | <datechnoman> | Hmmm maybe I need to let it run longer, results are streaming through. See if any subdomains start popping up |
| 21:45:23 | <@JAA> | url=example.org&matchType=domain&collapse=urlkey&fl=original is what I normally use for this use case. And I post-process it into a list of domains later when needed. |
| 21:45:52 | <@JAA> | Which is exactly what ia-cdx-search-subdomains does as well, just with the post-processing immediately. |
| 21:46:13 | <@JAA> | It does strip ports as well, in case that matters. |
| 21:46:20 | <@JAA> | So really only domains, not hosts. |
| 21:46:41 | <pokechu22> | Note that you'll get both www and non-www at the start because the urlkey suppresses www; other subdomains appear after those |
| 22:45:56 | <datechnoman> | Ahhh so thats working. Turns out it was always working, i just needed to let it output the root domain urls first.... *sigh* |
| 22:46:39 | <datechnoman> | Thank you very much all |
| 22:52:38 | <pokechu22> | Is there more information on the indexing issue (mentioned in #archiveteam-bs recently)? My understanding is that it's related to indexing but I'd appreciate something public/official I can point to |
| 23:16:54 | | atphoenix_ (atphoenix) joins |
| 23:19:39 | | atphoenix__ quits [Ping timeout: 272 seconds] |