02:31:15 | | DogsRNice joins |
02:41:19 | | qw3rty__ quits [Ping timeout: 272 seconds] |
02:47:48 | | qw3rty__ joins |
03:09:15 | | pabs quits [Read error: Connection reset by peer] |
03:10:00 | | pabs (pabs) joins |
03:56:03 | | DogsRNice quits [Read error: Connection reset by peer] |
03:57:34 | | fuzzy80211 quits [Read error: Connection reset by peer] |
03:58:05 | | fuzzy80211 (fuzzy80211) joins |
04:15:13 | | fuzzy80211 quits [Read error: Connection reset by peer] |
04:15:43 | | fuzzy80211 (fuzzy80211) joins |
05:01:52 | | fuzzy80211 quits [Read error: Connection reset by peer] |
05:02:22 | | fuzzy80211 (fuzzy80211) joins |
05:51:41 | | Sanqui joins |
05:51:43 | | Sanqui is now authenticated as Sanqui |
05:51:43 | | Sanqui quits [Changing host] |
05:51:43 | | Sanqui (Sanqui) joins |
06:07:01 | | JaffaCakes118 quits [Remote host closed the connection] |
06:07:24 | | JaffaCakes118 (JaffaCakes118) joins |
06:24:24 | | nulldata1 (nulldata) joins |
06:25:25 | | nulldata quits [Ping timeout: 255 seconds] |
06:25:25 | | nulldata1 is now known as nulldata |
06:51:30 | <IDK> | I think they are tweaking something, now elements of some pages are returning 503 |
08:49:09 | | Arcorann (Arcorann) joins |
09:39:57 | | nulldata quits [Ping timeout: 272 seconds] |
09:48:49 | | Arcorann quits [Ping timeout: 272 seconds] |
09:50:21 | | nulldata (nulldata) joins |
09:55:12 | <@arkiver> | IDK: got examples? |
09:55:15 | <@arkiver> | examples are always important |
10:02:28 | | MrMcNuggets (MrMcNuggets) joins |
10:02:47 | | MrMcNuggets quits [Remote host closed the connection] |
10:02:58 | | MrMcNuggets (MrMcNuggets) joins |
10:11:35 | | MrMcNuggets quits [Remote host closed the connection] |
10:12:01 | | MrMcNuggets (MrMcNuggets) joins |
11:53:23 | <IDK> | Not anymore, for both cases |
11:54:09 | <IDK> | Its like 1 day some goes down and they are back in a few hours, and then other goes down |
12:35:49 | | MrMcNugg1 (MrMcNuggets) joins |
12:36:13 | | MrMcNugg1 quits [Client Quit] |
12:39:49 | | MrMcNuggets quits [Ping timeout: 272 seconds] |
12:40:05 | | MrMcNuggets (MrMcNuggets) joins |
13:58:34 | | BearFortress_ quits [Ping timeout: 255 seconds] |
14:32:46 | | f_ quits [Ping timeout: 255 seconds] |
14:47:23 | | f_ (funderscore) joins |
15:28:20 | | MrMcNugg1 (MrMcNuggets) joins |
15:32:05 | | MrMcNuggets quits [Ping timeout: 272 seconds] |
15:54:47 | | MrMcNugg1 quits [Remote host closed the connection] |
15:55:37 | | MrMcNuggets (MrMcNuggets) joins |
15:56:11 | <audrooku|m> | interesting |
15:58:46 | <audrooku|m> | is there a working way to dump the cdx pages for a domain at the moment? |
15:59:02 | <audrooku|m> | I need all the urls but shownumpages is still broken and I'm getting lots of 504 errors |
16:18:48 | | MrMcNuggets quits [Client Quit] |
16:29:53 | <Dango360> | audR |
16:30:06 | <Dango360> | oops, had ahk on |
16:30:55 | <Dango360> | audrooku|m: you can use the cdx api: "https://web.archive.org/cdx/search?url=http://example.com/&matchType=prefix&collapse=urlkey&filter=statuscode%3A200&fl=original" |
16:31:34 | <Dango360> | oh wait, are you already using this? |
16:33:45 | <Dango360> | if it's not working the way it should you can try contacting info@archive.org and see if they can fix it |
16:36:48 | <audrooku|m> | I am sure ia is well aware, a lot of people have brought it up and its been broken for at least a month |
16:36:54 | <audrooku|m> | But sure I will email them |
16:46:55 | <audrooku|m> | I believe ark.iver has been mentioned a few times about this, but I'm not sure what he said or if he passed it along |
16:58:14 | | BearFortress joins |
17:12:11 | <audrooku|m> | hmm ok I don't think he actually saw it: arkiver |
17:12:29 | <audrooku|m> | example url: https://web.archive.org/cdx/search/cdx?url=https://goo.gl/*&matchType=domain&showNumPages=true |
17:41:26 | <audrooku|m> | *the asterisk causes the query to not return any results normally, nix that part |
17:55:42 | | DLoader quits [Quit: The Lounge - https://thelounge.chat] |
18:06:53 | <audrooku|m> | huh weird, this query gave me 1.17GiB of urls (7.6M) for the first page, maybe the issue is with pagination? https://web.archive.org/cdx/search/cdx?format=json&page=0&url=soundcloud.app.goo.gl&matchType=domain |
18:08:38 | | DLoader (DLoader) joins |
18:18:42 | <@JAA> | The problem is with the page-based pagination, yes. |
18:19:21 | <@JAA> | I didn't poke the resume key pagination much, but my attempts simply returned all results instead of a key. |
18:21:38 | <audrooku|m> | Yes I'm noticing this aswell. If I tweak the page size then shownumpages will simply return 1 |
18:24:28 | <audrooku|m> | I think I have a script behaving with this query format https://web.archive.org/cdx/search/cdx?url=soundcloud.app.goo.gl&matchType=domain&output=json&pageSize=1&page=3 |
18:24:43 | <audrooku|m> | but I will update and share the script when I have the data |
18:26:10 | <audrooku|m> | at least this will work for dumping all urls, I can live without the server side filtering |
18:29:34 | <@JAA> | Yeah, the resume key thing was somewhat broken with server-side filtering anyway last time I tried. That's why my script uses the page pagination. |
18:30:44 | <audrooku|m> | I didn't even know there was actual resume key pagination, odd |
18:30:51 | | DLoader quits [Client Quit] |
18:30:59 | <audrooku|m> | https://web.archive.org/cdx/search/cdx?url=soundcloud.app.goo.gl&matchType=domain&output=json&pageSize=1&page=6&showNumPages=1 |
18:31:01 | <audrooku|m> | this works |
18:31:11 | <audrooku|m> | I'm wondering if maybe this is related to the order of the query string? |
18:32:00 | <audrooku|m> | or maybe ia saw my email 😅... apologies for the stream of conciousness by the way |
18:36:04 | | DLoader (DLoader) joins |
18:36:06 | <audrooku|m> | setting pagesize to 1 and/or 50 seems to cause it to work, maybe it will fail on larger domains |
18:37:31 | <audrooku|m> | https://web.archive.org/cdx/search/cdx?url=youtube.com&matchType=domain&pageSize=50&showNumPages=1 |
18:37:31 | <audrooku|m> | `52479` |
18:38:13 | <audrooku|m> | I scraped this previously and there was >200k pages, but maybe the default pagesize was not 50 (the docs say it is) |
18:59:09 | <OrIdow6> | If you use curl -v |
18:59:31 | <OrIdow6> | It complains that "x-archive-wayback-runtime-error: positive pageSize is required" |
19:02:13 | <audrooku|m> | it sounds to me like they deleted or mangled the default config |
19:02:39 | <audrooku|m> | but thanks for pointing that out that's something I didn't notice and definitely a really relevant error message |
19:04:10 | <OrIdow6> | If someone wants to dig thru a very enterprise Java program I believe an old version of the CDX server source is online somewhere |
19:05:00 | <audrooku|m> | https://github.com/internetarchive/wayback |
19:11:46 | <audrooku|m> | I wrote my own ia-cdx-search script that I have finished validating behaves as properly as it can until this gets fully sorted out, but JAA's should work by just specifying pageSize=50 in the query, though I haven't tested that https://github.com/tntmod54321/audrey-ia-cdx-search |
19:17:04 | | ifconfig joins |
19:53:22 | | ifconfig quits [Client Quit] |
20:02:10 | | BearFortress quits [Ping timeout: 255 seconds] |
21:06:28 | | BearFortress joins |
21:29:47 | | yarrow quits [Read error: Connection reset by peer] |
21:31:29 | | yarrow (yarrow) joins |
22:06:19 | | DogsRNice joins |
22:20:00 | | yarr0w joins |
22:22:34 | | yarrow quits [Ping timeout: 255 seconds] |
22:25:31 | | yarr0w quits [Client Quit] |
22:25:52 | | yarrow (yarrow) joins |
22:57:13 | | tzt quits [Ping timeout: 255 seconds] |
23:49:36 | | tzt (tzt) joins |
23:54:19 | | qwertyasdfuiopghjkl quits [Ping timeout: 272 seconds] |