02:55:22 | | DogsRNice quits [Read error: Connection reset by peer] |
03:32:50 | <pabs> | does POST /save/ still do a different thing to GET /save/URL ? |
03:33:11 | <BlankEclair> | i've seen /save/_embed/ before lmfao |
03:33:19 | <BlankEclair> | so that's three paths to consider |
03:34:08 | <pabs> | hmm, haven't seen that one |
04:16:02 | <pokechu22> | IIRC /save/_embed/ is used as a redirect target for embedded images that haven't been captured before |
04:16:09 | <pokechu22> | it's not something you load directly otherwise |
05:12:06 | | atphoenix quits [Ping timeout: 260 seconds] |
08:25:57 | | BornOn420_ quits [Remote host closed the connection] |
08:26:41 | | BornOn420 (BornOn420) joins |
09:25:24 | | atphoenix (atphoenix) joins |
11:09:41 | | magmaus3 quits [Ping timeout: 260 seconds] |
11:20:01 | | magmaus3 (magmaus3) joins |
11:48:20 | | BearFortress quits [] |
12:32:06 | | BearFortress joins |
15:21:04 | <@JAA> | Yep |
15:22:31 | <@JAA> | It appears to me that there is a pre-check and then an actual archival run. The pre-check returns 404 in this case. But with POST, you can tell it to archive that anyway, and that goes through something else, which gets the proper 200 response. |
15:22:43 | <@JAA> | Not the first time I've seen such differences. Just the first time with GitHub, I think. |
15:38:04 | | atphoenix quits [Ping timeout: 258 seconds] |
15:40:57 | | atphoenix (atphoenix) joins |
18:19:16 | | PredatorIWD25 quits [Read error: Connection reset by peer] |
18:24:18 | | PredatorIWD25 joins |
21:39:37 | <TheTechRobo> | Brozzler does a HEAD check of the URL to see the status code and if it's HTML before running the page. I'm guessing maybe that's what that is? |
22:04:02 | <@JAA> | Ah yeah, that's what I was thinking of. And there might be different headers and different TLS fingerprints, too? I'm getting 200s from GitHub on HEAD with curl, anyway. |
22:24:54 | <TheTechRobo> | Headers, yes. It was previously requests but they switched to urllib3 on 2025-02-13 (not sure if their systems have been updated to that new brozzler version). |
22:25:06 | <TheTechRobo> | TLS fingerprint should be the same since it's done through warcprox. |
22:25:33 | <TheTechRobo> | And it actually does GET but ignores the response. |
22:27:54 | <TheTechRobo> | As it turns out, the part of the code that checks for the headers is different than the part of the code that fetches them. The former still uses requests. https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L384 |
22:28:19 | <@JAA> | Huh, so that 'pre-flight' request is also archived? |
22:28:32 | <TheTechRobo> | Wait, no, it isn't done through warcprox. |
22:28:42 | <TheTechRobo> | That might explain it then. |
22:28:45 | <@JAA> | Right, yeah |
22:29:51 | <TheTechRobo> | But if it determines that it doesn't need browsing (the Content-Type don't include "html") it will run urllib3 to fetch it through warcprox. (It doesn't actually need to wait for the response in that case, since warcprox will continue downloding even if the client hangs up.) |
22:30:26 | <@JAA> | Interesting |
23:32:58 | | DogsRNice joins |