#internetarchive log for 2025-05-06

Home Search Previous day Next day

02:55:22		DogsRNice quits [Read error: Connection reset by peer]
03:32:50	<pabs>	does POST /save/ still do a different thing to GET /save/URL ?
03:33:11	<BlankEclair>	i've seen /save/_embed/ before lmfao
03:33:19	<BlankEclair>	so that's three paths to consider
03:34:08	<pabs>	hmm, haven't seen that one
04:16:02	<pokechu22>	IIRC /save/_embed/ is used as a redirect target for embedded images that haven't been captured before
04:16:09	<pokechu22>	it's not something you load directly otherwise
05:12:06		atphoenix quits [Ping timeout: 260 seconds]
08:25:57		BornOn420_ quits [Remote host closed the connection]
08:26:41		BornOn420 (BornOn420) joins
09:25:24		atphoenix (atphoenix) joins
11:09:41		magmaus3 quits [Ping timeout: 260 seconds]
11:20:01		magmaus3 (magmaus3) joins
11:48:20		BearFortress quits []
12:32:06		BearFortress joins
15:21:04	<@JAA>	Yep
15:22:31	<@JAA>	It appears to me that there is a pre-check and then an actual archival run. The pre-check returns 404 in this case. But with POST, you can tell it to archive that anyway, and that goes through something else, which gets the proper 200 response.
15:22:43	<@JAA>	Not the first time I've seen such differences. Just the first time with GitHub, I think.
15:38:04		atphoenix quits [Ping timeout: 258 seconds]
15:40:57		atphoenix (atphoenix) joins
18:19:16		PredatorIWD25 quits [Read error: Connection reset by peer]
18:24:18		PredatorIWD25 joins
21:39:37	<TheTechRobo>	Brozzler does a HEAD check of the URL to see the status code and if it's HTML before running the page. I'm guessing maybe that's what that is?
22:04:02	<@JAA>	Ah yeah, that's what I was thinking of. And there might be different headers and different TLS fingerprints, too? I'm getting 200s from GitHub on HEAD with curl, anyway.
22:24:54	<TheTechRobo>	Headers, yes. It was previously requests but they switched to urllib3 on 2025-02-13 (not sure if their systems have been updated to that new brozzler version).
22:25:06	<TheTechRobo>	TLS fingerprint should be the same since it's done through warcprox.
22:25:33	<TheTechRobo>	And it actually does GET but ignores the response.
22:27:54	<TheTechRobo>	As it turns out, the part of the code that checks for the headers is different than the part of the code that fetches them. The former still uses requests. https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L384
22:28:19	<@JAA>	Huh, so that 'pre-flight' request is also archived?
22:28:32	<TheTechRobo>	Wait, no, it isn't done through warcprox.
22:28:42	<TheTechRobo>	That might explain it then.
22:28:45	<@JAA>	Right, yeah
22:29:51	<TheTechRobo>	But if it determines that it doesn't need browsing (the Content-Type don't include "html") it will run urllib3 to fetch it through warcprox. (It doesn't actually need to wait for the response in that case, since warcprox will continue downloding even if the client hangs up.)
22:30:26	<@JAA>	Interesting
23:32:58		DogsRNice joins

Home Search Previous day Next day