00:00:00<eggdrop>[remind] arkiver: pabs3 account issue :)
00:23:15xkey quits [Quit: WeeChat 4.9.1]
00:23:27xkey (xkey) joins
00:28:36<pabs>JAA: what are revisit records? and why is the first text/html 301 on 20231222143008 *after* the three revisit records?
00:30:22<@JAA>pabs: A revisit record in a WARC is a deduplicated reference to a previous response, almost always when the payload (decoded HTTP body) is identical.
00:31:40<@JAA>In this case, at least now, http://git.exotic.sh/ returns a standard nginx response for 301s, so it was deduped against some other similar record on a different URL in those first captures.
00:32:14<pabs>huh, so dedups are archive-wide?
00:32:30<pabs>or just around the same time or in the same WARC?
00:32:45<@JAA>No, it's something the software writing those WARCs did.
00:32:53<pabs>oh
00:32:57<pabs>thanks
00:33:00<pabs>JAA++
00:33:01<eggdrop>[karma] 'JAA' now has 396 karma!
00:33:26<@JAA>wget-at and qwarc dedupe within the WARC(s) written by a single process, for example.
00:34:16<@JAA>wget also has a mechanism for loading a CDX for deduping across multiple runs, but IIRC, that's somewhat broken, and it's not really used by anything at AT at least.
00:34:37<@JAA>Other tools may have everything backed by a database or similar.
01:23:23<pabs>Jake: repeated the SPN of https://savannah.gnu.org/news/ and the favicon snapshot thing didn't happen this time, hmm