| 00:00:00 | <eggdrop> | [remind] arkiver: pabs3 account issue :) |
| 00:23:15 | | xkey quits [Quit: WeeChat 4.9.1] |
| 00:23:27 | | xkey (xkey) joins |
| 00:28:36 | <pabs> | JAA: what are revisit records? and why is the first text/html 301 on 20231222143008 *after* the three revisit records? |
| 00:30:22 | <@JAA> | pabs: A revisit record in a WARC is a deduplicated reference to a previous response, almost always when the payload (decoded HTTP body) is identical. |
| 00:31:40 | <@JAA> | In this case, at least now, http://git.exotic.sh/ returns a standard nginx response for 301s, so it was deduped against some other similar record on a different URL in those first captures. |
| 00:32:14 | <pabs> | huh, so dedups are archive-wide? |
| 00:32:30 | <pabs> | or just around the same time or in the same WARC? |
| 00:32:45 | <@JAA> | No, it's something the software writing those WARCs did. |
| 00:32:53 | <pabs> | oh |
| 00:32:57 | <pabs> | thanks |
| 00:33:00 | <pabs> | JAA++ |
| 00:33:01 | <eggdrop> | [karma] 'JAA' now has 396 karma! |
| 00:33:26 | <@JAA> | wget-at and qwarc dedupe within the WARC(s) written by a single process, for example. |
| 00:34:16 | <@JAA> | wget also has a mechanism for loading a CDX for deduping across multiple runs, but IIRC, that's somewhat broken, and it's not really used by anything at AT at least. |
| 00:34:37 | <@JAA> | Other tools may have everything backed by a database or similar. |
| 01:23:23 | <pabs> | Jake: repeated the SPN of https://savannah.gnu.org/news/ and the favicon snapshot thing didn't happen this time, hmm |