01:33:57KoalaBear84 joins
01:37:25KoalaBear quits [Ping timeout: 255 seconds]
02:43:50<HP_Archivist>Would the hash values of _files.xml change over time? Not the hash values listed in the file, but the file's values itself.
02:44:32<HP_Archivist>I keep getting mismatches when going over item's individual _files.xml
02:45:10<HP_Archivist>*When comparing the copy listed on IA vs the copy downloaded locally.
03:20:22<HP_Archivist>Nvm.
03:46:32<HP_Archivist>Is it safe to assume that IA uses Linux to calculate file hash files?
03:47:12<HP_Archivist>file hash values*
04:13:35DogsRNice quits [Read error: Connection reset by peer]
04:18:04pabs quits [Remote host closed the connection]
04:56:56pabs (pabs) joins
07:46:10JaffaCakes118 quits [Remote host closed the connection]
07:57:53JaffaCakes118 (JaffaCakes118) joins
08:31:52qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds]
08:57:00MrMcNuggets (MrMcNuggets) joins
09:04:57qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:06:33Matthww quits [Ping timeout: 256 seconds]
11:27:29Matthww joins
12:18:58<@JAA>HP_Archivist: Hashes don't depend on the OS.
12:19:28<@JAA>_files.xml can change over time if anything in the item changes. That includes things like metadata and reviews.
14:41:48MrMcNuggets quits [Quit: WeeChat 4.3.2]
14:41:56<HP_Archivist>JAA I was under the impression that if I calculated hashes in a Windows file structure and then compared those against files' hashes calculated in a Linux environment there might be some confliction.
14:43:01<HP_Archivist>Rn I'm calculating MD5 hashes for all that data I just downloaded. Once that's done, I want to compare against the copies on IA under my account.
14:43:19<HP_Archivist>Windows file system* not structure
14:43:37<@JAA>HP_Archivist: No, hashes operate on binary data, which doesn't change by OS. If you try to do 'text mode' for hashes (which does LF vs CRLF conversion stuff), you're doing it wrong. :-)
14:44:40<@JAA>The metadata API at /metadata/$IDENTIFIER returns the hashes as JSON, by the way, which I find easier to work with than _files.xml.
14:45:37<HP_Archivist>Ahh, yeah I think I confused myself when thinking up a way to make this as seamless as possible
14:46:34myself used Confuse. It was super effective.
14:47:11<HP_Archivist>Heh
14:47:55<@arkiver>pabs: this will run into next week at least
14:50:34<HP_Archivist>JAA: If I use /metadata/$IDENTIFIER instead, how would I script that? I was going to have a script reference each _files.xml but I think that's honestly too much a headache (as you're implying)
14:52:47<HP_Archivist>When the MD5 calculations are done locally, I'll have a md5_checksums.txt for every item/file.
14:52:48<@JAA>HP_Archivist: You can do it from the XML, but it's ugly. See little-things/iasha1check for reference.
14:54:36<@JAA>You can do something like `diff <(curl -s https://archive.org/metadata/$IDENTIFIER | jq -r '.files[] | "\(.md5) \(.name)"') /path/to/${IDENTIFIER}/md5_checksums.txt`.
14:55:13<@JAA>Although the API returns all files, of course, so you'll have some extra files in that. Maybe `comm` is a better option.
14:55:41<@JAA>You'd also need to filter out the _files.xml file because it's hash is never correct.
14:56:42<HP_Archivist>^^ This explains my confusion. I kept seeing different values for _files.xml
14:57:04<@JAA>You can construct a file that contains its own hash, but it's not easy. :-P
14:58:00<HP_Archivist>I also need to filter out any derivative files, as I excluded those on the local downloads
14:58:14<HP_Archivist>Unless I just ignore it erroring out on missing files, etc
14:58:21<HP_Archivist>'missing files'
14:58:56<HP_Archivist>Unless 'comm' takes care of that?
14:59:56<@JAA>Yes, that's what I meant by 'extra files'.
15:00:15<HP_Archivist>^^
15:00:25<@JAA>`comm` lets you distinguish lines appearing only in the first input, or only in the second input, or in both.
15:01:05<@JAA>In this case, you'd want to check for lines that only appear in your local checksum files.
15:02:53<HP_Archivist>I see this is why working with xml files is much more problematic; they included the derivative lines, too, not just 'source="original"
15:03:21<HP_Archivist>If I understand all this correctly
15:04:15<HP_Archivist>I will give your example a try once everything is done hashing :)
15:04:54<@JAA>Hmm
15:05:04<@JAA>Actually...
15:05:12<@JAA>The metadata API also returns the source, so could filter for that.
15:06:16<@JAA>`jq -r '.files[] | select(.source != "derivative") | "\(.md5) \(.name)"'`
15:06:46<@JAA>Or `.source == "original"`, I don't recall which other sources exist.
15:09:00<HP_Archivist>I think it's just those two, no?
15:12:45<HP_Archivist>Gonna step away for a moment
15:15:42<@JAA>Looks like the 'Item Size' info field has been changed in the past few days to couple weeks and now rendered like '28.4M' to mean '28.4 MiB'. It was previously in bytes.
15:16:48<@JAA>That matches what `ls -h` or `du -h` output, and at least they didn't conflate MiB and MB this time. I'd still prefer the full unit though.
15:43:05<HP_Archivist>^^
15:43:51<HP_Archivist>I assume there's not much value in cross-referencing against all 3 hash calculations - MD5, CRC, SHA-1 as opposed to just MD5?
15:44:55<HP_Archivist>My concern is: Did the data download correctly. If so, ensure bit integrity. But I guess the latter is what ZFS is for :P
15:59:00<@JAA>Just checking a single hash is good enough to catch any random errors.
16:00:00<HP_Archivist>Yup, just thought I'd ask anyway :)
16:00:29<@JAA>If you do find a bad matching MD5, that'll be major cryptography news. :-)
16:01:05<HP_Archivist>Heh ^^
16:01:55<@JAA>It'd be a preimage attack, which is still infeasible for most hash functions, even otherwise very insecure and entirely obsolete ones like MD4. I'm not even sure a practical one for MD2 exists.
16:03:44<@JAA>Time and space complexity 2^73 is the best known preimage attack for MD2, apparently. Yeah, that's not going to happen either.
16:05:11<HP_Archivist>I have read about collisions being reported thought for MD5 iirc, but I think that was in the context of security rather than integrity
16:05:23<HP_Archivist>Or maybe that was SHA-1
16:05:32<@JAA>Collision attacks are much easier to mount that preimage attacks, yes.
16:05:43<@JAA>Collision attack: find two inputs that result in the same hash
16:06:09<@JAA>Preimage attack: find an input that results in the same hash as another input (that the attacker doesn't control)
16:07:04<@JAA>See the birthday paradox for why collisions are 'easy'.
16:07:44<@JAA>In this case, the 'other input' would be the original file, and the 'attacker' would be random noise.
16:54:04DogsRNice joins
19:07:33magmaus3 quits [Quit: :3]
19:08:52magmaus3 (magmaus3) joins
19:19:00Matthww quits [Ping timeout: 256 seconds]
19:19:33Matthww joins
20:46:32driib (driib) joins
21:04:28driib quits [Client Quit]
21:10:31driib (driib) joins
21:25:08driib quits [Client Quit]
21:29:09driib (driib) joins
21:30:50driib quits [Client Quit]
21:36:34driib (driib) joins
22:32:01qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds]
23:53:12<tzt>why is St. Petersburg Internet portal (Piter.tv) excluded entirely from the Wayback Machine?
23:56:05<@JAA>Most likely because the domain owner asked for it to be excluded.