00:04:03 | | fuzzy80211 quits [Read error: Connection reset by peer] |
00:04:04 | <HP_Archivist> | Hey JAA - I fixed that directory issue I was having. But the arg you gave me the other day for verifying hash values doesn't list the results in the output txt akin to the order of the txt it's reading the hashes from. |
00:04:27 | | fuzzy80211 (fuzzy80211) joins |
00:04:37 | <HP_Archivist> | I recalculated hashes and it's now all in one txt |
00:05:14 | <HP_Archivist> | https://transfer.archivete.am/Cekt7/all_items_md5_checksums.txt |
00:05:15 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/Cekt7/all_items_md5_checksums.txt |
00:05:30 | <@JAA> | HP_Archivist: Yes, that's why I sorted both inputs. |
00:05:54 | <HP_Archivist> | Ohh |
00:06:10 | <@JAA> | It does sort by the hash, but that doesn't really matter as long as the sorting is the same on both. |
00:06:59 | <HP_Archivist> | Either I'm reading the results wrong, or it's failing on the hash compare? https://transfer.archivete.am/Dddi8/hashes_compared_results.txt |
00:07:00 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/Dddi8/hashes_compared_results.txt |
00:07:32 | <HP_Archivist> | Manually comparing several of them earlier indicated they matched. + indicates they don't, right? |
00:07:51 | <@JAA> | Hmm, that's not one comparison per item anymore. |
00:08:11 | <@JAA> | The output is a diff. You should see pairs of - and + lines for mismatches. |
00:08:42 | <HP_Archivist> | Yes, I changed calculation script to avoid the extra file paths |
00:09:12 | <@JAA> | Mhm, but putting them all in one file rather than one file per item makes a mess. |
00:09:32 | <@JAA> | And in theory, it could mask problems with your download. |
00:09:36 | <HP_Archivist> | Oops |
00:10:00 | <HP_Archivist> | Which is probably why it's coming back saying none match (when they do) |
00:10:06 | <@JAA> | If you can rehash everything in a reasonable amount of time, you could just use iasha1check. |
00:10:43 | <HP_Archivist> | Yeah, I tried using that in the beginning attempt at this and got stuck. Don't remember on what though. |
00:11:07 | <HP_Archivist> | How do you usually create hashes? I know it varies based on who you ask |
00:11:12 | | JaffaCakes118 quits [Remote host closed the connection] |
00:11:22 | <@JAA> | It would be unhappy about the _files.xml and _meta.* files, but that's easily ignored/checked afterwards. |
00:11:41 | | JaffaCakes118 (JaffaCakes118) joins |
00:12:01 | <@JAA> | I calculate SHA-256 of all files before the upload, and I use iasha1check after IA finishes processing to confirm everything's fine. |
00:12:31 | <HP_Archivist> | SHA256 - Doesn't that take a really long time? |
00:13:01 | <HP_Archivist> | And I guess my question is, what do you use to do the calculations, a script or a dedicated piece of software |
00:13:45 | <nicolas17> | the bottleneck is usually your hard disk and not the hashing algorithm |
00:13:56 | <@JAA> | It is significantly slower than MD5 or SHA-1, yeah. |
00:14:16 | <@JAA> | SSDs <3 |
00:16:14 | <@JAA> | I use `sha256sum`. I've experimented with calculating all three hashes in parallel (the IA upload also needs MD5 for the Content-MD5 header), but that didn't work particularly well, so I abandoned that idea. |
00:16:14 | <nicolas17> | seems my laptop can sha1 at 330MiB/s and sha256 at 180MiB/s |
00:16:16 | <nicolas17> | roughly |
00:16:43 | <@JAA> | Yeah, it depends on CPU generation etc., but you can expect very roughly a factor 2. |
00:17:09 | <HP_Archivist> | I'd do SHA256, but my daily driver machine is the only one I have; I imagine that would still slow other tasks down a lot if I did 256 calculations on 1000s of files even though I'm running on SSDs, too. |
00:17:41 | <HP_Archivist> | Yeah, CPU would be my bottleneck I think. Barely break the 3Ghz barrier, not exactly great |
00:18:15 | <HP_Archivist> | I maxed out an older Inspiron with 64GB of memory and 2 internal 4TB nvmes Lol |
00:18:32 | <HP_Archivist> | There are still bottlenecks though. So, meh. |
00:18:38 | <nicolas17> | JAA: huh looks like CPU instructions to accelerate SHA-1 and SHA-256 were added *together* in Intel CPUs |
00:18:52 | <@JAA> | Huh |
00:19:47 | <@JAA> | The SHA-256 hashes aren't of any use for validating files on IA because IA doesn't calculate them. I keep them for my own sake and to possibly publish them in the future as an independent data integrity check. |
00:20:04 | <HP_Archivist> | JAA why use 256 if IA needs MD5 anyway? |
00:20:11 | <HP_Archivist> | Oh, nvm ^^ |
00:20:17 | <HP_Archivist> | Gotcha, makes sense |
00:20:29 | <@JAA> | MD5 and SHA-1 are fairly broken by now, so that's why I went with an SHA-2 function. |
00:20:56 | <nicolas17> | hm are they broken in ways that matter for this? |
00:21:04 | <@JAA> | SHA-256 isn't perfect for this because it's vulnerable to length extension attacks. If I started over, I'd probably do SHA-512/256 instead, or something more modern. |
00:22:55 | <HP_Archivist> | The idea of creating hashes with a more robust standard to keep to the side for data integrity validation is actually a smart idea. I've never bothered because I know it would likely slow everything else to a crawl. The perils of only working with one machine. |
00:23:16 | <steering> | JAA: lots of stuff still does sha2+size :) gentoo, debian |
00:23:25 | <@JAA> | nicolas17: Eh, depends on how you look at it. I could replace files on IA without it being visible in the hashes if I used one of those broken ones. |
00:23:34 | <steering> | works well enough against length extension after all |
00:23:34 | <@JAA> | steering: Yeah, that's where I messed up, I didn't keep the file sizes. :-) |
00:23:45 | <steering> | yeah, could always add them on for new stuff :) |
00:23:51 | <@JAA> | Aye |
00:24:14 | <steering> | (if you don't want to completely switch to a different hash function to retain compatibility) |
00:25:16 | <nicolas17> | oh right, length extension |
00:25:27 | <nicolas17> | I was just thinking collisions |
00:26:08 | <HP_Archivist> | JAA: You've given me something to think about. You validate before you upload. Not after the fact. But what I do is upload, then ia download (still having that original source locally, too) and then I want to validate what ia download pulls down since it remains how everything is IA-side. |
00:26:19 | <@JAA> | Length extension's probably not critical in this case. Whoever uses those hashes would have to trust the source (i.e. me) anyway that the hashes are correct. |
00:26:23 | <HP_Archivist> | since it retains* |
00:26:36 | <@JAA> | So it is mostly about collisions (and, in theory, preimages, but not going to happen). |
00:26:59 | <nicolas17> | afaik even MD4 doesn't have a practical preimage attack |
00:27:12 | <@JAA> | HP_Archivist: Nothing to validate before upload. But I calculate my local hashes before, yes. And then I validate after the upload before deleting the local copy. |
00:27:45 | <@JAA> | Even MD2 is still safe in that regard, I believe. |
00:27:58 | <HP_Archivist> | Erm, yeah, I meant you calculate local hashes before upload* |
00:28:44 | <nicolas17> | MD4 turned out to have significantly worse collision resistance than MD2 |
00:29:17 | <HP_Archivist> | I don't know what I should consider source - the actual source files, or what ia download pulls down assuming hashes match. The only reason I use ia download is because it keeps everything neat in its own folder, mirroring the directory structure of how it is on the site. |
00:29:49 | <steering> | interestingly, sha256sum doesn't appear to use CPU instructions for it, even on my riced out gentoo box |
00:30:03 | <nicolas17> | steering: might have been added later |
00:30:13 | <steering> | (I mean, doesn't use any of the SHA Extensions) |
00:30:18 | <nicolas17> | try "openssl sha256"? |
00:30:20 | <@JAA> | Hmm, that would explain the performance difference. IIRC, OpenSSL performs better. |
00:30:49 | <steering> | I disassembled sha256sum and grep'd through it. I don't think I can do that very well for openssl sha256 :) |
00:31:00 | <nicolas17> | oh I meant compare perf :P |
00:31:23 | | steering picks a big movie |
00:31:24 | <@JAA> | `openssl speed sha1 sha256` |
00:31:37 | <steering> | should i copy 73.3GB onto tmpfs? probably not. |
00:31:52 | <steering> | 28GB? sounds better xD |
00:32:46 | <@JAA> | 828 MB/s SHA-1, 372 MB/s SHA-256 on my test server |
00:32:54 | | nulldata quits [Quit: So long and thanks for all the fish!] |
00:33:01 | <@JAA> | 750 MB/s MD5, for comparison |
00:33:22 | <steering> | looks like thats single-threaded so *cores |
00:34:22 | <HP_Archivist> | JAA - guess I need to rehash *again* to individual txts and give iasha1check another try. Also, now my upload process will a bit longer; gonna start doing what you do to make the verification easier :P |
00:34:25 | | nulldata (nulldata) joins |
00:34:26 | <@JAA> | Yeah, I should probably write something that runs `sha256sum` in parallel. |
00:34:29 | <steering> | let's see, 1.2GB/s and 556MB/s for me |
00:34:48 | <steering> | 1219269.97k, 555985.58k @ 16384B |
00:34:54 | <@JAA> | Nice |
00:34:55 | <nicolas17> | testing, but discord is eating some CPU in the background |
00:35:04 | <nicolas17> | discord-- |
00:35:05 | <eggdrop> | [karma] 'discord' now has -18 karma! |
00:35:07 | <@JAA> | What CPU is that? |
00:35:33 | <nicolas17> | sha1 faster than md5 huh |
00:35:36 | <steering> | oops my /tmp is limited to 4GB, oh well, that should be enough |
00:36:05 | <steering> | i5-9600k (no OC, although maybe XMP something something for RAM) |
00:36:28 | <@JAA> | Could also just use /dev/zero. :-P |
00:36:38 | <steering> | hmm fair |
00:36:57 | <@JAA> | `time dd if=/dev/zero bs=4M count=1024 | sha1sum` |
00:36:58 | <nicolas17> | hm yes |
00:37:30 | <nicolas17> | md5 675MB/s, sha1 868MB/s, that does look like specialized CPU instructions |
00:37:36 | <nicolas17> | sha256 436MB/s |
00:38:02 | <steering> | I get about the same speed from both of them |
00:38:12 | <nicolas17> | intel i5-8250U |
00:38:14 | <steering> | `pv /dev/zero | openssl sha256` vs |sha256sum, 500MB/s |
00:38:19 | <steering> | and 515-526MB/s dd| |
00:39:51 | <nicolas17> | oh now I remember the weird one I knew |
00:39:57 | <nicolas17> | Apple M2 |
00:40:09 | <nicolas17> | md5: 596MB/s |
00:40:15 | <nicolas17> | sha1: 728MB/s |
00:40:24 | <nicolas17> | sha256: 2574MB/s |
00:40:48 | <nicolas17> | just... how |
00:41:19 | <steering> | hm, makes sense tbh |
00:41:36 | <nicolas17> | does it have acceleration for sha256 and not for sha1? |
00:41:38 | <steering> | only bother to optimize sha256 because it's the most used |
00:42:11 | <nicolas17> | aaaa why are wikipedia's IA bots so dumb |
00:42:35 | <nicolas17> | "archived from the original" and it points me at the wayback machine's capture of a 404 page |
00:42:37 | <steering> | especially with it being RISC |
00:42:54 | <@JAA> | Is it a 404 served as a 200? |
00:43:51 | <nicolas17> | probably |
00:43:56 | <nicolas17> | oh framesets, what year is it |
00:44:29 | <nicolas17> | JAA: it's a 301 to /documentation/404 |
00:44:46 | <@JAA> | Ah |
00:46:05 | <steering> | also wow Intel's SHA extensions are only from 2013? I didn't realize they were so recent. |
00:46:51 | <@JAA> | Yep, just in time for being deprecated. lol |
00:48:52 | | nulldata quits [Client Quit] |
00:49:11 | <steering> | wow, that's embarrassing |
00:49:47 | <steering> | I went and compared my 'home server' (nuc8i7be) to my desktop and... it basically did the exact same |
00:49:52 | <steering> | despite being clocked much lower |
00:50:55 | | nulldata (nulldata) joins |
00:50:57 | | nulldata quits [Client Quit] |
00:50:59 | <steering> | https://transfer.archivete.am/inline/jMQDK/openssl-speed.txt |
00:51:30 | <steering> | (and being a mobile part) |
00:52:25 | <@JAA> | Marginally higher throughput even, lol |
00:52:26 | <steering> | Oh I guess they actually have similar turbo boost speeds they're probably hitting |
00:52:54 | <steering> | 4.5 to 4.6 |
00:53:07 | <@JAA> | Yeah, I was just looking for the spec sheets. |
00:54:08 | | nulldata (nulldata) joins |
00:54:37 | <steering> | they're both coffee lake so same clock speed, same internals, same performance |
00:54:48 | <@JAA> | My N100 manages 2.6 GB/s SHA-1 and 2.2 GB/s SHA-256. |
00:55:02 | <HP_Archivist> | What do you think about using rclone to create hashes, JAA? |
00:55:46 | <@JAA> | That's with OpenSSL though. sha256sum does ... not. |
00:55:58 | <steering> | wow, and that's a 6W chip, meanwhile mine's 95W |
00:56:39 | <@JAA> | Well, it has the SHA extensions. The spec was created in 2013, but only quite recent CPUs actually have it. |
00:57:12 | <steering> | I... hmm |
00:57:31 | <@JAA> | Yours are two generations too old. |
00:57:52 | <@JAA> | Or one, depending on what series you look at exactly. |
00:58:10 | <@JAA> | > Intel Goldmont[3] (2016) and later Atom microarchitecture processors. |
00:58:13 | <@JAA> | > Intel Cannon Lake[4] (2018/2019), Ice Lake[5] (2019) and later processors for laptops ("mainstream mobile"). |
00:58:16 | <@JAA> | > Intel Rocket Lake (2021) and later processors for desktop computers. |
00:58:32 | <steering> | ahhh urite |
00:59:08 | <@JAA> | The N100 is great. Highly recommended. :-) |
00:59:25 | <@JAA> | HP_Archivist: I haven't used rclone, so no opinion. |
01:00:18 | <@JAA> | It appears as sha_ni in /proc/cpuinfo flags, it seems. |
01:00:20 | <steering> | yeah, I get 2.3 and 2.1 on a i7-1260P |
01:02:48 | <steering> | still, 28-64W for that (no TDP listed?? ugh intel at least be consistent in your useless units) compared to your 6W |
01:04:11 | <steering> | I'm surprised the N100 doesn't have E-cores. Then again I have no idea what Intel is doing with their processor lineups these days. |
01:19:12 | <@JAA> | The N100 has only E-cores, no P-cores. |
01:19:51 | <steering> | ah ok |
01:35:57 | | JaffaCakes118_2 (JaffaCakes118) joins |
01:36:00 | | JaffaCakes118 quits [Remote host closed the connection] |
03:38:42 | | JaffaCakes118_2 quits [Remote host closed the connection] |
03:39:25 | | JaffaCakes118_2 (JaffaCakes118) joins |
03:46:02 | | DogsRNice quits [Read error: Connection reset by peer] |
03:49:01 | | DogsRNice joins |
03:53:46 | | DogsRNice quits [Client Quit] |
03:55:14 | | igloo22225 quits [Quit: The Lounge - https://thelounge.chat] |
04:18:17 | | DogsRNice joins |
04:19:33 | | DogsRNice quits [Remote host closed the connection] |
04:32:41 | <HP_Archivist> | RE: rclone. No worries, JAA. Using iasha1check requires SHA-1, but IA requires MD5 |
04:32:45 | <HP_Archivist> | So I guess I'll do both |
04:32:58 | <@JAA> | HP_Archivist: You misunderstood. |
04:33:41 | <@JAA> | The IA S3 API supports MD5 in that you can send them the correct hash in a header and the upload will fail if it doesn't match. |
04:34:05 | <@JAA> | IA still calculates both MD5 and SHA-1 for each file after that. |
04:34:21 | <@JAA> | And the latter is what iasha1check uses for independently verifying that the local files match the item. |
04:37:09 | <HP_Archivist> | Ahh, that makes sense ^^ |
04:40:54 | <HP_Archivist> | So if I want to start hashing before uploading, I need to get in the habit of calculating MD5 values for files for use when uploading. During upload, if hash doesn't match, the upload fails as you said. |
04:41:04 | <HP_Archivist> | But I essentially still need to calculate both locally, no? |
04:42:30 | <HP_Archivist> | *Per what you mentioned about independently verifying with iasha1check |
04:42:51 | <HP_Archivist> | Apologies it's taken this long for me to get this |
04:45:06 | <HP_Archivist> | I don't think I've ever tried using the S3 API, not sure if that's necessary at this point. But I like the idea of providing hashes upon uploading. |
04:46:09 | <@JAA> | If you use `ia upload` (or `ia-upload-stream`, but I doubt that), you don't need to think about this. It happens automatically. |
04:46:47 | <@JAA> | Hmm, actually, `ia upload` requires `--verify` to do that. |
04:47:21 | <nicolas17> | -c calculates the checksum in order to know if it has to upload the file at all or not |
04:47:26 | <nicolas17> | I wonder if that also implies --verify |
04:47:55 | <@JAA> | It does not IIRC. |
04:48:40 | <nicolas17> | to me the reason to *not* use --verify would be to avoid spending time calculating the checksum |
04:48:54 | <nicolas17> | so if you have to calculate it anyway... hmm |
04:49:08 | <HP_Archivist> | It would just make the upload process longer, it seems |
04:49:35 | <@JAA> | Well, it has to calculate the hash, yes. If you verify it afterwards anyway, you can safely skip that. |
04:49:48 | <HP_Archivist> | And yeah, I already use ia upload JAA, so I guess no reason for me bother with the API |
04:51:11 | <HP_Archivist> | So: 'ia upload identifier file --verify' ? |
04:51:33 | <@JAA> | That would do the extra MD5 thing, yes. |
04:52:19 | <HP_Archivist> | Simple enough. It would only slow things down are largish files, no big deal I suppose |
04:53:08 | <HP_Archivist> | But still have to locally calculate hashes with SHA-1 if I want to use iasha1check |
04:56:34 | <@JAA> | Yes, not manually though. |
04:56:45 | <@JAA> | You just go to the directory with the item data and run `iasha1check IDENTIFIER`. |
04:57:26 | <HP_Archivist> | Thank you |
04:57:40 | <HP_Archivist> | Can't do them all at once? |
04:57:41 | <@JAA> | It retrieves the SHA-1 hashes from IA and checks them against the local files with `sha1sum -c`. (It also compares the file lists and reports differences in those separately for convenience.) |
04:58:03 | <@JAA> | A simple shell loop can take care of that. |
05:03:51 | <HP_Archivist> | Oh, by calculate I meant create SHA1 hashes first, so then iasha1check can automatically check with sha1sum -c |
05:04:09 | <HP_Archivist> | e.g. I don't have txts of sha1 hashes for the files yet, I'll need to generate those now. |
05:04:24 | <@JAA> | iasha1check doesn't support that. |
05:05:47 | <HP_Archivist> | Does iasha1check generate SHA-1 hash values for files or no? |
05:06:00 | <HP_Archivist> | Not just verify |
05:06:39 | <@JAA> | It only checks the hashes. It does not produce any hash output. |
05:07:15 | <@JAA> | It takes the IA item file metadata, generates the equivalent of `sha1sum *` output, and then feeds that to `sha1sum -c`. |
05:10:02 | <HP_Archivist> | Oh okay, I get it now. So sha1sum -c does the actual calculation part |
05:10:31 | <HP_Archivist> | Or generation or whatever you wanna call it. I'm tired and I'm going round in circles at this point, heh |
05:10:56 | <@JAA> | > -c, --check |
05:10:56 | <@JAA> | > read SHA1 sums from the FILEs and check them |
05:11:36 | <@JAA> | Here, the FILE is what's generated from IA's data. Normally, you'd use it like `sha1sum foo >foo.sha1` and later `sha1sum -c foo.sha1` to verify that `foo` is intact. |
05:11:57 | <@JAA> | It calculates the hashes internally but doesn't report them. |
05:12:03 | <@JAA> | Just match/mismatch |
05:12:38 | <HP_Archivist> | Yeah, JAA. I get it now. I think sometimes I overthink things a bit too much. Thanks! |
05:14:22 | <@JAA> | :-) |
05:51:46 | | igloo22225 (igloo22225) joins |
05:55:15 | <HP_Archivist> | Just noticed you keep it updated here? https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/iasha1check |
05:55:32 | <@JAA> | That is the repo, yes. |
06:03:56 | <HP_Archivist> | I cloned the git and it's installed, but running: iasha1check dans-les-coulisses-des-jeux-video-harry-potter-book |
06:04:43 | <HP_Archivist> | Displays a list of all items from the folder of all items and then: IA item files that are not in the local directory: |
06:04:43 | <HP_Archivist> | Dans les coulisses des Jeux Vidéo Harry Potter.pdf |
06:04:43 | <HP_Archivist> | __ia_thumb.jpg |
06:04:43 | <HP_Archivist> | SHA-1 comparison: |
06:04:43 | <HP_Archivist> | sha1sum: 'Dans les coulisses des Jeux Vidéo Harry Potter.pdf': No such file or directory |
06:04:45 | <HP_Archivist> | Dans les coulisses des Jeux Vidéo Harry Potter.pdf: FAILED open or read |
06:04:47 | <HP_Archivist> | sha1sum: __ia_thumb.jpg: No such file or directory |
06:04:48 | <HP_Archivist> | __ia_thumb.jpg: FAILED open or read |
06:04:50 | <HP_Archivist> | sha1sum: WARNING: 2 listed files could not be read |
06:05:02 | <HP_Archivist> | From the arg from the folder where the data is |
06:05:10 | <@JAA> | 04:56:45 <@JAA> You just go to the directory with the item data and run `iasha1check IDENTIFIER`. |
06:05:27 | <HP_Archivist> | Ah, the actual folder |
06:05:41 | <@JAA> | The argument isn't a directory name. The script doesn't care what the dir is named. |
06:05:56 | <@JAA> | It checks that the current dir matches the item. |
06:06:45 | <HP_Archivist> | Yup, that worked. I'm just tired :) |
06:06:50 | <HP_Archivist> | Finally |
06:06:59 | <HP_Archivist> | Took me long enough, heh |
13:03:14 | | SootBector quits [Remote host closed the connection] |
13:03:35 | | SootBector (SootBector) joins |
13:10:44 | | igloo22225 quits [Read error: Connection reset by peer] |
13:11:02 | | igloo22225 (igloo22225) joins |
15:05:47 | | MrMcNuggets (MrMcNuggets) joins |
15:19:58 | <HP_Archivist> | JAA - was too tired to continue last night, but my arg: original_dir=$(pwd); cd /mnt/g/iapiisource && for dir in */; do if [ -d "$dir" ]; then echo "Checking directory: $dir"; (cd "$dir" && $original_dir/iasha1check -d .); fi; done; cd "$original_dir" |
15:20:19 | <HP_Archivist> | Keeps failing, saying no such file or directory. What am I doing wrong? |
15:21:02 | <HP_Archivist> | e.g. I want it to cycle through each item / item folder in the parent folder iapiisource as if it was verifying just one item |
15:36:44 | <HP_Archivist> | Nvm. Got it |
15:36:52 | <HP_Archivist> | Using this: cd /mnt/g/iapiisource && for dir in */; do |
15:36:52 | <HP_Archivist> | echo "Checking directory: $dir" |
15:36:52 | <HP_Archivist> | (cd "$dir" && iasha1check "${dir%/}") |
15:36:52 | <HP_Archivist> | done |
15:46:14 | <HP_Archivist> | It will fail, or rather, say these ia files aren't there: _archive.torrent, |
15:46:14 | <HP_Archivist> | _files.xml |
15:46:14 | <HP_Archivist> | , _meta.sqlite |
15:46:14 | <HP_Archivist> | , _meta.xml, but that's to be expected. Actual source files are coming back read as OK. :) |
16:15:22 | <@arkiver> | https://blog.archive.org/2024/09/11/new-feature-alert-access-archived-webpages-directly-through-google-search/ |
16:20:00 | <HP_Archivist> | ^^ Wow this is impressive |
16:20:35 | <@JAA> | HP_Archivist: Yep, those are the expected errors I mentioned last night. It should be easy to filter the output afterwards with a bit of `grep` to make sure there are no other errors. |
16:21:02 | <nicolas17> | that's cool, but I'm still annoyed by Google removing its "cached" feature |
16:21:28 | <nicolas17> | https://twitter.com/nicolas09F9/status/1754520745118466153 |
16:21:31 | <@JAA> | nicolas17: The cache still exists. |
16:33:59 | <HP_Archivist> | Yup, JAA. Thanks again. |
16:34:39 | <HP_Archivist> | I think this is a super important step forward - this puts the power of the WBM right in ordinary users' hands. |
16:37:36 | <@arkiver> | i very much hope so! so many people do not know about the Wayback Machine... |
16:37:56 | <@arkiver> | it is usually really only known to tech people (and often to journalists, etc.) |
16:44:36 | <xkey> | arkiver: wowii, is publically known if archive.org gets financial rewards with that cooperation? |
16:52:53 | <@arkiver> | xkey: i have no idea. IA gets traffic at least |
16:55:49 | <xkey> | jup |
17:00:33 | | nyuuzyou joins |
17:18:27 | <rewby> | congestion++ |
17:18:28 | <eggdrop> | [karma] 'congestion' now has 1 karma! |
17:18:59 | <@arkiver> | :) |
17:51:00 | | kokos- joins |
18:01:04 | | MrMcNuggets quits [Client Quit] |
18:08:23 | | katia_ (katia) joins |
18:17:12 | | katia_ quits [Client Quit] |
18:17:36 | | kokos- quits [Client Quit] |
18:18:21 | | kokos- joins |
18:27:42 | | katia_ (katia) joins |
18:50:24 | <nicolas17> | https://web.archive.org/web/2/https://example.com/ |
18:50:50 | <nicolas17> | I have seen (and used) this a few times but what does it mean? is the 2 some kind of version number? or is it interpreted as a timestamp? |
18:51:15 | <@JAA> | The latter. |
18:51:43 | <nicolas17> | a very low timestamp, so it redirects to the "nearest" one which is the oldest? |
18:51:46 | <@JAA> | You can shorten the timestamp to the desired precision: https://web.archive.org/web/20240801/https://example.com/ |
18:51:54 | <nicolas17> | ah hm |
18:52:45 | <@JAA> | I think it picks the most recent snapshot that matches, effectively filling it up with 9s and 5s. |
18:52:57 | <@JAA> | So it'll be the most recent snapshot until the year 3000. |
18:53:06 | <@JAA> | Y3K BUG!!1! |
18:54:07 | <@JAA> | Or actually, it fills it up like that and then jumps to the closest snapshot, probably. |
18:54:23 | <@JAA> | So 20240801 becomes 20240801235959, and the closest snapshot is two seconds later. |
18:54:48 | <@JAA> | While 2 becomes 29991231235959, and the closest snapshot is effectively always the most recent one. |
18:55:34 | <nicolas17> | I think I have seen cases where 404s got archived |
18:56:06 | <nicolas17> | can that happen? (depending on the tooling) |
18:56:30 | <nicolas17> | ah but then I would get redirected to a concrete capture that happens to be a 404, rather than /2/ returning a 404 directly |
18:56:42 | <@JAA> | Of course, the SPN has a checkbox for it, and external sources don't have such restrictions anyway. |
18:56:47 | <@JAA> | Yes |
18:57:12 | <nicolas17> | so if /2/ returns 404 then I can be sure there are no captures for that file |
19:18:14 | <nicolas17> | hm interesting |
19:18:50 | <nicolas17> | it also makes a request to the origin server to see if it still exists |
19:19:03 | <nicolas17> | if the origin returns 404 then WBM returns 404 |
19:19:15 | <nicolas17> | if the origin returns 200 then WBM returns a 302 redirect to /save/_embed/ |
19:21:21 | <@JAA> | Yes, cf. the 'this page is available on the web' message thing. |
19:29:29 | <pokechu22> | Also note that /1/ becomes 1999 which is usually good for finding the oldest capture (though it'll still prefer 1998 over 1997, etc) |
19:30:49 | <nicolas17> | hm I guess I need the oldest capture here |
19:31:09 | <nicolas17> | because the newest one may well be a 403 if it got captured after it was deleted from the source |
19:31:53 | <nicolas17> | https://web.archive.org/web/*/https://swcdn.apple.com/content/downloads/22/50/002-32829-A_D6OB9130EQ/ypmki63xkjh2hoksrdnpd372z5kmpqn3vm/InstallAssistant.pkg one 404 and one 403 yay -_- |
19:32:40 | <@JAA> | /10/ through /18/ |
19:34:15 | <pokechu22> | The CDX api might be better for your purposes - you could list all of them that got 200s that way |
19:36:35 | <nicolas17> | for starters I'll collect all those that redirect to save/_embed (meaning it's definitely not archived but it can be) |
19:37:21 | <@JAA> | The CDX API can give you a complete list of what's captured in a single request, yeah. |
19:37:35 | <@JAA> | Although you'd still need to check for truncated responses, I guess. |
19:39:17 | <nicolas17> | so far it's looking like every single version of macOS Rosetta is unarchived, still available, and only ~200KB :D |
19:50:20 | <pokechu22> | The CDX API can also tell you the size IIRC so you should be able to spot truncated responses that way |
19:50:26 | <nicolas17> | oh even worse |
19:50:36 | <nicolas17> | https://web.archive.org/web/20240715092345/http://swcdn.apple.com/content/downloads/56/34/041-88557/sm2i2d444udypgi46bsi57h6aa0cq4pmm4/SafariTechPreviewElCapitan.pkg I forgot SPN does this shit |
19:51:19 | <nicolas17> | dunno if it's SPN's fault or swcdn's fault when it gets SPN's request |
19:52:20 | <@JAA> | Huh |
19:54:30 | <pokechu22> | https://web.archive.org/cdx/search/cdx?url=http%3A%2F%2Fswcdn.apple.com%2Fcontent%2Fdownloads%2F56%2F34%2F041-88557%2Fsm2i2d444udypgi46bsi57h6aa0cq4pmm4%2F&matchType=prefix&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cstatuscode%2Clength&limit=10000 shows 2 captured 400s |
19:56:01 | <pokechu22> | also possibly useful: https://web.archive.org/cdx/search/cdx?url=http%3A%2F%2Fswcdn.apple.com%2F&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A%5B45%5D..&limit=10000&showResumeKey=true&resumeKey= + https://archive.org/developers/wayback-cdx-server.html (not that many successful |
19:56:03 | <pokechu22> | captures it seems) |
20:14:30 | <@arkiver> | JAA: so on shownumpages, i believe you need to set a pageSize parameter for it to function |
20:14:36 | <@arkiver> | and others ^ |
20:14:46 | <@JAA> | Well, that's new. |
20:14:50 | <@arkiver> | yeah |
20:14:54 | <@arkiver> | or well not sure |
20:17:25 | <@JAA> | https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=1000&showNumPages=true returns - (a dash). |
20:18:36 | <@JAA> | The pageSize value doesn't seem to matter. |
20:19:36 | <@arkiver> | well hm |
20:32:34 | | qwertyasdfuiopghjkl quits [Quit: Client closed] |
20:56:01 | <@arkiver> | i hope to have more on this soon |
21:06:51 | <@arkiver> | JAA: &fl=original may be the problem |
21:08:22 | <@arkiver> | without that i get `1` |
21:08:30 | <@JAA> | https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&matchType=domain&pageSize=1000&showNumPages=true returns 1, https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain returns 286547 results... |
21:09:45 | <@JAA> | https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=1000&page=0 also returns the full result set (actually slightly more, 286641), completely ignoring the pageSize. |
21:10:10 | <@JAA> | (page=0 is the first page according to the documentation.) |
21:11:33 | <@arkiver> | well... more soon again |
21:21:47 | | DLoader quits [Ping timeout: 256 seconds] |
21:22:22 | | DLoader (DLoader) joins |
21:29:45 | <nicolas17> | does /save/_embed/ return an html form or does it actually immediately trigger saving? |
21:29:55 | | Nemo_bis quits [Ping timeout: 255 seconds] |
21:30:08 | | Nemo_bis (Nemo_bis) joins |
21:30:25 | <@JAA> | It immediately saves (or sometimes goes into an infinite loop because caching is hard). |
21:31:16 | <nicolas17> | so "wget https://web.archive.org/web/2/$url" would cause a save if that url isn't captured? oof |
21:31:48 | <@JAA> | It normally shouldn't. You should get the 404 with 'not saved yet' and possibly 'available on the web'. |
21:33:06 | <@JAA> | Appending im_ to the timestamp probably triggers a /save/_embed/ though. |
21:33:18 | <nicolas17> | maybe wget user-agent causes im_ behavior? |
21:33:22 | <pokechu22> | Yeah, the normal HTML one doesn't trigger a save, but various embed ones do |
21:33:36 | <pokechu22> | But /web/2/ _will_ try to fetch the page to see if it exists |
21:33:49 | <nicolas17> | I ran "wget https://web.archive.org/web/2/https://swcdn.apple.com/content/downloads/05/59/062-54078-A_B16PVEE8JJ/7xpklgkn5wzlb7fzavwy7c8oeir2gjs5uf/SafariTechPreview.pkg", it redirected to "https://web.archive.org/save/_embed/https://swcdn.apple.com/content/downloads/05/59/062-54078-A_B16PVEE8JJ/7xpklgkn5wzlb7fzavwy7c8oeir2gjs5uf/SafariTechPreview.pkg" and got stuck for a minute until I Ctrl-C'd it |
21:33:55 | <pokechu22> | not sure if that's a HEAD or a full GET or what |
21:33:56 | <nicolas17> | now it takes me to the failed capture saying "Cycle Prohibited" |
21:34:05 | <pokechu22> | OK, that's different from what I've seen, interesting |
21:34:20 | <@JAA> | Yeah, not something I've seen either. |
21:34:45 | <pokechu22> | It might be that wget gets served the original form instead of the iframe with the timeline and stuff, and doing that triggers saves |
21:34:46 | <nicolas17> | maybe browser UAs redirect to the web form but wget redirects to _embed |
21:53:11 | <steering> | oof, the google link to WBM is ... not very useful? |
21:56:50 | <steering> | https://imgur.com/a/Hy2lbiq |
21:57:03 | <steering> | maybe it's more prominent sometimes... |
22:09:12 | | KoalaBear84 quits [Read error: Connection reset by peer] |
22:12:13 | <imer> | thats quite hidden |
22:54:59 | <nicolas17> | can we block swcdn.apple.com from being archived in SPN? it always ends up in that weird proxy error |
22:57:54 | <nicolas17> | worst is when there is already a valid capture |
22:58:11 | <nicolas17> | and the "latest capture" ends up being the proxy error instead |
22:58:32 | | corentin joins |