#internetarchive log for 2024-09-11

Home Search Previous day Next day

00:04:03		fuzzy80211 quits [Read error: Connection reset by peer]
00:04:04	<HP_Archivist>	Hey JAA - I fixed that directory issue I was having. But the arg you gave me the other day for verifying hash values doesn't list the results in the output txt akin to the order of the txt it's reading the hashes from.
00:04:27		fuzzy80211 (fuzzy80211) joins
00:04:37	<HP_Archivist>	I recalculated hashes and it's now all in one txt
00:05:14	<HP_Archivist>	https://transfer.archivete.am/Cekt7/all_items_md5_checksums.txt
00:05:15	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/Cekt7/all_items_md5_checksums.txt
00:05:30	<@JAA>	HP_Archivist: Yes, that's why I sorted both inputs.
00:05:54	<HP_Archivist>	Ohh
00:06:10	<@JAA>	It does sort by the hash, but that doesn't really matter as long as the sorting is the same on both.
00:06:59	<HP_Archivist>	Either I'm reading the results wrong, or it's failing on the hash compare? https://transfer.archivete.am/Dddi8/hashes_compared_results.txt
00:07:00	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/Dddi8/hashes_compared_results.txt
00:07:32	<HP_Archivist>	Manually comparing several of them earlier indicated they matched. + indicates they don't, right?
00:07:51	<@JAA>	Hmm, that's not one comparison per item anymore.
00:08:11	<@JAA>	The output is a diff. You should see pairs of - and + lines for mismatches.
00:08:42	<HP_Archivist>	Yes, I changed calculation script to avoid the extra file paths
00:09:12	<@JAA>	Mhm, but putting them all in one file rather than one file per item makes a mess.
00:09:32	<@JAA>	And in theory, it could mask problems with your download.
00:09:36	<HP_Archivist>	Oops
00:10:00	<HP_Archivist>	Which is probably why it's coming back saying none match (when they do)
00:10:06	<@JAA>	If you can rehash everything in a reasonable amount of time, you could just use iasha1check.
00:10:43	<HP_Archivist>	Yeah, I tried using that in the beginning attempt at this and got stuck. Don't remember on what though.
00:11:07	<HP_Archivist>	How do you usually create hashes? I know it varies based on who you ask
00:11:12		JaffaCakes118 quits [Remote host closed the connection]
00:11:22	<@JAA>	It would be unhappy about the _files.xml and _meta.* files, but that's easily ignored/checked afterwards.
00:11:41		JaffaCakes118 (JaffaCakes118) joins
00:12:01	<@JAA>	I calculate SHA-256 of all files before the upload, and I use iasha1check after IA finishes processing to confirm everything's fine.
00:12:31	<HP_Archivist>	SHA256 - Doesn't that take a really long time?
00:13:01	<HP_Archivist>	And I guess my question is, what do you use to do the calculations, a script or a dedicated piece of software
00:13:45	<nicolas17>	the bottleneck is usually your hard disk and not the hashing algorithm
00:13:56	<@JAA>	It is significantly slower than MD5 or SHA-1, yeah.
00:14:16	<@JAA>	SSDs <3
00:16:14	<@JAA>	I use `sha256sum`. I've experimented with calculating all three hashes in parallel (the IA upload also needs MD5 for the Content-MD5 header), but that didn't work particularly well, so I abandoned that idea.
00:16:14	<nicolas17>	seems my laptop can sha1 at 330MiB/s and sha256 at 180MiB/s
00:16:16	<nicolas17>	roughly
00:16:43	<@JAA>	Yeah, it depends on CPU generation etc., but you can expect very roughly a factor 2.
00:17:09	<HP_Archivist>	I'd do SHA256, but my daily driver machine is the only one I have; I imagine that would still slow other tasks down a lot if I did 256 calculations on 1000s of files even though I'm running on SSDs, too.
00:17:41	<HP_Archivist>	Yeah, CPU would be my bottleneck I think. Barely break the 3Ghz barrier, not exactly great
00:18:15	<HP_Archivist>	I maxed out an older Inspiron with 64GB of memory and 2 internal 4TB nvmes Lol
00:18:32	<HP_Archivist>	There are still bottlenecks though. So, meh.
00:18:38	<nicolas17>	JAA: huh looks like CPU instructions to accelerate SHA-1 and SHA-256 were added together in Intel CPUs
00:18:52	<@JAA>	Huh
00:19:47	<@JAA>	The SHA-256 hashes aren't of any use for validating files on IA because IA doesn't calculate them. I keep them for my own sake and to possibly publish them in the future as an independent data integrity check.
00:20:04	<HP_Archivist>	JAA why use 256 if IA needs MD5 anyway?
00:20:11	<HP_Archivist>	Oh, nvm ^^
00:20:17	<HP_Archivist>	Gotcha, makes sense
00:20:29	<@JAA>	MD5 and SHA-1 are fairly broken by now, so that's why I went with an SHA-2 function.
00:20:56	<nicolas17>	hm are they broken in ways that matter for this?
00:21:04	<@JAA>	SHA-256 isn't perfect for this because it's vulnerable to length extension attacks. If I started over, I'd probably do SHA-512/256 instead, or something more modern.
00:22:55	<HP_Archivist>	The idea of creating hashes with a more robust standard to keep to the side for data integrity validation is actually a smart idea. I've never bothered because I know it would likely slow everything else to a crawl. The perils of only working with one machine.
00:23:16	<steering>	JAA: lots of stuff still does sha2+size :) gentoo, debian
00:23:25	<@JAA>	nicolas17: Eh, depends on how you look at it. I could replace files on IA without it being visible in the hashes if I used one of those broken ones.
00:23:34	<steering>	works well enough against length extension after all
00:23:34	<@JAA>	steering: Yeah, that's where I messed up, I didn't keep the file sizes. :-)
00:23:45	<steering>	yeah, could always add them on for new stuff :)
00:23:51	<@JAA>	Aye
00:24:14	<steering>	(if you don't want to completely switch to a different hash function to retain compatibility)
00:25:16	<nicolas17>	oh right, length extension
00:25:27	<nicolas17>	I was just thinking collisions
00:26:08	<HP_Archivist>	JAA: You've given me something to think about. You validate before you upload. Not after the fact. But what I do is upload, then ia download (still having that original source locally, too) and then I want to validate what ia download pulls down since it remains how everything is IA-side.
00:26:19	<@JAA>	Length extension's probably not critical in this case. Whoever uses those hashes would have to trust the source (i.e. me) anyway that the hashes are correct.
00:26:23	<HP_Archivist>	since it retains*
00:26:36	<@JAA>	So it is mostly about collisions (and, in theory, preimages, but not going to happen).
00:26:59	<nicolas17>	afaik even MD4 doesn't have a practical preimage attack
00:27:12	<@JAA>	HP_Archivist: Nothing to validate before upload. But I calculate my local hashes before, yes. And then I validate after the upload before deleting the local copy.
00:27:45	<@JAA>	Even MD2 is still safe in that regard, I believe.
00:27:58	<HP_Archivist>	Erm, yeah, I meant you calculate local hashes before upload*
00:28:44	<nicolas17>	MD4 turned out to have significantly worse collision resistance than MD2
00:29:17	<HP_Archivist>	I don't know what I should consider source - the actual source files, or what ia download pulls down assuming hashes match. The only reason I use ia download is because it keeps everything neat in its own folder, mirroring the directory structure of how it is on the site.
00:29:49	<steering>	interestingly, sha256sum doesn't appear to use CPU instructions for it, even on my riced out gentoo box
00:30:03	<nicolas17>	steering: might have been added later
00:30:13	<steering>	(I mean, doesn't use any of the SHA Extensions)
00:30:18	<nicolas17>	try "openssl sha256"?
00:30:20	<@JAA>	Hmm, that would explain the performance difference. IIRC, OpenSSL performs better.
00:30:49	<steering>	I disassembled sha256sum and grep'd through it. I don't think I can do that very well for openssl sha256 :)
00:31:00	<nicolas17>	oh I meant compare perf :P
00:31:23		steering picks a big movie
00:31:24	<@JAA>	`openssl speed sha1 sha256`
00:31:37	<steering>	should i copy 73.3GB onto tmpfs? probably not.
00:31:52	<steering>	28GB? sounds better xD
00:32:46	<@JAA>	828 MB/s SHA-1, 372 MB/s SHA-256 on my test server
00:32:54		nulldata quits [Quit: So long and thanks for all the fish!]
00:33:01	<@JAA>	750 MB/s MD5, for comparison
00:33:22	<steering>	looks like thats single-threaded so *cores
00:34:22	<HP_Archivist>	JAA - guess I need to rehash again to individual txts and give iasha1check another try. Also, now my upload process will a bit longer; gonna start doing what you do to make the verification easier :P
00:34:25		nulldata (nulldata) joins
00:34:26	<@JAA>	Yeah, I should probably write something that runs `sha256sum` in parallel.
00:34:29	<steering>	let's see, 1.2GB/s and 556MB/s for me
00:34:48	<steering>	1219269.97k, 555985.58k @ 16384B
00:34:54	<@JAA>	Nice
00:34:55	<nicolas17>	testing, but discord is eating some CPU in the background
00:35:04	<nicolas17>	discord--
00:35:05	<eggdrop>	[karma] 'discord' now has -18 karma!
00:35:07	<@JAA>	What CPU is that?
00:35:33	<nicolas17>	sha1 faster than md5 huh
00:35:36	<steering>	oops my /tmp is limited to 4GB, oh well, that should be enough
00:36:05	<steering>	i5-9600k (no OC, although maybe XMP something something for RAM)
00:36:28	<@JAA>	Could also just use /dev/zero. :-P
00:36:38	<steering>	hmm fair
00:36:57	<@JAA>	`time dd if=/dev/zero bs=4M count=1024 \| sha1sum`
00:36:58	<nicolas17>	hm yes
00:37:30	<nicolas17>	md5 675MB/s, sha1 868MB/s, that does look like specialized CPU instructions
00:37:36	<nicolas17>	sha256 436MB/s
00:38:02	<steering>	I get about the same speed from both of them
00:38:12	<nicolas17>	intel i5-8250U
00:38:14	<steering>	`pv /dev/zero \| openssl sha256` vs \|sha256sum, 500MB/s
00:38:19	<steering>	and 515-526MB/s dd\|
00:39:51	<nicolas17>	oh now I remember the weird one I knew
00:39:57	<nicolas17>	Apple M2
00:40:09	<nicolas17>	md5: 596MB/s
00:40:15	<nicolas17>	sha1: 728MB/s
00:40:24	<nicolas17>	sha256: 2574MB/s
00:40:48	<nicolas17>	just... how
00:41:19	<steering>	hm, makes sense tbh
00:41:36	<nicolas17>	does it have acceleration for sha256 and not for sha1?
00:41:38	<steering>	only bother to optimize sha256 because it's the most used
00:42:11	<nicolas17>	aaaa why are wikipedia's IA bots so dumb
00:42:35	<nicolas17>	"archived from the original" and it points me at the wayback machine's capture of a 404 page
00:42:37	<steering>	especially with it being RISC
00:42:54	<@JAA>	Is it a 404 served as a 200?
00:43:51	<nicolas17>	probably
00:43:56	<nicolas17>	oh framesets, what year is it
00:44:29	<nicolas17>	JAA: it's a 301 to /documentation/404
00:44:46	<@JAA>	Ah
00:46:05	<steering>	also wow Intel's SHA extensions are only from 2013? I didn't realize they were so recent.
00:46:51	<@JAA>	Yep, just in time for being deprecated. lol
00:48:52		nulldata quits [Client Quit]
00:49:11	<steering>	wow, that's embarrassing
00:49:47	<steering>	I went and compared my 'home server' (nuc8i7be) to my desktop and... it basically did the exact same
00:49:52	<steering>	despite being clocked much lower
00:50:55		nulldata (nulldata) joins
00:50:57		nulldata quits [Client Quit]
00:50:59	<steering>	https://transfer.archivete.am/inline/jMQDK/openssl-speed.txt
00:51:30	<steering>	(and being a mobile part)
00:52:25	<@JAA>	Marginally higher throughput even, lol
00:52:26	<steering>	Oh I guess they actually have similar turbo boost speeds they're probably hitting
00:52:54	<steering>	4.5 to 4.6
00:53:07	<@JAA>	Yeah, I was just looking for the spec sheets.
00:54:08		nulldata (nulldata) joins
00:54:37	<steering>	they're both coffee lake so same clock speed, same internals, same performance
00:54:48	<@JAA>	My N100 manages 2.6 GB/s SHA-1 and 2.2 GB/s SHA-256.
00:55:02	<HP_Archivist>	What do you think about using rclone to create hashes, JAA?
00:55:46	<@JAA>	That's with OpenSSL though. sha256sum does ... not.
00:55:58	<steering>	wow, and that's a 6W chip, meanwhile mine's 95W
00:56:39	<@JAA>	Well, it has the SHA extensions. The spec was created in 2013, but only quite recent CPUs actually have it.
00:57:12	<steering>	I... hmm
00:57:31	<@JAA>	Yours are two generations too old.
00:57:52	<@JAA>	Or one, depending on what series you look at exactly.
00:58:10	<@JAA>	> Intel Goldmont[3] (2016) and later Atom microarchitecture processors.
00:58:13	<@JAA>	> Intel Cannon Lake[4] (2018/2019), Ice Lake[5] (2019) and later processors for laptops ("mainstream mobile").
00:58:16	<@JAA>	> Intel Rocket Lake (2021) and later processors for desktop computers.
00:58:32	<steering>	ahhh urite
00:59:08	<@JAA>	The N100 is great. Highly recommended. :-)
00:59:25	<@JAA>	HP_Archivist: I haven't used rclone, so no opinion.
01:00:18	<@JAA>	It appears as sha_ni in /proc/cpuinfo flags, it seems.
01:00:20	<steering>	yeah, I get 2.3 and 2.1 on a i7-1260P
01:02:48	<steering>	still, 28-64W for that (no TDP listed?? ugh intel at least be consistent in your useless units) compared to your 6W
01:04:11	<steering>	I'm surprised the N100 doesn't have E-cores. Then again I have no idea what Intel is doing with their processor lineups these days.
01:19:12	<@JAA>	The N100 has only E-cores, no P-cores.
01:19:51	<steering>	ah ok
01:35:57		JaffaCakes118_2 (JaffaCakes118) joins
01:36:00		JaffaCakes118 quits [Remote host closed the connection]
03:38:42		JaffaCakes118_2 quits [Remote host closed the connection]
03:39:25		JaffaCakes118_2 (JaffaCakes118) joins
03:46:02		DogsRNice quits [Read error: Connection reset by peer]
03:49:01		DogsRNice joins
03:53:46		DogsRNice quits [Client Quit]
03:55:14		igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
04:18:17		DogsRNice joins
04:19:33		DogsRNice quits [Remote host closed the connection]
04:32:41	<HP_Archivist>	RE: rclone. No worries, JAA. Using iasha1check requires SHA-1, but IA requires MD5
04:32:45	<HP_Archivist>	So I guess I'll do both
04:32:58	<@JAA>	HP_Archivist: You misunderstood.
04:33:41	<@JAA>	The IA S3 API supports MD5 in that you can send them the correct hash in a header and the upload will fail if it doesn't match.
04:34:05	<@JAA>	IA still calculates both MD5 and SHA-1 for each file after that.
04:34:21	<@JAA>	And the latter is what iasha1check uses for independently verifying that the local files match the item.
04:37:09	<HP_Archivist>	Ahh, that makes sense ^^
04:40:54	<HP_Archivist>	So if I want to start hashing before uploading, I need to get in the habit of calculating MD5 values for files for use when uploading. During upload, if hash doesn't match, the upload fails as you said.
04:41:04	<HP_Archivist>	But I essentially still need to calculate both locally, no?
04:42:30	<HP_Archivist>	*Per what you mentioned about independently verifying with iasha1check
04:42:51	<HP_Archivist>	Apologies it's taken this long for me to get this
04:45:06	<HP_Archivist>	I don't think I've ever tried using the S3 API, not sure if that's necessary at this point. But I like the idea of providing hashes upon uploading.
04:46:09	<@JAA>	If you use `ia upload` (or `ia-upload-stream`, but I doubt that), you don't need to think about this. It happens automatically.
04:46:47	<@JAA>	Hmm, actually, `ia upload` requires `--verify` to do that.
04:47:21	<nicolas17>	-c calculates the checksum in order to know if it has to upload the file at all or not
04:47:26	<nicolas17>	I wonder if that also implies --verify
04:47:55	<@JAA>	It does not IIRC.
04:48:40	<nicolas17>	to me the reason to not use --verify would be to avoid spending time calculating the checksum
04:48:54	<nicolas17>	so if you have to calculate it anyway... hmm
04:49:08	<HP_Archivist>	It would just make the upload process longer, it seems
04:49:35	<@JAA>	Well, it has to calculate the hash, yes. If you verify it afterwards anyway, you can safely skip that.
04:49:48	<HP_Archivist>	And yeah, I already use ia upload JAA, so I guess no reason for me bother with the API
04:51:11	<HP_Archivist>	So: 'ia upload identifier file --verify' ?
04:51:33	<@JAA>	That would do the extra MD5 thing, yes.
04:52:19	<HP_Archivist>	Simple enough. It would only slow things down are largish files, no big deal I suppose
04:53:08	<HP_Archivist>	But still have to locally calculate hashes with SHA-1 if I want to use iasha1check
04:56:34	<@JAA>	Yes, not manually though.
04:56:45	<@JAA>	You just go to the directory with the item data and run `iasha1check IDENTIFIER`.
04:57:26	<HP_Archivist>	Thank you
04:57:40	<HP_Archivist>	Can't do them all at once?
04:57:41	<@JAA>	It retrieves the SHA-1 hashes from IA and checks them against the local files with `sha1sum -c`. (It also compares the file lists and reports differences in those separately for convenience.)
04:58:03	<@JAA>	A simple shell loop can take care of that.
05:03:51	<HP_Archivist>	Oh, by calculate I meant create SHA1 hashes first, so then iasha1check can automatically check with sha1sum -c
05:04:09	<HP_Archivist>	e.g. I don't have txts of sha1 hashes for the files yet, I'll need to generate those now.
05:04:24	<@JAA>	iasha1check doesn't support that.
05:05:47	<HP_Archivist>	Does iasha1check generate SHA-1 hash values for files or no?
05:06:00	<HP_Archivist>	Not just verify
05:06:39	<@JAA>	It only checks the hashes. It does not produce any hash output.
05:07:15	<@JAA>	It takes the IA item file metadata, generates the equivalent of `sha1sum *` output, and then feeds that to `sha1sum -c`.
05:10:02	<HP_Archivist>	Oh okay, I get it now. So sha1sum -c does the actual calculation part
05:10:31	<HP_Archivist>	Or generation or whatever you wanna call it. I'm tired and I'm going round in circles at this point, heh
05:10:56	<@JAA>	> -c, --check
05:10:56	<@JAA>	> read SHA1 sums from the FILEs and check them
05:11:36	<@JAA>	Here, the FILE is what's generated from IA's data. Normally, you'd use it like `sha1sum foo >foo.sha1` and later `sha1sum -c foo.sha1` to verify that `foo` is intact.
05:11:57	<@JAA>	It calculates the hashes internally but doesn't report them.
05:12:03	<@JAA>	Just match/mismatch
05:12:38	<HP_Archivist>	Yeah, JAA. I get it now. I think sometimes I overthink things a bit too much. Thanks!
05:14:22	<@JAA>	:-)
05:51:46		igloo22225 (igloo22225) joins
05:55:15	<HP_Archivist>	Just noticed you keep it updated here? https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/iasha1check
05:55:32	<@JAA>	That is the repo, yes.
06:03:56	<HP_Archivist>	I cloned the git and it's installed, but running: iasha1check dans-les-coulisses-des-jeux-video-harry-potter-book
06:04:43	<HP_Archivist>	Displays a list of all items from the folder of all items and then: IA item files that are not in the local directory:
06:04:43	<HP_Archivist>	Dans les coulisses des Jeux Vidéo Harry Potter.pdf
06:04:43	<HP_Archivist>	__ia_thumb.jpg
06:04:43	<HP_Archivist>	SHA-1 comparison:
06:04:43	<HP_Archivist>	sha1sum: 'Dans les coulisses des Jeux Vidéo Harry Potter.pdf': No such file or directory
06:04:45	<HP_Archivist>	Dans les coulisses des Jeux Vidéo Harry Potter.pdf: FAILED open or read
06:04:47	<HP_Archivist>	sha1sum: __ia_thumb.jpg: No such file or directory
06:04:48	<HP_Archivist>	__ia_thumb.jpg: FAILED open or read
06:04:50	<HP_Archivist>	sha1sum: WARNING: 2 listed files could not be read
06:05:02	<HP_Archivist>	From the arg from the folder where the data is
06:05:10	<@JAA>	04:56:45 <@JAA> You just go to the directory with the item data and run `iasha1check IDENTIFIER`.
06:05:27	<HP_Archivist>	Ah, the actual folder
06:05:41	<@JAA>	The argument isn't a directory name. The script doesn't care what the dir is named.
06:05:56	<@JAA>	It checks that the current dir matches the item.
06:06:45	<HP_Archivist>	Yup, that worked. I'm just tired :)
06:06:50	<HP_Archivist>	Finally
06:06:59	<HP_Archivist>	Took me long enough, heh
13:03:14		SootBector quits [Remote host closed the connection]
13:03:35		SootBector (SootBector) joins
13:10:44		igloo22225 quits [Read error: Connection reset by peer]
13:11:02		igloo22225 (igloo22225) joins
15:05:47		MrMcNuggets (MrMcNuggets) joins
15:19:58	<HP_Archivist>	JAA - was too tired to continue last night, but my arg: original_dir=$(pwd); cd /mnt/g/iapiisource && for dir in */; do if [ -d "$dir" ]; then echo "Checking directory: $dir"; (cd "$dir" && $original_dir/iasha1check -d .); fi; done; cd "$original_dir"
15:20:19	<HP_Archivist>	Keeps failing, saying no such file or directory. What am I doing wrong?
15:21:02	<HP_Archivist>	e.g. I want it to cycle through each item / item folder in the parent folder iapiisource as if it was verifying just one item
15:36:44	<HP_Archivist>	Nvm. Got it
15:36:52	<HP_Archivist>	Using this: cd /mnt/g/iapiisource && for dir in */; do
15:36:52	<HP_Archivist>	echo "Checking directory: $dir"
15:36:52	<HP_Archivist>	(cd "$dir" && iasha1check "${dir%/}")
15:36:52	<HP_Archivist>	done
15:46:14	<HP_Archivist>	It will fail, or rather, say these ia files aren't there: _archive.torrent,
15:46:14	<HP_Archivist>	_files.xml
15:46:14	<HP_Archivist>	, _meta.sqlite
15:46:14	<HP_Archivist>	, _meta.xml, but that's to be expected. Actual source files are coming back read as OK. :)
16:15:22	<@arkiver>	https://blog.archive.org/2024/09/11/new-feature-alert-access-archived-webpages-directly-through-google-search/
16:20:00	<HP_Archivist>	^^ Wow this is impressive
16:20:35	<@JAA>	HP_Archivist: Yep, those are the expected errors I mentioned last night. It should be easy to filter the output afterwards with a bit of `grep` to make sure there are no other errors.
16:21:02	<nicolas17>	that's cool, but I'm still annoyed by Google removing its "cached" feature
16:21:28	<nicolas17>	https://twitter.com/nicolas09F9/status/1754520745118466153
16:21:31	<@JAA>	nicolas17: The cache still exists.
16:33:59	<HP_Archivist>	Yup, JAA. Thanks again.
16:34:39	<HP_Archivist>	I think this is a super important step forward - this puts the power of the WBM right in ordinary users' hands.
16:37:36	<@arkiver>	i very much hope so! so many people do not know about the Wayback Machine...
16:37:56	<@arkiver>	it is usually really only known to tech people (and often to journalists, etc.)
16:44:36	<xkey>	arkiver: wowii, is publically known if archive.org gets financial rewards with that cooperation?
16:52:53	<@arkiver>	xkey: i have no idea. IA gets traffic at least
16:55:49	<xkey>	jup
17:00:33		nyuuzyou joins
17:18:27	<rewby>	congestion++
17:18:28	<eggdrop>	[karma] 'congestion' now has 1 karma!
17:18:59	<@arkiver>	:)
17:51:00		kokos- joins
18:01:04		MrMcNuggets quits [Client Quit]
18:08:23		katia_ (katia) joins
18:17:12		katia_ quits [Client Quit]
18:17:36		kokos- quits [Client Quit]
18:18:21		kokos- joins
18:27:42		katia_ (katia) joins
18:50:24	<nicolas17>	https://web.archive.org/web/2/https://example.com/
18:50:50	<nicolas17>	I have seen (and used) this a few times but what does it mean? is the 2 some kind of version number? or is it interpreted as a timestamp?
18:51:15	<@JAA>	The latter.
18:51:43	<nicolas17>	a very low timestamp, so it redirects to the "nearest" one which is the oldest?
18:51:46	<@JAA>	You can shorten the timestamp to the desired precision: https://web.archive.org/web/20240801/https://example.com/
18:51:54	<nicolas17>	ah hm
18:52:45	<@JAA>	I think it picks the most recent snapshot that matches, effectively filling it up with 9s and 5s.
18:52:57	<@JAA>	So it'll be the most recent snapshot until the year 3000.
18:53:06	<@JAA>	Y3K BUG!!1!
18:54:07	<@JAA>	Or actually, it fills it up like that and then jumps to the closest snapshot, probably.
18:54:23	<@JAA>	So 20240801 becomes 20240801235959, and the closest snapshot is two seconds later.
18:54:48	<@JAA>	While 2 becomes 29991231235959, and the closest snapshot is effectively always the most recent one.
18:55:34	<nicolas17>	I think I have seen cases where 404s got archived
18:56:06	<nicolas17>	can that happen? (depending on the tooling)
18:56:30	<nicolas17>	ah but then I would get redirected to a concrete capture that happens to be a 404, rather than /2/ returning a 404 directly
18:56:42	<@JAA>	Of course, the SPN has a checkbox for it, and external sources don't have such restrictions anyway.
18:56:47	<@JAA>	Yes
18:57:12	<nicolas17>	so if /2/ returns 404 then I can be sure there are no captures for that file
19:18:14	<nicolas17>	hm interesting
19:18:50	<nicolas17>	it also makes a request to the origin server to see if it still exists
19:19:03	<nicolas17>	if the origin returns 404 then WBM returns 404
19:19:15	<nicolas17>	if the origin returns 200 then WBM returns a 302 redirect to /save/_embed/
19:21:21	<@JAA>	Yes, cf. the 'this page is available on the web' message thing.
19:29:29	<pokechu22>	Also note that /1/ becomes 1999 which is usually good for finding the oldest capture (though it'll still prefer 1998 over 1997, etc)
19:30:49	<nicolas17>	hm I guess I need the oldest capture here
19:31:09	<nicolas17>	because the newest one may well be a 403 if it got captured after it was deleted from the source
19:31:53	<nicolas17>	https://web.archive.org/web/*/https://swcdn.apple.com/content/downloads/22/50/002-32829-A_D6OB9130EQ/ypmki63xkjh2hoksrdnpd372z5kmpqn3vm/InstallAssistant.pkg one 404 and one 403 yay -_-
19:32:40	<@JAA>	/10/ through /18/
19:34:15	<pokechu22>	The CDX api might be better for your purposes - you could list all of them that got 200s that way
19:36:35	<nicolas17>	for starters I'll collect all those that redirect to save/_embed (meaning it's definitely not archived but it can be)
19:37:21	<@JAA>	The CDX API can give you a complete list of what's captured in a single request, yeah.
19:37:35	<@JAA>	Although you'd still need to check for truncated responses, I guess.
19:39:17	<nicolas17>	so far it's looking like every single version of macOS Rosetta is unarchived, still available, and only ~200KB :D
19:50:20	<pokechu22>	The CDX API can also tell you the size IIRC so you should be able to spot truncated responses that way
19:50:26	<nicolas17>	oh even worse
19:50:36	<nicolas17>	https://web.archive.org/web/20240715092345/http://swcdn.apple.com/content/downloads/56/34/041-88557/sm2i2d444udypgi46bsi57h6aa0cq4pmm4/SafariTechPreviewElCapitan.pkg I forgot SPN does this shit
19:51:19	<nicolas17>	dunno if it's SPN's fault or swcdn's fault when it gets SPN's request
19:52:20	<@JAA>	Huh
19:54:30	<pokechu22>	https://web.archive.org/cdx/search/cdx?url=http%3A%2F%2Fswcdn.apple.com%2Fcontent%2Fdownloads%2F56%2F34%2F041-88557%2Fsm2i2d444udypgi46bsi57h6aa0cq4pmm4%2F&matchType=prefix&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cstatuscode%2Clength&limit=10000 shows 2 captured 400s
19:56:01	<pokechu22>	also possibly useful: https://web.archive.org/cdx/search/cdx?url=http%3A%2F%2Fswcdn.apple.com%2F&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A%5B45%5D..&limit=10000&showResumeKey=true&resumeKey= + https://archive.org/developers/wayback-cdx-server.html (not that many successful
19:56:03	<pokechu22>	captures it seems)
20:14:30	<@arkiver>	JAA: so on shownumpages, i believe you need to set a pageSize parameter for it to function
20:14:36	<@arkiver>	and others ^
20:14:46	<@JAA>	Well, that's new.
20:14:50	<@arkiver>	yeah
20:14:54	<@arkiver>	or well not sure
20:17:25	<@JAA>	https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=1000&showNumPages=true returns - (a dash).
20:18:36	<@JAA>	The pageSize value doesn't seem to matter.
20:19:36	<@arkiver>	well hm
20:32:34		qwertyasdfuiopghjkl quits [Quit: Client closed]
20:56:01	<@arkiver>	i hope to have more on this soon
21:06:51	<@arkiver>	JAA: &fl=original may be the problem
21:08:22	<@arkiver>	without that i get `1`
21:08:30	<@JAA>	https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&matchType=domain&pageSize=1000&showNumPages=true returns 1, https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain returns 286547 results...
21:09:45	<@JAA>	https://web.archive.org/cdx/search/cdx?url=wiki.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=1000&page=0 also returns the full result set (actually slightly more, 286641), completely ignoring the pageSize.
21:10:10	<@JAA>	(page=0 is the first page according to the documentation.)
21:11:33	<@arkiver>	well... more soon again
21:21:47		DLoader quits [Ping timeout: 256 seconds]
21:22:22		DLoader (DLoader) joins
21:29:45	<nicolas17>	does /save/_embed/ return an html form or does it actually immediately trigger saving?
21:29:55		Nemo_bis quits [Ping timeout: 255 seconds]
21:30:08		Nemo_bis (Nemo_bis) joins
21:30:25	<@JAA>	It immediately saves (or sometimes goes into an infinite loop because caching is hard).
21:31:16	<nicolas17>	so "wget https://web.archive.org/web/2/$url" would cause a save if that url isn't captured? oof
21:31:48	<@JAA>	It normally shouldn't. You should get the 404 with 'not saved yet' and possibly 'available on the web'.
21:33:06	<@JAA>	Appending im_ to the timestamp probably triggers a /save/_embed/ though.
21:33:18	<nicolas17>	maybe wget user-agent causes im_ behavior?
21:33:22	<pokechu22>	Yeah, the normal HTML one doesn't trigger a save, but various embed ones do
21:33:36	<pokechu22>	But /web/2/ _will_ try to fetch the page to see if it exists
21:33:49	<nicolas17>	I ran "wget https://web.archive.org/web/2/https://swcdn.apple.com/content/downloads/05/59/062-54078-A_B16PVEE8JJ/7xpklgkn5wzlb7fzavwy7c8oeir2gjs5uf/SafariTechPreview.pkg", it redirected to "https://web.archive.org/save/_embed/https://swcdn.apple.com/content/downloads/05/59/062-54078-A_B16PVEE8JJ/7xpklgkn5wzlb7fzavwy7c8oeir2gjs5uf/SafariTechPreview.pkg" and got stuck for a minute until I Ctrl-C'd it
21:33:55	<pokechu22>	not sure if that's a HEAD or a full GET or what
21:33:56	<nicolas17>	now it takes me to the failed capture saying "Cycle Prohibited"
21:34:05	<pokechu22>	OK, that's different from what I've seen, interesting
21:34:20	<@JAA>	Yeah, not something I've seen either.
21:34:45	<pokechu22>	It might be that wget gets served the original form instead of the iframe with the timeline and stuff, and doing that triggers saves
21:34:46	<nicolas17>	maybe browser UAs redirect to the web form but wget redirects to _embed
21:53:11	<steering>	oof, the google link to WBM is ... not very useful?
21:56:50	<steering>	https://imgur.com/a/Hy2lbiq
21:57:03	<steering>	maybe it's more prominent sometimes...
22:09:12		KoalaBear84 quits [Read error: Connection reset by peer]
22:12:13	<imer>	thats quite hidden
22:54:59	<nicolas17>	can we block swcdn.apple.com from being archived in SPN? it always ends up in that weird proxy error
22:57:54	<nicolas17>	worst is when there is already a valid capture
22:58:11	<nicolas17>	and the "latest capture" ends up being the proxy error instead
22:58:32		corentin joins

Home Search Previous day Next day