#archiveteam-ot log for 2023-05-21

Home Search Previous day Next day

01:15:16		rellem quits [Read error: Connection reset by peer]
01:15:37		rellem joins
01:34:11		M--mlv\|m joins
01:35:55		superkuh_ joins
01:35:59		superkuh quits [Ping timeout: 252 seconds]
01:44:49		Arcorann (Arcorann) joins
02:00:09		BlueMaxima_ joins
02:01:58		TastyWiener95 (TastyWiener95) joins
02:04:02		BlueMaxima quits [Ping timeout: 252 seconds]
02:22:44		rellem quits [Ping timeout: 252 seconds]
02:26:02		superkuh_ quits [Ping timeout: 252 seconds]
02:33:09		nosamu joins
02:34:50		Meli quits [Ping timeout: 252 seconds]
02:35:14		icedice quits [Client Quit]
02:36:19		Meli (Meli) joins
03:00:08		HP_Archivist quits [Ping timeout: 252 seconds]
03:08:46		sec^nd quits [Ping timeout: 245 seconds]
03:19:35		sec^nd (second) joins
03:21:58		wickedplayer494 quits [Ping timeout: 252 seconds]
03:36:03		sonick quits [Client Quit]
03:36:26		Meli quits [Ping timeout: 252 seconds]
03:39:11		Meli (Meli) joins
03:40:02		wickedplayer494 joins
03:57:23	<PredatorIWD>	Few quick questions: 1. Can I set the Warrior to 6 concurrent items and "ArchiveTeam’s Choice" project and not worry about that being too much and getting IP banned from some sites? Does the warrior get limited by the central server based on the project or? 2. The Warrior worked a day on the imgur project and the VirtualBox .vdi went from ~1GB to ~4GB+ in size, what is being stored here exactly? Thanks
03:59:53	<@JAA>	PredatorIWD: #warrior
04:08:30	<flashfire42>	https://server8.kiska.pw/uploads/9adb26713c2b52ba/image.png hmmmmmmmmmm
04:34:03		superkuh joins
04:38:29		wickedplayer494 quits [Ping timeout: 265 seconds]
04:40:23		wickedplayer494 joins
05:01:37		Hans5958 quits [Quit: Reconnecting]
05:01:45		Hans5958 (Hans5958) joins
05:04:20		BlueMaxima_ quits [Read error: Connection reset by peer]
05:35:31		Meli quits [Ping timeout: 265 seconds]
05:39:28		Meli (Meli) joins
06:18:15		hitgrr8 joins
06:24:55	<nicolas17>	JAA: curl -sk -H 'Content-Type: application/json' -H 'Accept: application/json' https://gdmf.apple.com/v2/assets --data '{ "AssetAudience": "341f2a17-0024-46cd-968d-b4444ec3699f", "ClientVersion": 2, "AssetType": "com.apple.MobileAsset.SoftwareUpdate", "ProductVersion": "9.0", "BuildVersion": "20D47", "HWModelStr": "N141sAP", "ProductType": "Watch4,3" }'
06:25:44	<nicolas17>	it's a JWT, if you pipe it into "cut -d. -f2 \| tr _- /+ \| base64 -d" you get JSON
06:26:12	<@JAA>	Fun
06:26:39	<nicolas17>	and if you do it twice in a row, you get the same response, except for this:
06:26:41	<nicolas17>	- "PallasNonce": "C7949F9A-F7AB-4CC5-9225-750EA9765553",
06:26:42	<nicolas17>	+ "PallasNonce": "9744E86C-D73B-4389-9820-076C4B2FD72B",
06:27:42	<nicolas17>	which obviously changes the entire signature in the JWT too
06:28:53	<@JAA>	Mhm
06:30:32	<Jake>	disgusting
06:31:38	<nicolas17>	I could archive it the way I'm archiving other Apple stuff: decode the JWT, prettify the JSON, remove PallasNonce entirely, commit it to a git repo, get nice readable diffs when it changes
06:31:46	<nicolas17>	but that's not very data-preservation-y
06:34:23	<nicolas17>	https://gitlab.com/nicolas17/mesu-archive/-/commit/982e83ceb0b33c787d1f09dfff929817b2a7ab8e eg. here I'm masking out bag-expiry-timestamp so the file doesn't change on every single update
06:38:48	<nicolas17>	Jake: XML in base64 in XML is more disgusting imo http://init.ess.apple.com/WebObjects/VCInit.woa/wa/getBag?ix=4
06:38:55	<Jake>	ewwwww
06:39:34	<@JAA>	lol
06:40:28	<nicolas17>	here too I extract the inner XML and patch out some stuff to avoid causing a git commit every single time I check for updates
06:40:46	<nicolas17>	for example there's something that seems to be the version number of the server
06:40:56	<@JAA>	How often are you fetching these, and what's the total size per fetch?
06:41:30	<nicolas17>	and it seems like they sometimes update half the servers, and the version changes back and forth as load balancing sends me to one or another server
06:43:40	<nicolas17>	JAA: once an hour, and the largest chunk is probably the XMLs in mesu.apple.com which add up to 160MB; revisit records would work great for those though
06:43:55	<nicolas17>	and there's no magic to it
06:44:07	<nicolas17>	https://gitlab.com/nicolas17/mesu-archive/-/raw/master/urls.txt
06:44:20	<nicolas17>	just static files
06:44:21	<@JAA>	Mhm
06:44:30	<@JAA>	How much unique data per fetch due to nonces etc.?
06:46:01	<nicolas17>	yeah I mean the mesu.apple.com stuff has no changing nonces or anything, it's just static files and they change when they actually change :)
06:46:31	<@JAA>	Yeah, but that's not all of what you fetch hourly, which is why I ask.
06:46:32	<nicolas17>	the other ad-hoc crap I fetch, I don't know how much changes...
06:47:05	<nicolas17>	I guess I could actually try it in wget-at
06:47:20	<nicolas17>	and see how big it gets
06:50:18	<nicolas17>	if wget-at downloads two URLs and the responses are identical, it can do a revisit record, but how does it work when you download one URL, then download it again later? I'd need to pass it the previous WARC I guess?
06:51:09	<@JAA>	You'd need to enable writing a CDX and then pass that, I believe.
06:51:22	<nicolas17>	oh cdx, cool
06:52:01	<nicolas17>	upstream wget writes broken WARCs and I shouldn't even try it and I should go straight to github.com/ArchiveTeam/wget-lua right?
06:53:04	<@JAA>	Yeah. Upstream wget also doesn't dedupe across different URLs, and just days ago, someone reported a bug in the deduper as well that leads to stuff not getting deduped.
06:53:24	<@JAA>	Also, all of this is dangerously on-topic for this channel.
06:53:41	<nicolas17>	lol, I just wanted to get it out of #imgone :P
06:54:41	<nicolas17>	as for gdmf.apple.com, I started looking into WARCs because it looked like the best way to archive stuff that has a request body
06:56:00	<nicolas17>	none of this implies I intend to submit WARCs to WBM
06:56:58	<@JAA>	Sure, but writing proper WARCs rather than some manipulated mess is still the right way to go. :-)
06:57:56	<nicolas17>	the main goal was knowing when things change and knowing what changed, so yeah I do stuff like sorting files to make the diffs readable
06:58:01	<andrew>	alternatively, you could apply a specialized compression algorithm on the WARCs :P
06:58:10	<nicolas17>	also storing WARCs may be a good idea though
06:59:09		sonick (sonick) joins
06:59:40	<@JAA>	Yeah, I have some tools as well that consume API stuff. I write WARCs to get a record of the raw data sent and received, and then I do whatever with it. If something goes wrong, I can always go back to the WARC to figure out what happened.
07:00:31	<nicolas17>	also I had a cursed idea but I need to stop adding more cursed ideas to my todo list
07:00:44	<nicolas17>	add support for WARCs to Wireshark :P
07:01:01	<andrew>	but why
07:01:40	<@JAA>	I'd rather see a pcap + TLS key to WARC conversion tool.
07:03:26	<nicolas17>	huh I thought "wget --warc-file=foo.warc -O wget-temp" would repeatedly overwrite that temp file, and leave all the data in the warc... but it seems it repeatedly appended to the temp file which is now 150MB :\|
07:04:20	<@JAA>	Yup, there's an option for truncation.
07:07:04	<nicolas17>	--truncate-output? looks like that's an -at addition
07:07:36	<@JAA>	Is there another wget?
07:07:39	<@JAA>	;-)
07:08:40	<nicolas17>	damn that XML is repetitive
07:09:01	<@JAA>	--delete-after isn't exactly the same thing but exists in upstream.
07:09:18	<@JAA>	Repetitiveness is perfect for compression. :-)
07:09:22	<nicolas17>	$ wc -c mesu.warc.gz
07:09:23	<nicolas17>	9322787 mesu.warc.gz
07:09:25	<nicolas17>	$ zcat mesu.warc.gz \| wc -c
07:09:26	<nicolas17>	158553124
07:09:36	<@JAA>	Heh
07:10:04	<@JAA>	zstd with a custom dict might well get it much smaller still.
07:10:30	<@JAA>	There's a dict trainer somewhere on our GitHub org I believe.
07:10:56	<Jake>	https://github.com/ArchiveTeam/zstd-dictionary-trainer
07:12:23	<nicolas17>	also yes I just tried a deduplicating fetch with upstream wget and it failed to dedup some
08:29:02		Explo joins
11:31:51		monoxane quits [Remote host closed the connection]
11:36:16		HP_Archivist (HP_Archivist) joins
12:12:30		Meroje joins
12:37:52		monoxane (monoxane) joins
12:41:23		monoxane quits [Remote host closed the connection]
12:46:53		monoxane (monoxane) joins
12:49:25	<masterX244>	interesting that commoncrawl managed to stumble over some of my nplusc.de links, too...
12:57:10		HackMii quits [Remote host closed the connection]
12:57:34		HackMii (hacktheplanet) joins
13:15:26		Meroje is now authenticated as Meroje
13:16:55		Meroje quits [Client Quit]
13:17:14		Meroje joins
13:17:14		Meroje is now authenticated as Meroje
13:17:29		michaelblob_ quits [Read error: Connection reset by peer]
13:40:02	<Hans5958>	What does -bs meant? bulls**t?
13:41:58	<@kaz>	probably
13:53:12		AlsoHP_Archivist joins
13:57:13		HP_Archivist quits [Ping timeout: 265 seconds]
14:00:39		Meroje quits [Changing host]
14:00:39		Meroje (Meroje) joins
14:08:24		Arcorann quits [Ping timeout: 252 seconds]
15:06:42		zhongfu quits [Ping timeout: 252 seconds]
15:14:46		AlsoHP_Archivist quits [Ping timeout: 252 seconds]
15:18:12		rellem joins
15:20:12		rellem quits [Read error: Connection reset by peer]
15:28:33		chrismeller (chrismeller) joins
15:29:02	<immibis>	Hans5958: maybe bikeshed
15:29:10	<immibis>	is there a python library that works like urllib but also saves warcs?
15:51:41		HackMii quits [Ping timeout: 245 seconds]
15:52:48		HackMii (hacktheplanet) joins
16:01:39		icedice2 joins
16:01:43		icedice (icedice) joins
16:05:18		icedice2 quits [Client Quit]
16:28:50		andrew quits [Quit: ]
16:29:11		andrew (andrew) joins
16:38:58		chrismeller quits [Client Quit]
16:41:01		chrismeller (chrismeller) joins
16:41:34		chrismeller quits [Client Quit]
16:42:16		chrismeller (chrismeller) joins
16:45:15	<spirit>	arkiver: ++! Thanks for your post :)
17:17:01		lukash joins
17:19:47		nicolas17 quits [Ping timeout: 252 seconds]
17:24:20		nicolas17 joins
17:45:41		fishingforsoup joins
17:47:15		spirit quits [Quit: Leaving]
17:55:22		HP_Archivist (HP_Archivist) joins
18:31:39		spirit joins
20:27:42		pikablu joins
20:29:56		wickedplayer494 is now authenticated as wickedplayer494
20:37:53		hitgrr8 quits [Client Quit]
20:42:47		sonick quits [Client Quit]
21:39:55		thedudedude joins
21:56:32		BlueMaxima joins
22:45:52		Jake quits [Quit: Leaving for a bit!]
22:46:14		Jake (Jake) joins
22:57:28		Danielle joins
23:00:58		Dalister quits [Ping timeout: 265 seconds]
23:11:30		sonick (sonick) joins

Home Search Previous day Next day