01:15:16rellem quits [Read error: Connection reset by peer]
01:15:37rellem joins
01:34:11M--mlv|m joins
01:35:55superkuh_ joins
01:35:59superkuh quits [Ping timeout: 252 seconds]
01:44:49Arcorann (Arcorann) joins
02:00:09BlueMaxima_ joins
02:01:58TastyWiener95 (TastyWiener95) joins
02:04:02BlueMaxima quits [Ping timeout: 252 seconds]
02:22:44rellem quits [Ping timeout: 252 seconds]
02:26:02superkuh_ quits [Ping timeout: 252 seconds]
02:33:09nosamu joins
02:34:50Meli quits [Ping timeout: 252 seconds]
02:35:14icedice quits [Client Quit]
02:36:19Meli (Meli) joins
03:00:08HP_Archivist quits [Ping timeout: 252 seconds]
03:08:46sec^nd quits [Ping timeout: 245 seconds]
03:19:35sec^nd (second) joins
03:21:58wickedplayer494 quits [Ping timeout: 252 seconds]
03:36:03sonick quits [Client Quit]
03:36:26Meli quits [Ping timeout: 252 seconds]
03:39:11Meli (Meli) joins
03:40:02wickedplayer494 joins
03:57:23<PredatorIWD>Few quick questions: 1. Can I set the Warrior to 6 concurrent items and "ArchiveTeam’s Choice" project and not worry about that being too much and getting IP banned from some sites? Does the warrior get limited by the central server based on the project or? 2. The Warrior worked a day on the imgur project and the VirtualBox .vdi went from ~1GB to ~4GB+ in size, what is being stored here exactly? Thanks
03:59:53<@JAA>PredatorIWD: #warrior
04:08:30<flashfire42>https://server8.kiska.pw/uploads/9adb26713c2b52ba/image.png hmmmmmmmmmm
04:34:03superkuh joins
04:38:29wickedplayer494 quits [Ping timeout: 265 seconds]
04:40:23wickedplayer494 joins
05:01:37Hans5958 quits [Quit: Reconnecting]
05:01:45Hans5958 (Hans5958) joins
05:04:20BlueMaxima_ quits [Read error: Connection reset by peer]
05:35:31Meli quits [Ping timeout: 265 seconds]
05:39:28Meli (Meli) joins
06:18:15hitgrr8 joins
06:24:55<nicolas17>JAA: curl -sk -H 'Content-Type: application/json' -H 'Accept: application/json' https://gdmf.apple.com/v2/assets --data '{ "AssetAudience": "341f2a17-0024-46cd-968d-b4444ec3699f", "ClientVersion": 2, "AssetType": "com.apple.MobileAsset.SoftwareUpdate", "ProductVersion": "9.0", "BuildVersion": "20D47", "HWModelStr": "N141sAP", "ProductType": "Watch4,3" }'
06:25:44<nicolas17>it's a JWT, if you pipe it into "cut -d. -f2 | tr _- /+ | base64 -d" you get JSON
06:26:12<@JAA>Fun
06:26:39<nicolas17>and if you do it twice in a row, you get the same response, except for this:
06:26:41<nicolas17>- "PallasNonce": "C7949F9A-F7AB-4CC5-9225-750EA9765553",
06:26:42<nicolas17>+ "PallasNonce": "9744E86C-D73B-4389-9820-076C4B2FD72B",
06:27:42<nicolas17>which obviously changes the entire signature in the JWT too
06:28:53<@JAA>Mhm
06:30:32<Jake>disgusting
06:31:38<nicolas17>I could archive it the way I'm archiving other Apple stuff: decode the JWT, prettify the JSON, remove PallasNonce entirely, commit it to a git repo, get nice readable diffs when it changes
06:31:46<nicolas17>but that's not very data-preservation-y
06:34:23<nicolas17>https://gitlab.com/nicolas17/mesu-archive/-/commit/982e83ceb0b33c787d1f09dfff929817b2a7ab8e eg. here I'm masking out bag-expiry-timestamp so the file doesn't change on every single update
06:38:48<nicolas17>Jake: XML in base64 in XML is more disgusting imo http://init.ess.apple.com/WebObjects/VCInit.woa/wa/getBag?ix=4
06:38:55<Jake>ewwwww
06:39:34<@JAA>lol
06:40:28<nicolas17>here too I extract the inner XML and patch out some stuff to avoid causing a git commit every single time I check for updates
06:40:46<nicolas17>for example there's something that seems to be the version number of the server
06:40:56<@JAA>How often are you fetching these, and what's the total size per fetch?
06:41:30<nicolas17>and it seems like they sometimes update half the servers, and the version changes back and forth as load balancing sends me to one or another server
06:43:40<nicolas17>JAA: once an hour, and the largest chunk is probably the XMLs in mesu.apple.com which add up to 160MB; revisit records would work great for those though
06:43:55<nicolas17>and there's no magic to it
06:44:07<nicolas17>https://gitlab.com/nicolas17/mesu-archive/-/raw/master/urls.txt
06:44:20<nicolas17>just static files
06:44:21<@JAA>Mhm
06:44:30<@JAA>How much unique data per fetch due to nonces etc.?
06:46:01<nicolas17>yeah I mean the mesu.apple.com stuff has no changing nonces or anything, it's just static files and they change when they actually change :)
06:46:31<@JAA>Yeah, but that's not all of what you fetch hourly, which is why I ask.
06:46:32<nicolas17>the other ad-hoc crap I fetch, I don't know how much changes...
06:47:05<nicolas17>I guess I could actually try it in wget-at
06:47:20<nicolas17>and see how big it gets
06:50:18<nicolas17>if wget-at downloads two URLs and the responses are identical, it can do a revisit record, but how does it work when you download one URL, then download it again later? I'd need to pass it the previous WARC I guess?
06:51:09<@JAA>You'd need to enable writing a CDX and then pass that, I believe.
06:51:22<nicolas17>oh cdx, cool
06:52:01<nicolas17>upstream wget writes broken WARCs and I shouldn't even try it and I should go straight to github.com/ArchiveTeam/wget-lua right?
06:53:04<@JAA>Yeah. Upstream wget also doesn't dedupe across different URLs, and just days ago, someone reported a bug in the deduper as well that leads to stuff not getting deduped.
06:53:24<@JAA>Also, all of this is dangerously on-topic for this channel.
06:53:41<nicolas17>lol, I just wanted to get it out of #imgone :P
06:54:41<nicolas17>as for gdmf.apple.com, I started looking into WARCs because it looked like the best way to archive stuff that has a *request* body
06:56:00<nicolas17>none of this implies I intend to submit WARCs to WBM
06:56:58<@JAA>Sure, but writing proper WARCs rather than some manipulated mess is still the right way to go. :-)
06:57:56<nicolas17>the main goal was knowing when things change and knowing what changed, so yeah I do stuff like sorting files to make the diffs readable
06:58:01<andrew>alternatively, you could apply a specialized compression algorithm on the WARCs :P
06:58:10<nicolas17>*also* storing WARCs may be a good idea though
06:59:09sonick (sonick) joins
06:59:40<@JAA>Yeah, I have some tools as well that consume API stuff. I write WARCs to get a record of the raw data sent and received, and then I do whatever with it. If something goes wrong, I can always go back to the WARC to figure out what happened.
07:00:31<nicolas17>also I had a cursed idea but I need to stop adding more cursed ideas to my todo list
07:00:44<nicolas17>add support for WARCs to Wireshark :P
07:01:01<andrew>but why
07:01:40<@JAA>I'd rather see a pcap + TLS key to WARC conversion tool.
07:03:26<nicolas17>huh I thought "wget --warc-file=foo.warc -O wget-temp" would repeatedly overwrite that temp file, and leave all the data in the warc... but it seems it repeatedly *appended* to the temp file which is now 150MB :|
07:04:20<@JAA>Yup, there's an option for truncation.
07:07:04<nicolas17>--truncate-output? looks like that's an -at addition
07:07:36<@JAA>Is there another wget?
07:07:39<@JAA>;-)
07:08:40<nicolas17>damn that XML is repetitive
07:09:01<@JAA>--delete-after isn't exactly the same thing but exists in upstream.
07:09:18<@JAA>Repetitiveness is perfect for compression. :-)
07:09:22<nicolas17>$ wc -c mesu.warc.gz
07:09:23<nicolas17>9322787 mesu.warc.gz
07:09:25<nicolas17>$ zcat mesu.warc.gz | wc -c
07:09:26<nicolas17>158553124
07:09:36<@JAA>Heh
07:10:04<@JAA>zstd with a custom dict might well get it much smaller still.
07:10:30<@JAA>There's a dict trainer somewhere on our GitHub org I believe.
07:10:56<Jake>https://github.com/ArchiveTeam/zstd-dictionary-trainer
07:12:23<nicolas17>also yes I just tried a deduplicating fetch with upstream wget and it failed to dedup some
08:29:02Explo joins
11:31:51monoxane quits [Remote host closed the connection]
11:36:16HP_Archivist (HP_Archivist) joins
12:12:30Meroje joins
12:37:52monoxane (monoxane) joins
12:41:23monoxane quits [Remote host closed the connection]
12:46:53monoxane (monoxane) joins
12:49:25<masterX244>interesting that commoncrawl managed to stumble over some of my nplusc.de links, too...
12:57:10HackMii quits [Remote host closed the connection]
12:57:34HackMii (hacktheplanet) joins
13:16:55Meroje quits [Client Quit]
13:17:14Meroje joins
13:17:29michaelblob_ quits [Read error: Connection reset by peer]
13:40:02<Hans5958>What does -bs meant? bulls**t?
13:41:58<@kaz>probably
13:53:12AlsoHP_Archivist joins
13:57:13HP_Archivist quits [Ping timeout: 265 seconds]
14:00:39Meroje quits [Changing host]
14:00:39Meroje (Meroje) joins
14:08:24Arcorann quits [Ping timeout: 252 seconds]
15:06:42zhongfu quits [Ping timeout: 252 seconds]
15:14:46AlsoHP_Archivist quits [Ping timeout: 252 seconds]
15:18:12rellem joins
15:20:12rellem quits [Read error: Connection reset by peer]
15:28:33chrismeller (chrismeller) joins
15:29:02<immibis>Hans5958: maybe bikeshed
15:29:10<immibis>is there a python library that works like urllib but also saves warcs?
15:51:41HackMii quits [Ping timeout: 245 seconds]
15:52:48HackMii (hacktheplanet) joins
16:01:39icedice2 joins
16:01:43icedice (icedice) joins
16:05:18icedice2 quits [Client Quit]
16:28:50andrew quits [Quit: ]
16:29:11andrew (andrew) joins
16:38:58chrismeller quits [Client Quit]
16:41:01chrismeller (chrismeller) joins
16:41:34chrismeller quits [Client Quit]
16:42:16chrismeller (chrismeller) joins
16:45:15<spirit>arkiver: ++! Thanks for your post :)
17:17:01lukash joins
17:19:47nicolas17 quits [Ping timeout: 252 seconds]
17:24:20nicolas17 joins
17:45:41fishingforsoup joins
17:47:15spirit quits [Quit: Leaving]
17:55:22HP_Archivist (HP_Archivist) joins
18:31:39spirit joins
20:27:42pikablu joins
20:37:53hitgrr8 quits [Client Quit]
20:42:47sonick quits [Client Quit]
21:39:55thedudedude joins
21:56:32BlueMaxima joins
22:45:52Jake quits [Quit: Leaving for a bit!]
22:46:14Jake (Jake) joins
22:57:28Danielle joins
23:00:58Dalister quits [Ping timeout: 265 seconds]
23:11:30sonick (sonick) joins