| 01:15:16 | | rellem quits [Read error: Connection reset by peer] |
| 01:15:37 | | rellem joins |
| 01:34:11 | | M--mlv|m joins |
| 01:35:55 | | superkuh_ joins |
| 01:35:59 | | superkuh quits [Ping timeout: 252 seconds] |
| 01:44:49 | | Arcorann (Arcorann) joins |
| 02:00:09 | | BlueMaxima_ joins |
| 02:01:58 | | TastyWiener95 (TastyWiener95) joins |
| 02:04:02 | | BlueMaxima quits [Ping timeout: 252 seconds] |
| 02:22:44 | | rellem quits [Ping timeout: 252 seconds] |
| 02:26:02 | | superkuh_ quits [Ping timeout: 252 seconds] |
| 02:33:09 | | nosamu joins |
| 02:34:50 | | Meli quits [Ping timeout: 252 seconds] |
| 02:35:14 | | icedice quits [Client Quit] |
| 02:36:19 | | Meli (Meli) joins |
| 03:00:08 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 03:08:46 | | sec^nd quits [Ping timeout: 245 seconds] |
| 03:19:35 | | sec^nd (second) joins |
| 03:21:58 | | wickedplayer494 quits [Ping timeout: 252 seconds] |
| 03:36:03 | | sonick quits [Client Quit] |
| 03:36:26 | | Meli quits [Ping timeout: 252 seconds] |
| 03:39:11 | | Meli (Meli) joins |
| 03:40:02 | | wickedplayer494 joins |
| 03:57:23 | <PredatorIWD> | Few quick questions: 1. Can I set the Warrior to 6 concurrent items and "ArchiveTeam’s Choice" project and not worry about that being too much and getting IP banned from some sites? Does the warrior get limited by the central server based on the project or? 2. The Warrior worked a day on the imgur project and the VirtualBox .vdi went from ~1GB to ~4GB+ in size, what is being stored here exactly? Thanks |
| 03:59:53 | <@JAA> | PredatorIWD: #warrior |
| 04:08:30 | <flashfire42> | https://server8.kiska.pw/uploads/9adb26713c2b52ba/image.png hmmmmmmmmmm |
| 04:34:03 | | superkuh joins |
| 04:38:29 | | wickedplayer494 quits [Ping timeout: 265 seconds] |
| 04:40:23 | | wickedplayer494 joins |
| 05:01:37 | | Hans5958 quits [Quit: Reconnecting] |
| 05:01:45 | | Hans5958 (Hans5958) joins |
| 05:04:20 | | BlueMaxima_ quits [Read error: Connection reset by peer] |
| 05:35:31 | | Meli quits [Ping timeout: 265 seconds] |
| 05:39:28 | | Meli (Meli) joins |
| 06:18:15 | | hitgrr8 joins |
| 06:24:55 | <nicolas17> | JAA: curl -sk -H 'Content-Type: application/json' -H 'Accept: application/json' https://gdmf.apple.com/v2/assets --data '{ "AssetAudience": "341f2a17-0024-46cd-968d-b4444ec3699f", "ClientVersion": 2, "AssetType": "com.apple.MobileAsset.SoftwareUpdate", "ProductVersion": "9.0", "BuildVersion": "20D47", "HWModelStr": "N141sAP", "ProductType": "Watch4,3" }' |
| 06:25:44 | <nicolas17> | it's a JWT, if you pipe it into "cut -d. -f2 | tr _- /+ | base64 -d" you get JSON |
| 06:26:12 | <@JAA> | Fun |
| 06:26:39 | <nicolas17> | and if you do it twice in a row, you get the same response, except for this: |
| 06:26:41 | <nicolas17> | - "PallasNonce": "C7949F9A-F7AB-4CC5-9225-750EA9765553", |
| 06:26:42 | <nicolas17> | + "PallasNonce": "9744E86C-D73B-4389-9820-076C4B2FD72B", |
| 06:27:42 | <nicolas17> | which obviously changes the entire signature in the JWT too |
| 06:28:53 | <@JAA> | Mhm |
| 06:30:32 | <Jake> | disgusting |
| 06:31:38 | <nicolas17> | I could archive it the way I'm archiving other Apple stuff: decode the JWT, prettify the JSON, remove PallasNonce entirely, commit it to a git repo, get nice readable diffs when it changes |
| 06:31:46 | <nicolas17> | but that's not very data-preservation-y |
| 06:34:23 | <nicolas17> | https://gitlab.com/nicolas17/mesu-archive/-/commit/982e83ceb0b33c787d1f09dfff929817b2a7ab8e eg. here I'm masking out bag-expiry-timestamp so the file doesn't change on every single update |
| 06:38:48 | <nicolas17> | Jake: XML in base64 in XML is more disgusting imo http://init.ess.apple.com/WebObjects/VCInit.woa/wa/getBag?ix=4 |
| 06:38:55 | <Jake> | ewwwww |
| 06:39:34 | <@JAA> | lol |
| 06:40:28 | <nicolas17> | here too I extract the inner XML and patch out some stuff to avoid causing a git commit every single time I check for updates |
| 06:40:46 | <nicolas17> | for example there's something that seems to be the version number of the server |
| 06:40:56 | <@JAA> | How often are you fetching these, and what's the total size per fetch? |
| 06:41:30 | <nicolas17> | and it seems like they sometimes update half the servers, and the version changes back and forth as load balancing sends me to one or another server |
| 06:43:40 | <nicolas17> | JAA: once an hour, and the largest chunk is probably the XMLs in mesu.apple.com which add up to 160MB; revisit records would work great for those though |
| 06:43:55 | <nicolas17> | and there's no magic to it |
| 06:44:07 | <nicolas17> | https://gitlab.com/nicolas17/mesu-archive/-/raw/master/urls.txt |
| 06:44:20 | <nicolas17> | just static files |
| 06:44:21 | <@JAA> | Mhm |
| 06:44:30 | <@JAA> | How much unique data per fetch due to nonces etc.? |
| 06:46:01 | <nicolas17> | yeah I mean the mesu.apple.com stuff has no changing nonces or anything, it's just static files and they change when they actually change :) |
| 06:46:31 | <@JAA> | Yeah, but that's not all of what you fetch hourly, which is why I ask. |
| 06:46:32 | <nicolas17> | the other ad-hoc crap I fetch, I don't know how much changes... |
| 06:47:05 | <nicolas17> | I guess I could actually try it in wget-at |
| 06:47:20 | <nicolas17> | and see how big it gets |
| 06:50:18 | <nicolas17> | if wget-at downloads two URLs and the responses are identical, it can do a revisit record, but how does it work when you download one URL, then download it again later? I'd need to pass it the previous WARC I guess? |
| 06:51:09 | <@JAA> | You'd need to enable writing a CDX and then pass that, I believe. |
| 06:51:22 | <nicolas17> | oh cdx, cool |
| 06:52:01 | <nicolas17> | upstream wget writes broken WARCs and I shouldn't even try it and I should go straight to github.com/ArchiveTeam/wget-lua right? |
| 06:53:04 | <@JAA> | Yeah. Upstream wget also doesn't dedupe across different URLs, and just days ago, someone reported a bug in the deduper as well that leads to stuff not getting deduped. |
| 06:53:24 | <@JAA> | Also, all of this is dangerously on-topic for this channel. |
| 06:53:41 | <nicolas17> | lol, I just wanted to get it out of #imgone :P |
| 06:54:41 | <nicolas17> | as for gdmf.apple.com, I started looking into WARCs because it looked like the best way to archive stuff that has a *request* body |
| 06:56:00 | <nicolas17> | none of this implies I intend to submit WARCs to WBM |
| 06:56:58 | <@JAA> | Sure, but writing proper WARCs rather than some manipulated mess is still the right way to go. :-) |
| 06:57:56 | <nicolas17> | the main goal was knowing when things change and knowing what changed, so yeah I do stuff like sorting files to make the diffs readable |
| 06:58:01 | <andrew> | alternatively, you could apply a specialized compression algorithm on the WARCs :P |
| 06:58:10 | <nicolas17> | *also* storing WARCs may be a good idea though |
| 06:59:09 | | sonick (sonick) joins |
| 06:59:40 | <@JAA> | Yeah, I have some tools as well that consume API stuff. I write WARCs to get a record of the raw data sent and received, and then I do whatever with it. If something goes wrong, I can always go back to the WARC to figure out what happened. |
| 07:00:31 | <nicolas17> | also I had a cursed idea but I need to stop adding more cursed ideas to my todo list |
| 07:00:44 | <nicolas17> | add support for WARCs to Wireshark :P |
| 07:01:01 | <andrew> | but why |
| 07:01:40 | <@JAA> | I'd rather see a pcap + TLS key to WARC conversion tool. |
| 07:03:26 | <nicolas17> | huh I thought "wget --warc-file=foo.warc -O wget-temp" would repeatedly overwrite that temp file, and leave all the data in the warc... but it seems it repeatedly *appended* to the temp file which is now 150MB :| |
| 07:04:20 | <@JAA> | Yup, there's an option for truncation. |
| 07:07:04 | <nicolas17> | --truncate-output? looks like that's an -at addition |
| 07:07:36 | <@JAA> | Is there another wget? |
| 07:07:39 | <@JAA> | ;-) |
| 07:08:40 | <nicolas17> | damn that XML is repetitive |
| 07:09:01 | <@JAA> | --delete-after isn't exactly the same thing but exists in upstream. |
| 07:09:18 | <@JAA> | Repetitiveness is perfect for compression. :-) |
| 07:09:22 | <nicolas17> | $ wc -c mesu.warc.gz |
| 07:09:23 | <nicolas17> | 9322787 mesu.warc.gz |
| 07:09:25 | <nicolas17> | $ zcat mesu.warc.gz | wc -c |
| 07:09:26 | <nicolas17> | 158553124 |
| 07:09:36 | <@JAA> | Heh |
| 07:10:04 | <@JAA> | zstd with a custom dict might well get it much smaller still. |
| 07:10:30 | <@JAA> | There's a dict trainer somewhere on our GitHub org I believe. |
| 07:10:56 | <Jake> | https://github.com/ArchiveTeam/zstd-dictionary-trainer |
| 07:12:23 | <nicolas17> | also yes I just tried a deduplicating fetch with upstream wget and it failed to dedup some |
| 08:29:02 | | Explo joins |
| 11:31:51 | | monoxane quits [Remote host closed the connection] |
| 11:36:16 | | HP_Archivist (HP_Archivist) joins |
| 12:12:30 | | Meroje joins |
| 12:37:52 | | monoxane (monoxane) joins |
| 12:41:23 | | monoxane quits [Remote host closed the connection] |
| 12:46:53 | | monoxane (monoxane) joins |
| 12:49:25 | <masterX244> | interesting that commoncrawl managed to stumble over some of my nplusc.de links, too... |
| 12:57:10 | | HackMii quits [Remote host closed the connection] |
| 12:57:34 | | HackMii (hacktheplanet) joins |
| 13:15:26 | | Meroje is now authenticated as Meroje |
| 13:16:55 | | Meroje quits [Client Quit] |
| 13:17:14 | | Meroje joins |
| 13:17:14 | | Meroje is now authenticated as Meroje |
| 13:17:29 | | michaelblob_ quits [Read error: Connection reset by peer] |
| 13:40:02 | <Hans5958> | What does -bs meant? bulls**t? |
| 13:41:58 | <@kaz> | probably |
| 13:53:12 | | AlsoHP_Archivist joins |
| 13:57:13 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 14:00:39 | | Meroje quits [Changing host] |
| 14:00:39 | | Meroje (Meroje) joins |
| 14:08:24 | | Arcorann quits [Ping timeout: 252 seconds] |
| 15:06:42 | | zhongfu quits [Ping timeout: 252 seconds] |
| 15:14:46 | | AlsoHP_Archivist quits [Ping timeout: 252 seconds] |
| 15:18:12 | | rellem joins |
| 15:20:12 | | rellem quits [Read error: Connection reset by peer] |
| 15:28:33 | | chrismeller (chrismeller) joins |
| 15:29:02 | <immibis> | Hans5958: maybe bikeshed |
| 15:29:10 | <immibis> | is there a python library that works like urllib but also saves warcs? |
| 15:51:41 | | HackMii quits [Ping timeout: 245 seconds] |
| 15:52:48 | | HackMii (hacktheplanet) joins |
| 16:01:39 | | icedice2 joins |
| 16:01:43 | | icedice (icedice) joins |
| 16:05:18 | | icedice2 quits [Client Quit] |
| 16:28:50 | | andrew quits [Quit: ] |
| 16:29:11 | | andrew (andrew) joins |
| 16:38:58 | | chrismeller quits [Client Quit] |
| 16:41:01 | | chrismeller (chrismeller) joins |
| 16:41:34 | | chrismeller quits [Client Quit] |
| 16:42:16 | | chrismeller (chrismeller) joins |
| 16:45:15 | <spirit> | arkiver: ++! Thanks for your post :) |
| 17:17:01 | | lukash joins |
| 17:19:47 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 17:24:20 | | nicolas17 joins |
| 17:45:41 | | fishingforsoup joins |
| 17:47:15 | | spirit quits [Quit: Leaving] |
| 17:55:22 | | HP_Archivist (HP_Archivist) joins |
| 18:31:39 | | spirit joins |
| 20:27:42 | | pikablu joins |
| 20:29:56 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 20:37:53 | | hitgrr8 quits [Client Quit] |
| 20:42:47 | | sonick quits [Client Quit] |
| 21:39:55 | | thedudedude joins |
| 21:56:32 | | BlueMaxima joins |
| 22:45:52 | | Jake quits [Quit: Leaving for a bit!] |
| 22:46:14 | | Jake (Jake) joins |
| 22:57:28 | | Danielle joins |
| 23:00:58 | | Dalister quits [Ping timeout: 265 seconds] |
| 23:11:30 | | sonick (sonick) joins |