00:24:46sludge quits [Read error: Connection reset by peer]
00:24:52sludge joins
00:37:00sludge quits [Remote host closed the connection]
00:57:36sludge joins
01:13:04etnguyen03 (etnguyen03) joins
01:34:21Webuser603710 joins
01:36:24andrew quits [Quit: ]
01:43:15BornOn420_ quits [Client Quit]
01:43:41BornOn420 (BornOn420) joins
01:58:51<Webuser603710>very new to all this and I'm having trouble decompressing a .zst archive. Every time I try I get an error, PeaZip says:
01:58:51<Webuser603710>"Possible causes of the error may be non readable input files (locked, not accessible, corrupted, missing volumes in multipart archive...), or full or not accessible output path. The archive may require a different password for the current operation. For more details please see "Report" tab for the full task log, and "Console" tab for task
01:58:51<Webuser603710>definition as command line."
01:58:51<Webuser603710>When I checked the "Report" tab is blank and the "Console" tab just has a zstd -d command and nothing after.
01:59:31<nicolas17>.warc.zst files are compressed with a custom dictionary
01:59:43<Webuser603710>I also tired 7-Zip ZS which just says unspecified error
02:00:12<nicolas17>the dictionary is inside the .zst file, but the way that's stored is not standardized and generic zstd don't support it
02:00:35<nicolas17>there's tools out there to extract the dictionary so you can pass it to unzstd -D
02:00:48<nicolas17>but I don't remember what those tools are
02:00:52<nicolas17>JAA: ?
02:04:16<Webuser603710>oh okay, the WARC section on the wiki just says "There is an exception: if the WARC file ends in .warc.zst, you will need to decompress it with zstd first. If it says "Dictionary mismatch" or a similar error message, try this Python script." so I was starting to wonder if That was the error I was getting
02:05:26<Webuser603710>it would be nice if it specified that I may need to use specific tools and that generic ZSTD wont always work
02:05:46<nicolas17>yeah that's probably the error
02:06:00<nicolas17>and PeaZip/7zip show less specific error messages...
02:12:35<@JAA>Yeah, that links to my script. Unixoid OS only, or at least I have no idea about (and can't support) anything else.
02:14:58<Webuser603710>okay thank you I was getting confused
02:17:18<@JAA>NB, .warc.zst doesn't *have to* use a custom dictionary; it's optional. All .warc.zst files that we've produced so far do though, I think.
02:17:25andrew (andrew) joins
02:17:41<nicolas17>JAA: oh, a while ago I uploaded some warcs that I compressed with gzip as a whole
02:17:55<Webuser603710>I'm on windows which would explain why I kept having issues when I tried it
02:17:58<@JAA>That's sadly allowed by the specification. The zstd one does not.
02:18:07<@OrIdow6>Bad idea: have a non-dictionary meta record at the beginning with text explaining how to read it
02:18:22<@OrIdow6>So hopefully if someone does zstdcat | they'll see that
02:19:12<nicolas17>oh wait no I used .warc.zst, compressed as a whole, which is worse
02:19:23<nicolas17>JAA: so is there any tool to convert this into per-record compression?
02:20:14<@JAA>OrIdow6: Yeah... Incompatible with the specification, too, since dicts are only allowed at the front, and all frames must be compressed with that same dict.
02:20:27@JAA slaps nicolas17 around a bit with a large trout
02:20:40<@JAA>I'm not aware of such a tool.
02:20:53<nicolas17>I basically did "zcat *.warc.gz | zstd > combined.warc.zst"
02:20:56<@JAA>I've been working on something that could eventually do it, but work is slow.
02:22:48<@JAA>OrIdow6: When we tried to upstream the in-stream dictionary format, they wanted to allow specifying multiple dicts in a stream, using multiple in adjacent frames, etc. In that case, you could have a frame at the front which is compressed with the standard dict (is there one?). But the .warc.zst specification is much narrower.
02:23:16<nicolas17>I imagine the dict starts empty by default
02:23:19<@JAA>(The upstream efforts died in discussions about that additional complexity.)
02:23:32nic8693102004 quits [Ping timeout: 260 seconds]
02:24:15<@JAA>nicolas17: Yeah, I may be thinking of Brotli, which I think has a built-in dictionary because it's Google and it was built for HTTP.
02:25:15<nicolas17>hm whole-file zstd got me 23.9GiB -> 3.7GiB and records are small, I doubt it will be anywhere near as good with per-record compression
02:25:51<@JAA>You'd probably want to train a dict on the data first. There is a tool for that somewhere (because we do that), but I'm not sure where or how user-friendly it is.
02:25:56<Webuser603710>Would Windows Subsystem for Linux for for running the script? or am I SOL unless I switch to a UNIX based OS?
02:26:32<nicolas17>wait what the heck are these Yahoo Groups records, they are not HTTP requests/responses :|
02:26:51<nicolas17>WARC-Type: resource
02:26:52<@JAA>Webuser603710: That might suffice (though I've heard things about WSL1 vs WSL2 mattering for many things). You still need zstd/unzstd though.
02:26:52<nicolas17>WARC-Target-URI: org.archive.yahoogroups:v1/group/00-aridbiologicalsciences/message/9/info
02:27:10<@JAA>nicolas17: wat
02:28:56<@OrIdow6>See the Yahoo Groups wiki page
02:29:07<@OrIdow6>There was a crawl done with a Python script that dind'
02:29:20<@OrIdow6>t capture HTTP and put everything into Resource records
02:29:33<@OrIdow6>Oh, you just said that
02:29:36<@JAA>Yeah, but what is that URI and why does it use archive.org?
02:29:42<nicolas17>JAA: example https://archive.org/download/yahoo-groups-2018-02-13T00-16-05Z-5c98f8/24th_bcs_pwd.EBjAP56.warc.gz
02:30:28<@JAA>Mhm
02:31:01<@JAA>Oh, and hex WARC-Payload-Digest, too.
02:31:25<@JAA>I still think that it was a mistake not to specify the format of the digest fields.
02:31:31<nicolas17>skull emoji
02:31:38<nicolas17>the warc spec doesn't specify it?
02:31:41<@JAA>Nope
02:32:18<@JAA>It just calls it a `labelled-digest` and specifies the semantics (e.g. for segmented records).
02:32:27<nicolas17>https://archive.org/details/yahoo-groups-64k I no longer remember why I did this
02:32:39<@JAA>The examples are all base32 SHA-1, and that's what's used in most places, but it's not specified anywhere.
02:32:46<nicolas17>but I guess most if not all of the sub-64KB warcs will be those weird API captures
02:33:12<@JAA>And `labelled-digest = algorithm ":" digest-value`
02:33:24<TheTechRobo><@JAA> You'd probably want to train a dict on the data first. There is a tool for that somewhere (because we do that), but I'm not sure where or how user-friendly it is.
02:33:24<TheTechRobo>I'm pretty sure it's this, and in true Archive Team fashion, no docs :D https://github.com/ArchiveTeam/zstd-dictionary-trainer
02:33:24<TheTechRobo>From skimming the code, though, I think it samples from stuff in a given IA collection, so wouldn't work standalone.
02:33:44<@JAA>Ah yeah, sounds about right.
02:33:44<nicolas17>the spec doesn't specify any 'algorithm' value?
02:33:55<@JAA>No
02:34:02<@JAA>> No particular algorithm is recommended.
02:37:42<nicolas17>HTTP digest headers use base64
02:37:58<nicolas17>Content-Digest: sha-256=:RK/0qy18MlBSVnWgjwz6lZEWjP/lF5HF9bvEF8FabDg=:
02:38:50<nicolas17>https://www.iana.org/assignments/http-digest-hash-alg/
02:38:59<@JAA>TIL, but it's extremely new.
02:39:26<@JAA>RFC 9530 from February
02:39:52<nicolas17>yep
02:40:39<@JAA>There was 'Digest' from RFC 3230 before, and MDN has this to say:
02:40:40<@JAA>> A Digest header was defined in previous specifications, but it proved problematic as the scope of what the digest applied to was not clear. Specifically, it was difficult to distinguish whether a digest applied to the entire resource representation or to the specific content of a HTTP message. As such, two separate headers were specified (Content-Digest and Repr-Digest) to convey HTTP message
02:40:46<@JAA>content digests and resource representation digests, respectively.
02:40:50<@JAA>Lovely
02:41:14<@JAA>So it's basically like WARC-Block-Digest and WARC-Payload-Digest.
02:41:32<@JAA>Ah no, not quite.
02:41:34<@JAA>> A content digest will differ based on Content-Encoding and Content-Range, but not Transfer-Encoding.
02:41:52<TheTechRobo>transfer encoding++
02:41:52<eggdrop>[karma] 'transfer encoding' now has 1 karma!
02:48:54DigitalDragons quits [Quit: Ping timeout (120 seconds)]
02:48:54Exorcism quits [Quit: Ping timeout (120 seconds)]
02:49:24Exorcism (exorcism) joins
02:49:24DigitalDragons (DigitalDragons) joins
02:52:40BennyOtt_ joins
02:54:27BennyOtt quits [Ping timeout: 260 seconds]
02:54:27BennyOtt_ is now known as BennyOtt
03:34:05etnguyen03 quits [Client Quit]
03:40:18etnguyen03 (etnguyen03) joins
04:21:51etnguyen03 quits [Remote host closed the connection]
04:53:27pabs quits [Ping timeout: 260 seconds]
04:58:19Naruyoko joins
05:05:04pabs (pabs) joins
05:54:14pabs quits [Read error: Connection reset by peer]
05:55:09Webuser603710 quits [Quit: Ooops, wrong browser tab.]
05:55:57quartermaster quits [Quit: Connection closed for inactivity]
06:18:40i_have_n0_idea quits [Quit: The Lounge - https://thelounge.chat]
06:20:10pabs (pabs) joins
07:05:52Unholy23619246453771312 (Unholy2361) joins
07:07:39BlueMaxima quits [Quit: Leaving]
08:05:05Webuser415804 joins
08:07:41lennier2_ joins
08:10:37lennier2 quits [Ping timeout: 260 seconds]
08:11:57i_have_n0_idea (i_have_n0_idea) joins
09:00:58i_have_n0_idea quits [Ping timeout: 265 seconds]
09:00:58Island quits [Read error: Connection reset by peer]
09:14:09i_have_n0_idea (i_have_n0_idea) joins
09:38:07Doomaholic quits [Read error: Connection reset by peer]
09:38:24Doomaholic (Doomaholic) joins
09:38:35cow_2001 joins
09:50:16nulldata quits [Ping timeout: 265 seconds]
09:53:52collat quits [Ping timeout: 260 seconds]
09:55:45collat joins
10:02:07collat quits [Client Quit]
11:14:22tzt quits [Ping timeout: 260 seconds]
11:26:01MrMcNuggets (MrMcNuggets) joins
11:36:11tzt (tzt) joins
11:38:59deedan06 joins
12:00:04Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat]
12:01:30deedan06 quits [Client Quit]
12:02:55Bleo182600722719623 joins
12:04:38Webuser415804 quits [Quit: Ooops, wrong browser tab.]
12:31:01<@OrIdow6>JAA: On the "hint in the zstd warc" idea, would work less well, but a meta/informational record within the initial data frame but encoded with Raw_Block so that it is viewable in e.g. a hexdump?
12:34:39<@OrIdow6>But I suspect zstd is widely known enough now that people might not bother to do that
13:54:32etnguyen03 (etnguyen03) joins
13:58:28Wohlstand (Wohlstand) joins
13:59:01ducky quits [Read error: Connection reset by peer]
13:59:15ducky (ducky) joins
14:10:17etnguyen03 quits [Client Quit]
14:20:39SkilledAlpaca418962 joins
14:34:08Wohlstand quits [Remote host closed the connection]
14:41:41Chris5010 (Chris5010) joins
14:55:22<c3manu>arkiver: currently trying to dig up what i can for the militant groups on the regime side. there's quite a lot.
15:03:40bf_ joins
15:13:44<@arkiver>c3manu: is there something i can do to help?
15:31:10<c3manu>maybe throw in the ones i didn't get yet? i need a break >.<
15:54:14bf_ quits [Remote host closed the connection]
16:18:17etnguyen03 (etnguyen03) joins
16:28:05PredatorIWD2 quits [Read error: Connection reset by peer]
16:28:33pabs quits [Remote host closed the connection]
16:36:13pabs (pabs) joins
16:44:52etnguyen03 quits [Client Quit]
16:49:09etnguyen03 (etnguyen03) joins
16:59:02adryd quits [Quit: Ping timeout (120 seconds)]
16:59:18adryd (adryd) joins
17:01:35i_have_n0_idea9 (i_have_n0_idea) joins
17:02:03i_have_n0_idea quits [Read error: Connection reset by peer]
17:02:03i_have_n0_idea9 is now known as i_have_n0_idea
17:22:41i_have_n0_idea quits [Client Quit]
17:48:23bf_ joins
17:50:49etnguyen03 quits [Client Quit]
18:09:01etnguyen03 (etnguyen03) joins
18:12:40Czechball joins
18:24:50PredatorIWD2 joins
18:41:21etnguyen03 quits [Client Quit]
18:58:05etnguyen03 (etnguyen03) joins
19:05:27Czechball quits [Client Quit]
19:22:53Mist8kenGAS quits [Remote host closed the connection]
19:25:17etnguyen03 quits [Client Quit]
19:30:20Czechball joins
19:30:23BlueMaxima joins
19:34:47SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
19:41:38MrMcNuggets quits [Quit: WeeChat 4.3.2]
19:42:45Wohlstand (Wohlstand) joins
19:44:16PredatorIWD2 quits [Read error: Connection reset by peer]
19:44:30PredatorIWD2 joins
19:45:15xarph quits [Ping timeout: 265 seconds]
19:46:52xarph joins
20:02:21etnguyen03 (etnguyen03) joins
20:03:47SkilledAlpaca418962 joins
20:05:47fuzzy80211 quits [Ping timeout: 260 seconds]
20:39:34<that_lurker>Trying to monitor Liveuamap MiddleEast through their twitter/x at #livemap
20:43:07riteo quits [Ping timeout: 260 seconds]
20:49:54etnguyen03 quits [Client Quit]
20:56:12etnguyen03 (etnguyen03) joins
21:26:30ShastaTheDog joins
21:43:11rktk quits [Ping timeout: 265 seconds]
21:49:03<nicolas17>man, rrpicturearchives still down, I should have archived it months ago
21:51:00fuzzy80211 (fuzzy80211) joins
22:02:11rktk (rktk) joins
22:03:52ShastaTheDog quits [Client Quit]
22:39:23hackbug quits [Remote host closed the connection]
22:44:52hackbug (hackbug) joins