00:24:46 | | sludge quits [Read error: Connection reset by peer] |
00:24:52 | | sludge joins |
00:29:20 | | sludge is now authenticated as sludge |
00:37:00 | | sludge quits [Remote host closed the connection] |
00:57:36 | | sludge joins |
00:58:28 | | sludge is now authenticated as sludge |
01:13:04 | | etnguyen03 (etnguyen03) joins |
01:34:21 | | Webuser603710 joins |
01:36:24 | | andrew quits [Quit: ] |
01:43:15 | | BornOn420_ quits [Client Quit] |
01:43:41 | | BornOn420 (BornOn420) joins |
01:58:51 | <Webuser603710> | very new to all this and I'm having trouble decompressing a .zst archive. Every time I try I get an error, PeaZip says: |
01:58:51 | <Webuser603710> | "Possible causes of the error may be non readable input files (locked, not accessible, corrupted, missing volumes in multipart archive...), or full or not accessible output path. The archive may require a different password for the current operation. For more details please see "Report" tab for the full task log, and "Console" tab for task |
01:58:51 | <Webuser603710> | definition as command line." |
01:58:51 | <Webuser603710> | When I checked the "Report" tab is blank and the "Console" tab just has a zstd -d command and nothing after. |
01:59:31 | <nicolas17> | .warc.zst files are compressed with a custom dictionary |
01:59:43 | <Webuser603710> | I also tired 7-Zip ZS which just says unspecified error |
02:00:12 | <nicolas17> | the dictionary is inside the .zst file, but the way that's stored is not standardized and generic zstd don't support it |
02:00:35 | <nicolas17> | there's tools out there to extract the dictionary so you can pass it to unzstd -D |
02:00:48 | <nicolas17> | but I don't remember what those tools are |
02:00:52 | <nicolas17> | JAA: ? |
02:04:16 | <Webuser603710> | oh okay, the WARC section on the wiki just says "There is an exception: if the WARC file ends in .warc.zst, you will need to decompress it with zstd first. If it says "Dictionary mismatch" or a similar error message, try this Python script." so I was starting to wonder if That was the error I was getting |
02:05:26 | <Webuser603710> | it would be nice if it specified that I may need to use specific tools and that generic ZSTD wont always work |
02:05:46 | <nicolas17> | yeah that's probably the error |
02:06:00 | <nicolas17> | and PeaZip/7zip show less specific error messages... |
02:12:35 | <@JAA> | Yeah, that links to my script. Unixoid OS only, or at least I have no idea about (and can't support) anything else. |
02:14:58 | <Webuser603710> | okay thank you I was getting confused |
02:17:18 | <@JAA> | NB, .warc.zst doesn't *have to* use a custom dictionary; it's optional. All .warc.zst files that we've produced so far do though, I think. |
02:17:25 | | andrew (andrew) joins |
02:17:41 | <nicolas17> | JAA: oh, a while ago I uploaded some warcs that I compressed with gzip as a whole |
02:17:55 | <Webuser603710> | I'm on windows which would explain why I kept having issues when I tried it |
02:17:58 | <@JAA> | That's sadly allowed by the specification. The zstd one does not. |
02:18:07 | <@OrIdow6> | Bad idea: have a non-dictionary meta record at the beginning with text explaining how to read it |
02:18:22 | <@OrIdow6> | So hopefully if someone does zstdcat | they'll see that |
02:19:12 | <nicolas17> | oh wait no I used .warc.zst, compressed as a whole, which is worse |
02:19:23 | <nicolas17> | JAA: so is there any tool to convert this into per-record compression? |
02:20:14 | <@JAA> | OrIdow6: Yeah... Incompatible with the specification, too, since dicts are only allowed at the front, and all frames must be compressed with that same dict. |
02:20:27 | | @JAA slaps nicolas17 around a bit with a large trout |
02:20:40 | <@JAA> | I'm not aware of such a tool. |
02:20:53 | <nicolas17> | I basically did "zcat *.warc.gz | zstd > combined.warc.zst" |
02:20:56 | <@JAA> | I've been working on something that could eventually do it, but work is slow. |
02:22:48 | <@JAA> | OrIdow6: When we tried to upstream the in-stream dictionary format, they wanted to allow specifying multiple dicts in a stream, using multiple in adjacent frames, etc. In that case, you could have a frame at the front which is compressed with the standard dict (is there one?). But the .warc.zst specification is much narrower. |
02:23:16 | <nicolas17> | I imagine the dict starts empty by default |
02:23:19 | <@JAA> | (The upstream efforts died in discussions about that additional complexity.) |
02:23:32 | | nic8693102004 quits [Ping timeout: 260 seconds] |
02:24:15 | <@JAA> | nicolas17: Yeah, I may be thinking of Brotli, which I think has a built-in dictionary because it's Google and it was built for HTTP. |
02:25:15 | <nicolas17> | hm whole-file zstd got me 23.9GiB -> 3.7GiB and records are small, I doubt it will be anywhere near as good with per-record compression |
02:25:51 | <@JAA> | You'd probably want to train a dict on the data first. There is a tool for that somewhere (because we do that), but I'm not sure where or how user-friendly it is. |
02:25:56 | <Webuser603710> | Would Windows Subsystem for Linux for for running the script? or am I SOL unless I switch to a UNIX based OS? |
02:26:32 | <nicolas17> | wait what the heck are these Yahoo Groups records, they are not HTTP requests/responses :| |
02:26:51 | <nicolas17> | WARC-Type: resource |
02:26:52 | <@JAA> | Webuser603710: That might suffice (though I've heard things about WSL1 vs WSL2 mattering for many things). You still need zstd/unzstd though. |
02:26:52 | <nicolas17> | WARC-Target-URI: org.archive.yahoogroups:v1/group/00-aridbiologicalsciences/message/9/info |
02:27:10 | <@JAA> | nicolas17: wat |
02:28:56 | <@OrIdow6> | See the Yahoo Groups wiki page |
02:29:07 | <@OrIdow6> | There was a crawl done with a Python script that dind' |
02:29:20 | <@OrIdow6> | t capture HTTP and put everything into Resource records |
02:29:33 | <@OrIdow6> | Oh, you just said that |
02:29:36 | <@JAA> | Yeah, but what is that URI and why does it use archive.org? |
02:29:42 | <nicolas17> | JAA: example https://archive.org/download/yahoo-groups-2018-02-13T00-16-05Z-5c98f8/24th_bcs_pwd.EBjAP56.warc.gz |
02:30:28 | <@JAA> | Mhm |
02:31:01 | <@JAA> | Oh, and hex WARC-Payload-Digest, too. |
02:31:25 | <@JAA> | I still think that it was a mistake not to specify the format of the digest fields. |
02:31:31 | <nicolas17> | skull emoji |
02:31:38 | <nicolas17> | the warc spec doesn't specify it? |
02:31:41 | <@JAA> | Nope |
02:32:18 | <@JAA> | It just calls it a `labelled-digest` and specifies the semantics (e.g. for segmented records). |
02:32:27 | <nicolas17> | https://archive.org/details/yahoo-groups-64k I no longer remember why I did this |
02:32:39 | <@JAA> | The examples are all base32 SHA-1, and that's what's used in most places, but it's not specified anywhere. |
02:32:46 | <nicolas17> | but I guess most if not all of the sub-64KB warcs will be those weird API captures |
02:33:12 | <@JAA> | And `labelled-digest = algorithm ":" digest-value` |
02:33:24 | <TheTechRobo> | <@JAA> You'd probably want to train a dict on the data first. There is a tool for that somewhere (because we do that), but I'm not sure where or how user-friendly it is. |
02:33:24 | <TheTechRobo> | I'm pretty sure it's this, and in true Archive Team fashion, no docs :D https://github.com/ArchiveTeam/zstd-dictionary-trainer |
02:33:24 | <TheTechRobo> | From skimming the code, though, I think it samples from stuff in a given IA collection, so wouldn't work standalone. |
02:33:44 | <@JAA> | Ah yeah, sounds about right. |
02:33:44 | <nicolas17> | the spec doesn't specify any 'algorithm' value? |
02:33:55 | <@JAA> | No |
02:34:02 | <@JAA> | > No particular algorithm is recommended. |
02:37:42 | <nicolas17> | HTTP digest headers use base64 |
02:37:58 | <nicolas17> | Content-Digest: sha-256=:RK/0qy18MlBSVnWgjwz6lZEWjP/lF5HF9bvEF8FabDg=: |
02:38:50 | <nicolas17> | https://www.iana.org/assignments/http-digest-hash-alg/ |
02:38:59 | <@JAA> | TIL, but it's extremely new. |
02:39:26 | <@JAA> | RFC 9530 from February |
02:39:52 | <nicolas17> | yep |
02:40:39 | <@JAA> | There was 'Digest' from RFC 3230 before, and MDN has this to say: |
02:40:40 | <@JAA> | > A Digest header was defined in previous specifications, but it proved problematic as the scope of what the digest applied to was not clear. Specifically, it was difficult to distinguish whether a digest applied to the entire resource representation or to the specific content of a HTTP message. As such, two separate headers were specified (Content-Digest and Repr-Digest) to convey HTTP message |
02:40:46 | <@JAA> | content digests and resource representation digests, respectively. |
02:40:50 | <@JAA> | Lovely |
02:41:14 | <@JAA> | So it's basically like WARC-Block-Digest and WARC-Payload-Digest. |
02:41:32 | <@JAA> | Ah no, not quite. |
02:41:34 | <@JAA> | > A content digest will differ based on Content-Encoding and Content-Range, but not Transfer-Encoding. |
02:41:52 | <TheTechRobo> | transfer encoding++ |
02:41:52 | <eggdrop> | [karma] 'transfer encoding' now has 1 karma! |
02:48:54 | | DigitalDragons quits [Quit: Ping timeout (120 seconds)] |
02:48:54 | | Exorcism quits [Quit: Ping timeout (120 seconds)] |
02:49:24 | | Exorcism (exorcism) joins |
02:49:24 | | DigitalDragons (DigitalDragons) joins |
02:52:40 | | BennyOtt_ joins |
02:54:27 | | BennyOtt quits [Ping timeout: 260 seconds] |
02:54:27 | | BennyOtt_ is now known as BennyOtt |
02:54:27 | | BennyOtt is now authenticated as BennyOtt |
03:34:05 | | etnguyen03 quits [Client Quit] |
03:40:18 | | etnguyen03 (etnguyen03) joins |
04:21:51 | | etnguyen03 quits [Remote host closed the connection] |
04:53:27 | | pabs quits [Ping timeout: 260 seconds] |
04:58:19 | | Naruyoko joins |
05:05:04 | | pabs (pabs) joins |
05:54:14 | | pabs quits [Read error: Connection reset by peer] |
05:55:09 | | Webuser603710 quits [Quit: Ooops, wrong browser tab.] |
05:55:57 | | quartermaster quits [Quit: Connection closed for inactivity] |
06:18:40 | | i_have_n0_idea quits [Quit: The Lounge - https://thelounge.chat] |
06:20:10 | | pabs (pabs) joins |
07:05:52 | | Unholy23619246453771312 (Unholy2361) joins |
07:07:39 | | BlueMaxima quits [Quit: Leaving] |
08:05:05 | | Webuser415804 joins |
08:07:41 | | lennier2_ joins |
08:10:37 | | lennier2 quits [Ping timeout: 260 seconds] |
08:11:57 | | i_have_n0_idea (i_have_n0_idea) joins |
09:00:58 | | i_have_n0_idea quits [Ping timeout: 265 seconds] |
09:00:58 | | Island quits [Read error: Connection reset by peer] |
09:14:09 | | i_have_n0_idea (i_have_n0_idea) joins |
09:38:07 | | Doomaholic quits [Read error: Connection reset by peer] |
09:38:24 | | Doomaholic (Doomaholic) joins |
09:38:35 | | cow_2001 joins |
09:50:16 | | nulldata quits [Ping timeout: 265 seconds] |
09:53:52 | | collat quits [Ping timeout: 260 seconds] |
09:55:45 | | collat joins |
10:02:07 | | collat quits [Client Quit] |
11:14:22 | | tzt quits [Ping timeout: 260 seconds] |
11:26:01 | | MrMcNuggets (MrMcNuggets) joins |
11:36:11 | | tzt (tzt) joins |
11:38:59 | | deedan06 joins |
12:00:04 | | Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat] |
12:01:30 | | deedan06 quits [Client Quit] |
12:02:55 | | Bleo182600722719623 joins |
12:04:38 | | Webuser415804 quits [Quit: Ooops, wrong browser tab.] |
12:31:01 | <@OrIdow6> | JAA: On the "hint in the zstd warc" idea, would work less well, but a meta/informational record within the initial data frame but encoded with Raw_Block so that it is viewable in e.g. a hexdump? |
12:34:39 | <@OrIdow6> | But I suspect zstd is widely known enough now that people might not bother to do that |
13:54:32 | | etnguyen03 (etnguyen03) joins |
13:58:28 | | Wohlstand (Wohlstand) joins |
13:59:01 | | ducky quits [Read error: Connection reset by peer] |
13:59:15 | | ducky (ducky) joins |
14:10:17 | | etnguyen03 quits [Client Quit] |
14:20:39 | | SkilledAlpaca418962 joins |
14:34:08 | | Wohlstand quits [Remote host closed the connection] |
14:41:41 | | Chris5010 (Chris5010) joins |
14:55:22 | <c3manu> | arkiver: currently trying to dig up what i can for the militant groups on the regime side. there's quite a lot. |
15:03:40 | | bf_ joins |
15:13:44 | <@arkiver> | c3manu: is there something i can do to help? |
15:31:10 | <c3manu> | maybe throw in the ones i didn't get yet? i need a break >.< |
15:54:14 | | bf_ quits [Remote host closed the connection] |
16:18:17 | | etnguyen03 (etnguyen03) joins |
16:28:05 | | PredatorIWD2 quits [Read error: Connection reset by peer] |
16:28:33 | | pabs quits [Remote host closed the connection] |
16:36:13 | | pabs (pabs) joins |
16:44:52 | | etnguyen03 quits [Client Quit] |
16:49:09 | | etnguyen03 (etnguyen03) joins |
16:59:02 | | adryd quits [Quit: Ping timeout (120 seconds)] |
16:59:18 | | adryd (adryd) joins |
17:01:35 | | i_have_n0_idea9 (i_have_n0_idea) joins |
17:02:03 | | i_have_n0_idea quits [Read error: Connection reset by peer] |
17:02:03 | | i_have_n0_idea9 is now known as i_have_n0_idea |
17:22:41 | | i_have_n0_idea quits [Client Quit] |
17:48:23 | | bf_ joins |
17:50:49 | | etnguyen03 quits [Client Quit] |
18:09:01 | | etnguyen03 (etnguyen03) joins |
18:12:40 | | Czechball joins |
18:24:50 | | PredatorIWD2 joins |
18:41:21 | | etnguyen03 quits [Client Quit] |
18:58:05 | | etnguyen03 (etnguyen03) joins |
19:05:27 | | Czechball quits [Client Quit] |
19:22:53 | | Mist8kenGAS quits [Remote host closed the connection] |
19:25:17 | | etnguyen03 quits [Client Quit] |
19:30:20 | | Czechball joins |
19:30:23 | | BlueMaxima joins |
19:34:47 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
19:41:38 | | MrMcNuggets quits [Quit: WeeChat 4.3.2] |
19:42:45 | | Wohlstand (Wohlstand) joins |
19:44:16 | | PredatorIWD2 quits [Read error: Connection reset by peer] |
19:44:30 | | PredatorIWD2 joins |
19:45:15 | | xarph quits [Ping timeout: 265 seconds] |
19:46:52 | | xarph joins |
20:02:21 | | etnguyen03 (etnguyen03) joins |
20:03:47 | | SkilledAlpaca418962 joins |
20:05:47 | | fuzzy80211 quits [Ping timeout: 260 seconds] |
20:39:34 | <that_lurker> | Trying to monitor Liveuamap MiddleEast through their twitter/x at #livemap |
20:43:07 | | riteo quits [Ping timeout: 260 seconds] |
20:49:54 | | etnguyen03 quits [Client Quit] |
20:56:12 | | etnguyen03 (etnguyen03) joins |
21:26:30 | | ShastaTheDog joins |
21:43:11 | | rktk quits [Ping timeout: 265 seconds] |
21:49:03 | <nicolas17> | man, rrpicturearchives still down, I should have archived it months ago |
21:51:00 | | fuzzy80211 (fuzzy80211) joins |
22:02:11 | | rktk (rktk) joins |
22:03:52 | | ShastaTheDog quits [Client Quit] |
22:39:23 | | hackbug quits [Remote host closed the connection] |
22:44:52 | | hackbug (hackbug) joins |