00:38:16<systwi_>Regarding [megawarc](https://github.com/ArchiveTeam/megawarc), is it possible to `megawarc restore` if the previously-created megawarc.warc.gz file is gone (the archive was decompressed and tossed, only the megawarc.warc remains).
00:40:08<systwi_>Doing some tests, even recompressing by simply using `gzip megawarc.warc` gives errors upon restoring: Exception: End of file: x bytes expected, but y bytes read.
00:40:42<@JAA>No. The .megawarc.json.gz contains offsets of each input WARC.
00:40:53<systwi_>I thought it might decompress the .gz _first_, then restore the original structure, instead of recreating it on the fly (which is what it sounds like it's doing).
00:41:18<@JAA>So if those offsets no longer match what you have, it will break, either with a clear error or in hilarious ways.
00:41:52<systwi_>...of each input WARC in the original gzipped output (if I'm not mistaken).
00:42:02<@JAA>Correct
00:42:24<@JAA>The input WARCs are simply appended to the megawarc, and that offset plus size gets written to the JSON file.
00:42:29Webuser734 joins
00:42:35<Webuser734>Hello
00:43:35<Webuser734>I'm trying to retrieve an archived listserv on Syracuse University's servers called ANI-L. I know someone who still has access to it, but he is hesitant to give me his login information for obvious reasons. Is there another thing I can do?
00:43:36<systwi_>JAA: Ahh. Is it not possible for the .json to have the WARC offsets referencing the uncompressed megawarc?
00:43:52<systwi_>I'm sure forking it and modifying it myself is the alternative. :-P
00:44:11<@JAA>systwi_: Not with standard megawarc, but I'm sure it's possible to implement it, yeah.
00:44:20<Webuser734>The list owners are either hard to reach or may not have access anymore
00:46:57<@JAA>Webuser734: Define 'archived'? I'm guessing that's a different definition than ours.
00:47:10<systwi_>Webuser734: I'm not familiar with their servers. Unfortunately, unless somebody else here has login credentials there or mirrored/saved the file(s) elsewhere there likely isn't much we can do. :-/
00:47:36<Webuser734>The list has been deactivated but it is still available for viewing by people who joined it when it was active
00:48:07<Webuser734>It was a private listserv
00:49:00Arcorann (Arcorann) joins
00:50:07<@JAA>Yeah, your only chance are the actual people there then, either a subscriber or an admin.
00:50:19<Webuser734>Ok
00:50:36<@JAA>There are a few snapshots in the Wayback Machine, but they're just login walls as expected.
00:50:56<Webuser734>One of the list owners is active on FB and I'm trying to see if he could let me in but otherwise I'll need to brainstorm how they can compile the data and send it to me
00:52:12<@JAA>Just in case: If they're interested in an actual archive getting created for future reference, we can take care of that. We'd need credentials for it, and the data wouldn't be easily accessible (if public at all).
00:53:01<Webuser734>I want it to be public
00:53:42<Webuser734>Or at least for there to be some way I can screencap it. I've been putting together a public archive of the Neurodiversity Movement and ANI-L was very important to its foundations
00:53:43<@JAA>Well, that's something the admins or list moderators need to decide. I'm guessing there's a reason why they kept it private for the past decades.
00:54:14<Webuser734>They did it because they wanted to be careful about who joined it
00:54:28<@JAA>But in general, the more important thing is to preserve the data, whether it be accessible now or only in several decades is kind of unrelated to that.
00:54:30<Webuser734>But people don't use it anymore so the only reason to join it would be for archival purposes
00:54:41<@JAA>Right
00:55:10<@JAA>It's a sensitive topic though. People might have been posting on the list with the knowledge that it's not public, for example.
00:55:58<Webuser734>I contacted Syracuse's IT department a few months ago and they said someone who owns the list or still has access would be the ones to consult iirc
00:56:09<@JAA>That makes sense, yeah.
00:56:24<Webuser734>I've kinda procrastinated for a few months and now wanna see if I can do something. I just have the sudden urge to.
00:57:53<Webuser734>So what are the next steps I should take?
01:04:20thetechrobo_ joins
01:04:57<systwi_>I guess, try sending emails/messages to anybody you know that might still have, or currently has, access to the server. See if you can gain access to the data; if you do, save what you need and ask administrators about full archival afterwards. Best of luck.
01:05:19<Webuser734>Thanks
01:05:21TheTechRobo quits [Ping timeout: 265 seconds]
01:05:35Webuser734 quits [Remote host closed the connection]
01:05:36<systwi_>You're welcome, hope that helps.
01:06:12systwi is now known as TheTechRobo
01:06:23TheTechRobo is now known as systwi
01:17:33thetechrobo_ quits [Remote host closed the connection]
01:17:55thetechrobo_ joins
01:20:59TheTechRobo joins
01:22:04thetechrobo_ quits [Ping timeout: 240 seconds]
01:41:45thetechrobo_ joins
01:42:30HP_Archivist (HP_Archivist) joins
01:43:46TheTechRobo quits [Ping timeout: 240 seconds]
01:50:21thetechrobo_ is now known as TheTechRobo
02:00:27<systwi_>JAA: re: megawarc creating/restoring, as an alternative, does a setup like this (general idea) look good?:
02:00:31<systwi_>https://paste.debian.net/plain/1251072
02:00:38<systwi_>(simulated shell I/O)
02:00:44<systwi_>It doesn't retain the original .gz information, unfortunately, as the .warc.gz files would need to be decompressed (to .warc) before starting.
02:02:17<@JAA>systwi_: Uh, I'm not sure what you're trying to do exactly (why not just keep them compressed?), but ... yeah, I guess?
02:03:13<@JAA>The dd command should be fun if you can't break the WARC size into a reasonable block but it's also too large to just fit into RAM easily. (Bonus points if the file size is prime.)
02:03:36<@JAA>I'd probably do that with head/tail instead, but not sure about performance.
02:04:11<@JAA>E.g. `tail -c+1234 | head -c2345` to get the 2345 bytes starting at offset 1234.
02:06:46<systwi_>I don't want the WARCs compressed as I plan on compressing them, along with wpull logs from that job, et al., into one compressed tarball (or likely, alternatively, one larger archive containing several other types of data - out of the scope of this topic).
02:07:45<systwi_>Thanks for the tip re: `tail', I'll have to experiment with that.
02:08:20<systwi_>Oh, `skip=z` should be included in that `dd` command.
02:15:35<systwi_>Related: is there a specific `gzip` syntax necessary in order to perfectly recreate the .warc.gz archive `metawarc` expects? `gzip example.com.warc` and `gzip -1 example.com.warc` (and varying compression effort) do not seem to work.
02:16:02<systwi_>They're either larger or smaller than the original archive.
02:16:56<Jake>WARC records are gzipped per record (I believe), so gzipping a uncompressed .warc wouldn't create the same file, I believe.
02:17:27<@JAA>Correct
02:18:16<@JAA>And gzip also has a few knobs. Compression level is the most obvious one, but there are other internal ones as well I think.
02:19:03<systwi_>I see. So I'm basically out of luck if I want to perfectly restore it after the fact. :-/
02:19:35<systwi_>It looks like -N/--name are also used, if I'm not mistaken.
02:19:59<systwi_>*in the original creation of the gzip archive
02:21:04<systwi_>Maybe _that's_ why `megawarc` creates the JSON based on gzip archives and not the uncompressed WARC.
02:21:25<@JAA>Yes, and a timestamp I think.
02:21:26<systwi_>s/creates the JSON/records the offsets/
02:21:45<systwi_>Ah, right, that looks to be correct as well.
02:22:16<@JAA>I'm not sure the JSON is actually relevant here. If you keep things compressed, you can perfectly extract the input WARCs again just from offset+size.
02:24:53<systwi_>...as `megawarc` simply `cat`s the archives together while simultaneously recording their size and outfile position.
02:25:02<systwi_>If I'm following correctly.
02:26:03TheTechRobo quits [Client Quit]
02:26:20TheTechRobo (TheTechRobo) joins
02:28:13TheTechRobo quits [Remote host closed the connection]
02:28:29TheTechRobo (TheTechRobo) joins
02:28:54TheTechRobo quits [Remote host closed the connection]
02:29:11TheTechRobo (TheTechRobo) joins
02:32:39<@JAA>More or less, yeah.
02:43:07<systwi_>Oh, re: the `dd' blocksize and count, in the past I had gone the easiest, yet probably slowest route:
02:43:38<systwi_>count=7388830594 bs=1
02:44:10<systwi_>I'm sure that's a lousy way of going about it.
02:49:45<@JAA>Yeah, that destroys any semblance of performance.
02:50:03<@JAA>tail|head is certainly more efficient than that. :-)
03:10:03qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
04:11:46dunger quits [Ping timeout: 240 seconds]
05:48:16systwi quits [Ping timeout: 240 seconds]
05:49:03systwi (systwi) joins
06:28:05BlueMaxima quits [Client Quit]
07:43:30Barto quits [Read error: Connection reset by peer]
08:29:32tzt quits [Ping timeout: 265 seconds]
09:20:50qwertyasdfuiopghjkl joins
09:37:02Barto (Barto) joins
10:55:00Megame (Megame) joins
11:01:44tzt (tzt) joins
11:28:09@Fusl quits [Excess Flood]
11:28:25Fusl (Fusl) joins
11:28:25@ChanServ sets mode: +o Fusl
11:36:28LeGoupil joins
11:56:00qwertyasdfuiopghjkl quits [Client Quit]
11:56:27qwertyasdfuiopghjkl joins
11:56:30LeGoupil quits [Read error: Connection reset by peer]
12:04:18LeGoupil joins
12:20:03TheTechRobo quits [Remote host closed the connection]
12:20:25TheTechRobo (TheTechRobo) joins
13:03:46march_happy quits [Ping timeout: 240 seconds]
13:04:38march_happy (march_happy) joins
13:33:51Megame quits [Remote host closed the connection]
13:34:09Megame (Megame) joins
13:39:45LeGoupil quits [Client Quit]
14:13:46march_happy quits [Ping timeout: 240 seconds]
14:14:24march_happy (march_happy) joins
14:18:46march_happy quits [Ping timeout: 240 seconds]
14:19:46march_happy (march_happy) joins
14:37:16Arcorann quits [Ping timeout: 240 seconds]
15:11:15Megame quits [Client Quit]
15:43:01sec^nd quits [Remote host closed the connection]
15:43:30AlsoHP_Archivist joins
15:43:32sec^nd (second) joins
15:46:46HP_Archivist quits [Ping timeout: 240 seconds]
16:00:16sec^nd quits [Ping timeout: 240 seconds]
16:01:20sec^nd (second) joins
17:06:49elsagatearchive joins
17:07:17<elsagatearchive>Can I archive large amounts of Elsagate scandal videos to the IA?
17:07:37<elsagatearchive>youtube is starting to ban it
17:08:19<elsagatearchive>But it is how much until IA gets mad
17:08:30elsagatearchive quits [Remote host closed the connection]
17:09:28elsagatearchive joins
17:09:31elsagatearchive quits [Remote host closed the connection]
17:20:28tech_exorcist (tech_exorcist) joins
17:28:27march_happy quits [Ping timeout: 265 seconds]
17:28:49tech_exorcist quits [Remote host closed the connection]
17:38:14<Jake>I know you are gone, but email IA and see what they have to say. Throwing YT content into IA is usually frowned upon.
17:43:51@Fusl quits [Excess Flood]
17:44:06Fusl (Fusl) joins
17:44:06@ChanServ sets mode: +o Fusl
17:47:32<@JAA>Ah, bonga's at it again.
17:48:18<TheTechRobo>Out of the loop, what is Elsagate?
17:48:56<TheTechRobo>Oh, this is horrifying.
17:49:40tech_exorcist (tech_exorcist) joins
17:54:30<Jake>Ah, didn't even realize it was bonga.
18:16:44tech_exorcist quits [Remote host closed the connection]
18:17:16tech_exorcist (tech_exorcist) joins
18:47:28tech_exorcist quits [Remote host closed the connection]
18:48:14tech_exorcist (tech_exorcist) joins
19:05:24tech_exorcist quits [Remote host closed the connection]
19:19:16sec^nd quits [Ping timeout: 240 seconds]
19:24:26Ryz quits [Read error: Connection reset by peer]
19:24:36Ryz9 (Ryz) joins
19:25:01Ryz9 is now known as Ryz
19:25:25sec^nd (second) joins
19:27:16Stiletto quits [Ping timeout: 240 seconds]
19:27:30Stiletto joins
19:33:29tech_exorcist (tech_exorcist) joins
20:06:46<Frogging101>>Throwing YT content into IA is usually frowned upon.
20:06:48<Frogging101>what?
20:08:29tech_exorcist quits [Read error: Connection reset by peer]
20:09:34tech_exorcist (tech_exorcist) joins
20:13:05<systwi_>Frogging101: IA generally aren't very fond of receiving mass amounts of YT videos, especially redundant content or content of little to no value (e.g. ~5hr clickbait rubbish videos).
20:13:13<Frogging101>ah
20:13:39<systwi_>Most Elsagate content I've seen appear on YT would fall under the "content of little to no value" category, IMO.
20:13:57<systwi_>I understand saving a handful of high-profile ones, but certainly not all of them.
20:19:51tech_exorcist_ (tech_exorcist) joins
20:23:26tech_exorcist quits [Remote host closed the connection]
20:29:46sec^nd quits [Ping timeout: 240 seconds]
20:35:17sec^nd (second) joins
20:41:17C4K3 joins
21:09:16AlsoHP_Archivist quits [Ping timeout: 240 seconds]
21:14:26HackMii_ quits [Remote host closed the connection]
21:14:59HackMii_ (hacktheplanet) joins
21:15:27tech_exorcist_ quits [Client Quit]
21:24:08HP_Archivist (HP_Archivist) joins
21:34:57sec^nd quits [Remote host closed the connection]
21:35:39sec^nd (second) joins
21:49:52<@arkiver>stuff from youtube that we know is going to be deleted can be archived with #down-the-tube
21:50:08<@arkiver>it will get the videos into the Wayback Machine, definitely the best way to save them
22:01:26march_happy (march_happy) joins
22:24:44qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
22:31:36driib5 quits [Quit: The Lounge - https://thelounge.chat]
22:31:53driib5 (driib) joins
22:46:58march_happy quits [Ping timeout: 265 seconds]
22:47:38march_happy (march_happy) joins
23:00:58nepeat quits [Quit: ZNC - https://znc.in]
23:00:58HP_Archivist quits [Remote host closed the connection]
23:01:11HP_Archivist (HP_Archivist) joins
23:01:19nepeat (nepeat) joins
23:12:53march_happy quits [Read error: Connection reset by peer]
23:13:22march_happy (march_happy) joins
23:47:23march_happy quits [Ping timeout: 265 seconds]
23:47:31march_happy (march_happy) joins
23:51:46march_happy quits [Ping timeout: 240 seconds]
23:52:00march_happy (march_happy) joins