| 00:38:16 | <systwi_> | Regarding [megawarc](https://github.com/ArchiveTeam/megawarc), is it possible to `megawarc restore` if the previously-created megawarc.warc.gz file is gone (the archive was decompressed and tossed, only the megawarc.warc remains). |
| 00:40:08 | <systwi_> | Doing some tests, even recompressing by simply using `gzip megawarc.warc` gives errors upon restoring: Exception: End of file: x bytes expected, but y bytes read. |
| 00:40:42 | <@JAA> | No. The .megawarc.json.gz contains offsets of each input WARC. |
| 00:40:53 | <systwi_> | I thought it might decompress the .gz _first_, then restore the original structure, instead of recreating it on the fly (which is what it sounds like it's doing). |
| 00:41:18 | <@JAA> | So if those offsets no longer match what you have, it will break, either with a clear error or in hilarious ways. |
| 00:41:52 | <systwi_> | ...of each input WARC in the original gzipped output (if I'm not mistaken). |
| 00:42:02 | <@JAA> | Correct |
| 00:42:24 | <@JAA> | The input WARCs are simply appended to the megawarc, and that offset plus size gets written to the JSON file. |
| 00:42:29 | | Webuser734 joins |
| 00:42:35 | <Webuser734> | Hello |
| 00:43:35 | <Webuser734> | I'm trying to retrieve an archived listserv on Syracuse University's servers called ANI-L. I know someone who still has access to it, but he is hesitant to give me his login information for obvious reasons. Is there another thing I can do? |
| 00:43:36 | <systwi_> | JAA: Ahh. Is it not possible for the .json to have the WARC offsets referencing the uncompressed megawarc? |
| 00:43:52 | <systwi_> | I'm sure forking it and modifying it myself is the alternative. :-P |
| 00:44:11 | <@JAA> | systwi_: Not with standard megawarc, but I'm sure it's possible to implement it, yeah. |
| 00:44:20 | <Webuser734> | The list owners are either hard to reach or may not have access anymore |
| 00:46:57 | <@JAA> | Webuser734: Define 'archived'? I'm guessing that's a different definition than ours. |
| 00:47:10 | <systwi_> | Webuser734: I'm not familiar with their servers. Unfortunately, unless somebody else here has login credentials there or mirrored/saved the file(s) elsewhere there likely isn't much we can do. :-/ |
| 00:47:36 | <Webuser734> | The list has been deactivated but it is still available for viewing by people who joined it when it was active |
| 00:48:07 | <Webuser734> | It was a private listserv |
| 00:49:00 | | Arcorann (Arcorann) joins |
| 00:50:07 | <@JAA> | Yeah, your only chance are the actual people there then, either a subscriber or an admin. |
| 00:50:19 | <Webuser734> | Ok |
| 00:50:36 | <@JAA> | There are a few snapshots in the Wayback Machine, but they're just login walls as expected. |
| 00:50:56 | <Webuser734> | One of the list owners is active on FB and I'm trying to see if he could let me in but otherwise I'll need to brainstorm how they can compile the data and send it to me |
| 00:52:12 | <@JAA> | Just in case: If they're interested in an actual archive getting created for future reference, we can take care of that. We'd need credentials for it, and the data wouldn't be easily accessible (if public at all). |
| 00:53:01 | <Webuser734> | I want it to be public |
| 00:53:42 | <Webuser734> | Or at least for there to be some way I can screencap it. I've been putting together a public archive of the Neurodiversity Movement and ANI-L was very important to its foundations |
| 00:53:43 | <@JAA> | Well, that's something the admins or list moderators need to decide. I'm guessing there's a reason why they kept it private for the past decades. |
| 00:54:14 | <Webuser734> | They did it because they wanted to be careful about who joined it |
| 00:54:28 | <@JAA> | But in general, the more important thing is to preserve the data, whether it be accessible now or only in several decades is kind of unrelated to that. |
| 00:54:30 | <Webuser734> | But people don't use it anymore so the only reason to join it would be for archival purposes |
| 00:54:41 | <@JAA> | Right |
| 00:55:10 | <@JAA> | It's a sensitive topic though. People might have been posting on the list with the knowledge that it's not public, for example. |
| 00:55:58 | <Webuser734> | I contacted Syracuse's IT department a few months ago and they said someone who owns the list or still has access would be the ones to consult iirc |
| 00:56:09 | <@JAA> | That makes sense, yeah. |
| 00:56:24 | <Webuser734> | I've kinda procrastinated for a few months and now wanna see if I can do something. I just have the sudden urge to. |
| 00:57:53 | <Webuser734> | So what are the next steps I should take? |
| 01:04:20 | | thetechrobo_ joins |
| 01:04:57 | <systwi_> | I guess, try sending emails/messages to anybody you know that might still have, or currently has, access to the server. See if you can gain access to the data; if you do, save what you need and ask administrators about full archival afterwards. Best of luck. |
| 01:05:19 | <Webuser734> | Thanks |
| 01:05:21 | | TheTechRobo quits [Ping timeout: 265 seconds] |
| 01:05:35 | | Webuser734 quits [Remote host closed the connection] |
| 01:05:36 | <systwi_> | You're welcome, hope that helps. |
| 01:06:12 | | systwi is now known as TheTechRobo |
| 01:06:23 | | TheTechRobo is now known as systwi |
| 01:17:33 | | thetechrobo_ quits [Remote host closed the connection] |
| 01:17:55 | | thetechrobo_ joins |
| 01:20:59 | | TheTechRobo joins |
| 01:21:14 | | TheTechRobo is now authenticated as TheTechRobo |
| 01:22:04 | | thetechrobo_ quits [Ping timeout: 240 seconds] |
| 01:41:45 | | thetechrobo_ joins |
| 01:42:30 | | HP_Archivist (HP_Archivist) joins |
| 01:43:46 | | TheTechRobo quits [Ping timeout: 240 seconds] |
| 01:50:16 | | thetechrobo_ is now authenticated as TheTechRobo |
| 01:50:21 | | thetechrobo_ is now known as TheTechRobo |
| 02:00:27 | <systwi_> | JAA: re: megawarc creating/restoring, as an alternative, does a setup like this (general idea) look good?: |
| 02:00:31 | <systwi_> | https://paste.debian.net/plain/1251072 |
| 02:00:38 | <systwi_> | (simulated shell I/O) |
| 02:00:44 | <systwi_> | It doesn't retain the original .gz information, unfortunately, as the .warc.gz files would need to be decompressed (to .warc) before starting. |
| 02:02:17 | <@JAA> | systwi_: Uh, I'm not sure what you're trying to do exactly (why not just keep them compressed?), but ... yeah, I guess? |
| 02:03:13 | <@JAA> | The dd command should be fun if you can't break the WARC size into a reasonable block but it's also too large to just fit into RAM easily. (Bonus points if the file size is prime.) |
| 02:03:36 | <@JAA> | I'd probably do that with head/tail instead, but not sure about performance. |
| 02:04:11 | <@JAA> | E.g. `tail -c+1234 | head -c2345` to get the 2345 bytes starting at offset 1234. |
| 02:06:46 | <systwi_> | I don't want the WARCs compressed as I plan on compressing them, along with wpull logs from that job, et al., into one compressed tarball (or likely, alternatively, one larger archive containing several other types of data - out of the scope of this topic). |
| 02:07:45 | <systwi_> | Thanks for the tip re: `tail', I'll have to experiment with that. |
| 02:08:20 | <systwi_> | Oh, `skip=z` should be included in that `dd` command. |
| 02:15:35 | <systwi_> | Related: is there a specific `gzip` syntax necessary in order to perfectly recreate the .warc.gz archive `metawarc` expects? `gzip example.com.warc` and `gzip -1 example.com.warc` (and varying compression effort) do not seem to work. |
| 02:16:02 | <systwi_> | They're either larger or smaller than the original archive. |
| 02:16:56 | <Jake> | WARC records are gzipped per record (I believe), so gzipping a uncompressed .warc wouldn't create the same file, I believe. |
| 02:17:27 | <@JAA> | Correct |
| 02:18:16 | <@JAA> | And gzip also has a few knobs. Compression level is the most obvious one, but there are other internal ones as well I think. |
| 02:19:03 | <systwi_> | I see. So I'm basically out of luck if I want to perfectly restore it after the fact. :-/ |
| 02:19:35 | <systwi_> | It looks like -N/--name are also used, if I'm not mistaken. |
| 02:19:59 | <systwi_> | *in the original creation of the gzip archive |
| 02:21:04 | <systwi_> | Maybe _that's_ why `megawarc` creates the JSON based on gzip archives and not the uncompressed WARC. |
| 02:21:25 | <@JAA> | Yes, and a timestamp I think. |
| 02:21:26 | <systwi_> | s/creates the JSON/records the offsets/ |
| 02:21:45 | <systwi_> | Ah, right, that looks to be correct as well. |
| 02:22:16 | <@JAA> | I'm not sure the JSON is actually relevant here. If you keep things compressed, you can perfectly extract the input WARCs again just from offset+size. |
| 02:24:53 | <systwi_> | ...as `megawarc` simply `cat`s the archives together while simultaneously recording their size and outfile position. |
| 02:25:02 | <systwi_> | If I'm following correctly. |
| 02:26:03 | | TheTechRobo quits [Client Quit] |
| 02:26:20 | | TheTechRobo (TheTechRobo) joins |
| 02:28:13 | | TheTechRobo quits [Remote host closed the connection] |
| 02:28:29 | | TheTechRobo (TheTechRobo) joins |
| 02:28:54 | | TheTechRobo quits [Remote host closed the connection] |
| 02:29:11 | | TheTechRobo (TheTechRobo) joins |
| 02:32:39 | <@JAA> | More or less, yeah. |
| 02:43:07 | <systwi_> | Oh, re: the `dd' blocksize and count, in the past I had gone the easiest, yet probably slowest route: |
| 02:43:38 | <systwi_> | count=7388830594 bs=1 |
| 02:44:10 | <systwi_> | I'm sure that's a lousy way of going about it. |
| 02:49:45 | <@JAA> | Yeah, that destroys any semblance of performance. |
| 02:50:03 | <@JAA> | tail|head is certainly more efficient than that. :-) |
| 03:10:03 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 04:11:46 | | dunger quits [Ping timeout: 240 seconds] |
| 05:48:16 | | systwi quits [Ping timeout: 240 seconds] |
| 05:49:03 | | systwi (systwi) joins |
| 06:28:05 | | BlueMaxima quits [Client Quit] |
| 07:43:30 | | Barto quits [Read error: Connection reset by peer] |
| 08:29:32 | | tzt quits [Ping timeout: 265 seconds] |
| 09:20:50 | | qwertyasdfuiopghjkl joins |
| 09:37:02 | | Barto (Barto) joins |
| 10:55:00 | | Megame (Megame) joins |
| 11:01:44 | | tzt (tzt) joins |
| 11:28:09 | | @Fusl quits [Excess Flood] |
| 11:28:25 | | Fusl (Fusl) joins |
| 11:28:25 | | @ChanServ sets mode: +o Fusl |
| 11:36:28 | | LeGoupil joins |
| 11:56:00 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 11:56:27 | | qwertyasdfuiopghjkl joins |
| 11:56:30 | | LeGoupil quits [Read error: Connection reset by peer] |
| 12:04:18 | | LeGoupil joins |
| 12:20:03 | | TheTechRobo quits [Remote host closed the connection] |
| 12:20:25 | | TheTechRobo (TheTechRobo) joins |
| 13:03:46 | | march_happy quits [Ping timeout: 240 seconds] |
| 13:04:38 | | march_happy (march_happy) joins |
| 13:33:51 | | Megame quits [Remote host closed the connection] |
| 13:34:09 | | Megame (Megame) joins |
| 13:39:45 | | LeGoupil quits [Client Quit] |
| 14:13:46 | | march_happy quits [Ping timeout: 240 seconds] |
| 14:14:24 | | march_happy (march_happy) joins |
| 14:18:46 | | march_happy quits [Ping timeout: 240 seconds] |
| 14:19:46 | | march_happy (march_happy) joins |
| 14:37:16 | | Arcorann quits [Ping timeout: 240 seconds] |
| 15:11:15 | | Megame quits [Client Quit] |
| 15:43:01 | | sec^nd quits [Remote host closed the connection] |
| 15:43:30 | | AlsoHP_Archivist joins |
| 15:43:32 | | sec^nd (second) joins |
| 15:46:46 | | HP_Archivist quits [Ping timeout: 240 seconds] |
| 16:00:16 | | sec^nd quits [Ping timeout: 240 seconds] |
| 16:01:20 | | sec^nd (second) joins |
| 17:06:49 | | elsagatearchive joins |
| 17:07:17 | <elsagatearchive> | Can I archive large amounts of Elsagate scandal videos to the IA? |
| 17:07:37 | <elsagatearchive> | youtube is starting to ban it |
| 17:08:19 | <elsagatearchive> | But it is how much until IA gets mad |
| 17:08:30 | | elsagatearchive quits [Remote host closed the connection] |
| 17:09:28 | | elsagatearchive joins |
| 17:09:31 | | elsagatearchive quits [Remote host closed the connection] |
| 17:20:28 | | tech_exorcist (tech_exorcist) joins |
| 17:28:27 | | march_happy quits [Ping timeout: 265 seconds] |
| 17:28:49 | | tech_exorcist quits [Remote host closed the connection] |
| 17:38:14 | <Jake> | I know you are gone, but email IA and see what they have to say. Throwing YT content into IA is usually frowned upon. |
| 17:43:51 | | @Fusl quits [Excess Flood] |
| 17:44:06 | | Fusl (Fusl) joins |
| 17:44:06 | | @ChanServ sets mode: +o Fusl |
| 17:47:32 | <@JAA> | Ah, bonga's at it again. |
| 17:48:18 | <TheTechRobo> | Out of the loop, what is Elsagate? |
| 17:48:56 | <TheTechRobo> | Oh, this is horrifying. |
| 17:49:40 | | tech_exorcist (tech_exorcist) joins |
| 17:53:36 | | Nemo_bis is now authenticated as Nemo_bis |
| 17:54:30 | <Jake> | Ah, didn't even realize it was bonga. |
| 18:16:44 | | tech_exorcist quits [Remote host closed the connection] |
| 18:17:16 | | tech_exorcist (tech_exorcist) joins |
| 18:47:28 | | tech_exorcist quits [Remote host closed the connection] |
| 18:48:14 | | tech_exorcist (tech_exorcist) joins |
| 19:05:24 | | tech_exorcist quits [Remote host closed the connection] |
| 19:19:16 | | sec^nd quits [Ping timeout: 240 seconds] |
| 19:24:26 | | Ryz quits [Read error: Connection reset by peer] |
| 19:24:36 | | Ryz9 (Ryz) joins |
| 19:25:01 | | Ryz9 is now known as Ryz |
| 19:25:25 | | sec^nd (second) joins |
| 19:27:16 | | Stiletto quits [Ping timeout: 240 seconds] |
| 19:27:30 | | Stiletto joins |
| 19:33:29 | | tech_exorcist (tech_exorcist) joins |
| 20:06:46 | <Frogging101> | >Throwing YT content into IA is usually frowned upon. |
| 20:06:48 | <Frogging101> | what? |
| 20:08:29 | | tech_exorcist quits [Read error: Connection reset by peer] |
| 20:09:34 | | tech_exorcist (tech_exorcist) joins |
| 20:13:05 | <systwi_> | Frogging101: IA generally aren't very fond of receiving mass amounts of YT videos, especially redundant content or content of little to no value (e.g. ~5hr clickbait rubbish videos). |
| 20:13:13 | <Frogging101> | ah |
| 20:13:39 | <systwi_> | Most Elsagate content I've seen appear on YT would fall under the "content of little to no value" category, IMO. |
| 20:13:57 | <systwi_> | I understand saving a handful of high-profile ones, but certainly not all of them. |
| 20:19:51 | | tech_exorcist_ (tech_exorcist) joins |
| 20:23:26 | | tech_exorcist quits [Remote host closed the connection] |
| 20:29:46 | | sec^nd quits [Ping timeout: 240 seconds] |
| 20:35:17 | | sec^nd (second) joins |
| 20:41:17 | | C4K3 joins |
| 20:41:17 | | C4K3 is now authenticated as C4K3 |
| 21:09:16 | | AlsoHP_Archivist quits [Ping timeout: 240 seconds] |
| 21:14:26 | | HackMii_ quits [Remote host closed the connection] |
| 21:14:59 | | HackMii_ (hacktheplanet) joins |
| 21:15:27 | | tech_exorcist_ quits [Client Quit] |
| 21:24:08 | | HP_Archivist (HP_Archivist) joins |
| 21:34:57 | | sec^nd quits [Remote host closed the connection] |
| 21:35:39 | | sec^nd (second) joins |
| 21:49:52 | <@arkiver> | stuff from youtube that we know is going to be deleted can be archived with #down-the-tube |
| 21:50:08 | <@arkiver> | it will get the videos into the Wayback Machine, definitely the best way to save them |
| 22:01:26 | | march_happy (march_happy) joins |
| 22:24:44 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 22:31:36 | | driib5 quits [Quit: The Lounge - https://thelounge.chat] |
| 22:31:53 | | driib5 (driib) joins |
| 22:46:58 | | march_happy quits [Ping timeout: 265 seconds] |
| 22:47:38 | | march_happy (march_happy) joins |
| 23:00:58 | | nepeat quits [Quit: ZNC - https://znc.in] |
| 23:00:58 | | HP_Archivist quits [Remote host closed the connection] |
| 23:01:11 | | HP_Archivist (HP_Archivist) joins |
| 23:01:19 | | nepeat (nepeat) joins |
| 23:12:53 | | march_happy quits [Read error: Connection reset by peer] |
| 23:13:22 | | march_happy (march_happy) joins |
| 23:47:23 | | march_happy quits [Ping timeout: 265 seconds] |
| 23:47:31 | | march_happy (march_happy) joins |
| 23:51:46 | | march_happy quits [Ping timeout: 240 seconds] |
| 23:52:00 | | march_happy (march_happy) joins |