#archiveteam-bs log for 2022-08-20

Home Search Previous day Next day

00:38:16	<systwi_>	Regarding [megawarc](https://github.com/ArchiveTeam/megawarc), is it possible to `megawarc restore` if the previously-created megawarc.warc.gz file is gone (the archive was decompressed and tossed, only the megawarc.warc remains).
00:40:08	<systwi_>	Doing some tests, even recompressing by simply using `gzip megawarc.warc` gives errors upon restoring: Exception: End of file: x bytes expected, but y bytes read.
00:40:42	<@JAA>	No. The .megawarc.json.gz contains offsets of each input WARC.
00:40:53	<systwi_>	I thought it might decompress the .gz _first_, then restore the original structure, instead of recreating it on the fly (which is what it sounds like it's doing).
00:41:18	<@JAA>	So if those offsets no longer match what you have, it will break, either with a clear error or in hilarious ways.
00:41:52	<systwi_>	...of each input WARC in the original gzipped output (if I'm not mistaken).
00:42:02	<@JAA>	Correct
00:42:24	<@JAA>	The input WARCs are simply appended to the megawarc, and that offset plus size gets written to the JSON file.
00:42:29		Webuser734 joins
00:42:35	<Webuser734>	Hello
00:43:35	<Webuser734>	I'm trying to retrieve an archived listserv on Syracuse University's servers called ANI-L. I know someone who still has access to it, but he is hesitant to give me his login information for obvious reasons. Is there another thing I can do?
00:43:36	<systwi_>	JAA: Ahh. Is it not possible for the .json to have the WARC offsets referencing the uncompressed megawarc?
00:43:52	<systwi_>	I'm sure forking it and modifying it myself is the alternative. :-P
00:44:11	<@JAA>	systwi_: Not with standard megawarc, but I'm sure it's possible to implement it, yeah.
00:44:20	<Webuser734>	The list owners are either hard to reach or may not have access anymore
00:46:57	<@JAA>	Webuser734: Define 'archived'? I'm guessing that's a different definition than ours.
00:47:10	<systwi_>	Webuser734: I'm not familiar with their servers. Unfortunately, unless somebody else here has login credentials there or mirrored/saved the file(s) elsewhere there likely isn't much we can do. :-/
00:47:36	<Webuser734>	The list has been deactivated but it is still available for viewing by people who joined it when it was active
00:48:07	<Webuser734>	It was a private listserv
00:49:00		Arcorann (Arcorann) joins
00:50:07	<@JAA>	Yeah, your only chance are the actual people there then, either a subscriber or an admin.
00:50:19	<Webuser734>	Ok
00:50:36	<@JAA>	There are a few snapshots in the Wayback Machine, but they're just login walls as expected.
00:50:56	<Webuser734>	One of the list owners is active on FB and I'm trying to see if he could let me in but otherwise I'll need to brainstorm how they can compile the data and send it to me
00:52:12	<@JAA>	Just in case: If they're interested in an actual archive getting created for future reference, we can take care of that. We'd need credentials for it, and the data wouldn't be easily accessible (if public at all).
00:53:01	<Webuser734>	I want it to be public
00:53:42	<Webuser734>	Or at least for there to be some way I can screencap it. I've been putting together a public archive of the Neurodiversity Movement and ANI-L was very important to its foundations
00:53:43	<@JAA>	Well, that's something the admins or list moderators need to decide. I'm guessing there's a reason why they kept it private for the past decades.
00:54:14	<Webuser734>	They did it because they wanted to be careful about who joined it
00:54:28	<@JAA>	But in general, the more important thing is to preserve the data, whether it be accessible now or only in several decades is kind of unrelated to that.
00:54:30	<Webuser734>	But people don't use it anymore so the only reason to join it would be for archival purposes
00:54:41	<@JAA>	Right
00:55:10	<@JAA>	It's a sensitive topic though. People might have been posting on the list with the knowledge that it's not public, for example.
00:55:58	<Webuser734>	I contacted Syracuse's IT department a few months ago and they said someone who owns the list or still has access would be the ones to consult iirc
00:56:09	<@JAA>	That makes sense, yeah.
00:56:24	<Webuser734>	I've kinda procrastinated for a few months and now wanna see if I can do something. I just have the sudden urge to.
00:57:53	<Webuser734>	So what are the next steps I should take?
01:04:20		thetechrobo_ joins
01:04:57	<systwi_>	I guess, try sending emails/messages to anybody you know that might still have, or currently has, access to the server. See if you can gain access to the data; if you do, save what you need and ask administrators about full archival afterwards. Best of luck.
01:05:19	<Webuser734>	Thanks
01:05:21		TheTechRobo quits [Ping timeout: 265 seconds]
01:05:35		Webuser734 quits [Remote host closed the connection]
01:05:36	<systwi_>	You're welcome, hope that helps.
01:06:12		systwi is now known as TheTechRobo
01:06:23		TheTechRobo is now known as systwi
01:17:33		thetechrobo_ quits [Remote host closed the connection]
01:17:55		thetechrobo_ joins
01:20:59		TheTechRobo joins
01:21:14		TheTechRobo is now authenticated as TheTechRobo
01:22:04		thetechrobo_ quits [Ping timeout: 240 seconds]
01:41:45		thetechrobo_ joins
01:42:30		HP_Archivist (HP_Archivist) joins
01:43:46		TheTechRobo quits [Ping timeout: 240 seconds]
01:50:16		thetechrobo_ is now authenticated as TheTechRobo
01:50:21		thetechrobo_ is now known as TheTechRobo
02:00:27	<systwi_>	JAA: re: megawarc creating/restoring, as an alternative, does a setup like this (general idea) look good?:
02:00:31	<systwi_>	https://paste.debian.net/plain/1251072
02:00:38	<systwi_>	(simulated shell I/O)
02:00:44	<systwi_>	It doesn't retain the original .gz information, unfortunately, as the .warc.gz files would need to be decompressed (to .warc) before starting.
02:02:17	<@JAA>	systwi_: Uh, I'm not sure what you're trying to do exactly (why not just keep them compressed?), but ... yeah, I guess?
02:03:13	<@JAA>	The dd command should be fun if you can't break the WARC size into a reasonable block but it's also too large to just fit into RAM easily. (Bonus points if the file size is prime.)
02:03:36	<@JAA>	I'd probably do that with head/tail instead, but not sure about performance.
02:04:11	<@JAA>	E.g. `tail -c+1234 \| head -c2345` to get the 2345 bytes starting at offset 1234.
02:06:46	<systwi_>	I don't want the WARCs compressed as I plan on compressing them, along with wpull logs from that job, et al., into one compressed tarball (or likely, alternatively, one larger archive containing several other types of data - out of the scope of this topic).
02:07:45	<systwi_>	Thanks for the tip re: `tail', I'll have to experiment with that.
02:08:20	<systwi_>	Oh, `skip=z` should be included in that `dd` command.
02:15:35	<systwi_>	Related: is there a specific `gzip` syntax necessary in order to perfectly recreate the .warc.gz archive `metawarc` expects? `gzip example.com.warc` and `gzip -1 example.com.warc` (and varying compression effort) do not seem to work.
02:16:02	<systwi_>	They're either larger or smaller than the original archive.
02:16:56	<Jake>	WARC records are gzipped per record (I believe), so gzipping a uncompressed .warc wouldn't create the same file, I believe.
02:17:27	<@JAA>	Correct
02:18:16	<@JAA>	And gzip also has a few knobs. Compression level is the most obvious one, but there are other internal ones as well I think.
02:19:03	<systwi_>	I see. So I'm basically out of luck if I want to perfectly restore it after the fact. :-/
02:19:35	<systwi_>	It looks like -N/--name are also used, if I'm not mistaken.
02:19:59	<systwi_>	*in the original creation of the gzip archive
02:21:04	<systwi_>	Maybe _that's_ why `megawarc` creates the JSON based on gzip archives and not the uncompressed WARC.
02:21:25	<@JAA>	Yes, and a timestamp I think.
02:21:26	<systwi_>	s/creates the JSON/records the offsets/
02:21:45	<systwi_>	Ah, right, that looks to be correct as well.
02:22:16	<@JAA>	I'm not sure the JSON is actually relevant here. If you keep things compressed, you can perfectly extract the input WARCs again just from offset+size.
02:24:53	<systwi_>	...as `megawarc` simply `cat`s the archives together while simultaneously recording their size and outfile position.
02:25:02	<systwi_>	If I'm following correctly.
02:26:03		TheTechRobo quits [Client Quit]
02:26:20		TheTechRobo (TheTechRobo) joins
02:28:13		TheTechRobo quits [Remote host closed the connection]
02:28:29		TheTechRobo (TheTechRobo) joins
02:28:54		TheTechRobo quits [Remote host closed the connection]
02:29:11		TheTechRobo (TheTechRobo) joins
02:32:39	<@JAA>	More or less, yeah.
02:43:07	<systwi_>	Oh, re: the `dd' blocksize and count, in the past I had gone the easiest, yet probably slowest route:
02:43:38	<systwi_>	count=7388830594 bs=1
02:44:10	<systwi_>	I'm sure that's a lousy way of going about it.
02:49:45	<@JAA>	Yeah, that destroys any semblance of performance.
02:50:03	<@JAA>	tail\|head is certainly more efficient than that. :-)
03:10:03		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
04:11:46		dunger quits [Ping timeout: 240 seconds]
05:48:16		systwi quits [Ping timeout: 240 seconds]
05:49:03		systwi (systwi) joins
06:28:05		BlueMaxima quits [Client Quit]
07:43:30		Barto quits [Read error: Connection reset by peer]
08:29:32		tzt quits [Ping timeout: 265 seconds]
09:20:50		qwertyasdfuiopghjkl joins
09:37:02		Barto (Barto) joins
10:55:00		Megame (Megame) joins
11:01:44		tzt (tzt) joins
11:28:09		@Fusl quits [Excess Flood]
11:28:25		Fusl (Fusl) joins
11:28:25		@ChanServ sets mode: +o Fusl
11:36:28		LeGoupil joins
11:56:00		qwertyasdfuiopghjkl quits [Client Quit]
11:56:27		qwertyasdfuiopghjkl joins
11:56:30		LeGoupil quits [Read error: Connection reset by peer]
12:04:18		LeGoupil joins
12:20:03		TheTechRobo quits [Remote host closed the connection]
12:20:25		TheTechRobo (TheTechRobo) joins
13:03:46		march_happy quits [Ping timeout: 240 seconds]
13:04:38		march_happy (march_happy) joins
13:33:51		Megame quits [Remote host closed the connection]
13:34:09		Megame (Megame) joins
13:39:45		LeGoupil quits [Client Quit]
14:13:46		march_happy quits [Ping timeout: 240 seconds]
14:14:24		march_happy (march_happy) joins
14:18:46		march_happy quits [Ping timeout: 240 seconds]
14:19:46		march_happy (march_happy) joins
14:37:16		Arcorann quits [Ping timeout: 240 seconds]
15:11:15		Megame quits [Client Quit]
15:43:01		sec^nd quits [Remote host closed the connection]
15:43:30		AlsoHP_Archivist joins
15:43:32		sec^nd (second) joins
15:46:46		HP_Archivist quits [Ping timeout: 240 seconds]
16:00:16		sec^nd quits [Ping timeout: 240 seconds]
16:01:20		sec^nd (second) joins
17:06:49		elsagatearchive joins
17:07:17	<elsagatearchive>	Can I archive large amounts of Elsagate scandal videos to the IA?
17:07:37	<elsagatearchive>	youtube is starting to ban it
17:08:19	<elsagatearchive>	But it is how much until IA gets mad
17:08:30		elsagatearchive quits [Remote host closed the connection]
17:09:28		elsagatearchive joins
17:09:31		elsagatearchive quits [Remote host closed the connection]
17:20:28		tech_exorcist (tech_exorcist) joins
17:28:27		march_happy quits [Ping timeout: 265 seconds]
17:28:49		tech_exorcist quits [Remote host closed the connection]
17:38:14	<Jake>	I know you are gone, but email IA and see what they have to say. Throwing YT content into IA is usually frowned upon.
17:43:51		@Fusl quits [Excess Flood]
17:44:06		Fusl (Fusl) joins
17:44:06		@ChanServ sets mode: +o Fusl
17:47:32	<@JAA>	Ah, bonga's at it again.
17:48:18	<TheTechRobo>	Out of the loop, what is Elsagate?
17:48:56	<TheTechRobo>	Oh, this is horrifying.
17:49:40		tech_exorcist (tech_exorcist) joins
17:53:36		Nemo_bis is now authenticated as Nemo_bis
17:54:30	<Jake>	Ah, didn't even realize it was bonga.
18:16:44		tech_exorcist quits [Remote host closed the connection]
18:17:16		tech_exorcist (tech_exorcist) joins
18:47:28		tech_exorcist quits [Remote host closed the connection]
18:48:14		tech_exorcist (tech_exorcist) joins
19:05:24		tech_exorcist quits [Remote host closed the connection]
19:19:16		sec^nd quits [Ping timeout: 240 seconds]
19:24:26		Ryz quits [Read error: Connection reset by peer]
19:24:36		Ryz9 (Ryz) joins
19:25:01		Ryz9 is now known as Ryz
19:25:25		sec^nd (second) joins
19:27:16		Stiletto quits [Ping timeout: 240 seconds]
19:27:30		Stiletto joins
19:33:29		tech_exorcist (tech_exorcist) joins
20:06:46	<Frogging101>	>Throwing YT content into IA is usually frowned upon.
20:06:48	<Frogging101>	what?
20:08:29		tech_exorcist quits [Read error: Connection reset by peer]
20:09:34		tech_exorcist (tech_exorcist) joins
20:13:05	<systwi_>	Frogging101: IA generally aren't very fond of receiving mass amounts of YT videos, especially redundant content or content of little to no value (e.g. ~5hr clickbait rubbish videos).
20:13:13	<Frogging101>	ah
20:13:39	<systwi_>	Most Elsagate content I've seen appear on YT would fall under the "content of little to no value" category, IMO.
20:13:57	<systwi_>	I understand saving a handful of high-profile ones, but certainly not all of them.
20:19:51		tech_exorcist_ (tech_exorcist) joins
20:23:26		tech_exorcist quits [Remote host closed the connection]
20:29:46		sec^nd quits [Ping timeout: 240 seconds]
20:35:17		sec^nd (second) joins
20:41:17		C4K3 joins
20:41:17		C4K3 is now authenticated as C4K3
21:09:16		AlsoHP_Archivist quits [Ping timeout: 240 seconds]
21:14:26		HackMii_ quits [Remote host closed the connection]
21:14:59		HackMii_ (hacktheplanet) joins
21:15:27		tech_exorcist_ quits [Client Quit]
21:24:08		HP_Archivist (HP_Archivist) joins
21:34:57		sec^nd quits [Remote host closed the connection]
21:35:39		sec^nd (second) joins
21:49:52	<@arkiver>	stuff from youtube that we know is going to be deleted can be archived with #down-the-tube
21:50:08	<@arkiver>	it will get the videos into the Wayback Machine, definitely the best way to save them
22:01:26		march_happy (march_happy) joins
22:24:44		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
22:31:36		driib5 quits [Quit: The Lounge - https://thelounge.chat]
22:31:53		driib5 (driib) joins
22:46:58		march_happy quits [Ping timeout: 265 seconds]
22:47:38		march_happy (march_happy) joins
23:00:58		nepeat quits [Quit: ZNC - https://znc.in]
23:00:58		HP_Archivist quits [Remote host closed the connection]
23:01:11		HP_Archivist (HP_Archivist) joins
23:01:19		nepeat (nepeat) joins
23:12:53		march_happy quits [Read error: Connection reset by peer]
23:13:22		march_happy (march_happy) joins
23:47:23		march_happy quits [Ping timeout: 265 seconds]
23:47:31		march_happy (march_happy) joins
23:51:46		march_happy quits [Ping timeout: 240 seconds]
23:52:00		march_happy (march_happy) joins

Home Search Previous day Next day