#archiveteam-bs log for 2024-05-08

Home Search Previous day Next day

00:04:50		Carnildo quits [Read error: Connection reset by peer]
00:04:58		Carnildo joins
00:14:37		Earendil7 quits [Ping timeout: 255 seconds]
00:15:24		Earendil7 (Earendil7) joins
00:18:30		Carnildo quits [Read error: Connection reset by peer]
00:18:52		Carnildo joins
00:21:41		etnguyen03 quits [Client Quit]
00:22:29		etnguyen03 (etnguyen03) joins
00:32:54		etnguyen03 quits [Client Quit]
00:33:36		etnguyen03 (etnguyen03) joins
00:38:27		Carnildo quits [Read error: Connection reset by peer]
00:38:39		Carnildo joins
00:47:28		Bleo1826007227 quits [Client Quit]
00:47:45		Bleo1826007227 joins
00:48:34		Carnildo_again joins
00:48:42		Carnildo quits [Remote host closed the connection]
01:02:26		dsadasd joins
01:02:53		dsadasd leaves
01:05:28		Barto quits [Ping timeout: 255 seconds]
01:07:44		Carnildo_again quits [Remote host closed the connection]
01:07:51		Carnildo joins
01:11:49		Carnildo quits [Read error: Connection reset by peer]
01:11:58		Carnildo joins
01:20:57		skyrocket quits [Client Quit]
01:21:54		skyrocket joins
01:22:11		Carnildo_again joins
01:22:37		Carnildo quits [Read error: Connection reset by peer]
01:25:41		kiryu joins
01:25:41		kiryu is now authenticated as kiryu
01:25:41		kiryu quits [Changing host]
01:25:41		kiryu (kiryu) joins
01:27:50		etnguyen03 quits [Client Quit]
01:28:31		etnguyen03 (etnguyen03) joins
01:30:15		Carnildo_again quits [Read error: Connection reset by peer]
01:30:38		Carnildo joins
01:34:02		Carnildo quits [Remote host closed the connection]
01:34:07		Carnildo joins
01:36:20		Carnildo quits [Remote host closed the connection]
01:36:34		Carnildo joins
01:38:18		etnguyen03 quits [Client Quit]
01:48:24		Carnildo_again joins
01:48:24		Carnildo quits [Read error: Connection reset by peer]
01:52:33		Carnildo_again quits [Read error: Connection reset by peer]
01:52:35		Carnildo joins
01:54:01		etnguyen03 (etnguyen03) joins
01:57:56		eroc19908 quits [Quit: The Lounge - https://thelounge.chat]
01:58:24		eroc1990 (eroc1990) joins
02:08:48	<eightthree>	JAA: https://github.com/internetarchive/brozzler this brozzler? in python, is it quick enough? anything in rust or go or ideally a memory-safe and type-safe language, works yet? I know none of these are going to be perfect reproductions of a js heavy site, as mentioned in your comment below the one I replied to...
02:10:21	<eightthree>	https://github.com/spider-rs/spider seems the most popular when I search https://github.com/search?q=crawler+lang%3Arust&ref=opensearch&type=repositories&s=stars&o=desc
02:11:01		Carnildo quits [Remote host closed the connection]
02:11:13		Carnildo joins
02:26:12	<@JAA>	eightthree: I'm not aware of any software written in Go or Rust that produces WARCs and has been verified to work correctly. And I have no experience with running brozzler.
02:27:11	<fireonlive>	https://dl.fireon.live/irc/aed6d285fea3f806/image.png not encouraging
02:28:30	<@JAA>	I wouldn't trust it anyway until verified. Lots of software writes incorrect WARCs, and most HTTP libraries don't make it easy to write correct ones since they usually don't expose the low-level byte stream.
02:29:21	<@JAA>	So unless you do the I/O yourself and use a sans-I/O parser, it's more likely than not going to be wrong.
02:35:26	<pabs>	are there tools for validating warc files are spec-conformant and not weird in other ways?
02:35:51		pabs wonders how long this change will last https://en.wikipedia.org/w/index.php?title=WARC_(file_format)&diff=prev&oldid=1222815751
02:36:35	<pabs>	(added a link on the wikipedia WARC page to the AT WARC ecosystem page)
02:37:11		Carnildo_again joins
02:37:21		Carnildo quits [Read error: Connection reset by peer]
02:37:42	<@JAA>	Not that I'm aware of. Someone was working on one in the context of warcio several years ago, but I don't think that ever landed. I've been working on my own, but not ready yet.
02:42:32		Carnildo_again quits [Remote host closed the connection]
02:42:33		Still_Carnildo joins
02:44:32	<pabs>	-rss/#hackernews- Microsoft closes several large Bethesda affiliated game studios: https://www.ign.com/articles/microsoft-closes-redfall-developer-arkane-austin-hifi-rush-developer-tango-gameworks-and-more-in-devastating-cuts-at-bethesda https://news.ycombinator.com/item?id=40285476
02:44:35		HP_Archivist (HP_Archivist) joins
02:58:39		Carnildo joins
02:58:39		Still_Carnildo quits [Read error: Connection reset by peer]
03:04:27		Carnildo quits [Read error: Connection reset by peer]
03:05:02		Carnildo joins
03:05:24	<@OrIdow6>	JAA: Doesn't the IA have a crawler in Go?
03:06:54	<@JAA>	OrIdow6: Hmm right, Zeno.
03:22:13	<pabs>	"After 16 years online, Feedbooks will soon close down." https://www.feedbooks.com/
03:24:23	<@JAA>	Yeah, it's been running through AB since late March, but that won't get it done.
03:24:47	<pabs>	ah
03:25:50	<@JAA>	Various filters etc. let the queue explode. One job had to be aborted already.
03:28:34	<fireonlive>	i read that as facebook and got a flash of excitement
03:28:37	<fireonlive>	:[
03:48:49		Carnildo quits [Remote host closed the connection]
03:48:51		Carnildo joins
04:05:17	<Vokun>	The amount of family photos that would dissapear from the planet when facebook shuts down. Woah
04:05:46	<Vokun>	It'd be interesting if they decided to sell all their hardware though. Imagine how cheap used servers would start going for
04:07:05	<Vokun>	We need like a solid few months without any emergencies so that there's time to actually get #Y up and running
04:08:16	<Vokun>	A few months without an emergency?
04:08:20		Vokun uploaded an image: (71KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/EgJNhFnPIUVWCXCqlDbYzulM/image.png >
04:11:05		Carnildo quits [Remote host closed the connection]
04:11:16		Carnildo joins
04:32:23		Carnildo quits [Read error: Connection reset by peer]
04:32:45		Carnildo joins
04:35:19		Carnildo quits [Remote host closed the connection]
04:35:22		Carnildo joins
04:37:05		Carnildo quits [Read error: Connection reset by peer]
04:37:07		Carnildo joins
04:41:01		Island quits [Read error: Connection reset by peer]
04:42:24		Carnildo quits [Remote host closed the connection]
04:42:33		Carnildo joins
04:51:40		Carnildo quits [Remote host closed the connection]
04:51:42		Carnildo joins
05:03:04		Lord_Nightmare quits [Ping timeout: 255 seconds]
05:04:47		Lord_Nightmare (Lord_Nightmare) joins
05:09:31		etnguyen03 quits [Client Quit]
05:10:12		etnguyen03 (etnguyen03) joins
05:25:51		etnguyen03 quits [Client Quit]
05:26:32		etnguyen03 (etnguyen03) joins
05:28:18		etnguyen03 quits [Remote host closed the connection]
05:32:27		Barto (Barto) joins
05:43:34		Bleo1826007227 quits [Ping timeout: 255 seconds]
05:44:01		@dxrt quits [Ping timeout: 255 seconds]
05:46:37		Bleo1826007227 joins
05:51:21		dxrt joins
05:51:23		dxrt is now authenticated as dxrt
05:51:23		dxrt quits [Changing host]
05:51:23		dxrt (dxrt) joins
05:51:23		@ChanServ sets mode: +o dxrt
06:21:11	<@arkiver>	fireonlive: hahahaha, now that would be something!
06:21:42	<fireonlive>	xD for sure!
06:22:09	<@arkiver>	JAA: do we need a project for feedbooks?
06:44:45		Carnildo quits [Read error: Connection reset by peer]
06:44:47		Carnildo_again joins
07:03:34		PredatorIWD joins
07:06:11		Unholy2361924645 (Unholy2361) joins
07:07:30		Carnildo_again quits [Remote host closed the connection]
07:07:32		Carnildo joins
07:18:16		PredatorIWD quits [Client Quit]
07:21:13		BearFortress quits [Ping timeout: 255 seconds]
07:22:23		BearFortress joins
07:25:03		Carnildo quits [Read error: Connection reset by peer]
07:25:14		Carnildo joins
07:30:13		Carnildo quits [Remote host closed the connection]
07:30:20		Carnildo joins
07:50:25		PredatorIWD joins
08:01:31		Carnildo_again joins
08:02:37		Carnildo quits [Ping timeout: 255 seconds]
08:02:39		PredatorIWD5 joins
08:02:52		PredatorIWD quits [Client Quit]
08:02:52		PredatorIWD5 is now known as PredatorIWD
08:41:44	<h2ibot>	Bear edited List of websites excluded from the Wayback Machine/Partial exclusions/Twitter accounts (+37, twitter.com/TheEuropeanMan1 - now he is…): https://wiki.archiveteam.org/?diff=52215&oldid=52035
08:51:46	<h2ibot>	Bear edited List of websites excluded from the Wayback Machine (+363, More URLs that are part of the TRS.com empire.): https://wiki.archiveteam.org/?diff=52216&oldid=52204
08:54:47	<h2ibot>	Bear uploaded File:Abload - upload form.png ([[Abload]] before they disabled uploading.): https://wiki.archiveteam.org/?title=File%3AAbload%20-%20upload%20form.png
08:58:47	<h2ibot>	Bear edited Abload (+103, [[:File:Abload - upload form.png]]): https://wiki.archiveteam.org/?diff=52218&oldid=52202
09:00:02		Bleo1826007227 quits [Client Quit]
09:01:26		Bleo1826007227 joins
09:16:22		monika quits [Quit: Zzz]
09:19:58		monika (boom) joins
09:22:08		monika quits [Client Quit]
09:30:33		monika (boom) joins
09:41:45		Carnildo_again quits [Remote host closed the connection]
09:41:51		Carnildo_again joins
09:45:43		Carnildo_again quits [Remote host closed the connection]
09:45:50		Carnildo_again joins
09:46:04		BornOn420 leaves [Textual IRC Client: www.textualapp.com]
09:56:34		Carnildo_again quits [Read error: Connection reset by peer]
09:56:38		beastbg8_ quits [Read error: Connection reset by peer]
09:56:46		Carnildo joins
10:02:00		igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
10:02:26		igloo22225 (igloo22225) joins
10:03:36		monika quits [Client Quit]
10:03:53		monika (boom) joins
10:04:37		Carnildo quits [Remote host closed the connection]
10:04:43		Carnildo joins
10:05:00		BlueMaxima quits [Client Quit]
10:15:37		PeterandLukas joins
10:15:57		PeterandLukas quits [Client Quit]
10:24:39		f_ (funderscore) joins
10:29:25		f_ quits [Remote host closed the connection]
10:29:59		f_ (funderscore) joins
10:32:30		Carnildo quits [Read error: Connection reset by peer]
10:33:05		Carnildo joins
10:38:19		Carnildo quits [Ping timeout: 255 seconds]
10:38:40		Carnildo joins
10:41:30		Carnildo quits [Read error: Connection reset by peer]
10:41:33		Carnildo joins
10:43:25		PredatorIWD quits [Read error: Connection reset by peer]
11:00:59		Carnildo quits [Read error: Connection reset by peer]
11:01:11		Carnildo joins
11:01:20		f_ quits [Client Quit]
11:16:17		Carnildo quits [Remote host closed the connection]
11:16:25		Carnildo joins
11:16:47		PredatorIWD joins
11:17:36		PredatorIWD quits [Read error: Connection reset by peer]
11:21:13		PredatorIWD joins
11:27:19		PredatorIWD quits [Read error: Connection reset by peer]
11:30:15		Carnildo quits [Read error: Connection reset by peer]
11:30:19		Carnildo joins
11:31:44		PredatorIWD joins
11:52:13		Carnildo quits [Remote host closed the connection]
11:52:21		Carnildo joins
12:32:09		Carnildo quits [Read error: Connection reset by peer]
12:32:17		Carnildo joins
12:40:49		Carnildo quits [Read error: Connection reset by peer]
12:41:05		Carnildo joins
12:58:43		etnguyen03 (etnguyen03) joins
13:08:38		Carnildo_again joins
13:08:49		Carnildo quits [Read error: Connection reset by peer]
13:15:30		Carnildo_again quits [Read error: Connection reset by peer]
13:15:43		Carnildo joins
13:25:44		Carnildo quits [Read error: Connection reset by peer]
13:25:53		Carnildo joins
13:28:18		Carnildo quits [Remote host closed the connection]
13:28:23		Carnildo joins
13:31:38		shgaqnyrjp quits [Remote host closed the connection]
13:31:41		shgaqnyrjp_ (shgaqnyrjp) joins
13:32:43		Carnildo quits [Remote host closed the connection]
13:32:50		Carnildo joins
13:45:22		Carnildo quits [Remote host closed the connection]
13:45:34		Carnildo joins
14:03:00		Carnildo quits [Read error: Connection reset by peer]
14:03:03		Carnildo joins
14:38:46		shgaqnyrjp_ is now known as shgaqnyrjp
15:10:56		kiryu quits [Remote host closed the connection]
15:11:13		]SaRgE[ (sarge) joins
15:14:57		sarge quits [Ping timeout: 272 seconds]
15:15:34	<@JAA>	arkiver: Good question, not sure. It'd just be a catalogue of books they offer, I think. The actual interesting part is behind a login wall and would require automating loaning and stuff.
15:19:33		Carnildo quits [Read error: Connection reset by peer]
15:19:46		Carnildo joins
15:19:49		kiryu joins
15:19:49		kiryu is now authenticated as kiryu
15:19:49		kiryu quits [Changing host]
15:19:49		kiryu (kiryu) joins
15:24:43	<ScenarioPlanet>	What are the main conditions for getting voiced in several AT channels that are used to operate archival bots (example: #archivebot / #wikibot)?
15:32:27		kiryu quits [Remote host closed the connection]
15:34:57		loug joins
15:42:51		Carnildo quits [Read error: Connection reset by peer]
15:46:48		Medowar quits [Quit: ZNC 1.9.0 - https://znc.in]
15:47:02		f_ (funderscore) joins
15:50:53		f_ quits [Remote host closed the connection]
15:51:12		f_ (funderscore) joins
15:52:30		f_ is now known as funderscore
15:52:38		funderscore is now known as f_
16:15:39		etnguyen03 quits [Client Quit]
16:16:20		etnguyen03 (etnguyen03) joins
16:26:07		etnguyen03 quits [Client Quit]
16:26:48		etnguyen03 (etnguyen03) joins
16:34:15	<pokechu22>	The main one is understanding how to operate the bot mainly (including things like ignores and noticing when a site's gotten mad at us).
16:36:21		loug quits [Client Quit]
16:36:35		etnguyen03 quits [Client Quit]
16:37:16		etnguyen03 (etnguyen03) joins
16:37:18		loug joins
16:39:04	<pokechu22>	You're not currently in #wikibot but I'd say that one is easier to operate as there's only a few commands
16:40:09	<ScenarioPlanet>	That doesn't seem to be hard to understand (especially in #wikibot case), but some details like pipeline operations are kinda off-putting for me, maybe because they don't have any public documentation
16:40:25	<pokechu22>	Yeah :/
16:41:16	<ScenarioPlanet>	I mean things like pipeline notes (which is cloudflared or not, which is closer to the server that holds the website being archived, local censorships & more)
16:43:22		marto_ quits [Client Quit]
16:43:28		marto_ (marto_) joins
16:47:02		etnguyen03 quits [Client Quit]
17:18:52	<eightthree>	JAA: what about in typescript using node.js ( or perhaps in vue or other typescript-written tools/langs/frameworks)
17:20:21	<eightthree>	https://github.com/webrecorder/browsertrix - in typescript
17:27:58	<@JAA>	eightthree: Stay away from anything webrecorder until proven it's not outputting rubbish.
17:28:42	<eightthree>	JAA https://github.com/webrecorder/pywb/issues/294 I noticed you linked to this, but don't know how relevant it is, given it's one of the non-ts projects of theirs...
17:28:46	<@JAA>	At least two of their tools do not produce valid WARCs. They have known this for years and made no attempt to fix it that I'm aware of.
17:29:02	<@JAA>	I have no reason to trust any of their tooling that produces WARCs.
17:29:21	<eightthree>	https://github.com/ArchiveTeam/ArchiveBot/issues/70 otherwise, this is still the open issue on using webrecorder...
17:29:37	<eightthree>	with hardly anything said...
17:29:52	<@JAA>	I never saw this issue before, and it predates my presence here.
17:30:04	<@JAA>	I guess the problems weren't known at the time.
17:30:21	<eightthree>	damn, how do i get proof? by ...not staying away and trying it (or seeing others comment on reddit github etc?
17:31:06	<@JAA>	You produce WARCs with them and verify that they are compliant. This requires intimate knowledge of the WARC and HTTP specs.
17:31:31	<@JAA>	Or you look at the code and immediately see why they can't possibly be compliant.
17:34:03	<@JAA>	E.g. their browser extension thing can't ever work because the browser doesn't make the necessary data available to extensions.
17:34:19	<@JAA>	ArchiveWeb.Page
17:36:07		datechnoman quits [Client Quit]
17:36:30		datechnoman (datechnoman) joins
17:50:42		f_ is now known as f_\|afk
17:51:04		f_\|afk is now known as f_
17:54:17	<nicolas17>	JAA: what would the browser need to expose, the raw network data without TLS?
17:55:07	<@JAA>	nicolas17: Yes, headers and transfer encoding as sent by the server. You only get a parsed representation of the former (losing capitalisation, whitespace, order) and TE is stripped.
17:55:14	<nicolas17>	it seems to me that would immediately hit the problem of WARC not supporting HTTP2 :P
17:55:22	<@JAA>	Yes, that as well.
17:55:38	<@JAA>	You can only write HTTP/1.1 to WARC. Technically, even 0.9 and 1.0 is incompatible.
17:56:09	<@JAA>	1.0 might work with a bit of generous interpretation. 0.9 definitely doesn't.
17:56:39	<nicolas17>	and translating HTTP2 to HTTP1.1 would likely be frowned upon
17:56:57	<nicolas17>	so now you have to disable HTTP2 browser-wide
17:57:07	<@JAA>	I've read somewhere that that's what webrecorder do, and yes, that's also bad.
17:57:30	<@JAA>	Or really, worse, because it even misrepresents which HTTP version was used.
17:58:26	<eightthree>	btw, i found this https://github.com/archivetheweb/archiver in rust, but no new commits since 1 year, and v0.3 , it does focus on warc 1.1 though
17:58:43	<eightthree>	JAA: what other "places" than here might have reliable enough experts (in warc i guess..or maybe also wacz?) that I could just ask, (and to avoid asking in a specific tool's chatroom/forum, as they are more likely biased).
17:58:46	<eightthree>	I found these 2 awesome lists https://github.com/ruarxive/awesome-digital-preservation https://github.com/iipc/awesome-web-archiving, but the first noted webrecorder a high-fidelity, so I don't know if those listmakers or if any the projects mentioned are reliable enough by yours standards...
17:59:34	<@JAA>	eightthree: I'm not aware of any. Even my discussions in the IIPC about this (which is the organisation where the WARC specification is written) were not entirely fruitful. I should revive them though.
18:00:17	<@JAA>	It seems that very few people care about spec compliance, which is wild given this data is supposed to survive decades or more.
18:01:09	<@JAA>	So far, my only rough rule is that if a software was written by IA, it's probably doing it correctly.
18:02:08		lennier2 quits [Read error: Connection reset by peer]
18:02:27		lennier2 joins
18:11:45	<eightthree>	so I searched for "better than warc" and stumbled upon someone saying HAR is better than warc, and capturing a HAR of the current page is implemented in the F12 devtools I believe... Do you have any link that I can read to see why warc is best of all the archiving formats? When the IIPC itself isn't reliable...
18:12:36	<@JAA>	IIPC is reliable, and there are people there that care about accuracy, just many don't.
18:12:41	<eightthree>	or maybe roughly tell me what to google if itll take longer to find link
18:13:08	<@JAA>	HAR doesn't preserve the exact HTTP traffic either, just a parsed version of it.
18:13:21	<@JAA>	So roughly the same as what webrecorder's tooling produces.
18:14:13	<@JAA>	You can always transform a (correct) WARC to a HAR, but the opposite is not possible.
18:16:50	<@JAA>	HAR is also awkward to use for anything larger than a single page or maybe a few. There's no concept of compression, and it's a single large JSON object, so appending is hard. I don't recall how binary data is stored, but I believe that's a mess, too. (Maybe base64?)
18:17:52	<eightthree>	JAA: so technically one could make a proper "extension" but compile it right into a browser, if ever a firefox or chromium derivative browser would be willing to do this? I've been annoyed by how singlefile is indeed not always a proper representation and TIL it's not entirely their fault...
18:18:06	<@JAA>	Yep, it is base64 indeed.
18:18:28	<@JAA>	Sure, if you modified the browser, it could definitely be done.
18:18:45	<@JAA>	But that's obviously not an easy task.
18:19:02	<@JAA>	And it's probably why brozzler uses an MITM proxy instead.
18:19:49	<nicolas17>	in theory you could use SSLKEYLOGFILE and capture the SSL'd traffic and decrypt it
18:19:59	<nicolas17>	but that still has the problem of HTTP2
18:20:53	<nicolas17>	would need to turn http2/3 off
18:21:14	<@JAA>	Yeah
18:21:15	<nicolas17>	or do MITM so you can tamper with the list of supported protocols
18:21:32	<eightthree>	JAA: tor browser baked in noscript, the grapheneos browser also incorporated...a content filter, not sure if it's noscript. I have noticed first hand many times how extensions can stop working when ram use is too high or something, and I guess that's why they didn't want to compromise on security.
18:23:39	<nicolas17>	I have also seen cases where Wireshark/dumpcap loses a packet even though the recipient ack'd it so it wasn't lost in the network, and then the entire stream is fucked
18:24:57		wessel1512 quits [Ping timeout: 272 seconds]
18:25:23		yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
18:27:58		yano (yano) joins
18:28:02		wessel1512 joins
18:36:13		etnguyen03 (etnguyen03) joins
18:50:53		RealPerson leaves
19:06:00		Island joins
19:12:25		etnguyen03 quits [Client Quit]
19:13:07		etnguyen03 (etnguyen03) joins
19:31:29		pixel leaves
19:31:30		pixel (pixel) joins
19:37:35		wyatt8740 joins
19:43:53		wyatt8750 joins
19:45:04		wyatt8740 quits [Ping timeout: 255 seconds]
20:04:10		Jens quits []
20:04:56		Jens (JensRex) joins
20:05:25		DJ joins
20:06:33	<lea>	nicolas17: why is http2/3 a problem here? SSLKEYLOGFILE should work for them as well
20:06:52		f_ quits [Client Quit]
20:10:15	<DJ>	Hey, has anyone checked out https://github.com/aliparlakci/bulk-downloader-for-reddit or https://github.com/RedditDownloader/redditdownloader.github.io? I think archivebot is limited in terms of crawling subreddits so I was wondering if anyone was aware of these or if they work.
20:38:56	<nicolas17>	lea: the WARC format has no way to represent captured HTTP2 requests/responses
20:39:30	<nicolas17>	and if you synthesize HTTP1.1-looking syntax from the HTTP2 data, that's not a pristine capture
20:39:34	<lea>	nicolas17: curl has a way of representing http2/3 responses in a format similar to http1.1, could that not be used?
20:39:41	<lea>	ah
20:39:52		tapos joins
20:39:58	<nicolas17>	<JAA> Or really, worse, because it even misrepresents which HTTP version was used.
20:41:13	<lea>	can it not say HTTP/2.0 200 OK in the header instead of HTTP/1.1 200 OK?
20:41:25	<DJ>	Nvm the first one has the 1000 posts API limit, I don't know if the second one does but probably.
20:41:42	<lea>	or is that the thing that is not supported?
20:42:06	<@JAA>	lea: WARC captures exactly what was sent over the network (at the application layer). And yes, the spec only supports HTTP/1.1 specifically.
20:51:00		fireonlive wonders if IIPC(?) will update it soonish
20:55:24		etnguyen03 quits [Client Quit]
21:01:18		BearFortress_ joins
21:01:19		tapos2 joins
21:01:23		qw3rty_ joins
21:01:24		ymgve_ joins
21:01:39		AlsoHP_Archivist joins
21:03:31		DJ58 joins
21:03:39		DJ58 leaves
21:04:42		ymgve quits [Ping timeout: 265 seconds]
21:04:42		qw3rty quits [Ping timeout: 265 seconds]
21:05:11		HP_Archivist quits [Ping timeout: 265 seconds]
21:05:40		tapos quits [Ping timeout: 265 seconds]
21:05:40		DJ quits [Ping timeout: 265 seconds]
21:05:40		BearFortress quits [Ping timeout: 265 seconds]
21:05:40		Guest quits [Ping timeout: 265 seconds]
21:13:18	<eightthree>	fireonlive: Am I too much of a conspiracy theorist for thinking that it might be intentional that proper copies of websites are so hard to do, spec not updated? Either to fingerprint copycat websites (for google and others to programmatically punish them in their algo (or not show them at all), but also to ensure that MITMs can be noticed?
21:14:22	<@JAA>	Yes, you are.
21:14:29	<eightthree>	like google has 20bill usd to send to Apple...I think each year, but they can't keep any archiving standard up to date with any of the other standards they spend heavily to influence and update?
21:14:37		Earendil7 quits [Ping timeout: 255 seconds]
21:14:54	<@JAA>	The problem is that there's probably less than a dozen people worldwide who have worked on the WARC spec occasionally over the past two decades.
21:15:17	<@JAA>	It's very much a niche, and there's no funding for doing these things.
21:16:04	<@JAA>	Google et al. don't care about WARC or even archival.
21:17:59	<@JAA>	It's so much of a niche that I was apparently the first person to try to implement a WARC parser based on just the spec, since I ran into several inconsistencies that have to come up on every implementation.
21:18:26	<eightthree>	JAA: you seem to not see that the funding money amounts could exist, but don't. ~What does google and other search engines use to show an archived copy of a page?~ Ok, forget that, those arent js copies for secu reasons likely, but say, google translate shows a page with js, what does that use? Why don't they care to make it as close to the original as possible? The companies have the money, they choose not to spend on this....
21:18:32	<eightthree>	... There could be 100s or 1000s working on this long term if they wanted it.
21:19:18	<eightthree>	govts and library/archiving orgs likewise have some budget...
21:19:34	<@JAA>	They're companies. They care about making money. Spending effort on exact preservation when it doesn't matter to their service isn't something they will do.
21:20:21		Earendil7 (Earendil7) joins
21:20:26	<@JAA>	At best, they care about storing an HTTP-equivalent copy of the data, i.e. with headers normalised, transfer encoding stripped, etc., since it's much easier to work with that.
21:20:27	<eightthree>	inconsistencies that _have_ to come up on every implementation.
21:20:27	<eightthree>	what do you mean by this? the other warc parsers always have inconsistencies, but why have?
21:20:52	<@JAA>	Not other parsers have inconsistencies, the spec has inconsistencies, and anyone implementing a parser based on the spec would have to run into them.
21:21:01	<@JAA>	Since they weren't reported previously, apparently nobody did that.
21:21:17	<@JAA>	Enjoy: https://github.com/iipc/warc-specifications/issues/created_by/JustAnotherArchivist
21:21:58	<@JAA>	#71 and #72 in particular are unavoidable if you look at the grammar.
21:22:02	<eightthree>	JAA: maybe this is why , as you were complaining earlier, most people don't care about coding to spec? If the spec has issues...
21:22:40	<@JAA>	If the spec has issues and you care about preservation, you raise the issue and get the spec fixed.
21:27:16	<eightthree>	have you seen anything about funding for the people contributing, from iipc or other? who/how many, if anyone is paid to contribute?
21:31:01	<@JAA>	So the IIPC is a consortium, and many of the people there are actually employed at various institutions doing digital preservation stuff. That would include some people from IA, from the British Library, etc. I imagine that the part of their employment dedicated to contributing to the IIPC is tiny to nonexistent though.
21:31:47	<@JAA>	IIPC does fund some projects in a narrow scope. There's an annual call for proposals.
21:32:17	<@JAA>	Or there's supposed to be, anyway, I think it hasn't happened in a couple years now for a reason they haven't communicated publicly I think.
21:33:21	<@JAA>	(I have considered submitting a proposal in this area before.)
21:38:18		Bleo18260072271 joins
21:40:28		Bleo1826007227 quits [Ping timeout: 265 seconds]
21:40:28		Bleo18260072271 is now known as Bleo1826007227
21:40:58	<eightthree>	Call for proposals is now closed
21:40:58	<eightthree>	Proposals due: 15 September 2021
21:40:58	<eightthree>	Projects start: 1 January 2022
21:40:58	<eightthree>	Final report due: by 31 December 2022
21:41:06	<eightthree>	from https://netpreserve.org/projects/funding/
21:43:00	<@JAA>	Aye
22:01:57	<eightthree>	hmm so if funding dried up (from only a trickle) at iipc, perhaps the solution is to improve HAR since it's so much more widely deployed? the link on the wikip article links to a draft,
22:01:59	<eightthree>	https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:01:59	<eightthree>	with fat warning
22:01:59	<eightthree>	> _DO NOT USE_
22:01:59	<eightthree>	> This document was never published by the W3C Web Performance Working Group and has been abandoned.
22:01:59	<eightthree>	but the document lists itself as the latest,
22:02:00	<eightthree>	> Historical Draft August 14, 2012
22:02:01	<eightthree>	> This version:
22:02:02	<eightthree>	> https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:02:04	<eightthree>	> Latest version:
22:02:05	<eightthree>	> https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:02:07	<eightthree>	and searching https://www.w3.org/TR/?filter-tr-name=har shows nothing...
22:02:08	<eightthree>	y was a draft with fat warning so widely deployed over WARC ???
22:05:26	<eightthree>	*deployed instead of warc...
22:05:26	<eightthree>	and do browsers etc all implement their own modified way of capturing HAR file from the current page in the browser? Am I going to have to go on a long hunt on each git/mailing list of each of these
22:05:29	<eightthree>	> The HAR format is supported by various software, including:
22:05:29	<eightthree>	> Charles Proxy
22:05:29	<eightthree>	> Fiddler
22:05:29	<eightthree>	> Firebug
22:05:29	<eightthree>	> Firefox
22:05:30	<eightthree>	> Fluxzy Desktop
22:05:32	<eightthree>	> Google Chrome
22:05:33	<eightthree>	> Internet Explorer 9
22:05:35	<eightthree>	> Microsoft Edge
22:05:36	<eightthree>	> Mitmproxy
22:05:38	<eightthree>	> Postman
22:05:39	<eightthree>	> OWASP ZAP
22:05:41	<eightthree>	> Safari
22:05:42	<eightthree>	to find out how they implement it?
22:06:06	<@JAA>	Likely, because I doubt there's any documentation on what they do in detail.
22:06:49	<eightthree>	im using matrix btw and its showing me the sciscors icon, so tell me if my message is hard to read...maybe I'll pastebin it...
22:06:56	<@JAA>	HAR being JSON is both great and horrible.
22:07:02	<@JAA>	Well, you pasted like a dozen messages, yeah.
22:07:27		SootBector quits [Ping timeout: 250 seconds]
22:09:29	<eightthree>	HTTP/2, published in 2015 - does anyone implement HAR beyond 1.1?
22:09:52		SootBector (SootBector) joins
22:12:23	<eightthree>	https://huggingface.co/spaces/ehristoforu/mixtral-46.7b-chat - this AI says har is neutral to http version...
22:12:47	<@JAA>	Keep AI nonsense out of here.
22:13:24	<nicolas17>	eightthree: where do you think it got that information from?
22:13:38		flotwig_ joins
22:14:47		flotwig quits [Ping timeout: 265 seconds]
22:14:47		flotwig_ is now known as flotwig
22:15:14	<nicolas17>	if there isn't any good information on HAR on websites, then there's nowhere the AI could have learned it from and it's just guessing/hallucinating
22:15:25	<nicolas17>	if there is, then look at those websites instead :P
22:16:30	<eightthree>	JAA: sorry
22:17:21	<@JAA>	To answer the question, at least Firefox can put HTTP/2 into HAR. Probably HTTP/3 and WebSocket, too.
22:17:51	<nicolas17>	by putting the decoded normalized headers in there?
22:18:03	<@JAA>	Yes
22:18:13	<@JAA>	It's a massive JSON object.
22:42:31	<eightthree>	https://searchfox.org/mozilla-central/search?q=har&path=&case=false&regexp=false - 112 results when I check the checkbox for "whole words" when I ctrl-f for har
22:44:43	<eightthree>	JAA: where can I find a detailed "whats missing from HAR that WARC has"
22:45:07	<@JAA>	eightthree: By comparing the specs of HAR and WARC in detail. I doubt it's been done before.
22:46:30	<eightthree>	JAA: like, the github repo you linked to earlier for warc, with the 2012 draft I linked for HAR?
22:47:09	<@JAA>	Probably? I never looked into what documentation exists on HAR. There might be stuff on MDN or in browsers' documentations, too.
22:47:44	<eightthree>	I guess there's no comparison with firefox's implementation yet...I'll see if I absolutely need to decipher code or if there's something in bugzilla or elsewhere...
22:51:13	<eightthree>	https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/index.html#copysave-all-as-har https://github.com/mdn/content/blob/main/files/en-us/mozilla/firefox/releases/41/index.md seemingly the button was added in ffx 41
22:52:19	<eightthree>	https://bugzilla.mozilla.org/show_bug.cgi?id=859058
22:57:35	<eightthree>	[Links (documentation, blog post, etc)]: 'HAR' can be linked to http://www.softwareishard.com/blog/har-12-spec/;
22:57:37	<eightthree>	from https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c38, even though they had found the w3c link too, the last mention of what to link to officially as the spec was the above line
22:58:08		tapos2 quits [Client Quit]
22:58:54		Overlordz_ quits [Quit: Leaving]
23:01:17		BlueMaxima joins
23:01:48		tapos joins
23:03:17		etnguyen03 (etnguyen03) joins
23:05:22	<eightthree>	https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c7 this honza guy seems to be knowledgeable enough to propose writing a draft update in case extra features were needed. Said 10 years ago though :)
23:11:55	<eightthree>	https://searchfox.org/mozilla-central/source/devtools/client/netmonitor/src/har/README.md
23:16:39		etnguyen03 quits [Client Quit]
23:22:00	<eightthree>	https://addons.mozilla.org/en-US/firefox/addon/har-export-trigger/ - made by the same guy,
23:22:00	<eightthree>	Jan Honza Odvarko, that implemented har in firefox...but in the readme there seems to be a way with user.js setting and then a browser reboot, to not need the extension to automatically save each page, at least that's how it seems...
23:44:14		etnguyen03 (etnguyen03) joins
23:44:41		Guest joins
23:48:06		Wohlstand (Wohlstand) joins

Home Search Previous day Next day