00:04:50Carnildo quits [Read error: Connection reset by peer]
00:04:58Carnildo joins
00:14:37Earendil7 quits [Ping timeout: 255 seconds]
00:15:24Earendil7 (Earendil7) joins
00:18:30Carnildo quits [Read error: Connection reset by peer]
00:18:52Carnildo joins
00:21:41etnguyen03 quits [Client Quit]
00:22:29etnguyen03 (etnguyen03) joins
00:32:54etnguyen03 quits [Client Quit]
00:33:36etnguyen03 (etnguyen03) joins
00:38:27Carnildo quits [Read error: Connection reset by peer]
00:38:39Carnildo joins
00:47:28Bleo1826007227 quits [Client Quit]
00:47:45Bleo1826007227 joins
00:48:34Carnildo_again joins
00:48:42Carnildo quits [Remote host closed the connection]
01:02:26dsadasd joins
01:02:53dsadasd leaves
01:05:28Barto quits [Ping timeout: 255 seconds]
01:07:44Carnildo_again quits [Remote host closed the connection]
01:07:51Carnildo joins
01:11:49Carnildo quits [Read error: Connection reset by peer]
01:11:58Carnildo joins
01:20:57skyrocket quits [Client Quit]
01:21:54skyrocket joins
01:22:11Carnildo_again joins
01:22:37Carnildo quits [Read error: Connection reset by peer]
01:25:41kiryu joins
01:25:41kiryu quits [Changing host]
01:25:41kiryu (kiryu) joins
01:27:50etnguyen03 quits [Client Quit]
01:28:31etnguyen03 (etnguyen03) joins
01:30:15Carnildo_again quits [Read error: Connection reset by peer]
01:30:38Carnildo joins
01:34:02Carnildo quits [Remote host closed the connection]
01:34:07Carnildo joins
01:36:20Carnildo quits [Remote host closed the connection]
01:36:34Carnildo joins
01:38:18etnguyen03 quits [Client Quit]
01:48:24Carnildo_again joins
01:48:24Carnildo quits [Read error: Connection reset by peer]
01:52:33Carnildo_again quits [Read error: Connection reset by peer]
01:52:35Carnildo joins
01:54:01etnguyen03 (etnguyen03) joins
01:57:56eroc19908 quits [Quit: The Lounge - https://thelounge.chat]
01:58:24eroc1990 (eroc1990) joins
02:08:48<eightthree>JAA: https://github.com/internetarchive/brozzler this brozzler? in python, is it quick enough? anything in rust or go or ideally a memory-safe and type-safe language, works yet? I know none of these are going to be perfect reproductions of a js heavy site, as mentioned in your comment below the one I replied to...
02:10:21<eightthree>https://github.com/spider-rs/spider seems the most popular when I search https://github.com/search?q=crawler+lang%3Arust&ref=opensearch&type=repositories&s=stars&o=desc
02:11:01Carnildo quits [Remote host closed the connection]
02:11:13Carnildo joins
02:26:12<@JAA>eightthree: I'm not aware of any software written in Go or Rust that produces WARCs and has been verified to work correctly. And I have no experience with running brozzler.
02:27:11<fireonlive>https://dl.fireon.live/irc/aed6d285fea3f806/image.png not encouraging
02:28:30<@JAA>I wouldn't trust it anyway until verified. Lots of software writes incorrect WARCs, and most HTTP libraries don't make it easy to write correct ones since they usually don't expose the low-level byte stream.
02:29:21<@JAA>So unless you do the I/O yourself and use a sans-I/O parser, it's more likely than not going to be wrong.
02:35:26<pabs>are there tools for validating warc files are spec-conformant and not weird in other ways?
02:35:51pabs wonders how long this change will last https://en.wikipedia.org/w/index.php?title=WARC_(file_format)&diff=prev&oldid=1222815751
02:36:35<pabs>(added a link on the wikipedia WARC page to the AT WARC ecosystem page)
02:37:11Carnildo_again joins
02:37:21Carnildo quits [Read error: Connection reset by peer]
02:37:42<@JAA>Not that I'm aware of. Someone was working on one in the context of warcio several years ago, but I don't think that ever landed. I've been working on my own, but not ready yet.
02:42:32Carnildo_again quits [Remote host closed the connection]
02:42:33Still_Carnildo joins
02:44:32<pabs>-rss/#hackernews- Microsoft closes several large Bethesda affiliated game studios: https://www.ign.com/articles/microsoft-closes-redfall-developer-arkane-austin-hifi-rush-developer-tango-gameworks-and-more-in-devastating-cuts-at-bethesda https://news.ycombinator.com/item?id=40285476
02:44:35HP_Archivist (HP_Archivist) joins
02:58:39Carnildo joins
02:58:39Still_Carnildo quits [Read error: Connection reset by peer]
03:04:27Carnildo quits [Read error: Connection reset by peer]
03:05:02Carnildo joins
03:05:24<@OrIdow6>JAA: Doesn't the IA have a crawler in Go?
03:06:54<@JAA>OrIdow6: Hmm right, Zeno.
03:22:13<pabs>"After 16 years online, Feedbooks will soon close down." https://www.feedbooks.com/
03:24:23<@JAA>Yeah, it's been running through AB since late March, but that won't get it done.
03:24:47<pabs>ah
03:25:50<@JAA>Various filters etc. let the queue explode. One job had to be aborted already.
03:28:34<fireonlive>i read that as facebook and got a flash of excitement
03:28:37<fireonlive>:[
03:48:49Carnildo quits [Remote host closed the connection]
03:48:51Carnildo joins
04:05:17<Vokun>The amount of family photos that would dissapear from the planet when facebook shuts down. Woah
04:05:46<Vokun>It'd be interesting if they decided to sell all their hardware though. Imagine how cheap used servers would start going for
04:07:05<Vokun>We need like a solid few months without any emergencies so that there's time to actually get #Y up and running
04:08:16<Vokun>A few months without an emergency?
04:08:20Vokun uploaded an image: (71KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/EgJNhFnPIUVWCXCqlDbYzulM/image.png >
04:11:05Carnildo quits [Remote host closed the connection]
04:11:16Carnildo joins
04:32:23Carnildo quits [Read error: Connection reset by peer]
04:32:45Carnildo joins
04:35:19Carnildo quits [Remote host closed the connection]
04:35:22Carnildo joins
04:37:05Carnildo quits [Read error: Connection reset by peer]
04:37:07Carnildo joins
04:41:01Island quits [Read error: Connection reset by peer]
04:42:24Carnildo quits [Remote host closed the connection]
04:42:33Carnildo joins
04:51:40Carnildo quits [Remote host closed the connection]
04:51:42Carnildo joins
05:03:04Lord_Nightmare quits [Ping timeout: 255 seconds]
05:04:47Lord_Nightmare (Lord_Nightmare) joins
05:09:31etnguyen03 quits [Client Quit]
05:10:12etnguyen03 (etnguyen03) joins
05:25:51etnguyen03 quits [Client Quit]
05:26:32etnguyen03 (etnguyen03) joins
05:28:18etnguyen03 quits [Remote host closed the connection]
05:32:27Barto (Barto) joins
05:43:34Bleo1826007227 quits [Ping timeout: 255 seconds]
05:44:01@dxrt quits [Ping timeout: 255 seconds]
05:46:37Bleo1826007227 joins
05:51:21dxrt joins
05:51:23dxrt quits [Changing host]
05:51:23dxrt (dxrt) joins
05:51:23@ChanServ sets mode: +o dxrt
06:21:11<@arkiver>fireonlive: hahahaha, now that would be something!
06:21:42<fireonlive>xD for sure!
06:22:09<@arkiver>JAA: do we need a project for feedbooks?
06:44:45Carnildo quits [Read error: Connection reset by peer]
06:44:47Carnildo_again joins
07:03:34PredatorIWD joins
07:06:11Unholy2361924645 (Unholy2361) joins
07:07:30Carnildo_again quits [Remote host closed the connection]
07:07:32Carnildo joins
07:18:16PredatorIWD quits [Client Quit]
07:21:13BearFortress quits [Ping timeout: 255 seconds]
07:22:23BearFortress joins
07:25:03Carnildo quits [Read error: Connection reset by peer]
07:25:14Carnildo joins
07:30:13Carnildo quits [Remote host closed the connection]
07:30:20Carnildo joins
07:50:25PredatorIWD joins
08:01:31Carnildo_again joins
08:02:37Carnildo quits [Ping timeout: 255 seconds]
08:02:39PredatorIWD5 joins
08:02:52PredatorIWD quits [Client Quit]
08:02:52PredatorIWD5 is now known as PredatorIWD
08:41:44<h2ibot>Bear edited List of websites excluded from the Wayback Machine/Partial exclusions/Twitter accounts (+37, twitter.com/TheEuropeanMan1 - now he is…): https://wiki.archiveteam.org/?diff=52215&oldid=52035
08:51:46<h2ibot>Bear edited List of websites excluded from the Wayback Machine (+363, More URLs that are part of the TRS.com empire.): https://wiki.archiveteam.org/?diff=52216&oldid=52204
08:54:47<h2ibot>Bear uploaded File:Abload - upload form.png ([[Abload]] before they disabled uploading.): https://wiki.archiveteam.org/?title=File%3AAbload%20-%20upload%20form.png
08:58:47<h2ibot>Bear edited Abload (+103, [[:File:Abload - upload form.png]]): https://wiki.archiveteam.org/?diff=52218&oldid=52202
09:00:02Bleo1826007227 quits [Client Quit]
09:01:26Bleo1826007227 joins
09:16:22monika quits [Quit: Zzz]
09:19:58monika (boom) joins
09:22:08monika quits [Client Quit]
09:30:33monika (boom) joins
09:41:45Carnildo_again quits [Remote host closed the connection]
09:41:51Carnildo_again joins
09:45:43Carnildo_again quits [Remote host closed the connection]
09:45:50Carnildo_again joins
09:46:04BornOn420 leaves [Textual IRC Client: www.textualapp.com]
09:56:34Carnildo_again quits [Read error: Connection reset by peer]
09:56:38beastbg8_ quits [Read error: Connection reset by peer]
09:56:46Carnildo joins
10:02:00igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
10:02:26igloo22225 (igloo22225) joins
10:03:36monika quits [Client Quit]
10:03:53monika (boom) joins
10:04:37Carnildo quits [Remote host closed the connection]
10:04:43Carnildo joins
10:05:00BlueMaxima quits [Client Quit]
10:15:37PeterandLukas joins
10:15:57PeterandLukas quits [Client Quit]
10:24:39f_ (funderscore) joins
10:29:25f_ quits [Remote host closed the connection]
10:29:59f_ (funderscore) joins
10:32:30Carnildo quits [Read error: Connection reset by peer]
10:33:05Carnildo joins
10:38:19Carnildo quits [Ping timeout: 255 seconds]
10:38:40Carnildo joins
10:41:30Carnildo quits [Read error: Connection reset by peer]
10:41:33Carnildo joins
10:43:25PredatorIWD quits [Read error: Connection reset by peer]
11:00:59Carnildo quits [Read error: Connection reset by peer]
11:01:11Carnildo joins
11:01:20f_ quits [Client Quit]
11:16:17Carnildo quits [Remote host closed the connection]
11:16:25Carnildo joins
11:16:47PredatorIWD joins
11:17:36PredatorIWD quits [Read error: Connection reset by peer]
11:21:13PredatorIWD joins
11:27:19PredatorIWD quits [Read error: Connection reset by peer]
11:30:15Carnildo quits [Read error: Connection reset by peer]
11:30:19Carnildo joins
11:31:44PredatorIWD joins
11:52:13Carnildo quits [Remote host closed the connection]
11:52:21Carnildo joins
12:32:09Carnildo quits [Read error: Connection reset by peer]
12:32:17Carnildo joins
12:40:49Carnildo quits [Read error: Connection reset by peer]
12:41:05Carnildo joins
12:58:43etnguyen03 (etnguyen03) joins
13:08:38Carnildo_again joins
13:08:49Carnildo quits [Read error: Connection reset by peer]
13:15:30Carnildo_again quits [Read error: Connection reset by peer]
13:15:43Carnildo joins
13:25:44Carnildo quits [Read error: Connection reset by peer]
13:25:53Carnildo joins
13:28:18Carnildo quits [Remote host closed the connection]
13:28:23Carnildo joins
13:31:38shgaqnyrjp quits [Remote host closed the connection]
13:31:41shgaqnyrjp_ (shgaqnyrjp) joins
13:32:43Carnildo quits [Remote host closed the connection]
13:32:50Carnildo joins
13:45:22Carnildo quits [Remote host closed the connection]
13:45:34Carnildo joins
14:03:00Carnildo quits [Read error: Connection reset by peer]
14:03:03Carnildo joins
14:38:46shgaqnyrjp_ is now known as shgaqnyrjp
15:10:56kiryu quits [Remote host closed the connection]
15:11:13]SaRgE[ (sarge) joins
15:14:57sarge quits [Ping timeout: 272 seconds]
15:15:34<@JAA>arkiver: Good question, not sure. It'd just be a catalogue of books they offer, I think. The actual interesting part is behind a login wall and would require automating loaning and stuff.
15:19:33Carnildo quits [Read error: Connection reset by peer]
15:19:46Carnildo joins
15:19:49kiryu joins
15:19:49kiryu quits [Changing host]
15:19:49kiryu (kiryu) joins
15:24:43<ScenarioPlanet>What are the main conditions for getting voiced in several AT channels that are used to operate archival bots (example: #archivebot / #wikibot)?
15:32:27kiryu quits [Remote host closed the connection]
15:34:57loug joins
15:42:51Carnildo quits [Read error: Connection reset by peer]
15:46:48Medowar quits [Quit: ZNC 1.9.0 - https://znc.in]
15:47:02f_ (funderscore) joins
15:50:53f_ quits [Remote host closed the connection]
15:51:12f_ (funderscore) joins
15:52:30f_ is now known as funderscore
15:52:38funderscore is now known as f_
16:15:39etnguyen03 quits [Client Quit]
16:16:20etnguyen03 (etnguyen03) joins
16:26:07etnguyen03 quits [Client Quit]
16:26:48etnguyen03 (etnguyen03) joins
16:34:15<pokechu22>The main one is understanding how to operate the bot mainly (including things like ignores and noticing when a site's gotten mad at us).
16:36:21loug quits [Client Quit]
16:36:35etnguyen03 quits [Client Quit]
16:37:16etnguyen03 (etnguyen03) joins
16:37:18loug joins
16:39:04<pokechu22>You're not currently in #wikibot but I'd say that one is easier to operate as there's only a few commands
16:40:09<ScenarioPlanet>That doesn't seem to be hard to understand (especially in #wikibot case), but some details like pipeline operations are kinda off-putting for me, maybe because they don't have any public documentation
16:40:25<pokechu22>Yeah :/
16:41:16<ScenarioPlanet>I mean things like pipeline notes (which is cloudflared or not, which is closer to the server that holds the website being archived, local censorships & more)
16:43:22marto_ quits [Client Quit]
16:43:28marto_ (marto_) joins
16:47:02etnguyen03 quits [Client Quit]
17:18:52<eightthree>JAA: what about in typescript using node.js ( or perhaps in vue or other typescript-written tools/langs/frameworks)
17:20:21<eightthree>https://github.com/webrecorder/browsertrix - in typescript
17:27:58<@JAA>eightthree: Stay away from anything webrecorder until proven it's not outputting rubbish.
17:28:42<eightthree>JAA https://github.com/webrecorder/pywb/issues/294 I noticed you linked to this, but don't know how relevant it is, given it's one of the non-ts projects of theirs...
17:28:46<@JAA>At least two of their tools do not produce valid WARCs. They have known this for years and made no attempt to fix it that I'm aware of.
17:29:02<@JAA>I have no reason to trust any of their tooling that produces WARCs.
17:29:21<eightthree>https://github.com/ArchiveTeam/ArchiveBot/issues/70 otherwise, this is still the open issue on using webrecorder...
17:29:37<eightthree>with hardly anything said...
17:29:52<@JAA>I never saw this issue before, and it predates my presence here.
17:30:04<@JAA>I guess the problems weren't known at the time.
17:30:21<eightthree>damn, how do i get proof? by ...not staying away and trying it (or seeing others comment on reddit github etc?
17:31:06<@JAA>You produce WARCs with them and verify that they are compliant. This requires intimate knowledge of the WARC and HTTP specs.
17:31:31<@JAA>Or you look at the code and immediately see why they can't possibly be compliant.
17:34:03<@JAA>E.g. their browser extension thing can't ever work because the browser doesn't make the necessary data available to extensions.
17:34:19<@JAA>ArchiveWeb.Page
17:36:07datechnoman quits [Client Quit]
17:36:30datechnoman (datechnoman) joins
17:50:42f_ is now known as f_|afk
17:51:04f_|afk is now known as f_
17:54:17<nicolas17>JAA: what would the browser need to expose, the raw network data without TLS?
17:55:07<@JAA>nicolas17: Yes, headers and transfer encoding as sent by the server. You only get a parsed representation of the former (losing capitalisation, whitespace, order) and TE is stripped.
17:55:14<nicolas17>it seems to me that would immediately hit the problem of WARC not supporting HTTP2 :P
17:55:22<@JAA>Yes, that as well.
17:55:38<@JAA>You can only write HTTP/1.1 to WARC. Technically, even 0.9 and 1.0 is incompatible.
17:56:09<@JAA>1.0 *might* work with a bit of generous interpretation. 0.9 definitely doesn't.
17:56:39<nicolas17>and translating HTTP2 to HTTP1.1 would likely be frowned upon
17:56:57<nicolas17>so now you have to disable HTTP2 browser-wide
17:57:07<@JAA>I've read somewhere that that's what webrecorder do, and yes, that's also bad.
17:57:30<@JAA>Or really, worse, because it even misrepresents which HTTP version was used.
17:58:26<eightthree>btw, i found this https://github.com/archivetheweb/archiver in rust, but no new commits since 1 year, and v0.3 , it does focus on warc 1.1 though
17:58:43<eightthree>JAA: what other "places" than here might have reliable enough experts (in warc i guess..or maybe also wacz?) that I could just ask, (and to avoid asking in a specific tool's chatroom/forum, as they are more likely biased).
17:58:46<eightthree>I found these 2 awesome lists https://github.com/ruarxive/awesome-digital-preservation https://github.com/iipc/awesome-web-archiving, but the first noted webrecorder a high-fidelity, so I don't know if those listmakers or if any the projects mentioned are reliable enough by yours standards...
17:59:34<@JAA>eightthree: I'm not aware of any. Even my discussions in the IIPC about this (which is the organisation where the WARC specification is written) were not entirely fruitful. I should revive them though.
18:00:17<@JAA>It seems that very few people care about spec compliance, which is wild given this data is supposed to survive decades or more.
18:01:09<@JAA>So far, my only rough rule is that if a software was written by IA, it's probably doing it correctly.
18:02:08lennier2 quits [Read error: Connection reset by peer]
18:02:27lennier2 joins
18:11:45<eightthree>so I searched for "better than warc" and stumbled upon someone saying HAR is better than warc, and capturing a HAR of the current page is implemented in the F12 devtools I believe... Do you have any link that I can read to see why warc is best of all the archiving formats? When the IIPC itself isn't reliable...
18:12:36<@JAA>IIPC is reliable, and there are people there that care about accuracy, just many don't.
18:12:41<eightthree>or maybe roughly tell me what to google if itll take longer to find link
18:13:08<@JAA>HAR doesn't preserve the exact HTTP traffic either, just a parsed version of it.
18:13:21<@JAA>So roughly the same as what webrecorder's tooling produces.
18:14:13<@JAA>You can always transform a (correct) WARC to a HAR, but the opposite is not possible.
18:16:50<@JAA>HAR is also awkward to use for anything larger than a single page or maybe a few. There's no concept of compression, and it's a single large JSON object, so appending is hard. I don't recall how binary data is stored, but I believe that's a mess, too. (Maybe base64?)
18:17:52<eightthree>JAA: so technically one could make a proper "extension" but compile it right into a browser, if ever a firefox or chromium derivative browser would be willing to do this? I've been annoyed by how singlefile is indeed not always a proper representation and TIL it's not entirely their fault...
18:18:06<@JAA>Yep, it is base64 indeed.
18:18:28<@JAA>Sure, if you modified the browser, it could definitely be done.
18:18:45<@JAA>But that's obviously not an easy task.
18:19:02<@JAA>And it's probably why brozzler uses an MITM proxy instead.
18:19:49<nicolas17>in theory you could use SSLKEYLOGFILE and capture the SSL'd traffic and decrypt it
18:19:59<nicolas17>but that still has the problem of HTTP2
18:20:53<nicolas17>would need to turn http2/3 off
18:21:14<@JAA>Yeah
18:21:15<nicolas17>or do MITM so you can tamper with the list of supported protocols
18:21:32<eightthree>JAA: tor browser baked in noscript, the grapheneos browser also incorporated...a content filter, not sure if it's noscript. I have noticed first hand many times how extensions can stop working when ram use is too high or something, and I guess that's why they didn't want to compromise on security.
18:23:39<nicolas17>I have also seen cases where Wireshark/dumpcap loses a packet even though the recipient ack'd it so it wasn't lost in the network, and then the entire stream is fucked
18:24:57wessel1512 quits [Ping timeout: 272 seconds]
18:25:23yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
18:27:58yano (yano) joins
18:28:02wessel1512 joins
18:36:13etnguyen03 (etnguyen03) joins
18:50:53RealPerson leaves
19:06:00Island joins
19:12:25etnguyen03 quits [Client Quit]
19:13:07etnguyen03 (etnguyen03) joins
19:31:29pixel leaves
19:31:30pixel (pixel) joins
19:37:35wyatt8740 joins
19:43:53wyatt8750 joins
19:45:04wyatt8740 quits [Ping timeout: 255 seconds]
20:04:10Jens quits []
20:04:56Jens (JensRex) joins
20:05:25DJ joins
20:06:33<lea>nicolas17: why is http2/3 a problem here? SSLKEYLOGFILE should work for them as well
20:06:52f_ quits [Client Quit]
20:10:15<DJ>Hey, has anyone checked out https://github.com/aliparlakci/bulk-downloader-for-reddit or https://github.com/RedditDownloader/redditdownloader.github.io? I think archivebot is limited in terms of crawling subreddits so I was wondering if anyone was aware of these or if they work.
20:38:56<nicolas17>lea: the WARC format has no way to represent captured HTTP2 requests/responses
20:39:30<nicolas17>and if you synthesize HTTP1.1-looking syntax from the HTTP2 data, that's not a pristine capture
20:39:34<lea>nicolas17: curl has a way of representing http2/3 responses in a format similar to http1.1, could that not be used?
20:39:41<lea>ah
20:39:52tapos joins
20:39:58<nicolas17><JAA> Or really, worse, because it even misrepresents which HTTP version was used.
20:41:13<lea>can it not say HTTP/2.0 200 OK in the header instead of HTTP/1.1 200 OK?
20:41:25<DJ>Nvm the first one has the 1000 posts API limit, I don't know if the second one does but probably.
20:41:42<lea>or is that the thing that is not supported?
20:42:06<@JAA>lea: WARC captures exactly what was sent over the network (at the application layer). And yes, the spec only supports HTTP/1.1 specifically.
20:51:00fireonlive wonders if IIPC(?) will update it soonish
20:55:24etnguyen03 quits [Client Quit]
21:01:18BearFortress_ joins
21:01:19tapos2 joins
21:01:23qw3rty_ joins
21:01:24ymgve_ joins
21:01:39AlsoHP_Archivist joins
21:03:31DJ58 joins
21:03:39DJ58 leaves
21:04:42ymgve quits [Ping timeout: 265 seconds]
21:04:42qw3rty quits [Ping timeout: 265 seconds]
21:05:11HP_Archivist quits [Ping timeout: 265 seconds]
21:05:40tapos quits [Ping timeout: 265 seconds]
21:05:40DJ quits [Ping timeout: 265 seconds]
21:05:40BearFortress quits [Ping timeout: 265 seconds]
21:05:40Guest quits [Ping timeout: 265 seconds]
21:13:18<eightthree>fireonlive: Am I too much of a conspiracy theorist for thinking that it might be intentional that proper copies of websites are so hard to do, spec not updated? Either to fingerprint copycat websites (for google and others to programmatically punish them in their algo (or not show them at all), but also to ensure that MITMs can be noticed?
21:14:22<@JAA>Yes, you are.
21:14:29<eightthree>like google has 20bill usd to send to Apple...I think each year, but they can't keep any archiving standard up to date with any of the other standards they spend heavily to influence and update?
21:14:37Earendil7 quits [Ping timeout: 255 seconds]
21:14:54<@JAA>The problem is that there's probably less than a dozen people worldwide who have worked on the WARC spec occasionally over the past two decades.
21:15:17<@JAA>It's very much a niche, and there's no funding for doing these things.
21:16:04<@JAA>Google et al. don't care about WARC or even archival.
21:17:59<@JAA>It's so much of a niche that I was apparently the first person to try to implement a WARC parser based on just the spec, since I ran into several inconsistencies that *have* to come up on every implementation.
21:18:26<eightthree>JAA: you seem to not see that the funding money amounts could exist, but don't. ~What does google and other search engines use to show an archived copy of a page?~ Ok, forget that, those arent js copies for secu reasons likely, but say, google translate shows a page with js, what does that use? Why don't they care to make it as close to the original as possible? The companies have the money, they choose not to spend on this....
21:18:32<eightthree>... There could be 100s or 1000s working on this long term if they wanted it.
21:19:18<eightthree>govts and library/archiving orgs likewise have some budget...
21:19:34<@JAA>They're companies. They care about making money. Spending effort on exact preservation when it doesn't matter to their service isn't something they will do.
21:20:21Earendil7 (Earendil7) joins
21:20:26<@JAA>At best, they care about storing an HTTP-equivalent copy of the data, i.e. with headers normalised, transfer encoding stripped, etc., since it's much easier to work with that.
21:20:27<eightthree>inconsistencies that _have_ to come up on every implementation.
21:20:27<eightthree>what do you mean by this? the other warc parsers always have inconsistencies, but why have?
21:20:52<@JAA>Not other parsers have inconsistencies, the spec has inconsistencies, and anyone implementing a parser based on the spec would have to run into them.
21:21:01<@JAA>Since they weren't reported previously, apparently nobody did that.
21:21:17<@JAA>Enjoy: https://github.com/iipc/warc-specifications/issues/created_by/JustAnotherArchivist
21:21:58<@JAA>#71 and #72 in particular are unavoidable if you look at the grammar.
21:22:02<eightthree>JAA: maybe this is why , as you were complaining earlier, most people don't care about coding to spec? If the spec has issues...
21:22:40<@JAA>If the spec has issues and you care about preservation, you raise the issue and get the spec fixed.
21:27:16<eightthree>have you seen anything about funding for the people contributing, from iipc or other? who/how many, if anyone is paid to contribute?
21:31:01<@JAA>So the IIPC is a consortium, and many of the people there are actually employed at various institutions doing digital preservation stuff. That would include some people from IA, from the British Library, etc. I imagine that the part of their employment dedicated to contributing to the IIPC is tiny to nonexistent though.
21:31:47<@JAA>IIPC does fund some projects in a narrow scope. There's an annual call for proposals.
21:32:17<@JAA>Or there's supposed to be, anyway, I think it hasn't happened in a couple years now for a reason they haven't communicated publicly I think.
21:33:21<@JAA>(I have considered submitting a proposal in this area before.)
21:38:18Bleo18260072271 joins
21:40:28Bleo1826007227 quits [Ping timeout: 265 seconds]
21:40:28Bleo18260072271 is now known as Bleo1826007227
21:40:58<eightthree>Call for proposals is now closed
21:40:58<eightthree>Proposals due: 15 September 2021
21:40:58<eightthree>Projects start: 1 January 2022
21:40:58<eightthree>Final report due: by 31 December 2022
21:41:06<eightthree>from https://netpreserve.org/projects/funding/
21:43:00<@JAA>Aye
22:01:57<eightthree>hmm so if funding dried up (from only a trickle) at iipc, perhaps the solution is to improve HAR since it's so much more widely deployed? the link on the wikip article links to a draft,
22:01:59<eightthree>https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:01:59<eightthree>with fat warning
22:01:59<eightthree>> _DO NOT USE_
22:01:59<eightthree>> This document was never published by the W3C Web Performance Working Group and has been abandoned.
22:01:59<eightthree>but the document lists itself as the latest,
22:02:00<eightthree>> Historical Draft August 14, 2012
22:02:01<eightthree>> This version:
22:02:02<eightthree>> https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:02:04<eightthree>> Latest version:
22:02:05<eightthree>> https://w3c.github.io/web-performance/specs/HAR/Overview.html
22:02:07<eightthree>and searching https://www.w3.org/TR/?filter-tr-name=har shows nothing...
22:02:08<eightthree>y was a draft with fat warning so widely deployed over WARC ???
22:05:26<eightthree>*deployed instead of warc...
22:05:26<eightthree>and do browsers etc all implement their own modified way of capturing HAR file from the current page in the browser? Am I going to have to go on a long hunt on each git/mailing list of each of these
22:05:29<eightthree>> The HAR format is supported by various software, including:
22:05:29<eightthree>> Charles Proxy
22:05:29<eightthree>> Fiddler
22:05:29<eightthree>> Firebug
22:05:29<eightthree>> Firefox
22:05:30<eightthree>> Fluxzy Desktop
22:05:32<eightthree>> Google Chrome
22:05:33<eightthree>> Internet Explorer 9
22:05:35<eightthree>> Microsoft Edge
22:05:36<eightthree>> Mitmproxy
22:05:38<eightthree>> Postman
22:05:39<eightthree>> OWASP ZAP
22:05:41<eightthree>> Safari
22:05:42<eightthree>to find out how they implement it?
22:06:06<@JAA>Likely, because I doubt there's any documentation on what they do in detail.
22:06:49<eightthree>im using matrix btw and its showing me the sciscors icon, so tell me if my message is hard to read...maybe I'll pastebin it...
22:06:56<@JAA>HAR being JSON is both great and horrible.
22:07:02<@JAA>Well, you pasted like a dozen messages, yeah.
22:07:27SootBector quits [Ping timeout: 250 seconds]
22:09:29<eightthree> HTTP/2, published in 2015 - does anyone implement HAR beyond 1.1?
22:09:52SootBector (SootBector) joins
22:12:23<eightthree>https://huggingface.co/spaces/ehristoforu/mixtral-46.7b-chat - this AI says har is neutral to http version...
22:12:47<@JAA>Keep AI nonsense out of here.
22:13:24<nicolas17>eightthree: where do you think it got that information from?
22:13:38flotwig_ joins
22:14:47flotwig quits [Ping timeout: 265 seconds]
22:14:47flotwig_ is now known as flotwig
22:15:14<nicolas17>if there isn't any good information on HAR on websites, then there's nowhere the AI could have learned it from and it's just guessing/hallucinating
22:15:25<nicolas17>if there is, then look at those websites instead :P
22:16:30<eightthree>JAA: sorry
22:17:21<@JAA>To answer the question, at least Firefox can put HTTP/2 into HAR. Probably HTTP/3 and WebSocket, too.
22:17:51<nicolas17>by putting the decoded normalized headers in there?
22:18:03<@JAA>Yes
22:18:13<@JAA>It's a massive JSON object.
22:42:31<eightthree>https://searchfox.org/mozilla-central/search?q=har&path=&case=false&regexp=false - 112 results when I check the checkbox for "whole words" when I ctrl-f for har
22:44:43<eightthree>JAA: where can I find a detailed "whats missing from HAR that WARC has"
22:45:07<@JAA>eightthree: By comparing the specs of HAR and WARC in detail. I doubt it's been done before.
22:46:30<eightthree>JAA: like, the github repo you linked to earlier for warc, with the 2012 draft I linked for HAR?
22:47:09<@JAA>Probably? I never looked into what documentation exists on HAR. There might be stuff on MDN or in browsers' documentations, too.
22:47:44<eightthree>I guess there's no comparison with firefox's implementation yet...I'll see if I absolutely need to decipher code or if there's something in bugzilla or elsewhere...
22:51:13<eightthree>https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/index.html#copysave-all-as-har https://github.com/mdn/content/blob/main/files/en-us/mozilla/firefox/releases/41/index.md seemingly the button was added in ffx 41
22:52:19<eightthree>https://bugzilla.mozilla.org/show_bug.cgi?id=859058
22:57:35<eightthree>[Links (documentation, blog post, etc)]: 'HAR' can be linked to http://www.softwareishard.com/blog/har-12-spec/;
22:57:37<eightthree>from https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c38, even though they had found the w3c link too, the last mention of what to link to officially as the spec was the above line
22:58:08tapos2 quits [Client Quit]
22:58:54Overlordz_ quits [Quit: Leaving]
23:01:17BlueMaxima joins
23:01:48tapos joins
23:03:17etnguyen03 (etnguyen03) joins
23:05:22<eightthree>https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c7 this honza guy seems to be knowledgeable enough to propose writing a draft update in case extra features were needed. Said 10 years ago though :)
23:11:55<eightthree>https://searchfox.org/mozilla-central/source/devtools/client/netmonitor/src/har/README.md
23:16:39etnguyen03 quits [Client Quit]
23:22:00<eightthree>https://addons.mozilla.org/en-US/firefox/addon/har-export-trigger/ - made by the same guy,
23:22:00<eightthree> Jan Honza Odvarko, that implemented har in firefox...but in the readme there seems to be a way with user.js setting and then a browser reboot, to not need the extension to automatically save each page, at least that's how it seems...
23:44:14etnguyen03 (etnguyen03) joins
23:44:41Guest joins
23:48:06Wohlstand (Wohlstand) joins