00:04:50 | | Carnildo quits [Read error: Connection reset by peer] |
00:04:58 | | Carnildo joins |
00:14:37 | | Earendil7 quits [Ping timeout: 255 seconds] |
00:15:24 | | Earendil7 (Earendil7) joins |
00:18:30 | | Carnildo quits [Read error: Connection reset by peer] |
00:18:52 | | Carnildo joins |
00:21:41 | | etnguyen03 quits [Client Quit] |
00:22:29 | | etnguyen03 (etnguyen03) joins |
00:32:54 | | etnguyen03 quits [Client Quit] |
00:33:36 | | etnguyen03 (etnguyen03) joins |
00:38:27 | | Carnildo quits [Read error: Connection reset by peer] |
00:38:39 | | Carnildo joins |
00:47:28 | | Bleo1826007227 quits [Client Quit] |
00:47:45 | | Bleo1826007227 joins |
00:48:34 | | Carnildo_again joins |
00:48:42 | | Carnildo quits [Remote host closed the connection] |
01:02:26 | | dsadasd joins |
01:02:53 | | dsadasd leaves |
01:05:28 | | Barto quits [Ping timeout: 255 seconds] |
01:07:44 | | Carnildo_again quits [Remote host closed the connection] |
01:07:51 | | Carnildo joins |
01:11:49 | | Carnildo quits [Read error: Connection reset by peer] |
01:11:58 | | Carnildo joins |
01:20:57 | | skyrocket quits [Client Quit] |
01:21:54 | | skyrocket joins |
01:22:11 | | Carnildo_again joins |
01:22:37 | | Carnildo quits [Read error: Connection reset by peer] |
01:25:41 | | kiryu joins |
01:25:41 | | kiryu is now authenticated as kiryu |
01:25:41 | | kiryu quits [Changing host] |
01:25:41 | | kiryu (kiryu) joins |
01:27:50 | | etnguyen03 quits [Client Quit] |
01:28:31 | | etnguyen03 (etnguyen03) joins |
01:30:15 | | Carnildo_again quits [Read error: Connection reset by peer] |
01:30:38 | | Carnildo joins |
01:34:02 | | Carnildo quits [Remote host closed the connection] |
01:34:07 | | Carnildo joins |
01:36:20 | | Carnildo quits [Remote host closed the connection] |
01:36:34 | | Carnildo joins |
01:38:18 | | etnguyen03 quits [Client Quit] |
01:48:24 | | Carnildo_again joins |
01:48:24 | | Carnildo quits [Read error: Connection reset by peer] |
01:52:33 | | Carnildo_again quits [Read error: Connection reset by peer] |
01:52:35 | | Carnildo joins |
01:54:01 | | etnguyen03 (etnguyen03) joins |
01:57:56 | | eroc19908 quits [Quit: The Lounge - https://thelounge.chat] |
01:58:24 | | eroc1990 (eroc1990) joins |
02:08:48 | <eightthree> | JAA: https://github.com/internetarchive/brozzler this brozzler? in python, is it quick enough? anything in rust or go or ideally a memory-safe and type-safe language, works yet? I know none of these are going to be perfect reproductions of a js heavy site, as mentioned in your comment below the one I replied to... |
02:10:21 | <eightthree> | https://github.com/spider-rs/spider seems the most popular when I search https://github.com/search?q=crawler+lang%3Arust&ref=opensearch&type=repositories&s=stars&o=desc |
02:11:01 | | Carnildo quits [Remote host closed the connection] |
02:11:13 | | Carnildo joins |
02:26:12 | <@JAA> | eightthree: I'm not aware of any software written in Go or Rust that produces WARCs and has been verified to work correctly. And I have no experience with running brozzler. |
02:27:11 | <fireonlive> | https://dl.fireon.live/irc/aed6d285fea3f806/image.png not encouraging |
02:28:30 | <@JAA> | I wouldn't trust it anyway until verified. Lots of software writes incorrect WARCs, and most HTTP libraries don't make it easy to write correct ones since they usually don't expose the low-level byte stream. |
02:29:21 | <@JAA> | So unless you do the I/O yourself and use a sans-I/O parser, it's more likely than not going to be wrong. |
02:35:26 | <pabs> | are there tools for validating warc files are spec-conformant and not weird in other ways? |
02:35:51 | | pabs wonders how long this change will last https://en.wikipedia.org/w/index.php?title=WARC_(file_format)&diff=prev&oldid=1222815751 |
02:36:35 | <pabs> | (added a link on the wikipedia WARC page to the AT WARC ecosystem page) |
02:37:11 | | Carnildo_again joins |
02:37:21 | | Carnildo quits [Read error: Connection reset by peer] |
02:37:42 | <@JAA> | Not that I'm aware of. Someone was working on one in the context of warcio several years ago, but I don't think that ever landed. I've been working on my own, but not ready yet. |
02:42:32 | | Carnildo_again quits [Remote host closed the connection] |
02:42:33 | | Still_Carnildo joins |
02:44:32 | <pabs> | -rss/#hackernews- Microsoft closes several large Bethesda affiliated game studios: https://www.ign.com/articles/microsoft-closes-redfall-developer-arkane-austin-hifi-rush-developer-tango-gameworks-and-more-in-devastating-cuts-at-bethesda https://news.ycombinator.com/item?id=40285476 |
02:44:35 | | HP_Archivist (HP_Archivist) joins |
02:58:39 | | Carnildo joins |
02:58:39 | | Still_Carnildo quits [Read error: Connection reset by peer] |
03:04:27 | | Carnildo quits [Read error: Connection reset by peer] |
03:05:02 | | Carnildo joins |
03:05:24 | <@OrIdow6> | JAA: Doesn't the IA have a crawler in Go? |
03:06:54 | <@JAA> | OrIdow6: Hmm right, Zeno. |
03:22:13 | <pabs> | "After 16 years online, Feedbooks will soon close down." https://www.feedbooks.com/ |
03:24:23 | <@JAA> | Yeah, it's been running through AB since late March, but that won't get it done. |
03:24:47 | <pabs> | ah |
03:25:50 | <@JAA> | Various filters etc. let the queue explode. One job had to be aborted already. |
03:28:34 | <fireonlive> | i read that as facebook and got a flash of excitement |
03:28:37 | <fireonlive> | :[ |
03:48:49 | | Carnildo quits [Remote host closed the connection] |
03:48:51 | | Carnildo joins |
04:05:17 | <Vokun> | The amount of family photos that would dissapear from the planet when facebook shuts down. Woah |
04:05:46 | <Vokun> | It'd be interesting if they decided to sell all their hardware though. Imagine how cheap used servers would start going for |
04:07:05 | <Vokun> | We need like a solid few months without any emergencies so that there's time to actually get #Y up and running |
04:08:16 | <Vokun> | A few months without an emergency? |
04:08:20 | | Vokun uploaded an image: (71KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/EgJNhFnPIUVWCXCqlDbYzulM/image.png > |
04:11:05 | | Carnildo quits [Remote host closed the connection] |
04:11:16 | | Carnildo joins |
04:32:23 | | Carnildo quits [Read error: Connection reset by peer] |
04:32:45 | | Carnildo joins |
04:35:19 | | Carnildo quits [Remote host closed the connection] |
04:35:22 | | Carnildo joins |
04:37:05 | | Carnildo quits [Read error: Connection reset by peer] |
04:37:07 | | Carnildo joins |
04:41:01 | | Island quits [Read error: Connection reset by peer] |
04:42:24 | | Carnildo quits [Remote host closed the connection] |
04:42:33 | | Carnildo joins |
04:51:40 | | Carnildo quits [Remote host closed the connection] |
04:51:42 | | Carnildo joins |
05:03:04 | | Lord_Nightmare quits [Ping timeout: 255 seconds] |
05:04:47 | | Lord_Nightmare (Lord_Nightmare) joins |
05:09:31 | | etnguyen03 quits [Client Quit] |
05:10:12 | | etnguyen03 (etnguyen03) joins |
05:25:51 | | etnguyen03 quits [Client Quit] |
05:26:32 | | etnguyen03 (etnguyen03) joins |
05:28:18 | | etnguyen03 quits [Remote host closed the connection] |
05:32:27 | | Barto (Barto) joins |
05:43:34 | | Bleo1826007227 quits [Ping timeout: 255 seconds] |
05:44:01 | | @dxrt quits [Ping timeout: 255 seconds] |
05:46:37 | | Bleo1826007227 joins |
05:51:21 | | dxrt joins |
05:51:23 | | dxrt is now authenticated as dxrt |
05:51:23 | | dxrt quits [Changing host] |
05:51:23 | | dxrt (dxrt) joins |
05:51:23 | | @ChanServ sets mode: +o dxrt |
06:21:11 | <@arkiver> | fireonlive: hahahaha, now that would be something! |
06:21:42 | <fireonlive> | xD for sure! |
06:22:09 | <@arkiver> | JAA: do we need a project for feedbooks? |
06:44:45 | | Carnildo quits [Read error: Connection reset by peer] |
06:44:47 | | Carnildo_again joins |
07:03:34 | | PredatorIWD joins |
07:06:11 | | Unholy2361924645 (Unholy2361) joins |
07:07:30 | | Carnildo_again quits [Remote host closed the connection] |
07:07:32 | | Carnildo joins |
07:18:16 | | PredatorIWD quits [Client Quit] |
07:21:13 | | BearFortress quits [Ping timeout: 255 seconds] |
07:22:23 | | BearFortress joins |
07:25:03 | | Carnildo quits [Read error: Connection reset by peer] |
07:25:14 | | Carnildo joins |
07:30:13 | | Carnildo quits [Remote host closed the connection] |
07:30:20 | | Carnildo joins |
07:50:25 | | PredatorIWD joins |
08:01:31 | | Carnildo_again joins |
08:02:37 | | Carnildo quits [Ping timeout: 255 seconds] |
08:02:39 | | PredatorIWD5 joins |
08:02:52 | | PredatorIWD quits [Client Quit] |
08:02:52 | | PredatorIWD5 is now known as PredatorIWD |
08:41:44 | <h2ibot> | Bear edited List of websites excluded from the Wayback Machine/Partial exclusions/Twitter accounts (+37, twitter.com/TheEuropeanMan1 - now he is…): https://wiki.archiveteam.org/?diff=52215&oldid=52035 |
08:51:46 | <h2ibot> | Bear edited List of websites excluded from the Wayback Machine (+363, More URLs that are part of the TRS.com empire.): https://wiki.archiveteam.org/?diff=52216&oldid=52204 |
08:54:47 | <h2ibot> | Bear uploaded File:Abload - upload form.png ([[Abload]] before they disabled uploading.): https://wiki.archiveteam.org/?title=File%3AAbload%20-%20upload%20form.png |
08:58:47 | <h2ibot> | Bear edited Abload (+103, [[:File:Abload - upload form.png]]): https://wiki.archiveteam.org/?diff=52218&oldid=52202 |
09:00:02 | | Bleo1826007227 quits [Client Quit] |
09:01:26 | | Bleo1826007227 joins |
09:16:22 | | monika quits [Quit: Zzz] |
09:19:58 | | monika (boom) joins |
09:22:08 | | monika quits [Client Quit] |
09:30:33 | | monika (boom) joins |
09:41:45 | | Carnildo_again quits [Remote host closed the connection] |
09:41:51 | | Carnildo_again joins |
09:45:43 | | Carnildo_again quits [Remote host closed the connection] |
09:45:50 | | Carnildo_again joins |
09:46:04 | | BornOn420 leaves [Textual IRC Client: www.textualapp.com] |
09:56:34 | | Carnildo_again quits [Read error: Connection reset by peer] |
09:56:38 | | beastbg8_ quits [Read error: Connection reset by peer] |
09:56:46 | | Carnildo joins |
10:02:00 | | igloo22225 quits [Quit: The Lounge - https://thelounge.chat] |
10:02:26 | | igloo22225 (igloo22225) joins |
10:03:36 | | monika quits [Client Quit] |
10:03:53 | | monika (boom) joins |
10:04:37 | | Carnildo quits [Remote host closed the connection] |
10:04:43 | | Carnildo joins |
10:05:00 | | BlueMaxima quits [Client Quit] |
10:15:37 | | PeterandLukas joins |
10:15:57 | | PeterandLukas quits [Client Quit] |
10:24:39 | | f_ (funderscore) joins |
10:29:25 | | f_ quits [Remote host closed the connection] |
10:29:59 | | f_ (funderscore) joins |
10:32:30 | | Carnildo quits [Read error: Connection reset by peer] |
10:33:05 | | Carnildo joins |
10:38:19 | | Carnildo quits [Ping timeout: 255 seconds] |
10:38:40 | | Carnildo joins |
10:41:30 | | Carnildo quits [Read error: Connection reset by peer] |
10:41:33 | | Carnildo joins |
10:43:25 | | PredatorIWD quits [Read error: Connection reset by peer] |
11:00:59 | | Carnildo quits [Read error: Connection reset by peer] |
11:01:11 | | Carnildo joins |
11:01:20 | | f_ quits [Client Quit] |
11:16:17 | | Carnildo quits [Remote host closed the connection] |
11:16:25 | | Carnildo joins |
11:16:47 | | PredatorIWD joins |
11:17:36 | | PredatorIWD quits [Read error: Connection reset by peer] |
11:21:13 | | PredatorIWD joins |
11:27:19 | | PredatorIWD quits [Read error: Connection reset by peer] |
11:30:15 | | Carnildo quits [Read error: Connection reset by peer] |
11:30:19 | | Carnildo joins |
11:31:44 | | PredatorIWD joins |
11:52:13 | | Carnildo quits [Remote host closed the connection] |
11:52:21 | | Carnildo joins |
12:32:09 | | Carnildo quits [Read error: Connection reset by peer] |
12:32:17 | | Carnildo joins |
12:40:49 | | Carnildo quits [Read error: Connection reset by peer] |
12:41:05 | | Carnildo joins |
12:58:43 | | etnguyen03 (etnguyen03) joins |
13:08:38 | | Carnildo_again joins |
13:08:49 | | Carnildo quits [Read error: Connection reset by peer] |
13:15:30 | | Carnildo_again quits [Read error: Connection reset by peer] |
13:15:43 | | Carnildo joins |
13:25:44 | | Carnildo quits [Read error: Connection reset by peer] |
13:25:53 | | Carnildo joins |
13:28:18 | | Carnildo quits [Remote host closed the connection] |
13:28:23 | | Carnildo joins |
13:31:38 | | shgaqnyrjp quits [Remote host closed the connection] |
13:31:41 | | shgaqnyrjp_ (shgaqnyrjp) joins |
13:32:43 | | Carnildo quits [Remote host closed the connection] |
13:32:50 | | Carnildo joins |
13:45:22 | | Carnildo quits [Remote host closed the connection] |
13:45:34 | | Carnildo joins |
14:03:00 | | Carnildo quits [Read error: Connection reset by peer] |
14:03:03 | | Carnildo joins |
14:38:46 | | shgaqnyrjp_ is now known as shgaqnyrjp |
15:10:56 | | kiryu quits [Remote host closed the connection] |
15:11:13 | | ]SaRgE[ (sarge) joins |
15:14:57 | | sarge quits [Ping timeout: 272 seconds] |
15:15:34 | <@JAA> | arkiver: Good question, not sure. It'd just be a catalogue of books they offer, I think. The actual interesting part is behind a login wall and would require automating loaning and stuff. |
15:19:33 | | Carnildo quits [Read error: Connection reset by peer] |
15:19:46 | | Carnildo joins |
15:19:49 | | kiryu joins |
15:19:49 | | kiryu is now authenticated as kiryu |
15:19:49 | | kiryu quits [Changing host] |
15:19:49 | | kiryu (kiryu) joins |
15:24:43 | <ScenarioPlanet> | What are the main conditions for getting voiced in several AT channels that are used to operate archival bots (example: #archivebot / #wikibot)? |
15:32:27 | | kiryu quits [Remote host closed the connection] |
15:34:57 | | loug joins |
15:42:51 | | Carnildo quits [Read error: Connection reset by peer] |
15:46:48 | | Medowar quits [Quit: ZNC 1.9.0 - https://znc.in] |
15:47:02 | | f_ (funderscore) joins |
15:50:53 | | f_ quits [Remote host closed the connection] |
15:51:12 | | f_ (funderscore) joins |
15:52:30 | | f_ is now known as funderscore |
15:52:38 | | funderscore is now known as f_ |
16:15:39 | | etnguyen03 quits [Client Quit] |
16:16:20 | | etnguyen03 (etnguyen03) joins |
16:26:07 | | etnguyen03 quits [Client Quit] |
16:26:48 | | etnguyen03 (etnguyen03) joins |
16:34:15 | <pokechu22> | The main one is understanding how to operate the bot mainly (including things like ignores and noticing when a site's gotten mad at us). |
16:36:21 | | loug quits [Client Quit] |
16:36:35 | | etnguyen03 quits [Client Quit] |
16:37:16 | | etnguyen03 (etnguyen03) joins |
16:37:18 | | loug joins |
16:39:04 | <pokechu22> | You're not currently in #wikibot but I'd say that one is easier to operate as there's only a few commands |
16:40:09 | <ScenarioPlanet> | That doesn't seem to be hard to understand (especially in #wikibot case), but some details like pipeline operations are kinda off-putting for me, maybe because they don't have any public documentation |
16:40:25 | <pokechu22> | Yeah :/ |
16:41:16 | <ScenarioPlanet> | I mean things like pipeline notes (which is cloudflared or not, which is closer to the server that holds the website being archived, local censorships & more) |
16:43:22 | | marto_ quits [Client Quit] |
16:43:28 | | marto_ (marto_) joins |
16:47:02 | | etnguyen03 quits [Client Quit] |
17:18:52 | <eightthree> | JAA: what about in typescript using node.js ( or perhaps in vue or other typescript-written tools/langs/frameworks) |
17:20:21 | <eightthree> | https://github.com/webrecorder/browsertrix - in typescript |
17:27:58 | <@JAA> | eightthree: Stay away from anything webrecorder until proven it's not outputting rubbish. |
17:28:42 | <eightthree> | JAA https://github.com/webrecorder/pywb/issues/294 I noticed you linked to this, but don't know how relevant it is, given it's one of the non-ts projects of theirs... |
17:28:46 | <@JAA> | At least two of their tools do not produce valid WARCs. They have known this for years and made no attempt to fix it that I'm aware of. |
17:29:02 | <@JAA> | I have no reason to trust any of their tooling that produces WARCs. |
17:29:21 | <eightthree> | https://github.com/ArchiveTeam/ArchiveBot/issues/70 otherwise, this is still the open issue on using webrecorder... |
17:29:37 | <eightthree> | with hardly anything said... |
17:29:52 | <@JAA> | I never saw this issue before, and it predates my presence here. |
17:30:04 | <@JAA> | I guess the problems weren't known at the time. |
17:30:21 | <eightthree> | damn, how do i get proof? by ...not staying away and trying it (or seeing others comment on reddit github etc? |
17:31:06 | <@JAA> | You produce WARCs with them and verify that they are compliant. This requires intimate knowledge of the WARC and HTTP specs. |
17:31:31 | <@JAA> | Or you look at the code and immediately see why they can't possibly be compliant. |
17:34:03 | <@JAA> | E.g. their browser extension thing can't ever work because the browser doesn't make the necessary data available to extensions. |
17:34:19 | <@JAA> | ArchiveWeb.Page |
17:36:07 | | datechnoman quits [Client Quit] |
17:36:30 | | datechnoman (datechnoman) joins |
17:50:42 | | f_ is now known as f_|afk |
17:51:04 | | f_|afk is now known as f_ |
17:54:17 | <nicolas17> | JAA: what would the browser need to expose, the raw network data without TLS? |
17:55:07 | <@JAA> | nicolas17: Yes, headers and transfer encoding as sent by the server. You only get a parsed representation of the former (losing capitalisation, whitespace, order) and TE is stripped. |
17:55:14 | <nicolas17> | it seems to me that would immediately hit the problem of WARC not supporting HTTP2 :P |
17:55:22 | <@JAA> | Yes, that as well. |
17:55:38 | <@JAA> | You can only write HTTP/1.1 to WARC. Technically, even 0.9 and 1.0 is incompatible. |
17:56:09 | <@JAA> | 1.0 *might* work with a bit of generous interpretation. 0.9 definitely doesn't. |
17:56:39 | <nicolas17> | and translating HTTP2 to HTTP1.1 would likely be frowned upon |
17:56:57 | <nicolas17> | so now you have to disable HTTP2 browser-wide |
17:57:07 | <@JAA> | I've read somewhere that that's what webrecorder do, and yes, that's also bad. |
17:57:30 | <@JAA> | Or really, worse, because it even misrepresents which HTTP version was used. |
17:58:26 | <eightthree> | btw, i found this https://github.com/archivetheweb/archiver in rust, but no new commits since 1 year, and v0.3 , it does focus on warc 1.1 though |
17:58:43 | <eightthree> | JAA: what other "places" than here might have reliable enough experts (in warc i guess..or maybe also wacz?) that I could just ask, (and to avoid asking in a specific tool's chatroom/forum, as they are more likely biased). |
17:58:46 | <eightthree> | I found these 2 awesome lists https://github.com/ruarxive/awesome-digital-preservation https://github.com/iipc/awesome-web-archiving, but the first noted webrecorder a high-fidelity, so I don't know if those listmakers or if any the projects mentioned are reliable enough by yours standards... |
17:59:34 | <@JAA> | eightthree: I'm not aware of any. Even my discussions in the IIPC about this (which is the organisation where the WARC specification is written) were not entirely fruitful. I should revive them though. |
18:00:17 | <@JAA> | It seems that very few people care about spec compliance, which is wild given this data is supposed to survive decades or more. |
18:01:09 | <@JAA> | So far, my only rough rule is that if a software was written by IA, it's probably doing it correctly. |
18:02:08 | | lennier2 quits [Read error: Connection reset by peer] |
18:02:27 | | lennier2 joins |
18:11:45 | <eightthree> | so I searched for "better than warc" and stumbled upon someone saying HAR is better than warc, and capturing a HAR of the current page is implemented in the F12 devtools I believe... Do you have any link that I can read to see why warc is best of all the archiving formats? When the IIPC itself isn't reliable... |
18:12:36 | <@JAA> | IIPC is reliable, and there are people there that care about accuracy, just many don't. |
18:12:41 | <eightthree> | or maybe roughly tell me what to google if itll take longer to find link |
18:13:08 | <@JAA> | HAR doesn't preserve the exact HTTP traffic either, just a parsed version of it. |
18:13:21 | <@JAA> | So roughly the same as what webrecorder's tooling produces. |
18:14:13 | <@JAA> | You can always transform a (correct) WARC to a HAR, but the opposite is not possible. |
18:16:50 | <@JAA> | HAR is also awkward to use for anything larger than a single page or maybe a few. There's no concept of compression, and it's a single large JSON object, so appending is hard. I don't recall how binary data is stored, but I believe that's a mess, too. (Maybe base64?) |
18:17:52 | <eightthree> | JAA: so technically one could make a proper "extension" but compile it right into a browser, if ever a firefox or chromium derivative browser would be willing to do this? I've been annoyed by how singlefile is indeed not always a proper representation and TIL it's not entirely their fault... |
18:18:06 | <@JAA> | Yep, it is base64 indeed. |
18:18:28 | <@JAA> | Sure, if you modified the browser, it could definitely be done. |
18:18:45 | <@JAA> | But that's obviously not an easy task. |
18:19:02 | <@JAA> | And it's probably why brozzler uses an MITM proxy instead. |
18:19:49 | <nicolas17> | in theory you could use SSLKEYLOGFILE and capture the SSL'd traffic and decrypt it |
18:19:59 | <nicolas17> | but that still has the problem of HTTP2 |
18:20:53 | <nicolas17> | would need to turn http2/3 off |
18:21:14 | <@JAA> | Yeah |
18:21:15 | <nicolas17> | or do MITM so you can tamper with the list of supported protocols |
18:21:32 | <eightthree> | JAA: tor browser baked in noscript, the grapheneos browser also incorporated...a content filter, not sure if it's noscript. I have noticed first hand many times how extensions can stop working when ram use is too high or something, and I guess that's why they didn't want to compromise on security. |
18:23:39 | <nicolas17> | I have also seen cases where Wireshark/dumpcap loses a packet even though the recipient ack'd it so it wasn't lost in the network, and then the entire stream is fucked |
18:24:57 | | wessel1512 quits [Ping timeout: 272 seconds] |
18:25:23 | | yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/] |
18:27:58 | | yano (yano) joins |
18:28:02 | | wessel1512 joins |
18:36:13 | | etnguyen03 (etnguyen03) joins |
18:50:53 | | RealPerson leaves |
19:06:00 | | Island joins |
19:12:25 | | etnguyen03 quits [Client Quit] |
19:13:07 | | etnguyen03 (etnguyen03) joins |
19:31:29 | | pixel leaves |
19:31:30 | | pixel (pixel) joins |
19:37:35 | | wyatt8740 joins |
19:43:53 | | wyatt8750 joins |
19:45:04 | | wyatt8740 quits [Ping timeout: 255 seconds] |
20:04:10 | | Jens quits [] |
20:04:56 | | Jens (JensRex) joins |
20:05:25 | | DJ joins |
20:06:33 | <lea> | nicolas17: why is http2/3 a problem here? SSLKEYLOGFILE should work for them as well |
20:06:52 | | f_ quits [Client Quit] |
20:10:15 | <DJ> | Hey, has anyone checked out https://github.com/aliparlakci/bulk-downloader-for-reddit or https://github.com/RedditDownloader/redditdownloader.github.io? I think archivebot is limited in terms of crawling subreddits so I was wondering if anyone was aware of these or if they work. |
20:38:56 | <nicolas17> | lea: the WARC format has no way to represent captured HTTP2 requests/responses |
20:39:30 | <nicolas17> | and if you synthesize HTTP1.1-looking syntax from the HTTP2 data, that's not a pristine capture |
20:39:34 | <lea> | nicolas17: curl has a way of representing http2/3 responses in a format similar to http1.1, could that not be used? |
20:39:41 | <lea> | ah |
20:39:52 | | tapos joins |
20:39:58 | <nicolas17> | <JAA> Or really, worse, because it even misrepresents which HTTP version was used. |
20:41:13 | <lea> | can it not say HTTP/2.0 200 OK in the header instead of HTTP/1.1 200 OK? |
20:41:25 | <DJ> | Nvm the first one has the 1000 posts API limit, I don't know if the second one does but probably. |
20:41:42 | <lea> | or is that the thing that is not supported? |
20:42:06 | <@JAA> | lea: WARC captures exactly what was sent over the network (at the application layer). And yes, the spec only supports HTTP/1.1 specifically. |
20:51:00 | | fireonlive wonders if IIPC(?) will update it soonish |
20:55:24 | | etnguyen03 quits [Client Quit] |
21:01:18 | | BearFortress_ joins |
21:01:19 | | tapos2 joins |
21:01:23 | | qw3rty_ joins |
21:01:24 | | ymgve_ joins |
21:01:39 | | AlsoHP_Archivist joins |
21:03:31 | | DJ58 joins |
21:03:39 | | DJ58 leaves |
21:04:42 | | ymgve quits [Ping timeout: 265 seconds] |
21:04:42 | | qw3rty quits [Ping timeout: 265 seconds] |
21:05:11 | | HP_Archivist quits [Ping timeout: 265 seconds] |
21:05:40 | | tapos quits [Ping timeout: 265 seconds] |
21:05:40 | | DJ quits [Ping timeout: 265 seconds] |
21:05:40 | | BearFortress quits [Ping timeout: 265 seconds] |
21:05:40 | | Guest quits [Ping timeout: 265 seconds] |
21:13:18 | <eightthree> | fireonlive: Am I too much of a conspiracy theorist for thinking that it might be intentional that proper copies of websites are so hard to do, spec not updated? Either to fingerprint copycat websites (for google and others to programmatically punish them in their algo (or not show them at all), but also to ensure that MITMs can be noticed? |
21:14:22 | <@JAA> | Yes, you are. |
21:14:29 | <eightthree> | like google has 20bill usd to send to Apple...I think each year, but they can't keep any archiving standard up to date with any of the other standards they spend heavily to influence and update? |
21:14:37 | | Earendil7 quits [Ping timeout: 255 seconds] |
21:14:54 | <@JAA> | The problem is that there's probably less than a dozen people worldwide who have worked on the WARC spec occasionally over the past two decades. |
21:15:17 | <@JAA> | It's very much a niche, and there's no funding for doing these things. |
21:16:04 | <@JAA> | Google et al. don't care about WARC or even archival. |
21:17:59 | <@JAA> | It's so much of a niche that I was apparently the first person to try to implement a WARC parser based on just the spec, since I ran into several inconsistencies that *have* to come up on every implementation. |
21:18:26 | <eightthree> | JAA: you seem to not see that the funding money amounts could exist, but don't. ~What does google and other search engines use to show an archived copy of a page?~ Ok, forget that, those arent js copies for secu reasons likely, but say, google translate shows a page with js, what does that use? Why don't they care to make it as close to the original as possible? The companies have the money, they choose not to spend on this.... |
21:18:32 | <eightthree> | ... There could be 100s or 1000s working on this long term if they wanted it. |
21:19:18 | <eightthree> | govts and library/archiving orgs likewise have some budget... |
21:19:34 | <@JAA> | They're companies. They care about making money. Spending effort on exact preservation when it doesn't matter to their service isn't something they will do. |
21:20:21 | | Earendil7 (Earendil7) joins |
21:20:26 | <@JAA> | At best, they care about storing an HTTP-equivalent copy of the data, i.e. with headers normalised, transfer encoding stripped, etc., since it's much easier to work with that. |
21:20:27 | <eightthree> | inconsistencies that _have_ to come up on every implementation. |
21:20:27 | <eightthree> | what do you mean by this? the other warc parsers always have inconsistencies, but why have? |
21:20:52 | <@JAA> | Not other parsers have inconsistencies, the spec has inconsistencies, and anyone implementing a parser based on the spec would have to run into them. |
21:21:01 | <@JAA> | Since they weren't reported previously, apparently nobody did that. |
21:21:17 | <@JAA> | Enjoy: https://github.com/iipc/warc-specifications/issues/created_by/JustAnotherArchivist |
21:21:58 | <@JAA> | #71 and #72 in particular are unavoidable if you look at the grammar. |
21:22:02 | <eightthree> | JAA: maybe this is why , as you were complaining earlier, most people don't care about coding to spec? If the spec has issues... |
21:22:40 | <@JAA> | If the spec has issues and you care about preservation, you raise the issue and get the spec fixed. |
21:27:16 | <eightthree> | have you seen anything about funding for the people contributing, from iipc or other? who/how many, if anyone is paid to contribute? |
21:31:01 | <@JAA> | So the IIPC is a consortium, and many of the people there are actually employed at various institutions doing digital preservation stuff. That would include some people from IA, from the British Library, etc. I imagine that the part of their employment dedicated to contributing to the IIPC is tiny to nonexistent though. |
21:31:47 | <@JAA> | IIPC does fund some projects in a narrow scope. There's an annual call for proposals. |
21:32:17 | <@JAA> | Or there's supposed to be, anyway, I think it hasn't happened in a couple years now for a reason they haven't communicated publicly I think. |
21:33:21 | <@JAA> | (I have considered submitting a proposal in this area before.) |
21:38:18 | | Bleo18260072271 joins |
21:40:28 | | Bleo1826007227 quits [Ping timeout: 265 seconds] |
21:40:28 | | Bleo18260072271 is now known as Bleo1826007227 |
21:40:58 | <eightthree> | Call for proposals is now closed |
21:40:58 | <eightthree> | Proposals due: 15 September 2021 |
21:40:58 | <eightthree> | Projects start: 1 January 2022 |
21:40:58 | <eightthree> | Final report due: by 31 December 2022 |
21:41:06 | <eightthree> | from https://netpreserve.org/projects/funding/ |
21:43:00 | <@JAA> | Aye |
22:01:57 | <eightthree> | hmm so if funding dried up (from only a trickle) at iipc, perhaps the solution is to improve HAR since it's so much more widely deployed? the link on the wikip article links to a draft, |
22:01:59 | <eightthree> | https://w3c.github.io/web-performance/specs/HAR/Overview.html |
22:01:59 | <eightthree> | with fat warning |
22:01:59 | <eightthree> | > _DO NOT USE_ |
22:01:59 | <eightthree> | > This document was never published by the W3C Web Performance Working Group and has been abandoned. |
22:01:59 | <eightthree> | but the document lists itself as the latest, |
22:02:00 | <eightthree> | > Historical Draft August 14, 2012 |
22:02:01 | <eightthree> | > This version: |
22:02:02 | <eightthree> | > https://w3c.github.io/web-performance/specs/HAR/Overview.html |
22:02:04 | <eightthree> | > Latest version: |
22:02:05 | <eightthree> | > https://w3c.github.io/web-performance/specs/HAR/Overview.html |
22:02:07 | <eightthree> | and searching https://www.w3.org/TR/?filter-tr-name=har shows nothing... |
22:02:08 | <eightthree> | y was a draft with fat warning so widely deployed over WARC ??? |
22:05:26 | <eightthree> | *deployed instead of warc... |
22:05:26 | <eightthree> | and do browsers etc all implement their own modified way of capturing HAR file from the current page in the browser? Am I going to have to go on a long hunt on each git/mailing list of each of these |
22:05:29 | <eightthree> | > The HAR format is supported by various software, including: |
22:05:29 | <eightthree> | > Charles Proxy |
22:05:29 | <eightthree> | > Fiddler |
22:05:29 | <eightthree> | > Firebug |
22:05:29 | <eightthree> | > Firefox |
22:05:30 | <eightthree> | > Fluxzy Desktop |
22:05:32 | <eightthree> | > Google Chrome |
22:05:33 | <eightthree> | > Internet Explorer 9 |
22:05:35 | <eightthree> | > Microsoft Edge |
22:05:36 | <eightthree> | > Mitmproxy |
22:05:38 | <eightthree> | > Postman |
22:05:39 | <eightthree> | > OWASP ZAP |
22:05:41 | <eightthree> | > Safari |
22:05:42 | <eightthree> | to find out how they implement it? |
22:06:06 | <@JAA> | Likely, because I doubt there's any documentation on what they do in detail. |
22:06:49 | <eightthree> | im using matrix btw and its showing me the sciscors icon, so tell me if my message is hard to read...maybe I'll pastebin it... |
22:06:56 | <@JAA> | HAR being JSON is both great and horrible. |
22:07:02 | <@JAA> | Well, you pasted like a dozen messages, yeah. |
22:07:27 | | SootBector quits [Ping timeout: 250 seconds] |
22:09:29 | <eightthree> | HTTP/2, published in 2015 - does anyone implement HAR beyond 1.1? |
22:09:52 | | SootBector (SootBector) joins |
22:12:23 | <eightthree> | https://huggingface.co/spaces/ehristoforu/mixtral-46.7b-chat - this AI says har is neutral to http version... |
22:12:47 | <@JAA> | Keep AI nonsense out of here. |
22:13:24 | <nicolas17> | eightthree: where do you think it got that information from? |
22:13:38 | | flotwig_ joins |
22:14:47 | | flotwig quits [Ping timeout: 265 seconds] |
22:14:47 | | flotwig_ is now known as flotwig |
22:15:14 | <nicolas17> | if there isn't any good information on HAR on websites, then there's nowhere the AI could have learned it from and it's just guessing/hallucinating |
22:15:25 | <nicolas17> | if there is, then look at those websites instead :P |
22:16:30 | <eightthree> | JAA: sorry |
22:17:21 | <@JAA> | To answer the question, at least Firefox can put HTTP/2 into HAR. Probably HTTP/3 and WebSocket, too. |
22:17:51 | <nicolas17> | by putting the decoded normalized headers in there? |
22:18:03 | <@JAA> | Yes |
22:18:13 | <@JAA> | It's a massive JSON object. |
22:42:31 | <eightthree> | https://searchfox.org/mozilla-central/search?q=har&path=&case=false®exp=false - 112 results when I check the checkbox for "whole words" when I ctrl-f for har |
22:44:43 | <eightthree> | JAA: where can I find a detailed "whats missing from HAR that WARC has" |
22:45:07 | <@JAA> | eightthree: By comparing the specs of HAR and WARC in detail. I doubt it's been done before. |
22:46:30 | <eightthree> | JAA: like, the github repo you linked to earlier for warc, with the 2012 draft I linked for HAR? |
22:47:09 | <@JAA> | Probably? I never looked into what documentation exists on HAR. There might be stuff on MDN or in browsers' documentations, too. |
22:47:44 | <eightthree> | I guess there's no comparison with firefox's implementation yet...I'll see if I absolutely need to decipher code or if there's something in bugzilla or elsewhere... |
22:51:13 | <eightthree> | https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/index.html#copysave-all-as-har https://github.com/mdn/content/blob/main/files/en-us/mozilla/firefox/releases/41/index.md seemingly the button was added in ffx 41 |
22:52:19 | <eightthree> | https://bugzilla.mozilla.org/show_bug.cgi?id=859058 |
22:57:35 | <eightthree> | [Links (documentation, blog post, etc)]: 'HAR' can be linked to http://www.softwareishard.com/blog/har-12-spec/; |
22:57:37 | <eightthree> | from https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c38, even though they had found the w3c link too, the last mention of what to link to officially as the spec was the above line |
22:58:08 | | tapos2 quits [Client Quit] |
22:58:54 | | Overlordz_ quits [Quit: Leaving] |
23:01:17 | | BlueMaxima joins |
23:01:48 | | tapos joins |
23:03:17 | | etnguyen03 (etnguyen03) joins |
23:05:22 | <eightthree> | https://bugzilla.mozilla.org/show_bug.cgi?id=859058#c7 this honza guy seems to be knowledgeable enough to propose writing a draft update in case extra features were needed. Said 10 years ago though :) |
23:11:55 | <eightthree> | https://searchfox.org/mozilla-central/source/devtools/client/netmonitor/src/har/README.md |
23:16:39 | | etnguyen03 quits [Client Quit] |
23:22:00 | <eightthree> | https://addons.mozilla.org/en-US/firefox/addon/har-export-trigger/ - made by the same guy, |
23:22:00 | <eightthree> | Jan Honza Odvarko, that implemented har in firefox...but in the readme there seems to be a way with user.js setting and then a browser reboot, to not need the extension to automatically save each page, at least that's how it seems... |
23:44:14 | | etnguyen03 (etnguyen03) joins |
23:44:41 | | Guest joins |
23:48:06 | | Wohlstand (Wohlstand) joins |