00:07:03cyanbox joins
00:22:43Shard11 quits [Read error: Connection reset by peer]
00:27:14Shard11 (Shard) joins
00:35:59Wake quits [Quit: The Lounge - https://thelounge.chat]
00:37:20Wake joins
00:40:20Wake quits [Client Quit]
00:40:40Wake joins
00:41:09Wake quits [Client Quit]
00:41:26Wake joins
00:43:05Wake quits [Client Quit]
00:44:15Wake joins
00:45:37etnguyen03 (etnguyen03) joins
00:47:42Sk1d joins
00:57:34Sk1d quits [Read error: Connection reset by peer]
00:58:33TastyWiener95 quits [Ping timeout: 272 seconds]
01:02:12notarobot17 joins
01:04:18etnguyen03 quits [Client Quit]
01:05:45TastyWiener95 (TastyWiener95) joins
01:16:20Arcorann_ joins
01:20:05Arcorann quits [Ping timeout: 272 seconds]
02:05:03nine quits [Ping timeout: 272 seconds]
02:12:03etnguyen03 (etnguyen03) joins
02:12:42nine joins
02:12:42nine quits [Changing host]
02:12:42nine (nine) joins
02:16:40<triplecamera|m>justauser: What does SFDW stand for?
02:18:26<triplecamera|m><cruller> "You might be able to do it using..." <- Indeed.
02:41:08Kotomind joins
03:26:22<h2ibot>PaulWise edited Discourse/archived (+109, universal-blue.discourse.group): https://wiki.archiveteam.org/?diff=60420&oldid=60419
03:37:27onetruth joins
03:56:27<cruller>I'm surprised that https://web.archive.org/cdx/search/cdx?url=https://get.microsoft.com/installer/download/* is only 744 lines long. Even “3D Viewer” has never been saved...
03:57:35onetruth quits [Client Quit]
04:02:16<cruller>Presumably, these are actually “installer downloaders”, and there's no point in saving them, but even so.
04:11:05Kotomind quits [Ping timeout: 272 seconds]
04:13:56etnguyen03 quits [Remote host closed the connection]
04:26:30<h2ibot>Cooljeanius edited US Government (+41, /* See also */ Put category here too, to make…): https://wiki.archiveteam.org/?diff=60421&oldid=60180
04:32:30<h2ibot>Cooljeanius edited US Government (+70, additional sentence about the [[US…): https://wiki.archiveteam.org/?diff=60422&oldid=60421
04:53:59Kotomind joins
04:59:09SootBector quits [Remote host closed the connection]
05:00:20SootBector (SootBector) joins
05:04:55n9nes quits [Ping timeout: 272 seconds]
05:06:22n9nes joins
05:17:53DogsRNice quits [Read error: Connection reset by peer]
05:57:37sec^nd quits [Remote host closed the connection]
05:57:53sec^nd (second) joins
06:15:17nexussfan quits [Quit: Konversation terminated!]
06:24:35SootBector quits [Remote host closed the connection]
06:25:46SootBector (SootBector) joins
07:02:42<pabs>ericgallager: re that wesnoth forums post, there are a few non-404 .sf2 files on IA, found using the little-things ia-cdx-search tool https://transfer.archivete.am/cqUq9/www.freesf2.com-non-404-sf2-files.txt
07:02:42<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/cqUq9/www.freesf2.com-non-404-sf2-files.txt
07:16:23<ericgallager>ok, how do I turn that into actual links?
07:22:02mannie (nannie) joins
07:23:34<mannie>Since there are a the moment popping up and going down every day new 'ai' company I look for a list with all/most .ai domains because a good part of these company's use this TLD.
07:24:29<mannie>I found this daily updating list https://domainmetadata.com/ai-domain-list with 940,581 active domains. I know that this is to much to do with archivebot. Could we do it with urls team?
07:25:02<@JAA>URLTeam is about URL shorteners. But we could do it with the URLs project.
07:25:04<mannie>It is to grab them before the ai bubble pops and we dont have the capacity and resource to track them down and archive
07:25:21<mannie>JAA: That is the project I mean.
07:45:50mannie quits [Client Quit]
08:04:27Ointment8862 (Ointment8862) joins
09:01:08Dada joins
09:52:27midou quits [Ping timeout: 272 seconds]
11:02:55evergreen5 quits [Quit: Bye]
11:03:20evergreen5 joins
11:06:16beastbg8_ joins
11:09:43beastbg8__ quits [Ping timeout: 272 seconds]
11:29:29<h2ibot>Bear edited List of websites excluded from the Wayback Machine/Partial exclusions (+55, The article Burger King doesn't want you to read.): https://wiki.archiveteam.org/?diff=60423&oldid=60372
11:30:06Wake quits [Quit: The Lounge - https://thelounge.chat]
11:42:30<h2ibot>Bear edited List of websites excluded from the Wayback Machine/Partial exclusions (+258, + context): https://wiki.archiveteam.org/?diff=60424&oldid=60423
11:56:56nyakase quits [Quit: @ERROR: max connections (-1) reached -- try again later]
11:57:20nyakase (nyakase) joins
12:00:02Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
12:02:49Bleo182600722719623455222 joins
13:11:14Island quits [Read error: Connection reset by peer]
13:23:16cyanbox quits [Read error: Connection reset by peer]
13:33:32Arcorann_ quits [Ping timeout: 256 seconds]
13:52:20Washuu joins
13:58:54Wohlstand1 (Wohlstand) joins
14:01:17Wohlstand1 is now known as Wohlstand
14:05:02Island joins
14:13:05midou joins
14:14:57ducky_ (ducky) joins
14:16:33ducky quits [Ping timeout: 272 seconds]
14:19:43ducky_ quits [Ping timeout: 272 seconds]
14:31:49ducky (ducky) joins
14:47:53<pabs>ericgallager: just prefix with the WBM and they should find them from there
14:49:49<pabs>JAA: another ping from Canonical about the Moin wikis, they are going to upload tarballs of the source data to IA
15:09:23<HP_Archivist>What is everyone using to save HTML locally? I've been using the Chrome extension SingleFile, but it... is a bit lackluster
15:12:02<HP_Archivist>Few years back, I was using "SingleFilez" but I understand it was merged into the aforementioned at some point
15:13:21<Hans5958>I still use SingleFile
15:14:03<HP_Archivist>Do you edit any of the settings or do you just let it rip on its defaults?
15:15:24<Hans5958>Default settings for me
15:21:09Kotomind quits [Ping timeout: 272 seconds]
15:25:14<HP_Archivist>I prefer it to append a YYYY-MM-DD-mm-ss at the end of the file name. I also edited the settings json file to remove the percent-encoding so it doesn't have %C3 nonsense
15:25:44<HP_Archivist>But this also turns any non-English characters into English. So, trade off I guess
15:26:22<HP_Archivist>I remember that SingleFilez used to wrap it all neatly into its own folder iirc. I don't see an option for that. only zip.
15:29:22@arkiver quits [Remote host closed the connection]
15:29:46arkiver (arkiver) joins
15:29:46@ChanServ sets mode: +o arkiver
15:35:13<Hans5958>IIRC it was on the "Self-extracting ZIP" option?
15:39:17<HP_Archivist>Hm, I'll have another look
15:42:31<cruller>Is there no standard for saving page states (after rendering and/or interaction)?
15:43:12<HP_Archivist>cruller you mean across options out there? standard as in ISO?
15:48:01<cruller>If anything, I meant the latter, but AFAIK the former doesn't exist either.
15:49:25<HP_Archivist>Well, yeah, WARC is the defacto
15:49:49<HP_Archivist>But for self contained, I don't think so/not sure
15:49:59<HP_Archivist>Oh, WACZ
15:50:28<HP_Archivist>But I don't think that's much formality around WACZ yet (Web Archive Collection Zipped)
15:50:34<HP_Archivist>there's much formality*
15:51:15andybak joins
15:51:57<HP_Archivist>I like Perma.cc, as it not only captures the page, but also a sceenshot and it lets you download as a WARC. But it's not free.
15:58:26<cruller>WARC is the most standard and best way to save HTTP messages, but it is insufficient when, for example, I also want to store form input data.
15:59:05<cruller>That is primarily when I use SingleFile.
16:01:07<HP_Archivist>Hm, there's also ArchiveBox. I always forget about that.
16:01:10<cruller>WBM also uses screenshots, but I prefer HTML-based formats.
16:01:29<HP_Archivist>WBM does both, html and static screenshots
16:01:52andybak quits [Client Quit]
16:03:47<HP_Archivist>I might actually setup Archivebox now that I'm looking at this
16:08:30<cruller>WBM certainly saves both WARC and image screenshots, but not DOM snapshots or anything like that?
16:09:07<HP_Archivist>No, not DOM. By static, I just meant still images, etc. You're correct.
16:09:21<cruller>I saw this topic recently on #internetarchive.
16:10:47<HP_Archivist>^^ thanks
16:11:42<cruller>It's good that mnbot saves DOM snapshots.
16:19:19<cruller>Preserving page states is likely of particular interest to those who actually make citations—journalists, legal professionals, scholars, and others.
16:20:08<cruller>I believe this is also related to why https://cit.is/ chose ArchiveBox.
16:22:28<HP_Archivist>Then if Archivebox is the gold standard, I'm going to set it up. Esp. since it's free.
16:22:35<HP_Archivist>I would consider https://webrecorder.net/browsertrix/ but it's not free.
16:26:28<cruller>If only ArchiveBox could create proper WARCs, I'd recommend it to my friends...
16:27:43klea wonders if cruller would make a softfork of AB that uses wget-at instead of wget.
16:28:49<klea>> Artifacts will be hosted on cit.is for at least 3 years. After that, we may ask you to upgrade to preserve older archives.
16:29:36<cruller>TBH I've never even used ArchiveBox :(
16:31:57<klea>I did somewhat, but I guess I could throw all of the 30.8GiB of data into /dev/null.
16:32:48<cruller>A major Japanese VPS provider offers Archivebox templates (https://doc.conoha.jp/products/vps-v3/startupscripts-v3/archivebox/), so I think it's particularly psychologically accessible for my followers.
16:37:18<HP_Archivist>cruller what you mean by 'proper WARCs" ?
16:37:21<HP_Archivist>How does it not?
16:38:17<klea>wget.
16:39:28<HP_Archivist>I'm confused; what is cit.is and how does that related to Archivebox?
16:39:32<HP_Archivist>relate*
16:41:02<klea>JAA: I think that the warcs made by wget (at least you told me) had some errors, so maybe we should add a big warning to [[Wget#You_can_even_create_a_function]].
16:41:46<klea>2026-02-05 16:39:28 <HP_Archivist> I'm confused; what is cit.is and how does that related to Archivebox? < https://github.com/sij-ai/citis#:~:text=ArchiveBox,-%29
16:42:53<HP_Archivist>Ah, gotcha
16:42:59<HP_Archivist>So then this isn't free either, technically?
16:43:48<HP_Archivist>I guess, for me, I was looking for a robust solution to simply archive snapshots of html pages locally. Perhaps it's beyond my goals.
16:44:29<HP_Archivist>Archiving them locally, and then archiving them into cit.is for permanency are two different goals.
16:44:59<klea>I think if you don't want WARC grade archives, singlefile should give you files that you should be able to replay later in theory even if the website goes offline I believe.
16:45:15<plcp>HP_Archivist: have you looked at https://webrecorder.net/archivewebpage/ ?
16:45:52<plcp>or just entering an URL in https://webrecorder.net/ on the "Enter a URL to archive" slot
16:45:58<HP_Archivist>plcp, yeah, I have, but klea makes a good point about them being WARC grade. I'm wondering if that's what I should aim for.
16:46:04<plcp>it spits out a WACZ that you can keep forever locally
16:46:10<plcp>WACZ has a WARC in it
16:46:23<plcp>with some JSON that has signatures and trusted timestamping, for free
16:46:23<HP_Archivist>Hmm
16:47:04<plcp>https://specs.webrecorder.net/wacz/1.1.1/#directory-layout
16:47:38<plcp>its just a WARC with a cdx files and a couple json files for book-keeping
16:48:42<HP_Archivist>Thanks for this, plcp. I think Webrecorder meets my goals. Because I'm not about to go down a whole rabbit. I guess my long-term goal would be to offload saved WACZ files into individual items on IA. But locally, to have as reference and "you never know" purposes.
16:48:46<plcp>you can even setup a local https://github.com/webrecorder/pywb instance if you want to have your own at-home "wayback machine"
16:49:03<HP_Archivist>^^ Neat
16:49:34<plcp>HP_Archivist: tbh you can create an account on IA and put your WACZ there, as long as they don't look to spammy or heavy, they'll be kept forever
16:49:55<plcp>the story changes if you have a gazillion gigabytes
16:50:13<plcp>but for whatever small-scale usage it works™
16:51:12<HP_Archivist>Oh, I already have an account. Several in fact, for different materials. But hm. I could see this snowballing into large-ish data over time. As it stands now, whatever I throw into AB, I also through into SPN and Archive.is, too.
16:51:35<HP_Archivist>But yeah, I'll have another look t Webrecorder for now. You saved me some time, I was about to setup Archivebox, heh.
16:52:23<HP_Archivist>All of this is really from the mindset of not depending on any single point of failure for saved pages by having my own copy, too.
16:52:46<klea>https://irclogs.archivete.am/archiveteam-bs/2024-05-08#lb3cc1de2
16:53:37<cruller>HP_Archivist: I can't explain what "proper warc" is. I think only a few people, including JAA, can explain it.
16:53:42<plcp>browsertrix is a heavier machinery that crawls whole websites and have nice automation, if you can manage making the snapshots by hand, archivewebpage/webrecorder is nice
16:53:51<cruller>However, I believe that only the software recommended at https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem can generate it.
16:54:11<HP_Archivist>klea just burst my bubble, heh
16:54:11<klea>https://irclogs.archivete.am/archiveteam-bs/2024-01-27#l8bcff9bf
16:54:40<HP_Archivist>In archival, nothing is every easy. sighs.
16:55:44<HP_Archivist>plcp, what do you think re: https://irclogs.archivete.am/archiveteam-bs/2024-05-08#lb3cc1de2
16:56:19<cruller>koichi: In addition, I personally also trust Zeno, Heritrix, and warcprox.
16:56:50<klea>https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#:~:text=Browsertrix
16:58:16<HP_Archivist>^^
16:59:14<plcp>https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem // looking at the comments made on the webrecorder tools
16:59:17<plcp>it cites https://github.com/webrecorder/archiveweb.page/blob/5431064ead4c8245b5b58cbe9233664e525302d9/README.md#architecture
17:00:08<HP_Archivist>Ideally, I capture a local instance that faithfully replays the page as it was, live. 1:1
17:00:39<HP_Archivist>I'm looking at that chart (thank you, klea) and I'm wondering why there are so many options :|
17:01:09<HP_Archivist>Maybe Zeno?
17:01:11<klea>time to setup SSLKEYLOGFILE and dump all raw traffic your browser makes when accessing those webpages
17:01:29<klea>That should allow you to later implement something to parse the data /s
17:01:48<HP_Archivist>Funny ^^ I was trying to *avoid* rabbit holes, heh
17:01:49<plcp>from the usability point of view, if you don't care about faithful reproduction of the bytes the server transferred nor the exact bytes / encoding / etc
17:01:57<plcp>me likey webcorder
17:02:01<plcp>*webrecorder
17:02:12<klea>fireonlive++
17:02:13<eggdrop>[karma] 'fireonlive' now has 1156 karma!
17:02:44klea doesn't have money so can't pay you to do that tool nicolas17. re: https://irclogs.archivete.am/archiveteam-bs/2024-01-27#ld3925144
17:02:46<HP_Archivist>plcp Guess it's a personal decision then, and what scope I want to have
17:02:51<plcp>i've used https://github.com/ArchiveTeam/ludios_wpull which is the thing inside grab-site of AT
17:03:31<HP_Archivist>Partly err on the side of "doing it once, do it right"
17:03:37<plcp>yeah
17:04:03<HP_Archivist>But I see there isn't a "right" solution, from what it looks like here
17:04:10<HP_Archivist>Several different flavors I guess
17:05:05<klea>The best solution is just go, make a entire copy of the server on the other end, of your client machine, of any server that server relies on, and create a intranet so it can always be poked at.
17:05:28<HP_Archivist>^^
17:05:33<klea>Sadly that will exceed your budget very likely, and require you to have access to those servers already.
17:05:39^ quits [Ping timeout: 272 seconds]
17:06:24<HP_Archivist>As I have found out over the years in other aspects of archival (imaging, digitization) doing it "right" is often very-prohibitive.
17:06:35<HP_Archivist>cost-prohibitive*
17:07:00<HP_Archivist>I think I'm gonna stick with Webrecorder as "good enough" for now
17:07:17<HP_Archivist>Maybe I'll have a look at Zeno, too
17:07:25Shard11 quits [Read error: Connection reset by peer]
17:07:47<plcp>at least your next-door broken WARC implementation spits out stuff that may be understood by most future tools :o)
17:07:54<plcp>just don't use wget WARC implementation
17:07:58<plcp>this one is too broken
17:08:10Shard11 (Shard) joins
17:08:22<HP_Archivist>^^ noted
17:08:25<HP_Archivist>And yup :)
17:08:25^ (^) joins
17:08:32<HP_Archivist>Thanks, everyone, for the input. I appreciate it.
17:08:46<plcp>> that may be understood by most future tools // given that most ppl with money or institutional backers are using WARC for everything everywhere all the time
17:08:58<plcp>from what i understood, i'm no expert here
17:09:40<HP_Archivist>Looking at the the AT logs that klea shared, doesn't sound like there are many WARC experts(?)
17:14:02<plcp>i've been reading https://github.com/iipc/warc-specifications/issues from times to times
17:14:22<plcp>last one is fun https://github.com/iipc/warc-specifications/issues/106
17:14:48Webuser159024 joins
17:15:25<plcp>i guess that ppl writing the tools and operating the archives will be the one deciding what will be passed to future generations :o)
17:15:45<HP_Archivist>That's....yeah, that's not exactly best-practice, is it. Heh.
17:16:35<HP_Archivist>I mean, I'm a complete novice when it comes to any of this. So, I can only speak to it a little. But sounds like there needs to be a consortium for the spec. Sorta like for the ICC (Intl Color Consortium)
17:17:59<cruller>Some people might say that “improper WARC is worse than PDF,” but this is a point of contention. Personally, I avoid Webrecorder software whenever possible (for capture purposes).
17:19:24<cruller>In any case, proper WARC is better than improper WARC.
17:25:12<HP_Archivist>cruller so then if not webrecorder, then what?
17:27:07<cruller>In most cases, grab-site works well. Not only for crawling, but also for saving single pages.
17:27:32<HP_Archivist>Like what?
17:36:21<cruller>Any page that doesn't require JavaScript can be easily saved with grab-site --1 https://example.com. Other complex options are generally fine with the defaults.
17:38:33<cruller>When JavaScript is required, I use Brozzler. Additionally, although still experimental, Zeno's --headless option might be nice.
17:43:16<cruller>I don't verify the generated WARC files (though I'd like to)... I only do a visual check on https://replayweb.page/ when I feel like it.
17:50:03Webuser159024 quits [Client Quit]
17:51:00<@arkiver>have we been able to archive https://historyhub.history.gov/ well?
17:52:36Dj-Wawa quits []
17:52:57<klea>arkiver: Maybe ask in #UncleSamsArchive too?
17:53:29Dj-Wawa joins
17:54:39<h2ibot>Manu edited Discourse/archived (+97, Queued discuss.gradle.org): https://wiki.archiveteam.org/?diff=60425&oldid=60420
17:56:22<HP_Archivist>cruller I meant an example of a grab-site
17:57:30Washuu quits [Quit: Ooops, wrong browser tab.]
17:59:02<klea>A WARC?
18:05:08<pokechu22>arkiver: no, https://historyhub.history.gov has gotten blocked due to incapsula
18:05:10<cruller>"a grab-site"? grab-site is a software that generates WARC files. https://github.com/ArchiveTeam/grab-site
18:05:27<@arkiver>pokechu22: i see
18:05:41<@arkiver>i'll look into it as well this weekend (don't have time on before then)
18:05:44<@arkiver>thanks pokechu22
18:15:21<nicolas17>klea: a pcap to warc converter seems theoretically possible but very hard...
18:16:09<nicolas17>would probably need to use Wireshark (or steal its code) as a base
18:19:20SootBector quits [Remote host closed the connection]
18:20:32SootBector (SootBector) joins
18:20:54<nicolas17>since you need to reassemble TCP packets that may be out of order or repeated... before you even get to the TLS part
18:22:49<cruller>This might be a very amateurish question, but is it possible to “replay” a pcap file and then recapture it using something like warcprox?
18:28:05SootBector quits [Remote host closed the connection]
18:29:12SootBector (SootBector) joins
18:30:05Bleo182600722719623455222 quits [Quit: Ping timeout (120 seconds)]
18:30:06allani quits [Quit: Ping timeout (120 seconds)]
18:30:08igloo222255 (igloo22225) joins
18:30:28jspiros quits [Ping timeout: 256 seconds]
18:32:10igloo22225 quits [Ping timeout: 256 seconds]
18:32:35jspiros (jspiros) joins
18:32:40igloo22225 (igloo22225) joins
18:33:54igloo222255 quits [Read error: Connection reset by peer]
18:41:17Dj-Wawa quits [Client Quit]
18:43:27Shard117 (Shard) joins
18:44:01Dj-Wawa joins
18:45:46Shard11 quits [Ping timeout: 256 seconds]
18:45:46Shard117 is now known as Shard11
18:52:38igloo222259 (igloo22225) joins
18:53:42igloo22225 quits [Read error: Connection reset by peer]
18:53:42igloo222259 is now known as igloo22225
18:58:49cyanbox joins
19:03:31Shard11 quits [Read error: Connection reset by peer]
19:05:02Shard11 (Shard) joins
19:08:26Shard11 quits [Read error: Connection reset by peer]
19:08:30Shard115 (Shard) joins
19:11:07Shard11 (Shard) joins
19:12:53Shard115 quits [Read error: Connection reset by peer]
19:16:07Shard11 quits [Ping timeout: 272 seconds]
19:28:43khaoohs__ joins
19:28:43khaoohs_ quits [Read error: Connection reset by peer]
19:51:29<Yakov>emulate the packet transactions to a headless browser (/s, i know it doesn't work like that)
20:21:39DogsRNice joins
20:41:56xarph quits [Ping timeout: 256 seconds]
20:42:10xarph joins
21:13:42Sk1d joins
21:19:32Sk1d quits [Client Quit]
21:46:27Sidpatchy2 (Sidpatchy) joins
21:47:22Sidpatchy2 quits [Client Quit]
21:54:48Sidpatchy2 (Sidpatchy) joins
21:57:25Wohlstand quits [Quit: Wohlstand]