00:11:14Webuser330629 joins
00:13:32Webuser330629 quits [Client Quit]
00:25:23etnguyen03 (etnguyen03) joins
00:29:57APOLLO03 quits [Quit: Leaving]
00:47:53TastyWiener95 (TastyWiener95) joins
01:17:09beastbg8__ joins
01:20:19beardicus (beardicus) joins
01:20:23beastbg8_ quits [Ping timeout: 260 seconds]
01:32:59BlueMaxima quits [Read error: Connection reset by peer]
01:46:35etnguyen03 quits [Client Quit]
01:51:28beardicus quits [Ping timeout: 250 seconds]
01:53:34beardicus (beardicus) joins
02:06:07bladem quits [Read error: Connection reset by peer]
02:16:59etnguyen03 (etnguyen03) joins
02:28:03beardicus quits [Ping timeout: 260 seconds]
02:33:01beardicus (beardicus) joins
03:28:42etnguyen03 quits [Remote host closed the connection]
04:55:38beardicus quits [Ping timeout: 250 seconds]
05:05:46JayEmbee (JayEmbee) joins
05:07:51opl quits [Quit: bye]
05:23:19opl joins
06:14:52BornOn420 quits [Remote host closed the connection]
06:15:51BornOn420 (BornOn420) joins
06:26:38jspiros_ quits [Ping timeout: 260 seconds]
06:42:00jspiros (jspiros) joins
06:51:08jspiros quits [Ping timeout: 260 seconds]
07:58:27SootBector quits [Remote host closed the connection]
07:58:50SootBector (SootBector) joins
08:16:27jspiros (jspiros) joins
08:24:24Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
08:25:45Shjosan (Shjosan) joins
09:14:19loug8318142 joins
09:18:07APOLLO03 joins
09:57:32Sluggs quits [Excess Flood]
10:03:55Sluggs joins
10:28:42Sluggs quits [Excess Flood]
10:35:34Sluggs joins
10:39:16<@OrIdow6>Looking at it, congrats on progress (not yet merged but looks like you're further along) on the mastodon thing pabs
10:53:47<@OrIdow6>Also, for all - I'm looking into this because STWP has had an incident kinda similar to the berries.space incident (they SPN'd some Fediverse posts without asking) and I think it would be useful to save the Mastodon thread where people get angry at them, good illustration of people reacting to archiving
10:58:24<@OrIdow6>I'm thinking of going forward with it, not like there's anyhting sensitive there anyway, just people complaining
10:58:45<@arkiver>what is this about OrIdow6 ?
11:09:58ArchivalEfforts quits [Quit: No Ping reply in 180 seconds.]
11:13:46Sluggs quits [Excess Flood]
11:17:42Sluggs joins
11:19:50<@arkiver>mostly asking since i can't find in my logs what this related to :P
11:20:03<@arkiver>i think i'm not missing logs but not sure
11:25:27mannie (nannie) joins
11:25:28<eggdrop>[tell] mannie: [2025-01-21T08:16:56Z] <OrIdow6> Could you please provide more context to this? For instance: what date range does this list cover? Also rather than a long list of references it would be better to have just one or two links per company, to establish that they are going bankrupt; e.g., I cannot find a source for the fact that Sarvision is going bankrupt skimming through the URLs in the list, instead most just seem to establish t
11:27:19<@OrIdow6>arkiver: Umm, the context is a long long conversation machine-translated from Chinese in #stwp-chat , basically I'm asking whether we want to archive a kerfuffle over archiving, just being safe and asking other people cuz there's a distant possibility we could get involved in the blowback as well and I don't want to make that decision solo
11:27:47<c3manu>OrIdow6: looks like the message you left mannie got cut off
11:28:01<@OrIdow6>c3manu: Oh shoot, you're right
11:28:25<mannie>OrIdow6: I use https://insolventies.rechtspraak.nl/#!/ as a source. I filter on type bankrupt and then today
11:28:41<@OrIdow6>"... that the companies exist" or something like that was what was left out
11:28:42<c3manu>i haven’t been following yesterday, have there been any new ideas? where’s that discussion currently at?
11:29:48<mannie>Here is the list for court annoucements of the last 7 days: https://insolventies.rechtspraak.nl/#!/resultaat?periode=Laatste%20week&rechtbank=all&publicatiesoort=uitspraken%20faillissement&publicatiekenmerk=&insolventienummer=&startDate=2025-01-15T11:30:42.875Z&type=kenmerk
11:30:08<c3manu>mannie: the thing is, we don’t really have time to go through all of that by ourselves and were thinking about solutions for automating part of it, or thinking about ways for people to throw in their stuff without having to learn all of AB, essentially
11:30:14<mannie>I dont include it in the references list becuase is not possible to archive it with archivebot
11:30:51<c3manu>mannie: if that makes it any better, the German insolvency list seems to be even more of a mess ^^"
11:31:03<@arkiver>OrIdow6: i guess it can be archived, after data is donated to IA, they can always contact IA with requests
11:31:11<mannie>I can make separte list of each company
11:32:03<@arkiver>mannie: was it the case that previously you didn't want to control AB to queue these sites?
11:32:12<@arkiver>if you want to, you're very welcome to get +v and put them in there
11:33:27<mannie>I dont know if I want to control AB. I am still in doubt. My be I just need to try it for a week or so.
11:33:28<@OrIdow6>arkiver: Thanks, that's along the lines of what I was thinking too, just didn't want to do this completely unilaterally
11:35:08<@arkiver>there was some talk about headless browser based archiving?
11:35:37<@OrIdow6>!remindme 15h save that mastodon thing
11:35:37<@arkiver>i believe we don't have anything in place at the moment that does that, it would be nice to have something similar to AB, but perhaps for headless browser based archiving
11:35:38<eggdrop>[remind] ok, i'll remind you at 2025-01-24T02:35:37Z
11:36:02<@arkiver>there's always a few web one-off pages that are a huge problem in AB, and for which there's not really time to create something custom
11:36:14<@arkiver>headless browser based archiving would be really useful there
11:36:20<@arkiver>there was that tool from IA
11:36:34<@arkiver>this one https://github.com/internetarchive/brozzler
11:36:45<@OrIdow6>Yeah it'd be really nice
11:36:56<@arkiver>i have never tried it out though, but it seems to have activity
11:37:39<@OrIdow6>Wasn't there discussion in -dev 2 months or so back about a different tool that used the Chrome debugging API to intercept requests?
11:37:56<@OrIdow6>Written in Go? Unless I'm mixing up Brozzler and Zeno
11:38:28<@arkiver>i am not sure at all
11:40:11<@OrIdow6>Think it was https://github.com/iipc/warcaroo but that is indeed not Go
11:44:53<@OrIdow6>mannie: I do encourage you to try it at least!
11:45:05<@arkiver>+1 from me mannie
11:45:07<@OrIdow6>No credit card required :)
11:51:16<mannie>I think I will try it for this month.
11:52:24<mannie>I also need to say that there is more real live stuff coming soon. So that days I will not be active at all. only the daily list for #down-the-tube
11:53:32<szczot3k>OrIdow6: what do you mean no CC required. I already gave mine to arkiver, and still no +v :( /s
12:00:01Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:54Bleo18260072271962345 joins
12:04:38benjins3 quits [Ping timeout: 250 seconds]
12:21:27benjins3 joins
12:42:40mannie quits [Ping timeout: 276 seconds]
12:46:44<TheTechRobo>arkiver, OrIdow6: I have something WIP in #jseater. No recursion yet, though, and there are some kinks I still need to iron out.
12:47:08<TheTechRobo>And yeah, it wraps brozzler (perhaps Zeno in the future)
12:48:45Webuser391030 joins
12:51:14gatagoto (gatagoto) joins
12:54:45beardicus (beardicus) joins
13:06:31SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
13:11:31<masterx244|m>posted a list 2 days ago with a bunch of firmware files. haven't seen that one ran through archivebot. list was intended for a !ao< and total size of the entire list is approx 23GB (according to my local copy of the files which i used to rederive the URLs after verifying them with a ugly bash command)
13:13:48loug8318142 quits [Ping timeout: 260 seconds]
13:15:43SkilledAlpaca418962 joins
13:16:58loug8318142 joins
13:17:02mannie (nannie) joins
13:17:57Webuser391030 quits [Client Quit]
13:20:13Snivy quits [Ping timeout: 260 seconds]
13:23:14cow_2001 quits [Quit: ✡]
13:23:56loug8318142 quits [Ping timeout: 250 seconds]
13:25:10cow_2001 joins
13:31:40loug8318142 joins
13:37:08loug8318142 quits [Ping timeout: 260 seconds]
13:38:06Snivy (Snivy) joins
13:44:15<@arkiver>TheTechRobo: nice!
13:46:09mannie quits [Client Quit]
13:48:14<DigitalDragons>OrIdow6: I remember talk of CDP being used to replace Chrome's HTTP stack with the corentinb/warc library as well
13:49:57loug8318142 joins
14:04:33loug8318142 quits [Ping timeout: 260 seconds]
14:07:37eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
14:08:06eroc1990 (eroc1990) joins
14:12:05loug8318142 joins
14:31:26atluxity joins
14:32:48<atluxity>I came to wonder... at what rate, with a optimistic and naive world view, how many items would archiveteam be able to pull tiny chunks of data over http? Do we have any experience on that?
14:33:23<atluxity>this is just food for a beer-discussion, nothing serious
14:36:16mannie (nannie) joins
14:36:49mannie quits [Client Quit]
14:38:17<@imer>atluxity: if we're talking DPoS we usually aren't limited by workers - either the site gives out (or has strict limiting) or uploading to IA can't keep up. multiple gbit/s is quite feasable
14:39:57<atluxity>It looks to me like Telegram is currently doing about 600 items per second
14:40:31<atluxity>I look at done_counter, noted the time... waited some time, repeat, calculated
14:40:48<atluxity>so I can use that number for now
14:40:51<@imer>telegram is one of those cases where they have rate limiting
14:41:27<@imer>urls is doing ~3k/s currently
14:42:29bladem (bladem) joins
15:36:32icedice quits [Ping timeout: 250 seconds]
15:49:48icedice (icedice) joins
15:56:00<AK>kiska, you able to get a max for done items/s on #// from your stats? 🤔
16:01:03kansei quits [Quit: ZNC 1.9.1 - https://znc.in]
16:02:08kansei (kansei) joins
16:05:18holbrooke quits [Ping timeout: 260 seconds]
16:05:44nulldata6 (nulldata) joins
16:07:03nulldata quits [Ping timeout: 260 seconds]
16:07:04nulldata6 is now known as nulldata
16:10:56<LddPotato>Hey
16:13:31<LddPotato>Hey guys, i wanted to spin up a cloud instance somewhere to run some docker containers on, woudl you advise to get a single instance with a little bit more memory and then add in some IPs, or would you go for a hand full of smaller instances that already come with their own IP? And which host is the goto one for you guys? One that has IPs that are in good standing and preferably is not too expensive..
16:19:50holbrooke joins
16:31:17<Blueacid>LddPotato: From what I've seen running locally, you don't need much memory, but CPU can quickly get bogged down with a few wget-at threads on the go! I suspect that many of the cheaper hosting options (DigitalOcean, Hetzner, OVH, Linode, Vultr and so on) will likely have "mixed" IP standings, thanks largely to the low prices on offer there =(
16:32:46<szczot3k>Depends on what you're trying to do. Any hoster, other than residential IP will have bad enough reputation for a lot of things, just because geolocation/reputation databases know it's a hoster.
16:36:21<LddPotato>Well, At home i have added a tunnel with extra IP addresses but the tunnel is saturated, to be able to contribute and improve my statistics a little i wanted to host a couple more bandwith intensive instances somewhere with a cloud hoster, but i am not sure what to get and from whom...
16:36:49<@imer>what project are you intending to run?
16:36:58<szczot3k>I'm running an OVH dedi with additional IPs, and so far I haven't seen any major issues
16:37:18<szczot3k>(Actually two dedis, with two subnets between them)
16:37:30<LddPotato>livesream and telegram use the most data for me at the moment...
16:37:48<pabs>OrIdow6: zygolophodon got support for some additional non-Mastodon Fediverse JS based software, which made the code more complicated, so I need to rebase/rewrite the archiving stuff on top of that
16:38:15<szczot3k>livestream's api will bonk you with anything over ~2 concurrency effective, telegram is a bit better in this regards
16:38:34<pabs>OrIdow6: I also didn't verify that things work in the WBM afterwards
16:39:02<szczot3k>But telegram might also one day wake up, and start bonking archiving efforts hard
16:40:20<LddPotato>szczot3k: the dedicated servers you run, what kind of specs do they have? how big is the subnets you have assigned to them? and how many workers are you running on them?
16:42:52<szczot3k>KS-LE-E, KS-LE-1, each with 4 vms each running 3 projects right now (urls, clyps, livestream), all of them on dedicated IPs. LE-E's CPU is ~40% used by archiveteam vms, LE-1's is at ~70%
16:42:56<@imer>livestream should be finishing or close to in a week or so (fingers crossed), so don't commit to anything long term for just that - hetzner cloud vms might be good since it's pay as you go - might have to watch your traffic allowance though (prices are very reasonable fwiw)
16:43:51<szczot3k>But - neither of those dedis were bought _for_ archiving efforts, they have just grown to this size by accident. Most likely will scale back one day
16:49:13<AK>I've got 3 dedis archiving, plus a bunch of Hetzner cloud I scale up when needed. 1 dedi is an AX51-NVMe with 4 ips which runs a bit from every project to collect logs. Then the other 2 dedis run ArchiveBot pipelines
16:51:05<@imer>hetzner is nice unless you try to run #// hard :D
16:51:53<szczot3k>On additional IPs on OVH you get to handle abuse yourself (or at least get a whois object with your abuse contact)
16:52:41<AK>Weirdly the hetzner emails have stopped for me. No idea what I did but not had any abuse reports for ages
16:53:06<AK>(Late as always, but livestream containers with logs are up in the usual place if needed)
16:54:29<szczot3k>I've been thinking of running every project, and pushing the logs to ELK, and then setting up some error reporting from this, but... the procrastination is real
16:54:35<AK>https://share.aktheknight.co.uk/riJe0/qamiwewA17.png/raw Oh boy, I might need to reduce some of the containers on this
16:54:53<LddPotato>it sounds like the preference is to go for a bigger instance or dedicated server and then add more IPs to it.... The unfortunate thing is i have my own server colocated somewhere but the bandwith is quite expensive there..
16:55:15<AK>szczot3k, I did that for a while, but the issue is the amount of logs, it was easy to churn out multiple TB per day. Now I just run Dozzle which gives me logs but no monitoring
16:56:41<szczot3k>AK: https://i.imgur.com/HGczakQ.png I clear the logs daily, and it's not that bad (24 containers push the logs there)
16:57:07Webuser821963 quits [Quit: Ooops, wrong browser tab.]
16:57:47<AK>Yeah doesn't look too bad
16:57:48<AK>I have 79 containers at the mo, at peak I was running upwards of 200 via swarm with all logs being grabbed in. Was chaos 😂
16:58:27<szczot3k>Running a single instance of every project might be feasible for me
16:58:50<szczot3k>Making good rules for reporting is harder
17:00:30<AK>I spent ages on rules, and then every time a new project came along the rules would need to be different, so I just settled on this: https://logs.hel1.aktheknight.co.uk/
17:00:59<AK>Now if someone goes "it isn't working" people can see my live logs and work out if it's a widespread issue or just one person reporting it
17:01:54beardicus quits [Ping timeout: 250 seconds]
17:01:56<Blueacid>Out of interest, how many projects are there running which aren't in the warrior? Should I try to run some of those, too?
17:02:17<Blueacid>(I've got gigabit at home, so happy to use a chunk of capacity for archival!)
17:04:20beardicus (beardicus) joins
17:04:56pedantic-darwin quits [Ping timeout: 250 seconds]
17:06:11<AK>I would advise being very careful running AT projects on a home ISP, some projects (#// especially) can result in lots of abuse reports and some ISPs get super unhappy about that
17:07:25<szczot3k>It's the less hardcore version of running a TOR Exit Relay at home
17:24:37<LddPotato>The reason i stayed away from running #// sofar...
17:35:08beardicus quits [Ping timeout: 260 seconds]
17:38:24<Blueacid>Is #// shorthand for the random links project (The one that has a warning about URL Blocklists in its title) ?
17:38:34<Blueacid>'Cos if so, yeah, I've avoided that one too!
17:39:05<Blueacid>But yeah, my ISP hasn't breathed a word to me so far - so happy to carry on until they get in touch, at which point I'll say "oops, sorry"
17:40:43beardicus (beardicus) joins
17:42:20BornOn420 quits [Ping timeout: 276 seconds]
17:50:24<@imer>Blueacid: #// is the channel for the urls project. so sorta yes :)
17:54:18SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
17:54:35BornOn420 (BornOn420) joins
18:38:48SkilledAlpaca418962 joins
19:32:04emily quits [Quit: ZNC 1.9.1 - https://znc.in]
19:33:19pseudorizer (pseudorizer) joins
19:44:50nicolas17 quits [Ping timeout: 250 seconds]
19:48:26nicolas17 joins
20:06:04@Fusl quits [Excess Flood]
20:06:20Fusl (Fusl) joins
20:06:20@ChanServ sets mode: +o Fusl
20:08:51Webuser915700 joins
20:12:15<Webuser915700>Hi all. Hope you are all well. Im quite new to website archiving and in need of some good suggestions. I have looked through tons of pages, including ArchiveTeam.org. They compared a few WARC supported options against each other but a few important options (Heretrix, Apache Nutch etc) was never properly compared. Before I start archiving I need to
20:12:15<Webuser915700>know a few things (Overall page reliability, ability to include comments, expanded comments and few Pros and Cons that does not get mentioned very often). Does anyone use either Heretrix, Apache Nutch and Grab-site?
20:13:45<szczot3k>What exactly are you trying to do? Do you want to help "the internet community" as a whole, and get stuff into the web.archive?
20:18:43<Webuser915700>Myself, and the internet community. So there might be differences in which way files are accepted or not (Im sure there must be some sort of guideline to make it eligible for upload). I backed up tons of pages before manually in simple .mhtml format. However, for many pages with many subdirectories its obviously not really feasible. While my own
20:18:43<Webuser915700>personal interest is more focused on health, nutrition, and biochemistry there is a dire need to archive all sorts of information.
20:19:31<TheTechRobo>WARCs you create yourself won't be added to the Wayback Machine, for what it's worth. You can requests sites to be submitted to ArchiveBot if you'd like them to be archived there
20:19:53<TheTechRobo>If You're not worried about the Wayback Machine, grab-site is probably the nicest option in terms of user friendliness
20:20:37<TheTechRobo>Heretrix is a lot of setup; never tried Apache Nutch so can't say much about that
20:23:12<Webuser915700>TheTechRobo If it gets added to the Wayback machine it would be nice as an extra backup. I have a server with around 120 TB which is quite small but its enough to start with. Do you know if either Heretrix or Grab-site comments and auto expanding comments?
20:23:59<Webuser915700>support comments and auto expanding comments*
20:25:12<nicolas17>that depends on the specific website
20:40:33HP_Archivist (HP_Archivist) joins
20:50:18Webuser915700 quits [Client Quit]
20:57:12beardicus quits [Ping timeout: 250 seconds]
20:58:50Webuser044028 joins
21:09:57<TheTechRobo>If it requires JavaScript, then probably not
21:10:12kansei quits [Ping timeout: 250 seconds]
21:10:34kansei (kansei) joins
21:18:03beardicus (beardicus) joins
21:31:58h2ibot quits [Ping timeout: 260 seconds]
21:38:09APOLLO03 quits [Quit: Leaving]
21:38:41Webuser044028 quits [Client Quit]
21:49:12beardicus quits [Ping timeout: 250 seconds]
21:53:13beardicus (beardicus) joins
22:01:00qwertyasdfuiopghjkl2 quits [Quit: Leaving.]
22:01:34qwertyasdfuiopghjkl2 joins
22:02:01qwertyasdfuiopghjkl2 quits [Max SendQ exceeded]
22:02:13qwertyasdfuiopghjkl2 joins
22:02:40qwertyasdfuiopghjkl2 quits [Max SendQ exceeded]
22:02:59sec^nd quits [Remote host closed the connection]
22:03:17sec^nd (second) joins
22:05:23APOLLO03 joins
22:10:59BlueMaxima joins
22:12:48beardicus quits [Ping timeout: 260 seconds]
22:16:45APOLLO_03 joins
22:18:38APOLLO03 quits [Ping timeout: 260 seconds]
22:32:41h2ibot (h2ibot) joins
22:41:12etnguyen03 (etnguyen03) joins
22:44:57Dango360 quits [Read error: Connection reset by peer]
22:59:42Dango360 (Dango360) joins
23:01:33Webuser880118 joins
23:02:03Overlordz joins
23:02:25Webuser880118 quits [Client Quit]
23:08:48Ryz quits [Ping timeout: 260 seconds]
23:11:29Ryz (Ryz) joins
23:26:47<kiska>AK the explore function is available, so you can go make your own query
23:27:39qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins
23:29:04<kiska>Otherwise I can get one soon(tm)
23:38:39etnguyen03 quits [Client Quit]
23:41:06<AK>kiska, Happy to explore, didn't want to overload your side if I wrote really bad queries. Will take a look as I'm interested now 🤔