00:11:14 | | Webuser330629 joins |
00:13:32 | | Webuser330629 quits [Client Quit] |
00:25:23 | | etnguyen03 (etnguyen03) joins |
00:29:57 | | APOLLO03 quits [Quit: Leaving] |
00:47:53 | | TastyWiener95 (TastyWiener95) joins |
01:17:09 | | beastbg8__ joins |
01:20:19 | | beardicus (beardicus) joins |
01:20:23 | | beastbg8_ quits [Ping timeout: 260 seconds] |
01:32:59 | | BlueMaxima quits [Read error: Connection reset by peer] |
01:46:35 | | etnguyen03 quits [Client Quit] |
01:51:28 | | beardicus quits [Ping timeout: 250 seconds] |
01:53:34 | | beardicus (beardicus) joins |
02:06:07 | | bladem quits [Read error: Connection reset by peer] |
02:16:59 | | etnguyen03 (etnguyen03) joins |
02:28:03 | | beardicus quits [Ping timeout: 260 seconds] |
02:33:01 | | beardicus (beardicus) joins |
03:28:42 | | etnguyen03 quits [Remote host closed the connection] |
04:55:38 | | beardicus quits [Ping timeout: 250 seconds] |
05:05:46 | | JayEmbee (JayEmbee) joins |
05:07:51 | | opl quits [Quit: bye] |
05:23:19 | | opl joins |
06:14:52 | | BornOn420 quits [Remote host closed the connection] |
06:15:51 | | BornOn420 (BornOn420) joins |
06:26:38 | | jspiros_ quits [Ping timeout: 260 seconds] |
06:42:00 | | jspiros (jspiros) joins |
06:51:08 | | jspiros quits [Ping timeout: 260 seconds] |
07:58:27 | | SootBector quits [Remote host closed the connection] |
07:58:50 | | SootBector (SootBector) joins |
08:16:27 | | jspiros (jspiros) joins |
08:24:24 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
08:25:45 | | Shjosan (Shjosan) joins |
09:14:19 | | loug8318142 joins |
09:18:07 | | APOLLO03 joins |
09:57:32 | | Sluggs quits [Excess Flood] |
10:03:55 | | Sluggs joins |
10:28:42 | | Sluggs quits [Excess Flood] |
10:35:34 | | Sluggs joins |
10:39:16 | <@OrIdow6> | Looking at it, congrats on progress (not yet merged but looks like you're further along) on the mastodon thing pabs |
10:53:47 | <@OrIdow6> | Also, for all - I'm looking into this because STWP has had an incident kinda similar to the berries.space incident (they SPN'd some Fediverse posts without asking) and I think it would be useful to save the Mastodon thread where people get angry at them, good illustration of people reacting to archiving |
10:58:24 | <@OrIdow6> | I'm thinking of going forward with it, not like there's anyhting sensitive there anyway, just people complaining |
10:58:45 | <@arkiver> | what is this about OrIdow6 ? |
11:09:58 | | ArchivalEfforts quits [Quit: No Ping reply in 180 seconds.] |
11:13:46 | | Sluggs quits [Excess Flood] |
11:17:42 | | Sluggs joins |
11:19:50 | <@arkiver> | mostly asking since i can't find in my logs what this related to :P |
11:20:03 | <@arkiver> | i think i'm not missing logs but not sure |
11:25:27 | | mannie (nannie) joins |
11:25:28 | <eggdrop> | [tell] mannie: [2025-01-21T08:16:56Z] <OrIdow6> Could you please provide more context to this? For instance: what date range does this list cover? Also rather than a long list of references it would be better to have just one or two links per company, to establish that they are going bankrupt; e.g., I cannot find a source for the fact that Sarvision is going bankrupt skimming through the URLs in the list, instead most just seem to establish t |
11:27:19 | <@OrIdow6> | arkiver: Umm, the context is a long long conversation machine-translated from Chinese in #stwp-chat , basically I'm asking whether we want to archive a kerfuffle over archiving, just being safe and asking other people cuz there's a distant possibility we could get involved in the blowback as well and I don't want to make that decision solo |
11:27:47 | <c3manu> | OrIdow6: looks like the message you left mannie got cut off |
11:28:01 | <@OrIdow6> | c3manu: Oh shoot, you're right |
11:28:25 | <mannie> | OrIdow6: I use https://insolventies.rechtspraak.nl/#!/ as a source. I filter on type bankrupt and then today |
11:28:41 | <@OrIdow6> | "... that the companies exist" or something like that was what was left out |
11:28:42 | <c3manu> | i haven’t been following yesterday, have there been any new ideas? where’s that discussion currently at? |
11:29:48 | <mannie> | Here is the list for court annoucements of the last 7 days: https://insolventies.rechtspraak.nl/#!/resultaat?periode=Laatste%20week&rechtbank=all&publicatiesoort=uitspraken%20faillissement&publicatiekenmerk=&insolventienummer=&startDate=2025-01-15T11:30:42.875Z&type=kenmerk |
11:30:08 | <c3manu> | mannie: the thing is, we don’t really have time to go through all of that by ourselves and were thinking about solutions for automating part of it, or thinking about ways for people to throw in their stuff without having to learn all of AB, essentially |
11:30:14 | <mannie> | I dont include it in the references list becuase is not possible to archive it with archivebot |
11:30:51 | <c3manu> | mannie: if that makes it any better, the German insolvency list seems to be even more of a mess ^^" |
11:31:03 | <@arkiver> | OrIdow6: i guess it can be archived, after data is donated to IA, they can always contact IA with requests |
11:31:11 | <mannie> | I can make separte list of each company |
11:32:03 | <@arkiver> | mannie: was it the case that previously you didn't want to control AB to queue these sites? |
11:32:12 | <@arkiver> | if you want to, you're very welcome to get +v and put them in there |
11:33:27 | <mannie> | I dont know if I want to control AB. I am still in doubt. My be I just need to try it for a week or so. |
11:33:28 | <@OrIdow6> | arkiver: Thanks, that's along the lines of what I was thinking too, just didn't want to do this completely unilaterally |
11:35:08 | <@arkiver> | there was some talk about headless browser based archiving? |
11:35:37 | <@OrIdow6> | !remindme 15h save that mastodon thing |
11:35:37 | <@arkiver> | i believe we don't have anything in place at the moment that does that, it would be nice to have something similar to AB, but perhaps for headless browser based archiving |
11:35:38 | <eggdrop> | [remind] ok, i'll remind you at 2025-01-24T02:35:37Z |
11:36:02 | <@arkiver> | there's always a few web one-off pages that are a huge problem in AB, and for which there's not really time to create something custom |
11:36:14 | <@arkiver> | headless browser based archiving would be really useful there |
11:36:20 | <@arkiver> | there was that tool from IA |
11:36:34 | <@arkiver> | this one https://github.com/internetarchive/brozzler |
11:36:45 | <@OrIdow6> | Yeah it'd be really nice |
11:36:56 | <@arkiver> | i have never tried it out though, but it seems to have activity |
11:37:39 | <@OrIdow6> | Wasn't there discussion in -dev 2 months or so back about a different tool that used the Chrome debugging API to intercept requests? |
11:37:56 | <@OrIdow6> | Written in Go? Unless I'm mixing up Brozzler and Zeno |
11:38:28 | <@arkiver> | i am not sure at all |
11:40:11 | <@OrIdow6> | Think it was https://github.com/iipc/warcaroo but that is indeed not Go |
11:44:53 | <@OrIdow6> | mannie: I do encourage you to try it at least! |
11:45:05 | <@arkiver> | +1 from me mannie |
11:45:07 | <@OrIdow6> | No credit card required :) |
11:51:16 | <mannie> | I think I will try it for this month. |
11:52:24 | <mannie> | I also need to say that there is more real live stuff coming soon. So that days I will not be active at all. only the daily list for #down-the-tube |
11:53:32 | <szczot3k> | OrIdow6: what do you mean no CC required. I already gave mine to arkiver, and still no +v :( /s |
12:00:01 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:54 | | Bleo18260072271962345 joins |
12:04:38 | | benjins3 quits [Ping timeout: 250 seconds] |
12:21:27 | | benjins3 joins |
12:42:40 | | mannie quits [Ping timeout: 276 seconds] |
12:46:44 | <TheTechRobo> | arkiver, OrIdow6: I have something WIP in #jseater. No recursion yet, though, and there are some kinks I still need to iron out. |
12:47:08 | <TheTechRobo> | And yeah, it wraps brozzler (perhaps Zeno in the future) |
12:48:45 | | Webuser391030 joins |
12:51:14 | | gatagoto (gatagoto) joins |
12:54:45 | | beardicus (beardicus) joins |
13:06:31 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
13:11:31 | <masterx244|m> | posted a list 2 days ago with a bunch of firmware files. haven't seen that one ran through archivebot. list was intended for a !ao< and total size of the entire list is approx 23GB (according to my local copy of the files which i used to rederive the URLs after verifying them with a ugly bash command) |
13:13:48 | | loug8318142 quits [Ping timeout: 260 seconds] |
13:15:43 | | SkilledAlpaca418962 joins |
13:16:58 | | loug8318142 joins |
13:17:02 | | mannie (nannie) joins |
13:17:57 | | Webuser391030 quits [Client Quit] |
13:20:13 | | Snivy quits [Ping timeout: 260 seconds] |
13:23:14 | | cow_2001 quits [Quit: ✡] |
13:23:56 | | loug8318142 quits [Ping timeout: 250 seconds] |
13:25:10 | | cow_2001 joins |
13:31:40 | | loug8318142 joins |
13:37:08 | | loug8318142 quits [Ping timeout: 260 seconds] |
13:38:06 | | Snivy (Snivy) joins |
13:44:15 | <@arkiver> | TheTechRobo: nice! |
13:46:09 | | mannie quits [Client Quit] |
13:48:14 | <DigitalDragons> | OrIdow6: I remember talk of CDP being used to replace Chrome's HTTP stack with the corentinb/warc library as well |
13:49:57 | | loug8318142 joins |
14:04:33 | | loug8318142 quits [Ping timeout: 260 seconds] |
14:07:37 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
14:08:06 | | eroc1990 (eroc1990) joins |
14:12:05 | | loug8318142 joins |
14:31:26 | | atluxity joins |
14:32:48 | <atluxity> | I came to wonder... at what rate, with a optimistic and naive world view, how many items would archiveteam be able to pull tiny chunks of data over http? Do we have any experience on that? |
14:33:23 | <atluxity> | this is just food for a beer-discussion, nothing serious |
14:36:16 | | mannie (nannie) joins |
14:36:49 | | mannie quits [Client Quit] |
14:38:17 | <@imer> | atluxity: if we're talking DPoS we usually aren't limited by workers - either the site gives out (or has strict limiting) or uploading to IA can't keep up. multiple gbit/s is quite feasable |
14:39:57 | <atluxity> | It looks to me like Telegram is currently doing about 600 items per second |
14:40:31 | <atluxity> | I look at done_counter, noted the time... waited some time, repeat, calculated |
14:40:48 | <atluxity> | so I can use that number for now |
14:40:51 | <@imer> | telegram is one of those cases where they have rate limiting |
14:41:27 | <@imer> | urls is doing ~3k/s currently |
14:42:29 | | bladem (bladem) joins |
15:36:32 | | icedice quits [Ping timeout: 250 seconds] |
15:49:48 | | icedice (icedice) joins |
15:56:00 | <AK> | kiska, you able to get a max for done items/s on #// from your stats? 🤔 |
16:01:03 | | kansei quits [Quit: ZNC 1.9.1 - https://znc.in] |
16:02:08 | | kansei (kansei) joins |
16:05:18 | | holbrooke quits [Ping timeout: 260 seconds] |
16:05:44 | | nulldata6 (nulldata) joins |
16:07:03 | | nulldata quits [Ping timeout: 260 seconds] |
16:07:04 | | nulldata6 is now known as nulldata |
16:10:56 | <LddPotato> | Hey |
16:13:31 | <LddPotato> | Hey guys, i wanted to spin up a cloud instance somewhere to run some docker containers on, woudl you advise to get a single instance with a little bit more memory and then add in some IPs, or would you go for a hand full of smaller instances that already come with their own IP? And which host is the goto one for you guys? One that has IPs that are in good standing and preferably is not too expensive.. |
16:19:50 | | holbrooke joins |
16:31:17 | <Blueacid> | LddPotato: From what I've seen running locally, you don't need much memory, but CPU can quickly get bogged down with a few wget-at threads on the go! I suspect that many of the cheaper hosting options (DigitalOcean, Hetzner, OVH, Linode, Vultr and so on) will likely have "mixed" IP standings, thanks largely to the low prices on offer there =( |
16:32:46 | <szczot3k> | Depends on what you're trying to do. Any hoster, other than residential IP will have bad enough reputation for a lot of things, just because geolocation/reputation databases know it's a hoster. |
16:36:21 | <LddPotato> | Well, At home i have added a tunnel with extra IP addresses but the tunnel is saturated, to be able to contribute and improve my statistics a little i wanted to host a couple more bandwith intensive instances somewhere with a cloud hoster, but i am not sure what to get and from whom... |
16:36:49 | <@imer> | what project are you intending to run? |
16:36:58 | <szczot3k> | I'm running an OVH dedi with additional IPs, and so far I haven't seen any major issues |
16:37:18 | <szczot3k> | (Actually two dedis, with two subnets between them) |
16:37:30 | <LddPotato> | livesream and telegram use the most data for me at the moment... |
16:37:48 | <pabs> | OrIdow6: zygolophodon got support for some additional non-Mastodon Fediverse JS based software, which made the code more complicated, so I need to rebase/rewrite the archiving stuff on top of that |
16:38:15 | <szczot3k> | livestream's api will bonk you with anything over ~2 concurrency effective, telegram is a bit better in this regards |
16:38:34 | <pabs> | OrIdow6: I also didn't verify that things work in the WBM afterwards |
16:39:02 | <szczot3k> | But telegram might also one day wake up, and start bonking archiving efforts hard |
16:40:20 | <LddPotato> | szczot3k: the dedicated servers you run, what kind of specs do they have? how big is the subnets you have assigned to them? and how many workers are you running on them? |
16:42:52 | <szczot3k> | KS-LE-E, KS-LE-1, each with 4 vms each running 3 projects right now (urls, clyps, livestream), all of them on dedicated IPs. LE-E's CPU is ~40% used by archiveteam vms, LE-1's is at ~70% |
16:42:56 | <@imer> | livestream should be finishing or close to in a week or so (fingers crossed), so don't commit to anything long term for just that - hetzner cloud vms might be good since it's pay as you go - might have to watch your traffic allowance though (prices are very reasonable fwiw) |
16:43:51 | <szczot3k> | But - neither of those dedis were bought _for_ archiving efforts, they have just grown to this size by accident. Most likely will scale back one day |
16:49:13 | <AK> | I've got 3 dedis archiving, plus a bunch of Hetzner cloud I scale up when needed. 1 dedi is an AX51-NVMe with 4 ips which runs a bit from every project to collect logs. Then the other 2 dedis run ArchiveBot pipelines |
16:51:05 | <@imer> | hetzner is nice unless you try to run #// hard :D |
16:51:53 | <szczot3k> | On additional IPs on OVH you get to handle abuse yourself (or at least get a whois object with your abuse contact) |
16:52:41 | <AK> | Weirdly the hetzner emails have stopped for me. No idea what I did but not had any abuse reports for ages |
16:53:06 | <AK> | (Late as always, but livestream containers with logs are up in the usual place if needed) |
16:54:29 | <szczot3k> | I've been thinking of running every project, and pushing the logs to ELK, and then setting up some error reporting from this, but... the procrastination is real |
16:54:35 | <AK> | https://share.aktheknight.co.uk/riJe0/qamiwewA17.png/raw Oh boy, I might need to reduce some of the containers on this |
16:54:53 | <LddPotato> | it sounds like the preference is to go for a bigger instance or dedicated server and then add more IPs to it.... The unfortunate thing is i have my own server colocated somewhere but the bandwith is quite expensive there.. |
16:55:15 | <AK> | szczot3k, I did that for a while, but the issue is the amount of logs, it was easy to churn out multiple TB per day. Now I just run Dozzle which gives me logs but no monitoring |
16:56:41 | <szczot3k> | AK: https://i.imgur.com/HGczakQ.png I clear the logs daily, and it's not that bad (24 containers push the logs there) |
16:57:07 | | Webuser821963 quits [Quit: Ooops, wrong browser tab.] |
16:57:47 | <AK> | Yeah doesn't look too bad |
16:57:48 | <AK> | I have 79 containers at the mo, at peak I was running upwards of 200 via swarm with all logs being grabbed in. Was chaos 😂 |
16:58:27 | <szczot3k> | Running a single instance of every project might be feasible for me |
16:58:50 | <szczot3k> | Making good rules for reporting is harder |
17:00:30 | <AK> | I spent ages on rules, and then every time a new project came along the rules would need to be different, so I just settled on this: https://logs.hel1.aktheknight.co.uk/ |
17:00:59 | <AK> | Now if someone goes "it isn't working" people can see my live logs and work out if it's a widespread issue or just one person reporting it |
17:01:54 | | beardicus quits [Ping timeout: 250 seconds] |
17:01:56 | <Blueacid> | Out of interest, how many projects are there running which aren't in the warrior? Should I try to run some of those, too? |
17:02:17 | <Blueacid> | (I've got gigabit at home, so happy to use a chunk of capacity for archival!) |
17:04:20 | | beardicus (beardicus) joins |
17:04:56 | | pedantic-darwin quits [Ping timeout: 250 seconds] |
17:06:11 | <AK> | I would advise being very careful running AT projects on a home ISP, some projects (#// especially) can result in lots of abuse reports and some ISPs get super unhappy about that |
17:07:25 | <szczot3k> | It's the less hardcore version of running a TOR Exit Relay at home |
17:24:37 | <LddPotato> | The reason i stayed away from running #// sofar... |
17:35:08 | | beardicus quits [Ping timeout: 260 seconds] |
17:38:24 | <Blueacid> | Is #// shorthand for the random links project (The one that has a warning about URL Blocklists in its title) ? |
17:38:34 | <Blueacid> | 'Cos if so, yeah, I've avoided that one too! |
17:39:05 | <Blueacid> | But yeah, my ISP hasn't breathed a word to me so far - so happy to carry on until they get in touch, at which point I'll say "oops, sorry" |
17:40:43 | | beardicus (beardicus) joins |
17:42:20 | | BornOn420 quits [Ping timeout: 276 seconds] |
17:50:24 | <@imer> | Blueacid: #// is the channel for the urls project. so sorta yes :) |
17:54:18 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
17:54:35 | | BornOn420 (BornOn420) joins |
18:38:48 | | SkilledAlpaca418962 joins |
19:32:04 | | emily quits [Quit: ZNC 1.9.1 - https://znc.in] |
19:33:19 | | pseudorizer (pseudorizer) joins |
19:44:50 | | nicolas17 quits [Ping timeout: 250 seconds] |
19:48:26 | | nicolas17 joins |
20:06:04 | | @Fusl quits [Excess Flood] |
20:06:20 | | Fusl (Fusl) joins |
20:06:20 | | @ChanServ sets mode: +o Fusl |
20:08:51 | | Webuser915700 joins |
20:12:15 | <Webuser915700> | Hi all. Hope you are all well. Im quite new to website archiving and in need of some good suggestions. I have looked through tons of pages, including ArchiveTeam.org. They compared a few WARC supported options against each other but a few important options (Heretrix, Apache Nutch etc) was never properly compared. Before I start archiving I need to |
20:12:15 | <Webuser915700> | know a few things (Overall page reliability, ability to include comments, expanded comments and few Pros and Cons that does not get mentioned very often). Does anyone use either Heretrix, Apache Nutch and Grab-site? |
20:13:45 | <szczot3k> | What exactly are you trying to do? Do you want to help "the internet community" as a whole, and get stuff into the web.archive? |
20:18:43 | <Webuser915700> | Myself, and the internet community. So there might be differences in which way files are accepted or not (Im sure there must be some sort of guideline to make it eligible for upload). I backed up tons of pages before manually in simple .mhtml format. However, for many pages with many subdirectories its obviously not really feasible. While my own |
20:18:43 | <Webuser915700> | personal interest is more focused on health, nutrition, and biochemistry there is a dire need to archive all sorts of information. |
20:19:31 | <TheTechRobo> | WARCs you create yourself won't be added to the Wayback Machine, for what it's worth. You can requests sites to be submitted to ArchiveBot if you'd like them to be archived there |
20:19:53 | <TheTechRobo> | If You're not worried about the Wayback Machine, grab-site is probably the nicest option in terms of user friendliness |
20:20:37 | <TheTechRobo> | Heretrix is a lot of setup; never tried Apache Nutch so can't say much about that |
20:23:12 | <Webuser915700> | TheTechRobo If it gets added to the Wayback machine it would be nice as an extra backup. I have a server with around 120 TB which is quite small but its enough to start with. Do you know if either Heretrix or Grab-site comments and auto expanding comments? |
20:23:59 | <Webuser915700> | support comments and auto expanding comments* |
20:25:12 | <nicolas17> | that depends on the specific website |
20:40:33 | | HP_Archivist (HP_Archivist) joins |
20:50:18 | | Webuser915700 quits [Client Quit] |
20:57:12 | | beardicus quits [Ping timeout: 250 seconds] |
20:58:50 | | Webuser044028 joins |
21:09:57 | <TheTechRobo> | If it requires JavaScript, then probably not |
21:10:12 | | kansei quits [Ping timeout: 250 seconds] |
21:10:34 | | kansei (kansei) joins |
21:18:03 | | beardicus (beardicus) joins |
21:31:58 | | h2ibot quits [Ping timeout: 260 seconds] |
21:38:09 | | APOLLO03 quits [Quit: Leaving] |
21:38:41 | | Webuser044028 quits [Client Quit] |
21:49:12 | | beardicus quits [Ping timeout: 250 seconds] |
21:53:13 | | beardicus (beardicus) joins |
22:01:00 | | qwertyasdfuiopghjkl2 quits [Quit: Leaving.] |
22:01:34 | | qwertyasdfuiopghjkl2 joins |
22:01:34 | | qwertyasdfuiopghjkl2 is now authenticated as qwertyasdfuiopghjkl2 |
22:02:01 | | qwertyasdfuiopghjkl2 quits [Max SendQ exceeded] |
22:02:13 | | qwertyasdfuiopghjkl2 joins |
22:02:13 | | qwertyasdfuiopghjkl2 is now authenticated as qwertyasdfuiopghjkl2 |
22:02:40 | | qwertyasdfuiopghjkl2 quits [Max SendQ exceeded] |
22:02:59 | | sec^nd quits [Remote host closed the connection] |
22:03:17 | | sec^nd (second) joins |
22:05:23 | | APOLLO03 joins |
22:10:59 | | BlueMaxima joins |
22:12:48 | | beardicus quits [Ping timeout: 260 seconds] |
22:16:06 | | nicolas17 is now authenticated as nicolas17 |
22:16:45 | | APOLLO_03 joins |
22:18:38 | | APOLLO03 quits [Ping timeout: 260 seconds] |
22:32:41 | | h2ibot (h2ibot) joins |
22:41:12 | | etnguyen03 (etnguyen03) joins |
22:44:57 | | Dango360 quits [Read error: Connection reset by peer] |
22:59:42 | | Dango360 (Dango360) joins |
23:01:33 | | Webuser880118 joins |
23:02:03 | | Overlordz joins |
23:02:25 | | Webuser880118 quits [Client Quit] |
23:08:48 | | Ryz quits [Ping timeout: 260 seconds] |
23:11:29 | | Ryz (Ryz) joins |
23:26:47 | <kiska> | AK the explore function is available, so you can go make your own query |
23:27:39 | | qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins |
23:29:04 | <kiska> | Otherwise I can get one soon(tm) |
23:38:39 | | etnguyen03 quits [Client Quit] |
23:41:06 | <AK> | kiska, Happy to explore, didn't want to overload your side if I wrote really bad queries. Will take a look as I'm interested now 🤔 |