#archiveteam-bs log for 2025-01-23

Home Search Previous day Next day

00:11:14		Webuser330629 joins
00:13:32		Webuser330629 quits [Client Quit]
00:25:23		etnguyen03 (etnguyen03) joins
00:29:57		APOLLO03 quits [Quit: Leaving]
00:47:53		TastyWiener95 (TastyWiener95) joins
01:17:09		beastbg8__ joins
01:20:19		beardicus (beardicus) joins
01:20:23		beastbg8_ quits [Ping timeout: 260 seconds]
01:32:59		BlueMaxima quits [Read error: Connection reset by peer]
01:46:35		etnguyen03 quits [Client Quit]
01:51:28		beardicus quits [Ping timeout: 250 seconds]
01:53:34		beardicus (beardicus) joins
02:06:07		bladem quits [Read error: Connection reset by peer]
02:16:59		etnguyen03 (etnguyen03) joins
02:28:03		beardicus quits [Ping timeout: 260 seconds]
02:33:01		beardicus (beardicus) joins
03:28:42		etnguyen03 quits [Remote host closed the connection]
04:55:38		beardicus quits [Ping timeout: 250 seconds]
05:05:46		JayEmbee (JayEmbee) joins
05:07:51		opl quits [Quit: bye]
05:23:19		opl joins
06:14:52		BornOn420 quits [Remote host closed the connection]
06:15:51		BornOn420 (BornOn420) joins
06:26:38		jspiros_ quits [Ping timeout: 260 seconds]
06:42:00		jspiros (jspiros) joins
06:51:08		jspiros quits [Ping timeout: 260 seconds]
07:58:27		SootBector quits [Remote host closed the connection]
07:58:50		SootBector (SootBector) joins
08:16:27		jspiros (jspiros) joins
08:24:24		Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
08:25:45		Shjosan (Shjosan) joins
09:14:19		loug8318142 joins
09:18:07		APOLLO03 joins
09:57:32		Sluggs quits [Excess Flood]
10:03:55		Sluggs joins
10:28:42		Sluggs quits [Excess Flood]
10:35:34		Sluggs joins
10:39:16	<@OrIdow6>	Looking at it, congrats on progress (not yet merged but looks like you're further along) on the mastodon thing pabs
10:53:47	<@OrIdow6>	Also, for all - I'm looking into this because STWP has had an incident kinda similar to the berries.space incident (they SPN'd some Fediverse posts without asking) and I think it would be useful to save the Mastodon thread where people get angry at them, good illustration of people reacting to archiving
10:58:24	<@OrIdow6>	I'm thinking of going forward with it, not like there's anyhting sensitive there anyway, just people complaining
10:58:45	<@arkiver>	what is this about OrIdow6 ?
11:09:58		ArchivalEfforts quits [Quit: No Ping reply in 180 seconds.]
11:13:46		Sluggs quits [Excess Flood]
11:17:42		Sluggs joins
11:19:50	<@arkiver>	mostly asking since i can't find in my logs what this related to :P
11:20:03	<@arkiver>	i think i'm not missing logs but not sure
11:25:27		mannie (nannie) joins
11:25:28	<eggdrop>	[tell] mannie: [2025-01-21T08:16:56Z] <OrIdow6> Could you please provide more context to this? For instance: what date range does this list cover? Also rather than a long list of references it would be better to have just one or two links per company, to establish that they are going bankrupt; e.g., I cannot find a source for the fact that Sarvision is going bankrupt skimming through the URLs in the list, instead most just seem to establish t
11:27:19	<@OrIdow6>	arkiver: Umm, the context is a long long conversation machine-translated from Chinese in #stwp-chat , basically I'm asking whether we want to archive a kerfuffle over archiving, just being safe and asking other people cuz there's a distant possibility we could get involved in the blowback as well and I don't want to make that decision solo
11:27:47	<c3manu>	OrIdow6: looks like the message you left mannie got cut off
11:28:01	<@OrIdow6>	c3manu: Oh shoot, you're right
11:28:25	<mannie>	OrIdow6: I use https://insolventies.rechtspraak.nl/#!/ as a source. I filter on type bankrupt and then today
11:28:41	<@OrIdow6>	"... that the companies exist" or something like that was what was left out
11:28:42	<c3manu>	i haven’t been following yesterday, have there been any new ideas? where’s that discussion currently at?
11:29:48	<mannie>	Here is the list for court annoucements of the last 7 days: https://insolventies.rechtspraak.nl/#!/resultaat?periode=Laatste%20week&rechtbank=all&publicatiesoort=uitspraken%20faillissement&publicatiekenmerk=&insolventienummer=&startDate=2025-01-15T11:30:42.875Z&type=kenmerk
11:30:08	<c3manu>	mannie: the thing is, we don’t really have time to go through all of that by ourselves and were thinking about solutions for automating part of it, or thinking about ways for people to throw in their stuff without having to learn all of AB, essentially
11:30:14	<mannie>	I dont include it in the references list becuase is not possible to archive it with archivebot
11:30:51	<c3manu>	mannie: if that makes it any better, the German insolvency list seems to be even more of a mess ^^"
11:31:03	<@arkiver>	OrIdow6: i guess it can be archived, after data is donated to IA, they can always contact IA with requests
11:31:11	<mannie>	I can make separte list of each company
11:32:03	<@arkiver>	mannie: was it the case that previously you didn't want to control AB to queue these sites?
11:32:12	<@arkiver>	if you want to, you're very welcome to get +v and put them in there
11:33:27	<mannie>	I dont know if I want to control AB. I am still in doubt. My be I just need to try it for a week or so.
11:33:28	<@OrIdow6>	arkiver: Thanks, that's along the lines of what I was thinking too, just didn't want to do this completely unilaterally
11:35:08	<@arkiver>	there was some talk about headless browser based archiving?
11:35:37	<@OrIdow6>	!remindme 15h save that mastodon thing
11:35:37	<@arkiver>	i believe we don't have anything in place at the moment that does that, it would be nice to have something similar to AB, but perhaps for headless browser based archiving
11:35:38	<eggdrop>	[remind] ok, i'll remind you at 2025-01-24T02:35:37Z
11:36:02	<@arkiver>	there's always a few web one-off pages that are a huge problem in AB, and for which there's not really time to create something custom
11:36:14	<@arkiver>	headless browser based archiving would be really useful there
11:36:20	<@arkiver>	there was that tool from IA
11:36:34	<@arkiver>	this one https://github.com/internetarchive/brozzler
11:36:45	<@OrIdow6>	Yeah it'd be really nice
11:36:56	<@arkiver>	i have never tried it out though, but it seems to have activity
11:37:39	<@OrIdow6>	Wasn't there discussion in -dev 2 months or so back about a different tool that used the Chrome debugging API to intercept requests?
11:37:56	<@OrIdow6>	Written in Go? Unless I'm mixing up Brozzler and Zeno
11:38:28	<@arkiver>	i am not sure at all
11:40:11	<@OrIdow6>	Think it was https://github.com/iipc/warcaroo but that is indeed not Go
11:44:53	<@OrIdow6>	mannie: I do encourage you to try it at least!
11:45:05	<@arkiver>	+1 from me mannie
11:45:07	<@OrIdow6>	No credit card required :)
11:51:16	<mannie>	I think I will try it for this month.
11:52:24	<mannie>	I also need to say that there is more real live stuff coming soon. So that days I will not be active at all. only the daily list for #down-the-tube
11:53:32	<szczot3k>	OrIdow6: what do you mean no CC required. I already gave mine to arkiver, and still no +v :( /s
12:00:01		Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:54		Bleo18260072271962345 joins
12:04:38		benjins3 quits [Ping timeout: 250 seconds]
12:21:27		benjins3 joins
12:42:40		mannie quits [Ping timeout: 276 seconds]
12:46:44	<TheTechRobo>	arkiver, OrIdow6: I have something WIP in #jseater. No recursion yet, though, and there are some kinks I still need to iron out.
12:47:08	<TheTechRobo>	And yeah, it wraps brozzler (perhaps Zeno in the future)
12:48:45		Webuser391030 joins
12:51:14		gatagoto (gatagoto) joins
12:54:45		beardicus (beardicus) joins
13:06:31		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
13:11:31	<masterx244\|m>	posted a list 2 days ago with a bunch of firmware files. haven't seen that one ran through archivebot. list was intended for a !ao< and total size of the entire list is approx 23GB (according to my local copy of the files which i used to rederive the URLs after verifying them with a ugly bash command)
13:13:48		loug8318142 quits [Ping timeout: 260 seconds]
13:15:43		SkilledAlpaca418962 joins
13:16:58		loug8318142 joins
13:17:02		mannie (nannie) joins
13:17:57		Webuser391030 quits [Client Quit]
13:20:13		Snivy quits [Ping timeout: 260 seconds]
13:23:14		cow_2001 quits [Quit: ✡]
13:23:56		loug8318142 quits [Ping timeout: 250 seconds]
13:25:10		cow_2001 joins
13:31:40		loug8318142 joins
13:37:08		loug8318142 quits [Ping timeout: 260 seconds]
13:38:06		Snivy (Snivy) joins
13:44:15	<@arkiver>	TheTechRobo: nice!
13:46:09		mannie quits [Client Quit]
13:48:14	<DigitalDragons>	OrIdow6: I remember talk of CDP being used to replace Chrome's HTTP stack with the corentinb/warc library as well
13:49:57		loug8318142 joins
14:04:33		loug8318142 quits [Ping timeout: 260 seconds]
14:07:37		eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
14:08:06		eroc1990 (eroc1990) joins
14:12:05		loug8318142 joins
14:31:26		atluxity joins
14:32:48	<atluxity>	I came to wonder... at what rate, with a optimistic and naive world view, how many items would archiveteam be able to pull tiny chunks of data over http? Do we have any experience on that?
14:33:23	<atluxity>	this is just food for a beer-discussion, nothing serious
14:36:16		mannie (nannie) joins
14:36:49		mannie quits [Client Quit]
14:38:17	<@imer>	atluxity: if we're talking DPoS we usually aren't limited by workers - either the site gives out (or has strict limiting) or uploading to IA can't keep up. multiple gbit/s is quite feasable
14:39:57	<atluxity>	It looks to me like Telegram is currently doing about 600 items per second
14:40:31	<atluxity>	I look at done_counter, noted the time... waited some time, repeat, calculated
14:40:48	<atluxity>	so I can use that number for now
14:40:51	<@imer>	telegram is one of those cases where they have rate limiting
14:41:27	<@imer>	urls is doing ~3k/s currently
14:42:29		bladem (bladem) joins
15:36:32		icedice quits [Ping timeout: 250 seconds]
15:49:48		icedice (icedice) joins
15:56:00	<AK>	kiska, you able to get a max for done items/s on #// from your stats? 🤔
16:01:03		kansei quits [Quit: ZNC 1.9.1 - https://znc.in]
16:02:08		kansei (kansei) joins
16:05:18		holbrooke quits [Ping timeout: 260 seconds]
16:05:44		nulldata6 (nulldata) joins
16:07:03		nulldata quits [Ping timeout: 260 seconds]
16:07:04		nulldata6 is now known as nulldata
16:10:56	<LddPotato>	Hey
16:13:31	<LddPotato>	Hey guys, i wanted to spin up a cloud instance somewhere to run some docker containers on, woudl you advise to get a single instance with a little bit more memory and then add in some IPs, or would you go for a hand full of smaller instances that already come with their own IP? And which host is the goto one for you guys? One that has IPs that are in good standing and preferably is not too expensive..
16:19:50		holbrooke joins
16:31:17	<Blueacid>	LddPotato: From what I've seen running locally, you don't need much memory, but CPU can quickly get bogged down with a few wget-at threads on the go! I suspect that many of the cheaper hosting options (DigitalOcean, Hetzner, OVH, Linode, Vultr and so on) will likely have "mixed" IP standings, thanks largely to the low prices on offer there =(
16:32:46	<szczot3k>	Depends on what you're trying to do. Any hoster, other than residential IP will have bad enough reputation for a lot of things, just because geolocation/reputation databases know it's a hoster.
16:36:21	<LddPotato>	Well, At home i have added a tunnel with extra IP addresses but the tunnel is saturated, to be able to contribute and improve my statistics a little i wanted to host a couple more bandwith intensive instances somewhere with a cloud hoster, but i am not sure what to get and from whom...
16:36:49	<@imer>	what project are you intending to run?
16:36:58	<szczot3k>	I'm running an OVH dedi with additional IPs, and so far I haven't seen any major issues
16:37:18	<szczot3k>	(Actually two dedis, with two subnets between them)
16:37:30	<LddPotato>	livesream and telegram use the most data for me at the moment...
16:37:48	<pabs>	OrIdow6: zygolophodon got support for some additional non-Mastodon Fediverse JS based software, which made the code more complicated, so I need to rebase/rewrite the archiving stuff on top of that
16:38:15	<szczot3k>	livestream's api will bonk you with anything over ~2 concurrency effective, telegram is a bit better in this regards
16:38:34	<pabs>	OrIdow6: I also didn't verify that things work in the WBM afterwards
16:39:02	<szczot3k>	But telegram might also one day wake up, and start bonking archiving efforts hard
16:40:20	<LddPotato>	szczot3k: the dedicated servers you run, what kind of specs do they have? how big is the subnets you have assigned to them? and how many workers are you running on them?
16:42:52	<szczot3k>	KS-LE-E, KS-LE-1, each with 4 vms each running 3 projects right now (urls, clyps, livestream), all of them on dedicated IPs. LE-E's CPU is ~40% used by archiveteam vms, LE-1's is at ~70%
16:42:56	<@imer>	livestream should be finishing or close to in a week or so (fingers crossed), so don't commit to anything long term for just that - hetzner cloud vms might be good since it's pay as you go - might have to watch your traffic allowance though (prices are very reasonable fwiw)
16:43:51	<szczot3k>	But - neither of those dedis were bought _for_ archiving efforts, they have just grown to this size by accident. Most likely will scale back one day
16:49:13	<AK>	I've got 3 dedis archiving, plus a bunch of Hetzner cloud I scale up when needed. 1 dedi is an AX51-NVMe with 4 ips which runs a bit from every project to collect logs. Then the other 2 dedis run ArchiveBot pipelines
16:51:05	<@imer>	hetzner is nice unless you try to run #// hard :D
16:51:53	<szczot3k>	On additional IPs on OVH you get to handle abuse yourself (or at least get a whois object with your abuse contact)
16:52:41	<AK>	Weirdly the hetzner emails have stopped for me. No idea what I did but not had any abuse reports for ages
16:53:06	<AK>	(Late as always, but livestream containers with logs are up in the usual place if needed)
16:54:29	<szczot3k>	I've been thinking of running every project, and pushing the logs to ELK, and then setting up some error reporting from this, but... the procrastination is real
16:54:35	<AK>	https://share.aktheknight.co.uk/riJe0/qamiwewA17.png/raw Oh boy, I might need to reduce some of the containers on this
16:54:53	<LddPotato>	it sounds like the preference is to go for a bigger instance or dedicated server and then add more IPs to it.... The unfortunate thing is i have my own server colocated somewhere but the bandwith is quite expensive there..
16:55:15	<AK>	szczot3k, I did that for a while, but the issue is the amount of logs, it was easy to churn out multiple TB per day. Now I just run Dozzle which gives me logs but no monitoring
16:56:41	<szczot3k>	AK: https://i.imgur.com/HGczakQ.png I clear the logs daily, and it's not that bad (24 containers push the logs there)
16:57:07		Webuser821963 quits [Quit: Ooops, wrong browser tab.]
16:57:47	<AK>	Yeah doesn't look too bad
16:57:48	<AK>	I have 79 containers at the mo, at peak I was running upwards of 200 via swarm with all logs being grabbed in. Was chaos 😂
16:58:27	<szczot3k>	Running a single instance of every project might be feasible for me
16:58:50	<szczot3k>	Making good rules for reporting is harder
17:00:30	<AK>	I spent ages on rules, and then every time a new project came along the rules would need to be different, so I just settled on this: https://logs.hel1.aktheknight.co.uk/
17:00:59	<AK>	Now if someone goes "it isn't working" people can see my live logs and work out if it's a widespread issue or just one person reporting it
17:01:54		beardicus quits [Ping timeout: 250 seconds]
17:01:56	<Blueacid>	Out of interest, how many projects are there running which aren't in the warrior? Should I try to run some of those, too?
17:02:17	<Blueacid>	(I've got gigabit at home, so happy to use a chunk of capacity for archival!)
17:04:20		beardicus (beardicus) joins
17:04:56		pedantic-darwin quits [Ping timeout: 250 seconds]
17:06:11	<AK>	I would advise being very careful running AT projects on a home ISP, some projects (#// especially) can result in lots of abuse reports and some ISPs get super unhappy about that
17:07:25	<szczot3k>	It's the less hardcore version of running a TOR Exit Relay at home
17:24:37	<LddPotato>	The reason i stayed away from running #// sofar...
17:35:08		beardicus quits [Ping timeout: 260 seconds]
17:38:24	<Blueacid>	Is #// shorthand for the random links project (The one that has a warning about URL Blocklists in its title) ?
17:38:34	<Blueacid>	'Cos if so, yeah, I've avoided that one too!
17:39:05	<Blueacid>	But yeah, my ISP hasn't breathed a word to me so far - so happy to carry on until they get in touch, at which point I'll say "oops, sorry"
17:40:43		beardicus (beardicus) joins
17:42:20		BornOn420 quits [Ping timeout: 276 seconds]
17:50:24	<@imer>	Blueacid: #// is the channel for the urls project. so sorta yes :)
17:54:18		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
17:54:35		BornOn420 (BornOn420) joins
18:38:48		SkilledAlpaca418962 joins
19:32:04		emily quits [Quit: ZNC 1.9.1 - https://znc.in]
19:33:19		pseudorizer (pseudorizer) joins
19:44:50		nicolas17 quits [Ping timeout: 250 seconds]
19:48:26		nicolas17 joins
20:06:04		@Fusl quits [Excess Flood]
20:06:20		Fusl (Fusl) joins
20:06:20		@ChanServ sets mode: +o Fusl
20:08:51		Webuser915700 joins
20:12:15	<Webuser915700>	Hi all. Hope you are all well. Im quite new to website archiving and in need of some good suggestions. I have looked through tons of pages, including ArchiveTeam.org. They compared a few WARC supported options against each other but a few important options (Heretrix, Apache Nutch etc) was never properly compared. Before I start archiving I need to
20:12:15	<Webuser915700>	know a few things (Overall page reliability, ability to include comments, expanded comments and few Pros and Cons that does not get mentioned very often). Does anyone use either Heretrix, Apache Nutch and Grab-site?
20:13:45	<szczot3k>	What exactly are you trying to do? Do you want to help "the internet community" as a whole, and get stuff into the web.archive?
20:18:43	<Webuser915700>	Myself, and the internet community. So there might be differences in which way files are accepted or not (Im sure there must be some sort of guideline to make it eligible for upload). I backed up tons of pages before manually in simple .mhtml format. However, for many pages with many subdirectories its obviously not really feasible. While my own
20:18:43	<Webuser915700>	personal interest is more focused on health, nutrition, and biochemistry there is a dire need to archive all sorts of information.
20:19:31	<TheTechRobo>	WARCs you create yourself won't be added to the Wayback Machine, for what it's worth. You can requests sites to be submitted to ArchiveBot if you'd like them to be archived there
20:19:53	<TheTechRobo>	If You're not worried about the Wayback Machine, grab-site is probably the nicest option in terms of user friendliness
20:20:37	<TheTechRobo>	Heretrix is a lot of setup; never tried Apache Nutch so can't say much about that
20:23:12	<Webuser915700>	TheTechRobo If it gets added to the Wayback machine it would be nice as an extra backup. I have a server with around 120 TB which is quite small but its enough to start with. Do you know if either Heretrix or Grab-site comments and auto expanding comments?
20:23:59	<Webuser915700>	support comments and auto expanding comments*
20:25:12	<nicolas17>	that depends on the specific website
20:40:33		HP_Archivist (HP_Archivist) joins
20:50:18		Webuser915700 quits [Client Quit]
20:57:12		beardicus quits [Ping timeout: 250 seconds]
20:58:50		Webuser044028 joins
21:09:57	<TheTechRobo>	If it requires JavaScript, then probably not
21:10:12		kansei quits [Ping timeout: 250 seconds]
21:10:34		kansei (kansei) joins
21:18:03		beardicus (beardicus) joins
21:31:58		h2ibot quits [Ping timeout: 260 seconds]
21:38:09		APOLLO03 quits [Quit: Leaving]
21:38:41		Webuser044028 quits [Client Quit]
21:49:12		beardicus quits [Ping timeout: 250 seconds]
21:53:13		beardicus (beardicus) joins
22:01:00		qwertyasdfuiopghjkl2 quits [Quit: Leaving.]
22:01:34		qwertyasdfuiopghjkl2 joins
22:01:34		qwertyasdfuiopghjkl2 is now authenticated as qwertyasdfuiopghjkl2
22:02:01		qwertyasdfuiopghjkl2 quits [Max SendQ exceeded]
22:02:13		qwertyasdfuiopghjkl2 joins
22:02:13		qwertyasdfuiopghjkl2 is now authenticated as qwertyasdfuiopghjkl2
22:02:40		qwertyasdfuiopghjkl2 quits [Max SendQ exceeded]
22:02:59		sec^nd quits [Remote host closed the connection]
22:03:17		sec^nd (second) joins
22:05:23		APOLLO03 joins
22:10:59		BlueMaxima joins
22:12:48		beardicus quits [Ping timeout: 260 seconds]
22:16:06		nicolas17 is now authenticated as nicolas17
22:16:45		APOLLO_03 joins
22:18:38		APOLLO03 quits [Ping timeout: 260 seconds]
22:32:41		h2ibot (h2ibot) joins
22:41:12		etnguyen03 (etnguyen03) joins
22:44:57		Dango360 quits [Read error: Connection reset by peer]
22:59:42		Dango360 (Dango360) joins
23:01:33		Webuser880118 joins
23:02:03		Overlordz joins
23:02:25		Webuser880118 quits [Client Quit]
23:08:48		Ryz quits [Ping timeout: 260 seconds]
23:11:29		Ryz (Ryz) joins
23:26:47	<kiska>	AK the explore function is available, so you can go make your own query
23:27:39		qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins
23:29:04	<kiska>	Otherwise I can get one soon(tm)
23:38:39		etnguyen03 quits [Client Quit]
23:41:06	<AK>	kiska, Happy to explore, didn't want to overload your side if I wrote really bad queries. Will take a look as I'm interested now 🤔

Home Search Previous day Next day