#archiveteam-bs log for 2022-12-21

Home Search Previous day Next day

00:23:03		BlueMaxima quits [Client Quit]
00:41:06		omglolbah quits [Ping timeout: 265 seconds]
00:46:48		omglolbah joins
01:13:50		omglolbah quits [Ping timeout: 264 seconds]
01:14:17		omglolbah joins
01:24:30		omglolbah_ joins
01:25:34		omglolbah quits [Ping timeout: 265 seconds]
02:22:58		BlueMaxima joins
02:33:16		monoxane (monoxane) joins
02:34:49	<pabs>	JAA: sounds about right. TBH this is the first time I heard of Freed-ora
02:35:20		pabs quits [Remote host closed the connection]
02:35:24	<monoxane>	yo so how hard would it be to get some more targets online if we had the storage + network to provide for it
02:35:59	<monoxane>	if you've seen pixiv over the last 2 days you may have seen that me and a few friends have thrown some 100g boxes at it and are currently bottlenecked by the 2 online targets
02:36:26	<monoxane>	we know the targets need to offload to IA at an appropriate speed, but have quite a bit of available storage to buffer ourselves with
02:38:02	<monoxane>	at a point we were hitting 7.5gbps from the source but are now limited by the targets disks filling up and stopping connections 😔
02:39:13	<monoxane>	its less of a thing for this in particular but we're trying to work out how we can provide some infra for the next "oh shit its going down in 24 hours" site scrape
02:39:58	<monoxane>	500gbit of bandwidth, a /24, and 100tbit of local storage will help some of those a fair bit 😉
02:41:54	<@JAA>	Pinging some relevant people: rewby HCross arkiver ^
02:43:38	<monoxane>	we're also working on rewriting an api compatible warrior that will scale much higher
02:44:19	<monoxane>	for reference last night we had 3328 warrior threads running across 13 nodes for shits n gigs, and were nowhere near capacity
02:45:20		pabs (pabs) joins
02:45:47	<monoxane>	also considering rolling a new version of the megawarc factory with some improvements, the real question is how does it get from the targets to IA and what do we need to do to facilitate that
02:46:37	<monoxane>	and yes, aware that IA only has ~20gbps S3 capacity, we'd be egress shaping down to about 5gps, hence the fuck off massive target cache to hold it for a bit
03:15:25		fishingforsoup_ quits [Quit: Leaving]
03:15:42		fishingforsoup_ joins
03:20:58		godane (godane) joins
03:21:34		godane1 quits [Ping timeout: 265 seconds]
04:06:27		Aizith joins
04:09:25		Aizith quits [Remote host closed the connection]
05:23:27		igloo22225 quits [Client Quit]
05:23:39		igloo22225 (igloo22225) joins
06:02:05		Arcorann_ joins
06:04:27		hackbug quits [Ping timeout: 265 seconds]
06:06:31		BlueMaxima quits [Client Quit]
06:18:15		sonick quits [Client Quit]
06:29:13		godane1 joins
06:31:02		godane quits [Ping timeout: 265 seconds]
06:32:36		jacksonchen666 quits [Ping timeout: 245 seconds]
06:33:06		jacksonchen666 (jacksonchen666) joins
06:35:46		Island quits [Read error: Connection reset by peer]
07:28:41	<monika>	monoxane could you clarify on the "api compatible" warrior? are you modifying the existing warrior or writing one from scratch
07:30:14	<monika>	i believe modifying warrior code is a big no no
07:31:36	<monoxane>	new one that does the same thing with the same apis just less jank and some more options to allow us to vertically scale easier and with an updated docker image
07:32:11	<monika>	JAA what's your opinion ^
07:32:34	<nepeat>	i'd be interested in learning more and supporting this warrior improvement
07:33:04	<nepeat>	personally, i'd love to add on prom metrics and getting the logging to fit the structlog format to work with my systems
07:33:07	<monoxane>	im not the guy doing that so i might be wrong on whats actually happening, but we've found that one of the main limiting factors of the warrior is its concurrency settings and the inability to disable things like the web ui
07:33:28	<monoxane>	and also the fact that some of the python libs used in it are effectively vaporware that havnt been updated since 2017
07:33:28	<monika>	if you run the bare project containers the UI is already disabled
07:33:44	<monika>	atdr.meo.ws/archiveteam/<PROJECT>-grab
07:33:58	<monika>	allows for 20 concurrency too
07:34:01	<monoxane>	ooo we did not know that
07:34:07	<monika>	go crazy
07:34:12	<monoxane>	that is going to make a massive difference
07:34:56	<monoxane>	aight the warrior isnt being changed anymore :)
07:35:26	<monoxane>	but we are gonna write our own cluster agent and c2 implementation :P
07:36:02	<nepeat>	ditching k8s already?
07:36:29	<monoxane>	no, still using k8s, just writing a controller that handles the deployment and configuration of those bare images instead of the warrior
07:36:46	<monoxane>	we are already working on that but via the warrior, knowing about the bare images is a massive game changer
07:38:54	<monoxane>	hm these dont seem to actually contain anything though 😔
07:39:06	<monika>	huh?
07:39:08	<monoxane>	at least the pixiv-2-grab one'd dockerfile literally only has a from line in it
07:39:38	<nepeat>	it does fucky ONBUILD magic https://github.com/ArchiveTeam/grab-base-df/blob/master/Dockerfile
07:39:45	<nepeat>	this is the dockerfile to refer to
07:39:52	<nepeat>	the base is on https://github.com/ArchiveTeam/wget-lua/blob/v1.21.3-at/Dockerfile
07:40:00		@OrIdow6 leaves [Leaving.]
07:40:10		OrIdow6 (OrIdow6) joins
07:40:10		@ChanServ sets mode: +o OrIdow6
07:40:35	<monoxane>	ah okay, thats some fucky shit i havnt seen before :P
07:40:45	<monoxane>	will play around with it after i finish my actual job for the day lol
07:41:23	<@OrIdow6>	arkiver: See above, they have dropped their plan to modify "the warrior"
07:42:08	<monoxane>	yes now we're just gonna bypass it 😆
07:43:02	<monoxane>	we dont wanna do anything that will screw anyone else here but there are definitely challenges with scaling warrior to 3000+ instances over 10+ nodes and actually managing it
07:43:11	<nepeat>	eh https://github.com/general-programming/megarepo/blob/mainline/common/nomad_jobs/job_at_vlive.hcl
07:43:31	<nepeat>	i like nomad, it's simple and has scaled up with my 100-300 instances well
07:43:57	<monoxane>	yea the other thing is the nodes we're using already have k3s and are running some other workloads, so we cant just jump to nomad
07:44:13	<nepeat>	ah, preexisting prod
07:44:54	<monoxane>	yes, if you knew what these nodes usually do you'd be absolutely shocked that we can run AT workloads on them, and also absolutely not surprised at all that we can pin 500gbps
07:45:05	<monoxane>	but dont worry its all approved by the owners :)
07:46:28	<@OrIdow6>	I haven't been following this conversation enough to know the meaning of "bypass it", but basically, the hard rules are: -don't modify wget-lua/wget-at, including messing with the build process to get it to accept wider ranges of library versions -don't modify Seesaw or the other libraries it uses -don't modify the project scripts -keep a clean, vanilla connection from wget and the project scripts to the Internet
07:48:23		sepro quits [Read error: Connection reset by peer]
07:48:52	<monoxane>	understood we’ll definitely be sticking to that
07:49:10	<monoxane>	i mean we’ll be running the project containers directly not managed through warrior
07:51:45		sepro (sepro) joins
07:52:29		michaelblob quits [Read error: Connection reset by peer]
07:53:14	<nepeat>	that's what most of us hardcore users do
07:53:39	<nepeat>	you're definitely on the right path to hauling top rates
07:58:56	<monoxane>	we dont care about the leaderboards lmao, even considered randomising the DOWNLOADER ids so other people dont get discouraged by 1 name munching 10tb a day
07:59:16	<monoxane>	its more, if we can help in an "oh fuck" situation where theres 24 hours to get an entire site archived, we'll put in everything we've got
07:59:46	<monoxane>	because i've been part of some of those were even with all the capacity we've had, some content is still lost, and in a couple of cases it was a fair bit of content
08:07:25	<schwarzkatz\|m>	Appreciate the work you guys do, monoxane!
08:11:34	<Jake>	(also related to earlier conversation, it's easier if you use a known downloader name so that you can be contacted)
08:13:58	<monoxane>	yea we're gonna use some sort of team name when its all up and running
08:14:07	<monoxane>	instead of just my nick lol
08:16:28		Hackerpcs quits [Quit: Hackerpcs]
08:18:43		Hackerpcs (Hackerpcs) joins
08:47:53	<nepeat>	kinda wondering, how up to date are all of the archive team repos?
09:08:50		hitgrr8 joins
09:15:26		sknebel is now known as sknebel_m
09:15:35		sknebel_m is now known as sknebel
09:17:43		sknebel is now known as sknebel2
09:17:49		sknebel2 is now known as sknebel
09:23:03	<neggles>	"don't modify Seesaw or the other libraries it uses" aww
09:27:39	<neggles>	I believe the current plan was to use MagnusInstitute or possibly MagnusArchivist as downloader name, TBC though
09:32:35	<neggles>	OrIdow6: would it be OK to rework the warrior docker image somewhat so it's a bit more... modern, for lack of a better way to put it? I was digging through repos and whatnot last night piecing together how it all works and... oof.
09:34:42	<@OrIdow6>	neggles: I don't know what that implies exactly
09:35:04	<@OrIdow6>	The core that you shouldn't modify is in the READMEs under "Distribution-specific setup"
09:35:25	<@OrIdow6>	And to my understanding the warrior, Docker images, etc. are basically just wrappers around a preconfigured version of this
09:35:52	<@OrIdow6>	But I don't know the details of those, and if you want specifics you should wait around for someone who does
09:35:58	<neggles>	OK, no problem
09:40:34	<neggles>	don't want to step on anyone's toes; I have a local 3/4-ish-complete copy of what I'm talking about, it's mostly a slightly cleaner build process (same steps, same sources, similar end result) just with bullseye underneath, theoretically arm64 support, and a few more things configured through environment variables (webui port, UID/GID)
09:50:33	<@rewby>	So the thing is: don't run custom builds of wget-at. It causes issues
09:51:06	<@rewby>	Mostly around compression and or file integrity
09:51:42	<@rewby>	And upgrading the base distro changes lib versions, which then causes the above
09:54:01	<@rewby>	As for targets, we don't generally accept them from just anyone who shows up randomly. Once data is on a target it is really hard to figure out what needs to be redone if that target disappears.
09:55:50	<@rewby>	Notably, we only accept targets in the form of bare metal or vms. We have provisioning playbooks for them
09:55:57	<@rewby>	Also, they destroy ssds
09:56:08	<@rewby>	And HDDs are not gonna keep up
09:56:43	<@rewby>	Also, 100T isn't much
09:56:57	<@rewby>	I have targets with that much sitting around as well
09:57:25	<@rewby>	I can look into reshuffling a few things
09:59:24	<@rewby>	Also, monoxane, do NOT use team names. That is forbidden. We will ban you if we discover this.
09:59:32	<monoxane>	oop okay
09:59:34	<monoxane>	will not
10:00:09	<@rewby>	We have had many issues with this before
10:02:10	<@rewby>	In the past, people have used team names and then one member's infra fucks up and we need then to stop. Inevitably that person is unreachable and the other members can't get to that specific bit of infra. We end up banning the whole thing because that's the most granular tool we have.
10:02:18	<@rewby>	This has happened multiple times.
10:02:31	<@rewby>	So we prohibit team names in general now
10:05:37	<monoxane>	yea that makes heaps of sense i dont know why i didnt think about it
10:06:18		michaelblob (michaelblob) joins
10:10:11	<@rewby>	Yeah, each person's infra needs a unique uploader name
10:10:30	<@rewby>	If you wanna do TeamBlah-monoxane then by all means go for it
10:10:59	<neggles>	does it qualify as one person's infra if all the workers are being managed from a central point, and go idle if they can't talk to it?
10:11:12	<@rewby>	Uh. Unsure.
10:12:00	<monoxane>	we'll take that as a no then
10:12:04	<monoxane>	dont wanna antagonise
10:12:09	<@rewby>	Basically, each uploader name should be associated with the person who can run sudo poweroff
10:12:10	<monoxane>	may have come in a bit too hot with the ideas
10:12:19	<monoxane>	copy
10:12:40	<neggles>	the whole "we need any one of us to be able to hit the kill switch" thing did occur to ud
10:12:44	<neggles>	s/ud/us
10:12:46	<@rewby>	Even if you don't have the skills to fix the issue, you can at least shut the thing down, you feel?
10:13:06	<@rewby>	Yeah so, that "any one of us" idea has been tried before
10:13:11	<@rewby>	It never works out in practice
10:13:59	<@rewby>	So we want to be able to tracker-ban one control domain
10:14:32	<BPCZ>	rewby: would a target with 5PiB of flash and 100PiB of hdd be of much use?
10:14:56	<monoxane>	BPCZ isnt that just the IA :P
10:15:14	<BPCZ>	Though if a target going missing is an issue that might be an issue since that system is my testing ground :/
10:15:24	<@rewby>	BPCZ: Depends on the networking, how much abuse against cpu and flash you're willing to take and how long it's available for
10:15:36	<@rewby>	Yeah, no testing grounds
10:15:53	<@rewby>	Targets going missing without >24h notice is a Big Problem
10:15:54	<monoxane>	someone buy a VAST cluster already
10:16:04	<BPCZ>	VAST is dog shit
10:16:13	<@rewby>	Just give me bare metal tbh
10:16:19	<@rewby>	That usually works best
10:16:27	<BPCZ>	Understandable
10:16:41	<@rewby>	I have a whole Ansible system to provision and manage metal
10:17:23	<@rewby>	Not OSS because, like a lot of AT code, it was all written with -2 hours of notice/planning
10:17:50	<BPCZ>	:( but we could clean it up
10:17:52	<@rewby>	Specifically I just hardcodes a ton of secrets in it because I had a deadline of a few hours before
10:17:58	<BPCZ>	lol
10:18:21	<@rewby>	It's the secrets bit that's the issue
10:20:17	<BPCZ>	Wish I could contribute hardware but that’s a big nono, I can chuck ungodly amounts of compute and ephemeral storage around but most OSS projects get annoyed when you show up do 5x the work they’ve done in 3 years then disappear
10:20:44	<@rewby>	Our problem is ephemeral is a big no for targets
10:20:47	<@rewby>	Workers, sure
10:20:59	<@rewby>	And i can scale targets up if need be.
10:21:19	<@rewby>	I've just been sick for the last 3 weeks and haven't been able to babysit them like I usually do
10:21:30	<nepeat>	oooooo ansible scripts
10:21:50	<nepeat>	i've been trying to research the backend infra and a lot of the stuff seems stale for that
10:22:30	<BPCZ>	Paperclips and chewing gum
10:22:31	<@rewby>	Targets aren't that complicated tbh, it's mostly OSS except for my provisioning code
10:22:39	<@rewby>	Tracker...
10:22:54	<@rewby>	Talk to Kaz. He's been on a journey to RE that thing
10:23:03	<nepeat>	heh
10:23:10	<nepeat>	is the current tracker code open sourced?
10:23:12	<@rewby>	Only F.usl really knows how that thing works.
10:23:25	<@rewby>	You assume all of it even has a source code repo
10:23:27	<@rewby>	Bold
10:23:31	<nepeat>	HAHA OH GOD
10:24:03	<nepeat>	my inner sre cries a little
10:24:07	<BPCZ>	>ruby
10:24:11	<@rewby>	Same
10:24:13	<BPCZ>	Off to a terrible start
10:24:14	<nepeat>	ruby is cool!
10:24:18	<@rewby>	Oh trackerproxy isn't ruby
10:24:27	<@rewby>	It's all redis and nix+lua
10:24:33	<@rewby>	*nginx
10:24:38	<@rewby>	Damn autocorrect
10:25:10	<BPCZ>	I wish there was better docs on the infrastructure, seems neat
10:25:24	<nepeat>	+1
10:25:40	<@rewby>	Same here
10:25:41	<nepeat>	i'd love to make some changes that would improve my quality of life with my infra
10:25:42	<monoxane>	+1
10:25:59	<monoxane>	i’ll just make my own with blackjack and hookers and an ia s3 key /s
10:26:01	<nepeat>	hell yeah prom exporters and structlogs
10:26:05	<monoxane>	too much work
10:26:17	<@rewby>	Using your own S3 key wouldn't work btw
10:26:35	<monoxane>	yea i know
10:26:39	<nepeat>	spicy
10:26:42	<@rewby>	You don't have access to the magical collections where we drop things.
10:26:47	<BPCZ>	IA is using S3
10:26:49	<monoxane>	it only lets you upload via the site doesn’t it
10:26:51	<BPCZ>	Now?
10:26:57	<BPCZ>	Sadage
10:26:58	<nepeat>	s3 compatible, not actual s3
10:27:01	<monoxane>	the web ui upload from ia is an s3 thing
10:27:03	<@rewby>	It's an S3 "compatible" endpoint
10:27:06	<@rewby>	We call it s3
10:27:10	<monoxane>	and yea not s3 from amazon, just the protocol
10:27:12	<BPCZ>	Thank god ok
10:27:13	<nepeat>	everyone implements s3 compatible apis
10:27:15	<neggles>	S3 =/= AWS S3
10:27:17	<@rewby>	It's cursed
10:27:44	<BPCZ>	I don’t even know if IA has multiple tape libraries yet
10:27:56	<@rewby>	It's all hdds
10:28:03	<@rewby>	Afaik
10:28:15	<nepeat>	i've heard they're running ceph these days?
10:28:19	<monoxane>	yea i think it’s hdd with a little bit of flash in front for web stuff
10:28:35	<BPCZ>	Probably too much effort to keep a library alive, those bastards always have issues
10:28:36	<monoxane>	there’s a page on the site talking about petabox
10:28:54	<monoxane>	somewhere else talks about s3 on top of it too
10:29:07	<monoxane>	whichbis where i got the idea to just ask for a key from :P
10:29:15	<@rewby>	Also, re SRE cries. You really don't wanna know the tracker. Some of it is Debian wheezy
10:29:15	<monoxane>	they’d absolutely say no though
10:29:16	<BPCZ>	If it’s Ceph then S3 is just gratis
10:29:25	<monoxane>	tell me to piss right off and never come back
10:29:42	<@rewby>	You can get keys piss easy
10:29:57	<@rewby>	Make an account on the IA and go to your profile
10:30:03	<monoxane>	not long lasting ones though
10:30:04	<@rewby>	It'll give them
10:30:12	<monoxane>	oh interesting
10:30:24	<@rewby>	They're just account creds iirc
10:30:44	<@rewby>	The thing is, we have collections with special flags that make the wbm index them
10:30:46	<monoxane>	yea and they probably revolve them if i uploaded at 10gbps
10:30:52	<monoxane>	yeap
10:31:02	<@rewby>	Randos cant just upload warcs and have them show up in the wbm
10:31:49	<nepeat>	reliability and automation would be great things to look at
10:31:56	<neggles>	most of what struck me as I was digging through code piecing together how this stuff works was, idk, disappointment? but the existential kind
10:31:59	<nepeat>	not pure brute force...
10:32:52		jtagcat quits [Client Quit]
10:33:01		jtagcat (jtagcat) joins
10:33:03	<@rewby>	But our collections are special
10:33:03	<@rewby>	And have restricted uploader access
10:33:03	<@rewby>	But all of the IA side is managed by ark.iver
10:33:03	<@rewby>	I get a set of S3 creds and a collection to shove stuff into
10:33:06	<@rewby>	If you see us discuss vars, that's our slang for the info I need from him to interface with IA
10:33:32	<@rewby>	Oh trust me, I wanna replace so much of it
10:33:41	<nepeat>	kinda curious, has something like vault been looked at for keeping the secrets outside of env files?
10:33:43	<@rewby>	But there's only so many hours in a day and I'm overworked as is
10:34:04	<neggles>	IA is important, AT is important, but it seems like there's... can't find the right way to say it but "oh come on, companies spend tens of millions on <next stupid internet fad> but none of them feel like giving any real resources to something that actually does some good?"
10:34:04	<@rewby>	Looked at? Sure. But time is limited for most of us.
10:34:18	<@rewby>	Note that we have 0 budget
10:34:24	<@rewby>	We fund this ourselves
10:34:32	<neggles>	yeah, absolutely not having a go at anyone here
10:34:53	<@rewby>	Target costs are split between me and like 4-5 other people who all pay for the hardware they donate
10:35:24	<@rewby>	But importantly, I have names, phone numbers, addresses etc
10:35:42	<@rewby>	We know where to send goons if someone fucks off
10:36:27	<neggles>	I guess i'm just kinda surprised none of the tech giants have decided to get themselves some positive press by throwing a (for them) miniscule amount of funding and resources at this
10:36:41	<@rewby>	We don't have an org
10:36:47	<neggles>	surprised isn't the right word, disappointed
10:36:48	<@rewby>	Which makes that hard
10:36:49	<nepeat>	some of us work for the tech giants ;)
10:37:18	<BPCZ>	Some of us would prefer dirty money not get involved
10:37:33	<nepeat>	i wouldn't say the money's dirty
10:37:52	<@rewby>	Money would be nice to finance proper target hw.
10:37:58	<@rewby>	Or at least pay hosting bills
10:37:59	<nepeat>	it's what makes it possible for people like me to spin up a lot of instances for the warrior IPs
10:38:05	<neggles>	all money is dirty depending on how you look at it, but that's a whole other question, and if it doesn't come with any strings attached other than "tell people we did this" that's fine
10:39:04	<@rewby>	From archiveteam.org: Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.
10:39:08	<@rewby>	This makes money hard
10:39:13	<neggles>	(that sounded wrong, s/that's fine/i wouldn't have a problem with it at least/)
10:40:19	<neggles>	nepeat: the org whose resources we are making use of do have a /22 or so available
10:40:23	<BPCZ>	I’m kind of surprised IA can’t provided a reasonable set of targets
10:40:37	<nepeat>	BPCZ: this isn't the IA
10:40:45	<@rewby>	We're not the IA
10:41:09	<@rewby>	That graciously deal with the storage and retrieval parts of web archiving for us
10:41:17	<neggles>	(and they don't get nearly enough funding either, hence the relatively low amount of ingress they can handle)
10:41:20	<@rewby>	Which is more than we could ask for anyways
10:43:22	<neggles>	yeah
10:43:49	<nepeat>	wondering, how can i help with some of the infra and client code?
10:44:06	<nepeat>	me putting out my thoughts is one thing for the overburdened team but i like to get my hands dirty and implement said thoughts
10:44:08	<neggles>	well, to say what we probably should've opened wit- heh nepeat that's p much what I was about to say
10:44:49	<neggles>	monoxane builds k8s-based application orchestration stacks for a living
10:45:08	<@rewby>	I have a decently interesting design for new target software. But not had the time to implement it.
10:46:19	<@rewby>	Also, F.usl has been working on a new tracker for years, might need help
10:47:32	<monoxane>	yea i’m kubelord, 80% of my job is building kube applications to orchestrate hundreds of gbit of traffic and the orchestration for the orchestrators to make it all manageable from a unified web interface
10:49:32	<@rewby>	I personally don't trust kube for targets
10:49:47	<@rewby>	This data is very persistent and not redundant
10:50:08	<monoxane>	replacing the warrior with a kubernetes controller that runs the direct job containers is gonna be a 3 day job at most, will look at it over christmas
10:50:21	<monoxane>	oh yea for targets it’s absolutely not the right tool
10:50:34		@rewby is the target person
10:50:53	<nepeat>	containerized targets would be very fucky, storage would have to be separated to force that to work...
10:51:04	<nepeat>	pretty much creating target2.0 if you are doing that
10:51:05	<neggles>	that's not particularly difficult if you're running on baremetal
10:51:15	<neggles>	but it's probably not worth the effort
10:51:26	<@rewby>	I have plans for new target software
10:51:29	<nepeat>	agreed, given targets aren't disposable
10:51:30	<monoxane>	but for collection at scale? kube, a 100gbe host, and a /24 will give up to 4000 concurrent downloads across an entire public ip range in seconds
10:51:40	<@rewby>	To: not destroy ssds as much and go faster
10:51:50	<monoxane>	ramdisk time :P
10:51:55	<@rewby>	NO
10:52:00	<@rewby>	Data loss
10:52:04	<nepeat>	:openeyescryinglaughing:
10:52:19	<@rewby>	Again, if we lose uploaded data, it's gone
10:52:25	<monoxane>	yea ik
10:52:30	<@rewby>	And we have no good way of figuring out what was lost
10:52:37	<monoxane>	1pb of zeusrams when
10:53:05	<monoxane>	actually a bluefield2 and some nvmeof would make a wonderful target
10:53:05	<neggles>	"if we lose data after it hits the target we can't tell what we lost" seems like a problem worth solving
10:53:15	<@rewby>	Also, one of my servers is under 1.5 years old. Its ssds have 3.5PiB written
10:53:26	<@rewby>	neggles: again, i have plans
10:53:30	<BPCZ>	Hahah I happen to know if a project trying to do multi tbps persisted storage via kube
10:53:33	<@rewby>	I just need to write it down
10:53:36	<BPCZ>	It’s going poorly
10:53:38	<monoxane>	also that, maybe it’s worth adding another step to the tracker for “egresses to ia”
10:53:40	<neggles>	oh yeah no i'm not suggesting it's easy
10:54:01	<neggles>	cause doing what mono just suggested doubles tracker load (and it sounds like the tracker is a bit of a black box at the moment?)
10:55:28	<schwarzkatz\|m>	are there even any good news regarding that site lately
10:55:28	<schwarzkatz\|m>	why is it so awfully quiet here currently, where is everybody :c
10:56:19	<@rewby>	schwarzkatz\|m: It's not quiet?
10:56:25	<joepie91\|m>	that way you optimize for scraping the high-result-count ones first
10:56:38	<joepie91\|m>	I believe that this is part of Google's n-gram dataset somewhere
10:57:10	<joepie91\|m>	hm, I thought there was a letter dataset also
10:57:11	<joepie91\|m>	(which afaik is used in google's language detection thingem)
10:57:50	<madpro\|m>	<rewby> "Also, F.usl has been working..." <- 🥲
10:57:50	<madpro\|m>	<rewby> "Also, F.usl has been working..." <- 🥲
11:00:40	<BPCZ>	monoxane: how does one become a kubelord
11:01:54	<monoxane>	a lot of "wtf how the fuck does that work" and reading golang code
11:02:08	<neggles>	if my own attempts are anything to go by, the first step involves creating & recreating your cluster 27 times in 3 different configurations before you find one that doesn't have a showstopping problem that rears its head after you're 3/4 done
11:02:32	<neggles>	(assuming you don't want to pay <cloud provider> half a kidney)
11:03:00	<monoxane>	lmao also that
11:03:03	<@rewby>	That tracks with my experience
11:03:14	<monoxane>	it took me 8 tries to make a kube cluster, now i can do it in 10 min from bare os
11:03:19	<neggles>	oh the other option is to pay red hat $texas for openshift
11:03:28	<nepeat>	boring
11:03:52	<neggles>	or go dig up all the OSS components of openshift and do it yourself
11:05:24	<BPCZ>	Oh ok so I’m most of the way there then. I write go for work and write kube oci providers and modify core kube crap to pass in hardware that’s not supposed to be passed in just yet
11:06:49	<BPCZ>	Just need to get to the standing up a cluster part … most of the time I barely figure out a process once and just have an ansible playbook for next time
11:07:38	<neggles>	having spent the better part of this year attempting to stand up a cluster that doesn't have some incredibly stupid limitation that makes me throw my hands up in defeat and forget about it for a month
11:07:43	<neggles>	good luck >.>
11:07:47	<nepeat>	this is overcomplicating the overcomplicated setup
11:08:54	<BPCZ>	neggles: I mean all clusters have limitations. I work in distributed systems and clusters professionally. Kube just isn’t used heavily for the big stuff
11:08:54	<monoxane>	BPCZ if you really wanna get standing up clusters down, do kubernetes-the-hard-way, like 4 times over, and you will know everything about the intenals and why things are like they are
11:09:05	<monoxane>	https://github.com/kelseyhightower/kubernetes-the-hard-way
11:09:11	<BPCZ>	Thanks!
11:09:14	<neggles>	the problem with k8s related stuff, from where i'm standing anyway, is it's all focused on "too big" or "too small"
11:09:19	<nepeat>	keep it simple. for my AT stuff, i got nomad (containers) + vault (mtls certs) + loki (logs!)
11:09:22	<monoxane>	(doesnt have to be gcloud, its just what they use as demo env)
11:09:30	<neggles>	there are a lot of ways to spin it up on single hosts that work quite well, are very straightforward, and behave
11:09:52	<neggles>	and a lot of ways to spin it up on <cloud provider> that work very well, are easy to manage, and cost an unpredictably-large fortune
11:09:52	<monoxane>	and yea, 1 node: easy, 2 to 6: incredibly painful, 6 to 1000: easy af
11:10:30	<BPCZ>	Did kube ever grow network topology knowledge? I recall that being a sticking point a while back
11:10:36	<neggles>	still a big problem.
11:10:42	<BPCZ>	Figures
11:10:47	<neggles>	there are several potential solutions, no clear winner
11:10:53	<neggles>	the frontrunner seems to be cillium
11:10:59	<monoxane>	its a problem but its got a whole lot better now, you can do l3 super easy with stuff like cilium or kube-router that dont rely on internal tunnels between nodes
11:11:26	<monoxane>	big thing about cilium is it does ebpf offloading so all the inter-pod stuff is done in the kernel and offloaded to the nic, instead of in userspace like the older CNIs
11:11:50	<neggles>	and you can handle rerouting traffic to the 'correct' node without overwriting the source address
11:12:15	<monoxane>	and also yknow just use bird to advertise everything between the nodes over bgp instead of cry when the vxlan is broken for no reason
11:12:27	<monoxane>	looking at you flannel
11:12:31	<nepeat>	+1 to using bgp lol
11:12:40	<neggles>	tl;dr it's getting a lot better, rapidly, but it's still not there yet
11:12:50	<neggles>	bit of an xkcd competing standards problem
11:12:52	<nepeat>	i just have wireguard tunnels and bgp to route my throwaway networks when i spin it up
11:13:01	<BPCZ>	Yeah I don’t trust kube for the workloads, and neither does google. Iirc they use nomad for some stuff, but the devs I’ve talked to over their says everything falls over when you get into the high hundred thousand messages a second range with nodes
11:13:05	<nepeat>	been looking at netmaker and got it rolled out for this iteration of my cluster
11:13:11	<monoxane>	i am currently working on standing up a cluster with 3 nodes in 3 locations connected via ipsec tunnels + bgp + kube-router for shits n gigs
11:13:14	<neggles>	google use borg, which is not k8s, but is not not k8s
11:13:35	<BPCZ>	Specific workload, they don’t use borg for it
11:13:41	<nepeat>	kinda curious, any of you got dashboards for the AT stuff yet?
11:13:43	<monoxane>	the best one to look at for implementaion and scale imo is spotify
11:13:44	<nepeat>	https://catgirl.host/i/c6s8s.png
11:13:50	<BPCZ>	They use a few thousand node nomad cluster
11:14:33	<schwarzkatz\|m>	rewby: what do you mean, quiet?
11:14:49	<monoxane>	they run 98% of workloads in 13 globally distributed clusters with the capability to hard failover any clusters traffic to any other site in under 5 seconds, they manage it all with an internal tool theyre making open source called BackStage
11:15:19	<BPCZ>	Sounds cool
11:16:10	<monoxane>	my works clusters are a fair bit smaller and a completely different ballpark, we just have 6 nodes running ~110 pods total but the application stack is designed to be entirely fault tollerrant internally so any service or any node goes down and we're still good
11:16:24	<monoxane>	most of the clusters are completely offline most of their life too
11:17:16	<neggles>	there was a big sportsball event you might've heard about recently; i will not elaborate further
11:17:30	<monoxane>	yea and another, and another :P
11:17:39	<nepeat>	i can't say anything about what i do but it's reinforced some good ideas for my personal setups, this included
11:17:47	<nepeat>	oh god not the world cup
11:18:02	<monoxane>	kube runs the video routing for the superbowl and 80% of global live sports tv
11:18:09	<BPCZ>	I wish companies would actually rework applications they chuck into kube. I had to de-kube something recently because the company wrapped a state full system into kube and washed their hands like that would be fine and kube would recover things better than other options
11:18:27	<monoxane>	oh yea ours is kube from the ground up, you cannot forklift existing workloads into kube and expect it to go well
11:20:52	<BPCZ>	nepeat: you can just say you professionally scan everyone’s butthole while they sleep it’s ok we get it. Companies just really like to know what our bowls are doing
11:21:21	<monoxane>	lmao
11:21:26	<nepeat>	lmao
11:21:29	<monoxane>	but which type of scan
11:21:42	<monoxane>	optical or something more exotic like ground penetrating radar through the roof?
11:21:53	<nepeat>	i just work for a place that inspires creativity and brings joy...
11:22:02	<monoxane>	narrator: it does not
11:22:46	<BPCZ>	All the scans, WiFi, roomba radar, brain wave from your sexual partners. If it could detect butthole the kube workload nepeat works on tries to collect it
11:23:58	<BPCZ>	Mousewitz?
11:24:04	<@rewby>	schwarzkatz\|m: In response to: 11:54 <schwarzkatz\|m> why is it so awfully quiet here currently, where is everybody :c
11:27:46	<BPCZ>	This whole conversation reminds me I need to be planning my next job and figuring out where to live next.
11:28:10	<BPCZ>	SF or Seattle seem to be the two big options
11:30:49	<neggles>	nepeat: so bytedance :v
11:30:50		le0n quits [Quit: see you later, alligator]
11:32:52		le0n (le0n) joins
11:35:16	<@rewby>	Anyone happen to know the deadline for pixiv?
11:35:29	<schwarzkatz\|m>	What is happening with this dumb matrix thing, sorry for posting duplicate messages
11:36:07	<schwarzkatz\|m>	Deadline was 2022-12-15, that’s when their TOS changed
11:36:09	<@rewby>	schwarzkatz\|m: Yeah your matrix stuff is bork. I tried checking my matrix alt and it's delayed like mad.
11:36:14	<@rewby>	Ah
11:36:18	<@rewby>	Right okay
11:36:21	<@rewby>	I'm gonna move some stuff around
11:37:04	<schwarzkatz\|m>	I think it’s only happening in the mobile app though, I have the same problem with discord sometimes
11:40:07	<@rewby>	monoxane: You were complaining about target limits? Right?
11:42:24	<@rewby>	Gas gas gas https://s3.services.ams.aperture-laboratories.science/rewby/public/1760112e-f24d-442c-9af6-0dca05f9d9ff/1671622931.615019.png
11:43:05	<nepeat>	oh neat
11:43:07	<neggles>	rewby: wound some more capacity in?
11:43:14	<neggles>	lets see how it looks this side...
11:43:48	<@rewby>	It's provisioning
11:43:52	<@rewby>	Just hit it as hard as you can
11:43:55	<@rewby>	I'll scale it up to meet
11:44:07	<@rewby>	I've hit the deploy hetzner cloud buttons
11:44:43	<neggles>	"just hit it as hard as you can" <- you may live to regret that
11:45:12	<@rewby>	Trust me, I've seen worse
11:45:39	<monoxane>	rewby we have 1.4tbps online right now lol
11:45:48	<@rewby>	And I have backpressure
11:46:00	<@rewby>	The system doesn't accept more data than it can take
11:46:13	<schwarzkatz\|m>	Argh I hat mobile apps
11:46:17	<@rewby>	If you hit a target too hard, it'll just shut off inbound and process what it has on disk
11:46:33	<@rewby>	I can easily scale this into 16-20 gbps
11:46:33	<neggles>	more the "scale up to meet" heh
11:46:44	<@rewby>	Yea I guess
11:46:50	<@rewby>	I can scale as high as IA has inbound on s3
11:46:56	<neggles>	looks like the source is now the limiter
11:48:07	<@rewby>	Here's your reminder: There's several projects with deadlines in the next 10 days
11:48:18	<@rewby>	pixiv, uploadir, vlive and buzzvideo are the main ones I know of
11:48:24	<@rewby>	So throw spare capacity at those
11:48:59	<BPCZ>	Can’t believe I went home for Christmas and can’t even do this during the holiday
11:49:51	<neggles>	sounds like we should spin up some more workers pointed at those other projects then
11:50:05	<neggles>	~3000 of them seems to be about all pixiv can handle
11:50:32	<nepeat>	oh man, i see the pixiv spike
11:51:23	<nepeat>	https://snapshots.raintank.io/dashboard/snapshot/FnHLJL9J0Stas95ZVy4565XdQLv7rJxs
11:51:48	<@rewby>	I'm not done scaling everything
11:52:03	<monoxane>	ill have you know we're currently doing 20gbps
11:52:15	<@rewby>	I'm well aware yes
11:52:17	<neggles>	pixiv has definitely run out of outbound, can't pull more than 300mbit or so from it on top of what we're hitting
11:52:17	<@rewby>	I have metrics
11:52:21	<neggles>	so
11:52:28	<neggles>	time to point some at the others?
11:52:39	<@rewby>	Yes
12:01:18		apache2 quits [Remote host closed the connection]
12:01:18		fishingforsoup quits [Remote host closed the connection]
12:01:18		user_ quits [Remote host closed the connection]
12:01:19		superkuh__ quits [Remote host closed the connection]
12:01:23	<monoxane>	uploadir has 0 tasks available, so id say we should focus on the others
12:01:25	<@rewby>	I have three separate ansibles going on trying to move stuff around
12:01:25	<neggles>	workin' on it
12:01:25		apache2_ joins
12:01:32		user_ (gazorpazorp) joins
12:01:35		superkuh__ joins
12:01:35		fishingforsoup joins
12:02:52	<@rewby>	211k out
12:02:54	<@rewby>	Hm
12:02:57	<@rewby>	Lemme flush that
12:05:37	<@rewby>	If you refresh you'll see uploadir tasks
12:06:29	<monoxane>	cool got em
12:08:27	<neggles>	just provisioning some more VMs :)
12:22:02	<@rewby>	Hm. I think I'm hitting an IA bottleneck
12:22:05	<@rewby>	Lemme investigate
12:24:57	<monoxane>	IA's only doing 15gbps inbound https://monitor.archive.org/weathermap/weathermap.html
12:25:08	<monoxane>	its likely its saturated s3 ingress though
12:25:37	<@rewby>	There's 2 lbs
12:25:41	<@rewby>	I think 10g each?
12:25:46	<monoxane>	yeap
12:25:48	<@rewby>	I need permission to override and use the other one though
12:25:59	<@rewby>	I have asked, but can't do much until I hear back
12:26:25	<monoxane>	valid
12:26:33	<monoxane>	bit of fun eh? :P
12:27:43	<@rewby>	Just let the targets fill up and workers back off, when I hear back I can get the throughput up
12:28:00	<@rewby>	My inbound on targets was like 16gbps
12:28:11	<@rewby>	Right up until disks started fillin gup
12:28:13	<@rewby>	*filling up
12:28:54	<monoxane>	we were doing 20.02gbps at peak
12:29:00	<monoxane>	from the core network
12:34:02	<@rewby>	On the upside, pixiv and vlive are looking to be done in ~24 hours according to my numbers
12:34:12	<monoxane>	sweet
12:34:50	<@rewby>	vlive in <6 hours
12:34:56	<neggles>	vlive seems to have the most source capacity
12:35:11	<@rewby>	We also don't have too many items there
12:35:35	<@rewby>	Either way, when arkiver wakes up he's gonna have a field day finding more items and things to archive
12:36:34	<neggles>	there are six more spare boxes - much smaller, "only" 8 core, but with 10G links - i've just been handed keys to
12:36:43	<neggles>	used to be minecraft servers
12:37:35	<@rewby>	I hit a spicy 18.6gbps inbound just a few ago
12:37:57	<neggles>	is uploadir stalled/already down?
12:38:12	<neggles>	seeing basically zero action out of those
12:40:42	<@rewby>	Shouldn't be. But IIRC there's a speed limit on that
12:43:43	<neggles>	ah
12:44:05		treora quits [Remote host closed the connection]
12:44:06		treora joins
12:46:58	<monoxane>	IA just cracked 20gbps inbound
12:48:02		Arcorann_ quits [Ping timeout: 264 seconds]
12:49:38	<Doomaholic>	Holy crap
12:49:45	<@rewby>	Over half of that is us
12:49:58	<@rewby>	tbh, we've done better
12:50:12	<neggles>	um, maybe silly question but where do the -grab containers store the files between pull/push?
12:50:16	<neggles>	internally
12:50:26	<@rewby>	In /grab
12:50:32	<neggles>	not in a subdir?
12:50:36	<@rewby>	I don't remember
12:50:39	<neggles>	fairo
12:50:45	<neggles>	i guess i opened one that hasn't started yet
12:51:01	<@rewby>	Data storage is synchronous
12:51:13	<@rewby>	So it always does download -> upload -> download -> upload
12:51:23	<@rewby>	So if it's waiting for work, it won't have any data stored
12:51:26	<neggles>	ah
12:51:39	<neggles>	some of these video files are big enough for kube to go "hey, you didn't ask for disk" and boot the pods
12:52:00	<neggles>	ah all under /grab/data excellent
12:52:12	<monoxane>	yea im currently looking at a cluster thats half evicted pods becuase they used 15gb of ephemeral storage 😆
12:52:34	<@rewby>	Yeah, the video files are big
12:54:34	<neggles>	ok it's happy now
12:55:38	<@rewby>	For people who were asking to donate target hw: This is what we do to disks: Data Units Written: 6,793,004,499 [3.47 PB]
12:55:41	<@rewby>	That's in 1.5 years
12:55:47	<madpro\|m>	I mean, Archive Team cannot be the only people making software for this nowadays. Can it?
12:55:47	<madpro\|m>	There are tons of companies that do crawling for a business, surely they have open-sourced some more robust trackers by now?
12:55:47	<madpro\|m>	Not that I know, as I have been searching for myself as well for the past 2 years or so.
12:56:13	<neggles>	anyone with a functional wide-scale web crawler / ripper is not going to hand that out for free
12:56:24	<neggles>	that's a surefire way to stop it working
12:57:17	<madpro\|m>	I cannot say I'm nearly as skeptic, seeing other projects like Hadoop in distributed computing
12:57:30		hackbug (hackbug) joins
12:57:52	<neggles>	hadoop is not the expensive/proprietary/"magic" part of a hadoop-based workflow though
12:58:13	<neggles>	it's worthless without the rules and flows and transforms etc
12:58:37	<neggles>	while the value of a crawler (on a commercial level anyway) comes from being able to skip around things trying to block crawlers
12:58:42	<@rewby>	The thing with these kind of trackers: They are tied closely to your workflow.
12:58:47	<@rewby>	They are very specific
12:59:13	<neggles>	same goes for hadoop setup
12:59:18	<@rewby>	If you tried to make an end-all-be-all tracker you'd end up with something as complex as kube
12:59:22	<neggles>	or SAP
12:59:27	<@rewby>	With a much smaller market
12:59:28	<madpro\|m>	Well there you go
12:59:51	<@rewby>	So instead people make trackers that are good enough for their workflow
12:59:59	<neggles>	ERP systems are the perfect example; they do everything for everyone, but they do it by having 27,000 different modules that can be wired together in practically infinite ways
13:00:00	<@rewby>	But then you end up being very very tied to your company
13:02:21	<monoxane>	i think pixiv might need a purge too, theres 1m out but it never goes below 99.95k and surely theres not a million jobs being processed rn lmao
13:02:37	<@rewby>	I'll give it a look in a sec
13:02:44	<madpro\|m>	Better close this tangent, before discussion shifts back to pixiv.
13:02:44	<@rewby>	I'm actually disappointed in my upload rate
13:02:51	<@rewby>	I've done 25G to them before
13:03:10	<@rewby>	I'm not too worried about pixiv
13:04:00	<@rewby>	Tldr: It doesn't recycle jobs from the out-list until todo is empty
13:04:00	<@rewby>	And there's 8M in too
13:04:00	<@rewby>	*todo
13:04:04	<monoxane>	im not either its just a bit of a high number
13:04:07	<monoxane>	ah okay i didnt know that
13:04:11	<@rewby>	Also, monoxane, the bare -grab containers will do concurrency up to 20
13:04:20	<monoxane>	yes we know
13:04:23	<@rewby>	kk
13:04:33	<madpro\|m>	For now, in terms of tracker development we should look to making do with what we have. The IA wiki and GitHub have a long way to go in terms of documentation.
13:04:42	<madpro\|m>	Exploiting our own resources and all that.
13:04:43	<monoxane>	every single one of the 3000 containers currently running across 20 nodes are on max concurrency
13:04:56	<@rewby>	Ah okay
13:05:08	<monoxane>	we are entirely restricted by IAs ingrest right now
13:06:12	<monoxane>	*ingest
13:06:14	<neggles>	"haha kubernetes go brrr"
13:07:26	<@rewby>	I've redirected vlive to a pile of spinning rust
13:07:32	<@rewby>	To sink your data into
13:07:48		qwertyasdfuiopghjkl quits [Remote host closed the connection]
13:08:38	<Doomaholic>	Bless
13:08:46	<monoxane>	excellent
13:09:18	<neggles>	ooooh one of these is a 5900X
13:10:10	<neggles>	https://lounge.neggl.es/uploads/e1333322a07ac379/bueno.png
13:10:45	<@rewby>	mmmm data https://s3.services.ams.aperture-laboratories.science/rewby/public/09c34280-fa59-4879-aaad-bd50f9a499e3/1671628237.819004.png
13:11:57	<Doomaholic>	Delicious
13:13:45		DavidSaguna joins
13:41:54	<monoxane>	i think the next bottleneck might actualy be rsync connections on the targets
13:42:08	<monoxane>	like 70% of my pods are sitting here idle waiting to retry dumping
13:42:20	<monoxane>	its error 400 not -1 so its not the disk full cutoff
13:44:40		qwertyasdfuiopghjkl joins
13:56:05		sonick (sonick) joins
14:29:55		G4te_Keep3r349 joins
14:32:11		sec^nd quits [Ping timeout: 245 seconds]
14:38:18	<@arkiver>	thanks for the ping OrIdow6
14:38:23	<@arkiver>	still reading some backlog
14:38:51	<@arkiver>	monoxane: are there several people running under a single 'team name'?
14:39:20	<@arkiver>	neggles: feel free to make a PR on the warrior docker image
14:42:35		sec^nd (second) joins
14:59:43		katocala quits [Remote host closed the connection]
15:06:44	<@arkiver>	monoxane: if you have a ton of IPs available - telegram could definitely benefit from that, we got quite some backlog to work through
15:07:07	<@arkiver>	on uploadir - roughly half of the items was 404
15:16:25		MrSolid joins
15:16:35	<MrSolid>	hi guys
15:16:43	<MrSolid>	can you please help me archive the website ac-web.org
15:17:06		MrSolid leaves
15:17:08		MrSolid joins
15:17:55	<@arkiver>	MrSolid: what is the reason?
15:18:26	<@arkiver>	ac-web.org is not loading for me
15:18:34	<MrSolid>	sites been down for months since being sold to new owners and trying to migrate to new community so information isnt lost
15:18:47	<MrSolid>	https://ac-web.org/index.php
15:18:49	<@arkiver>	well if the site is down, we can't archive it
15:18:53	<MrSolid>	its up for me thats odd
15:19:05	<MrSolid>	maybe someones crawling it right now haha
15:19:25	<@arkiver>	in any case, sounds like a site we should archive yes
15:19:35		programmerq quits [Remote host closed the connection]
15:19:44	<MrSolid>	thank you arkiver
15:20:27	<@arkiver>	loading very slowly now
15:21:44	<MrSolid>	i just hope the new owner doesnt shut the site down again before its archived
15:33:15		Island joins
15:41:34		MrSolid quits [Remote host closed the connection]
16:08:52		DLoader quits [Remote host closed the connection]
16:11:38		fishingforsoup_ quits [Remote host closed the connection]
16:11:38		DLoader joins
16:11:51		fishingforsoup_ joins
16:15:03	<monoxane>	arkiver not any more, for a little bit yesterday there were a couple people using one name but after we got a harsh no we split and each person controlling a set of nodes is using a different name
16:15:23	<monoxane>	will switch some of them to telegram in the morning
16:23:29		Megame (Megame) joins
16:23:44	<@arkiver>	monoxane: sounds good - separate names is definitely better for keeping track of who doing what
16:24:03	<@arkiver>	and yeah as rewby said, feel free to prepend something to the names to show people as being part of the same group
16:25:22	<monoxane>	yea will probably do that at some point, the team name thing was more because some people don’t want to be identified so those people are just using team name suffixed with country, identifiable enough for someone to tell ‘em to stop if it’s broken but not to be worked out from the leaderboard
16:26:03	<@arkiver>	right yeah
16:26:12	<@arkiver>	so what is this group of people?
16:27:26	<monoxane>	friends, some of which work at a tier 1 global isp and have some resources at their disposal
16:27:38	<@arkiver>	pretty awesome
16:28:20	<monoxane>	the 20gbps we were pulling today didn’t even make a single pixel increase in their usage charts (outside of the routes going to targets and the sources)
16:29:04	<@arkiver>	watch out that if you were to run a project like the URLs project (outlinks from various sources), it may contain any URL you can find online
16:29:12	<monoxane>	it was approved because it’s just a fun little load test on their links 😆
16:29:16	<@arkiver>	though I'd say it is one of our most valuable projects
16:29:28	<@arkiver>	telegram is likely very safe to run
16:29:32	<@arkiver>	hah :) sounds good
16:29:42	<monoxane>	we will likely only run at full tilt when there’s an “oh fuck” event where we have 24 hours to pull and entire site
16:29:58	<@arkiver>	alright
16:30:01	<monoxane>	and just leave a couple nodes running on a range of projects
16:30:07	<@arkiver>	yeah we have some end of year shutdown going on at the moment
16:30:11	<monoxane>	>just a couple nodes
16:30:36	<monoxane>	i say this as if they’re not 100gbe directly attached to an isp core
16:30:49	<@arkiver>	for the current short term projects bandwidth is the bottle neck somewhere along the way
16:30:58	<@arkiver>	but for the long term projects, IPs if the bottle neck
16:31:07	<@arkiver>	is*
16:31:16	<monoxane>	yea, we have some potential solutions to the ip bottleneck
16:31:38	<monoxane>	one of which involves giving a single node an entire /24 😅
16:32:41	<@arkiver>	"couple nodes" with each a different /24?
16:32:47	<@arkiver>	that'd be pretty awesome :)
16:35:52	<monoxane>	the only problem with that is burning /24s is less justifiable than burning 20gbps out of a 20+tbit network
16:38:00	<@arkiver>	yeah which is likely the reason as well why our long term project have IPs as bottle neck rather than bandwidth
16:38:05	<@rewby>	I think I'm currently still burning two /24s on telegram
16:38:26	<@rewby>	Or rather, I'm burning someone else's /24s
16:38:38	<@arkiver>	rewby: and it is really making a difference!
16:39:56	<@arkiver>	we're slowly working through the huge telegram backlog
16:40:17	<@arkiver>	note though that we currently cannot keep up with new discovered group posts (we can only keep up with new discovered channel posts)
16:40:59	<@arkiver>	i'm stashing the group posts at another project at the moment, https://tracker.archiveteam.org/telegram-groups-temp/ , which now has 4 billion items
16:41:10	<@arkiver>	so we'll just feed that in slowly whenever there is room
16:41:35	<@arkiver>	it's already very good we can keep up with channel posts however, we're discovering and archiving many of them
16:45:59	<Jake>	(I missed quite the night here!)
16:52:47		DavidSaguna quits [Read error: Connection reset by peer]
17:14:18		atphoenix_ quits [Remote host closed the connection]
17:14:18		superkuh__ quits [Remote host closed the connection]
17:14:18	<mgrandi>	@arkiver: how are you guys doing telegram? The web view of groups ?
17:14:36		superkuh__ joins
17:14:43	<mgrandi>	Also, update on the FA forums, I'm pretty sure that's not what the GDPR means , and also lol, like that's going to stop anyone https://forums.furaffinity.net/threads/forum-closure-fa-discord-coming-soon.1682702/post-7381985
17:14:50		atphoenix_ (atphoenix) joins
17:15:08	<@arkiver>	mgrandi: yes
17:15:11	<@arkiver>	on telegram
17:16:56		h2ibot quits [Remote host closed the connection]
17:17:09		h2ibot (h2ibot) joins
17:17:24	<ivan>	"I dont know how that works or if it can take as many messages or forum pages this site has." haha
17:18:39	<mgrandi>	@arkiver: that is the easiest way yeah, I have a lot of experience with tdlib but it's daunting how many things to support so the web view probably is the easiest way for now!
17:23:59		systwi quits [Ping timeout: 250 seconds]
17:29:46		systwi (systwi) joins
17:47:39	<schwarzkatz\|m>	what is their concern with GDPR on an archived website
17:47:46	<schwarzkatz\|m>	I don't really get it
17:50:20	<@arkiver>	mgrandi: yeah, and the web view can go into the Wayback Machine
17:50:25		Jonimus quits [Ping timeout: 250 seconds]
17:50:49	<schwarzkatz\|m>	according to deathwatch, https://www.zhihu.com/club/explore will stop working on 12-26. it's probably a good idea to archive/grab all links from these pages beforehand
17:51:53	<@arkiver>	schwarzkatz\|m: as in, https://www.zhihu.com/ is shutting down?
17:51:58	<@arkiver>	hmm
17:52:01	<@arkiver>	i missed that on deathwatch
17:52:08	<schwarzkatz\|m>	it says only /explore
17:53:07	<@arkiver>	hmm yeah but I see the entire thing (zhihu.com) is shutting down next year?
17:53:35	<schwarzkatz\|m>	looks like it
17:53:40	<@arkiver>	fun project
17:58:10		Jonimus joins
18:02:29	<@arkiver>	any idea where 'Circles' comes from in Zhihu Circles?
18:02:45	<@arkiver>	JAA: do you know if anything was done for the furaffinity forums?
18:03:00	<schwarzkatz\|m>	looks like a bunch of api calls to get, I'll try to grab them from /explore
18:03:39		Jonimus quits [Ping timeout: 265 seconds]
18:03:40	<@arkiver>	the deathwatch page says "Zhihu Circles" is removing that public access, is Zhihu Circles all of zhihu.com ?
18:05:05	<schwarzkatz\|m>	a circle seems to be a /club/[0-9]+
18:05:34	<schwarzkatz\|m>	so all items on /explore are circles
18:05:47	<@arkiver>	i see. thank you
18:11:45		Jonimus joins
18:22:03		Jonimus quits [Ping timeout: 250 seconds]
18:36:46		sec^nd quits [Ping timeout: 245 seconds]
18:36:50	<schwarzkatz\|m>	here is a list with all api calls for /explore https://transfer.archivete.am/aqUwF/zhihu.explore-api-calls.txt
18:42:39		sec^nd (second) joins
18:50:40	<mgrandi>	schwarzkatz\|m: the guy probably doesn't know what he is talking about or is thinking that we would be taking the legit forum database
19:22:14		Megame quits [Ping timeout: 264 seconds]
19:25:19		Jonimus joins
19:28:45	<@rewby>	Well then, today was an experience
19:28:58	<@rewby>	We're aware pixiv, vlive, etc are having target issues
19:29:17	<@rewby>	It's actually intentional
19:54:15	<@rewby>	HCross and me have paused high activity projects at the moment. We have too much backlogged data to process and we need to be careful with the IA.
19:54:32	<@rewby>	Announcements in project specific channels in a few.
20:38:26		sec^nd quits [Ping timeout: 245 seconds]
20:39:33		sec^nd (second) joins
21:11:46		sec^nd quits [Ping timeout: 245 seconds]
21:13:20		sec^nd (second) joins
21:30:02		braindancer joins
21:30:42		benjins is now authenticated as benjins
21:36:23	<neggles>	rewby: ah, sounds like we might've sent it a bit too hard
21:36:35		spirit quits [Quit: Leaving]
21:47:04		leo60228- quits [Quit: ZNC 1.8.2 - https://znc.in]
21:47:26		leo60228 (leo60228) joins
21:48:41	<@JAA>	arkiver: I thought Fur Affinity was thrown into AB, but apparently not. I'll take a look later. Might also qwarc it.
21:53:37	<h2ibot>	Arcorann edited Deathwatch (+292, /* 2023 */): https://wiki.archiveteam.org/?diff=49262&oldid=49255
21:53:39	<schwarzkatz\|m>	JAA: would collecting all thread & subforum urls be helpful?
21:53:40	<datechnoman>	Worst case throw Fur Affinity in #Y project to be trawled through
21:55:02	<@JAA>	schwarzkatz\|m: Not needed, with qwarc, I'd probably just bruteforce thread IDs anyway.
21:55:42	<schwarzkatz\|m>	with pagination? :O
21:55:53	<@JAA>	Of course.
21:56:20	<@JAA>	I've archived XenForo forums before with qwarc, so just need to adjust domains and am probably good to go.
21:56:30	<schwarzkatz\|m>	okay then, let me know if I could help otherwise :)
21:57:07	<@JAA>	If they can take the load, I could grab it all in hours. Chances are they can't though. :-)
22:00:32	<datechnoman>	Fur Affinity appears to be Cloudflare backed including their images so they will be able to process high throughput id say
22:04:28	<@arkiver>	JAA: alright, sounds good
22:04:38	<@arkiver>	and with bruteforcing thread IDs, you could still somehow get the pages?
22:04:45	<@arkiver>	outlink can of course go into #// :)
22:04:48	<@JAA>	datechnoman: That doesn't really mean much as it just depends on what the backend server is. Could be a RPi in someone's closet for all we know.
22:05:44	<@JAA>	arkiver: Yes, I will get thread pagination. No images etc., but those can be extracted later and fed to #// along with the outlinks, yeah.
22:08:02	<datechnoman>	Fair call JAA. I guess the CDN helps with throughput and load but the backend processing of the requests is a different story. You can tell my main focus is the #// which is everything all over the place
22:09:51	<@JAA>	Yeah, it certainly helps with things cached on the CDN. When bruteforcing threads, most won't be in the cache.
22:12:48	<@arkiver>	sounds good
22:14:16		sec^nd quits [Ping timeout: 245 seconds]
22:14:47		sec^nd (second) joins
22:21:46	<@JAA>	schwarzkatz\|m: So forum.lacartoonerie.com is NXDOMAIN now. It was down since end of November anyway, but I guess that means it definitely won't be coming back.
22:22:25	<schwarzkatz\|m>	good that we got it then :)
22:22:37	<@JAA>	Your grab is on IA?
22:22:53	<@JAA>	The ArchiveBot job didn't get far.
22:23:39	<schwarzkatz\|m>	I thought you got it all :/
22:24:38	<@JAA>	No, it got errors pretty soon after I started it. That's why I asked about whether you had also seen timeouts in your crawl.
22:24:46	<@JAA>	I don't think it managed to retrieve much more after that.
22:26:14	<@JAA>	Ah, I see https://archive.org/details/forum.lacartoonerie.com-2022-11-11-24a72456-00000.warc
22:26:22	<@JAA>	Missing the -meta.warc.gz though, do you still have that?
22:26:36	<schwarzkatz\|m>	that's unfortunate then
22:26:36	<schwarzkatz\|m>	my grab is partially in WBM since I at first used SPN exclusively
22:27:19	<@JAA>	Looks like someone else also did something in September, but it's in WARCZone: https://archive.org/details/warc_forum_lacartoonerie_com_20220927
22:28:18	<schwarzkatz\|m>	I have deleted all files after I uploaded that, looks like I didn't see that one
22:28:25	<@JAA>	Oof
22:28:49	<schwarzkatz\|m>	what's in there?
22:29:01	<@JAA>	Log
22:29:30	<@JAA>	Less important than the data I suppose, but yeah, please upload it on future grabs.
22:29:52	<schwarzkatz\|m>	will do
22:35:10	<@JAA>	Do we know of any list of projects on SourceHut that will be removed? If not, can someone try to compile one? https://sourcehut.org/blog/2022-10-31-tos-update-cryptocurrency/
22:38:55		hitgrr8 quits [Client Quit]
22:41:25	<schwarzkatz\|m>	searching for related words turns up maybe less than 20 public repos in total. maybe it's a good idea to get these and then archive all 1058 repos?
22:45:55		BlueMaxima joins
22:50:31		spirit joins
22:52:54	<@JAA>	Sounds reasonable.
22:53:30	<@JAA>	Not sure about archiving all repos actually, but sounds like it shouldn't be too big. Unless there are a dozen copies of Linux and Chromium on it. :-\|
22:53:43	<@arkiver>	"how about we just get everything?" "sounds reasonable" :P
22:54:08	<@JAA>	https://transfer.archivete.am/inline/bG4mu/aatt.png
22:54:20	<@arkiver>	hahaha yeah!
22:54:45		ThreeHM_ is now known as ThreeHM
22:55:03	<@JAA>	I will grab all of sr.ht eventually anyway (when that bot is ready), I'm just not entirely certain it's worth doing that now.
22:59:25	<schwarzkatz\|m>	https://transfer.archivete.am/WZZDz/sourcehut.crypto-related.txt
22:59:26	<@JAA>	Yeah, as expected, there are at least a couple copies of the Linux repo. Those would be duplicated.
22:59:40	<schwarzkatz\|m>	contains also non cryptocurrency stuff, didn't sort that out
22:59:53	<@JAA>	Is it 1058 repos or 1058 projects? Projects can have multiple repos, I think.
23:00:13	<schwarzkatz\|m>	projects then :D
23:00:28	<@JAA>	Thanks for the list, will do the magic later.
23:00:56	<schwarzkatz\|m>	great
23:01:17	<@JAA>	And I might just throw https://sr.ht/ into AB and add aggressive ignores to get a general record of what's on there.
23:02:21	<@JAA>	The project pages should have some records of the (short) commit IDs, too, which could be used to verify mirrors, for example.
23:02:44		mikesteven joins
23:03:01	<@JAA>	arkiver: Heard anything from GeoLog?
23:03:06		mikesteven leaves
23:18:20	<@JAA>	Ah, the repos are on a separate domain anyway, right. So it'd grab those and not recurse further, which is even better.
23:18:46	<@JAA>	SourceHut does also support unlisted repos, which would be tricky to find.
23:24:56	<@arkiver>	JAA: no, nothing
23:25:24	<@arkiver>	ACTUALLY
23:25:37	<@arkiver>	got a reply literally few hours ago
23:25:40	<@arkiver>	:)
23:25:45	<@JAA>	:-)
23:29:45		Ketchup901 quits [Quit: No Ping reply in 180 seconds.]
23:31:29		Ketchup901 (Ketchup901) joins
23:37:06	<Ryz>	Ooo, reply? O:
23:37:38	<Ryz>	arkiver?
23:39:49	<pabs>	JAA: #swh folks pointed me at this rejection of an API to list all SourceHut repos: https://lists.sr.ht/~sircmpwn/sr.ht-dev/patches/4859
23:40:25	<pabs>	JAA: btw, could you pastebin a link of the sr.ht repos you archive into #swh (libera) so they can grab them too?

Home Search Previous day Next day