| 00:23:03 | | BlueMaxima quits [Client Quit] |
| 00:41:06 | | omglolbah quits [Ping timeout: 265 seconds] |
| 00:46:48 | | omglolbah joins |
| 01:13:50 | | omglolbah quits [Ping timeout: 264 seconds] |
| 01:14:17 | | omglolbah joins |
| 01:24:30 | | omglolbah_ joins |
| 01:25:34 | | omglolbah quits [Ping timeout: 265 seconds] |
| 02:22:58 | | BlueMaxima joins |
| 02:33:16 | | monoxane (monoxane) joins |
| 02:34:49 | <pabs> | JAA: sounds about right. TBH this is the first time I heard of Freed-ora |
| 02:35:20 | | pabs quits [Remote host closed the connection] |
| 02:35:24 | <monoxane> | yo so how hard would it be to get some more targets online if we had the storage + network to provide for it |
| 02:35:59 | <monoxane> | if you've seen pixiv over the last 2 days you may have seen that me and a few friends have thrown some 100g boxes at it and are currently bottlenecked by the 2 online targets |
| 02:36:26 | <monoxane> | we know the targets need to offload to IA at an appropriate speed, but have quite a bit of available storage to buffer ourselves with |
| 02:38:02 | <monoxane> | at a point we were hitting 7.5gbps from the source but are now limited by the targets disks filling up and stopping connections 😔 |
| 02:39:13 | <monoxane> | its less of a thing for this in particular but we're trying to work out how we can provide some infra for the next "oh shit its going down in 24 hours" site scrape |
| 02:39:58 | <monoxane> | 500gbit of bandwidth, a /24, and 100tbit of local storage will help some of those a fair bit 😉 |
| 02:41:54 | <@JAA> | Pinging some relevant people: rewby HCross arkiver ^ |
| 02:43:38 | <monoxane> | we're also working on rewriting an api compatible warrior that will scale much higher |
| 02:44:19 | <monoxane> | for reference last night we had 3328 warrior threads running across 13 nodes for shits n gigs, and were nowhere near capacity |
| 02:45:20 | | pabs (pabs) joins |
| 02:45:47 | <monoxane> | also considering rolling a new version of the megawarc factory with some improvements, the real question is how does it get from the targets to IA and what do we need to do to facilitate that |
| 02:46:37 | <monoxane> | and yes, aware that IA only has ~20gbps S3 capacity, we'd be egress shaping down to about 5gps, hence the fuck off massive target cache to hold it for a bit |
| 03:15:25 | | fishingforsoup_ quits [Quit: Leaving] |
| 03:15:42 | | fishingforsoup_ joins |
| 03:20:58 | | godane (godane) joins |
| 03:21:34 | | godane1 quits [Ping timeout: 265 seconds] |
| 04:06:27 | | Aizith joins |
| 04:09:25 | | Aizith quits [Remote host closed the connection] |
| 05:23:27 | | igloo22225 quits [Client Quit] |
| 05:23:39 | | igloo22225 (igloo22225) joins |
| 06:02:05 | | Arcorann_ joins |
| 06:04:27 | | hackbug quits [Ping timeout: 265 seconds] |
| 06:06:31 | | BlueMaxima quits [Client Quit] |
| 06:18:15 | | sonick quits [Client Quit] |
| 06:29:13 | | godane1 joins |
| 06:31:02 | | godane quits [Ping timeout: 265 seconds] |
| 06:32:36 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 06:33:06 | | jacksonchen666 (jacksonchen666) joins |
| 06:35:46 | | Island quits [Read error: Connection reset by peer] |
| 07:28:41 | <monika> | monoxane could you clarify on the "api compatible" warrior? are you modifying the existing warrior or writing one from scratch |
| 07:30:14 | <monika> | i believe modifying warrior code is a big no no |
| 07:31:36 | <monoxane> | new one that does the same thing with the same apis just less jank and some more options to allow us to vertically scale easier and with an updated docker image |
| 07:32:11 | <monika> | JAA what's your opinion ^ |
| 07:32:34 | <nepeat> | i'd be interested in learning more and supporting this warrior improvement |
| 07:33:04 | <nepeat> | personally, i'd love to add on prom metrics and getting the logging to fit the structlog format to work with my systems |
| 07:33:07 | <monoxane> | im not the guy doing that so i might be wrong on whats actually happening, but we've found that one of the main limiting factors of the warrior is its concurrency settings and the inability to disable things like the web ui |
| 07:33:28 | <monoxane> | and also the fact that some of the python libs used in it are effectively vaporware that havnt been updated since 2017 |
| 07:33:28 | <monika> | if you run the bare project containers the UI is already disabled |
| 07:33:44 | <monika> | atdr.meo.ws/archiveteam/<PROJECT>-grab |
| 07:33:58 | <monika> | allows for 20 concurrency too |
| 07:34:01 | <monoxane> | ooo we did not know that |
| 07:34:07 | <monika> | go crazy |
| 07:34:12 | <monoxane> | that is going to make a massive difference |
| 07:34:56 | <monoxane> | aight the warrior isnt being changed anymore :) |
| 07:35:26 | <monoxane> | but we are gonna write our own cluster agent and c2 implementation :P |
| 07:36:02 | <nepeat> | ditching k8s already? |
| 07:36:29 | <monoxane> | no, still using k8s, just writing a controller that handles the deployment and configuration of those bare images instead of the warrior |
| 07:36:46 | <monoxane> | we are already working on that but via the warrior, knowing about the bare images is a massive game changer |
| 07:38:54 | <monoxane> | hm these dont seem to actually contain anything though 😔 |
| 07:39:06 | <monika> | huh? |
| 07:39:08 | <monoxane> | at least the pixiv-2-grab one'd dockerfile literally only has a from line in it |
| 07:39:38 | <nepeat> | it does fucky ONBUILD magic https://github.com/ArchiveTeam/grab-base-df/blob/master/Dockerfile |
| 07:39:45 | <nepeat> | this is the dockerfile to refer to |
| 07:39:52 | <nepeat> | the base is on https://github.com/ArchiveTeam/wget-lua/blob/v1.21.3-at/Dockerfile |
| 07:40:00 | | @OrIdow6 leaves [Leaving.] |
| 07:40:10 | | OrIdow6 (OrIdow6) joins |
| 07:40:10 | | @ChanServ sets mode: +o OrIdow6 |
| 07:40:35 | <monoxane> | ah okay, thats some fucky shit i havnt seen before :P |
| 07:40:45 | <monoxane> | will play around with it after i finish my actual job for the day lol |
| 07:41:23 | <@OrIdow6> | arkiver: See above, they have dropped their plan to modify "the warrior" |
| 07:42:08 | <monoxane> | yes now we're just gonna bypass it 😆 |
| 07:43:02 | <monoxane> | we dont wanna do anything that will screw anyone else here but there are definitely challenges with scaling warrior to 3000+ instances over 10+ nodes and actually managing it |
| 07:43:11 | <nepeat> | eh https://github.com/general-programming/megarepo/blob/mainline/common/nomad_jobs/job_at_vlive.hcl |
| 07:43:31 | <nepeat> | i like nomad, it's simple and has scaled up with my 100-300 instances well |
| 07:43:57 | <monoxane> | yea the other thing is the nodes we're using already have k3s and are running some other workloads, so we cant just jump to nomad |
| 07:44:13 | <nepeat> | ah, preexisting prod |
| 07:44:54 | <monoxane> | yes, if you knew what these nodes usually do you'd be absolutely shocked that we can run AT workloads on them, and also absolutely not surprised at all that we can pin 500gbps |
| 07:45:05 | <monoxane> | but dont worry its all approved by the owners :) |
| 07:46:28 | <@OrIdow6> | I haven't been following this conversation enough to know the meaning of "bypass it", but basically, the hard rules are: -don't modify wget-lua/wget-at, including messing with the build process to get it to accept wider ranges of library versions -don't modify Seesaw or the other libraries it uses -don't modify the project scripts -keep a clean, vanilla connection from wget and the project scripts to the Internet |
| 07:48:23 | | sepro quits [Read error: Connection reset by peer] |
| 07:48:52 | <monoxane> | understood we’ll definitely be sticking to that |
| 07:49:10 | <monoxane> | i mean we’ll be running the project containers directly not managed through warrior |
| 07:51:45 | | sepro (sepro) joins |
| 07:52:29 | | michaelblob quits [Read error: Connection reset by peer] |
| 07:53:14 | <nepeat> | that's what most of us hardcore users do |
| 07:53:39 | <nepeat> | you're definitely on the right path to hauling top rates |
| 07:58:56 | <monoxane> | we dont care about the leaderboards lmao, even considered randomising the DOWNLOADER ids so other people dont get discouraged by 1 name munching 10tb a day |
| 07:59:16 | <monoxane> | its more, if we can help in an "oh fuck" situation where theres 24 hours to get an entire site archived, we'll put in everything we've got |
| 07:59:46 | <monoxane> | because i've been part of some of those were even with all the capacity we've had, some content is still lost, and in a couple of cases it was a fair bit of content |
| 08:07:25 | <schwarzkatz|m> | Appreciate the work you guys do, monoxane! |
| 08:11:34 | <Jake> | (also related to earlier conversation, it's easier if you use a known downloader name so that you can be contacted) |
| 08:13:58 | <monoxane> | yea we're gonna use some sort of team name when its all up and running |
| 08:14:07 | <monoxane> | instead of just my nick lol |
| 08:16:28 | | Hackerpcs quits [Quit: Hackerpcs] |
| 08:18:43 | | Hackerpcs (Hackerpcs) joins |
| 08:47:53 | <nepeat> | kinda wondering, how up to date are all of the archive team repos? |
| 09:08:50 | | hitgrr8 joins |
| 09:15:26 | | sknebel is now known as sknebel_m |
| 09:15:35 | | sknebel_m is now known as sknebel |
| 09:17:43 | | sknebel is now known as sknebel2 |
| 09:17:49 | | sknebel2 is now known as sknebel |
| 09:23:03 | <neggles> | "don't modify Seesaw or the other libraries it uses" aww |
| 09:27:39 | <neggles> | I believe the current plan was to use MagnusInstitute or possibly MagnusArchivist as downloader name, TBC though |
| 09:32:35 | <neggles> | OrIdow6: would it be OK to rework the warrior docker image somewhat so it's a bit more... modern, for lack of a better way to put it? I was digging through repos and whatnot last night piecing together how it all works and... oof. |
| 09:34:42 | <@OrIdow6> | neggles: I don't know what that implies exactly |
| 09:35:04 | <@OrIdow6> | The core that you shouldn't modify is in the READMEs under "Distribution-specific setup" |
| 09:35:25 | <@OrIdow6> | And to my understanding the warrior, Docker images, etc. are basically just wrappers around a preconfigured version of this |
| 09:35:52 | <@OrIdow6> | But I don't know the details of those, and if you want specifics you should wait around for someone who does |
| 09:35:58 | <neggles> | OK, no problem |
| 09:40:34 | <neggles> | don't want to step on anyone's toes; I have a local 3/4-ish-complete copy of what I'm talking about, it's mostly a slightly cleaner build process (same steps, same sources, similar end result) just with bullseye underneath, theoretically arm64 support, and a few more things configured through environment variables (webui port, UID/GID) |
| 09:50:33 | <@rewby> | So the thing is: don't run custom builds of wget-at. It causes issues |
| 09:51:06 | <@rewby> | Mostly around compression and or file integrity |
| 09:51:42 | <@rewby> | And upgrading the base distro changes lib versions, which then causes the above |
| 09:54:01 | <@rewby> | As for targets, we don't generally accept them from just anyone who shows up randomly. Once data is on a target it is really hard to figure out what needs to be redone if that target disappears. |
| 09:55:50 | <@rewby> | Notably, we only accept targets in the form of bare metal or vms. We have provisioning playbooks for them |
| 09:55:57 | <@rewby> | Also, they destroy ssds |
| 09:56:08 | <@rewby> | And HDDs are not gonna keep up |
| 09:56:43 | <@rewby> | Also, 100T isn't much |
| 09:56:57 | <@rewby> | I have targets with that much sitting around as well |
| 09:57:25 | <@rewby> | I can look into reshuffling a few things |
| 09:59:24 | <@rewby> | Also, monoxane, do *NOT* use team names. That is forbidden. We will ban you if we discover this. |
| 09:59:32 | <monoxane> | oop okay |
| 09:59:34 | <monoxane> | will not |
| 10:00:09 | <@rewby> | We have had many issues with this before |
| 10:02:10 | <@rewby> | In the past, people have used team names and then one member's infra fucks up and we need then to stop. Inevitably that person is unreachable and the other members can't get to that specific bit of infra. We end up banning the whole thing because that's the most granular tool we have. |
| 10:02:18 | <@rewby> | This has happened multiple times. |
| 10:02:31 | <@rewby> | So we prohibit team names in general now |
| 10:05:37 | <monoxane> | yea that makes heaps of sense i dont know why i didnt think about it |
| 10:06:18 | | michaelblob (michaelblob) joins |
| 10:10:11 | <@rewby> | Yeah, each person's infra needs a unique uploader name |
| 10:10:30 | <@rewby> | If you wanna do TeamBlah-monoxane then by all means go for it |
| 10:10:59 | <neggles> | does it qualify as one person's infra if all the workers are being managed from a central point, and go idle if they can't talk to it? |
| 10:11:12 | <@rewby> | Uh. Unsure. |
| 10:12:00 | <monoxane> | we'll take that as a no then |
| 10:12:04 | <monoxane> | dont wanna antagonise |
| 10:12:09 | <@rewby> | Basically, each uploader name should be associated with the person who can run sudo poweroff |
| 10:12:10 | <monoxane> | may have come in a bit too hot with the ideas |
| 10:12:19 | <monoxane> | copy |
| 10:12:40 | <neggles> | the whole "we need any one of us to be able to hit the kill switch" thing did occur to ud |
| 10:12:44 | <neggles> | s/ud/us |
| 10:12:46 | <@rewby> | Even if you don't have the skills to fix the issue, you can at least shut the thing down, you feel? |
| 10:13:06 | <@rewby> | Yeah so, that "any one of us" idea has been tried before |
| 10:13:11 | <@rewby> | It never works out in practice |
| 10:13:59 | <@rewby> | So we want to be able to tracker-ban one control domain |
| 10:14:32 | <BPCZ> | rewby: would a target with 5PiB of flash and 100PiB of hdd be of much use? |
| 10:14:56 | <monoxane> | BPCZ isnt that just the IA :P |
| 10:15:14 | <BPCZ> | Though if a target going missing is an issue that might be an issue since that system is my testing ground :/ |
| 10:15:24 | <@rewby> | BPCZ: Depends on the networking, how much abuse against cpu and flash you're willing to take and how long it's available for |
| 10:15:36 | <@rewby> | Yeah, no testing grounds |
| 10:15:53 | <@rewby> | Targets going missing without >24h notice is a Big Problem |
| 10:15:54 | <monoxane> | someone buy a VAST cluster already |
| 10:16:04 | <BPCZ> | VAST is dog shit |
| 10:16:13 | <@rewby> | Just give me bare metal tbh |
| 10:16:19 | <@rewby> | That usually works best |
| 10:16:27 | <BPCZ> | Understandable |
| 10:16:41 | <@rewby> | I have a whole Ansible system to provision and manage metal |
| 10:17:23 | <@rewby> | Not OSS because, like a lot of AT code, it was all written with -2 hours of notice/planning |
| 10:17:50 | <BPCZ> | :( but we could clean it up |
| 10:17:52 | <@rewby> | Specifically I just hardcodes a ton of secrets in it because I had a deadline of a few hours before |
| 10:17:58 | <BPCZ> | lol |
| 10:18:21 | <@rewby> | It's the secrets bit that's the issue |
| 10:20:17 | <BPCZ> | Wish I could contribute hardware but that’s a big nono, I can chuck ungodly amounts of compute and ephemeral storage around but most OSS projects get annoyed when you show up do 5x the work they’ve done in 3 years then disappear |
| 10:20:44 | <@rewby> | Our problem is ephemeral is a big no for targets |
| 10:20:47 | <@rewby> | Workers, sure |
| 10:20:59 | <@rewby> | And i can scale targets up if need be. |
| 10:21:19 | <@rewby> | I've just been sick for the last 3 weeks and haven't been able to babysit them like I usually do |
| 10:21:30 | <nepeat> | oooooo ansible scripts |
| 10:21:50 | <nepeat> | i've been trying to research the backend infra and a lot of the stuff seems stale for that |
| 10:22:30 | <BPCZ> | Paperclips and chewing gum |
| 10:22:31 | <@rewby> | Targets aren't that complicated tbh, it's mostly OSS except for my provisioning code |
| 10:22:39 | <@rewby> | Tracker... |
| 10:22:54 | <@rewby> | Talk to Kaz. He's been on a journey to RE that thing |
| 10:23:03 | <nepeat> | heh |
| 10:23:10 | <nepeat> | is the current tracker code open sourced? |
| 10:23:12 | <@rewby> | Only F.usl really knows how that thing works. |
| 10:23:25 | <@rewby> | You assume all of it even has a source code repo |
| 10:23:27 | <@rewby> | Bold |
| 10:23:31 | <nepeat> | HAHA OH GOD |
| 10:24:03 | <nepeat> | my inner sre cries a little |
| 10:24:07 | <BPCZ> | >ruby |
| 10:24:11 | <@rewby> | Same |
| 10:24:13 | <BPCZ> | Off to a terrible start |
| 10:24:14 | <nepeat> | ruby is cool! |
| 10:24:18 | <@rewby> | Oh trackerproxy isn't ruby |
| 10:24:27 | <@rewby> | It's all redis and nix+lua |
| 10:24:33 | <@rewby> | *nginx |
| 10:24:38 | <@rewby> | Damn autocorrect |
| 10:25:10 | <BPCZ> | I wish there was better docs on the infrastructure, seems neat |
| 10:25:24 | <nepeat> | +1 |
| 10:25:40 | <@rewby> | Same here |
| 10:25:41 | <nepeat> | i'd love to make some changes that would improve my quality of life with my infra |
| 10:25:42 | <monoxane> | +1 |
| 10:25:59 | <monoxane> | i’ll just make my own with blackjack and hookers and an ia s3 key /s |
| 10:26:01 | <nepeat> | hell yeah prom exporters and structlogs |
| 10:26:05 | <monoxane> | too much work |
| 10:26:17 | <@rewby> | Using your own S3 key wouldn't work btw |
| 10:26:35 | <monoxane> | yea i know |
| 10:26:39 | <nepeat> | spicy |
| 10:26:42 | <@rewby> | You don't have access to the magical collections where we drop things. |
| 10:26:47 | <BPCZ> | IA is using S3 |
| 10:26:49 | <monoxane> | it only lets you upload via the site doesn’t it |
| 10:26:51 | <BPCZ> | Now? |
| 10:26:57 | <BPCZ> | Sadage |
| 10:26:58 | <nepeat> | s3 compatible, not actual s3 |
| 10:27:01 | <monoxane> | the web ui upload from ia is an s3 thing |
| 10:27:03 | <@rewby> | It's an S3 "compatible" endpoint |
| 10:27:06 | <@rewby> | We call it s3 |
| 10:27:10 | <monoxane> | and yea not s3 from amazon, just the protocol |
| 10:27:12 | <BPCZ> | Thank god ok |
| 10:27:13 | <nepeat> | everyone implements s3 compatible apis |
| 10:27:15 | <neggles> | S3 =/= AWS S3 |
| 10:27:17 | <@rewby> | It's cursed |
| 10:27:44 | <BPCZ> | I don’t even know if IA has multiple tape libraries yet |
| 10:27:56 | <@rewby> | It's all hdds |
| 10:28:03 | <@rewby> | Afaik |
| 10:28:15 | <nepeat> | i've heard they're running ceph these days? |
| 10:28:19 | <monoxane> | yea i think it’s hdd with a little bit of flash in front for web stuff |
| 10:28:35 | <BPCZ> | Probably too much effort to keep a library alive, those bastards always have issues |
| 10:28:36 | <monoxane> | there’s a page on the site talking about petabox |
| 10:28:54 | <monoxane> | somewhere else talks about s3 on top of it too |
| 10:29:07 | <monoxane> | whichbis where i got the idea to just ask for a key from :P |
| 10:29:15 | <@rewby> | Also, re SRE cries. You really don't wanna know the tracker. Some of it is Debian wheezy |
| 10:29:15 | <monoxane> | they’d absolutely say no though |
| 10:29:16 | <BPCZ> | If it’s Ceph then S3 is just gratis |
| 10:29:25 | <monoxane> | tell me to piss right off and never come back |
| 10:29:42 | <@rewby> | You can get keys piss easy |
| 10:29:57 | <@rewby> | Make an account on the IA and go to your profile |
| 10:30:03 | <monoxane> | not long lasting ones though |
| 10:30:04 | <@rewby> | It'll give them |
| 10:30:12 | <monoxane> | oh interesting |
| 10:30:24 | <@rewby> | They're just account creds iirc |
| 10:30:44 | <@rewby> | The thing is, we have collections with special flags that make the wbm index them |
| 10:30:46 | <monoxane> | yea and they probably revolve them if i uploaded at 10gbps |
| 10:30:52 | <monoxane> | yeap |
| 10:31:02 | <@rewby> | Randos cant just upload warcs and have them show up in the wbm |
| 10:31:49 | <nepeat> | reliability and automation would be great things to look at |
| 10:31:56 | <neggles> | most of what struck me as I was digging through code piecing together how this stuff works was, idk, disappointment? but the existential kind |
| 10:31:59 | <nepeat> | not pure brute force... |
| 10:32:52 | | jtagcat quits [Client Quit] |
| 10:33:01 | | jtagcat (jtagcat) joins |
| 10:33:03 | <@rewby> | But our collections are special |
| 10:33:03 | <@rewby> | And have restricted uploader access |
| 10:33:03 | <@rewby> | But all of the IA side is managed by ark.iver |
| 10:33:03 | <@rewby> | I get a set of S3 creds and a collection to shove stuff into |
| 10:33:06 | <@rewby> | If you see us discuss vars, that's our slang for the info I need from him to interface with IA |
| 10:33:32 | <@rewby> | Oh trust me, I wanna replace so much of it |
| 10:33:41 | <nepeat> | kinda curious, has something like vault been looked at for keeping the secrets outside of env files? |
| 10:33:43 | <@rewby> | But there's only so many hours in a day and I'm overworked as is |
| 10:34:04 | <neggles> | IA is important, AT is important, but it seems like there's... can't find the right way to say it but "oh come on, companies spend tens of millions on <next stupid internet fad> but *none* of them feel like giving any real resources to something that actually does some good?" |
| 10:34:04 | <@rewby> | Looked at? Sure. But time is limited for most of us. |
| 10:34:18 | <@rewby> | Note that we have 0 budget |
| 10:34:24 | <@rewby> | We fund this ourselves |
| 10:34:32 | <neggles> | yeah, absolutely not having a go at anyone here |
| 10:34:53 | <@rewby> | Target costs are split between me and like 4-5 other people who all pay for the hardware they donate |
| 10:35:24 | <@rewby> | But importantly, I have names, phone numbers, addresses etc |
| 10:35:42 | <@rewby> | We know where to send goons if someone fucks off |
| 10:36:27 | <neggles> | I guess i'm just kinda surprised none of the tech giants have decided to get themselves some positive press by throwing a (for them) miniscule amount of funding and resources at this |
| 10:36:41 | <@rewby> | We don't have an org |
| 10:36:47 | <neggles> | surprised isn't the right word, disappointed |
| 10:36:48 | <@rewby> | Which makes that hard |
| 10:36:49 | <nepeat> | some of us work for the tech giants ;) |
| 10:37:18 | <BPCZ> | Some of us would prefer dirty money not get involved |
| 10:37:33 | <nepeat> | i wouldn't say the money's dirty |
| 10:37:52 | <@rewby> | Money would be nice to finance proper target hw. |
| 10:37:58 | <@rewby> | Or at least pay hosting bills |
| 10:37:59 | <nepeat> | it's what makes it possible for people like me to spin up a lot of instances for the warrior IPs |
| 10:38:05 | <neggles> | all money is dirty depending on how you look at it, but that's a whole other question, and if it doesn't come with any strings attached other than "tell people we did this" that's fine |
| 10:39:04 | <@rewby> | From archiveteam.org: Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. |
| 10:39:08 | <@rewby> | This makes money hard |
| 10:39:13 | <neggles> | (that sounded wrong, s/that's fine/i wouldn't have a problem with it at least/) |
| 10:40:19 | <neggles> | nepeat: the org whose resources we are making use of do have a /22 or so available |
| 10:40:23 | <BPCZ> | I’m kind of surprised IA can’t provided a reasonable set of targets |
| 10:40:37 | <nepeat> | BPCZ: this isn't the IA |
| 10:40:45 | <@rewby> | We're not the IA |
| 10:41:09 | <@rewby> | That graciously deal with the storage and retrieval parts of web archiving for us |
| 10:41:17 | <neggles> | (and they don't get nearly enough funding either, hence the relatively low amount of ingress they can handle) |
| 10:41:20 | <@rewby> | Which is more than we could ask for anyways |
| 10:43:22 | <neggles> | yeah |
| 10:43:49 | <nepeat> | wondering, how can i help with some of the infra and client code? |
| 10:44:06 | <nepeat> | me putting out my thoughts is one thing for the overburdened team but i like to get my hands dirty and implement said thoughts |
| 10:44:08 | <neggles> | well, to say what we probably should've opened wit- heh nepeat that's p much what I was about to say |
| 10:44:49 | <neggles> | monoxane builds k8s-based application orchestration stacks for a living |
| 10:45:08 | <@rewby> | I have a decently interesting design for new target software. But not had the time to implement it. |
| 10:46:19 | <@rewby> | Also, F.usl has been working on a new tracker for years, might need help |
| 10:47:32 | <monoxane> | yea i’m kubelord, 80% of my job is building kube applications to orchestrate hundreds of gbit of traffic and the orchestration for the orchestrators to make it all manageable from a unified web interface |
| 10:49:32 | <@rewby> | I personally don't trust kube for targets |
| 10:49:47 | <@rewby> | This data is very persistent and not redundant |
| 10:50:08 | <monoxane> | replacing the warrior with a kubernetes controller that runs the direct job containers is gonna be a 3 day job at most, will look at it over christmas |
| 10:50:21 | <monoxane> | oh yea for targets it’s absolutely not the right tool |
| 10:50:34 | | @rewby is the target person |
| 10:50:53 | <nepeat> | containerized targets would be very fucky, storage would have to be separated to force that to work... |
| 10:51:04 | <nepeat> | pretty much creating target2.0 if you are doing that |
| 10:51:05 | <neggles> | that's not particularly difficult if you're running on baremetal |
| 10:51:15 | <neggles> | but it's probably not worth the effort |
| 10:51:26 | <@rewby> | I have plans for new target software |
| 10:51:29 | <nepeat> | agreed, given targets aren't disposable |
| 10:51:30 | <monoxane> | but for collection at scale? kube, a 100gbe host, and a /24 will give up to 4000 concurrent downloads across an entire public ip range in seconds |
| 10:51:40 | <@rewby> | To: not destroy ssds as much and go faster |
| 10:51:50 | <monoxane> | ramdisk time :P |
| 10:51:55 | <@rewby> | NO |
| 10:52:00 | <@rewby> | Data loss |
| 10:52:04 | <nepeat> | :openeyescryinglaughing: |
| 10:52:19 | <@rewby> | Again, if we lose uploaded data, it's gone |
| 10:52:25 | <monoxane> | yea ik |
| 10:52:30 | <@rewby> | And we have no good way of figuring out what was lost |
| 10:52:37 | <monoxane> | 1pb of zeusrams when |
| 10:53:05 | <monoxane> | actually a bluefield2 and some nvmeof would make a wonderful target |
| 10:53:05 | <neggles> | "if we lose data after it hits the target we can't tell what we lost" seems like a problem worth solving |
| 10:53:15 | <@rewby> | Also, one of my servers is under 1.5 years old. Its ssds have 3.5PiB written |
| 10:53:26 | <@rewby> | neggles: again, i have plans |
| 10:53:30 | <BPCZ> | Hahah I happen to know if a project trying to do multi tbps persisted storage via kube |
| 10:53:33 | <@rewby> | I just need to write it down |
| 10:53:36 | <BPCZ> | It’s going poorly |
| 10:53:38 | <monoxane> | also that, maybe it’s worth adding another step to the tracker for “egresses to ia” |
| 10:53:40 | <neggles> | oh yeah no i'm not suggesting it's easy |
| 10:54:01 | <neggles> | cause doing what mono just suggested doubles tracker load (and it sounds like the tracker is a bit of a black box at the moment?) |
| 10:55:28 | <schwarzkatz|m> | are there even any good news regarding that site lately |
| 10:55:28 | <schwarzkatz|m> | why is it so awfully quiet here currently, where is everybody :c |
| 10:56:19 | <@rewby> | schwarzkatz|m: It's not quiet? |
| 10:56:25 | <joepie91|m> | that way you optimize for scraping the high-result-count ones first |
| 10:56:38 | <joepie91|m> | I believe that this is part of Google's n-gram dataset somewhere |
| 10:57:10 | <joepie91|m> | hm, I thought there was a letter dataset also |
| 10:57:11 | <joepie91|m> | (which afaik is used in google's language detection thingem) |
| 10:57:50 | <madpro|m> | <rewby> "Also, F.usl has been working..." <- 🥲 |
| 10:57:50 | <madpro|m> | <rewby> "Also, F.usl has been working..." <- 🥲 |
| 11:00:40 | <BPCZ> | monoxane: how does one become a kubelord |
| 11:01:54 | <monoxane> | a lot of "wtf how the fuck does that work" and reading golang code |
| 11:02:08 | <neggles> | if my own attempts are anything to go by, the first step involves creating & recreating your cluster 27 times in 3 different configurations before you find one that doesn't have a showstopping problem that rears its head after you're 3/4 done |
| 11:02:32 | <neggles> | (assuming you don't want to pay <cloud provider> half a kidney) |
| 11:03:00 | <monoxane> | lmao also that |
| 11:03:03 | <@rewby> | That tracks with my experience |
| 11:03:14 | <monoxane> | it took me 8 tries to make a kube cluster, now i can do it in 10 min from bare os |
| 11:03:19 | <neggles> | oh the other option is to pay red hat $texas for openshift |
| 11:03:28 | <nepeat> | boring |
| 11:03:52 | <neggles> | or go dig up all the OSS components of openshift and do it yourself |
| 11:05:24 | <BPCZ> | Oh ok so I’m most of the way there then. I write go for work and write kube oci providers and modify core kube crap to pass in hardware that’s not supposed to be passed in just yet |
| 11:06:49 | <BPCZ> | Just need to get to the standing up a cluster part … most of the time I barely figure out a process once and just have an ansible playbook for next time |
| 11:07:38 | <neggles> | having spent the better part of this year attempting to stand up a cluster that doesn't have some incredibly stupid limitation that makes me throw my hands up in defeat and forget about it for a month |
| 11:07:43 | <neggles> | good luck >.> |
| 11:07:47 | <nepeat> | this is overcomplicating the overcomplicated setup |
| 11:08:54 | <BPCZ> | neggles: I mean all clusters have limitations. I work in distributed systems and clusters professionally. Kube just isn’t used heavily for the big stuff |
| 11:08:54 | <monoxane> | BPCZ if you really wanna get standing up clusters down, do kubernetes-the-hard-way, like 4 times over, and you will know everything about the intenals and why things are like they are |
| 11:09:05 | <monoxane> | https://github.com/kelseyhightower/kubernetes-the-hard-way |
| 11:09:11 | <BPCZ> | Thanks! |
| 11:09:14 | <neggles> | the problem with k8s related stuff, from where i'm standing anyway, is it's all focused on "too big" or "too small" |
| 11:09:19 | <nepeat> | keep it simple. for my AT stuff, i got nomad (containers) + vault (mtls certs) + loki (logs!) |
| 11:09:22 | <monoxane> | (doesnt have to be gcloud, its just what they use as demo env) |
| 11:09:30 | <neggles> | there are a lot of ways to spin it up on single hosts that work quite well, are very straightforward, and behave |
| 11:09:52 | <neggles> | and a lot of ways to spin it up on <cloud provider> that work very well, are easy to manage, and cost an unpredictably-large fortune |
| 11:09:52 | <monoxane> | and yea, 1 node: easy, 2 to 6: incredibly painful, 6 to 1000: easy af |
| 11:10:30 | <BPCZ> | Did kube ever grow network topology knowledge? I recall that being a sticking point a while back |
| 11:10:36 | <neggles> | still a big problem. |
| 11:10:42 | <BPCZ> | Figures |
| 11:10:47 | <neggles> | there are several potential solutions, no clear winner |
| 11:10:53 | <neggles> | the frontrunner seems to be cillium |
| 11:10:59 | <monoxane> | its a problem but its got a whole lot better now, you can do l3 super easy with stuff like cilium or kube-router that dont rely on internal tunnels between nodes |
| 11:11:26 | <monoxane> | big thing about cilium is it does ebpf offloading so all the inter-pod stuff is done in the kernel and offloaded to the nic, instead of in userspace like the older CNIs |
| 11:11:50 | <neggles> | and you can handle rerouting traffic to the 'correct' node without overwriting the source address |
| 11:12:15 | <monoxane> | and also yknow just use bird to advertise everything between the nodes over bgp instead of cry when the vxlan is broken for no reason |
| 11:12:27 | <monoxane> | looking at you flannel |
| 11:12:31 | <nepeat> | +1 to using bgp lol |
| 11:12:40 | <neggles> | tl;dr it's getting a lot better, rapidly, but it's still not there yet |
| 11:12:50 | <neggles> | bit of an xkcd competing standards problem |
| 11:12:52 | <nepeat> | i just have wireguard tunnels and bgp to route my throwaway networks when i spin it up |
| 11:13:01 | <BPCZ> | Yeah I don’t trust kube for the workloads, and neither does google. Iirc they use nomad for some stuff, but the devs I’ve talked to over their says everything falls over when you get into the high hundred thousand messages a second range with nodes |
| 11:13:05 | <nepeat> | been looking at netmaker and got it rolled out for this iteration of my cluster |
| 11:13:11 | <monoxane> | i am currently working on standing up a cluster with 3 nodes in 3 locations connected via ipsec tunnels + bgp + kube-router for shits n gigs |
| 11:13:14 | <neggles> | google use borg, which is not k8s, but is not not k8s |
| 11:13:35 | <BPCZ> | Specific workload, they don’t use borg for it |
| 11:13:41 | <nepeat> | kinda curious, any of you got dashboards for the AT stuff yet? |
| 11:13:43 | <monoxane> | the best one to look at for implementaion and scale imo is spotify |
| 11:13:44 | <nepeat> | https://catgirl.host/i/c6s8s.png |
| 11:13:50 | <BPCZ> | They use a few thousand node nomad cluster |
| 11:14:33 | <schwarzkatz|m> | rewby: what do you mean, quiet? |
| 11:14:49 | <monoxane> | they run 98% of workloads in 13 globally distributed clusters with the capability to hard failover any clusters traffic to any other site in under 5 seconds, they manage it all with an internal tool theyre making open source called BackStage |
| 11:15:19 | <BPCZ> | Sounds cool |
| 11:16:10 | <monoxane> | my works clusters are a fair bit smaller and a completely different ballpark, we just have 6 nodes running ~110 pods total but the application stack is designed to be entirely fault tollerrant internally so any service or any node goes down and we're still good |
| 11:16:24 | <monoxane> | most of the clusters are completely offline most of their life too |
| 11:17:16 | <neggles> | there was a big sportsball event you might've heard about recently; i will not elaborate further |
| 11:17:30 | <monoxane> | yea and another, and another :P |
| 11:17:39 | <nepeat> | i can't say anything about what i do but it's reinforced some good ideas for my personal setups, this included |
| 11:17:47 | <nepeat> | oh god not the world cup |
| 11:18:02 | <monoxane> | kube runs the video routing for the superbowl and 80% of global live sports tv |
| 11:18:09 | <BPCZ> | I wish companies would actually rework applications they chuck into kube. I had to de-kube something recently because the company wrapped a state full system into kube and washed their hands like that would be fine and kube would recover things better than other options |
| 11:18:27 | <monoxane> | oh yea ours is kube from the ground up, you cannot forklift existing workloads into kube and expect it to go well |
| 11:20:52 | <BPCZ> | nepeat: you can just say you professionally scan everyone’s butthole while they sleep it’s ok we get it. Companies just really like to know what our bowls are doing |
| 11:21:21 | <monoxane> | lmao |
| 11:21:26 | <nepeat> | lmao |
| 11:21:29 | <monoxane> | but which type of scan |
| 11:21:42 | <monoxane> | optical or something more exotic like ground penetrating radar through the roof? |
| 11:21:53 | <nepeat> | i just work for a place that inspires creativity and brings joy... |
| 11:22:02 | <monoxane> | narrator: it does not |
| 11:22:46 | <BPCZ> | All the scans, WiFi, roomba radar, brain wave from your sexual partners. If it could detect butthole the kube workload nepeat works on tries to collect it |
| 11:23:58 | <BPCZ> | Mousewitz? |
| 11:24:04 | <@rewby> | schwarzkatz|m: In response to: 11:54 <schwarzkatz|m> why is it so awfully quiet here currently, where is everybody :c |
| 11:27:46 | <BPCZ> | This whole conversation reminds me I need to be planning my next job and figuring out where to live next. |
| 11:28:10 | <BPCZ> | SF or Seattle seem to be the two big options |
| 11:30:49 | <neggles> | nepeat: so bytedance :v |
| 11:30:50 | | le0n quits [Quit: see you later, alligator] |
| 11:32:52 | | le0n (le0n) joins |
| 11:35:16 | <@rewby> | Anyone happen to know the deadline for pixiv? |
| 11:35:29 | <schwarzkatz|m> | What is happening with this dumb matrix thing, sorry for posting duplicate messages |
| 11:36:07 | <schwarzkatz|m> | Deadline was 2022-12-15, that’s when their TOS changed |
| 11:36:09 | <@rewby> | schwarzkatz|m: Yeah your matrix stuff is bork. I tried checking my matrix alt and it's delayed like mad. |
| 11:36:14 | <@rewby> | *Ah* |
| 11:36:18 | <@rewby> | Right okay |
| 11:36:21 | <@rewby> | I'm gonna move some stuff around |
| 11:37:04 | <schwarzkatz|m> | I think it’s only happening in the mobile app though, I have the same problem with discord sometimes |
| 11:40:07 | <@rewby> | monoxane: You were complaining about target limits? Right? |
| 11:42:24 | <@rewby> | Gas gas gas https://s3.services.ams.aperture-laboratories.science/rewby/public/1760112e-f24d-442c-9af6-0dca05f9d9ff/1671622931.615019.png |
| 11:43:05 | <nepeat> | oh neat |
| 11:43:07 | <neggles> | rewby: wound some more capacity in? |
| 11:43:14 | <neggles> | lets see how it looks this side... |
| 11:43:48 | <@rewby> | It's provisioning |
| 11:43:52 | <@rewby> | Just hit it as hard as you can |
| 11:43:55 | <@rewby> | I'll scale it up to meet |
| 11:44:07 | <@rewby> | I've hit the *deploy hetzner cloud* buttons |
| 11:44:43 | <neggles> | "just hit it as hard as you can" <- you may live to regret that |
| 11:45:12 | <@rewby> | Trust me, I've seen worse |
| 11:45:39 | <monoxane> | rewby we have 1.4tbps online right now lol |
| 11:45:48 | <@rewby> | And I have backpressure |
| 11:46:00 | <@rewby> | The system doesn't accept more data than it can take |
| 11:46:13 | <schwarzkatz|m> | Argh I hat mobile apps |
| 11:46:17 | <@rewby> | If you hit a target too hard, it'll just shut off inbound and process what it has on disk |
| 11:46:33 | <@rewby> | I can easily scale this into 16-20 gbps |
| 11:46:33 | <neggles> | more the "scale up to meet" heh |
| 11:46:44 | <@rewby> | Yea I guess |
| 11:46:50 | <@rewby> | I can scale as high as IA has inbound on s3 |
| 11:46:56 | <neggles> | looks like the source is now the limiter |
| 11:48:07 | <@rewby> | Here's your reminder: There's several projects with deadlines in the next 10 days |
| 11:48:18 | <@rewby> | pixiv, uploadir, vlive and buzzvideo are the main ones I know of |
| 11:48:24 | <@rewby> | So throw spare capacity at those |
| 11:48:59 | <BPCZ> | Can’t believe I went home for Christmas and can’t even do this during the holiday |
| 11:49:51 | <neggles> | sounds like we should spin up some more workers pointed at those other projects then |
| 11:50:05 | <neggles> | ~3000 of them seems to be about all pixiv can handle |
| 11:50:32 | <nepeat> | oh man, i see the pixiv spike |
| 11:51:23 | <nepeat> | https://snapshots.raintank.io/dashboard/snapshot/FnHLJL9J0Stas95ZVy4565XdQLv7rJxs |
| 11:51:48 | <@rewby> | I'm not done scaling everything |
| 11:52:03 | <monoxane> | ill have you know we're currently doing 20gbps |
| 11:52:15 | <@rewby> | I'm well aware yes |
| 11:52:17 | <neggles> | pixiv has definitely run out of outbound, can't pull more than 300mbit or so from it on top of what we're hitting |
| 11:52:17 | <@rewby> | I have metrics |
| 11:52:21 | <neggles> | so |
| 11:52:28 | <neggles> | time to point some at the others? |
| 11:52:39 | <@rewby> | Yes |
| 12:01:18 | | apache2 quits [Remote host closed the connection] |
| 12:01:18 | | fishingforsoup quits [Remote host closed the connection] |
| 12:01:18 | | user_ quits [Remote host closed the connection] |
| 12:01:19 | | superkuh__ quits [Remote host closed the connection] |
| 12:01:23 | <monoxane> | uploadir has 0 tasks available, so id say we should focus on the others |
| 12:01:25 | <@rewby> | I have three separate ansibles going on trying to move stuff around |
| 12:01:25 | <neggles> | workin' on it |
| 12:01:25 | | apache2_ joins |
| 12:01:32 | | user_ (gazorpazorp) joins |
| 12:01:35 | | superkuh__ joins |
| 12:01:35 | | fishingforsoup joins |
| 12:02:52 | <@rewby> | 211k out |
| 12:02:54 | <@rewby> | Hm |
| 12:02:57 | <@rewby> | Lemme flush that |
| 12:05:37 | <@rewby> | If you refresh you'll see uploadir tasks |
| 12:06:29 | <monoxane> | cool got em |
| 12:08:27 | <neggles> | just provisioning some more VMs :) |
| 12:22:02 | <@rewby> | Hm. I think I'm hitting an IA bottleneck |
| 12:22:05 | <@rewby> | Lemme investigate |
| 12:24:57 | <monoxane> | IA's only doing 15gbps inbound https://monitor.archive.org/weathermap/weathermap.html |
| 12:25:08 | <monoxane> | its likely its saturated s3 ingress though |
| 12:25:37 | <@rewby> | There's 2 lbs |
| 12:25:41 | <@rewby> | I think 10g each? |
| 12:25:46 | <monoxane> | yeap |
| 12:25:48 | <@rewby> | I need permission to override and use the other one though |
| 12:25:59 | <@rewby> | I have asked, but can't do much until I hear back |
| 12:26:25 | <monoxane> | valid |
| 12:26:33 | <monoxane> | bit of fun eh? :P |
| 12:27:43 | <@rewby> | Just let the targets fill up and workers back off, when I hear back I can get the throughput up |
| 12:28:00 | <@rewby> | My inbound on targets was like 16gbps |
| 12:28:11 | <@rewby> | Right up until disks started fillin gup |
| 12:28:13 | <@rewby> | *filling up |
| 12:28:54 | <monoxane> | we were doing 20.02gbps at peak |
| 12:29:00 | <monoxane> | from the core network |
| 12:34:02 | <@rewby> | On the upside, pixiv and vlive are looking to be done in ~24 hours according to my numbers |
| 12:34:12 | <monoxane> | sweet |
| 12:34:50 | <@rewby> | vlive in <6 hours |
| 12:34:56 | <neggles> | vlive seems to have the most source capacity |
| 12:35:11 | <@rewby> | We also don't have too many items there |
| 12:35:35 | <@rewby> | Either way, when arkiver wakes up he's gonna have a field day finding more items and things to archive |
| 12:36:34 | <neggles> | there are six more spare boxes - much smaller, "only" 8 core, but with 10G links - i've just been handed keys to |
| 12:36:43 | <neggles> | used to be minecraft servers |
| 12:37:35 | <@rewby> | I hit a spicy 18.6gbps inbound just a few ago |
| 12:37:57 | <neggles> | is uploadir stalled/already down? |
| 12:38:12 | <neggles> | seeing basically zero action out of those |
| 12:40:42 | <@rewby> | Shouldn't be. But IIRC there's a speed limit on that |
| 12:43:43 | <neggles> | ah |
| 12:44:05 | | treora quits [Remote host closed the connection] |
| 12:44:06 | | treora joins |
| 12:46:58 | <monoxane> | IA just cracked 20gbps inbound |
| 12:48:02 | | Arcorann_ quits [Ping timeout: 264 seconds] |
| 12:49:38 | <Doomaholic> | Holy crap |
| 12:49:45 | <@rewby> | Over half of that is us |
| 12:49:58 | <@rewby> | tbh, we've done better |
| 12:50:12 | <neggles> | um, maybe silly question but where do the -grab containers store the files between pull/push? |
| 12:50:16 | <neggles> | internally |
| 12:50:26 | <@rewby> | In /grab |
| 12:50:32 | <neggles> | not in a subdir? |
| 12:50:36 | <@rewby> | I don't remember |
| 12:50:39 | <neggles> | fairo |
| 12:50:45 | <neggles> | i guess i opened one that hasn't started yet |
| 12:51:01 | <@rewby> | Data storage is synchronous |
| 12:51:13 | <@rewby> | So it always does download -> upload -> download -> upload |
| 12:51:23 | <@rewby> | So if it's waiting for work, it won't have any data stored |
| 12:51:26 | <neggles> | ah |
| 12:51:39 | <neggles> | some of these video files are big enough for kube to go "hey, you didn't ask for disk" and boot the pods |
| 12:52:00 | <neggles> | ah all under /grab/data excellent |
| 12:52:12 | <monoxane> | yea im currently looking at a cluster thats half evicted pods becuase they used 15gb of ephemeral storage 😆 |
| 12:52:34 | <@rewby> | Yeah, the video files are big |
| 12:54:34 | <neggles> | ok it's happy now |
| 12:55:38 | <@rewby> | For people who were asking to donate target hw: This is what we do to disks: Data Units Written: 6,793,004,499 [3.47 PB] |
| 12:55:41 | <@rewby> | That's in 1.5 years |
| 12:55:47 | <madpro|m> | I mean, Archive Team cannot be the only people making software for this nowadays. Can it? |
| 12:55:47 | <madpro|m> | There are tons of companies that do crawling for a business, surely they have open-sourced some more robust trackers by now? |
| 12:55:47 | <madpro|m> | Not that I know, as I have been searching for myself as well for the past 2 years or so. |
| 12:56:13 | <neggles> | anyone with a functional wide-scale web crawler / ripper is not going to hand that out for free |
| 12:56:24 | <neggles> | that's a surefire way to stop it working |
| 12:57:17 | <madpro|m> | I cannot say I'm nearly as skeptic, seeing other projects like Hadoop in distributed computing |
| 12:57:30 | | hackbug (hackbug) joins |
| 12:57:52 | <neggles> | hadoop is not the expensive/proprietary/"magic" part of a hadoop-based workflow though |
| 12:58:13 | <neggles> | it's worthless without the rules and flows and transforms etc |
| 12:58:37 | <neggles> | while the value of a crawler (on a commercial level anyway) comes from being able to skip around things trying to block crawlers |
| 12:58:42 | <@rewby> | The thing with these kind of trackers: They are tied closely to your workflow. |
| 12:58:47 | <@rewby> | They are very specific |
| 12:59:13 | <neggles> | same goes for hadoop setup |
| 12:59:18 | <@rewby> | If you tried to make an end-all-be-all tracker you'd end up with something as complex as kube |
| 12:59:22 | <neggles> | or SAP |
| 12:59:27 | <@rewby> | With a much smaller market |
| 12:59:28 | <madpro|m> | Well there you go |
| 12:59:51 | <@rewby> | So instead people make trackers that are good enough for their workflow |
| 12:59:59 | <neggles> | ERP systems are the perfect example; they do everything for everyone, but they do it by having 27,000 different modules that can be wired together in practically infinite ways |
| 13:00:00 | <@rewby> | But then you end up being very very tied to your company |
| 13:02:21 | <monoxane> | i think pixiv might need a purge too, theres 1m out but it never goes below 99.95k and surely theres not a million jobs being processed rn lmao |
| 13:02:37 | <@rewby> | I'll give it a look in a sec |
| 13:02:44 | <madpro|m> | Better close this tangent, before discussion shifts back to pixiv. |
| 13:02:44 | <@rewby> | I'm actually disappointed in my upload rate |
| 13:02:51 | <@rewby> | I've done 25G to them before |
| 13:03:10 | <@rewby> | I'm not too worried about pixiv |
| 13:04:00 | <@rewby> | Tldr: It doesn't recycle jobs from the out-list until todo is empty |
| 13:04:00 | <@rewby> | And there's 8M in too |
| 13:04:00 | <@rewby> | *todo |
| 13:04:04 | <monoxane> | im not either its just a bit of a high number |
| 13:04:07 | <monoxane> | ah okay i didnt know that |
| 13:04:11 | <@rewby> | Also, monoxane, the bare -grab containers will do concurrency up to 20 |
| 13:04:20 | <monoxane> | yes we know |
| 13:04:23 | <@rewby> | kk |
| 13:04:33 | <madpro|m> | For now, in terms of tracker development we should look to making do with what we have. The IA wiki and GitHub have a long way to go in terms of documentation. |
| 13:04:42 | <madpro|m> | Exploiting our own resources and all that. |
| 13:04:43 | <monoxane> | every single one of the 3000 containers currently running across 20 nodes are on max concurrency |
| 13:04:56 | <@rewby> | Ah okay |
| 13:05:08 | <monoxane> | we are entirely restricted by IAs ingrest right now |
| 13:06:12 | <monoxane> | *ingest |
| 13:06:14 | <neggles> | "haha kubernetes go brrr" |
| 13:07:26 | <@rewby> | I've redirected vlive to a pile of spinning rust |
| 13:07:32 | <@rewby> | To sink your data into |
| 13:07:48 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 13:08:38 | <Doomaholic> | Bless |
| 13:08:46 | <monoxane> | excellent |
| 13:09:18 | <neggles> | ooooh one of these is a 5900X |
| 13:10:10 | <neggles> | https://lounge.neggl.es/uploads/e1333322a07ac379/bueno.png |
| 13:10:45 | <@rewby> | mmmm data https://s3.services.ams.aperture-laboratories.science/rewby/public/09c34280-fa59-4879-aaad-bd50f9a499e3/1671628237.819004.png |
| 13:11:57 | <Doomaholic> | Delicious |
| 13:13:45 | | DavidSaguna joins |
| 13:41:54 | <monoxane> | i think the next bottleneck might actualy be rsync connections on the targets |
| 13:42:08 | <monoxane> | like 70% of my pods are sitting here idle waiting to retry dumping |
| 13:42:20 | <monoxane> | its error 400 not -1 so its not the disk full cutoff |
| 13:44:40 | | qwertyasdfuiopghjkl joins |
| 13:56:05 | | sonick (sonick) joins |
| 14:29:55 | | G4te_Keep3r349 joins |
| 14:32:11 | | sec^nd quits [Ping timeout: 245 seconds] |
| 14:38:18 | <@arkiver> | thanks for the ping OrIdow6 |
| 14:38:23 | <@arkiver> | still reading some backlog |
| 14:38:51 | <@arkiver> | monoxane: are there several people running under a single 'team name'? |
| 14:39:20 | <@arkiver> | neggles: feel free to make a PR on the warrior docker image |
| 14:42:35 | | sec^nd (second) joins |
| 14:59:43 | | katocala quits [Remote host closed the connection] |
| 15:06:44 | <@arkiver> | monoxane: if you have a ton of IPs available - telegram could definitely benefit from that, we got quite some backlog to work through |
| 15:07:07 | <@arkiver> | on uploadir - roughly half of the items was 404 |
| 15:16:25 | | MrSolid joins |
| 15:16:35 | <MrSolid> | hi guys |
| 15:16:43 | <MrSolid> | can you please help me archive the website ac-web.org |
| 15:17:06 | | MrSolid leaves |
| 15:17:08 | | MrSolid joins |
| 15:17:55 | <@arkiver> | MrSolid: what is the reason? |
| 15:18:26 | <@arkiver> | ac-web.org is not loading for me |
| 15:18:34 | <MrSolid> | sites been down for months since being sold to new owners and trying to migrate to new community so information isnt lost |
| 15:18:47 | <MrSolid> | https://ac-web.org/index.php |
| 15:18:49 | <@arkiver> | well if the site is down, we can't archive it |
| 15:18:53 | <MrSolid> | its up for me thats odd |
| 15:19:05 | <MrSolid> | maybe someones crawling it right now haha |
| 15:19:25 | <@arkiver> | in any case, sounds like a site we should archive yes |
| 15:19:35 | | programmerq quits [Remote host closed the connection] |
| 15:19:44 | <MrSolid> | thank you arkiver |
| 15:20:27 | <@arkiver> | loading very slowly now |
| 15:21:44 | <MrSolid> | i just hope the new owner doesnt shut the site down again before its archived |
| 15:33:15 | | Island joins |
| 15:41:34 | | MrSolid quits [Remote host closed the connection] |
| 16:08:52 | | DLoader quits [Remote host closed the connection] |
| 16:11:38 | | fishingforsoup_ quits [Remote host closed the connection] |
| 16:11:38 | | DLoader joins |
| 16:11:51 | | fishingforsoup_ joins |
| 16:15:03 | <monoxane> | arkiver not any more, for a little bit yesterday there were a couple people using one name but after we got a harsh no we split and each person controlling a set of nodes is using a different name |
| 16:15:23 | <monoxane> | will switch some of them to telegram in the morning |
| 16:23:29 | | Megame (Megame) joins |
| 16:23:44 | <@arkiver> | monoxane: sounds good - separate names is definitely better for keeping track of who doing what |
| 16:24:03 | <@arkiver> | and yeah as rewby said, feel free to prepend something to the names to show people as being part of the same group |
| 16:25:22 | <monoxane> | yea will probably do that at some point, the team name thing was more because some people don’t want to be identified so those people are just using team name suffixed with country, identifiable enough for someone to tell ‘em to stop if it’s broken but not to be worked out from the leaderboard |
| 16:26:03 | <@arkiver> | right yeah |
| 16:26:12 | <@arkiver> | so what is this group of people? |
| 16:27:26 | <monoxane> | friends, some of which work at a tier 1 global isp and have some resources at their disposal |
| 16:27:38 | <@arkiver> | pretty awesome |
| 16:28:20 | <monoxane> | the 20gbps we were pulling today didn’t even make a single pixel increase in their usage charts (outside of the routes going to targets and the sources) |
| 16:29:04 | <@arkiver> | watch out that if you were to run a project like the URLs project (outlinks from various sources), it may contain any URL you can find online |
| 16:29:12 | <monoxane> | it was approved because it’s just a fun little load test on their links 😆 |
| 16:29:16 | <@arkiver> | though I'd say it is one of our most valuable projects |
| 16:29:28 | <@arkiver> | telegram is likely very safe to run |
| 16:29:32 | <@arkiver> | hah :) sounds good |
| 16:29:42 | <monoxane> | we will likely only run at full tilt when there’s an “oh fuck” event where we have 24 hours to pull and entire site |
| 16:29:58 | <@arkiver> | alright |
| 16:30:01 | <monoxane> | and just leave a couple nodes running on a range of projects |
| 16:30:07 | <@arkiver> | yeah we have some end of year shutdown going on at the moment |
| 16:30:11 | <monoxane> | >just a couple nodes |
| 16:30:36 | <monoxane> | i say this as if they’re not 100gbe directly attached to an isp core |
| 16:30:49 | <@arkiver> | for the current short term projects bandwidth is the bottle neck somewhere along the way |
| 16:30:58 | <@arkiver> | but for the long term projects, IPs if the bottle neck |
| 16:31:07 | <@arkiver> | is* |
| 16:31:16 | <monoxane> | yea, we have some potential solutions to the ip bottleneck |
| 16:31:38 | <monoxane> | one of which involves giving a single node an entire /24 😅 |
| 16:32:41 | <@arkiver> | "couple nodes" with each a different /24? |
| 16:32:47 | <@arkiver> | that'd be pretty awesome :) |
| 16:35:52 | <monoxane> | the only problem with that is burning /24s is less justifiable than burning 20gbps out of a 20+tbit network |
| 16:38:00 | <@arkiver> | yeah which is likely the reason as well why our long term project have IPs as bottle neck rather than bandwidth |
| 16:38:05 | <@rewby> | I think I'm currently still burning two /24s on telegram |
| 16:38:26 | <@rewby> | Or rather, I'm burning someone else's /24s |
| 16:38:38 | <@arkiver> | rewby: and it is really making a difference! |
| 16:39:56 | <@arkiver> | we're slowly working through the huge telegram backlog |
| 16:40:17 | <@arkiver> | note though that we currently cannot keep up with new discovered group posts (we can only keep up with new discovered channel posts) |
| 16:40:59 | <@arkiver> | i'm stashing the group posts at another project at the moment, https://tracker.archiveteam.org/telegram-groups-temp/ , which now has 4 billion items |
| 16:41:10 | <@arkiver> | so we'll just feed that in slowly whenever there is room |
| 16:41:35 | <@arkiver> | it's already very good we can keep up with channel posts however, we're discovering and archiving many of them |
| 16:45:59 | <Jake> | (I missed quite the night here!) |
| 16:52:47 | | DavidSaguna quits [Read error: Connection reset by peer] |
| 17:14:18 | | atphoenix_ quits [Remote host closed the connection] |
| 17:14:18 | | superkuh__ quits [Remote host closed the connection] |
| 17:14:18 | <mgrandi> | @arkiver: how are you guys doing telegram? The web view of groups ? |
| 17:14:36 | | superkuh__ joins |
| 17:14:43 | <mgrandi> | Also, update on the FA forums, I'm pretty sure that's not what the GDPR means , and also lol, like that's going to stop anyone https://forums.furaffinity.net/threads/forum-closure-fa-discord-coming-soon.1682702/post-7381985 |
| 17:14:50 | | atphoenix_ (atphoenix) joins |
| 17:15:08 | <@arkiver> | mgrandi: yes |
| 17:15:11 | <@arkiver> | on telegram |
| 17:16:56 | | h2ibot quits [Remote host closed the connection] |
| 17:17:09 | | h2ibot (h2ibot) joins |
| 17:17:24 | <ivan> | "I dont know how that works or if it can take as many messages or forum pages this site has." haha |
| 17:18:39 | <mgrandi> | @arkiver: that is the easiest way yeah, I have a lot of experience with tdlib but it's daunting how many things to support so the web view probably is the easiest way for now! |
| 17:23:59 | | systwi quits [Ping timeout: 250 seconds] |
| 17:29:46 | | systwi (systwi) joins |
| 17:47:39 | <schwarzkatz|m> | what is their concern with GDPR on an archived website |
| 17:47:46 | <schwarzkatz|m> | I don't really get it |
| 17:50:20 | <@arkiver> | mgrandi: yeah, and the web view can go into the Wayback Machine |
| 17:50:25 | | Jonimus quits [Ping timeout: 250 seconds] |
| 17:50:49 | <schwarzkatz|m> | according to deathwatch, https://www.zhihu.com/club/explore will stop working on 12-26. it's probably a good idea to archive/grab all links from these pages beforehand |
| 17:51:53 | <@arkiver> | schwarzkatz|m: as in, https://www.zhihu.com/ is shutting down? |
| 17:51:58 | <@arkiver> | hmm |
| 17:52:01 | <@arkiver> | i missed that on deathwatch |
| 17:52:08 | <schwarzkatz|m> | it says only /explore |
| 17:53:07 | <@arkiver> | hmm yeah but I see the entire thing (zhihu.com) is shutting down next year? |
| 17:53:35 | <schwarzkatz|m> | looks like it |
| 17:53:40 | <@arkiver> | fun project |
| 17:58:10 | | Jonimus joins |
| 18:02:29 | <@arkiver> | any idea where 'Circles' comes from in Zhihu Circles? |
| 18:02:45 | <@arkiver> | JAA: do you know if anything was done for the furaffinity forums? |
| 18:03:00 | <schwarzkatz|m> | looks like a bunch of api calls to get, I'll try to grab them from /explore |
| 18:03:39 | | Jonimus quits [Ping timeout: 265 seconds] |
| 18:03:40 | <@arkiver> | the deathwatch page says "Zhihu Circles" is removing that public access, is Zhihu Circles all of zhihu.com ? |
| 18:05:05 | <schwarzkatz|m> | a circle seems to be a /club/[0-9]+ |
| 18:05:34 | <schwarzkatz|m> | so all items on /explore are circles |
| 18:05:47 | <@arkiver> | i see. thank you |
| 18:11:45 | | Jonimus joins |
| 18:22:03 | | Jonimus quits [Ping timeout: 250 seconds] |
| 18:36:46 | | sec^nd quits [Ping timeout: 245 seconds] |
| 18:36:50 | <schwarzkatz|m> | here is a list with all api calls for /explore https://transfer.archivete.am/aqUwF/zhihu.explore-api-calls.txt |
| 18:42:39 | | sec^nd (second) joins |
| 18:50:40 | <mgrandi> | schwarzkatz|m: the guy probably doesn't know what he is talking about or is thinking that we would be taking the legit forum database |
| 19:22:14 | | Megame quits [Ping timeout: 264 seconds] |
| 19:25:19 | | Jonimus joins |
| 19:28:45 | <@rewby> | Well then, today was an *experience* |
| 19:28:58 | <@rewby> | We're aware pixiv, vlive, etc are having target issues |
| 19:29:17 | <@rewby> | It's actually intentional |
| 19:54:15 | <@rewby> | HCross and me have paused high activity projects at the moment. We have too much backlogged data to process and we need to be careful with the IA. |
| 19:54:32 | <@rewby> | Announcements in project specific channels in a few. |
| 20:38:26 | | sec^nd quits [Ping timeout: 245 seconds] |
| 20:39:33 | | sec^nd (second) joins |
| 21:11:46 | | sec^nd quits [Ping timeout: 245 seconds] |
| 21:13:20 | | sec^nd (second) joins |
| 21:30:02 | | braindancer joins |
| 21:30:42 | | benjins is now authenticated as benjins |
| 21:36:23 | <neggles> | rewby: ah, sounds like we might've sent it a bit too hard |
| 21:36:35 | | spirit quits [Quit: Leaving] |
| 21:47:04 | | leo60228- quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 21:47:26 | | leo60228 (leo60228) joins |
| 21:48:41 | <@JAA> | arkiver: I thought Fur Affinity was thrown into AB, but apparently not. I'll take a look later. Might also qwarc it. |
| 21:53:37 | <h2ibot> | Arcorann edited Deathwatch (+292, /* 2023 */): https://wiki.archiveteam.org/?diff=49262&oldid=49255 |
| 21:53:39 | <schwarzkatz|m> | JAA: would collecting all thread & subforum urls be helpful? |
| 21:53:40 | <datechnoman> | Worst case throw Fur Affinity in #Y project to be trawled through |
| 21:55:02 | <@JAA> | schwarzkatz|m: Not needed, with qwarc, I'd probably just bruteforce thread IDs anyway. |
| 21:55:42 | <schwarzkatz|m> | with pagination? :O |
| 21:55:53 | <@JAA> | Of course. |
| 21:56:20 | <@JAA> | I've archived XenForo forums before with qwarc, so just need to adjust domains and am probably good to go. |
| 21:56:30 | <schwarzkatz|m> | okay then, let me know if I could help otherwise :) |
| 21:57:07 | <@JAA> | If they can take the load, I could grab it all in hours. Chances are they can't though. :-) |
| 22:00:32 | <datechnoman> | Fur Affinity appears to be Cloudflare backed including their images so they will be able to process high throughput id say |
| 22:04:28 | <@arkiver> | JAA: alright, sounds good |
| 22:04:38 | <@arkiver> | and with bruteforcing thread IDs, you could still somehow get the pages? |
| 22:04:45 | <@arkiver> | outlink can of course go into #// :) |
| 22:04:48 | <@JAA> | datechnoman: That doesn't really mean much as it just depends on what the backend server is. Could be a RPi in someone's closet for all we know. |
| 22:05:44 | <@JAA> | arkiver: Yes, I will get thread pagination. No images etc., but those can be extracted later and fed to #// along with the outlinks, yeah. |
| 22:08:02 | <datechnoman> | Fair call JAA. I guess the CDN helps with throughput and load but the backend processing of the requests is a different story. You can tell my main focus is the #// which is everything all over the place |
| 22:09:51 | <@JAA> | Yeah, it certainly helps with things cached on the CDN. When bruteforcing threads, most won't be in the cache. |
| 22:12:48 | <@arkiver> | sounds good |
| 22:14:16 | | sec^nd quits [Ping timeout: 245 seconds] |
| 22:14:47 | | sec^nd (second) joins |
| 22:21:46 | <@JAA> | schwarzkatz|m: So forum.lacartoonerie.com is NXDOMAIN now. It was down since end of November anyway, but I guess that means it definitely won't be coming back. |
| 22:22:25 | <schwarzkatz|m> | good that we got it then :) |
| 22:22:37 | <@JAA> | Your grab is on IA? |
| 22:22:53 | <@JAA> | The ArchiveBot job didn't get far. |
| 22:23:39 | <schwarzkatz|m> | I thought you got it all :/ |
| 22:24:38 | <@JAA> | No, it got errors pretty soon after I started it. That's why I asked about whether you had also seen timeouts in your crawl. |
| 22:24:46 | <@JAA> | I don't think it managed to retrieve much more after that. |
| 22:26:14 | <@JAA> | Ah, I see https://archive.org/details/forum.lacartoonerie.com-2022-11-11-24a72456-00000.warc |
| 22:26:22 | <@JAA> | Missing the -meta.warc.gz though, do you still have that? |
| 22:26:36 | <schwarzkatz|m> | that's unfortunate then |
| 22:26:36 | <schwarzkatz|m> | my grab is partially in WBM since I at first used SPN exclusively |
| 22:27:19 | <@JAA> | Looks like someone else also did something in September, but it's in WARCZone: https://archive.org/details/warc_forum_lacartoonerie_com_20220927 |
| 22:28:18 | <schwarzkatz|m> | I have deleted all files after I uploaded that, looks like I didn't see that one |
| 22:28:25 | <@JAA> | Oof |
| 22:28:49 | <schwarzkatz|m> | what's in there? |
| 22:29:01 | <@JAA> | Log |
| 22:29:30 | <@JAA> | Less important than the data I suppose, but yeah, please upload it on future grabs. |
| 22:29:52 | <schwarzkatz|m> | will do |
| 22:35:10 | <@JAA> | Do we know of any list of projects on SourceHut that will be removed? If not, can someone try to compile one? https://sourcehut.org/blog/2022-10-31-tos-update-cryptocurrency/ |
| 22:38:55 | | hitgrr8 quits [Client Quit] |
| 22:41:25 | <schwarzkatz|m> | searching for related words turns up maybe less than 20 public repos in total. maybe it's a good idea to get these and then archive all 1058 repos? |
| 22:45:55 | | BlueMaxima joins |
| 22:50:31 | | spirit joins |
| 22:52:54 | <@JAA> | Sounds reasonable. |
| 22:53:30 | <@JAA> | Not sure about archiving all repos actually, but sounds like it shouldn't be too big. Unless there are a dozen copies of Linux and Chromium on it. :-| |
| 22:53:43 | <@arkiver> | "how about we just get everything?" "sounds reasonable" :P |
| 22:54:08 | <@JAA> | https://transfer.archivete.am/inline/bG4mu/aatt.png |
| 22:54:20 | <@arkiver> | hahaha yeah! |
| 22:54:45 | | ThreeHM_ is now known as ThreeHM |
| 22:55:03 | <@JAA> | I will grab all of sr.ht eventually anyway (when that bot is ready), I'm just not entirely certain it's worth doing that now. |
| 22:59:25 | <schwarzkatz|m> | https://transfer.archivete.am/WZZDz/sourcehut.crypto-related.txt |
| 22:59:26 | <@JAA> | Yeah, as expected, there are at least a couple copies of the Linux repo. Those would be duplicated. |
| 22:59:40 | <schwarzkatz|m> | contains also non cryptocurrency stuff, didn't sort that out |
| 22:59:53 | <@JAA> | Is it 1058 repos or 1058 projects? Projects can have multiple repos, I think. |
| 23:00:13 | <schwarzkatz|m> | projects then :D |
| 23:00:28 | <@JAA> | Thanks for the list, will do the magic later. |
| 23:00:56 | <schwarzkatz|m> | great |
| 23:01:17 | <@JAA> | And I might just throw https://sr.ht/ into AB and add aggressive ignores to get a general record of what's on there. |
| 23:02:21 | <@JAA> | The project pages should have some records of the (short) commit IDs, too, which could be used to verify mirrors, for example. |
| 23:02:44 | | mikesteven joins |
| 23:03:01 | <@JAA> | arkiver: Heard anything from GeoLog? |
| 23:03:06 | | mikesteven leaves |
| 23:18:20 | <@JAA> | Ah, the repos are on a separate domain anyway, right. So it'd grab those and not recurse further, which is even better. |
| 23:18:46 | <@JAA> | SourceHut does also support unlisted repos, which would be tricky to find. |
| 23:24:56 | <@arkiver> | JAA: no, nothing |
| 23:25:24 | <@arkiver> | ACTUALLY |
| 23:25:37 | <@arkiver> | got a reply literally few hours ago |
| 23:25:40 | <@arkiver> | :) |
| 23:25:45 | <@JAA> | :-) |
| 23:29:45 | | Ketchup901 quits [Quit: No Ping reply in 180 seconds.] |
| 23:31:29 | | Ketchup901 (Ketchup901) joins |
| 23:37:06 | <Ryz> | Ooo, reply? O: |
| 23:37:38 | <Ryz> | arkiver? |
| 23:39:49 | <pabs> | JAA: #swh folks pointed me at this rejection of an API to list all SourceHut repos: https://lists.sr.ht/~sircmpwn/sr.ht-dev/patches/4859 |
| 23:40:25 | <pabs> | JAA: btw, could you pastebin a link of the sr.ht repos you archive into #swh (libera) so they can grab them too? |