00:01:51<szczot3k>Captcha is only one thing, then you get to things like attacks on the tracker (i.e. getting items, just so they understand what is in the queue, and base their response on that), analyzing the grabbers itself (Limited UA combos, 'byte-level' differences between firefox's network code, and python), to I don't even know what
00:02:12<TheTechRobo>They absolutely would not go to that extent
00:02:53<szczot3k>Seen their "AI Audit" function? That's... pretty similiar
00:02:57<TheTechRobo>And I believe TLS fingerprinting is something that I believe ark.iver has been working on for Wget-AT
00:04:51<imer>it was an issue for #shreddit but thats no more, so might've been put on the back burner
00:04:56<szczot3k>https://www.cloudflare.com/press-releases/2024/cloudflare-helps-content-creators-regain-control-of-content-from-ai/ replace AI in this article with "Archive", and we're pretty fucked
00:08:12<szczot3k>While the end goal is vastly different, the tools archive team, and ai scrappers share are similiar. What's changing is the destination, and motives behind it.
00:08:47<szczot3k>AB could easily be an AI scrapper if you'd change the destionation to openai's ingest servers
00:09:01<@arkiver>yeah
00:09:47<szczot3k>So Cloudflare IS developing tools to stop this kinds of things, and I really think it should be counted as one of the major threats. Maybe not in 2024, but in some years.
00:10:03<szczot3k>This AI Audit could easily be repurosed as "Scrape Audit"
00:10:06<steering>lots of people have had tools to stop this kind of things for a long time ;)
00:10:59<steering>I think what's really changing is, there's a spectrum from "block legitimate users" to "allow bad bots", and web sites are moving closer to the "block legitimate users" side
00:11:06<szczot3k>But for a long time scraping didn't have a really strong financial motive (AI)
00:11:55<joepie91>in response to some of the comments above, I understand the temptation to go "we should standardize and formalize everything" (coding styles, NGO, whatever) but this really isn't a good fit for what's ultimately a radical project like archiveteam. diversity in approaches and structure is a useful property, it's what allows us to deal with the complexity of the work on a shoestring budget. once you go too hard on the processes and
00:11:55<joepie91>formalizations, you just end up making it feel more like work, and that's going to burn out contributors and scare away new ones if you don't have reliable funding (which we don't)
00:12:09<steering>^
00:12:45<szczot3k>joepie91, sure, but there's a spectrum from 'Don't formalize anything' to 'Be a corporation'
00:12:56<szczot3k>Getting the right balance is important
00:13:01<joepie91>this implied equivalence between "we need to solve problems" and "we need to formalize things" is a false one; it's an equivalence that exists in corporate and institutional environments but it is not fundamentally equivalent in any way, it's just an approach that makes sense in those specific environments (which is not ours)
00:13:59<joepie91>szczot3k: sure, but that means that standardization needs to be argued on its merits, not on the implicit value of standardization and formalization, as has been happening here - you actually need to be able to motivate in what way NGO incorporation materially improves the situation (and that includes evaluating the downsides), or how code style rules solve a concrete problem that we can confirm actually exists
00:14:06<joepie91>and so on
00:14:26<szczot3k>Go too much for 'Don't formalize anything' and you get code that only one person has access to, doesn't really undestand it, and it fails, and go too much for 'Be a corporation' and Archiveteam isn't archiveteam anymore, and it was bought by OpenAI a year ago ;)
00:14:38<joepie91>you can't just say "we should formalize things to make it easier" because the "to make it easier" is an unsupported assumption that doesn't hold true outside of institutional environments
00:14:50etnguyen03 quits [Client Quit]
00:14:56<nicolas17>damn that's a lot of words I missed
00:15:19<szczot3k>Can we get an archivebot to archive those messages? /s
00:15:50<joepie91>szczot3k: I'm not saying that we should not have structure or contingency plans either, I don't think anyone here is arguing that. the point is that these things must be argued on merits.
00:16:13<joepie91>and things like "we need code style rules to make it easier" are... not that
00:16:21<steering>nicolas17: and all of them are on-topic :)
00:16:35<szczot3k>Also - that's kinda ironic that this whole discussion, which also touched on the topic of "People are scared to message on #at, AND on -bs" happened on -ot
00:16:42<joepie91>while there is very clearly a reason to have contingency plans, like "what if the guy running the tracker gets hit by a bus tomorrow" :)
00:17:02<steering>szczot3k: be a corporation - >insert freenode scandal here
00:17:26<joepie91>ah freenode was a 'fun' one to sort out...
00:18:38<@arkiver>nicolas17: good luck
00:20:12<szczot3k>It goes both ways. There are project that die because of lack of an organized entity, and there are projects that die because of an organized entity. I'd just like to reeitarate - I'm not advocating for ArchiveTeam Foundation, Inc., I'm advocating for finding the right balance
00:20:24<joepie91>szczot3k: on the point of 'reaching out', historically it's been very common for people involved in an archiveteam project to e-mail operators with something to the effect of "I am with archiveteam" and proposing to talk about how to archive things without being a nuisance. the success rate for that is fairly decent and to my knowledge nobody ever got questioned on what "with archiveteam" means exacxtly
00:20:27<joepie91>exactly*
00:21:10<szczot3k>Sure, don't have a lot of experience with that to be fair
00:22:29<joepie91>so I think that this is mostly a case of "start a project, make a channel and a wiki page, and send an e-mail with a link to it"
00:25:00<imer>joepie91: re code style: thats just asking for trouble though if its a free-for-all and you're actually wanting random people to contribute code. (and guidelines can be very loose&added to when theres issues of course)
00:25:00<imer>its certainly secondary - the first problem being actually having something that people can actively contribute to of course
00:25:00<joepie91>szczot3k: a broader note on that, if you're interested in starting a project, there's a fair amount of questions I can probably answer so feel free to ask. with the caveat that my knowledge is a bit rusty and I am not up to date on current conventions/policies :)
00:25:09<joepie91>but for the procedural stuff that may still be helpful
00:25:43<joepie91>imer: "make the code structured roughly like the existing code" is an entirely viable strategy that I have used and found to work fine
00:26:03fuzzy80211 quits [Read error: Connection reset by peer]
00:26:12<imer>thats a code style guideline of sorts at that point though :)
00:26:28fuzzy80211 (fuzzy80211) joins
00:26:31<joepie91>and give it an extra review with feedback for long-term infra, or don't for archival scripts since those are temporary anyway
00:26:40<joepie91>imer: in a very literal sense yes, but basically nobody considers it one
00:26:55<joepie91>if you say "code style guideline" then people are going to start looking for exact rules :)
00:27:05<joepie91>(and/or arguing exact rules, which is usually worse)
00:27:09<imer>words are hard.
00:27:12<imer>yeah
00:27:13<szczot3k>well, mpcforum is 18 years old, and is kinda on the stagnating - "Najnowsze tematy" (Recent topics) has one topic per day, which a couple of years back it was more like a tens of topics per hour
00:27:52<joepie91>szczot3k: does it use standard-ish forum software or is it some custom janky JS stuff?
00:28:41<szczot3k>Invision forum behind cloudflare
00:29:22<joepie91>ah. cloudflare
00:29:35<joepie91>depending on cloudflare settings I think archivebot may be able to deal with it
00:29:43<szczot3k>Kinda fueled my cloudflare rant ;)
00:30:19<joepie91>I'd give archivebot a shot first, see what it does with it
00:30:38<szczot3k>Any AB gods here willing to try it?
00:32:20joepie91 is not currently in archivebot channel for technical reasons
00:33:40<szczot3k>I'm not +v nor @ on there, so can't try to do so
00:34:44<sralracer>you can always ask in #archivebot with a short explanation
00:35:39<katia>joepie91: what technical reasons?
00:36:06<joepie91>Synapse-related
00:36:39<katia>ah, i feel almost sorry
00:39:13superkuh joins
00:42:54<szczot3k><@pokechu22> szczot3k: hmm, I got an immediate cloudflare challenge on it, which would prevent us from running it :|
00:43:22<szczot3k>Welp, just adding to cloudflare as a threat
00:44:02sralracer quits [Quit: Ooops, wrong browser tab.]
00:50:05<joepie91>szczot3k: your best option would then probably be to reach out. I would recommend describing it as "wanting to work with them to make sure the posts are preserved in case something happens" or something along those lines, and making clear upfront that you're flexible on the exact approach, and so if they could just let you know what they would need to make it happen. or something along those lines
00:50:31etnguyen03 (etnguyen03) joins
00:50:41<thuban>< TheTechRobo> My 2cents: The lack of openness and maintenance on the infrastructure has been really discouraging for me in the past.
00:50:44<joepie91>if you can get even a tacit approval from the operators to archive it, that's a good start
00:50:46<thuban>ditto. the tracker being closed-source. dev being stalled _everywhere_ because of ci being broken. (testing on arm, deploying archivebot fixes, getting seesaw past 3.9, maybe even tracking down that mysterious race condition instead of throwing containers at the problem...)
00:50:52<steering>> Powered by Invision Community
00:50:56<thuban>~nobody except arkiver writes project code, because the existing project code is a totally uncommented accretion and the lack of public tracker code means that even people willing to "put in the time and energy" _can't stand up a test env_ to start figuring it out or cleaning it up, so projects start late or don't get done if arkiver is busy.
00:51:02<thuban>~nobody except JAA writes qwarc code, because even if qwarc were documented other people would still not be whitelisted for its output, so qwarc jobs start late or don't get done if JAA is busy.
00:51:10<steering>is this cloud crap? wonder if the site admins are even the ones running the cloudflare
00:51:28<thuban>i'd love to increase cooperation with stwp, maybe even integrate our dev work and/or get their projects whitelisted for the wbm, but how can i suggest that they use our infra when i can't use our infra?
00:51:34<steering>ah, no, looks like it's just new name for same old invision forum
00:51:34<thuban>i think it is simultaneously true that at does more projects, and more volume, now than it did previously, as arkiver suggests, and that that throughput is insanely fragile
00:51:47<thuban>imo, an irc interface for managing projects is merely nice to have, whereas getting ci working is _vital_, because it's the only way at can safely accept contributions from outside the 1-digit number of people already running everything
00:51:52<joepie91>szczot3k: I'd say that things to avoid are a) being demanding, or b) implying it is about to die. and that it's good to emphasize that this is about *public* data only. but that's just how I would personally approach it
00:52:42<@arkiver>thuban: thanks for that
00:52:48Notrealname1234 (Notrealname1234) joins
00:53:01<@arkiver>note OrIdow6 also writes some projects
00:54:30<thuban>thanks for listening. and yeah, hence "~"
00:55:03<szczot3k>Creating the project for it scares me, I'm here for just a week ;)
01:00:08imer wouldn't mind having a go at a project at some point, they're usually not simple/beginner friendly though. too much js these days
01:01:25<joepie91>szczot3k: I need to go to bed now but may be able to provide some guidance later if wanted
01:26:04szczot3k4 (szczot3k) joins
01:27:12etnguyen03 quits [Client Quit]
01:27:50szczot3k quits [Killed (NickServ (GHOST command used by szczot3k4))]
01:27:53szczot3k4 is now known as szczot3k
01:29:04Notrealname1234 quits [Client Quit]
01:29:14etnguyen03 (etnguyen03) joins
01:58:57<@OrIdow6>Thank you all for your comments
01:59:20<@OrIdow6>thuban TheTechRobo: agree/sympathize pretty closely with what you two say
01:59:26<@OrIdow6>And several others
02:01:06<@OrIdow6>It's true that ArchiveTeam is working now, but it's brittle in a way that I think will find a hard part adapting to the future
02:01:57<@OrIdow6>I'm going to sound like a company here but - AI scraping, increasing use of JS, the partitioning of the public Internet by governments and big cloud providers, all of these may change what happens here and none we are really equipped to deal with
02:03:52<@OrIdow6>What exists now is super effective - a huge amount of scale on warriors and trackers, and petabytes of data that has been for enormous public benefit
02:04:17<@OrIdow6>And capable of getting fairly good and fast coverage of things
02:06:03<@OrIdow6>But it's all basically early-2010s technology (ArchiveTeam started off without a tracker - people manually "tracked" items on the wiki pages!) and what I think it might work *so* well that we're hyper-optimizing into a particular set of cases and one day some law will get passed or the IA will make a change in policy or something like that and it'll dissolve
02:07:08<@OrIdow6>A large part of ArchiveTeam is effectively a branch of the Internet Archive and this has caused a kind of institutionalization
02:08:50<@OrIdow6>And that's also the biggest issue because to some extent archiveteam IS a branch of the IA - it would not have survived for nearly this long without that distinguishing feature I think
02:09:11<@OrIdow6>The ability to get things into the Wayback Machine
02:11:51<@OrIdow6>I don't know what the solution is; I have had fantasies of rewriting basically all systems from the ground up and letting the existing stuff to continue to run separately
02:15:13<@OrIdow6>It's true that has its own difficulties
02:17:49<@OrIdow6>And there may be a better solution
02:18:02<@OrIdow6>But my impression reading this group's history is that its original advantage wasn't infrastructure scale, but the way it concentrated so much energy into one place, and I would like to see that continue to be the case
02:50:44<steering>hmmmmm... I can't sign up for OVH because US isn't an option \o/
03:00:15<steering>hey there we go, had to go to a different site because obviously
03:03:39etnguyen03 quits [Client Quit]
03:07:45etnguyen03 (etnguyen03) joins
03:08:49tzt quits [Remote host closed the connection]
03:09:08tzt (tzt) joins
03:10:46<Ryz>I've been reading and skimming through the conversation going on here for a bit; as someone who is so specialized in #archivebot activities (in terms of finding content, archiving content, managing existing ongoing archives, etc), but still keeps a lookout of ongoing activity within ArchiveTeam, from my experience being in this group over the past
03:10:47<Ryz>5 years since joining, for the Archive Team in general, it's has it's ebbs and flows of activity, but admittedly it is kinda slowing down more and more
03:13:48<Ryz>As for #archivebot side, archiving over there has increased a lot, but damn a lot, especially within the past year or two; there's constant archiving happening nearly every hour of a day now as of recently; the amount of people though on helping out finding out stuff and being recruited in, it's been slowed down too gradually, as personally, I
03:13:48<Ryz>think a wide range of people specializing in interests to be able to archive less known content much more thoroughly
03:14:22<Ryz>It doesn't help with this year being so maddening with pipeline crashes, pipeline losses and stuff, probably moreso than last year's
03:17:16<Ryz>I feel there's an increased need for programmers to help things out within the ArchiveTeam
03:18:39<Ryz>As for more reachability, I do recall one attempt of an official ArchiveTeam Discord server years ago, but I believe something happened on that front
03:22:33<Ryz>When is CI going to start running by the way?
03:27:38etnguyen03 quits [Client Quit]
03:29:31etnguyen03 (etnguyen03) joins
03:30:49<Ryz>I'm wondering to what extent the wiki needs polishing or revamping or something~
03:45:13<nicolas17>Ryz: "It doesn't help with this year being so maddening with pipeline crashes"
03:45:16<nicolas17>and the IA outage
03:46:41<Ryz>...Oh yes, dearly yes, the IA outage Z:
03:53:28<Flashfire42>I have my expertises but trust me you dont want me touching code Ryz
03:57:07<Ryz>I know you well enough that code is not a thing for you, it's your esoteric knowledge of goodies and looties in that particular realm you've been in >#<;
04:14:00etnguyen03 quits [Remote host closed the connection]
06:29:35BlueMaxima quits [Quit: Leaving]
08:21:16pixel leaves [Error from remote client]
09:00:53ScenarioPlanet (ScenarioPlanet) joins
09:57:10pixel (pixel) joins
10:02:35sralracer (sralracer) joins
10:53:40tzt quits [Ping timeout: 260 seconds]
11:00:29tzt (tzt) joins
11:18:38ducky quits [Ping timeout: 260 seconds]
11:20:40ducky (ducky) joins
11:47:18sec^nd quits [Remote host closed the connection]
11:47:31sec^nd (second) joins
12:00:03Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat]
12:02:44Bleo182600722719623 joins
12:13:52MrMcNuggets (MrMcNuggets) joins
12:17:37HP_Archivist quits [Read error: Connection reset by peer]
12:17:50HP_Archivist (HP_Archivist) joins
12:55:58etnguyen03 (etnguyen03) joins
13:20:24etnguyen03 quits [Client Quit]
13:23:44i_have_n0_idea quits [Read error: Connection reset by peer]
13:24:13i_have_n0_idea (i_have_n0_idea) joins
13:27:58etnguyen03 (etnguyen03) joins
14:24:44i_have_n0_idea quits [Read error: Connection reset by peer]
14:25:05i_have_n0_idea (i_have_n0_idea) joins
14:42:55katocala quits [Ping timeout: 260 seconds]
14:43:07katocala joins
14:49:55katocala quits [Ping timeout: 260 seconds]
14:50:35katocala joins
14:56:46etnguyen03 quits [Client Quit]
15:27:44i_have_n0_idea quits [Read error: Connection reset by peer]
15:28:07i_have_n0_idea (i_have_n0_idea) joins
16:07:57PredatorIWD2 quits [Read error: Connection reset by peer]
16:10:25PredatorIWD2 joins
17:02:10fangfufu quits [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
17:06:29Notrealname1234 (Notrealname1234) joins
17:07:09Notrealname1234 quits [Client Quit]
17:16:22Notrealname1234 (Notrealname1234) joins
17:19:39etnguyen03 (etnguyen03) joins
17:33:40Notrealname1234 quits [Client Quit]
17:52:28pixel leaves [Error from remote client]
17:52:32pixel (pixel) joins
18:04:19yasomi quits [Quit: ZNC 1.9.1 - https://znc.in]
18:05:45yasomi (yasomi) joins
18:30:41MrMcNuggets quits [Quit: WeeChat 4.3.2]
18:43:40etnguyen03 quits [Client Quit]
19:15:20etnguyen03 (etnguyen03) joins
20:27:09sralracer quits [Quit: Ooops, wrong browser tab.]
21:09:34BennyOtt_ joins
21:10:50BennyOtt quits [Ping timeout: 260 seconds]
21:13:54BennyOtt_ is now known as BennyOtt
21:26:38<ymgve_>you did it! https://old.reddit.com/r/DataHoarder/comments/1h3kjg0/tomorrow_netflix_is_nuking_2024_remaining/
21:26:59<ymgve_>or maybe not archiveteam
21:38:27BlueMaxima joins
22:14:32BlueMaxima quits [Read error: Connection reset by peer]
22:14:58szczot3k0 (szczot3k) joins
22:17:55szczot3k quits [Ping timeout: 260 seconds]
22:17:55szczot3k0 is now known as szczot3k
22:49:54f_ is now known as funderscore
22:49:56funderscore is now known as f_
23:00:37pixel leaves [Disconnected: Replaced by new connection]
23:00:41pixel (pixel) joins
23:08:50pie_ quits []
23:08:55pie_ joins
23:20:57kpcyrd quits [Changing host]
23:20:57kpcyrd (kpcyrd) joins
23:53:05fangfufu joins
23:53:35wickedplayer494 quits [Ping timeout: 260 seconds]
23:53:54wickedplayer494 joins
23:55:27etnguyen03 quits [Quit: Konversation terminated!]