00:01:51 | <szczot3k> | Captcha is only one thing, then you get to things like attacks on the tracker (i.e. getting items, just so they understand what is in the queue, and base their response on that), analyzing the grabbers itself (Limited UA combos, 'byte-level' differences between firefox's network code, and python), to I don't even know what |
00:02:12 | <TheTechRobo> | They absolutely would not go to that extent |
00:02:53 | <szczot3k> | Seen their "AI Audit" function? That's... pretty similiar |
00:02:57 | <TheTechRobo> | And I believe TLS fingerprinting is something that I believe ark.iver has been working on for Wget-AT |
00:04:51 | <imer> | it was an issue for #shreddit but thats no more, so might've been put on the back burner |
00:04:56 | <szczot3k> | https://www.cloudflare.com/press-releases/2024/cloudflare-helps-content-creators-regain-control-of-content-from-ai/ replace AI in this article with "Archive", and we're pretty fucked |
00:08:12 | <szczot3k> | While the end goal is vastly different, the tools archive team, and ai scrappers share are similiar. What's changing is the destination, and motives behind it. |
00:08:47 | <szczot3k> | AB could easily be an AI scrapper if you'd change the destionation to openai's ingest servers |
00:09:01 | <@arkiver> | yeah |
00:09:47 | <szczot3k> | So Cloudflare IS developing tools to stop this kinds of things, and I really think it should be counted as one of the major threats. Maybe not in 2024, but in some years. |
00:10:03 | <szczot3k> | This AI Audit could easily be repurosed as "Scrape Audit" |
00:10:06 | <steering> | lots of people have had tools to stop this kind of things for a long time ;) |
00:10:59 | <steering> | I think what's really changing is, there's a spectrum from "block legitimate users" to "allow bad bots", and web sites are moving closer to the "block legitimate users" side |
00:11:06 | <szczot3k> | But for a long time scraping didn't have a really strong financial motive (AI) |
00:11:55 | <joepie91> | in response to some of the comments above, I understand the temptation to go "we should standardize and formalize everything" (coding styles, NGO, whatever) but this really isn't a good fit for what's ultimately a radical project like archiveteam. diversity in approaches and structure is a useful property, it's what allows us to deal with the complexity of the work on a shoestring budget. once you go too hard on the processes and |
00:11:55 | <joepie91> | formalizations, you just end up making it feel more like work, and that's going to burn out contributors and scare away new ones if you don't have reliable funding (which we don't) |
00:12:09 | <steering> | ^ |
00:12:45 | <szczot3k> | joepie91, sure, but there's a spectrum from 'Don't formalize anything' to 'Be a corporation' |
00:12:56 | <szczot3k> | Getting the right balance is important |
00:13:01 | <joepie91> | this implied equivalence between "we need to solve problems" and "we need to formalize things" is a false one; it's an equivalence that exists in corporate and institutional environments but it is not fundamentally equivalent in any way, it's just an approach that makes sense in those specific environments (which is not ours) |
00:13:59 | <joepie91> | szczot3k: sure, but that means that standardization needs to be argued on its merits, not on the implicit value of standardization and formalization, as has been happening here - you actually need to be able to motivate in what way NGO incorporation materially improves the situation (and that includes evaluating the downsides), or how code style rules solve a concrete problem that we can confirm actually exists |
00:14:06 | <joepie91> | and so on |
00:14:26 | <szczot3k> | Go too much for 'Don't formalize anything' and you get code that only one person has access to, doesn't really undestand it, and it fails, and go too much for 'Be a corporation' and Archiveteam isn't archiveteam anymore, and it was bought by OpenAI a year ago ;) |
00:14:38 | <joepie91> | you can't just say "we should formalize things to make it easier" because the "to make it easier" is an unsupported assumption that doesn't hold true outside of institutional environments |
00:14:50 | | etnguyen03 quits [Client Quit] |
00:14:56 | <nicolas17> | damn that's a lot of words I missed |
00:15:19 | <szczot3k> | Can we get an archivebot to archive those messages? /s |
00:15:50 | <joepie91> | szczot3k: I'm not saying that we should not have structure or contingency plans either, I don't think anyone here is arguing that. the point is that these things must be argued on merits. |
00:16:13 | <joepie91> | and things like "we need code style rules to make it easier" are... not that |
00:16:21 | <steering> | nicolas17: and all of them are on-topic :) |
00:16:35 | <szczot3k> | Also - that's kinda ironic that this whole discussion, which also touched on the topic of "People are scared to message on #at, AND on -bs" happened on -ot |
00:16:42 | <joepie91> | while there is very clearly a reason to have contingency plans, like "what if the guy running the tracker gets hit by a bus tomorrow" :) |
00:17:02 | <steering> | szczot3k: be a corporation - >insert freenode scandal here |
00:17:26 | <joepie91> | ah freenode was a 'fun' one to sort out... |
00:18:38 | <@arkiver> | nicolas17: good luck |
00:20:12 | <szczot3k> | It goes both ways. There are project that die because of lack of an organized entity, and there are projects that die because of an organized entity. I'd just like to reeitarate - I'm not advocating for ArchiveTeam Foundation, Inc., I'm advocating for finding the right balance |
00:20:24 | <joepie91> | szczot3k: on the point of 'reaching out', historically it's been very common for people involved in an archiveteam project to e-mail operators with something to the effect of "I am with archiveteam" and proposing to talk about how to archive things without being a nuisance. the success rate for that is fairly decent and to my knowledge nobody ever got questioned on what "with archiveteam" means exacxtly |
00:20:27 | <joepie91> | exactly* |
00:21:10 | <szczot3k> | Sure, don't have a lot of experience with that to be fair |
00:22:29 | <joepie91> | so I think that this is mostly a case of "start a project, make a channel and a wiki page, and send an e-mail with a link to it" |
00:25:00 | <imer> | joepie91: re code style: thats just asking for trouble though if its a free-for-all and you're actually wanting random people to contribute code. (and guidelines can be very loose&added to when theres issues of course) |
00:25:00 | <imer> | its certainly secondary - the first problem being actually having something that people can actively contribute to of course |
00:25:00 | <joepie91> | szczot3k: a broader note on that, if you're interested in starting a project, there's a fair amount of questions I can probably answer so feel free to ask. with the caveat that my knowledge is a bit rusty and I am not up to date on current conventions/policies :) |
00:25:09 | <joepie91> | but for the procedural stuff that may still be helpful |
00:25:43 | <joepie91> | imer: "make the code structured roughly like the existing code" is an entirely viable strategy that I have used and found to work fine |
00:26:03 | | fuzzy80211 quits [Read error: Connection reset by peer] |
00:26:12 | <imer> | thats a code style guideline of sorts at that point though :) |
00:26:28 | | fuzzy80211 (fuzzy80211) joins |
00:26:31 | <joepie91> | and give it an extra review with feedback for long-term infra, or don't for archival scripts since those are temporary anyway |
00:26:40 | <joepie91> | imer: in a very literal sense yes, but basically nobody considers it one |
00:26:55 | <joepie91> | if you say "code style guideline" then people are going to start looking for exact rules :) |
00:27:05 | <joepie91> | (and/or arguing exact rules, which is usually worse) |
00:27:09 | <imer> | words are hard. |
00:27:12 | <imer> | yeah |
00:27:13 | <szczot3k> | well, mpcforum is 18 years old, and is kinda on the stagnating - "Najnowsze tematy" (Recent topics) has one topic per day, which a couple of years back it was more like a tens of topics per hour |
00:27:52 | <joepie91> | szczot3k: does it use standard-ish forum software or is it some custom janky JS stuff? |
00:28:41 | <szczot3k> | Invision forum behind cloudflare |
00:29:22 | <joepie91> | ah. cloudflare |
00:29:35 | <joepie91> | depending on cloudflare settings I think archivebot may be able to deal with it |
00:29:43 | <szczot3k> | Kinda fueled my cloudflare rant ;) |
00:30:19 | <joepie91> | I'd give archivebot a shot first, see what it does with it |
00:30:38 | <szczot3k> | Any AB gods here willing to try it? |
00:32:20 | | joepie91 is not currently in archivebot channel for technical reasons |
00:33:40 | <szczot3k> | I'm not +v nor @ on there, so can't try to do so |
00:34:44 | <sralracer> | you can always ask in #archivebot with a short explanation |
00:35:39 | <katia> | joepie91: what technical reasons? |
00:36:06 | <joepie91> | Synapse-related |
00:36:39 | <katia> | ah, i feel almost sorry |
00:39:13 | | superkuh joins |
00:42:54 | <szczot3k> | <@pokechu22> szczot3k: hmm, I got an immediate cloudflare challenge on it, which would prevent us from running it :| |
00:43:22 | <szczot3k> | Welp, just adding to cloudflare as a threat |
00:44:02 | | sralracer quits [Quit: Ooops, wrong browser tab.] |
00:50:05 | <joepie91> | szczot3k: your best option would then probably be to reach out. I would recommend describing it as "wanting to work with them to make sure the posts are preserved in case something happens" or something along those lines, and making clear upfront that you're flexible on the exact approach, and so if they could just let you know what they would need to make it happen. or something along those lines |
00:50:31 | | etnguyen03 (etnguyen03) joins |
00:50:41 | <thuban> | < TheTechRobo> My 2cents: The lack of openness and maintenance on the infrastructure has been really discouraging for me in the past. |
00:50:44 | <joepie91> | if you can get even a tacit approval from the operators to archive it, that's a good start |
00:50:46 | <thuban> | ditto. the tracker being closed-source. dev being stalled _everywhere_ because of ci being broken. (testing on arm, deploying archivebot fixes, getting seesaw past 3.9, maybe even tracking down that mysterious race condition instead of throwing containers at the problem...) |
00:50:52 | <steering> | > Powered by Invision Community |
00:50:56 | <thuban> | ~nobody except arkiver writes project code, because the existing project code is a totally uncommented accretion and the lack of public tracker code means that even people willing to "put in the time and energy" _can't stand up a test env_ to start figuring it out or cleaning it up, so projects start late or don't get done if arkiver is busy. |
00:51:02 | <thuban> | ~nobody except JAA writes qwarc code, because even if qwarc were documented other people would still not be whitelisted for its output, so qwarc jobs start late or don't get done if JAA is busy. |
00:51:10 | <steering> | is this cloud crap? wonder if the site admins are even the ones running the cloudflare |
00:51:28 | <thuban> | i'd love to increase cooperation with stwp, maybe even integrate our dev work and/or get their projects whitelisted for the wbm, but how can i suggest that they use our infra when i can't use our infra? |
00:51:34 | <steering> | ah, no, looks like it's just new name for same old invision forum |
00:51:34 | <thuban> | i think it is simultaneously true that at does more projects, and more volume, now than it did previously, as arkiver suggests, and that that throughput is insanely fragile |
00:51:47 | <thuban> | imo, an irc interface for managing projects is merely nice to have, whereas getting ci working is _vital_, because it's the only way at can safely accept contributions from outside the 1-digit number of people already running everything |
00:51:52 | <joepie91> | szczot3k: I'd say that things to avoid are a) being demanding, or b) implying it is about to die. and that it's good to emphasize that this is about *public* data only. but that's just how I would personally approach it |
00:52:42 | <@arkiver> | thuban: thanks for that |
00:52:48 | | Notrealname1234 (Notrealname1234) joins |
00:53:01 | <@arkiver> | note OrIdow6 also writes some projects |
00:54:30 | <thuban> | thanks for listening. and yeah, hence "~" |
00:55:03 | <szczot3k> | Creating the project for it scares me, I'm here for just a week ;) |
01:00:08 | | imer wouldn't mind having a go at a project at some point, they're usually not simple/beginner friendly though. too much js these days |
01:01:25 | <joepie91> | szczot3k: I need to go to bed now but may be able to provide some guidance later if wanted |
01:26:04 | | szczot3k4 (szczot3k) joins |
01:27:12 | | etnguyen03 quits [Client Quit] |
01:27:50 | | szczot3k quits [Killed (NickServ (GHOST command used by szczot3k4))] |
01:27:53 | | szczot3k4 is now known as szczot3k |
01:29:04 | | Notrealname1234 quits [Client Quit] |
01:29:14 | | etnguyen03 (etnguyen03) joins |
01:58:57 | <@OrIdow6> | Thank you all for your comments |
01:59:20 | <@OrIdow6> | thuban TheTechRobo: agree/sympathize pretty closely with what you two say |
01:59:26 | <@OrIdow6> | And several others |
02:01:06 | <@OrIdow6> | It's true that ArchiveTeam is working now, but it's brittle in a way that I think will find a hard part adapting to the future |
02:01:57 | <@OrIdow6> | I'm going to sound like a company here but - AI scraping, increasing use of JS, the partitioning of the public Internet by governments and big cloud providers, all of these may change what happens here and none we are really equipped to deal with |
02:03:52 | <@OrIdow6> | What exists now is super effective - a huge amount of scale on warriors and trackers, and petabytes of data that has been for enormous public benefit |
02:04:17 | <@OrIdow6> | And capable of getting fairly good and fast coverage of things |
02:06:03 | <@OrIdow6> | But it's all basically early-2010s technology (ArchiveTeam started off without a tracker - people manually "tracked" items on the wiki pages!) and what I think it might work *so* well that we're hyper-optimizing into a particular set of cases and one day some law will get passed or the IA will make a change in policy or something like that and it'll dissolve |
02:07:08 | <@OrIdow6> | A large part of ArchiveTeam is effectively a branch of the Internet Archive and this has caused a kind of institutionalization |
02:08:50 | <@OrIdow6> | And that's also the biggest issue because to some extent archiveteam IS a branch of the IA - it would not have survived for nearly this long without that distinguishing feature I think |
02:09:11 | <@OrIdow6> | The ability to get things into the Wayback Machine |
02:11:51 | <@OrIdow6> | I don't know what the solution is; I have had fantasies of rewriting basically all systems from the ground up and letting the existing stuff to continue to run separately |
02:15:13 | <@OrIdow6> | It's true that has its own difficulties |
02:17:49 | <@OrIdow6> | And there may be a better solution |
02:18:02 | <@OrIdow6> | But my impression reading this group's history is that its original advantage wasn't infrastructure scale, but the way it concentrated so much energy into one place, and I would like to see that continue to be the case |
02:50:44 | <steering> | hmmmmm... I can't sign up for OVH because US isn't an option \o/ |
03:00:15 | <steering> | hey there we go, had to go to a different site because obviously |
03:03:39 | | etnguyen03 quits [Client Quit] |
03:07:45 | | etnguyen03 (etnguyen03) joins |
03:08:49 | | tzt quits [Remote host closed the connection] |
03:09:08 | | tzt (tzt) joins |
03:10:46 | <Ryz> | I've been reading and skimming through the conversation going on here for a bit; as someone who is so specialized in #archivebot activities (in terms of finding content, archiving content, managing existing ongoing archives, etc), but still keeps a lookout of ongoing activity within ArchiveTeam, from my experience being in this group over the past |
03:10:47 | <Ryz> | 5 years since joining, for the Archive Team in general, it's has it's ebbs and flows of activity, but admittedly it is kinda slowing down more and more |
03:13:48 | <Ryz> | As for #archivebot side, archiving over there has increased a lot, but damn a lot, especially within the past year or two; there's constant archiving happening nearly every hour of a day now as of recently; the amount of people though on helping out finding out stuff and being recruited in, it's been slowed down too gradually, as personally, I |
03:13:48 | <Ryz> | think a wide range of people specializing in interests to be able to archive less known content much more thoroughly |
03:14:22 | <Ryz> | It doesn't help with this year being so maddening with pipeline crashes, pipeline losses and stuff, probably moreso than last year's |
03:17:16 | <Ryz> | I feel there's an increased need for programmers to help things out within the ArchiveTeam |
03:18:39 | <Ryz> | As for more reachability, I do recall one attempt of an official ArchiveTeam Discord server years ago, but I believe something happened on that front |
03:22:33 | <Ryz> | When is CI going to start running by the way? |
03:27:38 | | etnguyen03 quits [Client Quit] |
03:29:31 | | etnguyen03 (etnguyen03) joins |
03:30:49 | <Ryz> | I'm wondering to what extent the wiki needs polishing or revamping or something~ |
03:45:13 | <nicolas17> | Ryz: "It doesn't help with this year being so maddening with pipeline crashes" |
03:45:16 | <nicolas17> | and the IA outage |
03:46:41 | <Ryz> | ...Oh yes, dearly yes, the IA outage Z: |
03:53:28 | <Flashfire42> | I have my expertises but trust me you dont want me touching code Ryz |
03:57:07 | <Ryz> | I know you well enough that code is not a thing for you, it's your esoteric knowledge of goodies and looties in that particular realm you've been in >#<; |
04:14:00 | | etnguyen03 quits [Remote host closed the connection] |
06:29:35 | | BlueMaxima quits [Quit: Leaving] |
08:21:16 | | pixel leaves [Error from remote client] |
09:00:53 | | ScenarioPlanet (ScenarioPlanet) joins |
09:57:10 | | pixel (pixel) joins |
10:02:35 | | sralracer (sralracer) joins |
10:53:40 | | tzt quits [Ping timeout: 260 seconds] |
11:00:29 | | tzt (tzt) joins |
11:18:38 | | ducky quits [Ping timeout: 260 seconds] |
11:20:40 | | ducky (ducky) joins |
11:47:18 | | sec^nd quits [Remote host closed the connection] |
11:47:31 | | sec^nd (second) joins |
12:00:03 | | Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:44 | | Bleo182600722719623 joins |
12:13:52 | | MrMcNuggets (MrMcNuggets) joins |
12:17:37 | | HP_Archivist quits [Read error: Connection reset by peer] |
12:17:50 | | HP_Archivist (HP_Archivist) joins |
12:55:58 | | etnguyen03 (etnguyen03) joins |
13:20:24 | | etnguyen03 quits [Client Quit] |
13:23:44 | | i_have_n0_idea quits [Read error: Connection reset by peer] |
13:24:13 | | i_have_n0_idea (i_have_n0_idea) joins |
13:27:58 | | etnguyen03 (etnguyen03) joins |
14:24:44 | | i_have_n0_idea quits [Read error: Connection reset by peer] |
14:25:05 | | i_have_n0_idea (i_have_n0_idea) joins |
14:42:55 | | katocala quits [Ping timeout: 260 seconds] |
14:43:07 | | katocala joins |
14:49:55 | | katocala quits [Ping timeout: 260 seconds] |
14:50:35 | | katocala joins |
14:56:46 | | etnguyen03 quits [Client Quit] |
15:27:44 | | i_have_n0_idea quits [Read error: Connection reset by peer] |
15:28:07 | | i_have_n0_idea (i_have_n0_idea) joins |
16:07:57 | | PredatorIWD2 quits [Read error: Connection reset by peer] |
16:10:25 | | PredatorIWD2 joins |
17:02:10 | | fangfufu quits [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in] |
17:06:29 | | Notrealname1234 (Notrealname1234) joins |
17:07:09 | | Notrealname1234 quits [Client Quit] |
17:16:22 | | Notrealname1234 (Notrealname1234) joins |
17:19:39 | | etnguyen03 (etnguyen03) joins |
17:33:40 | | Notrealname1234 quits [Client Quit] |
17:52:28 | | pixel leaves [Error from remote client] |
17:52:32 | | pixel (pixel) joins |
18:04:19 | | yasomi quits [Quit: ZNC 1.9.1 - https://znc.in] |
18:05:45 | | yasomi (yasomi) joins |
18:30:41 | | MrMcNuggets quits [Quit: WeeChat 4.3.2] |
18:43:40 | | etnguyen03 quits [Client Quit] |
19:15:20 | | etnguyen03 (etnguyen03) joins |
20:27:09 | | sralracer quits [Quit: Ooops, wrong browser tab.] |
21:09:34 | | BennyOtt_ joins |
21:10:50 | | BennyOtt quits [Ping timeout: 260 seconds] |
21:13:54 | | BennyOtt_ is now known as BennyOtt |
21:13:54 | | BennyOtt is now authenticated as BennyOtt |
21:26:38 | <ymgve_> | you did it! https://old.reddit.com/r/DataHoarder/comments/1h3kjg0/tomorrow_netflix_is_nuking_2024_remaining/ |
21:26:59 | <ymgve_> | or maybe not archiveteam |
21:38:27 | | BlueMaxima joins |
22:14:32 | | BlueMaxima quits [Read error: Connection reset by peer] |
22:14:58 | | szczot3k0 (szczot3k) joins |
22:17:55 | | szczot3k quits [Ping timeout: 260 seconds] |
22:17:55 | | szczot3k0 is now known as szczot3k |
22:49:54 | | f_ is now known as funderscore |
22:49:56 | | funderscore is now known as f_ |
23:00:37 | | pixel leaves [Disconnected: Replaced by new connection] |
23:00:41 | | pixel (pixel) joins |
23:08:50 | | pie_ quits [] |
23:08:55 | | pie_ joins |
23:20:57 | | kpcyrd is now authenticated as kpcyrd |
23:20:57 | | kpcyrd quits [Changing host] |
23:20:57 | | kpcyrd (kpcyrd) joins |
23:53:05 | | fangfufu joins |
23:53:35 | | wickedplayer494 quits [Ping timeout: 260 seconds] |
23:53:54 | | wickedplayer494 joins |
23:54:39 | | wickedplayer494 is now authenticated as wickedplayer494 |
23:55:27 | | etnguyen03 quits [Quit: Konversation terminated!] |