| 00:00:50 | | superkuh joins |
| 00:17:51 | | michaelblob (michaelblob) joins |
| 00:22:03 | | michaelblob_ quits [Ping timeout: 268 seconds] |
| 01:50:33 | | mut4ntm0nkey quits [Ping timeout: 255 seconds] |
| 01:51:52 | | mut4ntm0nkey (mutantmonkey) joins |
| 02:19:17 | | michaelblob_ (michaelblob) joins |
| 02:23:08 | | michaelblob quits [Ping timeout: 265 seconds] |
| 03:26:31 | | michaelblob (michaelblob) joins |
| 03:30:45 | | michaelblob_ quits [Ping timeout: 268 seconds] |
| 04:43:31 | | march_happy quits [Ping timeout: 268 seconds] |
| 04:44:22 | | march_happy (march_happy) joins |
| 04:48:02 | <pabs> | woa person who died: https://twitter.com/galagrrr https://gallaghersmash.com/ https://en.wikipedia.org/wiki/Gallagher_(comedian) |
| 04:48:11 | <pabs> | er s/woa/a/ |
| 05:34:27 | | holbrooke quits [Client Quit] |
| 06:24:55 | | BlueMaxima quits [Client Quit] |
| 08:53:09 | <mgrandi> | betamax: neat! That going up on wbm when done? |
| 08:53:47 | <mgrandi> | OrIdow6: and I think it's for their own ai generation software, third parties will just ignore it |
| 09:13:48 | | michaelblob_ (michaelblob) joins |
| 09:17:56 | | michaelblob quits [Ping timeout: 268 seconds] |
| 09:29:02 | | march_happy quits [Ping timeout: 268 seconds] |
| 09:29:43 | | march_happy (march_happy) joins |
| 10:02:20 | | janh quits [Ping timeout: 268 seconds] |
| 10:30:17 | | h3ndr1k quits [Quit: ] |
| 10:31:48 | | Hackerpcs quits [Quit: Hackerpcs] |
| 10:34:13 | | Hackerpcs (Hackerpcs) joins |
| 10:47:31 | | T31M is now authenticated as T31M |
| 10:51:54 | | yawkat quits [Ping timeout: 255 seconds] |
| 10:57:30 | <Barto> | pabs: i covered him earlier |
| 11:16:04 | | tech_exorcist (tech_exorcist) joins |
| 12:17:53 | | yawkat (yawkat) joins |
| 12:20:08 | <betamax> | mgrandi: unfortunately as a third-party generated WARC it can't go on the wbm |
| 12:20:36 | <betamax> | all the WARCs will go to an item on archive.org, similar to what I did for the 2018 US midterms: https://archive.org/details/2018_us_midterm_campaign_site_archive |
| 12:20:54 | <schwarzkatz|m> | wait, grab-site WARCs cannot go into the WBM? |
| 12:21:04 | | sonick (sonick) joins |
| 12:21:15 | <schwarzkatz|m> | *does that mean |
| 12:37:22 | | h3ndr1k (h3ndr1k) joins |
| 12:37:28 | <@JAA> | They can, but not when uploaded by random users. |
| 12:53:46 | | Arcorann quits [Ping timeout: 268 seconds] |
| 12:53:54 | | tech_exorcist quits [Remote host closed the connection] |
| 12:54:40 | | tech_exorcist (tech_exorcist) joins |
| 12:58:08 | | Ketchup901 quits [Remote host closed the connection] |
| 13:04:20 | | Ketchup901 (Ketchup901) joins |
| 13:18:26 | | dasineura2 quits [Ping timeout: 268 seconds] |
| 13:25:01 | | dasineura2 (dasineura) joins |
| 13:31:09 | | Ketchup901 quits [Remote host closed the connection] |
| 13:31:29 | | Ketchup901 (Ketchup901) joins |
| 13:35:42 | | dasineura2 quits [Ping timeout: 268 seconds] |
| 13:44:22 | | holbrooke joins |
| 13:45:06 | | holbrooke quits [Client Quit] |
| 14:03:37 | | holbrooke joins |
| 14:04:28 | | holbrooke quits [Client Quit] |
| 14:11:57 | | dasineura2 (dasineura) joins |
| 15:00:25 | | holbrooke joins |
| 15:01:24 | | holbrooke quits [Client Quit] |
| 15:38:08 | <schwarzkatz|m> | <JAA> "They can, but not when uploaded..." <- I suppose I would be considered a random user then, right? I just finished grabbing forum.lacartoonerie.com yesterday and it's currently uploading. prior to that, I used a script to automate SPN2 with all the urls... If the archive I made isn't accepted, grabbing it is kind of a waste :/ |
| 15:41:59 | <@JAA> | schwarzkatz|m: Yes, you would. If the WBM just accepted random WARCs, anyone could falsify history there, and it'd be useless. |
| 15:42:25 | <@JAA> | That said, those WARCs can still be useful. Anyone can download them and view them locally with something like pywb. |
| 15:42:48 | <@JAA> | It's up to the downloader then to judge whether they trust the uploader or not. |
| 15:43:43 | <schwarzkatz|m> | that's understandable. so I'll continue to use my script then for stuff I want to have in the WBM |
| 15:44:48 | <@JAA> | But then there aren't any downloadable WARCs. :-/ |
| 15:49:59 | <schwarzkatz|m> | if it's not in the wbm it's useless to me lol |
| 15:55:07 | | holbrooke joins |
| 15:55:47 | <@JAA> | Yeah, that's why it's best to let a trusted person run the actual archival so that the WARCs do end up in the WBM but are also accessible. |
| 15:57:01 | <schwarzkatz|m> | yep, agreed |
| 16:03:15 | | dasineura2 quits [Remote host closed the connection] |
| 16:03:30 | | dasineura2 (dasineura) joins |
| 16:07:20 | | michaelblob (michaelblob) joins |
| 16:07:25 | | michaelblob_ quits [Remote host closed the connection] |
| 16:07:35 | | holbrooke quits [Client Quit] |
| 16:14:48 | | march_happy quits [Ping timeout: 268 seconds] |
| 16:15:39 | | holbrooke joins |
| 16:15:44 | | march_happy (march_happy) joins |
| 16:27:04 | | holbrooke quits [Client Quit] |
| 16:33:03 | | holbrooke joins |
| 16:40:05 | | holbrooke quits [Ping timeout: 268 seconds] |
| 16:46:01 | | mut4ntm0nkey quits [Remote host closed the connection] |
| 16:46:01 | | HackMii_ quits [Read error: Connection reset by peer] |
| 16:46:39 | | HackMii_ (hacktheplanet) joins |
| 16:46:52 | | mut4ntm0nkey (mutantmonkey) joins |
| 17:32:41 | <@Sanqui> | oh no... wpull is not python 3.10 compatible |
| 17:33:17 | <@rewby> | Sounds on brand for our code |
| 17:33:39 | <@Sanqui> | seems like an easy fix though |
| 17:33:46 | <@JAA> | It's not compatible with *any* supported Python version. |
| 17:33:50 | <@JAA> | 3.6 is the max. |
| 17:34:09 | <@Sanqui> | nvm then |
| 17:34:15 | <@Sanqui> | the 3.10 import would have been an easy fix |
| 17:34:28 | <@JAA> | Yeah, there's more stuff, like the 'async' keyword. |
| 17:34:30 | <@Sanqui> | conveniently, I have 3.5 installed |
| 17:34:30 | <@JAA> | Not sure what else. |
| 17:34:33 | <schwarzkatz|m> | grab-site runs wpulm on 3.8.13 tho |
| 17:34:44 | <@JAA> | Yes, because it doesn't run wpull but ludios-wpull. |
| 17:35:16 | <schwarzkatz|m> | Yeah, but isn’t that the one that is to be used anyways? |
| 17:35:29 | <@JAA> | But hey, the WikiTeam tools are still Python 2 only, so this isn't that bad! |
| 17:35:32 | <@JAA> | :-) |
| 17:37:20 | <@Sanqui> | we really oughta get paid for this stuff |
| 17:37:49 | <@JAA> | Sign me up. |
| 17:39:05 | <@Sanqui> | you should be first in line indeed |
| 17:39:23 | <@JAA> | wpull: Python 3.6 maximum, 2.0.x is basically unusable standalone, lots of fun bugs |
| 17:40:02 | <@JAA> | ludios_wpull: newer Python, integrated into grab-site, not packaged on PyPI or similar, not sure about which bugs were fixed or how usable it is standalone |
| 17:41:03 | <@JAA> | I should resume my wpull work, but there have always been more urgent things in the past few years. :-| |
| 17:41:48 | <@Sanqui> | it's understandable |
| 17:42:49 | <@Sanqui> | honestly, I'm beginning to think that for sustainability's sake, we should start avoiding relying on projects that need continuous maintenance that we are the sole users of. I wonder what wpull could be replaced with. |
| 17:43:52 | <@JAA> | Not sure that's possible really. There just isn't much software in this niche. |
| 17:45:20 | | march_happy quits [Ping timeout: 265 seconds] |
| 17:45:23 | <@Sanqui> | there's a lot of software for working with web technologies |
| 17:46:01 | <@JAA> | Sure, but WARCs not so much, and half the software around it that does exist sucks in major ways. |
| 17:46:10 | <@Sanqui> | obviously I don't want to sound like I have a solution on hand though when I don't (would have to enumerate everything that wpull *is* even useful for first) |
| 17:46:42 | <@JAA> | wget 1 produces buggy WARCs, wget 2 doesn't support writing them at all. warcio is fucked. And so on. |
| 17:46:55 | <@Sanqui> | I'd drop a "WARCs suck in major ways" but I need to try to stop being negative |
| 17:47:28 | <@JAA> | WARCs in general are fine, although there are some very sharp edges in the specs. |
| 17:48:00 | <@JAA> | Unfortunately, they borrow quite a bit from the HTTP specs, which have their own downsides and ambiguities. |
| 17:48:11 | <@JAA> | And some parts of the WARC spec are ridiculously overengineered. |
| 17:50:25 | <@JAA> | When I started writing my own implementation of it, I also realised that apparently nobody ever did that based on the spec (as opposed to existing implementations) before, nor were the specs apparently reviewed as critically as I would expect from an ISO standard. There are a few fairly glaring ambiguities in the syntax. |
| 17:51:32 | <@JAA> | Obviously, such things would contribute to poor implementations. |
| 17:51:38 | <@JAA> | Cf. the whole angle brackets mess. |
| 17:53:59 | | Ketchup901 quits [Remote host closed the connection] |
| 17:54:22 | | Ketchup901 (Ketchup901) joins |
| 17:55:55 | <@JAA> | But I do agree with you; maintaining an entire HTTP and FTP implementation, for example, seems silly. |
| 17:56:47 | <@JAA> | Perhaps something like curl/wget/whatever + warcprox would be a better route. (Assuming warcprox is solid, never looked at it much so far.) |
| 17:57:37 | <@JAA> | There are performance aspects as well though. Not sure how performant warcprox is, but I kind of doubt it'll beat qwarc. |
| 18:01:52 | <@Sanqui> | I've used puppeteer with warcprox :) |
| 18:02:14 | <@Sanqui> | at the point you're running a whole chrome browser, performance becomes less critical |
| 18:04:34 | | Ketchup901 quits [Remote host closed the connection] |
| 18:04:54 | | Ketchup901 (Ketchup901) joins |
| 18:22:21 | | Church quits [Ping timeout: 255 seconds] |
| 18:24:11 | <@JAA> | Yeah, sure, and brozzler's a thing as well. For that kind of archival, the WARC writing is the least of your worries runtime-wise. |
| 18:49:02 | | tech_exorcist quits [Client Quit] |
| 18:49:04 | <TheTechRobo> | Warcprox is fairly fast for me, at least as used in #burnthetwitch . I haven't noticed any slowdown caused by it. |
| 18:57:01 | | Church (Church) joins |
| 19:05:30 | | tech_exorcist (tech_exorcist) joins |
| 20:05:40 | | Minkafighter7 quits [Quit: The Lounge - https://thelounge.chat] |
| 20:06:22 | | Minkafighter7 joins |
| 20:10:03 | | sonick quits [Client Quit] |
| 20:38:59 | <lennier1> | Does anyone know the details of the Twitter's plans to hide tweets from unverified accounts? I've been skeptical of an actual exodus from the site, but that would probably do it. Maybe this is already out of date since they've suspended signups to Twitter Blue due to scam accounts, but this is what I'm talking about. "Twitter will eventually default to displaying tweets from Twitter Blue subscribers, while tweets from users |
| 20:38:59 | <lennier1> | who do not pay for a blue check mark, he said, would be relegated to a separate page on the site and effectively buried unless viewers sought out that material." https://www.cnn.com/2022/11/09/tech/musk-twitter-brands-interview |
| 20:41:17 | <schwarzkatz|m> | while I don't know the specifics, reading this hot garbage brings this to mind: |
| 20:41:17 | <schwarzkatz|m> | I hope someone goes out of their way to make a userscript to filter all tweets from checkmark accounts out :D |
| 20:50:02 | <joepie91|m> | nftwat filter 2.0 |
| 20:54:26 | | onetruth joins |
| 21:08:07 | <@OrIdow6> | Looks like this is where that was said https://twitter.com/i/spaces/1RDGlabMNOgJL , hour long and no clear way to download transcripts |
| 21:12:27 | | mut4ntm0nkey quits [Ping timeout: 255 seconds] |
| 21:12:38 | <@OrIdow6> | It would be nice for us to at least develop some kind of plan for Twitter, or a series of plans based on the various scenarios that could result from this |
| 21:15:28 | | mut4ntm0nkey (mutantmonkey) joins |
| 21:17:17 | <@OrIdow6> | Like, dimension 1: {Users stay there, user exodus, Twitter shuts down} |
| 21:18:28 | <@OrIdow6> | Dimension 2: {Data retained publicly as currently, data stays public but locations change (and may cause playback/discoverability issues), data is deleted or taken out of public view} |
| 21:22:12 | <@OrIdow6> | Really what concerns me is a short-notice move between one of the categories of dimension 2 |
| 21:22:18 | <@OrIdow6> | s/one of // |
| 21:25:11 | <@OrIdow6> | Or that a gradual dimension-1 death should be as much a circus as this thing has been so far, and we're unable to recognize it until (again) some final you-have-3-days announcement |
| 21:28:31 | <JTL> | Would be bad if the infra starts crumbling in subtle ways (i.e "old" tweets start erroring out or whatever) |
| 21:37:48 | <@OrIdow6> | I guess? Though I'm usually of the opinion that technical indicators like that add little information to what can be gotten from more normal ones |
| 21:39:27 | <@OrIdow6> | Unless you mean as part of some deliberate plot to get rid of old posts |
| 21:50:15 | | march_happy (march_happy) joins |
| 22:08:55 | | tech_exorcist quits [Remote host closed the connection] |
| 22:09:15 | | tech_exorcist (tech_exorcist) joins |
| 22:15:41 | <IDK> | could we create a dedicated channel for twitter, like #twatter |
| 22:21:05 | <@Sanqui> | sweb.cz update: I have 141 thousand domains & many more urls per domain, many of them are already dead though. 3 weeks before demise, any ideas other than starting archivebot jobs? I'm thinking 2-4k domains per file, ~4 jobs running, no offsite should do it. |
| 22:21:55 | <lennier1> | Myspace would be an example of a company cost cutting to the point they accidentally lost data, though I don't think that's the most likely outcome. But yes, I could imagine Twitter changes that break the search functionality snscrape relies on, or put some posts behind a login wall, or even a paywall. Search is already broken for old posts by private accounts even if you follow them, and at some point in the last few years |
| 22:21:55 | <lennier1> | they started requiring a login to view some adult content. In general they're much worse about nagging you to login than they used to be. |
| 22:23:51 | <JTL> | iirc nitter somehow gets the URL for loginwalled adult media |
| 22:26:16 | <@Sanqui> | anyways, feel free to PM me thoughts, I'm going to sleep now, will start the jobs tomorrow morning when I can monitor / adjust them |
| 22:30:43 | <@JAA> | TheTechRobo: Do you have a rough number for the request rate you've thrown at it? |
| 22:40:04 | | tech_exorcist quits [Remote host closed the connection] |
| 22:40:13 | | tech_exorcist_ (tech_exorcist) joins |
| 22:44:15 | | mut4ntm0nkey quits [Ping timeout: 255 seconds] |
| 22:47:53 | | mut4ntm0nkey (mutantmonkey) joins |
| 22:49:45 | <@Sanqui|m> | P.s. If you feel like deriving some interesting crawls for sweb.cz and *.sweb.cz urls please send them my way, thanks! |
| 22:55:36 | | tech_exorcist_ quits [Client Quit] |
| 22:56:26 | <@OrIdow6> | Deriving? |
| 23:00:26 | <TheTechRobo> | JAA: Admittedly not a high load. One or two requests per second. |
| 23:01:01 | <TheTechRobo> | (Trying not to get Twitch people knocking down my door.) |
| 23:08:35 | | march_happy quits [Ping timeout: 268 seconds] |
| 23:09:00 | | march_happy (march_happy) joins |
| 23:10:05 | | lunik17 quits [Quit: Ping timeout (120 seconds)] |
| 23:11:07 | | @Sanqui quits [Quit: .] |
| 23:14:16 | | lunik17 joins |
| 23:15:08 | | Sanqui joins |
| 23:15:10 | | Sanqui is now authenticated as Sanqui |
| 23:15:10 | | Sanqui quits [Changing host] |
| 23:15:10 | | Sanqui (Sanqui) joins |
| 23:15:10 | | @ChanServ sets mode: +o Sanqui |
| 23:18:08 | <@JAA> | TheTechRobo: Ah right, yeah, there'd be something seriously wrong if warcprox couldn't handle that. qwarc easily does about two orders of magnitude more. |
| 23:18:58 | <TheTechRobo> | qwarc documentation when :-) |
| 23:22:43 | <@JAA> | When it reaches version 1.0. :-P |
| 23:23:03 | <TheTechRobo> | qwarc reaching 1.0 when :P |
| 23:23:29 | <@JAA> | When I write the documentation. :-) |
| 23:24:05 | <TheTechRobo> | RecursionError: maximum recursion depth exceeded |
| 23:24:53 | <@JAA> | sys.setrecursionlimit(float('Infinity')) |
| 23:25:27 | <TheTechRobo> | qwarc documentation when :-) |
| 23:25:31 | | michaelblob segfaults |
| 23:25:38 | | Arcorann (Arcorann) joins |
| 23:26:06 | | mut4ntm0nkey quits [Ping timeout: 255 seconds] |
| 23:37:23 | | mut4ntm0nkey (mutantmonkey) joins |
| 23:40:46 | | mut4ntm0nkey quits [Remote host closed the connection] |
| 23:41:05 | | mut4ntm0nkey (mutantmonkey) joins |
| 23:42:51 | <@JAA> | More serious answer: I want to do several fundamental changes to qwarc before it can be considered anything close to stable. The current WARC implementation is acceptable (since I replaced warcio), but the plan is to replace it with pywarc once that's ready. And the hacky aiohttp stack needs to go. I've been working on the former, but it's slow-going. |
| 23:48:27 | | BlueMaxima joins |
| 23:52:39 | | Nulo quits [Ping timeout: 255 seconds] |