00:00:50superkuh joins
00:17:51michaelblob (michaelblob) joins
00:22:03michaelblob_ quits [Ping timeout: 268 seconds]
01:50:33mut4ntm0nkey quits [Ping timeout: 255 seconds]
01:51:52mut4ntm0nkey (mutantmonkey) joins
02:19:17michaelblob_ (michaelblob) joins
02:23:08michaelblob quits [Ping timeout: 265 seconds]
03:26:31michaelblob (michaelblob) joins
03:30:45michaelblob_ quits [Ping timeout: 268 seconds]
04:43:31march_happy quits [Ping timeout: 268 seconds]
04:44:22march_happy (march_happy) joins
04:48:02<pabs>woa person who died: https://twitter.com/galagrrr https://gallaghersmash.com/ https://en.wikipedia.org/wiki/Gallagher_(comedian)
04:48:11<pabs>er s/woa/a/
05:34:27holbrooke quits [Client Quit]
06:24:55BlueMaxima quits [Client Quit]
08:53:09<mgrandi>betamax: neat! That going up on wbm when done?
08:53:47<mgrandi>OrIdow6: and I think it's for their own ai generation software, third parties will just ignore it
09:13:48michaelblob_ (michaelblob) joins
09:17:56michaelblob quits [Ping timeout: 268 seconds]
09:29:02march_happy quits [Ping timeout: 268 seconds]
09:29:43march_happy (march_happy) joins
10:02:20janh quits [Ping timeout: 268 seconds]
10:30:17h3ndr1k quits [Quit: ]
10:31:48Hackerpcs quits [Quit: Hackerpcs]
10:34:13Hackerpcs (Hackerpcs) joins
10:51:54yawkat quits [Ping timeout: 255 seconds]
10:57:30<Barto>pabs: i covered him earlier
11:16:04tech_exorcist (tech_exorcist) joins
12:17:53yawkat (yawkat) joins
12:20:08<betamax>mgrandi: unfortunately as a third-party generated WARC it can't go on the wbm
12:20:36<betamax>all the WARCs will go to an item on archive.org, similar to what I did for the 2018 US midterms: https://archive.org/details/2018_us_midterm_campaign_site_archive
12:20:54<schwarzkatz|m>wait, grab-site WARCs cannot go into the WBM?
12:21:04sonick (sonick) joins
12:21:15<schwarzkatz|m>*does that mean
12:37:22h3ndr1k (h3ndr1k) joins
12:37:28<@JAA>They can, but not when uploaded by random users.
12:53:46Arcorann quits [Ping timeout: 268 seconds]
12:53:54tech_exorcist quits [Remote host closed the connection]
12:54:40tech_exorcist (tech_exorcist) joins
12:58:08Ketchup901 quits [Remote host closed the connection]
13:04:20Ketchup901 (Ketchup901) joins
13:18:26dasineura2 quits [Ping timeout: 268 seconds]
13:25:01dasineura2 (dasineura) joins
13:31:09Ketchup901 quits [Remote host closed the connection]
13:31:29Ketchup901 (Ketchup901) joins
13:35:42dasineura2 quits [Ping timeout: 268 seconds]
13:44:22holbrooke joins
13:45:06holbrooke quits [Client Quit]
14:03:37holbrooke joins
14:04:28holbrooke quits [Client Quit]
14:11:57dasineura2 (dasineura) joins
15:00:25holbrooke joins
15:01:24holbrooke quits [Client Quit]
15:38:08<schwarzkatz|m><JAA> "They can, but not when uploaded..." <- I suppose I would be considered a random user then, right? I just finished grabbing forum.lacartoonerie.com yesterday and it's currently uploading. prior to that, I used a script to automate SPN2 with all the urls... If the archive I made isn't accepted, grabbing it is kind of a waste :/
15:41:59<@JAA>schwarzkatz|m: Yes, you would. If the WBM just accepted random WARCs, anyone could falsify history there, and it'd be useless.
15:42:25<@JAA>That said, those WARCs can still be useful. Anyone can download them and view them locally with something like pywb.
15:42:48<@JAA>It's up to the downloader then to judge whether they trust the uploader or not.
15:43:43<schwarzkatz|m>that's understandable. so I'll continue to use my script then for stuff I want to have in the WBM
15:44:48<@JAA>But then there aren't any downloadable WARCs. :-/
15:49:59<schwarzkatz|m>if it's not in the wbm it's useless to me lol
15:55:07holbrooke joins
15:55:47<@JAA>Yeah, that's why it's best to let a trusted person run the actual archival so that the WARCs do end up in the WBM but are also accessible.
15:57:01<schwarzkatz|m>yep, agreed
16:03:15dasineura2 quits [Remote host closed the connection]
16:03:30dasineura2 (dasineura) joins
16:07:20michaelblob (michaelblob) joins
16:07:25michaelblob_ quits [Remote host closed the connection]
16:07:35holbrooke quits [Client Quit]
16:14:48march_happy quits [Ping timeout: 268 seconds]
16:15:39holbrooke joins
16:15:44march_happy (march_happy) joins
16:27:04holbrooke quits [Client Quit]
16:33:03holbrooke joins
16:40:05holbrooke quits [Ping timeout: 268 seconds]
16:46:01mut4ntm0nkey quits [Remote host closed the connection]
16:46:01HackMii_ quits [Read error: Connection reset by peer]
16:46:39HackMii_ (hacktheplanet) joins
16:46:52mut4ntm0nkey (mutantmonkey) joins
17:32:41<@Sanqui>oh no... wpull is not python 3.10 compatible
17:33:17<@rewby>Sounds on brand for our code
17:33:39<@Sanqui>seems like an easy fix though
17:33:46<@JAA>It's not compatible with *any* supported Python version.
17:33:50<@JAA>3.6 is the max.
17:34:09<@Sanqui>nvm then
17:34:15<@Sanqui>the 3.10 import would have been an easy fix
17:34:28<@JAA>Yeah, there's more stuff, like the 'async' keyword.
17:34:30<@Sanqui>conveniently, I have 3.5 installed
17:34:30<@JAA>Not sure what else.
17:34:33<schwarzkatz|m>grab-site runs wpulm on 3.8.13 tho
17:34:44<@JAA>Yes, because it doesn't run wpull but ludios-wpull.
17:35:16<schwarzkatz|m>Yeah, but isn’t that the one that is to be used anyways?
17:35:29<@JAA>But hey, the WikiTeam tools are still Python 2 only, so this isn't that bad!
17:35:32<@JAA>:-)
17:37:20<@Sanqui>we really oughta get paid for this stuff
17:37:49<@JAA>Sign me up.
17:39:05<@Sanqui>you should be first in line indeed
17:39:23<@JAA>wpull: Python 3.6 maximum, 2.0.x is basically unusable standalone, lots of fun bugs
17:40:02<@JAA>ludios_wpull: newer Python, integrated into grab-site, not packaged on PyPI or similar, not sure about which bugs were fixed or how usable it is standalone
17:41:03<@JAA>I should resume my wpull work, but there have always been more urgent things in the past few years. :-|
17:41:48<@Sanqui>it's understandable
17:42:49<@Sanqui>honestly, I'm beginning to think that for sustainability's sake, we should start avoiding relying on projects that need continuous maintenance that we are the sole users of. I wonder what wpull could be replaced with.
17:43:52<@JAA>Not sure that's possible really. There just isn't much software in this niche.
17:45:20march_happy quits [Ping timeout: 265 seconds]
17:45:23<@Sanqui>there's a lot of software for working with web technologies
17:46:01<@JAA>Sure, but WARCs not so much, and half the software around it that does exist sucks in major ways.
17:46:10<@Sanqui>obviously I don't want to sound like I have a solution on hand though when I don't (would have to enumerate everything that wpull *is* even useful for first)
17:46:42<@JAA>wget 1 produces buggy WARCs, wget 2 doesn't support writing them at all. warcio is fucked. And so on.
17:46:55<@Sanqui>I'd drop a "WARCs suck in major ways" but I need to try to stop being negative
17:47:28<@JAA>WARCs in general are fine, although there are some very sharp edges in the specs.
17:48:00<@JAA>Unfortunately, they borrow quite a bit from the HTTP specs, which have their own downsides and ambiguities.
17:48:11<@JAA>And some parts of the WARC spec are ridiculously overengineered.
17:50:25<@JAA>When I started writing my own implementation of it, I also realised that apparently nobody ever did that based on the spec (as opposed to existing implementations) before, nor were the specs apparently reviewed as critically as I would expect from an ISO standard. There are a few fairly glaring ambiguities in the syntax.
17:51:32<@JAA>Obviously, such things would contribute to poor implementations.
17:51:38<@JAA>Cf. the whole angle brackets mess.
17:53:59Ketchup901 quits [Remote host closed the connection]
17:54:22Ketchup901 (Ketchup901) joins
17:55:55<@JAA>But I do agree with you; maintaining an entire HTTP and FTP implementation, for example, seems silly.
17:56:47<@JAA>Perhaps something like curl/wget/whatever + warcprox would be a better route. (Assuming warcprox is solid, never looked at it much so far.)
17:57:37<@JAA>There are performance aspects as well though. Not sure how performant warcprox is, but I kind of doubt it'll beat qwarc.
18:01:52<@Sanqui>I've used puppeteer with warcprox :)
18:02:14<@Sanqui>at the point you're running a whole chrome browser, performance becomes less critical
18:04:34Ketchup901 quits [Remote host closed the connection]
18:04:54Ketchup901 (Ketchup901) joins
18:22:21Church quits [Ping timeout: 255 seconds]
18:24:11<@JAA>Yeah, sure, and brozzler's a thing as well. For that kind of archival, the WARC writing is the least of your worries runtime-wise.
18:49:02tech_exorcist quits [Client Quit]
18:49:04<TheTechRobo>Warcprox is fairly fast for me, at least as used in #burnthetwitch . I haven't noticed any slowdown caused by it.
18:57:01Church (Church) joins
19:05:30tech_exorcist (tech_exorcist) joins
20:05:40Minkafighter7 quits [Quit: The Lounge - https://thelounge.chat]
20:06:22Minkafighter7 joins
20:10:03sonick quits [Client Quit]
20:38:59<lennier1>Does anyone know the details of the Twitter's plans to hide tweets from unverified accounts? I've been skeptical of an actual exodus from the site, but that would probably do it. Maybe this is already out of date since they've suspended signups to Twitter Blue due to scam accounts, but this is what I'm talking about. "Twitter will eventually default to displaying tweets from Twitter Blue subscribers, while tweets from users
20:38:59<lennier1>who do not pay for a blue check mark, he said, would be relegated to a separate page on the site and effectively buried unless viewers sought out that material." https://www.cnn.com/2022/11/09/tech/musk-twitter-brands-interview
20:41:17<schwarzkatz|m>while I don't know the specifics, reading this hot garbage brings this to mind:
20:41:17<schwarzkatz|m>I hope someone goes out of their way to make a userscript to filter all tweets from checkmark accounts out :D
20:50:02<joepie91|m>nftwat filter 2.0
20:54:26onetruth joins
21:08:07<@OrIdow6>Looks like this is where that was said https://twitter.com/i/spaces/1RDGlabMNOgJL , hour long and no clear way to download transcripts
21:12:27mut4ntm0nkey quits [Ping timeout: 255 seconds]
21:12:38<@OrIdow6>It would be nice for us to at least develop some kind of plan for Twitter, or a series of plans based on the various scenarios that could result from this
21:15:28mut4ntm0nkey (mutantmonkey) joins
21:17:17<@OrIdow6>Like, dimension 1: {Users stay there, user exodus, Twitter shuts down}
21:18:28<@OrIdow6>Dimension 2: {Data retained publicly as currently, data stays public but locations change (and may cause playback/discoverability issues), data is deleted or taken out of public view}
21:22:12<@OrIdow6>Really what concerns me is a short-notice move between one of the categories of dimension 2
21:22:18<@OrIdow6>s/one of //
21:25:11<@OrIdow6>Or that a gradual dimension-1 death should be as much a circus as this thing has been so far, and we're unable to recognize it until (again) some final you-have-3-days announcement
21:28:31<JTL>Would be bad if the infra starts crumbling in subtle ways (i.e "old" tweets start erroring out or whatever)
21:37:48<@OrIdow6>I guess? Though I'm usually of the opinion that technical indicators like that add little information to what can be gotten from more normal ones
21:39:27<@OrIdow6>Unless you mean as part of some deliberate plot to get rid of old posts
21:50:15march_happy (march_happy) joins
22:08:55tech_exorcist quits [Remote host closed the connection]
22:09:15tech_exorcist (tech_exorcist) joins
22:15:41<IDK>could we create a dedicated channel for twitter, like #twatter
22:21:05<@Sanqui>sweb.cz update: I have 141 thousand domains & many more urls per domain, many of them are already dead though. 3 weeks before demise, any ideas other than starting archivebot jobs? I'm thinking 2-4k domains per file, ~4 jobs running, no offsite should do it.
22:21:55<lennier1>Myspace would be an example of a company cost cutting to the point they accidentally lost data, though I don't think that's the most likely outcome. But yes, I could imagine Twitter changes that break the search functionality snscrape relies on, or put some posts behind a login wall, or even a paywall. Search is already broken for old posts by private accounts even if you follow them, and at some point in the last few years
22:21:55<lennier1>they started requiring a login to view some adult content. In general they're much worse about nagging you to login than they used to be.
22:23:51<JTL>iirc nitter somehow gets the URL for loginwalled adult media
22:26:16<@Sanqui>anyways, feel free to PM me thoughts, I'm going to sleep now, will start the jobs tomorrow morning when I can monitor / adjust them
22:30:43<@JAA>TheTechRobo: Do you have a rough number for the request rate you've thrown at it?
22:40:04tech_exorcist quits [Remote host closed the connection]
22:40:13tech_exorcist_ (tech_exorcist) joins
22:44:15mut4ntm0nkey quits [Ping timeout: 255 seconds]
22:47:53mut4ntm0nkey (mutantmonkey) joins
22:49:45<@Sanqui|m>P.s. If you feel like deriving some interesting crawls for sweb.cz and *.sweb.cz urls please send them my way, thanks!
22:55:36tech_exorcist_ quits [Client Quit]
22:56:26<@OrIdow6>Deriving?
23:00:26<TheTechRobo>JAA: Admittedly not a high load. One or two requests per second.
23:01:01<TheTechRobo>(Trying not to get Twitch people knocking down my door.)
23:08:35march_happy quits [Ping timeout: 268 seconds]
23:09:00march_happy (march_happy) joins
23:10:05lunik17 quits [Quit: Ping timeout (120 seconds)]
23:11:07@Sanqui quits [Quit: .]
23:14:16lunik17 joins
23:15:08Sanqui joins
23:15:10Sanqui quits [Changing host]
23:15:10Sanqui (Sanqui) joins
23:15:10@ChanServ sets mode: +o Sanqui
23:18:08<@JAA>TheTechRobo: Ah right, yeah, there'd be something seriously wrong if warcprox couldn't handle that. qwarc easily does about two orders of magnitude more.
23:18:58<TheTechRobo>qwarc documentation when :-)
23:22:43<@JAA>When it reaches version 1.0. :-P
23:23:03<TheTechRobo>qwarc reaching 1.0 when :P
23:23:29<@JAA>When I write the documentation. :-)
23:24:05<TheTechRobo>RecursionError: maximum recursion depth exceeded
23:24:53<@JAA>sys.setrecursionlimit(float('Infinity'))
23:25:27<TheTechRobo>qwarc documentation when :-)
23:25:31michaelblob segfaults
23:25:38Arcorann (Arcorann) joins
23:26:06mut4ntm0nkey quits [Ping timeout: 255 seconds]
23:37:23mut4ntm0nkey (mutantmonkey) joins
23:40:46mut4ntm0nkey quits [Remote host closed the connection]
23:41:05mut4ntm0nkey (mutantmonkey) joins
23:42:51<@JAA>More serious answer: I want to do several fundamental changes to qwarc before it can be considered anything close to stable. The current WARC implementation is acceptable (since I replaced warcio), but the plan is to replace it with pywarc once that's ready. And the hacky aiohttp stack needs to go. I've been working on the former, but it's slow-going.
23:48:27BlueMaxima joins
23:52:39Nulo quits [Ping timeout: 255 seconds]