00:27:59<aarchi>I've added processing of these dumps to URLHero, though URLHero isn't ready as a service yet though. https://github.com/andrewarchi/urlhero/tree/main/wwiki
00:57:10mutantmonkey quits [Remote host closed the connection]
00:57:29mutantmonkey (mutantmonkey) joins
01:21:14<teej>aarchi: Nice!
01:37:36<teej>aarchi: So does the URLHero program download the dumps periodically? Like weekly?
01:57:51<teej>I found some more info on the w.wiki short-links: See https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener
01:58:37<teej>https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener#Existing_short_links explains the purpose of the links.
01:58:59<teej>So it's not purely incremental.
02:01:10<teej>Never mind! It seems to be incremental. Disregard what I just said.
02:02:25<teej>Well, there are just a few special links.
02:03:21<teej>According to https://www.mediawiki.org/wiki/Help:UrlShortener, "Short urls can also be deleted by users holding the urlshortener-manage-url user right."
02:04:02<aarchi>@teej URLHero can currently download URLTeam dumps (via Internet Archive torrents for faster speeds) and w.wiki dumps. I'm currently designing a bespoke index for regex search over the long URLs, but it may be infeasible due to large size (>400GB, xz-compressed). The goal is to have an API for querying URLs by shortcode or long URL, along with a browser extension that replaces shortener URLs with ones resolved by URLHero, for privacy and archival
02:04:02<aarchi>reasons.
02:04:21<aarchi>The downloading is not currently automatic.
02:07:01<teej>I also found some implementation details. https://www.mediawiki.org/wiki/Extension:UrlShortener
02:07:44<teej>It says `The first character in the list is treated as a leading zero; no shortcodes beginning with that character will be created, and it is ignored when used at the start of the shortcode in a URL (e.g. https://w.wiki/22222222w is the same as https://w.wiki/w).`
02:08:17<aarchi>That's weird
02:08:44<teej>Yeah, that's unique. I haven't encountered a url-shortener that does that.
02:10:16<teej>aarchi: Regarding URLHero, would a database be suited for the job?
02:18:12<aarchi>I don't have extensive background with databases, so I don't know if there's one that has regex search and compression
02:25:13<teej>aarchi: I think the right type of database needs to be chosen for the job, but here is some info: https://dataschool.com/how-to-teach-people-sql/how-regex-works-in-sql/
02:26:09<teej>I'm also don't have an extensive background on databases, but I do think using one is the right way forward.
02:28:17<teej>There are a lot of databases to choose from. See https://db-engines.com/
02:30:05<teej>Here's a comparison on 3 open source databases: https://db-engines.com/en/system/MariaDB%3bPostgreSQL%3bSQLite and https://opensource.com/article/19/1/open-source-databases
02:31:29<aarchi>Yeah. For now, I forked google/codesearch, which creates a trigram index for regex search. I'm planning on bucketing the URLs by shortcode and into groups of, say 32MB, of URLs, then compressing the buckets. We'll see how that goes and if it scales
02:31:48<aarchi>Postgres and SQLite won't scale for this magnitude of data
02:34:27<@JAA>SQLite certainly won't, but PostgreSQL should be fine.
02:35:10<@JAA>If tuned properly, that is.
04:42:26qw3rty__ joins
04:46:19qw3rty_ quits [Ping timeout: 264 seconds]
09:46:22Arcorann_ joins
09:59:41ragu_ joins
10:03:07ragu__ quits [Ping timeout: 264 seconds]
10:06:13Arcorann (Arcorann) joins
10:07:55Arcorann_ quits [Ping timeout: 264 seconds]
10:24:18ragu_ quits [Client Quit]
10:31:55Arcorann quits [Ping timeout: 264 seconds]
11:19:55HackMii quits [Ping timeout: 264 seconds]
13:15:36Arcorann (Arcorann) joins
13:48:52Zerote_ joins
13:51:43Zerote quits [Ping timeout: 264 seconds]
14:25:49Arcorann quits [Ping timeout: 260 seconds]
16:28:53britmob quits [Read error: Connection reset by peer]
16:43:20britmob joins
20:33:09Ryz quits [Remote host closed the connection]
20:35:00Ryz (Ryz) joins
23:14:21seatsea8 joins
23:15:15jodizzle quits [Remote host closed the connection]
23:15:30chfoo_ (chfoo) joins
23:15:30@ChanServ sets mode: +o chfoo_
23:15:39billy549- quits [Ping timeout: 244 seconds]
23:15:54Hecz- joins
23:16:03jodizzle (jodizzle) joins
23:16:10nothere_ quits [Ping timeout: 244 seconds]
23:16:10Terbium quits [Ping timeout: 244 seconds]
23:16:42@chfoo quits [Ping timeout: 244 seconds]
23:16:42Hecz quits [Ping timeout: 244 seconds]
23:16:42seatsea quits [Ping timeout: 244 seconds]
23:16:42seatsea8 is now known as seatsea
23:16:49Hecz- is now known as Hecz
23:16:53Hecz quits [Changing host]
23:16:53Hecz (Hecz) joins
23:16:57Terbium joins
23:19:19nothere joins
23:22:58billy549 (Billy549) joins