| 00:27:59 | <aarchi> | I've added processing of these dumps to URLHero, though URLHero isn't ready as a service yet though. https://github.com/andrewarchi/urlhero/tree/main/wwiki |
| 00:57:10 | | mutantmonkey quits [Remote host closed the connection] |
| 00:57:29 | | mutantmonkey (mutantmonkey) joins |
| 01:21:14 | <teej> | aarchi: Nice! |
| 01:37:36 | <teej> | aarchi: So does the URLHero program download the dumps periodically? Like weekly? |
| 01:57:51 | <teej> | I found some more info on the w.wiki short-links: See https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener |
| 01:58:37 | <teej> | https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener#Existing_short_links explains the purpose of the links. |
| 01:58:59 | <teej> | So it's not purely incremental. |
| 02:01:10 | <teej> | Never mind! It seems to be incremental. Disregard what I just said. |
| 02:02:25 | <teej> | Well, there are just a few special links. |
| 02:03:21 | <teej> | According to https://www.mediawiki.org/wiki/Help:UrlShortener, "Short urls can also be deleted by users holding the urlshortener-manage-url user right." |
| 02:04:02 | <aarchi> | @teej URLHero can currently download URLTeam dumps (via Internet Archive torrents for faster speeds) and w.wiki dumps. I'm currently designing a bespoke index for regex search over the long URLs, but it may be infeasible due to large size (>400GB, xz-compressed). The goal is to have an API for querying URLs by shortcode or long URL, along with a browser extension that replaces shortener URLs with ones resolved by URLHero, for privacy and archival |
| 02:04:02 | <aarchi> | reasons. |
| 02:04:21 | <aarchi> | The downloading is not currently automatic. |
| 02:07:01 | <teej> | I also found some implementation details. https://www.mediawiki.org/wiki/Extension:UrlShortener |
| 02:07:44 | <teej> | It says `The first character in the list is treated as a leading zero; no shortcodes beginning with that character will be created, and it is ignored when used at the start of the shortcode in a URL (e.g. https://w.wiki/22222222w is the same as https://w.wiki/w).` |
| 02:08:17 | <aarchi> | That's weird |
| 02:08:44 | <teej> | Yeah, that's unique. I haven't encountered a url-shortener that does that. |
| 02:10:16 | <teej> | aarchi: Regarding URLHero, would a database be suited for the job? |
| 02:18:12 | <aarchi> | I don't have extensive background with databases, so I don't know if there's one that has regex search and compression |
| 02:25:13 | <teej> | aarchi: I think the right type of database needs to be chosen for the job, but here is some info: https://dataschool.com/how-to-teach-people-sql/how-regex-works-in-sql/ |
| 02:26:09 | <teej> | I'm also don't have an extensive background on databases, but I do think using one is the right way forward. |
| 02:28:17 | <teej> | There are a lot of databases to choose from. See https://db-engines.com/ |
| 02:30:05 | <teej> | Here's a comparison on 3 open source databases: https://db-engines.com/en/system/MariaDB%3bPostgreSQL%3bSQLite and https://opensource.com/article/19/1/open-source-databases |
| 02:31:29 | <aarchi> | Yeah. For now, I forked google/codesearch, which creates a trigram index for regex search. I'm planning on bucketing the URLs by shortcode and into groups of, say 32MB, of URLs, then compressing the buckets. We'll see how that goes and if it scales |
| 02:31:48 | <aarchi> | Postgres and SQLite won't scale for this magnitude of data |
| 02:34:27 | <@JAA> | SQLite certainly won't, but PostgreSQL should be fine. |
| 02:35:10 | <@JAA> | If tuned properly, that is. |
| 04:42:26 | | qw3rty__ joins |
| 04:46:19 | | qw3rty_ quits [Ping timeout: 264 seconds] |
| 09:46:22 | | Arcorann_ joins |
| 09:59:41 | | ragu_ joins |
| 10:03:07 | | ragu__ quits [Ping timeout: 264 seconds] |
| 10:06:13 | | Arcorann (Arcorann) joins |
| 10:07:55 | | Arcorann_ quits [Ping timeout: 264 seconds] |
| 10:24:18 | | ragu_ quits [Client Quit] |
| 10:31:55 | | Arcorann quits [Ping timeout: 264 seconds] |
| 11:19:55 | | HackMii quits [Ping timeout: 264 seconds] |
| 13:15:36 | | Arcorann (Arcorann) joins |
| 13:48:52 | | Zerote_ joins |
| 13:51:43 | | Zerote quits [Ping timeout: 264 seconds] |
| 14:25:49 | | Arcorann quits [Ping timeout: 260 seconds] |
| 16:28:53 | | britmob quits [Read error: Connection reset by peer] |
| 16:43:20 | | britmob joins |
| 20:33:09 | | Ryz quits [Remote host closed the connection] |
| 20:35:00 | | Ryz (Ryz) joins |
| 23:14:21 | | seatsea8 joins |
| 23:15:15 | | jodizzle quits [Remote host closed the connection] |
| 23:15:30 | | chfoo_ (chfoo) joins |
| 23:15:30 | | @ChanServ sets mode: +o chfoo_ |
| 23:15:39 | | billy549- quits [Ping timeout: 244 seconds] |
| 23:15:54 | | Hecz- joins |
| 23:16:03 | | jodizzle (jodizzle) joins |
| 23:16:10 | | nothere_ quits [Ping timeout: 244 seconds] |
| 23:16:10 | | Terbium quits [Ping timeout: 244 seconds] |
| 23:16:42 | | @chfoo quits [Ping timeout: 244 seconds] |
| 23:16:42 | | Hecz quits [Ping timeout: 244 seconds] |
| 23:16:42 | | seatsea quits [Ping timeout: 244 seconds] |
| 23:16:42 | | seatsea8 is now known as seatsea |
| 23:16:49 | | Hecz- is now known as Hecz |
| 23:16:53 | | Hecz is now authenticated as Hecz |
| 23:16:53 | | Hecz quits [Changing host] |
| 23:16:53 | | Hecz (Hecz) joins |
| 23:16:57 | | Terbium joins |
| 23:19:19 | | nothere joins |
| 23:22:58 | | billy549 (Billy549) joins |