#urlteam log for 2021-02-25

Home Search Previous day Next day

00:27:59	<aarchi>	I've added processing of these dumps to URLHero, though URLHero isn't ready as a service yet though. https://github.com/andrewarchi/urlhero/tree/main/wwiki
00:57:10		mutantmonkey quits [Remote host closed the connection]
00:57:29		mutantmonkey (mutantmonkey) joins
01:21:14	<teej>	aarchi: Nice!
01:37:36	<teej>	aarchi: So does the URLHero program download the dumps periodically? Like weekly?
01:57:51	<teej>	I found some more info on the w.wiki short-links: See https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener
01:58:37	<teej>	https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener#Existing_short_links explains the purpose of the links.
01:58:59	<teej>	So it's not purely incremental.
02:01:10	<teej>	Never mind! It seems to be incremental. Disregard what I just said.
02:02:25	<teej>	Well, there are just a few special links.
02:03:21	<teej>	According to https://www.mediawiki.org/wiki/Help:UrlShortener, "Short urls can also be deleted by users holding the urlshortener-manage-url user right."
02:04:02	<aarchi>	@teej URLHero can currently download URLTeam dumps (via Internet Archive torrents for faster speeds) and w.wiki dumps. I'm currently designing a bespoke index for regex search over the long URLs, but it may be infeasible due to large size (>400GB, xz-compressed). The goal is to have an API for querying URLs by shortcode or long URL, along with a browser extension that replaces shortener URLs with ones resolved by URLHero, for privacy and archival
02:04:02	<aarchi>	reasons.
02:04:21	<aarchi>	The downloading is not currently automatic.
02:07:01	<teej>	I also found some implementation details. https://www.mediawiki.org/wiki/Extension:UrlShortener
02:07:44	<teej>	It says `The first character in the list is treated as a leading zero; no shortcodes beginning with that character will be created, and it is ignored when used at the start of the shortcode in a URL (e.g. https://w.wiki/22222222w is the same as https://w.wiki/w).`
02:08:17	<aarchi>	That's weird
02:08:44	<teej>	Yeah, that's unique. I haven't encountered a url-shortener that does that.
02:10:16	<teej>	aarchi: Regarding URLHero, would a database be suited for the job?
02:18:12	<aarchi>	I don't have extensive background with databases, so I don't know if there's one that has regex search and compression
02:25:13	<teej>	aarchi: I think the right type of database needs to be chosen for the job, but here is some info: https://dataschool.com/how-to-teach-people-sql/how-regex-works-in-sql/
02:26:09	<teej>	I'm also don't have an extensive background on databases, but I do think using one is the right way forward.
02:28:17	<teej>	There are a lot of databases to choose from. See https://db-engines.com/
02:30:05	<teej>	Here's a comparison on 3 open source databases: https://db-engines.com/en/system/MariaDB%3bPostgreSQL%3bSQLite and https://opensource.com/article/19/1/open-source-databases
02:31:29	<aarchi>	Yeah. For now, I forked google/codesearch, which creates a trigram index for regex search. I'm planning on bucketing the URLs by shortcode and into groups of, say 32MB, of URLs, then compressing the buckets. We'll see how that goes and if it scales
02:31:48	<aarchi>	Postgres and SQLite won't scale for this magnitude of data
02:34:27	<@JAA>	SQLite certainly won't, but PostgreSQL should be fine.
02:35:10	<@JAA>	If tuned properly, that is.
04:42:26		qw3rty__ joins
04:46:19		qw3rty_ quits [Ping timeout: 264 seconds]
09:46:22		Arcorann_ joins
09:59:41		ragu_ joins
10:03:07		ragu__ quits [Ping timeout: 264 seconds]
10:06:13		Arcorann (Arcorann) joins
10:07:55		Arcorann_ quits [Ping timeout: 264 seconds]
10:24:18		ragu_ quits [Client Quit]
10:31:55		Arcorann quits [Ping timeout: 264 seconds]
11:19:55		HackMii quits [Ping timeout: 264 seconds]
13:15:36		Arcorann (Arcorann) joins
13:48:52		Zerote_ joins
13:51:43		Zerote quits [Ping timeout: 264 seconds]
14:25:49		Arcorann quits [Ping timeout: 260 seconds]
16:28:53		britmob quits [Read error: Connection reset by peer]
16:43:20		britmob joins
20:33:09		Ryz quits [Remote host closed the connection]
20:35:00		Ryz (Ryz) joins
23:14:21		seatsea8 joins
23:15:15		jodizzle quits [Remote host closed the connection]
23:15:30		chfoo_ (chfoo) joins
23:15:30		@ChanServ sets mode: +o chfoo_
23:15:39		billy549- quits [Ping timeout: 244 seconds]
23:15:54		Hecz- joins
23:16:03		jodizzle (jodizzle) joins
23:16:10		nothere_ quits [Ping timeout: 244 seconds]
23:16:10		Terbium quits [Ping timeout: 244 seconds]
23:16:42		@chfoo quits [Ping timeout: 244 seconds]
23:16:42		Hecz quits [Ping timeout: 244 seconds]
23:16:42		seatsea quits [Ping timeout: 244 seconds]
23:16:42		seatsea8 is now known as seatsea
23:16:49		Hecz- is now known as Hecz
23:16:53		Hecz is now authenticated as Hecz
23:16:53		Hecz quits [Changing host]
23:16:53		Hecz (Hecz) joins
23:16:57		Terbium joins
23:19:19		nothere joins
23:22:58		billy549 (Billy549) joins

Home Search Previous day Next day