03:19:21klg quits [Ping timeout: 265 seconds]
03:19:51klg (klg) joins
03:26:27<aarchi>I'm building a privacy-friendly unshortener and Web Extension that grabs the daily URLTeam releases (https://github.com/andrewarchi/urlhero), but the compressed data of all Tiny Town releases, along with the final TinyBack release is about 480GB, compressed. I also want to index the URLs such that queries like /mediafire.com|mfi.re/ are easy. How have you setup databases for the dumps so the size isn't insane?
03:47:50<OrIdow6>I don't know if there's any big preexisting database of it
03:48:06<OrIdow6>Reply last time you were here, by the way, after you left the channel: <treora> aarchi: I once proposed a similar project, which never went further than this one-pager: <https://web.archive.org/web/20180203184012id_/https://directer.org/> — I’d be curious to hear more about your work!
03:49:23<OrIdow6>At some point you will hit a limit in the form of entropy in the URLs, which is fundamentally a problem of compression rather than how the database is structured
03:51:57<OrIdow6>Best trick there might be seeing that you can put like URLs together so that redundancy is eliminated - e.g. put the URLs themselves into blocks of a few MB and compress those, and have the first stage of the resolution process map the shortened URL to a number or something like that, and the second stage map the number to the URL (inside one of these "blocks")
03:52:29<OrIdow6>Maybe there's a better way, I didn't put much thought into that
03:53:35<OrIdow6>But again, this can only reduce the size so much
03:57:47<OrIdow6>Incidentally, this format (which is, not coincidentally, very similar to CDX files used for web archive replaying) is good for doing prefix searches on URLs for specific domains etc. - but the only plausible way to do general regex searches is with text
03:58:37<OrIdow6>By the way, there's an ArchiveTeam project for mediafire, #mediaonfire - someone said they were going to extract URLteam data and feed it into there, don't know if that happened or if it was you
04:01:46<aarchi>Thanks for the repeat from last time. I didn’t have an IRC client before, so had no scrollback.
04:02:11qw3rty_ joins
04:04:13<aarchi>If the URLTeam mention on #mediaonfire was @aarchi or @hook54321, it was me. I posted the links I grabbed yesterday.
04:05:53qw3rty__ quits [Ping timeout: 244 seconds]
04:16:39<OrIdow6>Looks like it was you, though last thing mentioned about it there was that you "said it'll take about a month"
04:18:52<aarchi>I did some optimization on the traversal. I was formerly using a pure Go xz library, but it was so slow, so I replaced it with system xz. A full search now takes about 6 hours instead of the estimated month.
04:25:11<OrIdow6>Oh, haha
04:27:43<OrIdow6>Only a 12075% speedup
04:35:10<aarchi>Yeah and I’m hoping to replace the buggy and slow torrent lib I’m using with Transmission bindings
04:37:36<aarchi>anacrolix/torrent has mediocre web seeding support, which is problematic because IA torrents use webseed almost exclusively and no one seeds them
04:40:23<aarchi>On compression: I figure for most shorteners, I could represent the shortcodes as integers in their respective bases and just store the dumps as arrays of indices into the compressed content. Sounds like a lot of engineering though.
04:49:31<aarchi>I'll need to determine how sparse the non-sequential shortcode assignments are. If they're packed tightly, that'll simplify things
04:59:14qw3rty__ joins
05:03:01qw3rty_ quits [Ping timeout: 252 seconds]
07:12:36Pichu0102 joins
07:15:16<Pichu0102>Found two URL shorteners with examples: https://edsy.org/s/vgH9Y3z https://s.orbis.zone/cbrc
08:13:59sliccricc_ (sliccricc) joins
08:14:04sliccricc quits [Remote host closed the connection]
09:53:31sliccricc_ quits [Ping timeout: 264 seconds]
10:38:14Ryz quits [Remote host closed the connection]
10:39:41Ryz (Ryz) joins
11:05:19Arcorann_ joins
11:06:27Arcorann quits [Ping timeout: 244 seconds]
11:22:42Arcorann__ joins
11:25:25Arcorann_ quits [Ping timeout: 240 seconds]
11:37:27Arcorann__ quits [Ping timeout: 244 seconds]
12:44:53kpcyrd quits [Quit: kpcyrd]
12:45:25kpcyrd (kpcyrd) joins
13:10:08ragu_ joins
13:13:37ragu__ quits [Ping timeout: 252 seconds]
13:19:28yano quits [Quit: WeeChat, The Better IRC Client, https://weechat.org/]
13:21:40yanome (yano) joins
13:22:32yano (yano) joins
13:26:23tech234a quits [Read error: Connection reset by peer]
13:26:34aarchi quits [Read error: No route to host]
13:26:36janpaul123 quits [Read error: Connection reset by peer]
13:26:36@EggplantN2 quits [Read error: Connection reset by peer]
13:28:03tech234a (tech234a) joins
13:28:12janpaul123 (janpaul123) joins
13:28:12EggplantN2 joins
13:30:59aarchi (aarchi) joins
13:32:43EggplantN2 quits [Changing host]
13:32:43EggplantN2 (EggplantN) joins
13:32:43@ChanServ sets mode: +o EggplantN2
14:38:56sliccricc_ (sliccricc) joins
14:41:47sliccricc_ quits [Remote host closed the connection]
14:42:13sliccricc_ (sliccricc) joins
15:46:14sliccricc_ quits [Remote host closed the connection]
15:52:43justaguy is now known as AltroskyAS207616_Mystique
15:56:44Pichu0102 quits [Remote host closed the connection]
15:58:04sliccricc (sliccricc) joins
15:58:39sliccricc quits [Remote host closed the connection]
15:58:55sliccricc (sliccricc) joins
19:47:09britmob quits [Read error: Connection reset by peer]
19:48:06britmob joins
19:50:14sliccricc quits [Remote host closed the connection]
19:50:31sliccricc (sliccricc) joins
20:43:55sliccricc quits [Ping timeout: 264 seconds]
21:32:35britmob quits [Read error: Connection reset by peer]
21:44:55britmob joins
22:44:32mutantmonkey quits [Remote host closed the connection]
22:44:58mutantmonkey (mutantmonkey) joins
23:04:26janpaul123 quits [Read error: Connection reset by peer]
23:04:41janpaul123 (janpaul123) joins
23:18:34treora quits [Ping timeout: 244 seconds]
23:28:41treora joins
23:51:40@kiska quits [Quit: Ping timeout (120 seconds)]
23:53:27kiska (kiska) joins
23:53:27@ChanServ sets mode: +o kiska