| 03:19:21 | | klg quits [Ping timeout: 265 seconds] |
| 03:19:51 | | klg (klg) joins |
| 03:26:27 | <aarchi> | I'm building a privacy-friendly unshortener and Web Extension that grabs the daily URLTeam releases (https://github.com/andrewarchi/urlhero), but the compressed data of all Tiny Town releases, along with the final TinyBack release is about 480GB, compressed. I also want to index the URLs such that queries like /mediafire.com|mfi.re/ are easy. How have you setup databases for the dumps so the size isn't insane? |
| 03:47:50 | <OrIdow6> | I don't know if there's any big preexisting database of it |
| 03:48:06 | <OrIdow6> | Reply last time you were here, by the way, after you left the channel: <treora> aarchi: I once proposed a similar project, which never went further than this one-pager: <https://web.archive.org/web/20180203184012id_/https://directer.org/> — I’d be curious to hear more about your work! |
| 03:49:23 | <OrIdow6> | At some point you will hit a limit in the form of entropy in the URLs, which is fundamentally a problem of compression rather than how the database is structured |
| 03:51:57 | <OrIdow6> | Best trick there might be seeing that you can put like URLs together so that redundancy is eliminated - e.g. put the URLs themselves into blocks of a few MB and compress those, and have the first stage of the resolution process map the shortened URL to a number or something like that, and the second stage map the number to the URL (inside one of these "blocks") |
| 03:52:29 | <OrIdow6> | Maybe there's a better way, I didn't put much thought into that |
| 03:53:35 | <OrIdow6> | But again, this can only reduce the size so much |
| 03:57:47 | <OrIdow6> | Incidentally, this format (which is, not coincidentally, very similar to CDX files used for web archive replaying) is good for doing prefix searches on URLs for specific domains etc. - but the only plausible way to do general regex searches is with text |
| 03:58:37 | <OrIdow6> | By the way, there's an ArchiveTeam project for mediafire, #mediaonfire - someone said they were going to extract URLteam data and feed it into there, don't know if that happened or if it was you |
| 04:01:46 | <aarchi> | Thanks for the repeat from last time. I didn’t have an IRC client before, so had no scrollback. |
| 04:02:11 | | qw3rty_ joins |
| 04:04:13 | <aarchi> | If the URLTeam mention on #mediaonfire was @aarchi or @hook54321, it was me. I posted the links I grabbed yesterday. |
| 04:05:53 | | qw3rty__ quits [Ping timeout: 244 seconds] |
| 04:16:39 | <OrIdow6> | Looks like it was you, though last thing mentioned about it there was that you "said it'll take about a month" |
| 04:18:52 | <aarchi> | I did some optimization on the traversal. I was formerly using a pure Go xz library, but it was so slow, so I replaced it with system xz. A full search now takes about 6 hours instead of the estimated month. |
| 04:25:11 | <OrIdow6> | Oh, haha |
| 04:27:43 | <OrIdow6> | Only a 12075% speedup |
| 04:35:10 | <aarchi> | Yeah and I’m hoping to replace the buggy and slow torrent lib I’m using with Transmission bindings |
| 04:37:36 | <aarchi> | anacrolix/torrent has mediocre web seeding support, which is problematic because IA torrents use webseed almost exclusively and no one seeds them |
| 04:40:23 | <aarchi> | On compression: I figure for most shorteners, I could represent the shortcodes as integers in their respective bases and just store the dumps as arrays of indices into the compressed content. Sounds like a lot of engineering though. |
| 04:49:31 | <aarchi> | I'll need to determine how sparse the non-sequential shortcode assignments are. If they're packed tightly, that'll simplify things |
| 04:59:14 | | qw3rty__ joins |
| 05:03:01 | | qw3rty_ quits [Ping timeout: 252 seconds] |
| 07:12:36 | | Pichu0102 joins |
| 07:15:16 | <Pichu0102> | Found two URL shorteners with examples: https://edsy.org/s/vgH9Y3z https://s.orbis.zone/cbrc |
| 08:13:59 | | sliccricc_ (sliccricc) joins |
| 08:14:04 | | sliccricc quits [Remote host closed the connection] |
| 09:53:31 | | sliccricc_ quits [Ping timeout: 264 seconds] |
| 10:38:14 | | Ryz quits [Remote host closed the connection] |
| 10:39:41 | | Ryz (Ryz) joins |
| 11:05:19 | | Arcorann_ joins |
| 11:06:27 | | Arcorann quits [Ping timeout: 244 seconds] |
| 11:22:42 | | Arcorann__ joins |
| 11:25:25 | | Arcorann_ quits [Ping timeout: 240 seconds] |
| 11:37:27 | | Arcorann__ quits [Ping timeout: 244 seconds] |
| 12:44:53 | | kpcyrd quits [Quit: kpcyrd] |
| 12:45:25 | | kpcyrd (kpcyrd) joins |
| 13:10:08 | | ragu_ joins |
| 13:13:37 | | ragu__ quits [Ping timeout: 252 seconds] |
| 13:19:28 | | yano quits [Quit: WeeChat, The Better IRC Client, https://weechat.org/] |
| 13:21:40 | | yanome (yano) joins |
| 13:22:32 | | yano (yano) joins |
| 13:26:23 | | tech234a quits [Read error: Connection reset by peer] |
| 13:26:34 | | aarchi quits [Read error: No route to host] |
| 13:26:36 | | janpaul123 quits [Read error: Connection reset by peer] |
| 13:26:36 | | @EggplantN2 quits [Read error: Connection reset by peer] |
| 13:28:03 | | tech234a (tech234a) joins |
| 13:28:12 | | janpaul123 (janpaul123) joins |
| 13:28:12 | | EggplantN2 joins |
| 13:30:59 | | aarchi (aarchi) joins |
| 13:32:43 | | EggplantN2 is now authenticated as EggplantN |
| 13:32:43 | | EggplantN2 quits [Changing host] |
| 13:32:43 | | EggplantN2 (EggplantN) joins |
| 13:32:43 | | @ChanServ sets mode: +o EggplantN2 |
| 14:38:56 | | sliccricc_ (sliccricc) joins |
| 14:41:47 | | sliccricc_ quits [Remote host closed the connection] |
| 14:42:13 | | sliccricc_ (sliccricc) joins |
| 15:46:14 | | sliccricc_ quits [Remote host closed the connection] |
| 15:52:43 | | justaguy is now known as AltroskyAS207616_Mystique |
| 15:56:44 | | Pichu0102 quits [Remote host closed the connection] |
| 15:58:04 | | sliccricc (sliccricc) joins |
| 15:58:39 | | sliccricc quits [Remote host closed the connection] |
| 15:58:55 | | sliccricc (sliccricc) joins |
| 19:47:09 | | britmob quits [Read error: Connection reset by peer] |
| 19:48:06 | | britmob joins |
| 19:50:14 | | sliccricc quits [Remote host closed the connection] |
| 19:50:31 | | sliccricc (sliccricc) joins |
| 20:43:55 | | sliccricc quits [Ping timeout: 264 seconds] |
| 21:32:35 | | britmob quits [Read error: Connection reset by peer] |
| 21:44:55 | | britmob joins |
| 22:44:32 | | mutantmonkey quits [Remote host closed the connection] |
| 22:44:58 | | mutantmonkey (mutantmonkey) joins |
| 23:04:26 | | janpaul123 quits [Read error: Connection reset by peer] |
| 23:04:41 | | janpaul123 (janpaul123) joins |
| 23:18:34 | | treora quits [Ping timeout: 244 seconds] |
| 23:28:41 | | treora joins |
| 23:51:40 | | @kiska quits [Quit: Ping timeout (120 seconds)] |
| 23:53:27 | | kiska (kiska) joins |
| 23:53:27 | | @ChanServ sets mode: +o kiska |