| 00:19:54 | <@JAA> | Oh yeah, that'd be neat. :-) |
| 00:24:00 | <mikolaj|m> | pabs: yeah that's one of the goals |
| 00:24:34 | <mikolaj|m> | but not yet, too early |
| 00:25:21 | <pabs> | great! once it can do Maildir and Hacker News (and LWN), I'd love to use it locally |
| 00:25:57 | | pabs needs to start packaging archiving related things for Debian |
| 00:26:00 | <anarcat> | a readme file would be great :) |
| 00:26:07 | <mikolaj|m> | good to know |
| 02:36:57 | <pabs> | will forum-dl move into the ArchiveTeam namespace? |
| 02:37:17 | <pabs> | "Google Groups has been left to die" https://ahelwer.ca/post/2023-03-08-google-groups/ |
| 02:57:21 | | TheTechRobo quits [Client Quit] |
| 02:58:19 | <pabs> | "Tell HN: Freenom (the operator of .tk, .ml, .ga, .cf, .gq TLDs) is falling apart" https://news.ycombinator.com/item?id=34194555 https://krebsonsecurity.com/2023/03/sued-by-meta-freenom-halts-domain-registrations/ https://news.ycombinator.com/item?id=35062806 |
| 02:58:41 | | TheTechRobo (TheTechRobo) joins |
| 03:02:33 | | TheTechRobo quits [Remote host closed the connection] |
| 04:12:10 | | DLoader quits [Client Quit] |
| 05:02:46 | | datechnoman quits [Ping timeout: 252 seconds] |
| 05:08:27 | <h2ibot> | Notpushkin edited Talk:ArchiveTeam Warrior (+776, /* Limiting Warrior bandwidth with Docker */…): https://wiki.archiveteam.org/?diff=49534&oldid=48318 |
| 05:13:57 | | datechnoman (datechnoman) joins |
| 05:34:11 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:03:49 | | hackbug quits [Ping timeout: 252 seconds] |
| 06:14:43 | | Arcorann (Arcorann) joins |
| 06:45:04 | | Shjosan quits [Read error: Connection reset by peer] |
| 06:45:53 | | Shjosan (Shjosan) joins |
| 07:03:14 | | Island quits [Read error: Connection reset by peer] |
| 07:22:28 | | hitgrr8 joins |
| 07:42:04 | | DLoader joins |
| 08:00:16 | | LeGoupil joins |
| 08:16:07 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 08:45:43 | <@JAA> | Re TotalBiscuit, I threw a few things into socialbot, #down-the-tube, and #youtubearchive, but just the obvious stuff. There's certainly more. |
| 09:19:33 | <SketchCow> | The fuck with that, frankly |
| 09:20:48 | <SketchCow> | When your widow is a moron |
| 09:21:10 | <@JAA> | I haven't followed much of her activities, but ... yeah. |
| 09:21:24 | <JTL> | I suspect grief does strange things to people |
| 10:08:53 | <masterX244> | And the fact that "AI" can be used for faking stuff based on that. Multiple factors mixing up at that spot |
| 11:19:20 | | LeGoupil quits [Ping timeout: 252 seconds] |
| 11:20:26 | | Mateon1 quits [Ping timeout: 252 seconds] |
| 11:21:04 | | lennier2 joins |
| 11:24:28 | | lennier1 quits [Ping timeout: 274 seconds] |
| 11:24:37 | | lennier2 is now known as lennier1 |
| 12:31:32 | <pabs> | is there a project for archiving Tor Onion Sites? |
| 12:33:42 | | LeGoupil joins |
| 12:57:39 | | hackbug (hackbug) joins |
| 13:31:09 | | Mateon1 joins |
| 13:33:09 | | Mateon1 quits [Remote host closed the connection] |
| 13:35:30 | | Mateon1 joins |
| 13:41:21 | | LeGoupil quits [Remote host closed the connection] |
| 13:41:27 | | LeGoupil joins |
| 13:45:10 | | LeGoupil quits [Remote host closed the connection] |
| 13:45:14 | | LeGoupil joins |
| 13:52:43 | | Mateon1 quits [Remote host closed the connection] |
| 13:54:28 | | Mateon1 joins |
| 13:58:17 | <kpcyrd> | I'd be interested in that, enumeration became very difficult with v3 onions tho |
| 14:05:51 | <pabs> | I was thinking just ArchiveBot support, so archiving individual sites |
| 14:20:23 | | HP_Archivist (HP_Archivist) joins |
| 14:45:42 | | HP_Archivist quits [Client Quit] |
| 15:03:55 | | Arcorann quits [Ping timeout: 252 seconds] |
| 16:37:52 | | Island joins |
| 17:11:12 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 17:39:29 | | ehmry joins |
| 17:49:31 | | Mateon2 joins |
| 17:49:33 | | Island_ joins |
| 17:50:48 | | LeGoupil quits [Remote host closed the connection] |
| 17:50:48 | | Mateon1 quits [Remote host closed the connection] |
| 17:50:48 | | Island quits [Remote host closed the connection] |
| 17:50:48 | | Mateon2 is now known as Mateon1 |
| 17:50:49 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 18:13:23 | <@JAA> | pabs: We used to have a special AB pipeline with Tor. Its data went to a separate collection and not into the WBM. Hasn't been running in years though. |
| 18:49:19 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 18:49:50 | <avoozl> | mikolaj|m: oh i'm kind of AFK'ish on these channels, but yeah I did build something but it was in a very early stage |
| 18:50:41 | <avoozl> | mikolaj|m: basically I did this once in golang a few years back, and kind of abandoned that project. The repo is a bit of a mess, it was set up to do one thing only and it did work but it needed serious modularization in order to be applied to other dumps |
| 18:50:59 | <avoozl> | mikolaj|m: I recently started testing the same thing in rust, but that is not in a usable state |
| 18:54:25 | <mikolaj|m> | avoozl: so, no public repo? I wanted to take a look and maybe steal some ideas. My work isn't in an usable state either |
| 19:03:36 | <avoozl> | mikolaj|m: feel free to take a look, I can also explain what my concept was |
| 19:04:44 | <mikolaj|m> | avoozl: sure, but I'd need to know where the repo is (haven't managed to find it) |
| 19:05:14 | <avoozl> | mikolaj|m: it isn't public yet, but feel free to tell me a github name and I'll invite you |
| 19:05:44 | <mikolaj|m> | avoozl: thanks, my name on GitHub is "mikwielgus" |
| 19:08:19 | <avoozl> | oh apparantly I can only give you read&write access, but not just read. Let me check if i should just make this public instead. |
| 19:12:42 | <avoozl> | mikolaj|m: here you go https://github.com/fairuse/warceater |
| 19:13:11 | <mikolaj|m> | avoozl: thanks! |
| 19:13:35 | <avoozl> | mikolaj|m: it consists mostly of commandline tools, but the documentation is of course missing in action :P |
| 19:14:43 | <avoozl> | mikolaj|m: I recall I got it working for both yahoo answers as well as for the league of legends forums |
| 19:15:06 | <avoozl> | mikolaj|m: Then I started implementing a generic parser and that is where I took a break |
| 19:15:47 | <avoozl> | mikolaj|m: the core idea for the parsers is that you get a document and you need to return posts, https://github.com/fairuse/warceater/blob/main/pkg/parsers/league.go |
| 19:17:35 | <avoozl> | mikolaj|m: I recall having to fork the warc reader in go, and put in some effort to support the zstd-compressed 'megawarc' files because that was broken by default.. I went through the same trouble in rust later on but that is even in a more brittle state |
| 19:28:37 | <mikolaj|m> | avoozl: looks great to take inspiration from. I have somewhat more complicated parsing (the entire site is first scanned, subboards are determined, there's separate datatypes for threads and posts), but I'm downloading directly from sites, no WARC support so far |
| 19:52:44 | <avoozl> | mikolaj|m: the neat thing about WARC is there is a lot of stuff already available on archive.org, in varying sizes, and you are not dependend on building a scraper first. But I can see the need in doing both :) |
| 19:53:47 | <avoozl> | My idea was to make the post structure in the data store as simple as possible, and then extend it with other fields (like board), as I grew the project. Bleve/bluge allowed indexing on these fields so there was no need to keep the hierarchical structure, because it could be easily reconstructed on the fly during retrieval |
| 19:54:29 | <avoozl> | My purpose was also more as a self-hosted system, so not for at-scale hosting. Otherwise I would have picked the performance tradeoffs differently. I mainly optimized for storage (storage ize) and flexiblility, keeping a reasonable single user retrieval experience |
| 19:57:02 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 21:20:35 | | umgr036 joins |
| 21:21:23 | | umgr036 quits [Remote host closed the connection] |
| 21:21:36 | | umgr036 joins |
| 21:45:06 | | umgr036 quits [Remote host closed the connection] |
| 21:45:21 | | umgr036 joins |
| 21:48:09 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 21:56:25 | | hitgrr8 quits [Client Quit] |
| 22:09:16 | | umgr036 quits [Remote host closed the connection] |
| 22:09:18 | | umgr036 joins |
| 22:11:38 | | pabs quits [Ping timeout: 252 seconds] |
| 22:12:15 | | pabs (pabs) joins |
| 22:34:50 | | umgr036 quits [Remote host closed the connection] |
| 22:35:05 | | umgr036 joins |
| 22:38:34 | | pabs quits [Ping timeout: 265 seconds] |
| 22:38:51 | <h2ibot> | Finnless edited Deathwatch (+273, /* 2023 */): https://wiki.archiveteam.org/?diff=49535&oldid=49528 |
| 22:39:10 | | pabs (pabs) joins |
| 22:46:18 | | pabs quits [Ping timeout: 265 seconds] |
| 22:46:55 | | pabs (pabs) joins |
| 22:47:36 | | Arcorann (Arcorann) joins |
| 23:15:37 | | myself quits [Read error: Connection reset by peer] |
| 23:15:40 | | myself9 joins |
| 23:19:54 | | myself9 quits [Read error: Connection reset by peer] |
| 23:20:05 | | myself joins |
| 23:59:47 | | finnless (finnless) joins |