00:19:54<@JAA>Oh yeah, that'd be neat. :-)
00:24:00<mikolaj|m>pabs: yeah that's one of the goals
00:24:34<mikolaj|m>but not yet, too early
00:25:21<pabs>great! once it can do Maildir and Hacker News (and LWN), I'd love to use it locally
00:25:57pabs needs to start packaging archiving related things for Debian
00:26:00<anarcat>a readme file would be great :)
00:26:07<mikolaj|m>good to know
02:36:57<pabs>will forum-dl move into the ArchiveTeam namespace?
02:37:17<pabs>"Google Groups has been left to die" https://ahelwer.ca/post/2023-03-08-google-groups/
02:57:21TheTechRobo quits [Client Quit]
02:58:19<pabs>"Tell HN: Freenom (the operator of .tk, .ml, .ga, .cf, .gq TLDs) is falling apart" https://news.ycombinator.com/item?id=34194555 https://krebsonsecurity.com/2023/03/sued-by-meta-freenom-halts-domain-registrations/ https://news.ycombinator.com/item?id=35062806
02:58:41TheTechRobo (TheTechRobo) joins
03:02:33TheTechRobo quits [Remote host closed the connection]
04:12:10DLoader quits [Client Quit]
05:02:46datechnoman quits [Ping timeout: 252 seconds]
05:08:27<h2ibot>Notpushkin edited Talk:ArchiveTeam Warrior (+776, /* Limiting Warrior bandwidth with Docker */…): https://wiki.archiveteam.org/?diff=49534&oldid=48318
05:13:57datechnoman (datechnoman) joins
05:34:11BlueMaxima quits [Read error: Connection reset by peer]
06:03:49hackbug quits [Ping timeout: 252 seconds]
06:14:43Arcorann (Arcorann) joins
06:45:04Shjosan quits [Read error: Connection reset by peer]
06:45:53Shjosan (Shjosan) joins
07:03:14Island quits [Read error: Connection reset by peer]
07:22:28hitgrr8 joins
07:42:04DLoader joins
08:00:16LeGoupil joins
08:16:07qwertyasdfuiopghjkl quits [Remote host closed the connection]
08:45:43<@JAA>Re TotalBiscuit, I threw a few things into socialbot, #down-the-tube, and #youtubearchive, but just the obvious stuff. There's certainly more.
09:19:33<SketchCow>The fuck with that, frankly
09:20:48<SketchCow>When your widow is a moron
09:21:10<@JAA>I haven't followed much of her activities, but ... yeah.
09:21:24<JTL>I suspect grief does strange things to people
10:08:53<masterX244>And the fact that "AI" can be used for faking stuff based on that. Multiple factors mixing up at that spot
11:19:20LeGoupil quits [Ping timeout: 252 seconds]
11:20:26Mateon1 quits [Ping timeout: 252 seconds]
11:21:04lennier2 joins
11:24:28lennier1 quits [Ping timeout: 274 seconds]
11:24:37lennier2 is now known as lennier1
12:31:32<pabs>is there a project for archiving Tor Onion Sites?
12:33:42LeGoupil joins
12:57:39hackbug (hackbug) joins
13:31:09Mateon1 joins
13:33:09Mateon1 quits [Remote host closed the connection]
13:35:30Mateon1 joins
13:41:21LeGoupil quits [Remote host closed the connection]
13:41:27LeGoupil joins
13:45:10LeGoupil quits [Remote host closed the connection]
13:45:14LeGoupil joins
13:52:43Mateon1 quits [Remote host closed the connection]
13:54:28Mateon1 joins
13:58:17<kpcyrd>I'd be interested in that, enumeration became very difficult with v3 onions tho
14:05:51<pabs>I was thinking just ArchiveBot support, so archiving individual sites
14:20:23HP_Archivist (HP_Archivist) joins
14:45:42HP_Archivist quits [Client Quit]
15:03:55Arcorann quits [Ping timeout: 252 seconds]
16:37:52Island joins
17:11:12qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
17:39:29ehmry joins
17:49:31Mateon2 joins
17:49:33Island_ joins
17:50:48LeGoupil quits [Remote host closed the connection]
17:50:48Mateon1 quits [Remote host closed the connection]
17:50:48Island quits [Remote host closed the connection]
17:50:48Mateon2 is now known as Mateon1
17:50:49qwertyasdfuiopghjkl quits [Client Quit]
18:13:23<@JAA>pabs: We used to have a special AB pipeline with Tor. Its data went to a separate collection and not into the WBM. Hasn't been running in years though.
18:49:19qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
18:49:50<avoozl>mikolaj|m: oh i'm kind of AFK'ish on these channels, but yeah I did build something but it was in a very early stage
18:50:41<avoozl>mikolaj|m: basically I did this once in golang a few years back, and kind of abandoned that project. The repo is a bit of a mess, it was set up to do one thing only and it did work but it needed serious modularization in order to be applied to other dumps
18:50:59<avoozl>mikolaj|m: I recently started testing the same thing in rust, but that is not in a usable state
18:54:25<mikolaj|m>avoozl: so, no public repo? I wanted to take a look and maybe steal some ideas. My work isn't in an usable state either
19:03:36<avoozl>mikolaj|m: feel free to take a look, I can also explain what my concept was
19:04:44<mikolaj|m>avoozl: sure, but I'd need to know where the repo is (haven't managed to find it)
19:05:14<avoozl>mikolaj|m: it isn't public yet, but feel free to tell me a github name and I'll invite you
19:05:44<mikolaj|m>avoozl: thanks, my name on GitHub is "mikwielgus"
19:08:19<avoozl>oh apparantly I can only give you read&write access, but not just read. Let me check if i should just make this public instead.
19:12:42<avoozl>mikolaj|m: here you go https://github.com/fairuse/warceater
19:13:11<mikolaj|m>avoozl: thanks!
19:13:35<avoozl>mikolaj|m: it consists mostly of commandline tools, but the documentation is of course missing in action :P
19:14:43<avoozl>mikolaj|m: I recall I got it working for both yahoo answers as well as for the league of legends forums
19:15:06<avoozl>mikolaj|m: Then I started implementing a generic parser and that is where I took a break
19:15:47<avoozl>mikolaj|m: the core idea for the parsers is that you get a document and you need to return posts, https://github.com/fairuse/warceater/blob/main/pkg/parsers/league.go
19:17:35<avoozl>mikolaj|m: I recall having to fork the warc reader in go, and put in some effort to support the zstd-compressed 'megawarc' files because that was broken by default.. I went through the same trouble in rust later on but that is even in a more brittle state
19:28:37<mikolaj|m>avoozl: looks great to take inspiration from. I have somewhat more complicated parsing (the entire site is first scanned, subboards are determined, there's separate datatypes for threads and posts), but I'm downloading directly from sites, no WARC support so far
19:52:44<avoozl>mikolaj|m: the neat thing about WARC is there is a lot of stuff already available on archive.org, in varying sizes, and you are not dependend on building a scraper first. But I can see the need in doing both :)
19:53:47<avoozl>My idea was to make the post structure in the data store as simple as possible, and then extend it with other fields (like board), as I grew the project. Bleve/bluge allowed indexing on these fields so there was no need to keep the hierarchical structure, because it could be easily reconstructed on the fly during retrieval
19:54:29<avoozl>My purpose was also more as a self-hosted system, so not for at-scale hosting. Otherwise I would have picked the performance tradeoffs differently. I mainly optimized for storage (storage ize) and flexiblility, keeping a reasonable single user retrieval experience
19:57:02qwertyasdfuiopghjkl quits [Client Quit]
21:20:35umgr036 joins
21:21:23umgr036 quits [Remote host closed the connection]
21:21:36umgr036 joins
21:45:06umgr036 quits [Remote host closed the connection]
21:45:21umgr036 joins
21:48:09qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
21:56:25hitgrr8 quits [Client Quit]
22:09:16umgr036 quits [Remote host closed the connection]
22:09:18umgr036 joins
22:11:38pabs quits [Ping timeout: 252 seconds]
22:12:15pabs (pabs) joins
22:34:50umgr036 quits [Remote host closed the connection]
22:35:05umgr036 joins
22:38:34pabs quits [Ping timeout: 265 seconds]
22:38:51<h2ibot>Finnless edited Deathwatch (+273, /* 2023 */): https://wiki.archiveteam.org/?diff=49535&oldid=49528
22:39:10pabs (pabs) joins
22:46:18pabs quits [Ping timeout: 265 seconds]
22:46:55pabs (pabs) joins
22:47:36Arcorann (Arcorann) joins
23:15:37myself quits [Read error: Connection reset by peer]
23:15:40myself9 joins
23:19:54myself9 quits [Read error: Connection reset by peer]
23:20:05myself joins
23:59:47finnless (finnless) joins