| 00:19:56 | | BlueMaxima joins |
| 00:24:38 | | brgtt quits [Client Quit] |
| 00:24:53 | <billy549> | Is it allowed to slightly modify ArchiveTeam's logo? |
| 00:25:25 | <billy549> | i cant really go into specifics without it being obvious so please DM me and i can explain |
| 00:25:29 | <billy549> | in fact, ill just |
| 00:27:09 | <@OrIdow6> | ? |
| 00:28:55 | <billy549> | it's fine, i dmed JAA instead - please ignore me :P |
| 00:29:16 | <@OrIdow6> | Oh |
| 00:38:53 | | vela quits [Quit: Ping timeout (120 seconds)] |
| 00:39:22 | | vela (vela) joins |
| 00:53:42 | | Dragnog quits [Client Quit] |
| 00:54:10 | | onetruth joins |
| 01:02:24 | | dm4v quits [Read error: Connection reset by peer] |
| 01:03:30 | | dm4v joins |
| 01:03:33 | | dm4v is now authenticated as dm4v |
| 01:03:33 | | dm4v quits [Changing host] |
| 01:03:33 | | dm4v (dm4v) joins |
| 01:14:47 | | Arcorann (Arcorann) joins |
| 01:38:02 | | Mineroboter_ joins |
| 01:38:51 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 01:41:45 | | Arcorann_ joins |
| 01:43:50 | | Arcorann quits [Ping timeout: 258 seconds] |
| 02:23:11 | | EggplantN5 joins |
| 02:23:26 | | @EggplantN quits [Read error: Connection reset by peer] |
| 02:23:26 | | EggplantN5 is now known as EggplantN |
| 02:24:14 | | ThreeHeadedMonkey quits [Ping timeout: 250 seconds] |
| 02:26:25 | | ThreeHeadedMonkey (ThreeHeadedMonkey) joins |
| 02:43:38 | | EggplantN quits [Ping timeout: 258 seconds] |
| 02:52:26 | <purplebot> | Frequently Asked Questions edited by JustAnotherArchivist (+8, /* halp pls halp */ Fix WBM inclusion …) just now -- https://www.archiveteam.org/?diff=46532&oldid=45717 |
| 02:53:02 | | EggplantN joins |
| 03:23:19 | | captbaritone (captbaritone) joins |
| 03:34:37 | | DogsRNice quits [Read error: Connection reset by peer] |
| 03:46:08 | | lennier2 joins |
| 03:46:23 | | lennier1 quits [Read error: Connection reset by peer] |
| 03:46:28 | | lennier2 is now known as lennier1 |
| 03:51:31 | | qw3rty__ joins |
| 03:55:19 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 04:02:10 | | captbaritone quits [Remote host closed the connection] |
| 04:05:31 | | etnguyen03 quits [Client Quit] |
| 04:21:57 | | iki joins |
| 04:49:00 | | nerdguy1138 quits [Client Quit] |
| 04:49:40 | | nerdguy1138 (nerdguy1138) joins |
| 05:03:38 | | Zopolis4 (Zopolis4) joins |
| 05:31:39 | | Zopolis4 quits [Ping timeout: 244 seconds] |
| 05:36:33 | | Zopolis4 (Zopolis4) joins |
| 05:41:24 | | Zopolis4 quits [Remote host closed the connection] |
| 06:12:00 | | pcr leaves |
| 06:31:36 | | pcr joins |
| 06:32:58 | | HackMii quits [Client Quit] |
| 07:30:52 | | HackMii (hacktheplanet) joins |
| 07:35:56 | | Zopolis4 (Zopolis4) joins |
| 07:45:51 | | HP_Archivist quits [Client Quit] |
| 08:27:31 | | hooway joins |
| 08:36:20 | | Dragnog joins |
| 08:54:05 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 10:17:30 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 10:38:12 | | Terbium quits [Client Quit] |
| 10:38:33 | | Terbium joins |
| 10:46:23 | | spirit joins |
| 11:02:40 | | Zopolis4 quits [Remote host closed the connection] |
| 11:08:37 | | Arcorann_ joins |
| 11:13:28 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 11:25:48 | | Arcorann_ joins |
| 12:52:26 | <purplebot> | Template:Czech websites edited by Sanqui (+27, Add szm.com and call it Slovak …), Sanqui (-4, two lines) just now -- https://www.archiveteam.org/?diff=46534&oldid=46222 |
| 13:05:50 | | LeighR (LeighR) joins |
| 13:13:33 | | HackMii_ (hacktheplanet) joins |
| 13:15:05 | | HackMii quits [Remote host closed the connection] |
| 13:23:21 | <avoozl1> | how big is the yahoo_answers collection getting? would I need all files within the archiveteam_yahooanswers collection to be complete? |
| 13:23:47 | <avoozl1> | There's 789 parts at the moment, but I'm not sure where to check status |
| 13:24:15 | <@arkiver> | avoozl1: you want to get a copy? |
| 13:24:19 | <@arkiver> | it'll be tens of TBs |
| 13:24:23 | <@arkiver> | maybe 100+ |
| 13:24:28 | <avoozl1> | I would. I'm building a forum indexer and this seems like a good testcase |
| 13:24:35 | <avoozl1> | tens of TBs I can handle, 100+ not so much |
| 13:24:52 | <@arkiver> | the archiveteam_yahooanswers contains data from an old yahooanswers project as well |
| 13:24:55 | | LeighR quits [Ping timeout: 244 seconds] |
| 13:25:08 | <avoozl1> | ahh yeah I see, there's something from 2016 in there |
| 13:25:15 | <avoozl1> | and 2017 |
| 13:25:22 | <@arkiver> | yeah all from 2021+ is from the currently project |
| 13:25:29 | <avoozl1> | I can pick every id that starts with archiveteam_yahooanswers_2021 |
| 13:25:32 | <avoozl1> | thanks |
| 13:25:42 | <@arkiver> | back than yahoo answers had a different structure (no horrible PUT requests for pagination) |
| 13:26:11 | <avoozl1> | is the archiveteam_yahooanswers_dictionary_16* part of the new run? |
| 13:26:22 | <@arkiver> | yes |
| 13:26:25 | | etnguyen03 (etnguyen03) joins |
| 13:26:42 | <@arkiver> | everything is compressed with ZSTD with a dictionary |
| 13:26:50 | <@arkiver> | that dictionary is stored there for safekeeping |
| 13:26:56 | <avoozl1> | yeah I had some issues missing these in earlier downloads :) good to have these dicts |
| 13:27:03 | <avoozl1> | otherwise the data is useless :) |
| 13:27:22 | <@arkiver> | i mean the dictionary is in the ZST megaWARC as well |
| 13:27:34 | <avoozl1> | ohh ok. good |
| 13:28:04 | <@arkiver> | details here on how its stored in the ZST WARC https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01 |
| 13:28:13 | <avoozl1> | I'm mostly using my own go-based tools, so I have to take care of some of these things every once in a while by myself |
| 13:28:27 | <@arkiver> | also contains details on deduplication |
| 13:31:51 | <avoozl1> | current size seems <10TB but I'm not sure how far it has gotten |
| 13:33:16 | <avoozl1> | I'll grab a few and start coding, it will take a very long time for this to trickle down into my local machine :) |
| 13:33:21 | <avoozl1> | Thanks! |
| 13:43:49 | <rewby> | avoozl1: If you're interested, I have a golang package to read the special zstd format |
| 13:56:17 | <avoozl1> | rewby: that'd be awesome. I'm currently using github.com/CorentinB/warc |
| 13:56:36 | | avoozl1 is now known as avoozl |
| 14:00:12 | <rewby> | avoozl1: here is my library. It's not the best library out there but it does one thing I wasn't able to find another golang one does. Well two things. It does our zstd files and it does streaming io. That is, I don't need to load an entire record into memory to process it. My own code uses that to skip past big media since I only care about html content |
| 14:00:24 | <rewby> | https://gitlab.roelf.org/warcscan/warcreader |
| 14:02:37 | <avoozl> | Thanks, I'll take a look |
| 14:03:21 | <rewby> | I use this to pull multiple gigabits out of the IA and analyze the warcs for urls |
| 14:04:51 | <avoozl> | Looks awesome. I could probably drop this into my codebase almost as is |
| 14:05:25 | <avoozl> | I'm working on a bit of a hobby project to consume forum grabs in warc format and turn them into a searchable forum (using bleve/bluge as on-disk index format) |
| 14:05:34 | <rewby> | Yeah, feel free to poke me if you need support for it |
| 14:05:41 | <avoozl> | I haven't given it the time it requires lately, but it is in an functioning state |
| 14:05:54 | <rewby> | Cool! |
| 14:06:22 | <avoozl> | basically parses all the response bodies, runs then through goquery so you can use css selectors or xpath to extract the necessary parts at a thread/post level, and then builds a giant index with a bit of a simple web interface on top |
| 14:06:36 | <rewby> | That's neat! |
| 14:06:51 | <avoozl> | it requires custom work for each forum of course, but the customization is fairly minimal.. I started from the league of legends forums |
| 14:07:04 | <rewby> | If you have any licencing concerns with my library, feel free to ask and we'll work it out |
| 14:07:15 | <avoozl> | awesome, will do |
| 14:07:32 | <rewby> | That's really neat. I quite like doing things with previously archived data. |
| 14:07:42 | <Sanqui> | I'm a big fan too! |
| 14:08:08 | <Sanqui> | I've been thinking about how useful a "personalized forum search engine" would be to me |
| 14:08:18 | <rewby> | I've got this compute cluster building a huge urls database to help with discovery for new projects |
| 14:08:26 | <Sanqui> | oftentimes when searching for something, google is garbage and I'd rather know "what are people saying about this" |
| 14:08:28 | <Sanqui> | y'know? |
| 14:09:02 | <Sanqui> | so if I can help with this, perhaps contribute some forums, that'd be lovely |
| 14:09:15 | <Sanqui> | I'm planning on doing something similar with my Discord archives later |
| 14:09:32 | <avoozl> | I'll get the code cleaned up a bit so it can be pushed out somewhere. I still have to complete some parts of the bleve->bluge switch |
| 14:09:43 | <avoozl> | (the ingest is done but the search is still bleve-only) |
| 14:09:50 | <rewby> | Cool |
| 14:10:03 | <rewby> | I'd love to see the code |
| 14:10:38 | <Sanqui> | my Discord archiver is still in progress, but it's looking like it will scale to a few 100s of servers |
| 14:11:13 | <avoozl> | rewby: just a peek right now, but this is what I implement at a forum level.. https://paste.ofcode.org/DKszjr3eNAy4HjEjzJSSp9 |
| 14:11:27 | <Sanqui> | this is important because web forums are dying and "public discords" fill the same cultural nice today... |
| 14:11:40 | <rewby> | Neat |
| 14:11:50 | <avoozl> | rewby: so this is just 'parse the response into an array of Bodies' and then the rest of the pipeline takes care of the indexing |
| 14:12:32 | <rewby> | Yeah, if you want "free" zstd support, feel free to use my code. My gitlab does have like 3-4 mins of downtime every few days while auto updates run, just as a warning |
| 14:12:56 | <avoozl> | one of the big todo's is to modify bluge so that the search index can be hosted as a remote file instead of a local one. Search performance would suffer, but that'd mean I could just put a large index somewhere http-accessible (or s3) and have people query it without any special server side logic |
| 14:13:01 | <masterX244> | pushing up a 2019 dump of forum.brickset.com atm (did that back then due to forum being in danger of shutdown) |
| 14:13:50 | <rewby> | Yeah, that makes lots of sense. A remote index would make sense. Especially if there's no need to do server side compute |
| 14:14:08 | <avoozl> | yeah it'll just be a bunch of range-bytes style retrievals |
| 14:14:58 | <rewby> | Makes a lot of sense. I would personally avoid s3 or similar due to the bandwidth/per request costs. If the files are small enough a vps or something could do the trick |
| 14:15:39 | <avoozl> | typically I'm finding the file will be around as large or slightly larger as the gz compressed input text, so they are fairly large |
| 14:16:00 | <rewby> | Ah hm. |
| 14:16:10 | <rewby> | Try zstd. It does wonders |
| 14:16:15 | <avoozl> | rewby: you can always opt to do requester-pays on s3, but I guess for some users that'd be a hurdle |
| 14:16:31 | <rewby> | Yeah, I can imagine |
| 14:16:51 | <rewby> | I'd almost wonder if the IA would be willing to host the index files |
| 14:16:55 | <avoozl> | rewby: yeah zstd compresses better, but the index itself can't really be compressed well. at least not without sacrificing a ton of performance.. I've been chatting with the author of bluge but it wasn't really on their radar as a usecase so it takes some time to look at the options there |
| 14:17:25 | <rewby> | Ah that sucks. I'm personally lucky my warc work compresses well |
| 14:17:38 | <rewby> | 20billion urls in less than 200G |
| 14:17:38 | <avoozl> | (if I compress it I loose proper seek performance, and it needs pretty fine grained retrieval.. I tried 8k or 64k block based compression but it wasn't that great for this usecase) |
| 14:18:04 | <avoozl> | s/loose/lose/ |
| 14:18:10 | <rewby> | Yeah no, makes sense |
| 14:18:56 | <avoozl> | my json stuff also compresses like crazy with zstd |
| 14:19:11 | <rewby> | Yeah, zstd is bloody magic |
| 14:19:18 | <Sanqui> | I've been compressing jsonl with gzip |
| 14:19:21 | <Sanqui> | should I look into zstd? |
| 14:19:30 | <rewby> | I'd say you should |
| 14:19:41 | <rewby> | I got like 3-4x better compression out of zstd |
| 14:19:44 | <Sanqui> | does zstd have better indexing support? |
| 14:19:47 | <rewby> | With no more overhead |
| 14:20:08 | <rewby> | I dunno about indexing. I mostly do streaming IO |
| 14:20:17 | <EggplantN> | Zstd is magic we use it at work for VM backups/snapshots |
| 14:22:10 | <avoozl> | I've used zstd with a 64kb block size, then train a dict on that, and then compress each block so we can effectively seek (with an offset table next to it).. That gives still pretty good compression for some of our data files (better than compressing each block individually).. but only when the file is pretty monotonous |
| 14:22:59 | <rewby> | I generally leave it set to default and it works loads better than gzip |
| 14:23:45 | <avoozl> | well xz compresses my json best, but it is also slow to compress and decompress |
| 14:24:02 | <avoozl> | zstd is faster and compresses a bit less.. though the speed for me makes up for it |
| 14:25:12 | <Sanqui> | incidentally, the upcoming release of fedora comes with btrfs with zstd compression enabled by default |
| 14:25:21 | <Sanqui> | so i guess it's *really* good |
| 14:25:45 | <avoozl> | I do have btrfs, but have compression disabled as I use dedup a lot and they don't seem to mingle well |
| 14:26:45 | <avoozl> | as a quick compression comparison for my json (which contains quite a bit of plain text content): https://paste.ofcode.org/q7PYhs6D8spbgc8KHm6San |
| 14:28:23 | <Sanqui> | 303217943234215948.jsonl : 10.17% (146585063 => 14903289 bytes, 303217943234215948.jsonl.zst) |
| 14:28:51 | <Sanqui> | original is 146.6 MB, gzipped 17.4 MB, zst'd 14.9 MB |
| 14:29:05 | <Sanqui> | I guess that adds up |
| 14:30:15 | <Sanqui> | xz is waaaay slower but nets 11.2 MB |
| 14:39:18 | <@JAA> | All of these comparisons are useless unless you mention the compression level. :-) |
| 14:40:30 | <Sanqui> | the defaults matter :) |
| 14:40:32 | <@JAA> | zstd's defaults are more in favour of speed than of compression ratio compared to the other tools. |
| 14:40:55 | <Sanqui> | I understand that |
| 14:40:57 | | brgtt joins |
| 14:41:12 | <Sanqui> | I also tried zstd with -T0 --ultra -20 and it got to the same size that xz did |
| 14:41:24 | <@JAA> | I've found that zstd -10 is about comparable in runtime to gzip -9. Depends a lot on the data of course. |
| 14:42:35 | <@JAA> | And zstd -2 yielded similarly sized files as gzip -9 at a tenfold shorter runtime. |
| 14:43:05 | <@JAA> | This was using log files and SQLite databases from ArchiveBot, FWIW. |
| 14:44:41 | | brgtt quits [Read error: Connection reset by peer] |
| 14:45:44 | <lunik1> | for absolute best compression lrzip+zpaq gives me the best ratios, too slow to be practical in most circumstances though |
| 14:47:43 | <@JAA> | Ultimately, one just needs to test all compression levels (and possibly extra settings like threading on zstd) and then analyse at what point the additional compression ratio is no longer worth the extra computational effort. The results will differ wildly depending on what you're compressing, what the bottleneck is, etc. |
| 14:48:10 | <lunik1> | and your definition of "no longer worth" |
| 14:53:32 | | trinsic joins |
| 14:58:47 | <avoozl> | Just a check, the yahoo answers warc, that does exclude images and other frills, right? Because that is some serious amount of text |
| 14:59:30 | | trinsic quits [Client Quit] |
| 15:05:36 | | betamax quits [Ping timeout: 250 seconds] |
| 15:07:10 | | betamax (betamax) joins |
| 15:17:18 | | Arcorann_ quits [Ping timeout: 250 seconds] |
| 15:26:20 | | Hyenadae quits [Ping timeout: 244 seconds] |
| 15:42:41 | | pcr leaves |
| 15:43:05 | | pcr joins |
| 16:05:32 | | Dragnog quits [Client Quit] |
| 16:17:33 | | Dragnog joins |
| 17:03:01 | | blub joins |
| 17:13:40 | | nathan quits [Quit: Konversation terminated!] |
| 17:25:29 | <avoozl> | rewby: did you have any reasons for picking github.com/DataDog/zstd over github.com/valyala/gozstd ? I'm using the latter one |
| 17:43:33 | | Ravenloft (Ravenloft) joins |
| 17:59:02 | | EggplantN is now authenticated as EggplantN |
| 17:59:02 | | EggplantN quits [Changing host] |
| 17:59:02 | | EggplantN (EggplantN) joins |
| 17:59:02 | | @ChanServ sets mode: +o EggplantN |
| 18:05:47 | | bit joins |
| 19:02:46 | | nertzy quits [Client Quit] |
| 19:10:26 | | limb quits [Read error: Connection reset by peer] |
| 19:40:32 | | blub quits [Ping timeout: 244 seconds] |
| 20:14:27 | | colona quits [Quit: leaving] |
| 21:02:01 | | Hyenadae joins |
| 21:19:36 | | Ravenloft quits [Remote host closed the connection] |
| 21:33:36 | | user777_ quits [Remote host closed the connection] |
| 21:34:05 | | LeighR (LeighR) joins |
| 21:34:42 | <masterX244> | avoozl: bricksetforum still uploading, 31/38 warcs up. total 190GB (since i captured embedded and referenced media, too) |
| 21:42:09 | <@HCross> | avoozl: the warcs contain everything |
| 21:42:31 | <@HCross> | The entire point of them is to give you enough data to recreate the experience of being on the site without being online |
| 21:43:03 | <@HCross> | also: look into AWS S3 but using Requester Pays |
| 21:44:44 | <@HCross> | https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html |
| 21:46:00 | <tech234a> | Side note: one of us should update the Docker Warrior tutorial to ensure that configuration is persisted across Watchtower updates |
| 21:51:03 | | godane2 quits [Client Quit] |
| 22:13:16 | | hooway quits [Client Quit] |
| 22:16:03 | | LeighR quits [Ping timeout: 244 seconds] |
| 22:18:10 | | Jonboy345 quits [Ping timeout: 258 seconds] |
| 22:37:55 | | colona joins |
| 22:45:16 | | rsn joins |
| 22:53:26 | | BlueMaxima joins |