00:12:05sonick (sonick) joins
00:15:56sec^nd quits [Ping timeout: 245 seconds]
00:18:37wyatt8740 quits [Ping timeout: 250 seconds]
00:18:54sec^nd (second) joins
00:25:57wyatt8740 joins
00:29:02wyatt8750 joins
00:32:03wyatt8740 quits [Ping timeout: 250 seconds]
00:33:52Island quits [Ping timeout: 265 seconds]
00:36:11Island joins
01:25:21Megame quits [Client Quit]
01:27:31fl0w joins
01:30:38fl0w_ quits [Ping timeout: 264 seconds]
01:45:24yawkat quits [Ping timeout: 265 seconds]
01:56:14<h2ibot>JustAnotherArchivist edited Deathwatch (+282, /* 2022 */ Add KSCO): https://wiki.archiveteam.org/?diff=49351&oldid=49348
02:24:03sec^nd quits [Remote host closed the connection]
02:24:32sec^nd (second) joins
02:32:48<tech234a>Is there a place to put older projects on the wiki? yuku and devport are long offline and can probably be removed
02:32:54<tech234a>from the homepage
02:59:28Doran is now known as Doranwen
03:10:40mskiptr joins
03:11:41<mskiptr>Hello, how should I go about archiving a blogspot.com blog, like this one?
03:11:44<mskiptr>https://milosniczka-slodyczy.blogspot.com/
03:12:29<pokechu22>That should be easy to run via archivebot; I'll do it now
03:12:34<mskiptr>I'd normally find a sitemap that lists links to all posts, or something like that, but there's none as far as I can see
03:12:49<pokechu22>Try view-source:https://milosniczka-slodyczy.blogspot.com/sitemap.xml
03:13:15<pokechu22>though I'm not sure how complete it is, and blogspot is a bit annoying with its pagination-but-not-really system
03:14:05<mskiptr>Thanks pokechu22
04:21:31Mental quits [Ping timeout: 265 seconds]
04:34:25lukeman quits [Quit: Connection closed for inactivity]
05:22:04<sonick>Is there a way to extract image files, etc. in wp-contents from a single warc.gz file?
05:26:01<pokechu22>I'm not aware of an automated solution, but you can probably do something with the cdx file to get a list of offsets and then extract them that way
05:48:02<@OrIdow6>sonick: Use warcfilter from https://github.com/internetarchive/warctools to create a WARC containing only the URLs you want. Then use "warcat extract" from https://github.com/chfoo/warcat
06:21:00fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))]
06:21:06fuzzy8021 (fuzzy8021) joins
06:33:21BlueMaxima quits [Client Quit]
06:57:09Island quits [Ping timeout: 265 seconds]
07:00:08mskiptr quits [Remote host closed the connection]
07:10:06sec^nd quits [Ping timeout: 245 seconds]
07:13:46sec^nd (second) joins
07:17:46hitgrr8 joins
08:10:21<madpro|m>How do you folks get IP ranges for your VM-fleets?
08:10:58<madpro|m>My solution until now has just to sign up on more cloud providers whenever I cannot parallelize locally
08:11:39<madpro|m>As the saying goes: it's not foolish to not know, but it is foolish to not ask better :)
08:11:53<madpro|m>* been to
08:30:12fishingforsoup quits [Read error: Connection reset by peer]
09:04:02yawkat (yawkat) joins
10:21:57sonick quits [Client Quit]
10:42:26Icyelut (Icyelut) joins
10:58:16Icyelut quits [Changing host]
10:58:16Icyelut (Icyelut) joins
10:59:22Icyelut quits [Changing host]
10:59:22Icyelut (Icyelut) joins
11:01:22Icyelut quits [Client Quit]
11:12:38Icyelut (Icyelut) joins
11:16:01Icyelut quits [Client Quit]
11:16:13Icyelut (Icyelut) joins
11:38:23Icyelut quits [Client Quit]
11:44:13Icyelut (Icyelut) joins
13:03:04Megame (Megame) joins
14:28:01AlsoTheTechRobo is now known as TheTechRobo
14:33:02hackbug quits [Quit: Lost terminal]
14:35:28hackbug (hackbug) joins
14:49:09jacksonchen666 quits [Remote host closed the connection]
14:51:32jacksonchen666 (jacksonchen666) joins
15:21:02<kiska>madpro|m: Lots and lots of cloud for me :D
15:32:40Island joins
15:40:36<madpro|m>It's all cloud?!
15:41:32<@JAA>πŸŒπŸ‘¨β€πŸš€πŸ”«πŸ‘¨β€πŸš€πŸŒŒ
15:46:08<madpro|m>kiska: You're on Hertzner, right? Do you just rent a bunch of small machines in bulk, or?
15:46:51<phuz-test>madpro|m: When I was running a lot of pipelines., I would get dozens of $5/mo DO instances and deploy them all with Ansible.
15:47:47phuz-test is now known as phuzion
15:51:28<kiska>madpro|m: When I want to do a project I just spin 200 hetzner VMs, connect them to docker swarm and let it rip
15:54:37fl0w quits [Ping timeout: 265 seconds]
16:05:14datechnoman quits [Quit: Ping timeout (120 seconds)]
16:05:34datechnoman (datechnoman) joins
17:07:34jacksonchen666 quits [Remote host closed the connection]
17:21:40Megame quits [Client Quit]
17:34:42<madpro|m>It cannot be that dead simple
17:35:03<madpro|m>Horribly expensive, but still super simple :|
17:36:02<madpro|m>But thanks for the details :)
18:09:30<TheTechRobo>Sider.review is shutting down: https://siderlabs.com/blog/sider-review-is-shutting-down/
18:38:15<pokechu22>Copying this over from #archivebot (unfortunately they didn't actually say what the site was):
18:38:17<pokechu22>12:32 <ikkoup> Hi hi, I have a little question regarding crawling a site that's a bit too big imo (over 1 million pages)
18:38:19<pokechu22>12:34 <ikkoup> it offers free "blogs" to users and has a directory of user posts (helps with link crawling and discovery), it's been running since 2004 with funding from United Nations and Egyptian Ministry of Communication.
18:38:21<pokechu22>12:35 <ikkoup> recently they stopped accepting new users, I guess due to Egypt's economic situation and lack of funding? so I'm kinda worried about it and the history it has regarding Arabic web content.
18:38:23<pokechu22>12:36 <ikkoup> should I try to archive it on my own or is it something you may be interested in?
18:57:06<h2ibot>Bzc6p edited Kepfeltoltes.eu (+34, /* Archiving */ Forgot to note, but saved until…): https://wiki.archiveteam.org/?diff=49353&oldid=48076
19:07:45jacksonchen666 (jacksonchen666) joins
19:58:52<@JAA>I made a little search thingy for DemoDrop. HTML, JS and a compressed index file. Search happens entirely in the browser (powered by MiniSearch), i.e. no server-side component is necessary, just static file hosting. The disadvantage is that loading the index seems to require 1-2 GiB of RAM in my testing. I'm using a pure JS implementation of zstd for the decompression, could be sped up significantly
19:58:58<@JAA>with WASM.
20:00:16<h2ibot>Bzc6p edited Legalja.hu (+163, website in coma): https://wiki.archiveteam.org/?diff=49354&oldid=47953
20:03:54<@JAA>https://transfer.archivete.am/inline/bfnbG/demodrop_search.png
20:06:10<Maakuth|m>JAA, zstd packed index? Maybe the index would be better uncompressed at the server, accessed page by page through range queries
20:07:12<Maakuth|m>Maybe compression makes it necessary to extract the whole thing in memory and thus burns through lots of ram
20:13:18<h2ibot>Bzc6p edited TVN.hu (+312, Now, video.xfree.hu was apparently a…): https://wiki.archiveteam.org/?diff=49355&oldid=45779
20:14:17<@JAA>Maakuth|m: Yeah, or splitting the index into multiple smaller indices by prefix. Those things seem like a lot more work than this though.
20:16:58<@JAA>The compression is pretty much unrelated to the memory use. The vast majority of that is simply the index objects/structure.
20:17:45<Maakuth|m>https://github.com/phiresky/sql.js-httpvfs this one does the range query trick with sqlite dbs
20:18:00<@JAA>The index is 136 MiB without compression, 44 with zstd -10.
20:19:23<Maakuth|m>Ok, the minisearch index structures are very big in memory indeed
20:26:07<@JAA>Looks like the index on MiniSearch is an array of `[word, {fieldId: {documentId: count}}]` when really all I care about is `[word, [documentId...]]`. But of course the other info is useful for more complex searches than what I'm trying to have here.
20:27:45<@JAA>That's on the serialised version, anyway. Not sure what it does in memory.
20:29:09<@JAA>(I'm guessing it might build a prefix tree.)
20:44:44wyatt8740 joins
20:47:02wyatt8750 quits [Ping timeout: 265 seconds]
20:49:36<@JAA>Well, if anyone would like to mess with this further, be my guest. I've spent enough time on it for now. Might revisit at some point though.
20:55:56sec^nd quits [Ping timeout: 245 seconds]
20:56:19sec^nd (second) joins
21:08:33Ruthalas5 quits [Quit: Ping timeout (120 seconds)]
21:08:51Ruthalas5 (Ruthalas) joins
21:38:32<h2ibot>JustAnotherArchivist edited DemoDrop (+81, Add search): https://wiki.archiveteam.org/?diff=49359&oldid=47825
21:56:59<TheTechRobo>MiniSearch doesn't sound very mini :P
22:15:39Ketchup901 quits [Client Quit]
22:16:52Ketchup901 (Ketchup901) joins
22:32:34lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
22:33:24lennier1 (lennier1) joins
22:37:12hitgrr8 quits [Client Quit]
23:50:23Megame (Megame) joins
23:53:01f0x2 quits [Quit: So long, and thanks for all the fish]
23:59:55<h2ibot>JustAnotherArchivist edited Win-Raid Forum (+244): https://wiki.archiveteam.org/?diff=49360&oldid=48512