| 00:12:05 | | sonick (sonick) joins |
| 00:15:56 | | sec^nd quits [Ping timeout: 245 seconds] |
| 00:18:37 | | wyatt8740 quits [Ping timeout: 250 seconds] |
| 00:18:54 | | sec^nd (second) joins |
| 00:25:57 | | wyatt8740 joins |
| 00:29:02 | | wyatt8750 joins |
| 00:32:03 | | wyatt8740 quits [Ping timeout: 250 seconds] |
| 00:33:52 | | Island quits [Ping timeout: 265 seconds] |
| 00:36:11 | | Island joins |
| 01:25:21 | | Megame quits [Client Quit] |
| 01:27:31 | | fl0w joins |
| 01:30:38 | | fl0w_ quits [Ping timeout: 264 seconds] |
| 01:45:24 | | yawkat quits [Ping timeout: 265 seconds] |
| 01:56:14 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+282, /* 2022 */ Add KSCO): https://wiki.archiveteam.org/?diff=49351&oldid=49348 |
| 02:24:03 | | sec^nd quits [Remote host closed the connection] |
| 02:24:32 | | sec^nd (second) joins |
| 02:32:48 | <tech234a> | Is there a place to put older projects on the wiki? yuku and devport are long offline and can probably be removed |
| 02:32:54 | <tech234a> | from the homepage |
| 02:59:28 | | Doran is now known as Doranwen |
| 03:10:40 | | mskiptr joins |
| 03:11:41 | <mskiptr> | Hello, how should I go about archiving a blogspot.com blog, like this one? |
| 03:11:44 | <mskiptr> | https://milosniczka-slodyczy.blogspot.com/ |
| 03:12:29 | <pokechu22> | That should be easy to run via archivebot; I'll do it now |
| 03:12:34 | <mskiptr> | I'd normally find a sitemap that lists links to all posts, or something like that, but there's none as far as I can see |
| 03:12:49 | <pokechu22> | Try view-source:https://milosniczka-slodyczy.blogspot.com/sitemap.xml |
| 03:13:15 | <pokechu22> | though I'm not sure how complete it is, and blogspot is a bit annoying with its pagination-but-not-really system |
| 03:14:05 | <mskiptr> | Thanks pokechu22 |
| 04:21:31 | | Mental quits [Ping timeout: 265 seconds] |
| 04:34:25 | | lukeman quits [Quit: Connection closed for inactivity] |
| 05:22:04 | <sonick> | Is there a way to extract image files, etc. in wp-contents from a single warc.gz file? |
| 05:26:01 | <pokechu22> | I'm not aware of an automated solution, but you can probably do something with the cdx file to get a list of offsets and then extract them that way |
| 05:48:02 | <@OrIdow6> | sonick: Use warcfilter from https://github.com/internetarchive/warctools to create a WARC containing only the URLs you want. Then use "warcat extract" from https://github.com/chfoo/warcat |
| 06:21:00 | | fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))] |
| 06:21:06 | | fuzzy8021 (fuzzy8021) joins |
| 06:33:21 | | BlueMaxima quits [Client Quit] |
| 06:57:09 | | Island quits [Ping timeout: 265 seconds] |
| 07:00:08 | | mskiptr quits [Remote host closed the connection] |
| 07:10:06 | | sec^nd quits [Ping timeout: 245 seconds] |
| 07:13:46 | | sec^nd (second) joins |
| 07:17:46 | | hitgrr8 joins |
| 08:10:21 | <madpro|m> | How do you folks get IP ranges for your VM-fleets? |
| 08:10:58 | <madpro|m> | My solution until now has just to sign up on more cloud providers whenever I cannot parallelize locally |
| 08:11:39 | <madpro|m> | As the saying goes: it's not foolish to not know, but it is foolish to not ask better :) |
| 08:11:53 | <madpro|m> | * been to |
| 08:30:12 | | fishingforsoup quits [Read error: Connection reset by peer] |
| 09:04:02 | | yawkat (yawkat) joins |
| 10:21:57 | | sonick quits [Client Quit] |
| 10:42:26 | | Icyelut (Icyelut) joins |
| 10:58:16 | | Icyelut quits [Changing host] |
| 10:58:16 | | Icyelut (Icyelut) joins |
| 10:59:22 | | Icyelut quits [Changing host] |
| 10:59:22 | | Icyelut (Icyelut) joins |
| 11:01:22 | | Icyelut quits [Client Quit] |
| 11:12:38 | | Icyelut (Icyelut) joins |
| 11:16:01 | | Icyelut quits [Client Quit] |
| 11:16:13 | | Icyelut (Icyelut) joins |
| 11:38:23 | | Icyelut quits [Client Quit] |
| 11:44:13 | | Icyelut (Icyelut) joins |
| 13:03:04 | | Megame (Megame) joins |
| 14:28:01 | | AlsoTheTechRobo is now known as TheTechRobo |
| 14:33:02 | | hackbug quits [Quit: Lost terminal] |
| 14:35:28 | | hackbug (hackbug) joins |
| 14:49:09 | | jacksonchen666 quits [Remote host closed the connection] |
| 14:51:32 | | jacksonchen666 (jacksonchen666) joins |
| 15:21:02 | <kiska> | madpro|m: Lots and lots of cloud for me :D |
| 15:32:40 | | Island joins |
| 15:40:36 | <madpro|m> | It's all cloud?! |
| 15:41:32 | <@JAA> | ππ¨βππ«π¨βππ |
| 15:46:08 | <madpro|m> | kiska: You're on Hertzner, right? Do you just rent a bunch of small machines in bulk, or? |
| 15:46:51 | <phuz-test> | madpro|m: When I was running a lot of pipelines., I would get dozens of $5/mo DO instances and deploy them all with Ansible. |
| 15:47:47 | | phuz-test is now known as phuzion |
| 15:51:28 | <kiska> | madpro|m: When I want to do a project I just spin 200 hetzner VMs, connect them to docker swarm and let it rip |
| 15:54:37 | | fl0w quits [Ping timeout: 265 seconds] |
| 16:05:14 | | datechnoman quits [Quit: Ping timeout (120 seconds)] |
| 16:05:34 | | datechnoman (datechnoman) joins |
| 17:07:34 | | jacksonchen666 quits [Remote host closed the connection] |
| 17:21:40 | | Megame quits [Client Quit] |
| 17:34:42 | <madpro|m> | It cannot be that dead simple |
| 17:35:03 | <madpro|m> | Horribly expensive, but still super simple :| |
| 17:36:02 | <madpro|m> | But thanks for the details :) |
| 18:09:30 | <TheTechRobo> | Sider.review is shutting down: https://siderlabs.com/blog/sider-review-is-shutting-down/ |
| 18:38:15 | <pokechu22> | Copying this over from #archivebot (unfortunately they didn't actually say what the site was): |
| 18:38:17 | <pokechu22> | 12:32 <ikkoup> Hi hi, I have a little question regarding crawling a site that's a bit too big imo (over 1 million pages) |
| 18:38:19 | <pokechu22> | 12:34 <ikkoup> it offers free "blogs" to users and has a directory of user posts (helps with link crawling and discovery), it's been running since 2004 with funding from United Nations and Egyptian Ministry of Communication. |
| 18:38:21 | <pokechu22> | 12:35 <ikkoup> recently they stopped accepting new users, I guess due to Egypt's economic situation and lack of funding? so I'm kinda worried about it and the history it has regarding Arabic web content. |
| 18:38:23 | <pokechu22> | 12:36 <ikkoup> should I try to archive it on my own or is it something you may be interested in? |
| 18:57:06 | <h2ibot> | Bzc6p edited Kepfeltoltes.eu (+34, /* Archiving */ Forgot to note, but saved untilβ¦): https://wiki.archiveteam.org/?diff=49353&oldid=48076 |
| 19:07:45 | | jacksonchen666 (jacksonchen666) joins |
| 19:58:52 | <@JAA> | I made a little search thingy for DemoDrop. HTML, JS and a compressed index file. Search happens entirely in the browser (powered by MiniSearch), i.e. no server-side component is necessary, just static file hosting. The disadvantage is that loading the index seems to require 1-2 GiB of RAM in my testing. I'm using a pure JS implementation of zstd for the decompression, could be sped up significantly |
| 19:58:58 | <@JAA> | with WASM. |
| 20:00:16 | <h2ibot> | Bzc6p edited Legalja.hu (+163, website in coma): https://wiki.archiveteam.org/?diff=49354&oldid=47953 |
| 20:03:54 | <@JAA> | https://transfer.archivete.am/inline/bfnbG/demodrop_search.png |
| 20:06:10 | <Maakuth|m> | JAA, zstd packed index? Maybe the index would be better uncompressed at the server, accessed page by page through range queries |
| 20:07:12 | <Maakuth|m> | Maybe compression makes it necessary to extract the whole thing in memory and thus burns through lots of ram |
| 20:13:18 | <h2ibot> | Bzc6p edited TVN.hu (+312, Now, video.xfree.hu was apparently aβ¦): https://wiki.archiveteam.org/?diff=49355&oldid=45779 |
| 20:14:17 | <@JAA> | Maakuth|m: Yeah, or splitting the index into multiple smaller indices by prefix. Those things seem like a lot more work than this though. |
| 20:16:58 | <@JAA> | The compression is pretty much unrelated to the memory use. The vast majority of that is simply the index objects/structure. |
| 20:17:45 | <Maakuth|m> | https://github.com/phiresky/sql.js-httpvfs this one does the range query trick with sqlite dbs |
| 20:18:00 | <@JAA> | The index is 136 MiB without compression, 44 with zstd -10. |
| 20:19:23 | <Maakuth|m> | Ok, the minisearch index structures are very big in memory indeed |
| 20:26:07 | <@JAA> | Looks like the index on MiniSearch is an array of `[word, {fieldId: {documentId: count}}]` when really all I care about is `[word, [documentId...]]`. But of course the other info is useful for more complex searches than what I'm trying to have here. |
| 20:27:45 | <@JAA> | That's on the serialised version, anyway. Not sure what it does in memory. |
| 20:29:09 | <@JAA> | (I'm guessing it might build a prefix tree.) |
| 20:44:44 | | wyatt8740 joins |
| 20:47:02 | | wyatt8750 quits [Ping timeout: 265 seconds] |
| 20:49:36 | <@JAA> | Well, if anyone would like to mess with this further, be my guest. I've spent enough time on it for now. Might revisit at some point though. |
| 20:55:56 | | sec^nd quits [Ping timeout: 245 seconds] |
| 20:56:19 | | sec^nd (second) joins |
| 21:08:33 | | Ruthalas5 quits [Quit: Ping timeout (120 seconds)] |
| 21:08:51 | | Ruthalas5 (Ruthalas) joins |
| 21:38:32 | <h2ibot> | JustAnotherArchivist edited DemoDrop (+81, Add search): https://wiki.archiveteam.org/?diff=49359&oldid=47825 |
| 21:56:59 | <TheTechRobo> | MiniSearch doesn't sound very mini :P |
| 22:15:39 | | Ketchup901 quits [Client Quit] |
| 22:16:52 | | Ketchup901 (Ketchup901) joins |
| 22:32:34 | | lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)] |
| 22:33:24 | | lennier1 (lennier1) joins |
| 22:37:12 | | hitgrr8 quits [Client Quit] |
| 23:50:23 | | Megame (Megame) joins |
| 23:53:01 | | f0x2 quits [Quit: So long, and thanks for all the fish] |
| 23:59:55 | <h2ibot> | JustAnotherArchivist edited Win-Raid Forum (+244): https://wiki.archiveteam.org/?diff=49360&oldid=48512 |