#archiveteam-bs log for 2023-01-06

Home Search Previous day Next day

00:12:05		sonick (sonick) joins
00:15:56		sec^nd quits [Ping timeout: 245 seconds]
00:18:37		wyatt8740 quits [Ping timeout: 250 seconds]
00:18:54		sec^nd (second) joins
00:25:57		wyatt8740 joins
00:29:02		wyatt8750 joins
00:32:03		wyatt8740 quits [Ping timeout: 250 seconds]
00:33:52		Island quits [Ping timeout: 265 seconds]
00:36:11		Island joins
01:25:21		Megame quits [Client Quit]
01:27:31		fl0w joins
01:30:38		fl0w_ quits [Ping timeout: 264 seconds]
01:45:24		yawkat quits [Ping timeout: 265 seconds]
01:56:14	<h2ibot>	JustAnotherArchivist edited Deathwatch (+282, /* 2022 */ Add KSCO): https://wiki.archiveteam.org/?diff=49351&oldid=49348
02:24:03		sec^nd quits [Remote host closed the connection]
02:24:32		sec^nd (second) joins
02:32:48	<tech234a>	Is there a place to put older projects on the wiki? yuku and devport are long offline and can probably be removed
02:32:54	<tech234a>	from the homepage
02:59:28		Doran is now known as Doranwen
03:10:40		mskiptr joins
03:11:41	<mskiptr>	Hello, how should I go about archiving a blogspot.com blog, like this one?
03:11:44	<mskiptr>	https://milosniczka-slodyczy.blogspot.com/
03:12:29	<pokechu22>	That should be easy to run via archivebot; I'll do it now
03:12:34	<mskiptr>	I'd normally find a sitemap that lists links to all posts, or something like that, but there's none as far as I can see
03:12:49	<pokechu22>	Try view-source:https://milosniczka-slodyczy.blogspot.com/sitemap.xml
03:13:15	<pokechu22>	though I'm not sure how complete it is, and blogspot is a bit annoying with its pagination-but-not-really system
03:14:05	<mskiptr>	Thanks pokechu22
04:21:31		Mental quits [Ping timeout: 265 seconds]
04:34:25		lukeman quits [Quit: Connection closed for inactivity]
05:22:04	<sonick>	Is there a way to extract image files, etc. in wp-contents from a single warc.gz file?
05:26:01	<pokechu22>	I'm not aware of an automated solution, but you can probably do something with the cdx file to get a list of offsets and then extract them that way
05:48:02	<@OrIdow6>	sonick: Use warcfilter from https://github.com/internetarchive/warctools to create a WARC containing only the URLs you want. Then use "warcat extract" from https://github.com/chfoo/warcat
06:21:00		fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))]
06:21:06		fuzzy8021 (fuzzy8021) joins
06:33:21		BlueMaxima quits [Client Quit]
06:57:09		Island quits [Ping timeout: 265 seconds]
07:00:08		mskiptr quits [Remote host closed the connection]
07:10:06		sec^nd quits [Ping timeout: 245 seconds]
07:13:46		sec^nd (second) joins
07:17:46		hitgrr8 joins
08:10:21	<madpro\|m>	How do you folks get IP ranges for your VM-fleets?
08:10:58	<madpro\|m>	My solution until now has just to sign up on more cloud providers whenever I cannot parallelize locally
08:11:39	<madpro\|m>	As the saying goes: it's not foolish to not know, but it is foolish to not ask better :)
08:11:53	<madpro\|m>	* been to
08:30:12		fishingforsoup quits [Read error: Connection reset by peer]
09:04:02		yawkat (yawkat) joins
10:21:57		sonick quits [Client Quit]
10:42:26		Icyelut (Icyelut) joins
10:58:16		Icyelut quits [Changing host]
10:58:16		Icyelut (Icyelut) joins
10:59:22		Icyelut quits [Changing host]
10:59:22		Icyelut (Icyelut) joins
11:01:22		Icyelut quits [Client Quit]
11:12:38		Icyelut (Icyelut) joins
11:16:01		Icyelut quits [Client Quit]
11:16:13		Icyelut (Icyelut) joins
11:38:23		Icyelut quits [Client Quit]
11:44:13		Icyelut (Icyelut) joins
13:03:04		Megame (Megame) joins
14:28:01		AlsoTheTechRobo is now known as TheTechRobo
14:33:02		hackbug quits [Quit: Lost terminal]
14:35:28		hackbug (hackbug) joins
14:49:09		jacksonchen666 quits [Remote host closed the connection]
14:51:32		jacksonchen666 (jacksonchen666) joins
15:21:02	<kiska>	madpro\|m: Lots and lots of cloud for me :D
15:32:40		Island joins
15:40:36	<madpro\|m>	It's all cloud?!
15:41:32	<@JAA>	🌏👨‍🚀🔫👨‍🚀🌌
15:46:08	<madpro\|m>	kiska: You're on Hertzner, right? Do you just rent a bunch of small machines in bulk, or?
15:46:51	<phuz-test>	madpro\|m: When I was running a lot of pipelines., I would get dozens of $5/mo DO instances and deploy them all with Ansible.
15:47:47		phuz-test is now known as phuzion
15:51:28	<kiska>	madpro\|m: When I want to do a project I just spin 200 hetzner VMs, connect them to docker swarm and let it rip
15:54:37		fl0w quits [Ping timeout: 265 seconds]
16:05:14		datechnoman quits [Quit: Ping timeout (120 seconds)]
16:05:34		datechnoman (datechnoman) joins
17:07:34		jacksonchen666 quits [Remote host closed the connection]
17:21:40		Megame quits [Client Quit]
17:34:42	<madpro\|m>	It cannot be that dead simple
17:35:03	<madpro\|m>	Horribly expensive, but still super simple :\|
17:36:02	<madpro\|m>	But thanks for the details :)
18:09:30	<TheTechRobo>	Sider.review is shutting down: https://siderlabs.com/blog/sider-review-is-shutting-down/
18:38:15	<pokechu22>	Copying this over from #archivebot (unfortunately they didn't actually say what the site was):
18:38:17	<pokechu22>	12:32 <ikkoup> Hi hi, I have a little question regarding crawling a site that's a bit too big imo (over 1 million pages)
18:38:19	<pokechu22>	12:34 <ikkoup> it offers free "blogs" to users and has a directory of user posts (helps with link crawling and discovery), it's been running since 2004 with funding from United Nations and Egyptian Ministry of Communication.
18:38:21	<pokechu22>	12:35 <ikkoup> recently they stopped accepting new users, I guess due to Egypt's economic situation and lack of funding? so I'm kinda worried about it and the history it has regarding Arabic web content.
18:38:23	<pokechu22>	12:36 <ikkoup> should I try to archive it on my own or is it something you may be interested in?
18:57:06	<h2ibot>	Bzc6p edited Kepfeltoltes.eu (+34, /* Archiving */ Forgot to note, but saved until…): https://wiki.archiveteam.org/?diff=49353&oldid=48076
19:07:45		jacksonchen666 (jacksonchen666) joins
19:58:52	<@JAA>	I made a little search thingy for DemoDrop. HTML, JS and a compressed index file. Search happens entirely in the browser (powered by MiniSearch), i.e. no server-side component is necessary, just static file hosting. The disadvantage is that loading the index seems to require 1-2 GiB of RAM in my testing. I'm using a pure JS implementation of zstd for the decompression, could be sped up significantly
19:58:58	<@JAA>	with WASM.
20:00:16	<h2ibot>	Bzc6p edited Legalja.hu (+163, website in coma): https://wiki.archiveteam.org/?diff=49354&oldid=47953
20:03:54	<@JAA>	https://transfer.archivete.am/inline/bfnbG/demodrop_search.png
20:06:10	<Maakuth\|m>	JAA, zstd packed index? Maybe the index would be better uncompressed at the server, accessed page by page through range queries
20:07:12	<Maakuth\|m>	Maybe compression makes it necessary to extract the whole thing in memory and thus burns through lots of ram
20:13:18	<h2ibot>	Bzc6p edited TVN.hu (+312, Now, video.xfree.hu was apparently a…): https://wiki.archiveteam.org/?diff=49355&oldid=45779
20:14:17	<@JAA>	Maakuth\|m: Yeah, or splitting the index into multiple smaller indices by prefix. Those things seem like a lot more work than this though.
20:16:58	<@JAA>	The compression is pretty much unrelated to the memory use. The vast majority of that is simply the index objects/structure.
20:17:45	<Maakuth\|m>	https://github.com/phiresky/sql.js-httpvfs this one does the range query trick with sqlite dbs
20:18:00	<@JAA>	The index is 136 MiB without compression, 44 with zstd -10.
20:19:23	<Maakuth\|m>	Ok, the minisearch index structures are very big in memory indeed
20:26:07	<@JAA>	Looks like the index on MiniSearch is an array of `[word, {fieldId: {documentId: count}}]` when really all I care about is `[word, [documentId...]]`. But of course the other info is useful for more complex searches than what I'm trying to have here.
20:27:45	<@JAA>	That's on the serialised version, anyway. Not sure what it does in memory.
20:29:09	<@JAA>	(I'm guessing it might build a prefix tree.)
20:44:44		wyatt8740 joins
20:47:02		wyatt8750 quits [Ping timeout: 265 seconds]
20:49:36	<@JAA>	Well, if anyone would like to mess with this further, be my guest. I've spent enough time on it for now. Might revisit at some point though.
20:55:56		sec^nd quits [Ping timeout: 245 seconds]
20:56:19		sec^nd (second) joins
21:08:33		Ruthalas5 quits [Quit: Ping timeout (120 seconds)]
21:08:51		Ruthalas5 (Ruthalas) joins
21:38:32	<h2ibot>	JustAnotherArchivist edited DemoDrop (+81, Add search): https://wiki.archiveteam.org/?diff=49359&oldid=47825
21:56:59	<TheTechRobo>	MiniSearch doesn't sound very mini :P
22:15:39		Ketchup901 quits [Client Quit]
22:16:52		Ketchup901 (Ketchup901) joins
22:32:34		lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
22:33:24		lennier1 (lennier1) joins
22:37:12		hitgrr8 quits [Client Quit]
23:50:23		Megame (Megame) joins
23:53:01		f0x2 quits [Quit: So long, and thanks for all the fish]
23:59:55	<h2ibot>	JustAnotherArchivist edited Win-Raid Forum (+244): https://wiki.archiveteam.org/?diff=49360&oldid=48512

Home Search Previous day Next day