#archiveteam-bs log for 2026-02-12

Home Search Previous day Next day

00:00:52		nexussfan (nexussfan) joins
00:11:14		mr_sarge quits [Read error: Connection reset by peer]
00:11:47		StarletCharlotte joins
00:11:58	<StarletCharlotte>	What's the best way to upload large files to the Internet Archive?
00:12:22	<StarletCharlotte>	Because I'm trying to upload an archive of ftp://ftp.funcom.com and it's stuck at 4.9 GB. It's been several hours.
00:12:39	<StarletCharlotte>	My internet isn't the best but I don't think it's that bad.
00:14:01	<Yakov>	reading some of ABs source, I think it supports ftp..?
00:14:08	<@imer>	StarletCharlotte: there's some tips here (if you've not seen it yet) https://wiki.archiveteam.org/index.php/Internet_Archive#Upload_speed
00:17:20	<pokechu22>	ab doesn't interact nicely with ftp - there's some code for it but it crashes most of the time and as such is mostly disabled at this point
00:18:11	<StarletCharlotte>	imer I'll take a look
00:26:25	<@OrIdow6>	What's the current tracker architecture? I found old logs talking about it being an Nginx(Lua) proxy that talks to the original tracker, but doesn't directly talk to Redis - is that still the case?
00:28:36	<nicolas17>	StarletCharlotte: are you on Linux?
00:28:46	<StarletCharlotte>	yeah
00:29:47	<nicolas17>	in my experience "sudo sysctl net.ipv4.tcp_congestion_control=bbr" makes uploads to archive.org significantly faster
00:29:50	<nicolas17>	won't help with ongoing connections/uploads though, you'd have to start over
00:30:21	<StarletCharlotte>	Got it. Should I turn it off after though?
00:30:46	<nicolas17>	I didn't notice any negative effects on the rest of my internet use tbh
00:30:57	<StarletCharlotte>	got it
00:31:01	<nicolas17>	but you could run "sudo sysctl net.ipv4.tcp_congestion_control" to see what your current value is
00:31:05	<nicolas17>	and restore it afterwards
00:31:07	<StarletCharlotte>	whoops
00:31:12	<StarletCharlotte>	i uh... already did it
00:31:17	<StarletCharlotte>	oh well
00:32:59	<BlankEclair>	out of curiosity, does that only make IA uploads fast, or does it make all tcp connections go faster
00:34:26	<nicolas17>	there's something in archive.org's networking that doesn't interact well with the default congestion control algorithm, I don't understand the details
00:35:34	<klea>	https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html#:~:text=tcp%5Fcongestion%5Fcontrol%20%2D%20STRING is not very clear what that does.
00:40:24	<nicolas17>	https://en.wikipedia.org/wiki/TCP_congestion_control
00:40:44	<StarletCharlotte>	> /usr/bin/python: Error while finding module specification for 'ia-upload-stream.py' (ModuleNotFoundError: __path__ attribute not found on 'ia-upload-stream' while trying to find 'ia-upload-stream.py'). Try using 'ia-upload-stream' instead of 'ia-upload-stream.py' as the module name.
00:40:49	<StarletCharlotte>	not sure what's going on here
00:41:20	<StarletCharlotte>	Removing .py also fails
00:41:42	<StarletCharlotte>	same with removing -m
00:47:47	<klea>	nicolas17: what version does setting the variable to bbr set it to, BBRv1, BBRv2 or BBRv3?
00:49:12	<StarletCharlotte>	yeah I can't figure out how to run this. The example on the wiki just doesn't work for some reason
00:50:00	<@JAA>	Why is that command trying to run it as a module? (I either never knew or forgot that my uploader is even listed there.)
00:51:16	<StarletCharlotte>	Good question, but not running it as a module also fails.
00:51:18	<@JAA>	And what's that bit about installing the ia package? ia-upload-stream only depends on requests.
00:51:53	<StarletCharlotte>	https://pastebin.com/s88c8eJr
00:53:03	<@JAA>	Hmm yeah, I suppose.
00:53:22	<@JAA>	That does run the script correctly though.
00:53:46	<@JAA>	You can specify the S3 credentials via IA_S3_ACCESS and IA_S3_SECRET environment variables as well.
00:54:14	<@JAA>	And ia-s3-auth can get you those values without `ia configure`.
00:55:52		etnguyen03 quits [Client Quit]
00:56:42	<StarletCharlotte>	S3?
00:57:16	<StarletCharlotte>	Okay I guess
00:57:24	<klea>	They're available on the web at https://archive.org/account/s3.php too
00:57:32	<klea>	It's an S3-like API
00:57:34	<StarletCharlotte>	oh okay thanks
00:59:27	<StarletCharlotte>	Tried again, same error. It's asking about a config file or something?
00:59:43	<@JAA>	To explain that error referencing `ia configure`: `ia-upload-stream` reads ia's config file if it's available (and not overridden by the environment variable). There's no actual dependency on `ia`.
01:00:29	<StarletCharlotte>	I assume ia is from python-internetarchive?
01:00:55	<StarletCharlotte>	I set the environment variables for the S3 credentials so it's not that.
01:01:27	<TheTechRobo>	StarletCharlotte: the sysctl option should go back to what it was before after a reboot, FWIW, so don't worry about losing it
01:01:39	<StarletCharlotte>	got it
01:01:41	<@JAA>	Sounds like you didn't set them correctly then. It won't even reach that code when they're set.
01:02:03	<TheTechRobo>	(ia comes from https://pypi.org/project/internetarchive BTW)
01:02:40	<StarletCharlotte>	Huh, I guess set just sets the shell variables and not environment variables? I think?
01:02:48	<@JAA>	Yes
01:02:50	<klea>	try to export.
01:02:53	<TheTechRobo>	export IA_S3_ACCESS=...
01:03:07	<@JAA>	Either run it as `IA_S3_ACCESS=... IA_S3_SECRET=... ./ia-upload-stream ...` or `export` them.
01:03:49	<StarletCharlotte>	There it goes. thank you
01:03:50	<@JAA>	And `set` sets the arguments, not variables.
01:03:54	<StarletCharlotte>	that explains a lot
01:04:34		StarletCharlotte quits [Client Quit]
01:11:50		pabs (pabs) joins
01:13:49		LddPotato quits [Read error: Connection reset by peer]
01:14:30		LddPotato (LddPotato) joins
01:15:12		roverinexile joins
01:17:41		rover quits [Ping timeout: 272 seconds]
01:18:31		etnguyen03 (etnguyen03) joins
01:24:27		LddPotato quits [Read error: Connection reset by peer]
01:25:09		LddPotato (LddPotato) joins
01:34:57		LddPotato quits [Read error: Connection reset by peer]
01:35:51		LddPotato (LddPotato) joins
01:36:03		petrichor quits [Ping timeout: 272 seconds]
01:44:13		fangfufu quits [Client Quit]
01:45:53		LddPotato quits [Read error: Connection reset by peer]
01:46:34		LddPotato (LddPotato) joins
01:50:08		fangfufu joins
01:50:27		fangfufu is now authenticated as fangfufu
01:50:28		kansei- (kansei) joins
01:51:52		kansei quits [Ping timeout: 256 seconds]
02:03:57		LddPotato quits [Read error: Connection reset by peer]
02:05:31		LddPotato (LddPotato) joins
02:29:50		pokechu22 quits [Ping timeout: 256 seconds]
02:40:35		pokechu22 (pokechu22) joins
02:52:14		ducky_ (ducky) joins
02:53:04		ducky quits [Ping timeout: 256 seconds]
02:53:04		ducky_ is now known as ducky
02:53:29		thalia quits [Quit: Connection closed for inactivity]
03:06:40		ducky quits [Ping timeout: 256 seconds]
03:08:16		ducky (ducky) joins
03:30:58		nexussfan quits [Quit: Konversation terminated!]
03:36:42		Godzfire quits [Quit: Ooops, wrong browser tab.]
03:47:30		nexussfan (nexussfan) joins
04:08:05		etnguyen03 quits [Remote host closed the connection]
04:08:17		fireatseaparks quits [Quit: Textual IRC Client: www.textualapp.com]
04:16:13		fireatseaparks (fireatseaparks) joins
04:39:57		Island quits [Read error: Connection reset by peer]
04:46:18		cyanbox joins
04:55:14		DogsRNice quits [Read error: Connection reset by peer]
05:04:32		n9nes quits [Ping timeout: 256 seconds]
05:05:03		khaoohs quits [Ping timeout: 272 seconds]
05:06:01		n9nes joins
05:06:36		khaoohs joins
05:08:58		nexussfan quits [Client Quit]
05:15:33		steering wonders how thoroughly wikipedia links have been archived
05:23:34	<steering>	i know there's bots that try and point links to archives when they're dead but is there stuff going through and SPN'ing links for example
05:24:26	<BlankEclair>	wikipedia-eventstream or something
05:24:55	<BlankEclair>	https://archive.org/details/wikipedia-eventstream?tab=about
05:27:59	<pokechu22>	Yeah, my understanding is that there's a project that does that (that isn't by archiveteam). Looking at https://archive.org/details/wikipedia-eventstream?tab=collection&sort=-publicdate it seems like stuff is ran weeklyish?
05:35:01	<steering>	ah good :)
06:08:23		Snivy quits [Ping timeout: 272 seconds]
06:15:57		petrichor (petrichor) joins
06:25:00		fionera quits [Ping timeout: 256 seconds]
06:29:23		BennyOtt (BennyOtt) joins
06:40:59		Wohlstand1 (Wohlstand) joins
06:43:24		Wohlstand1 is now known as Wohlstand
06:51:24		Wohlstand quits [Client Quit]
07:12:09		Snivy (Snivy) joins
08:30:53		rohvani quits [Ping timeout: 272 seconds]
08:55:44		ducky quits [Ping timeout: 256 seconds]
08:57:25	<ericgallager>	https://en.wikipedia.org/wiki/User:GreenC_bot does archiving of Wikipedia links
08:57:40	<ericgallager>	https://en.wikipedia.org/wiki/User:GreenC/WaybackMedic
08:59:51	<ericgallager>	oh and this one too: https://en.wikipedia.org/wiki/User:InternetArchiveBot
09:14:57		ducky (ducky) joins
09:32:19		sec^nd quits [Ping timeout: 244 seconds]
09:34:36		sec^nd (second) joins
09:58:55		BornOn420 quits [Ping timeout: 272 seconds]
10:41:42		TheEnbyperor quits [Ping timeout: 256 seconds]
10:41:59		TheEnbyperor_ quits [Ping timeout: 272 seconds]
10:46:16		TheEnbyperor (TheEnbyperor) joins
10:51:29		TheEnbyperor quits [Ping timeout: 272 seconds]
10:57:47		TheEnbyperor joins
10:59:34		TheEnbyperor_ (TheEnbyperor) joins
11:02:13		Dada joins
11:05:11		Dada quits [Remote host closed the connection]
11:40:27		APOLLO03a joins
11:42:54		APOLLO03 quits [Ping timeout: 256 seconds]
11:59:46		StarletCharlotte joins
12:00:03		Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:12	<StarletCharlotte>	Good news: ia-upload-stream.py works! Bad news: I can't edit the metadata to say I finished uploading the actual file instead of the placeholder now because it turns out the Internet Archive REALLY doesn't like when an item identifier has dots in it. But it only tells you that breaks things AFTER you make that the name of your item, only when you
12:02:12	<StarletCharlotte>	try to edit the item. https://archive.org/details/ftp.funcom.com
12:02:15	<StarletCharlotte>	Not sure what to do.
12:02:48		Bleo1826007227196234552220 joins
13:03:07		StarletCharlotte quits [Client Quit]
13:19:32		Webuser302981 joins
13:19:39	<Webuser302981>	What
13:20:06		Webuser302981 quits [Client Quit]
13:20:22		@imer nods
13:22:51		Arcorann_ quits [Ping timeout: 272 seconds]
13:42:17		ice quits [Quit: WeeChat 4.7.1]
13:42:29		oxtyped quits [Ping timeout: 272 seconds]
13:54:00		mgrytbak8 joins
13:54:50		ice joins
13:55:09		mgrytbak quits [Ping timeout: 272 seconds]
13:55:09		mgrytbak8 is now known as mgrytbak
14:15:02		oxtyped joins
14:34:13		Webuser247771 joins
14:34:57		Webuser247771 quits [Client Quit]
14:40:07		oxtyped quits [Ping timeout: 272 seconds]
14:49:40		oxtyped joins
14:51:57		GodzFire joins
14:58:28	<GodzFire>	pokechu22 I was watching the crawler and noticed it was seemingly scrapping some production websites so I checked the productionmusic.fandom.com_articles_and_outlinks.txt list. There's a crap ton that should be removed. I went through and took out 17000 links. Here's an updated txt that only has ProdMusic Wiki stuff, could you restart it with
14:58:28	<GodzFire>	this?: https://litter.catbox.moe/gke9wfo08aoe2dpx.txt
15:00:18	<GodzFire>	I was wondering why it pulled 111gbs when the site is only 12 total.
15:04:21		FiTheArchiver joins
15:04:39		FiTheArchiver quits [Remote host closed the connection]
15:14:18		Dada joins
15:19:02		Webuser963758 joins
15:19:30		Webuser963758 quits [Client Quit]
15:20:51	<aaq\|m>	That would compress down well at least
15:21:51	<justauser>	GodzFire: That's fine, our motto is "Archive All The Things".
15:22:29	<justauser>	IA is willing to store the junk.
15:24:28	<justauser>	However, it only pulled 7GB so far - where is your number from?
15:26:33	<justauser>	Oh, nevermind - it's my number that came from a frozen dashboard.
15:33:39	<GodzFire>	justauser well from what the other 17000 links are too, it's all collections of actual music files on big Production Music websites which is licensed and could get in trouble. I would really prefer if the job could please just get restarted with only the ProdMusic Wiki links.
15:34:53	<GodzFire>	It doesn't feel right otherwise.
15:37:21		BornOn420 (BornOn420) joins
15:43:27	<klea>	IA excludes stuff from WBM, so that's fine I believe?
16:02:02		Island joins
16:02:45		polduran joins
16:08:38		Wohlstand (Wohlstand) joins
16:10:14		Dada quits [Remote host closed the connection]
16:15:45		AK quits [Quit: AK]
16:30:07		Boppen_ quits [Read error: Connection reset by peer]
16:34:18		Boppen (Boppen) joins
16:52:00		Dada joins
16:53:16		janos777 joins
16:53:23		janos778 joins
17:00:23	<polduran>	hello there. normaly, I only stumble here, when I hear of the approaching end of a website so it gets queued for the archivebot. my understanding of your projects and how you work is therefore very limited. anyway, you may have already heard that archive.today is apparently using site visits for DDoS attacks which makes its already endangered
17:00:23	<polduran>	future even worse. in the english wikipedia and the german wikipedia (and probably several other languages as well) they started the discussion to ban the URL as it is obviously bad to link to a malicious website. the problem is, that there are almost 700'000 existing links to sites archived on archive.today and its mirrors. most often websites
17:00:23	<polduran>	that have not been archived (properly) in the wayback machine. i was hoping some of you might be interested to join the discussion and might offer some ideas how to preserve the archived information somewhere else. here the link to the english discussion: https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Archive.is_RFC_5
17:00:23	<polduran>	by any chance, is anyone already working on or planing a project to somehow rescue data from there? after all, there are 700k existing links but i cannot imagine how many dead links must be in the wikipedias, which are only archived on archive.today.
17:27:04	<@arkiver>	archive.is is definitely on my radar
17:30:58		lennier2_ joins
17:33:06		lennier2 quits [Ping timeout: 256 seconds]
17:46:19		ThreeHM quits [Ping timeout: 272 seconds]
17:47:59		ThreeHM (ThreeHeadedMonkey) joins
17:48:10	<pokechu22>	GodzFire: I chose to scrape the outlinks as well because that's what we would have done for a recursive job as well - I included them so that if you clicked on links on the site, those would also be saved... and given that most of them seem to have previously not been saved, it feels like it's useful to save them
17:49:41	<pokechu22>	It should only be files that are public previews - if archivebot is somehow finding music that you're supposed to pay for, then they've done something really weird
17:53:19	<@JAA>	!tell StarletCharlotte Dots in IA item names are perfectly fine; I use them all the time. And there's a script in little-things for metadata as well. Feel free to ask in #internetarchive if you have more questions.
17:53:19	<eggdrop>	[tell] ok, I'll tell StarletCharlotte when they join next
17:54:36	<pokechu22>	(I was expecting it to be outlinks with text information about albums only, but if they have previews, might as well get them too)
18:05:19		sg72 quits [Ping timeout: 272 seconds]
18:08:37		sg72 joins
18:26:41		Webuser043121 joins
18:27:41		SootBector quits [Remote host closed the connection]
18:28:49		SootBector (SootBector) joins
18:35:04		Webuser043121 quits [Client Quit]
18:36:18	<GodzFire>	pokechu22 everything from ProdMusic Wiki seems to be erroring
18:37:10	<pokechu22>	Yeah, that's something fandom does - it has a lot of things that look like relative links to scripts to archivebot, but actually aren't
18:37:51	<pokechu22>	I'll add an ignore for some of them, but it's annoying to deal with (which is one reason why I did an !ao < list job like this instead of a fully recursive !a job)
18:38:14	<GodzFire>	Is there any way to see how far along it is and how many it's done/have left?
18:41:34	<pokechu22>	Yes, though the information is presented in a more obvious way on http://archivebot.com/3. It's processed 225k URLs and has another 275k URLs to go (though some are ignored or otherwise not relevant). That means it's saved the HTML for everything in my original list of 53k URLs, and is now saving images/scripts/media files embedded in those pages
18:42:14	<GodzFire>	I truly appreciate what you're doing and helping, but I really do have a worry about all the sound files. If you wanted to separate this into two jobs where one is all the ProdMusicWiki stuff and another is all the links to other sites, that would make me feel a lot better. One so the ProdMusicWiki can get done by itself, but the other is because
18:42:14	<GodzFire>	the sheer amount of filesize and music those others link to is isanity.
18:43:05		Webuser567384 joins
18:43:29	<GodzFire>	For example one music library alone is easily 50 gigs of samples and thousands of pages, and currently it's trying to pull probably a hundred different music libraries
18:43:37	<pokechu22>	Yeah, I probably should have split it off initially and would have if I'd thought of the media files... but archivebot doesn't have a good way of doing that without aborting the job and starting from scratch (which would mean I'd duplicate the 131 GB of media already saved)
18:47:10	<pokechu22>	I could add ignores for the media files and then !ao < list them afterwards, but I don't really feel like that makes much of a difference (archivebot jobs get split into 5GB chunks that are uploaded as they finish, so just because it's downloaded over 100 GB doesn't mean there's a single 100 GB file sitting on the machine or any risk of running out of disk space)
18:47:31	<GodzFire>	Would it be possible to just do a separate additional job for just the ProdMusicWiki stuff since it's only 12 gigs? That way this other one can keep going. Then it will just see the ProdMusicWiki stuff is already uploaded to wayback and skip it.
18:47:43	<nicolas17>	it does not skip stuff that way
18:48:32	<nicolas17>	note that 85GB was already uploaded to archive.org
19:11:17		ericgallager quits [Read error: Connection reset by peer]
19:14:57		ericgallager joins
19:19:25		mls quits [Ping timeout: 272 seconds]
19:20:35		mls (mls) joins
19:34:16		UwU quits [Remote host closed the connection]
19:34:52		UwU joins
19:46:27		UwU quits [Remote host closed the connection]
19:47:46		UwU joins
19:53:08		polduran quits [Quit: Ooops, wrong browser tab.]
19:57:11		UwU quits [Remote host closed the connection]
19:58:21		UwU joins
20:09:47		UwU quits [Remote host closed the connection]
20:10:22		UwU joins
20:15:08	<klea>	https://codeberg.org/lindenii/sethrawall - sethrawall is a small HTTP reverse proxy with SSH-based authentication.
20:26:39		UwU quits [Remote host closed the connection]
20:27:16		UwU joins
20:44:01		UwU quits [Remote host closed the connection]
20:44:36		UwU joins
21:10:54		GodzFire quits [Quit: Ooops, wrong browser tab.]
21:11:39		UwU quits [Remote host closed the connection]
21:12:16		UwU joins
21:20:05		Webuser567384 quits [Client Quit]
21:30:32		UwU quits [Remote host closed the connection]
21:31:13		UwU joins
21:32:13		Dj-Wawa quits []
21:34:58		Dj-Wawa joins
21:34:58		Dj-Wawa is now authenticated as Dj-Wawa
21:35:04		Dada quits [Ping timeout: 256 seconds]
21:43:18		fionera joins
21:43:18		fionera is now authenticated as Fionera
21:43:18		fionera quits [Changing host]
21:43:18		fionera (Fionera) joins
21:45:27		Hackerpcs quits [Quit: Hackerpcs]
21:46:22		Hackerpcs (Hackerpcs) joins
21:57:23		Dj-Wawa quits [Client Quit]
21:57:30		Dj-Wawa joins
21:57:30		Dj-Wawa is now authenticated as Dj-Wawa
22:00:45		Dj-Wawa quits [Client Quit]
22:00:53		Dj-Wawa joins
22:00:54		Dj-Wawa is now authenticated as Dj-Wawa
22:11:35		UwU quits [Remote host closed the connection]
22:12:20		UwU joins
22:31:39		Webuser851055 joins
22:42:30		G4te_Keep3r34924156 quits [Ping timeout: 256 seconds]
22:44:20		UwU quits [Client Quit]
22:44:50		G4te_Keep3r34924156 joins
22:44:59		UwU joins
22:46:55		thedude joins
22:47:48		etnguyen03 (etnguyen03) joins
22:48:22	<thedude>	I'm trying to recover a webpage from archive.today archives. Are there any tools out there that can do this?
22:50:00	<thedude>	I'd rather not try to hack something together in selenium myself
22:50:36	<klea>	What's the current approach around archiving Dropbox links? (i'm interested in archiving https://www.dropbox.com/s/l8yoah76t7nq04y/mueller-report.pdf from a url list I found somewhere on the web)
22:52:44	<pokechu22>	https://www.dropbox.com/s/l8yoah76t7nq04y/mueller-report.pdf?dl=1 and https://dl.dropboxusercontent.com/s/l8yoah76t7nq04y/mueller-report.pdf - note that www.dropbox.com is excluded from WBM
22:59:13	<klea>	so shoving those two into AB?
23:00:18	<pokechu22>	Yeah, I'll do that
23:01:58		Arcorann_ (Arcorann) joins
23:02:07	<klea>	Thanks
23:02:56		thedude quits [Client Quit]
23:16:45		atphoenix__ (atphoenix) joins
23:19:20		atphoenix_ quits [Ping timeout: 256 seconds]
23:23:32		Webuser851055 quits [Client Quit]
23:46:41		ericgallager quits [Ping timeout: 272 seconds]
23:47:57		nicolas17 quits [Ping timeout: 272 seconds]
23:48:30		nicolas17 (nicolas17) joins
23:57:45		rohvani joins

Home Search Previous day Next day