#internetarchive log for 2023-09-10

Home Search Previous day Next day

00:42:51	<pabs>	download speeds from IA seem quite slow around the world, are their connections overloaded or something?
00:53:48	<flashfire42>	Yeah
00:56:47	<nicolas17>	sorry my fault, still downloading the yahoo videos stuff and my 4 connections are clearly overwhelming poor IA (?)
01:04:57	<pabs>	:)
01:08:00	<pabs>	hmm, rtorrent finds 0 peers for IA torrents like https://archive.org/download/fossy2023_Breaking_the_Chains_of_Trustin/fossy2023_Breaking_the_Chains_of_Trustin_archive.torrent
01:08:49	<pabs>	transmission-cli too
01:09:09	<pabs>	rtorrent also says DHT search unsuccessful
01:10:43	<nicolas17>	pabs: they use webseeding, which rtorrent doesn't support
01:15:28	<pabs>	ah :(
01:16:02	<pabs>	transmission-cli does download from webseeds much faster than a regular web download, interesting
01:17:24	<pabs>	what are these .____padding_file directories in the torrent?
01:19:24		Arcorann (Arcorann) joins
01:20:24	<nicolas17>	the torrent points at both IA servers having the file as webseeds, so you download from both simultaneously, which mitigates one being slow
01:22:11	<nicolas17>	multi-file torrents work as if it was a single blob with all files concatenated, so a chunk hash can span multiple files, which causes some issues, those zero-filled padding files align the beginning of each file with chunk hash boundaries
01:25:47	<pabs>	so the IA download problem isn't the uplink, but individual servers?
01:38:32	<pabs>	hmm, browsers need better download mechanisms
02:33:12		threedeeitguy39 (threedeeitguy) joins
02:33:25		threedeeitguy3 quits [Ping timeout: 265 seconds]
02:33:25		threedeeitguy39 is now known as threedeeitguy3
02:47:35		threedeeitguy3 quits [Client Quit]
02:54:13		threedeeitguy39 (threedeeitguy) joins
04:00:26		BigBrain_ quits [Ping timeout: 245 seconds]
04:02:12	<pabs>	arkiver: in principle, do you think IA might be interested in sponsoring servers/storage/network/hosting/etc for live replicas of archive.softwareheritage.org (source code, all of GitHub/Debian/etc, will or does use Ceph) and or snapshot.debian.org (~152TB all of Debian's software history, both source/binaries, bespoke content-addressed filesystem layout)?
04:02:26		BigBrain_ (bigbrain) joins
06:31:08		nicolas17 quits [Ping timeout: 252 seconds]
06:31:25	<Exorcism>	update: https://irc.digitaldragon.dev/uploads/6b75bb3f2cb4020c/image.png
06:32:03	<Exorcism>	and because of this it's impossible for me to publish on ia at the moment oof
07:26:18		themadpro (themadpro) joins
08:07:23		nulldata quits [Ping timeout: 252 seconds]
08:11:02		nulldata (nulldata) joins
09:00:44		nulldata quits [Ping timeout: 252 seconds]
09:04:23		nulldata (nulldata) joins
09:17:49		Exorcism is now known as Exorcism_
09:35:31		themadpro quits [Client Quit]
09:46:48		Exorcism (exorcism) joins
11:57:07		Exorcism uploaded an image: (99KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.fedibird.com/UQEukxNbokaThmsFmxsEzEVw/1000017637.jpg >
11:57:14		Exorcism uploaded an image: (225KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.fedibird.com/skOoTTrwCoeaXEFZWbsZvoKM/1000017636.jpg >
11:57:30	<Exorcism>	hum
11:57:33	<Exorcism>	wrong channel 💀
13:26:56		Arcorann quits [Ping timeout: 252 seconds]
14:07:34		driib quits [Quit: The Lounge - https://thelounge.chat]
14:12:28		driib (driib) joins
15:43:05	<@arkiver>	pabs: interesting question, are you connected with software heritage?
15:43:10	<@arkiver>	i have no idea how big software heritage is
15:43:18	<@arkiver>	definitely not saying no to that
15:44:28	<pabs>	kind of. I know the Debian folks who started it, I've been submitting code there for about a year and am about to start a contract or two with them about expanding it
15:44:49	<pabs>	for Debian and snapshot.debian.org, I'm one of the Debian sysadmin team
15:45:54	<pabs>	I forget how big SWH is either, but the video/slides for this talk will contain that once published https://debconf23.debconf.org/talks/44-software-heritage-building-a-community-to-safeguard-the-software-commons/
15:46:53	<pabs>	SWH are also working on mirroring it, they have two orgs in Europe doing that now
15:47:31	<@arkiver>	very nice!
15:47:40	<@arkiver>	pabs: do you have an estimate on the side of software heritage?
15:47:51	<@arkiver>	~250 TB?
15:48:31	<pabs>	I think more, but I can't remember sorry. definitely in the slides, I can ask olasd to publish them
15:48:38	<@arkiver>	and what is the yearly growth of both debian and software heritage like?
15:49:08	<pabs>	snapshot.d.o at least 5TB, probably more these days
15:49:18	<@arkiver>	i can have the numbers a bit later - will likely not have an answer for you yet next week, and we have the current problems at IA (which are being fixed - various factors :/ )
15:49:35	<pabs>	it was 5TB/year in 2014-06-01
15:49:45	<@arkiver>	ah, so probbaly more like 20 TB/year nowadays
15:50:15	<pabs>	maybe, not sure. might be some data in our munin instance
15:50:31	<@arkiver>	would an assumption of yearly growth at 50 TB for both debian and software heritage be realistic?
15:50:40	<@arkiver>	(again, we can figure out these numbers in the coming weeks)
15:50:52		@arkiver if afk for ~40 minutes
15:50:54	<@arkiver>	is*
15:51:10	<pabs>	anyways, this is just exploration, only briefly mentioned to SWH during an interview and to Debian sysadmins on IRC
16:04:04		zhongfu quits [Ping timeout: 258 seconds]
16:05:49		zhongfu (zhongfu) joins
16:26:14		qw3rty quits [Ping timeout: 252 seconds]
16:30:00	<pabs>	arkiver: the slides https://annex.softwareheritage.org/public/talks/2023/2023-09-10-DebConf23.pdf
16:31:09	<pabs>	More than 1PB of source code files
16:31:09	<pabs>	(replicated 3 times by Software
16:31:09	<pabs>	Heritage)
16:31:09	<pabs>	More than 100 TB used for (resilient)
16:31:09	<pabs>	storage of the graph
16:31:10	<pabs>	Infrastructure support for mirroring:
16:31:14	<pabs>	100 TB kafka deployment (~30TB of
16:31:16	<pabs>	data used)
17:27:00	<@arkiver>	pabs: thank you. a PB is quite a bit of data
17:27:16	<@arkiver>	it's not an impossible amount, but might have to check in with some people
18:09:56		AlsoHP_Archivist joins
18:10:26	<HP_Archivist>	JAA: You helped with this script for downloading items from the CLI
18:10:31	<HP_Archivist>	ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' \| xargs -P 8 -n 1 ia download
18:10:50	<HP_Archivist>	How would I edit this to not download IA derived files? and only the files I upload?
18:40:12		AlsoHP_Archivist quits [Client Quit]
18:50:36	<@JAA>	HP_Archivist: Support for that has only been added a few months ago: https://github.com/jjjake/internetarchive/issues/365
18:51:14	<@JAA>	You'll need 3.4.0 or higher, then add --exclude-source=derivative at the end of the command, probably.
19:04:31		driib quits [Client Quit]
19:16:34		qwertyasdfuiopghjkl quits [Client Quit]
19:21:05		driib (driib) joins
19:34:05		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
19:35:01		BigBrain_ quits [Ping timeout: 245 seconds]
19:37:18		BigBrain_ (bigbrain) joins
21:03:18	<HP_Archivist>	JAA: Thanks. ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' \| xargs -P 8 -n 1 ia download --exclude-source=derivative ?
21:03:29	<HP_Archivist>	That doesn't look right, though
21:12:54	<imer>	HP_Archivist: Before \| xargs most likely
21:25:40	<HP_Archivist>	imer: Tried that, it's actually: ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' \| xargs -P 8 -n 1 ia download --exclude-source=derivative
21:25:49	<HP_Archivist>	That's what seemed to work just now
21:28:54	<@JAA>	<futurama_squint.png>
21:29:01	<@JAA>	Those two commands look the same...?
21:29:50	<HP_Archivist>	JAA: They are. I didn't try it the first time (when I said it didn't look right) Tried it first with imer's suggestion, that didn't work. But my original assumption was correct
21:31:13	<@JAA>	Yeah, adding --exclude-source=derivative at the end is what I said. :-)
21:31:32	<HP_Archivist>	:)
21:31:46	<HP_Archivist>	I should know not to second guess you, heh
21:31:49	<@JAA>	Technically, `ia download -h` says that the options must come after the identifier, but that's actually not true.
21:32:03	<@JAA>	And it's annoying to do with xargs, so whatever. :-P
21:32:39	<HP_Archivist>	Hm. Is there any way to add to this command a way to speed up the downloads, a la aria2?
21:32:43	<@JAA>	And it's also contrary to the vast majority of CLIs out there. Optional --x/-x arguments normally always come before positional ones.
21:33:19	<@JAA>	I'm not aware of one. You can make `ia download` print file URLs instead somehow, I think.
21:33:28	<@JAA>	Otherwise, just download more items in parallel by tweaking -P.
21:33:40	<@JAA>	But also, IA is struggling, so if it isn't urgent, maybe don't hammer them too hard.
21:34:24	<HP_Archivist>	JAA: Yeah, I thought about that. I'll just deal with the speed I'm getting rn.
21:37:25		fireonlive quits [Excess Flood]
21:39:01		fireonlive (fireonlive) joins
21:56:03		nicolas17 joins
22:52:52		BearFortress quits [Ping timeout: 265 seconds]
23:47:58		systwi quits [Ping timeout: 265 seconds]
23:48:25		systwi__ (systwi) joins

Home Search Previous day Next day