#internetarchive log for 2024-08-07

Home Search Previous day Next day

01:15:16	<nicolas17>	6790/8739 [5:08:32<2:18:05, 4.25s/MiB]
01:15:20	<nicolas17>	archive.org pls
01:16:02	<@JAA>	Can confirm.
01:17:17	<nicolas17>	ok last megabyte took 21 seconds
01:52:16		pabs quits [Ping timeout: 255 seconds]
01:54:28		pabs (pabs) joins
02:39:56	<HP_Archivist>	Posted in the IA Discord. But will post here as well in case someone can assist
02:40:38	<HP_Archivist>	I decided I want to use aria2 to help speed up what is essentially downloading all items from one of my accounts as a backup
02:41:50	<HP_Archivist>	The arg I have: `ia download --search 'uploader:email@example.org' --exclude-source derivative' but then I need to add the syntax for aria and it looks like I need to pass a list of the urls to it.
02:42:01	<HP_Archivist>	Not sure what to do next
02:42:30	<HP_Archivist>	JAA - and yes, I know what I said about not wanting to use it earlier. But then I realized this is going to take forever :P
02:42:50	<@JAA>	`ia download` has a `--dry-run` option that gets you a list of URLs. Then you can pass that to your download tool of choice. aria2c will by default try to download individual files in parallel, too, which isn't of much use with IA; you want to parallelise over different files (and ideally items/servers).
02:43:16	<@JAA>	I'd probably do it with `xargs` and `wget`, but there are many options.
02:47:43	<HP_Archivist>	Hmm, thanks. Didn't know --dry-run was an option. That's printing out now, it'll probably take a while. If I wanted to use xargs instead, what would that look like?
02:50:40	<@JAA>	xargs is just a tool that can do the parallelism thing. For example, `xargs -P 16 wget </path/to/list-of-urls` will set up 16 wget processes in parallel, each with some number of URLs from the list. Might need some twiddling with other options. xargs options go before, wget options after the 'wget'.
02:51:25	<@JAA>	Console output with the progress bars will probably be a mess..
03:04:02	<HP_Archivist>	I think I might use xargs + wget instead because even though aria speeds things up, that's a good point about spreading it out over different files / items / servers.
03:04:46	<HP_Archivist>	Yeah - that .txt list is still be generated.
03:04:50	<HP_Archivist>	being*
03:05:09	<@JAA>	You can probably configure aria2c to spread across files instead if you prefer that. It has a lot of knobs.
03:07:54	<datechnoman>	How does one get an invite to the IA Discord? I didnt know there was one (even an unofficial one)?
03:07:57	<HP_Archivist>	Yeah I could create a config file for aria2c. I'll see. Whenever this list will finish generating heh
03:08:07	<datechnoman>	Would be great to get on there
03:08:21	<HP_Archivist>	It's an unofficial Discord - there aren't any official IA Discord servers that I am aware of
03:08:53	<HP_Archivist>	datechnoman: https://discord.gg/GPxCGYp2z3
03:08:56	<@JAA>	Discord--
03:08:57	<eggdrop>	[karma] 'Discord' now has -17 karma!
03:09:30	<HP_Archivist>	I know JAA hates Discord, but you're always welcome to join :)
03:13:41	<HP_Archivist>	Oh and JAA - I am assuming that configuring aria2c to work across files will not download all files into a single folder or anything crazy like that. Assuming the folder/file structure is tied to an item's individual folder.
03:13:59	<HP_Archivist>	*As is the case normally when using --ia download
03:16:55	<nicolas17>	if you give aria2 URLs to different files (whether with -i filelist.txt or with -Z url1 url2 url3), it will still download them all to the current directory
03:17:17	<nicolas17>	the list can specify directories though
03:17:30	<nicolas17>	http://example.com/foo/file.dat
03:17:33	<nicolas17>	dir=foo
03:18:32	<@JAA>	Yeah, sounds like wget is probably the better option then.
03:18:44	<nicolas17>	also
03:19:03	<HP_Archivist>	Ugh, I'm glad I asked ^
03:19:12	<nicolas17>	you're saying downloading files that are on different servers at the same time would be faster, right? since downloading two files from the same server won't speed things up much
03:20:21	<nicolas17>	if there's multiple files in the same item, by default you would have them all together in your URL list, and those are in the same server...
03:20:26	<nicolas17>	but you could just shuffle the list :)
03:20:35	<@JAA>	Going the xargs/wget route has the advantage that the URLs from the same item will (usually) be within a single wget process, and wget can then reuse the connection to the server storing the item. Although that'll mostly matter when you have a lot of files in an item.
03:21:35	<nicolas17>	JAA: yeah but you would also end up with multiple processes downloading from the same server, when it would be better to spread it out
03:21:46	<nicolas17>	if the files are large, that may matter more than the latency of connecting
03:22:12	<@JAA>	You probably wouldn't since many files are retrieved by one wget process.
03:22:40	<nicolas17>	how many files in total? how many files per wget invocation?
03:22:43	<@JAA>	xargs takes the first N URLs and starts a wget process for those, then the next batch, etc.
03:23:37	<@JAA>	Unless you tend to have more files per item than that batch size, you'll usually only have the sequential download by one wget process from a particular item.
03:24:02	<nicolas17>	oh I was assuming a single giant list, not one list per item :P
03:24:15	<@JAA>	It is a single giant list, but that doesn't matter.
03:25:23	<HP_Archivist>	To give you an idea: https://transfer.archivete.am/xplGW/urls.txt
03:25:23	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/xplGW/urls.txt
03:25:33	<nicolas17>	8413/8739 [7:18:50<33:23, 6.15s/MiB]
03:25:34	<nicolas17>	come ON
03:25:54	<HP_Archivist>	That's what it's compiled thus far. And it's still going. I have 22k items under said account
03:25:55	<datechnoman>	Thank you so much HP_Archivist !
03:26:54	<HP_Archivist>	nicolas17: The list also excludes any derivative files. So, it's grabbing original files and IA metadata files.
03:27:23	<HP_Archivist>	No problem, datechnoman
03:28:44	<HP_Archivist>	The majority of items, 20k or so, makeup archived YouTube videos. Some other video sharing sites like Vimeo, DailyMotion, etc. But the majority is YouTube. And then other items like software, txt, and audio items.
03:44:08	<nicolas17>	ok uploading from my VPS isn't any better but at least I can shut down my noisy desktop PC now
03:54:48	<HP_Archivist>	nicolas17 - that list is still being compiled. It will be 10s of thousands of files. I'm wondering if it's even worth it to try parallelism across files since there will be that many?
03:55:30	<nicolas17>	if it's small files, you will be affected by the latency of making requests
04:09:22	<HP_Archivist>	So, in effect, maybe it's not worth it then?
04:10:33	<nicolas17>	it is
04:10:40	<nicolas17>	while one file is waiting to connect to the server and send the request etc. your other wget instance will be using your bandwidth productively downloading something else
04:23:01		ScenarioPlanet quits [Ping timeout: 255 seconds]
04:23:01		Pedrosso quits [Ping timeout: 255 seconds]
04:23:05		TheTechRobo quits [Ping timeout: 272 seconds]
04:34:26	<HP_Archivist>	Hmm alright, thanks
04:34:35	<HP_Archivist>	I was just reading this, fun read: https://blog.decryption.net.au/t/how-to-download-2400-items-off-the-internet-archive/99
04:34:56	<HP_Archivist>	That guy had 2400 items. I have 22k. So, I guess this will be a while no matter what
05:00:39		Pedrosso joins
05:00:48		ScenarioPlanet (ScenarioPlanet) joins
05:02:07		TheTechRobo (TheTechRobo) joins
05:19:16		TheTechRobo quits [Ping timeout: 255 seconds]
05:19:16		Pedrosso quits [Ping timeout: 255 seconds]
05:19:43		ScenarioPlanet quits [Ping timeout: 255 seconds]
05:21:31		nicolas17 quits [Ping timeout: 255 seconds]
05:22:33		nicolas17 joins
05:31:38		Pedrosso joins
05:31:44		ScenarioPlanet (ScenarioPlanet) joins
05:33:15		TheTechRobo (TheTechRobo) joins
06:19:47		michaelblob quits [Read error: Connection reset by peer]
06:21:08		michaelblob (michaelblob) joins
07:41:39		nicolas17 quits [Read error: Connection reset by peer]
07:41:57		nicolas17 joins
12:23:35		qwertyasdfuiopghjkl quits [Client Quit]
12:28:28		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
14:10:52		Flashfire42 quits [Client Quit]
14:11:47		Flashfire42 joins
14:32:55	<HP_Archivist>	The list finished compiling. It's a 14mb txt file.
14:55:29		pabs quits [Read error: Connection reset by peer]
14:56:22		pabs (pabs) joins
15:09:44	<HP_Archivist>	JAA: Well, I have xargs + wget setup using a script. It's running through each download URL. Seems to be working so far.
15:10:22	<@JAA>	:-)
15:10:31	<HP_Archivist>	:)
16:04:42	<@JAA>	> 9.79s/MiB
16:04:43	<@JAA>	IA pls
16:19:52		DLoader quits [Ping timeout: 255 seconds]
17:46:30		DLoader4 joins
17:51:51		DLoader4 quits [Ping timeout: 272 seconds]
18:04:01		balrog quits [Quit: Bye]
18:12:32		balrog (balrog) joins
19:11:02		DLoader4 joins
19:11:32		DLoader4 quits [Client Quit]
19:14:01		DLoader4 joins
19:14:22		DLoader4 quits [Client Quit]
19:15:08		DLoader4 joins
19:15:41		DLoader4 quits [Client Quit]
19:17:01		DLoader (DLoader) joins
19:17:04		DLoader43 joins
19:18:12		DLoader43 quits [Client Quit]
19:18:12		DLoader quits [Client Quit]
19:24:33		DLoader (DLoader) joins
19:30:41		pabs quits [Read error: Connection reset by peer]
19:31:29		pabs (pabs) joins
22:53:05		yarrow quits [Client Quit]

Home Search Previous day Next day