01:15:16 | <nicolas17> | 6790/8739 [5:08:32<2:18:05, 4.25s/MiB] |
01:15:20 | <nicolas17> | archive.org pls |
01:16:02 | <@JAA> | Can confirm. |
01:17:17 | <nicolas17> | ok last megabyte took 21 seconds |
01:52:16 | | pabs quits [Ping timeout: 255 seconds] |
01:54:28 | | pabs (pabs) joins |
02:39:56 | <HP_Archivist> | Posted in the IA Discord. But will post here as well in case someone can assist |
02:40:38 | <HP_Archivist> | I decided I want to use aria2 to help speed up what is essentially downloading all items from one of my accounts as a backup |
02:41:50 | <HP_Archivist> | The arg I have: `ia download --search 'uploader:email@example.org' --exclude-source derivative' but then I need to add the syntax for aria and it looks like I need to pass a list of the urls to it. |
02:42:01 | <HP_Archivist> | Not sure what to do next |
02:42:30 | <HP_Archivist> | JAA - and yes, I know what I said about not wanting to use it earlier. But then I realized this is going to take forever :P |
02:42:50 | <@JAA> | `ia download` has a `--dry-run` option that gets you a list of URLs. Then you can pass that to your download tool of choice. aria2c will by default try to download individual files in parallel, too, which isn't of much use with IA; you want to parallelise over different files (and ideally items/servers). |
02:43:16 | <@JAA> | I'd probably do it with `xargs` and `wget`, but there are many options. |
02:47:43 | <HP_Archivist> | Hmm, thanks. Didn't know --dry-run was an option. That's printing out now, it'll probably take a while. If I wanted to use xargs instead, what would that look like? |
02:50:40 | <@JAA> | xargs is just a tool that can do the parallelism thing. For example, `xargs -P 16 wget </path/to/list-of-urls` will set up 16 wget processes in parallel, each with some number of URLs from the list. Might need some twiddling with other options. xargs options go before, wget options after the 'wget'. |
02:51:25 | <@JAA> | Console output with the progress bars will probably be a mess.. |
03:04:02 | <HP_Archivist> | I think I might use xargs + wget instead because even though aria speeds things up, that's a good point about spreading it out over different files / items / servers. |
03:04:46 | <HP_Archivist> | Yeah - that .txt list is still be generated. |
03:04:50 | <HP_Archivist> | being* |
03:05:09 | <@JAA> | You can probably configure aria2c to spread across files instead if you prefer that. It has a lot of knobs. |
03:07:54 | <datechnoman> | How does one get an invite to the IA Discord? I didnt know there was one (even an unofficial one)? |
03:07:57 | <HP_Archivist> | Yeah I could create a config file for aria2c. I'll see. Whenever this list will finish generating heh |
03:08:07 | <datechnoman> | Would be great to get on there |
03:08:21 | <HP_Archivist> | It's an unofficial Discord - there aren't any official IA Discord servers that I am aware of |
03:08:53 | <HP_Archivist> | datechnoman: https://discord.gg/GPxCGYp2z3 |
03:08:56 | <@JAA> | Discord-- |
03:08:57 | <eggdrop> | [karma] 'Discord' now has -17 karma! |
03:09:30 | <HP_Archivist> | I know JAA hates Discord, but you're always welcome to join :) |
03:13:41 | <HP_Archivist> | Oh and JAA - I am assuming that configuring aria2c to work across files will not download all files into a single folder or anything crazy like that. Assuming the folder/file structure is tied to an item's individual folder. |
03:13:59 | <HP_Archivist> | *As is the case normally when using --ia download |
03:16:55 | <nicolas17> | if you give aria2 URLs to different files (whether with -i filelist.txt or with -Z url1 url2 url3), it will still download them all to the current directory |
03:17:17 | <nicolas17> | the list can specify directories though |
03:17:30 | <nicolas17> | http://example.com/foo/file.dat |
03:17:33 | <nicolas17> | dir=foo |
03:18:32 | <@JAA> | Yeah, sounds like wget is probably the better option then. |
03:18:44 | <nicolas17> | also |
03:19:03 | <HP_Archivist> | Ugh, I'm glad I asked ^ |
03:19:12 | <nicolas17> | you're saying downloading files that are on different servers at the same time would be faster, right? since downloading two files from the same server won't speed things up much |
03:20:21 | <nicolas17> | if there's multiple files in the same item, by default you would have them all together in your URL list, and those are in the same server... |
03:20:26 | <nicolas17> | but you could just shuffle the list :) |
03:20:35 | <@JAA> | Going the xargs/wget route has the advantage that the URLs from the same item will (usually) be within a single wget process, and wget can then reuse the connection to the server storing the item. Although that'll mostly matter when you have a lot of files in an item. |
03:21:35 | <nicolas17> | JAA: yeah but you would also end up with multiple processes downloading from the same server, when it would be better to spread it out |
03:21:46 | <nicolas17> | if the files are large, that may matter more than the latency of connecting |
03:22:12 | <@JAA> | You probably wouldn't since many files are retrieved by one wget process. |
03:22:40 | <nicolas17> | how many files in total? how many files per wget invocation? |
03:22:43 | <@JAA> | xargs takes the first N URLs and starts a wget process for those, then the next batch, etc. |
03:23:37 | <@JAA> | Unless you tend to have more files per item than that batch size, you'll usually only have the sequential download by one wget process from a particular item. |
03:24:02 | <nicolas17> | oh I was assuming a single giant list, not one list per item :P |
03:24:15 | <@JAA> | It is a single giant list, but that doesn't matter. |
03:25:23 | <HP_Archivist> | To give you an idea: https://transfer.archivete.am/xplGW/urls.txt |
03:25:23 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/xplGW/urls.txt |
03:25:33 | <nicolas17> | 8413/8739 [7:18:50<33:23, 6.15s/MiB] |
03:25:34 | <nicolas17> | come ON |
03:25:54 | <HP_Archivist> | That's what it's compiled thus far. And it's still going. I have 22k items under said account |
03:25:55 | <datechnoman> | Thank you so much HP_Archivist ! |
03:26:54 | <HP_Archivist> | nicolas17: The list also excludes any derivative files. So, it's grabbing original files and IA metadata files. |
03:27:23 | <HP_Archivist> | No problem, datechnoman |
03:28:44 | <HP_Archivist> | The majority of items, 20k or so, makeup archived YouTube videos. Some other video sharing sites like Vimeo, DailyMotion, etc. But the majority is YouTube. And then other items like software, txt, and audio items. |
03:44:08 | <nicolas17> | ok uploading from my VPS isn't any better but at least I can shut down my noisy desktop PC now |
03:54:48 | <HP_Archivist> | nicolas17 - that list is still being compiled. It will be 10s of thousands of files. I'm wondering if it's even worth it to try parallelism across files since there will be that many? |
03:55:30 | <nicolas17> | if it's small files, you will be affected by the latency of making requests |
04:09:22 | <HP_Archivist> | So, in effect, maybe it's not worth it then? |
04:10:33 | <nicolas17> | it is |
04:10:40 | <nicolas17> | while one file is waiting to connect to the server and send the request etc. your other wget instance will be using your bandwidth productively downloading something else |
04:23:01 | | ScenarioPlanet quits [Ping timeout: 255 seconds] |
04:23:01 | | Pedrosso quits [Ping timeout: 255 seconds] |
04:23:05 | | TheTechRobo quits [Ping timeout: 272 seconds] |
04:34:26 | <HP_Archivist> | Hmm alright, thanks |
04:34:35 | <HP_Archivist> | I was just reading this, fun read: https://blog.decryption.net.au/t/how-to-download-2400-items-off-the-internet-archive/99 |
04:34:56 | <HP_Archivist> | That guy had 2400 items. I have 22k. So, I guess this will be a while no matter what |
05:00:39 | | Pedrosso joins |
05:00:48 | | ScenarioPlanet (ScenarioPlanet) joins |
05:02:07 | | TheTechRobo (TheTechRobo) joins |
05:19:16 | | TheTechRobo quits [Ping timeout: 255 seconds] |
05:19:16 | | Pedrosso quits [Ping timeout: 255 seconds] |
05:19:43 | | ScenarioPlanet quits [Ping timeout: 255 seconds] |
05:21:31 | | nicolas17 quits [Ping timeout: 255 seconds] |
05:22:33 | | nicolas17 joins |
05:31:38 | | Pedrosso joins |
05:31:44 | | ScenarioPlanet (ScenarioPlanet) joins |
05:33:15 | | TheTechRobo (TheTechRobo) joins |
06:19:47 | | michaelblob quits [Read error: Connection reset by peer] |
06:21:08 | | michaelblob (michaelblob) joins |
07:41:39 | | nicolas17 quits [Read error: Connection reset by peer] |
07:41:57 | | nicolas17 joins |
12:23:35 | | qwertyasdfuiopghjkl quits [Client Quit] |
12:28:28 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
14:10:52 | | Flashfire42 quits [Client Quit] |
14:11:47 | | Flashfire42 joins |
14:32:55 | <HP_Archivist> | The list finished compiling. It's a 14mb txt file. |
14:55:29 | | pabs quits [Read error: Connection reset by peer] |
14:56:22 | | pabs (pabs) joins |
15:09:44 | <HP_Archivist> | JAA: Well, I have xargs + wget setup using a script. It's running through each download URL. Seems to be working so far. |
15:10:22 | <@JAA> | :-) |
15:10:31 | <HP_Archivist> | :) |
16:04:42 | <@JAA> | > 9.79s/MiB |
16:04:43 | <@JAA> | IA pls |
16:19:52 | | DLoader quits [Ping timeout: 255 seconds] |
17:46:30 | | DLoader4 joins |
17:51:51 | | DLoader4 quits [Ping timeout: 272 seconds] |
18:04:01 | | balrog quits [Quit: Bye] |
18:12:32 | | balrog (balrog) joins |
19:11:02 | | DLoader4 joins |
19:11:32 | | DLoader4 quits [Client Quit] |
19:14:01 | | DLoader4 joins |
19:14:22 | | DLoader4 quits [Client Quit] |
19:15:08 | | DLoader4 joins |
19:15:41 | | DLoader4 quits [Client Quit] |
19:17:01 | | DLoader (DLoader) joins |
19:17:04 | | DLoader43 joins |
19:18:12 | | DLoader43 quits [Client Quit] |
19:18:12 | | DLoader quits [Client Quit] |
19:24:33 | | DLoader (DLoader) joins |
19:30:41 | | pabs quits [Read error: Connection reset by peer] |
19:31:29 | | pabs (pabs) joins |
22:53:05 | | yarrow quits [Client Quit] |