00:01:19 | | jtagcat quits [Client Quit] |
00:01:40 | | jtagcat (jtagcat) joins |
00:04:35 | | katia quits [Remote host closed the connection] |
00:04:52 | | katia (katia) joins |
01:18:53 | | eggdrop leaves |
01:24:11 | | eggdrop (eggdrop) joins |
01:24:34 | | eggdrop quits [Client Quit] |
01:27:42 | | eggdrop (eggdrop) joins |
01:38:04 | <audrooku|m> | Yeah, but I feel like that's a problem they'll have to solve at some point, I assume the issue is purely a problem of having the user be able to navigate to the specific record |
01:41:20 | <@JAA> | It requires indexing the request record, and associating it with the corresponding response record. There's nothing saying there has to be a 1-to-1 correspondence between these either, and they can appear in any order. It's definitely not a trivial problem. |
01:51:45 | <audrooku|m> | Yeah, that's not what I meant to imply |
02:42:53 | | BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
03:17:03 | | BearFortress joins |
03:28:27 | | BearFortress_ joins |
03:31:53 | | BearFortress quits [Ping timeout: 272 seconds] |
04:32:02 | | DogsRNice quits [Read error: Connection reset by peer] |
06:16:40 | <nicolas17> | audrooku|m: one of the things sent in the POST request to get the file is "download purpose" https://transfer.archivete.am/inline/HMehz/samsung-opensource.png |
06:17:30 | <nicolas17> | would I need to download every file under all 7 options so that WBM playback works with all 7 options? |
06:17:59 | <nicolas17> | (in a hypothetical world where WBM supports matching requests based on POST request body) |
06:18:45 | <@JAA> | Probably, yeah. |
06:19:27 | <@JAA> | There are also ways this can quickly fall apart, e.g. if the token is dynamically calculated with JS involving the current timestamp or similar. |
06:19:59 | <nicolas17> | no, the token is sent by the server |
06:20:03 | <nicolas17> | but... it's one-time use |
06:20:14 | <nicolas17> | so if I capture a download picking the first Download Purpose |
06:20:47 | <nicolas17> | that token no longer works, and I need to get a new one (by loading the form again) if I want to re-download with the second Download Purpose, but then each capture of the form would only work for one option |
06:21:36 | <@JAA> | Ah, right, so that'll never work 'properly', fun. |
06:22:05 | <@JAA> | The solution there would be to ignore the token on indexing, but I bet that'll never happen for specific sites. |
06:22:26 | <@JAA> | At least not for something that isn't one of the largest sites on the web. |
06:22:39 | <nicolas17> | if you're gonna do site-specific hacks like that, you might as well ignore the "download purpose" parameter too :P |
06:23:50 | <@JAA> | Yeah |
06:46:18 | <audrooku|m> | You would have to allow users to search the POST requests to find the matching response |
06:46:50 | <audrooku|m> | But at that point not having it in the WBM makes sense, i suppose |
06:47:47 | <@JAA> | The way the webrecorder people have solved it is by encoding the POST data somehow and yeeting it into a _postdata parameter or something like that. That way, it's just part of the URL and would naturally show up in the WBM as well. |
06:49:15 | <audrooku|m> | Sounds pretty hacky, obviously post request data can be very large and containe binary data |
06:57:47 | <@JAA> | I want to say base64 was involved to handle the latter and the URLs did get huge, but I'm not sure. |
06:57:58 | <@JAA> | And well, it's webrecorder, of course it'll be hacky. |
06:59:39 | <@JAA> | Here's the gory details, enjoy: https://github.com/webrecorder/pywb/commit/626da99899865e7f9bf9bfdd775218b36d6a2567 |
06:59:50 | <fireonlive> | gotta love webrecorder |
07:58:27 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
11:17:03 | | zhongfu (zhongfu) joins |
12:46:37 | | jodizzle quits [Remote host closed the connection] |
12:48:57 | | jodizzle (jodizzle) joins |
13:13:55 | | Arcorann quits [Ping timeout: 272 seconds] |
13:24:36 | | Barto quits [Quit: WeeChat 4.1.2] |
13:25:31 | | Barto (Barto) joins |
16:38:04 | | HP_Archivist (HP_Archivist) joins |
18:16:24 | | that_lurker quits [Quit: I am most likely running a system update] |
18:17:22 | | that_lurker (that_lurker) joins |
18:23:16 | | igloo22225 quits [Remote host closed the connection] |
18:27:58 | | igloo22225 (igloo22225) joins |
18:31:26 | | igloo22225 quits [Client Quit] |
18:41:18 | | igloo22225 (igloo22225) joins |
18:45:38 | | Skylion joins |
18:45:56 | <Skylion> | What CLI do people use for downloading torrent files from the InternetArchive. I usually use rtorrent but realized it doesn't support web-seeding sadly. |
18:46:36 | <nicolas17> | rtorrent not supporting webseeding is awful |
18:46:40 | <nicolas17> | try aria2c |
18:54:46 | | katia quits [Remote host closed the connection] |
18:55:36 | | katia (katia) joins |
19:16:24 | <Nemo_bis> | Skylion: it depends how many torrents you need to download; for thousands, I've found only deluge works |
19:17:10 | <nicolas17> | I think aria2 defaults to downloading 5 files at a time (whether torrents or plain HTTP URLs) |
19:17:19 | <nicolas17> | which you can increase but probably not too much :P |
19:17:23 | <Nemo_bis> | for tens of thousands, it's better to download with parallel wget and then let rtorrent verify the local files |
19:17:37 | <nicolas17> | Nemo_bis: downloading thousands of torrents *at the same time* would definitely kill my router |
19:17:44 | <Nemo_bis> | not mine ^_^ |
19:18:13 | <Skylion> | Also, one llast question. What is people's preferred method for downloading a large collection? Using the python API is great, but the download feature is slow and not robust. Ultimately, my workflow is to download the metadata via that API than cosntruct an rclone Union filesystem over that and use rclone to handle the download + checkingsumming, |
19:18:13 | <Skylion> | but that seems less effficient than torrenting (download speeds are way faster) now that I have that working. For the future, what are people's preferred workflows for downloading entire collections? |
19:18:31 | <Nemo_bis> | Define "large collection" |
19:18:38 | <fireonlive> | we need to get nicolas17 a new router |
19:18:41 | <nicolas17> | also, for thousands of items you can take advantage of parallel downloads yeah |
19:18:49 | <nicolas17> | but if there's a few huge items you want to download faster |
19:18:50 | <Skylion> | I did 1800 items over the summer adding up to about 1 Petabyte. |
19:19:00 | <fireonlive> | nicolas17: have you asked your ISP if they have something different since you've signed up at least |
19:19:03 | <Nemo_bis> | Ok, so they're few items of big size |
19:19:13 | <nicolas17> | the best way is to get the torrent, and ask a friend with good IA peering to also get the torrent :D |
19:19:43 | <Skylion> | That's an extreme case though. I am now just trying to download a 1.7TB dataset of about 27 items, but each item is massive. |
19:20:20 | <nicolas17> | fireonlive: in theory I could ask them to switch the modem to bridge mode (no NAT or Wi-Fi or anything) so I can connect my own Wi-Fi router behind it |
19:20:48 | <fireonlive> | might be a good idea if it keeps crashing because of nat table issues |
19:20:57 | <nicolas17> | well that's just my theory of why it crashes |
19:21:09 | <Skylion> | Example, I just tried to download teh whole collection with the python IA api, and it's slow. I just tried downloading one of the large items via torrent and it went from 500kb/s to over 30MB/s |
19:21:22 | <Nemo_bis> | nicolas17: do you have access to check whether the router is running out of RAM? |
19:21:46 | <Nemo_bis> | Skylion: well that's nice, means that latency isn't too bad |
19:21:48 | <nicolas17> | Nemo_bis: what do you mean by "access", a shell? |
19:22:04 | <Nemo_bis> | nicolas17: for example... but sometimes it's in a GUI too |
19:22:30 | <nicolas17> | I should feel fortunate they didn't lock me out of the web interface and I can at least forward ports |
19:23:29 | <Nemo_bis> | Skylion: 30*27 MB/s is 6 Gbps, can you really go faster than that? |
19:24:09 | <Skylion> | Haven't tried it, but I am on a university compute cluster. Theoretical limit is 10gbps on the server. |
19:24:14 | <nicolas17> | ohh looks like I can set bridge mode from the web UI myself, neat |
19:24:23 | <Skylion> | 30*500kb/s was painfully slow though |
19:24:43 | <Nemo_bis> | I'd expect you to hit I/O limits before you hit 10 Gbps |
19:25:13 | <Skylion> | Yeah, NFS server usually craps out by the time after I fill the SSD raid buffer. Before then, it's fine. |
19:25:15 | <Nemo_bis> | if you manage to maintain 30 MB/s per torrent you should consider yourself lucky enough |
19:25:26 | <Skylion> | Yeah, just orchestrating that is a bit of pain. |
19:25:27 | <nicolas17> | at those speeds the torrent client starts to matter a lot |
19:25:47 | <Skylion> | It's really just fancy web-download multiplexing, only connecting to seeds on IA |
19:25:58 | <Nemo_bis> | if you get much slower than that and you're in a hurry, you can try a digitalocean vps in SFO and download the same torrent there with deluge |
19:26:14 | <nicolas17> | sometimes you get CPU bottlenecks |
19:26:31 | <Nemo_bis> | I think I posted an example deluge config somewhere |
19:26:38 | <nicolas17> | but it's more noticeable over actual bittorrent protocol rather than webseeds |
19:27:18 | <Nemo_bis> | or maybe I didn't https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps/Torrents#Deluge_alternative |
19:27:47 | <nicolas17> | for example if I decide "actually it's better if I download this on computer B", and I start the torrent on computer B, and computer A starts seeding to it over the LAN, I'll catch rtorrent using 100% CPU |
19:27:57 | <nicolas17> | it can't keep up with the network |
19:28:36 | <Nemo_bis> | rtorrent has a bunch of single-threaded bottlenecks |
19:28:38 | <@JAA> | I usually do an `ia search ... --itemlist | xargs ia list --location | xargs -P... wget`. Most of the time, I don't actually download to disk but rather have a script that takes one (WARC) URL as an argument and then does some `curl $url | zstdcat | grep` nonsense on it. :-) |
19:28:52 | <nicolas17> | oh oh |
19:29:09 | <nicolas17> | you can also use aria2c to do parallel HTTP downloads instead of using xargs -P wget |
19:29:45 | <Nemo_bis> | ...but that doesn't help too much when each of the 5 connections is stuck waiting on the same slow disk on IA side |
19:30:05 | <Nemo_bis> | it's better to have N ongoing wget processes downloading different things, you can usually then continue with wget -C - |
19:30:13 | <fireonlive> | how can we move IA to all SSD storage... 🤔 |
19:30:22 | <nicolas17> | Nemo_bis: no no I don't mean 5 connections to the same server |
19:30:26 | <@JAA> | 5 connections is cute. :-) |
19:30:26 | <Nemo_bis> | Assault fort knox? |
19:30:34 | <fireonlive> | x3 |
19:30:41 | <@JAA> | But these would go to different servers usually. |
19:30:55 | <nicolas17> | I mean feed the entire list to a single aria2 instance, and let it do only one connection per file, but multiple files at a time |
19:30:57 | <Nemo_bis> | randomly bump into some billionaire who gives out stupid amounts of money to rancom charities? https://yieldgiving.com/gifts/ |
19:31:14 | <fireonlive> | 'hey jeffy, you know that amazon thing?' |
19:31:18 | <fireonlive> | 'what if you gave back' |
19:31:21 | <@JAA> | The items I usually touch like this are from our DPoS projects, where there's only one big file per item, and most items are on different servers. |
19:31:21 | <Nemo_bis> | nicolas17: how is that better than wget? I've forgotten, didn't use aria2 in perhaps a decade |
19:31:40 | <nicolas17> | the progress indicator will be more readable? :D |
19:31:53 | <Nemo_bis> | fireonlive: let's hope in a second expensive divorce |
19:32:04 | <Nemo_bis> | nicolas17: ah! :) I'd just use df |
19:32:26 | <fireonlive> | Nemo_bis: ah yes, one of us can try to marry him haha |
19:32:56 | <nicolas17> | and it can preallocate disk space, which can be useful to avoid HDD fragmentation |
19:33:01 | | nicolas17 tries to think of another advantage >.> |
19:33:34 | <fireonlive> | you can put a web ui on it! |
19:33:36 | <fireonlive> | lol |
19:33:45 | <fireonlive> | wait does wget support torrents? |
19:33:48 | <fireonlive> | if so TIL |
19:33:50 | <nicolas17> | no |
19:33:51 | <nicolas17> | aria2 does |
19:33:53 | <fireonlive> | ah ok |
19:34:17 | <nicolas17> | I was suggesting aria2 instead of wget for http downloads |
19:35:01 | <fireonlive> | ah :3 |
19:35:06 | <fireonlive> | i was thinking torrent+webseed |
19:35:19 | <nicolas17> | JAA: btw when doing "curl $url | zstdcat | grep" I prefer to use "wget -O -" instead, as it can resume downloads on dropped connections |
19:35:33 | <nicolas17> | (and following redirects by default is handy) |
19:35:35 | <Nemo_bis> | fireonlive: what dataset is this? if it's something sufficiently popular you can also try https://academictorrents.com/ and advertise it a bit |
19:36:03 | <fireonlive> | curl has '-c' for continue; but can't remember if it can restart in the same process |
19:36:17 | <Nemo_bis> | that's what I did with the https://academictorrents.com/browse.php?search=generalindex and it worked (for a few weeks while people were still excited about it) |
19:36:27 | <@JAA> | nicolas17: My WARC parsers usually fall over before I get any network issues, so resuming isn't really needed for me, but yeah, good advice nonetheless. |
19:36:30 | <fireonlive> | ah i have none myself |
19:36:41 | <@JAA> | I download from a server that basically has a line of sight to IA, for reference. |
19:36:45 | <fireonlive> | i was playing with the reddit one a while ago for purely SFW purposes |
19:37:09 | <nicolas17> | fireonlive: if you're downloading into a file on disk, and the download fails, you can run curl -c again to resume where it left off, but yeah that won't work for the same process and/or for stdout |
19:37:14 | <Nemo_bis> | I think there's a new torrent for the reddit stuff btw |
19:37:15 | <fireonlive> | ah ok |
19:37:24 | <fireonlive> | oh right; someone released one |
19:37:35 | <fireonlive> | https://pullpush.io/ seems to be going as well still |
19:37:43 | <nicolas17> | Nemo_bis: "what dataset is this?" what were you replying to? >.> |
19:37:45 | <fireonlive> | i should update my uh |
19:37:49 | <fireonlive> | ... |
19:37:59 | <fireonlive> | <that other reddit api service that died> scripts |
19:38:05 | <fireonlive> | to try out pullpush instead |
19:38:20 | <fireonlive> | how quickly we forget |
19:38:23 | <@JAA> | Pushshift? |
19:38:27 | <Nemo_bis> | fireonlive: so if the dataset you're interested in is reddit comments I could just suggest an rsync target for you, write me in private |
19:38:27 | <fireonlive> | ah yes! |
19:38:48 | <fireonlive> | Nemo_bis: :) i'm no creator myself but thanks |
19:39:18 | <Nemo_bis> | hmmmm |
19:39:43 | <fireonlive> | Skylion was doing the 1PB though |
19:56:44 | | eggdrop quits [Excess Flood] |
19:59:29 | | eggdrop (eggdrop) joins |
20:46:07 | | DopefishJustin quits [Ping timeout: 272 seconds] |
21:09:44 | | DogsRNice joins |
22:50:34 | | DopefishJustin joins |
22:50:34 | | DopefishJustin is now authenticated as DopefishJustin |