#internetarchive log for 2024-01-21

Home Search Previous day Next day

00:01:19		jtagcat quits [Client Quit]
00:01:40		jtagcat (jtagcat) joins
00:04:35		katia quits [Remote host closed the connection]
00:04:52		katia (katia) joins
01:18:53		eggdrop leaves
01:24:11		eggdrop (eggdrop) joins
01:24:34		eggdrop quits [Client Quit]
01:27:42		eggdrop (eggdrop) joins
01:38:04	<audrooku\|m>	Yeah, but I feel like that's a problem they'll have to solve at some point, I assume the issue is purely a problem of having the user be able to navigate to the specific record
01:41:20	<@JAA>	It requires indexing the request record, and associating it with the corresponding response record. There's nothing saying there has to be a 1-to-1 correspondence between these either, and they can appear in any order. It's definitely not a trivial problem.
01:51:45	<audrooku\|m>	Yeah, that's not what I meant to imply
02:42:53		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
03:17:03		BearFortress joins
03:28:27		BearFortress_ joins
03:31:53		BearFortress quits [Ping timeout: 272 seconds]
04:32:02		DogsRNice quits [Read error: Connection reset by peer]
06:16:40	<nicolas17>	audrooku\|m: one of the things sent in the POST request to get the file is "download purpose" https://transfer.archivete.am/inline/HMehz/samsung-opensource.png
06:17:30	<nicolas17>	would I need to download every file under all 7 options so that WBM playback works with all 7 options?
06:17:59	<nicolas17>	(in a hypothetical world where WBM supports matching requests based on POST request body)
06:18:45	<@JAA>	Probably, yeah.
06:19:27	<@JAA>	There are also ways this can quickly fall apart, e.g. if the token is dynamically calculated with JS involving the current timestamp or similar.
06:19:59	<nicolas17>	no, the token is sent by the server
06:20:03	<nicolas17>	but... it's one-time use
06:20:14	<nicolas17>	so if I capture a download picking the first Download Purpose
06:20:47	<nicolas17>	that token no longer works, and I need to get a new one (by loading the form again) if I want to re-download with the second Download Purpose, but then each capture of the form would only work for one option
06:21:36	<@JAA>	Ah, right, so that'll never work 'properly', fun.
06:22:05	<@JAA>	The solution there would be to ignore the token on indexing, but I bet that'll never happen for specific sites.
06:22:26	<@JAA>	At least not for something that isn't one of the largest sites on the web.
06:22:39	<nicolas17>	if you're gonna do site-specific hacks like that, you might as well ignore the "download purpose" parameter too :P
06:23:50	<@JAA>	Yeah
06:46:18	<audrooku\|m>	You would have to allow users to search the POST requests to find the matching response
06:46:50	<audrooku\|m>	But at that point not having it in the WBM makes sense, i suppose
06:47:47	<@JAA>	The way the webrecorder people have solved it is by encoding the POST data somehow and yeeting it into a _postdata parameter or something like that. That way, it's just part of the URL and would naturally show up in the WBM as well.
06:49:15	<audrooku\|m>	Sounds pretty hacky, obviously post request data can be very large and containe binary data
06:57:47	<@JAA>	I want to say base64 was involved to handle the latter and the URLs did get huge, but I'm not sure.
06:57:58	<@JAA>	And well, it's webrecorder, of course it'll be hacky.
06:59:39	<@JAA>	Here's the gory details, enjoy: https://github.com/webrecorder/pywb/commit/626da99899865e7f9bf9bfdd775218b36d6a2567
06:59:50	<fireonlive>	gotta love webrecorder
07:58:27		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:17:03		zhongfu (zhongfu) joins
12:46:37		jodizzle quits [Remote host closed the connection]
12:48:57		jodizzle (jodizzle) joins
13:13:55		Arcorann quits [Ping timeout: 272 seconds]
13:24:36		Barto quits [Quit: WeeChat 4.1.2]
13:25:31		Barto (Barto) joins
16:38:04		HP_Archivist (HP_Archivist) joins
18:16:24		that_lurker quits [Quit: I am most likely running a system update]
18:17:22		that_lurker (that_lurker) joins
18:23:16		igloo22225 quits [Remote host closed the connection]
18:27:58		igloo22225 (igloo22225) joins
18:31:26		igloo22225 quits [Client Quit]
18:41:18		igloo22225 (igloo22225) joins
18:45:38		Skylion joins
18:45:56	<Skylion>	What CLI do people use for downloading torrent files from the InternetArchive. I usually use rtorrent but realized it doesn't support web-seeding sadly.
18:46:36	<nicolas17>	rtorrent not supporting webseeding is awful
18:46:40	<nicolas17>	try aria2c
18:54:46		katia quits [Remote host closed the connection]
18:55:36		katia (katia) joins
19:16:24	<Nemo_bis>	Skylion: it depends how many torrents you need to download; for thousands, I've found only deluge works
19:17:10	<nicolas17>	I think aria2 defaults to downloading 5 files at a time (whether torrents or plain HTTP URLs)
19:17:19	<nicolas17>	which you can increase but probably not too much :P
19:17:23	<Nemo_bis>	for tens of thousands, it's better to download with parallel wget and then let rtorrent verify the local files
19:17:37	<nicolas17>	Nemo_bis: downloading thousands of torrents at the same time would definitely kill my router
19:17:44	<Nemo_bis>	not mine ^_^
19:18:13	<Skylion>	Also, one llast question. What is people's preferred method for downloading a large collection? Using the python API is great, but the download feature is slow and not robust. Ultimately, my workflow is to download the metadata via that API than cosntruct an rclone Union filesystem over that and use rclone to handle the download + checkingsumming,
19:18:13	<Skylion>	but that seems less effficient than torrenting (download speeds are way faster) now that I have that working. For the future, what are people's preferred workflows for downloading entire collections?
19:18:31	<Nemo_bis>	Define "large collection"
19:18:38	<fireonlive>	we need to get nicolas17 a new router
19:18:41	<nicolas17>	also, for thousands of items you can take advantage of parallel downloads yeah
19:18:49	<nicolas17>	but if there's a few huge items you want to download faster
19:18:50	<Skylion>	I did 1800 items over the summer adding up to about 1 Petabyte.
19:19:00	<fireonlive>	nicolas17: have you asked your ISP if they have something different since you've signed up at least
19:19:03	<Nemo_bis>	Ok, so they're few items of big size
19:19:13	<nicolas17>	the best way is to get the torrent, and ask a friend with good IA peering to also get the torrent :D
19:19:43	<Skylion>	That's an extreme case though. I am now just trying to download a 1.7TB dataset of about 27 items, but each item is massive.
19:20:20	<nicolas17>	fireonlive: in theory I could ask them to switch the modem to bridge mode (no NAT or Wi-Fi or anything) so I can connect my own Wi-Fi router behind it
19:20:48	<fireonlive>	might be a good idea if it keeps crashing because of nat table issues
19:20:57	<nicolas17>	well that's just my theory of why it crashes
19:21:09	<Skylion>	Example, I just tried to download teh whole collection with the python IA api, and it's slow. I just tried downloading one of the large items via torrent and it went from 500kb/s to over 30MB/s
19:21:22	<Nemo_bis>	nicolas17: do you have access to check whether the router is running out of RAM?
19:21:46	<Nemo_bis>	Skylion: well that's nice, means that latency isn't too bad
19:21:48	<nicolas17>	Nemo_bis: what do you mean by "access", a shell?
19:22:04	<Nemo_bis>	nicolas17: for example... but sometimes it's in a GUI too
19:22:30	<nicolas17>	I should feel fortunate they didn't lock me out of the web interface and I can at least forward ports
19:23:29	<Nemo_bis>	Skylion: 30*27 MB/s is 6 Gbps, can you really go faster than that?
19:24:09	<Skylion>	Haven't tried it, but I am on a university compute cluster. Theoretical limit is 10gbps on the server.
19:24:14	<nicolas17>	ohh looks like I can set bridge mode from the web UI myself, neat
19:24:23	<Skylion>	30*500kb/s was painfully slow though
19:24:43	<Nemo_bis>	I'd expect you to hit I/O limits before you hit 10 Gbps
19:25:13	<Skylion>	Yeah, NFS server usually craps out by the time after I fill the SSD raid buffer. Before then, it's fine.
19:25:15	<Nemo_bis>	if you manage to maintain 30 MB/s per torrent you should consider yourself lucky enough
19:25:26	<Skylion>	Yeah, just orchestrating that is a bit of pain.
19:25:27	<nicolas17>	at those speeds the torrent client starts to matter a lot
19:25:47	<Skylion>	It's really just fancy web-download multiplexing, only connecting to seeds on IA
19:25:58	<Nemo_bis>	if you get much slower than that and you're in a hurry, you can try a digitalocean vps in SFO and download the same torrent there with deluge
19:26:14	<nicolas17>	sometimes you get CPU bottlenecks
19:26:31	<Nemo_bis>	I think I posted an example deluge config somewhere
19:26:38	<nicolas17>	but it's more noticeable over actual bittorrent protocol rather than webseeds
19:27:18	<Nemo_bis>	or maybe I didn't https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps/Torrents#Deluge_alternative
19:27:47	<nicolas17>	for example if I decide "actually it's better if I download this on computer B", and I start the torrent on computer B, and computer A starts seeding to it over the LAN, I'll catch rtorrent using 100% CPU
19:27:57	<nicolas17>	it can't keep up with the network
19:28:36	<Nemo_bis>	rtorrent has a bunch of single-threaded bottlenecks
19:28:38	<@JAA>	I usually do an `ia search ... --itemlist \| xargs ia list --location \| xargs -P... wget`. Most of the time, I don't actually download to disk but rather have a script that takes one (WARC) URL as an argument and then does some `curl $url \| zstdcat \| grep` nonsense on it. :-)
19:28:52	<nicolas17>	oh oh
19:29:09	<nicolas17>	you can also use aria2c to do parallel HTTP downloads instead of using xargs -P wget
19:29:45	<Nemo_bis>	...but that doesn't help too much when each of the 5 connections is stuck waiting on the same slow disk on IA side
19:30:05	<Nemo_bis>	it's better to have N ongoing wget processes downloading different things, you can usually then continue with wget -C -
19:30:13	<fireonlive>	how can we move IA to all SSD storage... 🤔
19:30:22	<nicolas17>	Nemo_bis: no no I don't mean 5 connections to the same server
19:30:26	<@JAA>	5 connections is cute. :-)
19:30:26	<Nemo_bis>	Assault fort knox?
19:30:34	<fireonlive>	x3
19:30:41	<@JAA>	But these would go to different servers usually.
19:30:55	<nicolas17>	I mean feed the entire list to a single aria2 instance, and let it do only one connection per file, but multiple files at a time
19:30:57	<Nemo_bis>	randomly bump into some billionaire who gives out stupid amounts of money to rancom charities? https://yieldgiving.com/gifts/
19:31:14	<fireonlive>	'hey jeffy, you know that amazon thing?'
19:31:18	<fireonlive>	'what if you gave back'
19:31:21	<@JAA>	The items I usually touch like this are from our DPoS projects, where there's only one big file per item, and most items are on different servers.
19:31:21	<Nemo_bis>	nicolas17: how is that better than wget? I've forgotten, didn't use aria2 in perhaps a decade
19:31:40	<nicolas17>	the progress indicator will be more readable? :D
19:31:53	<Nemo_bis>	fireonlive: let's hope in a second expensive divorce
19:32:04	<Nemo_bis>	nicolas17: ah! :) I'd just use df
19:32:26	<fireonlive>	Nemo_bis: ah yes, one of us can try to marry him haha
19:32:56	<nicolas17>	and it can preallocate disk space, which can be useful to avoid HDD fragmentation
19:33:01		nicolas17 tries to think of another advantage >.>
19:33:34	<fireonlive>	you can put a web ui on it!
19:33:36	<fireonlive>	lol
19:33:45	<fireonlive>	wait does wget support torrents?
19:33:48	<fireonlive>	if so TIL
19:33:50	<nicolas17>	no
19:33:51	<nicolas17>	aria2 does
19:33:53	<fireonlive>	ah ok
19:34:17	<nicolas17>	I was suggesting aria2 instead of wget for http downloads
19:35:01	<fireonlive>	ah :3
19:35:06	<fireonlive>	i was thinking torrent+webseed
19:35:19	<nicolas17>	JAA: btw when doing "curl $url \| zstdcat \| grep" I prefer to use "wget -O -" instead, as it can resume downloads on dropped connections
19:35:33	<nicolas17>	(and following redirects by default is handy)
19:35:35	<Nemo_bis>	fireonlive: what dataset is this? if it's something sufficiently popular you can also try https://academictorrents.com/ and advertise it a bit
19:36:03	<fireonlive>	curl has '-c' for continue; but can't remember if it can restart in the same process
19:36:17	<Nemo_bis>	that's what I did with the https://academictorrents.com/browse.php?search=generalindex and it worked (for a few weeks while people were still excited about it)
19:36:27	<@JAA>	nicolas17: My WARC parsers usually fall over before I get any network issues, so resuming isn't really needed for me, but yeah, good advice nonetheless.
19:36:30	<fireonlive>	ah i have none myself
19:36:41	<@JAA>	I download from a server that basically has a line of sight to IA, for reference.
19:36:45	<fireonlive>	i was playing with the reddit one a while ago for purely SFW purposes
19:37:09	<nicolas17>	fireonlive: if you're downloading into a file on disk, and the download fails, you can run curl -c again to resume where it left off, but yeah that won't work for the same process and/or for stdout
19:37:14	<Nemo_bis>	I think there's a new torrent for the reddit stuff btw
19:37:15	<fireonlive>	ah ok
19:37:24	<fireonlive>	oh right; someone released one
19:37:35	<fireonlive>	https://pullpush.io/ seems to be going as well still
19:37:43	<nicolas17>	Nemo_bis: "what dataset is this?" what were you replying to? >.>
19:37:45	<fireonlive>	i should update my uh
19:37:49	<fireonlive>	...
19:37:59	<fireonlive>	<that other reddit api service that died> scripts
19:38:05	<fireonlive>	to try out pullpush instead
19:38:20	<fireonlive>	how quickly we forget
19:38:23	<@JAA>	Pushshift?
19:38:27	<Nemo_bis>	fireonlive: so if the dataset you're interested in is reddit comments I could just suggest an rsync target for you, write me in private
19:38:27	<fireonlive>	ah yes!
19:38:48	<fireonlive>	Nemo_bis: :) i'm no creator myself but thanks
19:39:18	<Nemo_bis>	hmmmm
19:39:43	<fireonlive>	Skylion was doing the 1PB though
19:56:44		eggdrop quits [Excess Flood]
19:59:29		eggdrop (eggdrop) joins
20:46:07		DopefishJustin quits [Ping timeout: 272 seconds]
21:09:44		DogsRNice joins
22:50:34		DopefishJustin joins
22:50:34		DopefishJustin is now authenticated as DopefishJustin

Home Search Previous day Next day