#internetarchive log for 2024-09-13

Home Search Previous day Next day

00:01:55		Matthww quits [Client Quit]
00:02:40		Matthww joins
00:08:38		phaeton quits [Client Quit]
00:08:51		hlgs\|m quits [Client Quit]
00:08:51		qyxojzh\|m quits [Client Quit]
00:09:02		s-crypt\|m\|m quits [Client Quit]
00:09:03		Nulo\|m quits [Client Quit]
00:09:03		mikolaj\|m quits [Client Quit]
00:09:05		yzqzss quits [Client Quit]
01:57:52	<nicolas17>	2461/3597 [3:46:59<1:25:45, 4.53s/MiB]
01:57:54	<nicolas17>	woe is me
02:00:46	<@JAA>	Can relate. I uploaded a few GB of data from a host last night and it took forever at very similar speeds.
02:01:30	<@JAA>	Come to think of it, I probably didn't do the TCP magic on that host, so maybe that's why.
02:09:17	<imer>	tcp magic has been a solid "meh" for me so far, but ymmv. maybe default are just better nowadays
04:05:33		briankrebs quits [Remote host closed the connection]
04:05:50		vinnytroia joins
04:08:57		DogsRNice quits [Read error: Connection reset by peer]
04:22:00		nyuuzyou quits [Client Quit]
04:22:56		Vokun quits [Client Quit]
04:27:41		Terbium joins
05:42:46	<nicolas17>	if I upload a .torrent to IA, supposedly "when that item is derived, we will instantiate a BitTorrent client (Transmission) and attempt to retrieve the Torrent"
05:43:09	<nicolas17>	does anyone know how long that runs? like if there's no seeders, I assume it will eventually give up?
05:47:12	<@JAA>	IIRC, it's an 'abort if there's not at least X% progress within Y hours' thing, and the Y is something like 24 for a small X (<5?). The total download can definitely run for at least a couple days if there's reasonable progress.
05:50:08	<nicolas17>	oh, that's smarter than I expected
05:50:14	<nicolas17>	I thought there would be a global timeout or something
05:50:35	<nicolas17>	also if that happens I could trigger a re-derive right?
05:55:57	<@JAA>	Not sure if you can, but I could try, or someone at IA. I think it causes the task to error out.
06:04:08	<@JAA>	TIL when you try to SPN something on a TLD that doesn't exist, it tells you that the 'URL syntax is not valid': https://web.archive.org/save/https://invalid.tld/
06:04:21	<nicolas17>	https://archive.org/details/iPad_64bit_TouchID_11.4.1_15G5077a_Restore.ipsw let's see if this works
06:04:53	<nicolas17>	I asked one of the apple data hoarders to seed this one (I don't have the actual ipsw file myself), if tomorrow I wake up to a fully uploaded item it will be great
08:09:23		Doranwen quits [Ping timeout: 256 seconds]
08:21:27		Doranwen (Doranwen) joins
08:39:42	<@JAA>	arkiver: Next one! :-P
08:39:52	<@JAA>	/cdx/search/cdx?url=tracker.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=100&output=json&page=0 returns 1179 results.
08:40:21	<@JAA>	Setting pageSize to 1 and iterating over the 9 pages, merging, and deduping yields 1184.
08:41:43	<@JAA>	Looks like the difference is all additional HTTPS URLs that are already in both lists as HTTP.
08:42:05	<@JAA>	Maybe it's just collapsing per page and returning some dupes on the page boundaries.
08:42:16	<@JAA>	Dupes with a different scheme, that is.
08:44:52	<@JAA>	I'm getting two actual dupes on pageSize=1.
08:51:01	<@JAA>	But ia-cdx-search finally works again. :-)
09:15:13		nulldata quits [Quit: So long and thanks for all the fish!]
09:16:32		nulldata (nulldata) joins
09:38:20		driib quits [Quit: The Lounge - https://thelounge.chat]
09:38:49		driib (driib) joins
10:07:50		Terbium quits [Ping timeout: 256 seconds]
12:33:32		mattwright324\|m joins
12:36:47		igneousx (igneousx) joins
12:36:47		tomodachi94 (tomodachi94) joins
12:36:47		theblazehen\|m joins
12:36:47		thermospheric (Thermospheric) joins
12:36:47		audrooku\|m joins
12:36:47		schwarzkatz\|m joins
12:36:47		DigitalDragon (DigitalDragon) joins
12:36:47		Sanqui\|m (Sanqui) joins
12:36:47		britmob\|m joins
12:36:47		x9fff00 (x9fff00) joins
12:36:47		Thibaultmol joins
12:36:47		Vokun (Vokun) joins
12:36:47		nyuuzyou joins
12:36:47		yzqzss (yzqzss) joins
12:36:47		Exorcism\|m joins
12:36:47		mikolaj\|m joins
12:36:55		hlgs\|m joins
12:36:55		s-crypt\|m\|m joins
12:36:55		qyxojzh\|m joins
12:36:55		Nulo\|m joins
12:36:55		phaeton (phaeton) joins
12:57:57		c3manu quits [Read error: Connection reset by peer]
13:11:59		nicolas17 quits [Ping timeout: 256 seconds]
14:05:56		c3manu (c3manu) joins
14:11:00		nicolas17 joins
14:13:52	<nicolas17>	JAA: so far the torrent thing seems to be working (it downloaded 15% off me and stalled because that's all I have, and my friend didn't see my msg and start seeding yet)
14:32:43	<nicolas17>	hmm I assume IA's torrent client can't receive incoming connections
14:37:14	<nicolas17>	he's behind CGNAT :/ if IA's client is also behind NAT this won't work unless the data goes through a third torrent client (like mine)
14:50:27	<nicolas17>	ok with intermediaries this works
15:14:30	<nicolas17>	https://catalogd.archive.org/log/4537558939 idk what it's doing now
15:26:43	<nicolas17>	other than the NAT issue, this IA feature is damn flawless
15:46:11	<@arkiver>	nicolas17: looks like all went fine?
16:19:57		nicolas17 is now authenticated as nicolas17
16:49:35	<HP_Archivist>	General PSA - if you ia download <identifier> --exclude-source derivative on a particular item such as an item of WARCs or WACZ (where those are source files) the exclude command will miss those.
16:50:43	<HP_Archivist>	*Based on behavior I'm seeing while re-running ia search --itemlist 'uploader:user@foo.com mediatype:data' \| xargs -P 8 -n 1 ia download --exclude-source=derivative
16:51:36	<HP_Archivist>	To be super clear: by 'miss those' I mean they will be excluded and not downloaded. So, the arg does it job where it technically should not.
16:53:22	<HP_Archivist>	Also: To really make sure one grabs everything under an account, it seems like it might be best to do per mediatype: etc to be extra thorough. I have items already downloaded, but re-running mediatype: commands indicates I missed files here/there that are not derivatives
16:54:17	<HP_Archivist>	Which I suppose comes down to have the items are categorized / sorted in collection and mediatype, erm, sets
17:01:14		nyuuzyou is now authenticated as nyuuzyou
17:34:34		Medowar quits [Ping timeout: 255 seconds]
17:38:30		Medowar joins
18:02:31	<@JAA>	wat
18:02:55	<@JAA>	If that's true, that'd definitely be a bug. Those files are not source derivative.
18:06:55	<HP_Archivist>	I probably overlooked something on my end, which is more likely the case than it being a bug.
18:08:03	<HP_Archivist>	Also, running: ia search --itemlist 'uploader:foo@email.org' \| xargs -P 8 -n 1 ia download --exclude-source=derivative on an account doesn't really grab everything
18:09:09	<@JAA>	I'm interested in specific examples of what's being missed. (Feel free to PM if you don't want to share publicly.)
18:09:16	<HP_Archivist>	Example: That arg on my account grabbed this item https://archive.org/details/epson-perfection-v850-scanner-drivers but it missed some files like dmg's and deb.tar.gz's
18:09:53	<HP_Archivist>	I'm fine here, JAA. Anything I do is essentially public anyway.
18:10:41	<HP_Archivist>	I mean, it could be a connection timeout, but I've been triple checking the cli output and didn't see anything. This morning out of curiosity I ran:
18:11:01	<@JAA>	`ia download -d --exclude-source derivative epson-perfection-v850-scanner-drivers` includes the .dmg and .deb.tar.gz files for me.
18:11:10	<HP_Archivist>	ia search --itemlist 'uploader:archivist.goals@gmail.com mediatype:data' \| xargs -P 8 -n 1 ia download --exclude-source=derivative and running over that Epson item was pulling down files that it missed before (?)
18:11:22	<@JAA>	It returns 32 files, to be precise.
18:11:48	<HP_Archivist>	Maybe a connection issue then
18:11:59	<@JAA>	That sounds plausible, yes.
18:12:18	<@JAA>	The parallel processes would make the output quite unreadable.
18:12:38	<@JAA>	Could be fixed by logging each process to its own file, but that's slightly messy.
18:12:47	<HP_Archivist>	Yeah I have to make the font pretty small to get a full view of the output
18:13:34	<HP_Archivist>	The WACZ/WARC issue was from this item: https://archive.org/download/symphony-of-science-facebook-fan-page-warc-wacz-2010-2023
18:13:54	<HP_Archivist>	Again, downloaded most, but it missed a few when running mediatype:data on it
18:14:15	<HP_Archivist>	Erm, I mean it picked up once that were missed previously*
18:14:20	<HP_Archivist>	ones
18:16:56	<@JAA>	Hmm, not sure about _datapackage.json, but otherwise, the output is what I would expect.
18:17:34	<@JAA>	I had forgotten that the item CDX and IDX are considered original.
18:17:47	<@JAA>	But that'd download too much, not too little.
18:19:11	<HP_Archivist>	Probably something on my end like connection timeout or something then
18:20:22	<HP_Archivist>	I wish IA offered a service where I could get a drive shipped to me full of the data I uploaded. I'd pay for that :P
18:20:39	<@JAA>	I guess the _datapackage.json is pulled out of the WACZ or something.
18:21:16	<HP_Archivist>	Yeah, I used webrecorded to create those
18:21:23	<HP_Archivist>	webrecorder*
18:23:07	<HP_Archivist>	From now on though I'm going to not only use the former args, but also:
18:23:25	<HP_Archivist>	mediatype: args, too, just so I don't miss anything
18:23:43	<HP_Archivist>	I remember we talked about before how 'audio' items were different from 'community_audio' iirc
18:24:20	<HP_Archivist>	Don't necessarily trust that one would grab everything over the other
18:31:23	<@JAA>	I don't think additional filters (like mediatype) should make any difference at all.
18:32:03	<@JAA>	As in, the list produced by `ia search --itemlist 'uploader:archivist.goals@gmail.com'` should be identical to the combined list from runs with all mediatypes.
18:33:12	<@JAA>	You said it was missing individual files, not entire items.
18:33:19	<@JAA>	The search command would be irrelevant for the former.
18:33:53	<HP_Archivist>	Correct - not missing items, but individual files
18:33:57	<HP_Archivist>	Hm
18:34:10	<HP_Archivist>	Then it's definitely connection related
18:34:17	<@JAA>	Yeah, certainly sounds like it.
18:35:23		Dango360 quits [Read error: Connection reset by peer]
18:35:56	<HP_Archivist>	Welp, alrighty then. I'm probably running too many parallel connections in too many cli windows heh
18:36:45	<HP_Archivist>	I'm glad it's not a bug
18:49:32		DogsRNice joins
19:13:40		katia quits [Read error: Connection reset by peer]
19:14:12		kokos quits [Read error: Connection reset by peer]
19:15:20		Dango360 (Dango360) joins
19:56:11		kokos joins
20:04:15		katia_ quits [Quit: ZNC - https://znc.in]
20:08:19		katia_ (katia) joins
20:09:20		katia (katia) joins
20:24:21		corentin quits [Ping timeout: 256 seconds]
20:28:55		katia is now known as katia__
20:38:15		corentin joins

Home Search Previous day Next day