00:01:55Matthww quits [Client Quit]
00:02:40Matthww joins
00:08:38phaeton quits [Client Quit]
00:08:51hlgs|m quits [Client Quit]
00:08:51qyxojzh|m quits [Client Quit]
00:09:02s-crypt|m|m quits [Client Quit]
00:09:03Nulo|m quits [Client Quit]
00:09:03mikolaj|m quits [Client Quit]
00:09:05yzqzss quits [Client Quit]
01:57:52<nicolas17>2461/3597 [3:46:59<1:25:45, 4.53s/MiB]
01:57:54<nicolas17>woe is me
02:00:46<@JAA>Can relate. I uploaded a few GB of data from a host last night and it took forever at very similar speeds.
02:01:30<@JAA>Come to think of it, I probably didn't do the TCP magic on that host, so maybe that's why.
02:09:17<imer>tcp magic has been a solid "meh" for me so far, but ymmv. maybe default are just better nowadays
04:05:33briankrebs quits [Remote host closed the connection]
04:05:50vinnytroia joins
04:08:57DogsRNice quits [Read error: Connection reset by peer]
04:22:00nyuuzyou quits [Client Quit]
04:22:56Vokun quits [Client Quit]
04:27:41Terbium joins
05:42:46<nicolas17>if I upload a .torrent to IA, supposedly "when that item is derived, we will instantiate a BitTorrent client (Transmission) and attempt to retrieve the Torrent"
05:43:09<nicolas17>does anyone know how long that runs? like if there's no seeders, I assume it will eventually give up?
05:47:12<@JAA>IIRC, it's an 'abort if there's not at least X% progress within Y hours' thing, and the Y is something like 24 for a small X (<5?). The total download can definitely run for at least a couple days if there's reasonable progress.
05:50:08<nicolas17>oh, that's smarter than I expected
05:50:14<nicolas17>I thought there would be a global timeout or something
05:50:35<nicolas17>also if that happens I could trigger a re-derive right?
05:55:57<@JAA>Not sure if *you* can, but I could try, or someone at IA. I think it causes the task to error out.
06:04:08<@JAA>TIL when you try to SPN something on a TLD that doesn't exist, it tells you that the 'URL syntax is not valid': https://web.archive.org/save/https://invalid.tld/
06:04:21<nicolas17>https://archive.org/details/iPad_64bit_TouchID_11.4.1_15G5077a_Restore.ipsw let's see if this works
06:04:53<nicolas17>I asked one of the apple data hoarders to seed this one (I *don't* have the actual ipsw file myself), if tomorrow I wake up to a fully uploaded item it will be great
08:09:23Doranwen quits [Ping timeout: 256 seconds]
08:21:27Doranwen (Doranwen) joins
08:39:42<@JAA>arkiver: Next one! :-P
08:39:52<@JAA>/cdx/search/cdx?url=tracker.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=100&output=json&page=0 returns 1179 results.
08:40:21<@JAA>Setting pageSize to 1 and iterating over the 9 pages, merging, and deduping yields 1184.
08:41:43<@JAA>Looks like the difference is all additional HTTPS URLs that are already in both lists as HTTP.
08:42:05<@JAA>Maybe it's just collapsing per page and returning some dupes on the page boundaries.
08:42:16<@JAA>Dupes with a different scheme, that is.
08:44:52<@JAA>I'm getting two actual dupes on pageSize=1.
08:51:01<@JAA>But ia-cdx-search finally works again. :-)
09:15:13nulldata quits [Quit: So long and thanks for all the fish!]
09:16:32nulldata (nulldata) joins
09:38:20driib quits [Quit: The Lounge - https://thelounge.chat]
09:38:49driib (driib) joins
10:07:50Terbium quits [Ping timeout: 256 seconds]
12:33:32mattwright324|m joins
12:36:47igneousx (igneousx) joins
12:36:47tomodachi94 (tomodachi94) joins
12:36:47theblazehen|m joins
12:36:47thermospheric (Thermospheric) joins
12:36:47audrooku|m joins
12:36:47schwarzkatz|m joins
12:36:47DigitalDragon (DigitalDragon) joins
12:36:47Sanqui|m (Sanqui) joins
12:36:47britmob|m joins
12:36:47x9fff00 (x9fff00) joins
12:36:47Thibaultmol joins
12:36:47Vokun (Vokun) joins
12:36:47nyuuzyou joins
12:36:47yzqzss (yzqzss) joins
12:36:47Exorcism|m joins
12:36:47mikolaj|m joins
12:36:55hlgs|m joins
12:36:55s-crypt|m|m joins
12:36:55qyxojzh|m joins
12:36:55Nulo|m joins
12:36:55phaeton (phaeton) joins
12:57:57c3manu quits [Read error: Connection reset by peer]
13:11:59nicolas17 quits [Ping timeout: 256 seconds]
14:05:56c3manu (c3manu) joins
14:11:00nicolas17 joins
14:13:52<nicolas17>JAA: so far the torrent thing seems to be working (it downloaded 15% off me and stalled because that's all I have, and my friend didn't see my msg and start seeding yet)
14:32:43<nicolas17>hmm I assume IA's torrent client can't receive incoming connections
14:37:14<nicolas17>he's behind CGNAT :/ if IA's client is also behind NAT this won't work unless the data goes through a third torrent client (like mine)
14:50:27<nicolas17>ok with intermediaries this works
15:14:30<nicolas17>https://catalogd.archive.org/log/4537558939 idk what it's doing now
15:26:43<nicolas17>other than the NAT issue, this IA feature is damn flawless
15:46:11<@arkiver>nicolas17: looks like all went fine?
16:49:35<HP_Archivist>General PSA - if you ia download <identifier> --exclude-source derivative on a particular item such as an item of WARCs or WACZ (where those are source files) the exclude command will miss those.
16:50:43<HP_Archivist>*Based on behavior I'm seeing while re-running ia search --itemlist 'uploader:user@foo.com mediatype:data' | xargs -P 8 -n 1 ia download --exclude-source=derivative
16:51:36<HP_Archivist>To be super clear: by 'miss those' I mean they will be excluded and not downloaded. So, the arg does it job where it technically should not.
16:53:22<HP_Archivist>Also: To really make sure one grabs *everything* under an account, it seems like it might be best to do per mediatype: etc to be extra thorough. I have items already downloaded, but re-running mediatype: commands indicates I missed files here/there that are not derivatives
16:54:17<HP_Archivist>Which I suppose comes down to have the items are categorized / sorted in collection and mediatype, erm, sets
17:34:34Medowar quits [Ping timeout: 255 seconds]
17:38:30Medowar joins
18:02:31<@JAA>wat
18:02:55<@JAA>If that's true, that'd definitely be a bug. Those files are not source derivative.
18:06:55<HP_Archivist>I probably overlooked something on my end, which is more likely the case than it being a bug.
18:08:03<HP_Archivist>Also, running: ia search --itemlist 'uploader:foo@email.org' | xargs -P 8 -n 1 ia download --exclude-source=derivative on an account doesn't really grab *everything*
18:09:09<@JAA>I'm interested in specific examples of what's being missed. (Feel free to PM if you don't want to share publicly.)
18:09:16<HP_Archivist>Example: That arg on my account grabbed this item https://archive.org/details/epson-perfection-v850-scanner-drivers but it missed some files like dmg's and deb.tar.gz's
18:09:53<HP_Archivist>I'm fine here, JAA. Anything I do is essentially public anyway.
18:10:41<HP_Archivist>I mean, it could be a connection timeout, but I've been triple checking the cli output and didn't see anything. This morning out of curiosity I ran:
18:11:01<@JAA>`ia download -d --exclude-source derivative epson-perfection-v850-scanner-drivers` includes the .dmg and .deb.tar.gz files for me.
18:11:10<HP_Archivist>ia search --itemlist 'uploader:archivist.goals@gmail.com mediatype:data' | xargs -P 8 -n 1 ia download --exclude-source=derivative and running over that Epson item was pulling down files that it missed before (?)
18:11:22<@JAA>It returns 32 files, to be precise.
18:11:48<HP_Archivist>Maybe a connection issue then
18:11:59<@JAA>That sounds plausible, yes.
18:12:18<@JAA>The parallel processes would make the output quite unreadable.
18:12:38<@JAA>Could be fixed by logging each process to its own file, but that's slightly messy.
18:12:47<HP_Archivist>Yeah I have to make the font pretty small to get a full view of the output
18:13:34<HP_Archivist>The WACZ/WARC issue was from this item: https://archive.org/download/symphony-of-science-facebook-fan-page-warc-wacz-2010-2023
18:13:54<HP_Archivist>Again, downloaded most, but it missed a few when running mediatype:data on it
18:14:15<HP_Archivist>Erm, I mean it picked up once that were missed previously*
18:14:20<HP_Archivist>ones
18:16:56<@JAA>Hmm, not sure about _datapackage.json, but otherwise, the output is what I would expect.
18:17:34<@JAA>I had forgotten that the item CDX and IDX are considered original.
18:17:47<@JAA>But that'd download too much, not too little.
18:19:11<HP_Archivist>Probably something on my end like connection timeout or something then
18:20:22<HP_Archivist>I wish IA offered a service where I could get a drive shipped to me full of the data I uploaded. I'd pay for that :P
18:20:39<@JAA>I guess the _datapackage.json is pulled out of the WACZ or something.
18:21:16<HP_Archivist>Yeah, I used webrecorded to create those
18:21:23<HP_Archivist>webrecorder*
18:23:07<HP_Archivist>From now on though I'm going to not only use the former args, but also:
18:23:25<HP_Archivist>mediatype: args, too, just so I don't miss anything
18:23:43<HP_Archivist>I remember we talked about before how 'audio' items were different from 'community_audio' iirc
18:24:20<HP_Archivist>Don't necessarily trust that one would grab everything over the other
18:31:23<@JAA>I don't think additional filters (like mediatype) should make any difference at all.
18:32:03<@JAA>As in, the list produced by `ia search --itemlist 'uploader:archivist.goals@gmail.com'` should be identical to the combined list from runs with all mediatypes.
18:33:12<@JAA>You said it was missing individual files, not entire items.
18:33:19<@JAA>The search command would be irrelevant for the former.
18:33:53<HP_Archivist>Correct - not missing items, but individual files
18:33:57<HP_Archivist>Hm
18:34:10<HP_Archivist>Then it's definitely connection related
18:34:17<@JAA>Yeah, certainly sounds like it.
18:35:23Dango360 quits [Read error: Connection reset by peer]
18:35:56<HP_Archivist>Welp, alrighty then. I'm probably running too many parallel connections in too many cli windows heh
18:36:45<HP_Archivist>I'm glad it's not a bug
18:49:32DogsRNice joins
19:13:40katia quits [Read error: Connection reset by peer]
19:14:12kokos quits [Read error: Connection reset by peer]
19:15:20Dango360 (Dango360) joins
19:56:11kokos joins
20:04:15katia_ quits [Quit: ZNC - https://znc.in]
20:08:19katia_ (katia) joins
20:09:20katia (katia) joins
20:24:21corentin quits [Ping timeout: 256 seconds]
20:28:55katia is now known as katia__
20:38:15corentin joins