00:01:55 | | Matthww quits [Client Quit] |
00:02:40 | | Matthww joins |
00:08:38 | | phaeton quits [Client Quit] |
00:08:51 | | hlgs|m quits [Client Quit] |
00:08:51 | | qyxojzh|m quits [Client Quit] |
00:09:02 | | s-crypt|m|m quits [Client Quit] |
00:09:03 | | Nulo|m quits [Client Quit] |
00:09:03 | | mikolaj|m quits [Client Quit] |
00:09:05 | | yzqzss quits [Client Quit] |
01:57:52 | <nicolas17> | 2461/3597 [3:46:59<1:25:45, 4.53s/MiB] |
01:57:54 | <nicolas17> | woe is me |
02:00:46 | <@JAA> | Can relate. I uploaded a few GB of data from a host last night and it took forever at very similar speeds. |
02:01:30 | <@JAA> | Come to think of it, I probably didn't do the TCP magic on that host, so maybe that's why. |
02:09:17 | <imer> | tcp magic has been a solid "meh" for me so far, but ymmv. maybe default are just better nowadays |
04:05:33 | | briankrebs quits [Remote host closed the connection] |
04:05:50 | | vinnytroia joins |
04:08:57 | | DogsRNice quits [Read error: Connection reset by peer] |
04:22:00 | | nyuuzyou quits [Client Quit] |
04:22:56 | | Vokun quits [Client Quit] |
04:27:41 | | Terbium joins |
05:42:46 | <nicolas17> | if I upload a .torrent to IA, supposedly "when that item is derived, we will instantiate a BitTorrent client (Transmission) and attempt to retrieve the Torrent" |
05:43:09 | <nicolas17> | does anyone know how long that runs? like if there's no seeders, I assume it will eventually give up? |
05:47:12 | <@JAA> | IIRC, it's an 'abort if there's not at least X% progress within Y hours' thing, and the Y is something like 24 for a small X (<5?). The total download can definitely run for at least a couple days if there's reasonable progress. |
05:50:08 | <nicolas17> | oh, that's smarter than I expected |
05:50:14 | <nicolas17> | I thought there would be a global timeout or something |
05:50:35 | <nicolas17> | also if that happens I could trigger a re-derive right? |
05:55:57 | <@JAA> | Not sure if *you* can, but I could try, or someone at IA. I think it causes the task to error out. |
06:04:08 | <@JAA> | TIL when you try to SPN something on a TLD that doesn't exist, it tells you that the 'URL syntax is not valid': https://web.archive.org/save/https://invalid.tld/ |
06:04:21 | <nicolas17> | https://archive.org/details/iPad_64bit_TouchID_11.4.1_15G5077a_Restore.ipsw let's see if this works |
06:04:53 | <nicolas17> | I asked one of the apple data hoarders to seed this one (I *don't* have the actual ipsw file myself), if tomorrow I wake up to a fully uploaded item it will be great |
08:09:23 | | Doranwen quits [Ping timeout: 256 seconds] |
08:21:27 | | Doranwen (Doranwen) joins |
08:39:42 | <@JAA> | arkiver: Next one! :-P |
08:39:52 | <@JAA> | /cdx/search/cdx?url=tracker.archiveteam.org&collapse=urlkey&fl=original&matchType=domain&pageSize=100&output=json&page=0 returns 1179 results. |
08:40:21 | <@JAA> | Setting pageSize to 1 and iterating over the 9 pages, merging, and deduping yields 1184. |
08:41:43 | <@JAA> | Looks like the difference is all additional HTTPS URLs that are already in both lists as HTTP. |
08:42:05 | <@JAA> | Maybe it's just collapsing per page and returning some dupes on the page boundaries. |
08:42:16 | <@JAA> | Dupes with a different scheme, that is. |
08:44:52 | <@JAA> | I'm getting two actual dupes on pageSize=1. |
08:51:01 | <@JAA> | But ia-cdx-search finally works again. :-) |
09:15:13 | | nulldata quits [Quit: So long and thanks for all the fish!] |
09:16:32 | | nulldata (nulldata) joins |
09:38:20 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
09:38:49 | | driib (driib) joins |
10:07:50 | | Terbium quits [Ping timeout: 256 seconds] |
12:33:32 | | mattwright324|m joins |
12:36:47 | | igneousx (igneousx) joins |
12:36:47 | | tomodachi94 (tomodachi94) joins |
12:36:47 | | theblazehen|m joins |
12:36:47 | | thermospheric (Thermospheric) joins |
12:36:47 | | audrooku|m joins |
12:36:47 | | schwarzkatz|m joins |
12:36:47 | | DigitalDragon (DigitalDragon) joins |
12:36:47 | | Sanqui|m (Sanqui) joins |
12:36:47 | | britmob|m joins |
12:36:47 | | x9fff00 (x9fff00) joins |
12:36:47 | | Thibaultmol joins |
12:36:47 | | Vokun (Vokun) joins |
12:36:47 | | nyuuzyou joins |
12:36:47 | | yzqzss (yzqzss) joins |
12:36:47 | | Exorcism|m joins |
12:36:47 | | mikolaj|m joins |
12:36:55 | | hlgs|m joins |
12:36:55 | | s-crypt|m|m joins |
12:36:55 | | qyxojzh|m joins |
12:36:55 | | Nulo|m joins |
12:36:55 | | phaeton (phaeton) joins |
12:57:57 | | c3manu quits [Read error: Connection reset by peer] |
13:11:59 | | nicolas17 quits [Ping timeout: 256 seconds] |
14:05:56 | | c3manu (c3manu) joins |
14:11:00 | | nicolas17 joins |
14:13:52 | <nicolas17> | JAA: so far the torrent thing seems to be working (it downloaded 15% off me and stalled because that's all I have, and my friend didn't see my msg and start seeding yet) |
14:32:43 | <nicolas17> | hmm I assume IA's torrent client can't receive incoming connections |
14:37:14 | <nicolas17> | he's behind CGNAT :/ if IA's client is also behind NAT this won't work unless the data goes through a third torrent client (like mine) |
14:50:27 | <nicolas17> | ok with intermediaries this works |
15:14:30 | <nicolas17> | https://catalogd.archive.org/log/4537558939 idk what it's doing now |
15:26:43 | <nicolas17> | other than the NAT issue, this IA feature is damn flawless |
15:46:11 | <@arkiver> | nicolas17: looks like all went fine? |
16:19:57 | | nicolas17 is now authenticated as nicolas17 |
16:49:35 | <HP_Archivist> | General PSA - if you ia download <identifier> --exclude-source derivative on a particular item such as an item of WARCs or WACZ (where those are source files) the exclude command will miss those. |
16:50:43 | <HP_Archivist> | *Based on behavior I'm seeing while re-running ia search --itemlist 'uploader:user@foo.com mediatype:data' | xargs -P 8 -n 1 ia download --exclude-source=derivative |
16:51:36 | <HP_Archivist> | To be super clear: by 'miss those' I mean they will be excluded and not downloaded. So, the arg does it job where it technically should not. |
16:53:22 | <HP_Archivist> | Also: To really make sure one grabs *everything* under an account, it seems like it might be best to do per mediatype: etc to be extra thorough. I have items already downloaded, but re-running mediatype: commands indicates I missed files here/there that are not derivatives |
16:54:17 | <HP_Archivist> | Which I suppose comes down to have the items are categorized / sorted in collection and mediatype, erm, sets |
17:01:14 | | nyuuzyou is now authenticated as nyuuzyou |
17:34:34 | | Medowar quits [Ping timeout: 255 seconds] |
17:38:30 | | Medowar joins |
18:02:31 | <@JAA> | wat |
18:02:55 | <@JAA> | If that's true, that'd definitely be a bug. Those files are not source derivative. |
18:06:55 | <HP_Archivist> | I probably overlooked something on my end, which is more likely the case than it being a bug. |
18:08:03 | <HP_Archivist> | Also, running: ia search --itemlist 'uploader:foo@email.org' | xargs -P 8 -n 1 ia download --exclude-source=derivative on an account doesn't really grab *everything* |
18:09:09 | <@JAA> | I'm interested in specific examples of what's being missed. (Feel free to PM if you don't want to share publicly.) |
18:09:16 | <HP_Archivist> | Example: That arg on my account grabbed this item https://archive.org/details/epson-perfection-v850-scanner-drivers but it missed some files like dmg's and deb.tar.gz's |
18:09:53 | <HP_Archivist> | I'm fine here, JAA. Anything I do is essentially public anyway. |
18:10:41 | <HP_Archivist> | I mean, it could be a connection timeout, but I've been triple checking the cli output and didn't see anything. This morning out of curiosity I ran: |
18:11:01 | <@JAA> | `ia download -d --exclude-source derivative epson-perfection-v850-scanner-drivers` includes the .dmg and .deb.tar.gz files for me. |
18:11:10 | <HP_Archivist> | ia search --itemlist 'uploader:archivist.goals@gmail.com mediatype:data' | xargs -P 8 -n 1 ia download --exclude-source=derivative and running over that Epson item was pulling down files that it missed before (?) |
18:11:22 | <@JAA> | It returns 32 files, to be precise. |
18:11:48 | <HP_Archivist> | Maybe a connection issue then |
18:11:59 | <@JAA> | That sounds plausible, yes. |
18:12:18 | <@JAA> | The parallel processes would make the output quite unreadable. |
18:12:38 | <@JAA> | Could be fixed by logging each process to its own file, but that's slightly messy. |
18:12:47 | <HP_Archivist> | Yeah I have to make the font pretty small to get a full view of the output |
18:13:34 | <HP_Archivist> | The WACZ/WARC issue was from this item: https://archive.org/download/symphony-of-science-facebook-fan-page-warc-wacz-2010-2023 |
18:13:54 | <HP_Archivist> | Again, downloaded most, but it missed a few when running mediatype:data on it |
18:14:15 | <HP_Archivist> | Erm, I mean it picked up once that were missed previously* |
18:14:20 | <HP_Archivist> | ones |
18:16:56 | <@JAA> | Hmm, not sure about _datapackage.json, but otherwise, the output is what I would expect. |
18:17:34 | <@JAA> | I had forgotten that the item CDX and IDX are considered original. |
18:17:47 | <@JAA> | But that'd download too much, not too little. |
18:19:11 | <HP_Archivist> | Probably something on my end like connection timeout or something then |
18:20:22 | <HP_Archivist> | I wish IA offered a service where I could get a drive shipped to me full of the data I uploaded. I'd pay for that :P |
18:20:39 | <@JAA> | I guess the _datapackage.json is pulled out of the WACZ or something. |
18:21:16 | <HP_Archivist> | Yeah, I used webrecorded to create those |
18:21:23 | <HP_Archivist> | webrecorder* |
18:23:07 | <HP_Archivist> | From now on though I'm going to not only use the former args, but also: |
18:23:25 | <HP_Archivist> | mediatype: args, too, just so I don't miss anything |
18:23:43 | <HP_Archivist> | I remember we talked about before how 'audio' items were different from 'community_audio' iirc |
18:24:20 | <HP_Archivist> | Don't necessarily trust that one would grab everything over the other |
18:31:23 | <@JAA> | I don't think additional filters (like mediatype) should make any difference at all. |
18:32:03 | <@JAA> | As in, the list produced by `ia search --itemlist 'uploader:archivist.goals@gmail.com'` should be identical to the combined list from runs with all mediatypes. |
18:33:12 | <@JAA> | You said it was missing individual files, not entire items. |
18:33:19 | <@JAA> | The search command would be irrelevant for the former. |
18:33:53 | <HP_Archivist> | Correct - not missing items, but individual files |
18:33:57 | <HP_Archivist> | Hm |
18:34:10 | <HP_Archivist> | Then it's definitely connection related |
18:34:17 | <@JAA> | Yeah, certainly sounds like it. |
18:35:23 | | Dango360 quits [Read error: Connection reset by peer] |
18:35:56 | <HP_Archivist> | Welp, alrighty then. I'm probably running too many parallel connections in too many cli windows heh |
18:36:45 | <HP_Archivist> | I'm glad it's not a bug |
18:49:32 | | DogsRNice joins |
19:13:40 | | katia quits [Read error: Connection reset by peer] |
19:14:12 | | kokos quits [Read error: Connection reset by peer] |
19:15:20 | | Dango360 (Dango360) joins |
19:56:11 | | kokos joins |
20:04:15 | | katia_ quits [Quit: ZNC - https://znc.in] |
20:08:19 | | katia_ (katia) joins |
20:09:20 | | katia (katia) joins |
20:24:21 | | corentin quits [Ping timeout: 256 seconds] |
20:28:55 | | katia is now known as katia__ |
20:38:15 | | corentin joins |