00:03:50 | | fuzzy8021 quits [Read error: Connection reset by peer] |
00:03:59 | | fuzzy80211 (fuzzy80211) joins |
00:15:17 | | systwi_ joins |
02:00:13 | <nicolas17> | hm, 13GB file in WBM, trying to download it gets 15MB and closes connection |
02:00:24 | <nicolas17> | that usually suggests the URL was only partially archived |
02:00:47 | <pokechu22> | I think there's a header with more information about if a file was truncated |
02:00:52 | <nicolas17> | but I remember seeing a header that says "download aborted due to 'time'" or something like that and I'm not seeing it here |
02:01:19 | <nicolas17> | https://web.archive.org/web/20240724054156/https://swcdn.apple.com/content/downloads/61/25/062-37076-A_IBQ9B6BPOH/d6qqf4pb3wagb0zgsdbd9kgscpurrjky5x/InstallAssistant.pkg |
02:02:56 | <pokechu22> | Hmm, yeah, that has an actual content-length: 13495235447 header too - I'd expect that to be different in this case |
04:08:00 | <nicolas17> | https://web.archive.org/web/20240924153134/https://swcdn.apple.com/content/downloads/48/28/062-80466-A_IX9VBQPBBE/8jcyvbnr93ho7qbzvnbw1jv9vtnmo76ifl/InstallAssistant.pkg and this one gives gateway timeout / bad gateway and never starts downloading |
04:26:28 | <nicolas17> | pokechu22: I wonder if IA removed that warning header saying the capture was truncated |
04:26:46 | <nicolas17> | because I didn't see it in any of these captures, not even those that fail to download after 15MB |
04:35:20 | <@JAA> | No, the warning header still exists as of a couple weeks ago: https://web.archive.org/web/20240911063246/http://nbg1-speed.hetzner.com/10GB.bin |
04:36:56 | <@JAA> | Or well, I'd expect the warning header to be based on the WARC-Truncated header, and I'm sure they won't change *that*. |
04:52:25 | | DogsRNice quits [Read error: Connection reset by peer] |
06:09:41 | | IDK (IDK) joins |
06:17:27 | | Lord_Nightmare quits [Quit: ZNC - http://znc.in] |
06:21:11 | | Lord_Nightmare (Lord_Nightmare) joins |
08:15:28 | <qwertyasdfuiopghjkl> | nicolas17: According to the x-archive-src header on your first URL it's in spn-cloudflare-20240724061613/spn-cloudflare-20240723125001-wwwb-front8.us.archive.org-8011.warc.gz, which https://archive.org/download/spn-cloudflare-20240724061613 says is only 953.8MiB, so unless that file somehow compressed very well it's truncated. |
08:53:16 | | IDK quits [Client Quit] |
09:15:14 | | nulldata quits [Quit: So long and thanks for all the fish!] |
09:16:40 | | nulldata (nulldata) joins |
09:51:08 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
09:51:34 | | driib (driib) joins |
09:52:39 | <@JAA> | I'd love to see what that WARC record looks like. |
09:54:03 | <@JAA> | But the CDX API confirms the truncation to about 15.5 MB (after compression?): https://web.archive.org/cdx/search/cdx?url=https://swcdn.apple.com/content/downloads/61/25/062-37076-A_IBQ9B6BPOH/d6qqf4pb3wagb0zgsdbd9kgscpurrjky5x/InstallAssistant.pkg |
10:52:14 | <@arkiver> | nicolas17: i think it may really be better to rely on #archivebot for thi |
10:52:16 | <@arkiver> | s |
12:59:29 | | IDK (IDK) joins |
13:35:37 | | KoalaBear84 joins |
13:39:04 | | KoalaBear quits [Ping timeout: 260 seconds] |
13:40:56 | | qw3rty__ joins |
13:44:12 | | qw3rty_ quits [Ping timeout: 258 seconds] |
13:58:50 | | qw3rty_ joins |
14:02:14 | | qw3rty__ quits [Ping timeout: 258 seconds] |
14:40:57 | | qw3rty__ joins |
14:44:46 | | qw3rty_ quits [Ping timeout: 258 seconds] |
15:39:52 | <nicolas17> | arkiver: I am using archivebot for archiving things, these were pre-existing captures (no idea where they came from, maybe someone fed them into SPN or some crawling thing) |
15:41:22 | <nicolas17> | from the warc filename, I saw mentions of "SPNOUT", "GDELT" and "WDRP" |
15:59:08 | | IDK quits [Client Quit] |
16:06:30 | | JaffaCakes118_2 (JaffaCakes118) joins |
16:06:46 | | JaffaCakes118 quits [Remote host closed the connection] |
18:35:44 | | DogsRNice joins |
18:36:22 | | DogsRNice quits [Remote host closed the connection] |
19:12:54 | <nicolas17> | JAA: I was downloading the whole thing from WBM but the CDX API already answers that (via the hash) :/ |
19:13:46 | <nicolas17> | that will be so much faster |
19:26:06 | | DogsRNice joins |
19:53:37 | <TheTechRobo> | JAA: How do you resume an interrupted ia-upload-stream? It requires --parts, right? What data structure is it expecting? |
20:19:55 | <TheTechRobo> | Ah, it's printed if the upload fails |
23:16:59 | | thalia joins |
23:17:01 | | thalia is now authenticated as thalia |