00:40:07 | | sralracer quits [Client Quit] |
00:49:23 | | wickedplayer494 quits [Remote host closed the connection] |
00:49:40 | | wickedplayer494 joins |
00:49:46 | | wickedplayer494 is now authenticated as wickedplayer494 |
01:30:34 | <@JAA> | Barto++ |
01:30:34 | <eggdrop> | [karma] 'Barto' now has 12 karma! |
01:51:22 | <nicolas17> | JAA: https://transfer.archivete.am/inline/ew4AR/qwarc.txt now-what.gif |
01:51:48 | <@JAA> | nicolas17: For starters, don't install the master branch. |
01:51:48 | <nicolas17> | do I have too old python or |
01:52:20 | | nicolas17 tries last tag instead |
01:52:44 | <@JAA> | But the issue comes from a backwards-incompatible change in an aiohttp dependency that broke everything a while back. |
01:52:58 | <nicolas17> | sounds damn familiar |
01:53:04 | <nicolas17> | like I fought this exact issue before |
01:53:12 | <@JAA> | async-timeout==3.0.1 |
01:53:37 | <nicolas17> | would it be possible to pin it in qwarc's setup.py to avoid the issue? |
01:54:25 | <nicolas17> | okay that works, and now it asks me for a specfile |
01:54:33 | <nicolas17> | which I guess is a python script using an undocumented API |
01:56:00 | <@JAA> | Correct |
01:56:14 | <@JAA> | And yeah, I should add a pin. |
01:56:33 | <nicolas17> | ...does nothing else support dedup? |
01:57:13 | <@JAA> | wget does (but writes faulty WARCs, although that seems to be getting fixed soon, finally). |
01:57:36 | <@JAA> | You could try wget-at, if you're not on ARM. |
01:58:11 | <nicolas17> | are there qwarc examples somewhere at least? |
01:58:13 | <@JAA> | Or rather: if you're on x86-64 |
01:58:34 | <@JAA> | In my IA uploads. qwarc is self-documenting and writes the spec file to the meta WARC. |
01:58:40 | | nicolas17 puts RISC-V machine back down |
02:05:37 | <@JAA> | ... it's been 3 years since the last qwarc version? Wat? |
02:05:50 | <nicolas17> | time is fake |
02:06:29 | <@JAA> | Yeah, must be. |
02:09:41 | <nicolas17> | so far cloning gnulib git repo has been the slowest step |
02:16:30 | <@JAA> | Cloning what and why? |
02:16:47 | <nicolas17> | wget-at has gnulib as a submodule |
02:17:23 | <@JAA> | Ah, you're trying wget-at now, right. |
02:18:12 | <@OrIdow6> | Yeah it's super slow/times out sometimes, don't know why |
02:18:15 | <nicolas17> | using the Dockerfile built an image that can't run because it *only* has wget-at and its libraries, but no libc or anything, because it uses "FROM scratch"? what? |
02:18:20 | <@OrIdow6> | Good thing is you only have to do it once |
02:18:45 | <@JAA> | Yeah, the wget-at image isn't really meant to be used directly. |
02:19:24 | <@JAA> | Or wget-lua image, I suppose. |
02:19:27 | | decky quits [Ping timeout: 252 seconds] |
02:19:31 | <@JAA> | Yay, legacy naming. |
02:19:44 | <nicolas17> | and if I change "FROM scratch" to "FROM debian:bookworm-slim" it has to download gnulib and compile wget again, because *any* change to Dockerfile changes the effects of the previous statement "COPY . /tmp/wget" |
02:20:24 | <@JAA> | You could just pull the grab-base image from atdr. |
02:20:26 | <nicolas17> | why are computers |
02:20:58 | <@JAA> | It's meant for DPoS, but it should be possible to just use it to run wget-at. |
02:21:01 | <nicolas17> | also fun how cares took much longer to compile than wget, if we ignore the time needed to clone gnulib and the autoconf nonsense |
02:22:18 | | decky_e joins |
02:41:26 | <nicolas17> | the --progress option seems broken |
02:42:29 | <nicolas17> | --progress=bar or --progress=dot:mega etc, none changes anything, it keeps flooding my console with one dot per KB |
02:45:50 | | HP_Archivist quits [Quit: Leaving] |
02:53:31 | <nicolas17> | hm this initial test seems fine, but I should probably run this for real on my other computer so that I can actually fit a whole InstallAssistant in RAM |
02:57:29 | | cow_2001 quits [Quit: ✡] |
02:58:37 | | cow_2001 joins |
02:59:25 | <nicolas17> | wtf it's appending to the -O file? |
03:01:56 | | nulldata quits [Quit: Ping timeout (120 seconds)] |
03:02:02 | <nicolas17> | "wget-at -O wget.tmp -i list.txt--warc-file=test" downloaded the 12GB file into wget.tmp, then wrote the 12GB data into test.warc, and now wget.tmp is growing more (with the content of the second URL?), I expected it to get truncated for the second download |
03:02:49 | | nulldata (nulldata) joins |
03:06:02 | <@JAA> | Yes, it should truncate. That's clearly not your exact command since there's no space between txt and --. |
03:06:15 | <nicolas17> | ...pipelines pass --truncate-output |
03:06:16 | <nicolas17> | I see |
03:07:01 | <@JAA> | Oh, right, that's a wget-at thing, I think. |
03:08:38 | <nicolas17> | I think I can also just use -O /dev/null and save some temporary disk space |
03:08:48 | <nicolas17> | since it uses a separate temporary file for warc purposes anyway |
03:10:51 | <@JAA> | Since you don't need to extract links or similar, that sounds plausible, yeah. |
03:11:31 | <nicolas17> | hm the .cdx only has the first URL |
03:30:34 | | pete joins |
03:31:04 | | pete quits [Client Quit] |
03:39:22 | | nic8693102004 (nic) joins |
03:53:57 | <TheTechRobo> | nicolas17: You can do `FROM <wget-at image> as wget` and then `COPY --from=wget /path/to/wget-lua /usr/bin/wget-lua`, for the record. |
03:55:03 | <TheTechRobo> | You don't need a runnable container to copy data out of it. |
03:55:13 | <nicolas17> | on a separate dockerfile you mean? |
03:55:20 | <TheTechRobo> | Yeah |
03:55:29 | <TheTechRobo> | IIRC that's what I did |
03:55:53 | <nicolas17> | (editing this dockerfile in any way caused a rebuild of wget because it was considered part of the source code being copied in a build step) |
03:55:54 | <TheTechRobo> | Although I haven't used wget-at outside of docker for probably at least a year now. |
03:56:02 | | etnguyen03 quits [Remote host closed the connection] |
03:57:54 | | wyatt8740 quits [Ping timeout: 252 seconds] |
03:58:15 | | loug8318142 quits [Quit: The Lounge - https://thelounge.chat] |
03:58:57 | <nicolas17> | augh |
03:59:10 | <nicolas17> | JAA: "VEILPUTFJNJQAAAAAAAAAAAAAAAAAAAA" is this the bug mentioned recently in #archiveteam-dev? |
03:59:19 | <@JAA> | Sure looks like it. |
04:00:17 | <nicolas17> | it's joever |
04:00:21 | <nicolas17> | every warc tool is broken |
04:01:49 | <TheTechRobo> | Enjoy the broken digests. :-) |
04:02:02 | | wyatt8740 joins |
04:03:04 | <TheTechRobo> | JAA: Do you think it's reasonable to add an option to test for the broken digest bug in warc-tiny so it doesn't flood the logs so much? |
04:03:48 | <TheTechRobo> | Similar to how it detects the broken handling of transfer-encoding, but as an option since only Wget-AT is affected |
04:04:31 | <nicolas17> | this is a 104-byte download btw |
04:04:48 | | ducky quits [Ping timeout: 260 seconds] |
04:05:29 | <TheTechRobo> | nicolas17: Not sure what the exact odds of hitting the bug are, but I think you just got unlucky. |
04:05:36 | <nicolas17> | https://swcdn.apple.com/content/downloads/13/38/072-11038-A_8VILF7KGLR/ekuwqfer80bkta2a6l6hn9flavknip4edt/MajorOSInfo.pkg.integrityDataV1 |
04:06:17 | | ducky (ducky) joins |
04:07:47 | <@OrIdow6> | <nicolas17> every warc tool is broken |
04:07:49 | <h2ibot> | OrIdow6 edited The WARC Ecosystem (+155, /* Tools */ wget-lua to yellow): https://wiki.archiveteam.org/?diff=53840&oldid=53766 |
04:07:49 | <@OrIdow6> | Good point... |
04:08:03 | <nicolas17> | ffs |
04:08:14 | <TheTechRobo> | Oh, right, it affects dedupe too. |
04:08:30 | <@OrIdow6> | I don't think anyone's tested that but apparently that's what it looked like |
04:08:45 | <TheTechRobo> | Yeah, the copied hash is what's passed to the deduplication function. |
04:09:01 | | TheTechRobo wonders how many records have been incorrectly deduplicated |
04:09:27 | | nulldata quits [Client Quit] |
04:09:52 | <@OrIdow6> | Mmmm, probably a lot |
04:10:09 | <TheTechRobo> | I'm not so sure |
04:10:14 | <TheTechRobo> | URL-agnostic dedupe is per-process |
04:10:20 | | nulldata (nulldata) joins |
04:10:22 | <@JAA> | 1 of 256 hashes will be all 0 bytes with this bug. |
04:10:46 | <TheTechRobo> | Oh, those odds are significantly worse than I thought. :-( |
04:10:55 | <@OrIdow6> | Like apparently a 100-request item (/multiitem batch) will have a 5% chance of having a false dedup |
04:10:55 | <@JAA> | And that's just the biggest chunk of it. |
04:11:12 | <@OrIdow6> | No |
04:11:14 | <@JAA> | Specifically the probability of the first byte of the has being NUL. |
04:11:25 | <@JAA> | hash* |
04:11:32 | <@OrIdow6> | Yes, actually, I copied the wrong number but it was about the same as the right one |
04:11:40 | <@OrIdow6> | (Replying to myself not to J A A) |
04:11:45 | <@JAA> | Collisions on later NUL bytes are also possible but not as likely. |
04:12:38 | <@OrIdow6> | I'll turn off dedup on Cohost in a bit |
04:12:53 | <nicolas17> | is it strcpy'ing the binary hash? |
04:12:57 | <TheTechRobo> | Yes |
04:13:00 | <TheTechRobo> | strncpy specifically |
04:13:02 | <@JAA> | strncpy, but yes |
04:13:11 | <nicolas17> | let's quit computers and start a farm |
04:13:16 | <TheTechRobo> | :-) |
04:13:16 | <@JAA> | Sounds good to me. |
04:13:31 | <@JAA> | Oh wait, modern farm equipment is all computers. D: |
04:13:42 | <nicolas17> | and fighting tractor DRM |
04:16:53 | <nicolas17> | so wtf |
04:16:56 | <nicolas17> | isn't this "easy" to fix |
04:17:02 | <TheTechRobo> | Yes |
04:17:11 | <TheTechRobo> | strncmp -> memcmp, theoretically |
04:17:15 | <TheTechRobo> | Er |
04:17:23 | <TheTechRobo> | strncpy -> memcpy, theoretically |
04:18:53 | | TheTechRobo has just noticed that there is now only one green row on The WARC Ecosystem :-( |
04:19:43 | <nicolas17> | let's go find bugs in it |
04:21:59 | <@JAA> | And even that green row has bugs inherited from wpull. :-( |
04:24:44 | <nicolas17> | also indeed running this on a machine with 24GB RAM was so much better |
04:24:54 | <nicolas17> | whole file stays in disk cache |
04:27:52 | <nicolas17> | https://archive.org/details/macos-installassistant-24C5073e-warc see you in two hours |
04:40:18 | | Guest54 quits [Quit: My MacBook has gone to sleep. ZZZzzz…] |
04:41:40 | | Unholy2361924645377131 quits [Ping timeout: 260 seconds] |
04:41:55 | <nicolas17> | does this affect archivebot or does that use different software? |
04:43:07 | <nicolas17> | it seems to be 1 of the 2 uses of strncpy in the whole codebase so yes this seems easy to fix |
04:43:21 | <nicolas17> | src/warc.c:2085: strncpy(sha1_res_payload, sha1_payload, SHA1_DIGEST_SIZE); |
05:27:00 | <nicolas17> | JAA: I finished uploading the item |
05:27:36 | <nicolas17> | I left the .warc uncompressed because it's mainly a giant already-compressed file, is that a problem? are there tools that expect .warc to always be .gz/.zstd? |
05:36:25 | | HP_Archivist (HP_Archivist) joins |
06:06:34 | | ave quits [Quit: Ping timeout (120 seconds)] |
06:06:54 | | ave (ave) joins |
06:27:12 | | night quits [Remote host closed the connection] |
06:27:23 | | night joins |
06:27:23 | | night is now authenticated as night |
07:03:52 | | BlueMaxima quits [Quit: Leaving] |
07:05:51 | | Unholy2361924645377131 (Unholy2361) joins |
07:08:04 | | Pedrosso5 joins |
07:08:09 | | ScenarioPlanet2 (ScenarioPlanet) joins |
07:08:10 | | TheTechRobo2 (TheTechRobo) joins |
07:10:25 | | ScenarioPlanet quits [Ping timeout: 255 seconds] |
07:10:25 | | Pedrosso quits [Ping timeout: 255 seconds] |
07:10:25 | | ScenarioPlanet2 is now known as ScenarioPlanet |
07:10:26 | | Pedrosso5 is now known as Pedrosso |
07:10:57 | | TheTechRobo quits [Ping timeout: 252 seconds] |
07:10:57 | | TheTechRobo2 is now known as TheTechRobo |
07:12:45 | | @rewby quits [Ping timeout: 260 seconds] |
07:48:13 | | ducky quits [Ping timeout: 260 seconds] |
08:18:05 | <upperbody321|m> | So-net Blog (SS Blog), the former Sony Communications blogging business, will end its services on 31 March 2025. |
08:18:05 | <upperbody321|m> | https://blog-wn.blog.ss-blog.jp/2024-11-15 |
08:18:05 | <upperbody321|m> | Sorry if it has already been posted |
08:37:05 | | rewby (rewby) joins |
08:37:05 | | @ChanServ sets mode: +o rewby |
08:42:53 | | Island quits [Read error: Connection reset by peer] |
08:54:12 | | xarph quits [Read error: Connection reset by peer] |
08:54:30 | | xarph joins |
09:02:15 | | BennyOtt (BennyOtt) joins |
09:21:48 | | BennyOtt quits [Client Quit] |
09:37:14 | | ducky (ducky) joins |
09:41:28 | | BennyOtt (BennyOtt) joins |
10:12:07 | <@arkiver> | the bug in Wget-AT is now fixed with https://github.com/ArchiveTeam/wget-lua/commit/8adeb442e256ca8c737da19cc0224e1ca09ef266 |
10:13:26 | <@arkiver> | i will make sure it is propagated to all projects |
10:27:29 | <@arkiver> | how i believe this would work out in the Wayback Machine is the following. upon indexing of WARCs (created the .cdx.gx file from the .warc.gz), record have their hashes recalculated. this means they would end up in the CDX file with the correct hash. |
10:28:37 | <@arkiver> | however, hashes used in revisit records are of course not being recalculated - those are assumed to be correct. when a revisit record is resolved, i believe the nearest record matching advertised URL+hash is found and redirected to. if hashes do not match, i believe no redirect would happen. |
10:28:43 | <@arkiver> | i will confirm that |
10:36:21 | | sralracer joins |
10:36:42 | | sralracer is now authenticated as sralracer |
10:40:18 | <@arkiver> | but... let's see, is this fixable? to some degree it is, but it would require parsing a ton of data |
10:40:52 | <@arkiver> | we do not have any kind of 'global' deduplication with some central collection of hashes against which we deduplicate. deduplication only happens withing a single session. |
10:41:43 | <@arkiver> | a single session produces one WARCs, which always ends up in one megaWARC (it's never split over multiple), and every record in the WARC is clearly associated to the session (or item) it was produced with. |
10:43:24 | <@arkiver> | together with the WARC-Refers-To-Date and WARC-Refers-To-Target-URI WARC headers, it is possible with a very high degree of certainty to match records together, and fix hashes in that way. |
10:50:15 | <@arkiver> | thinking more about this. while ideal would be to fix the megaWARCs themselves, it may be possible as well ti create a second WARC next to the megaWARC with the fixed revisit records. this would require only using the CDX to find "maybe bad revisit recording" (all those ending with one or more A s?), then writing fixed version of these to a WARC and placing this smaller WARC in the item. |
10:51:02 | <@arkiver> | of course, this would not fix the records in the megaWARC, but those that are revisit records will already have their digests recalculated upon creating the CDX file. |
10:51:34 | <@arkiver> | i'm looking into this, i think there's a possibilities |
10:58:57 | | Wohlstand quits [Remote host closed the connection] |
11:10:52 | <@arkiver> | err that is not the complete story, that is for payloads that have been correctly deduplicated from each other. payload that have been wrongly deduplicated cannot be fixed |
11:12:22 | | @arkiver is not thinking well at the moment :/ |
11:13:08 | | nulldata0 (nulldata) joins |
11:14:46 | | nulldata quits [Ping timeout: 255 seconds] |
11:14:47 | | nulldata0 is now known as nulldata |
12:00:02 | | Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:43 | | Bleo182600722719623 joins |
12:05:45 | | LddPotato quits [Ping timeout: 252 seconds] |
12:38:44 | | SkilledAlpaca41896 quits [Quit: SkilledAlpaca41896] |
12:44:43 | | SkilledAlpaca41896 joins |
12:51:21 | | th3z0l4_ quits [Read error: Connection reset by peer] |
12:52:17 | | th3z0l4 joins |
13:06:48 | | th3z0l4 quits [Ping timeout: 252 seconds] |
13:07:23 | | th3z0l4 joins |
13:16:03 | | lennier2_ joins |
13:19:05 | | lennier2 quits [Ping timeout: 260 seconds] |
13:56:34 | <BennyOtt> | What is actually the best option? The "warrior-dockerfile" with the interface or each project you want to support individually? |
14:07:35 | | Guest54 joins |
14:17:52 | <that_lurker> | each project individually would be the best way to go. |
14:23:50 | | FartWithFury (FartWithFury) joins |
14:25:29 | <nstrom|m> | Yeah primarily because you can run multiple projects at once that way |
14:26:46 | <nstrom|m> | Usually individual projects are limited to some extent on how much you can run on a single IP so even if you have plenty of bandwidth /cpu/Ram you can't usually devote it all to a single project since the site usually blocks/throttles on their end |
14:37:43 | <FartWithFury> | what are you all archiving with and where too? |
14:38:24 | <nstrom|m> | Standalone docker images, each project has one |
14:39:11 | <nstrom|m> | And from where, a number of VPS/cloud servers and some Linux hardware at home |
14:39:18 | <nstrom|m> | For me at least |
14:42:45 | <imer> | ^same |
14:50:40 | <FartWithFury> | <3 |
14:51:22 | <that_lurker> | FartWithFury: grab-site for my own website acrhives, tubeup for videso or yt-dlp, wikiteam3 for mediawikis... almost everything goes to archive.org |
14:51:48 | <that_lurker> | chat_downloader for live chats (youtube, twitch..) |
14:54:38 | <FartWithFury> | i'm using httrac , wget (python) and downloadthemall addon fore firefox :) and same up to archive |
14:55:09 | <FartWithFury> | then up to archive.org* |
15:02:57 | <BennyOtt> | ok, thanks @that_lurker and @nstrom. then I might change something a bit on my side, since "warrior-dockerfile" hasn't received an update yet, even though it's easy to manage when new projects come along. |
15:04:30 | <that_lurker> | you can also just run multiple instances of the warrior-docker if you want a gui experience |
15:06:33 | <BennyOtt> | yes, I did that before too |
15:31:05 | | Mateon1 quits [Quit: Mateon1] |
15:41:30 | | VerifiedJ9 quits [Quit: The Lounge - https://thelounge.chat] |
15:41:32 | | Mateon1 joins |
15:42:00 | | thuban quits [Ping timeout: 260 seconds] |
15:42:10 | | VerifiedJ9 (VerifiedJ) joins |
15:46:02 | | thuban (thuban) joins |
15:55:41 | | Raithmir joins |
16:04:20 | | ducky quits [Read error: Connection reset by peer] |
16:04:40 | | ducky (ducky) joins |
16:07:05 | | pabs quits [Ping timeout: 260 seconds] |
16:10:23 | | pabs (pabs) joins |
16:15:07 | | Raithmir quits [Client Quit] |
16:25:45 | | wyatt8740 quits [Ping timeout: 260 seconds] |
16:29:51 | <katia> | pabs, kokos asked me about Gemini and the possibilities of archiving it. I know you looked into it at some point, are you doing anything with it? |
16:37:12 | | wyatt8740 joins |
17:01:06 | | katocala quits [Ping timeout: 252 seconds] |
17:01:32 | | katocala joins |
17:19:20 | | wessel1512 joins |
17:19:30 | | bladem quits [Read error: Connection reset by peer] |
17:32:27 | | katocala quits [Ping timeout: 252 seconds] |
17:33:02 | | katocala joins |
17:56:45 | | pabs quits [Ping timeout: 260 seconds] |
17:59:10 | | sralracer quits [Client Quit] |
18:00:16 | | pabs (pabs) joins |
18:01:11 | | Naruyoko quits [Read error: Connection reset by peer] |
18:01:28 | | Naruyoko joins |
18:13:47 | | sralracer (sralracer) joins |
18:20:38 | <h2ibot> | JustAnotherArchivist edited Template:IRC channel (-2, Update for new web chat based on…): https://wiki.archiveteam.org/?diff=53841&oldid=47317 |
18:25:13 | <@JAA> | nicolas17: Uncompressed WARC is fine, I think. |
18:27:23 | <@JAA> | arkiver: Identifying potentially faulty records should be possible from the CDX and the megawarc JSON. The former should contain the payload digests, and the latter allows identifying boundaries between mini-WARCs to eliminate those false positives. |
18:28:29 | <@JAA> | Still a lot of data though. And we'd have to check what hash the CDX contains exactly; I think IA recalculates it rather than relying on what's in the WARC, but not entirely sure. |
18:29:52 | | katocala is now authenticated as katocala |
18:45:45 | | nicolas17 is now authenticated as nicolas17 |
19:06:52 | | ducky_ (ducky) joins |
19:07:13 | | ducky quits [Ping timeout: 260 seconds] |
19:07:28 | | ducky_ is now known as ducky |
19:12:45 | | Webuser074404 joins |
19:13:29 | | Webuser074404 quits [Client Quit] |
19:28:40 | | BlueMaxima joins |
19:33:27 | | FartWithFury quits [Read error: Connection reset by peer] |
20:49:29 | | Naruyoko5 joins |
20:50:03 | | Naruyoko quits [Read error: Connection reset by peer] |
21:03:57 | | Arachnophine quits [Quit: Ping timeout (120 seconds)] |
21:04:14 | | Arachnophine (Arachnophine) joins |
21:06:58 | | BornOn420 quits [Remote host closed the connection] |
21:07:14 | | alexlehm quits [Remote host closed the connection] |
21:07:34 | | Sluggs quits [Quit: ZNC - http://znc.in] |
21:07:38 | | alexlehm (alexlehm) joins |
21:08:36 | | Barto quits [Quit: WeeChat 4.4.3] |
21:09:15 | | katia_ quits [Ping timeout: 260 seconds] |
21:09:50 | | @JAA quits [Ping timeout: 260 seconds] |
21:09:50 | | kokos| quits [Ping timeout: 260 seconds] |
21:11:09 | | JAA (JAA) joins |
21:11:09 | | @ChanServ sets mode: +o JAA |
21:12:09 | | loug8318142 joins |
21:14:50 | | kokos- joins |
21:16:29 | | katia_ (katia) joins |
21:18:59 | | Sluggs joins |
21:21:52 | | BornOn420 (BornOn420) joins |
21:34:33 | | Island joins |
21:43:06 | | sralracer quits [Quit: Ooops, wrong browser tab.] |
21:43:41 | | Barto (Barto) joins |
22:04:35 | | @JAA quits [Remote host closed the connection] |
22:05:11 | | JAA (JAA) joins |
22:05:11 | | @ChanServ sets mode: +o JAA |
22:14:22 | <TheTechRobo> | Is using a VPN with Warrior projects OK if I control the VPN? It's wireguard. |
22:20:43 | | BlueMaxima quits [Read error: Connection reset by peer] |
22:22:32 | <@OrIdow6> | arkiver: The sidecar WARC idea would work I think; also your explanation of the WBM's behavior seems to fit what I can see - e.g. https://web.archive.org/web/https://t.nhentai.net/galleries/73599/170t.jpg is a revisit with all A's in https://archive.org/details/archiveteam_nhentai_20240920192453_7ede7e27 , but in playback it redirects to a date but then says there are no WBM captures (live URL NSFW) |
22:26:22 | <@OrIdow6> | If we wanted to do *only* a CDX/sidecar-warc approach with no reading/writing of the original we could have a threshold of entropy of the hash (number of A's) - say "even if this ends in 20 A's, the first 20 digits are the same, there's only a 1/(whatever) chance this would've happened by coincidence" |
22:29:45 | | Unholy2361924645377131 quits [Ping timeout: 260 seconds] |
22:29:46 | <@OrIdow6> | More broadly I think we could come up with some kind of "score" for how likely the 2 are to be true equals? |
22:31:04 | <@OrIdow6> | LIke, # of bits that match before the A's start + (5 if the URLs are the same else 0) + (5 if the Etag and HTTP content-length headers are the same else 0) |
22:31:28 | <@OrIdow6> | And check if that's greater than 20 |
22:32:35 | <nicolas17> | for non-revisit records you could also just calculate the correct hash |
22:32:46 | <@OrIdow6> | But better be cautious with this because if done wrongly (if the heuristic measures have too much weight) it could veer into "faking data" territory |
22:34:03 | <nicolas17> | if it has *any* 00 at the end, calculate the correct payload hash, if it doesn't match then anything with that hash is suspect |
22:35:46 | <@OrIdow6> | arkiver: Also in addition to the above I think that ASAP we should go thru uploaded collections, as well as the temporary storage, find all revisits that we think might be false-positive-revisits (i.e., those with above some set number of A's), and throw those into URLs |
22:36:18 | <@OrIdow6> | Which is doable from the CDXs from what's on IA already |
22:37:49 | <@OrIdow6> | nicolas17: For stuff already on the IA they already calculate the correct hash (due to the issue with chunked encoding or whatever it was) so that's what can be matched against |
22:38:31 | <@OrIdow6> | But the fact that the IA already assumes that the WARC-Payload-Digest values in the WARC are garbage means our big issue here is revisits, not the values of that header per se |
22:41:06 | <@OrIdow6> | "and throw those into URLs" - or maybe AB because URLs might DDOS some stuff |
22:43:20 | <@JAA> | Depends on how many there are, I'd say. |
22:44:25 | <@JAA> | And throwing into #// would possibly need to bypass backfeed. |
22:47:55 | <@OrIdow6> | (On the topic of revisits, I need to check again if those sha1-collision-attack PDFs are mixed up in the WBM...) |
22:48:51 | <nicolas17> | oh no |
22:56:03 | | etnguyen03 (etnguyen03) joins |
22:59:07 | | Wohlstand (Wohlstand) joins |
23:07:08 | | sralracer (sralracer) joins |
23:13:41 | | pixel leaves [Error from remote client] |
23:14:12 | | peo joins |
23:26:37 | <peo> | Hi all! Anyone awaye and have knowledge about the grab-site tool ? Need to resume a interrupted download if possible |
23:27:23 | <@JAA> | Hi peo. That's not supported: https://github.com/ArchiveTeam/grab-site/issues/58 |
23:28:16 | <peo> | yep, as I have read.. just thought if anyone had a work-around except from using the "pause when diskspace is low"-trick |
23:29:17 | <peo> | It filled up in the datadir's temp folder because it stumbled on a large file while downloading the whole world.. |
23:30:31 | <@JAA> | Ah, I bet it's planet-latest.osm.pbf. Classic. |
23:31:37 | <nicolas17> | x_x |
23:35:01 | | loug8318142 quits [Client Quit] |
23:36:31 | <@OrIdow6> | I am running a very slow scan thru some CDXs in collection:archiveteam that are accessible to me for false revisits, if you want the scripts (just a pipeline + GNU parallel, but took a while since the latter still has pretty bad documention) contact me |
23:37:49 | <that_lurker> | would distributing that job help speed it up? |
23:38:06 | <nicolas17> | I assume IA would become the bottleneck quickly |
23:40:02 | | sralracer quits [Client Quit] |
23:41:28 | <@OrIdow6> | Parallel thinks there are 20h left on the set I'm doing (very reduced, limiting it to stuff I can access + stuff I think is likely to be at temporally closer risk) |