00:26:44 | | etnguyen03 quits [Client Quit] |
00:29:24 | | Webuser814419 joins |
00:30:23 | <Webuser814419> | Another old forum closed, "For a limited time, it is still available in read-only mode": https://www.criticker.com/forum/ |
00:32:07 | | Webuser814419 quits [Client Quit] |
00:40:42 | | holbrooke quits [Client Quit] |
00:46:01 | | holbrooke joins |
00:54:19 | | holbrooke quits [Client Quit] |
01:02:47 | | etnguyen03 (etnguyen03) joins |
01:21:41 | | beardicus (beardicus) joins |
01:44:05 | | BornOn420 quits [Ping timeout: 276 seconds] |
01:44:36 | | BornOn420 (BornOn420) joins |
02:17:00 | <wickedplayer494> | FWIW: ArtemR (former Android Police owner) is also enforcing the TikTok ban-that-might-or-might-not-be on APKMirror for US visitors |
02:17:03 | <wickedplayer494> | https://twitter.com/APKMirror/status/1881509961660068099?mx=1 |
02:17:04 | <eggdrop> | nitter: https://xcancel.com/APKMirror/status/1881509961660068099 |
02:21:26 | | wickedplayer494 quits [Ping timeout: 250 seconds] |
02:22:36 | | wickedplayer494 joins |
02:26:33 | | beardicus quits [Remote host closed the connection] |
02:26:53 | | beardicus (beardicus) joins |
02:29:05 | | etnguyen03 quits [Client Quit] |
02:30:25 | <h2ibot> | Wickedplayer494 edited APKMirror (+418, Can't get around the…): https://wiki.archiveteam.org/?diff=54267&oldid=53414 |
02:30:57 | | wickedplayer494 is now authenticated as wickedplayer494 |
02:31:09 | | pedantic-darwin joins |
02:31:15 | | pedantic-darwin quits [Client Quit] |
02:31:42 | | pedantic-darwin joins |
02:37:04 | | beardicus quits [Read error: Connection reset by peer] |
02:38:32 | | beardicus (beardicus) joins |
02:42:05 | | etnguyen03 (etnguyen03) joins |
02:42:29 | <h2ibot> | JustAnotherArchivist created UK Online Safety Act 2023 (+1322, Created page with "The '''Online Safety Act…): https://wiki.archiveteam.org/?title=UK%20Online%20Safety%20Act%202023 |
02:43:10 | | holbrooke joins |
02:51:21 | <Ryz> | Fuji TV encountering controversy: https://unseen-japan.com/fuji-tv-nakai-masahiro-scandal/ |
02:51:52 | <Ryz> | For those who have specialized knowledge in Japan, or Japanese culture, would be much appreciated on archiving internet presence of the company and related content via #archivebot please |
03:03:48 | | opl quits [Quit: Ping timeout (120 seconds)] |
03:04:04 | | opl joins |
03:15:58 | | beardicus quits [Ping timeout: 260 seconds] |
03:19:26 | | beardicus (beardicus) joins |
03:25:24 | | etnguyen03 quits [Remote host closed the connection] |
03:55:04 | | holbrooke quits [Client Quit] |
04:07:34 | | Webuser264426 joins |
04:07:56 | | Webuser264426 quits [Client Quit] |
04:28:35 | | holbrooke joins |
05:01:28 | <@OrIdow6> | !remindme 1d machine |
05:01:29 | <eggdrop> | [remind] ok, i'll remind you at 2025-01-22T05:01:28Z |
05:03:18 | | wickedplayer494 quits [Ping timeout: 260 seconds] |
05:04:14 | | wickedplayer494 joins |
05:04:23 | | wickedplayer494 is now authenticated as wickedplayer494 |
05:21:25 | | klg quits [Quit: bbl] |
05:36:22 | | holbrooke quits [Client Quit] |
05:56:48 | | beardicus quits [Ping timeout: 250 seconds] |
06:19:40 | | klg (klg) joins |
06:44:23 | | klg quits [Client Quit] |
06:52:35 | | beardicus (beardicus) joins |
06:57:03 | | beardicus quits [Ping timeout: 260 seconds] |
07:04:43 | | BlueMaxima quits [Quit: Leaving] |
07:08:40 | | klg (klg) joins |
07:15:29 | <h2ibot> | OrIdow6 edited Archiveteam:IRC (+228, /* EFnet (mostly historical, October 2020 and…): https://wiki.archiveteam.org/?diff=54269&oldid=53944 |
07:31:40 | | Webuser616943 joins |
07:33:04 | | Webuser616943 quits [Client Quit] |
07:41:10 | | loug8318142 joins |
07:59:39 | | mannie (nannie) joins |
07:59:47 | <that_lurker> | And reproductiverights.gov got nuked |
07:59:49 | <mannie> | The lists of yesterday are not visible in the viewer so I share them again: main: https://transfer.archivete.am/L0vN2/bankruptcies-NL-2025-jan20-main.txt other references:https://transfer.archivete.am/LZa1F/bankruptcies-NL-2025-jan20-ref.txt ssl-error: https://transfer.archivete.am/U3hdP/bankruptcies-NL-2025-jan20-ssl-error.txt |
08:01:18 | <@OrIdow6> | mannie: What are these lists of? |
08:01:58 | <mannie> | All company's that when bankrupt yesterday |
08:07:10 | | mannie quits [Remote host closed the connection] |
08:07:45 | <@OrIdow6> | Welp away they go |
08:08:33 | <@OrIdow6> | Something I'd like to do *long-term* would be some way for people like this to do their own thing, in a sandbox, with approval |
08:08:40 | <@OrIdow6> | Some kind of overall data limit etc |
08:09:23 | | BornOn420 quits [Remote host closed the connection] |
08:09:54 | | BornOn420 (BornOn420) joins |
08:16:56 | <@OrIdow6> | !tell mannie Could you please provide more context to this? For instance: what date range does this list cover? Also rather than a long list of references it would be better to have just one or two links per company, to establish that they are going bankrupt; e.g., I cannot find a source for the fact that Sarvision is going bankrupt skimming through the URLs in the list, instead most just seem to establish that the company exists. |
08:16:57 | <eggdrop> | [tell] ok, I'll tell mannie when they join next |
08:17:14 | <@OrIdow6> | Might be too harsh, not sure |
08:20:32 | <@OrIdow6> | JAA: Nice to hear |
08:21:55 | | flotwig_ joins |
08:22:48 | | flotwig quits [Ping timeout: 260 seconds] |
08:22:49 | | flotwig_ is now known as flotwig |
08:23:30 | <@OrIdow6> | Also JAA, I've suggested to myself learning qwarc as an educational exercise for this Niconico thing, so, uh, how to get started? Which branch is the correct latest one to use, is there documentation besides example usage in ia collections, etc? |
08:39:08 | <@arkiver> | it would be great if we could have a little AB pipeline with a Japanese IP |
09:15:53 | | Hackerpcs quits [Ping timeout: 260 seconds] |
09:38:22 | <katia> | OrIdow6, i have a little box in .jp |
09:42:53 | <katia> | pm'd you ip/root pass |
09:43:49 | | Island quits [Read error: Connection reset by peer] |
09:48:15 | | Hackerpcs (Hackerpcs) joins |
10:05:33 | | qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins |
10:05:35 | | qwertyasdfuiopghjkl2 quits [Max SendQ exceeded] |
10:11:18 | | qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins |
10:35:48 | | emphatic quits [Ping timeout: 260 seconds] |
11:11:15 | <@OrIdow6> | Thanks katia! |
11:11:33 | <@OrIdow6> | nstrom|m: See above, seems you won't need to spend any money after all |
11:31:12 | <flashfire42|m> | https://au.finance.yahoo.com/news/popular-aussie-online-retailer-shut-233233284.html |
11:54:10 | <c3manu> | OrIdow6: mannie is usually checking https://insolventies.rechtspraak.nl/ for new entries of companies that went bankrupt in the Netherlands, then throws domains mentioned there (or googled ones, which might not be the right ones all the time) into something self-programmed to generate a list of subdomains. |
11:56:17 | <c3manu> | those lists are a mess though, and as of december i have stopped doing them in favor of using AB to grab other things. i’ve had some help doing them here and there, but i don’t think many have been done since i stopped running them. |
11:57:24 | <c3manu> | the '-ref' lists are usually references to those places that filed for bankruptcy like opening hour pages, stuff like that. i assume those are compiled via manual web searches. |
12:00:02 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:52 | | Bleo18260072271962345 joins |
12:22:47 | <@OrIdow6> | c3manu: Ah, thank you for your prior service |
12:23:09 | <@OrIdow6> | Think I have a chance of making it better if I chastize them like I did up there...? |
12:23:30 | <@OrIdow6> | My excuse for not wanting to run them myself is not being an ABer |
12:23:44 | <@OrIdow6> | But I do appreciate that they are trying to save these pages |
12:24:18 | <@OrIdow6> | *Do you think I |
12:24:21 | <murb> | OrIdow6: you're not an estuary? |
12:25:34 | <@OrIdow6> | murb: Haha, I meant ArchiveBotter |
12:26:47 | <pabs> | OrIdow6: mannie has been doing those bankruptcies for at least a year, posting them in #archivebot and getting folks to do them. there are a lot, and I personally burned out on doing them due to the volume. I think others may have too |
12:27:10 | <c3manu> | OrIdow6: hm. i think i like the "sandbox" idea of yours, but i have no idea how that would work. |
12:27:22 | <pabs> | the references are links to the companies and are the easy part, just !ao < them |
12:27:37 | <c3manu> | yeah i do that too when i see it |
12:27:44 | <c3manu> | but the other ones require so much manual work |
12:28:05 | <pabs> | personally I thought mannie should probably get AB access, but I think they may not want the extra work |
12:28:51 | <c3manu> | if they were "pre-vetted", like "these work, you can just quickly check and queue them", then one "those are broken subdomains or logins" that can be run using '!a <'.. |
12:29:16 | <pabs> | and of course mannie's lists are just the tip of the iceberg, there are many many more countries that we don't have bankruptcy visibility |
12:30:16 | <c3manu> | but the way it is now, all the non-resolving ones are included, ones that are broken on https:// aren’t checked on http://, if the "main website" is 'www.company.com' then 'company.com' is usually missing... |
12:30:23 | <@OrIdow6> | c3manu: It's too advanced for technology to do today sadly |
12:31:14 | <c3manu> | it would actually be easier to reduce them to the domain names and do the subdomain discovery yourself. it’s less effort than working through that list, but it’s still effort |
12:31:47 | <@OrIdow6> | pabs: From what I'm hearing mannie doesn't strike me malicious or destructive so I'd be in support of it, but like I say that's not my area and not my decision to make |
12:31:51 | <pabs> | I was only doing it before mannie started doing the subdomains too... |
12:32:31 | <pabs> | I asked mannie about it a while ago and they only said they would think about it |
12:32:49 | <@OrIdow6> | It's a shame but true that there are more websites shutting down than we have volunteer labor here |
12:33:11 | <@OrIdow6> | I think *some* stuff can be automated eventually but that's the state of things now |
12:33:41 | <c3manu> | well..sometimes there’s mistakes. like some small "F5 Logistics" company went bankrupt and www.f5.com was added to the list. and me not really knowing and being overwhelmed by the list already, i just queued that >.< |
12:34:04 | <c3manu> | yeah, i think we’re a little understaffed here as well :D |
12:34:16 | <pabs> | the subdomain stuff is why I think we need an automated DPoS based subdomain/URL enumerator with software/service type detection and AB/wikibot/Y/jseater/etc job command generation |
12:35:01 | <@OrIdow6> | Maybe I'm just in a weird mood today but I think an ultimatum to them might be the answer, "no one has enough free time/patience/mental bandwidth to do this, you're going to need to be your own advocate here" |
12:35:07 | <@OrIdow6> | pabs: Yeah would be nice |
12:35:40 | <@OrIdow6> | Maybe also something that compares how a headless browser behaves to a js-oblivious one, to try to tell if it can be AB'd effectively |
12:35:52 | <pabs> | indeed |
12:36:25 | <pabs> | lots of other tricks we could code too, like detecting if /pipermail or /pipermail/ work on Mailman 2 sites |
12:36:33 | <c3manu> | true |
12:36:35 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
12:37:14 | <pabs> | I expect everyone has their favourite site scouting techniques |
12:37:47 | | SkilledAlpaca418962 joins |
12:38:53 | <@OrIdow6> | Huh, on the "sandbox" thing I guess you could have some bot that proxies commands for them, and denies them if they're on the wrong job ID/if they request too high a rate limit/etc... |
12:40:06 | <c3manu> | i think crafting those restrictions would be really much effort actually |
12:41:57 | <c3manu> | like "please do not queue wikipedia.org", "that huge website should be archived, but not on *that* pipeline"... |
12:42:34 | <c3manu> | "this one has a session ID so needs to be run as https://random-forum.com/?archiveteam" |
12:43:26 | <c3manu> | "please make sure to run Shopify pages only with one worker, and only on pipelines without any other active Shopify jobs" |
12:43:45 | <steering> | surely there's some open source "web scanner" to do that -> < pabs> the subdomain stuff is why I think we need an automated DPoS based subdomain/URL enumerator with >>software/service type detection<< |
12:44:47 | <steering> | (although it would be nice to have it tuned for "look for the stuff that archiveteam has tooling for") |
12:45:31 | <pabs> | maybe, but probably none that could download to WARC or read sites from a WARC |
12:45:42 | <pabs> | or be used in a DPoS setup |
12:46:51 | <pabs> | on AT-related scanners, I've written a Python thing for finding wikis and generating #wikibot commands, and that_lurker has a Blogger detector |
12:49:18 | | IRC2DC joins |
12:55:14 | <that_lurker> | I do? o_O |
12:57:11 | <pabs> | woops, it was <thuban> https://transfer.archivete.am/inline/PUhGC/blogspot-checker.py |
13:01:46 | | NF885 (NF885) joins |
13:01:59 | | beardicus (beardicus) joins |
13:02:00 | | NF885 quits [Client Quit] |
13:05:58 | | NF885 (NF885) joins |
13:06:02 | | eroc1990 quits [Quit: Ping timeout (120 seconds)] |
13:06:22 | | eroc1990 (eroc1990) joins |
13:06:40 | <NF885> | FYI looks like /sitemap.xml no longer redirects to the sitemap index for some reason on the Biden White House archive site (and /robots.txt doesn't exist) |
13:06:50 | | NF885 quits [Client Quit] |
13:08:30 | | NF885 (NF885) joins |
13:08:38 | <NF885> | the sitemap indexes are still at https://bidenwhitehouse.archives.gov/sitemap_index.xml and https://bidenwhitehouse.archives.gov/es/sitemap_index.xml, though |
13:09:30 | <NF885> | (also I probably shouldn't be trying to send this on mobile data) |
13:10:11 | | NF885 quits [Client Quit] |
13:25:23 | <h2ibot> | Manu edited Discourse/archived (+99, queued community.openenergymonitor.org): https://wiki.archiveteam.org/?diff=54270&oldid=54191 |
14:17:10 | <masterx244|m> | crunching some data atm to verify links to some firmware files (approx 20GB) for a !ao< run. sussing out dead links with some wget prodding right now |
14:20:52 | <TheTechRobo> | OrIdow6: Re qwarc, I believe the correct branch is 0.2. Writing a spec file didn't seem too difficult last time I looked into it. The general idea seems to be that you define subclasses of `qwarc.Item`. qwarc will call the `generate` function on each of them to create the initial set of items. Then you can call `add_subitem` to queue any new items |
14:20:52 | <TheTechRobo> | (i.e. backfeed). |
14:21:47 | <TheTechRobo> | The `process` callback is where you actually do stuff. |
14:22:22 | <TheTechRobo> | And of course, you're on your own for parsing |
14:28:05 | <@arkiver> | a distributed Warrior project that could be compared to AB will be introduced at #Y , but will still take some time |
14:29:17 | <@arkiver> | however that will still require work as well with maintaining jobs, ignoring stuff, etc. |
14:30:04 | <masterx244|m> | dishing out the ignores is the main PITA. loops and other traps usually only appear further down in a archiving job |
14:30:35 | <@arkiver> | yeah |
14:30:52 | <masterx244|m> | (had to grab something for personal archival recently, too. closed off page where i pulled myself a backup and a few traps waited there, too that needed a few dirty ignores to squish) |
14:31:07 | <@arkiver> | as for access to AB for mannie - i think it may be fine? but i believe that is largely up to JAA , though i'm not sure on the state of AB at the moment when it comes to this |
14:32:22 | <masterx244|m> | switched to Arch at the start of this year to finally wall off the last remaining windows on my HW. much easier now that i got the same toolings that i got on my server on my main computer, quickly spun up grab-site for that one job |
14:45:21 | <masterx244|m> | murphy's law, just when you need transfer.archiveteam.org its dead |
14:46:30 | <masterx244|m> | *transfer.archivete.am |
14:47:57 | <masterx244|m> | nope, somehow the bash snippet bricked itself.... |
14:48:07 | <masterx244|m> | list processed https://transfer.archivete.am/Cu73P/senafirmware_ab.lst |
14:48:07 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/Cu73P/senafirmware_ab.lst |
14:54:54 | <masterx244|m> | need to figure out why 3 files went missingno. even though i got them in my local archive (some dev-versions that they never intended to be caught but my automatic monitoring got that stuff even though it was just online for a short time) |
15:17:40 | | holbrooke joins |
15:42:06 | | BornOn420 quits [Remote host closed the connection] |
15:42:34 | | BornOn420 (BornOn420) joins |
15:59:09 | <@JAA> | arkiver: IIRC, they've been offered access before but didn't want it. Something about compiling the list taking long enough and not wanting to invest the time into understanding AB and keeping an eye on the jobs. |
15:59:49 | <@JAA> | c3manu or pokechu22 might be able to confirm. |
16:05:50 | <c3manu> | i think i vaguely remember something like that, yeah |
16:08:05 | <@JAA> | OrIdow6: Re qwarc, there is no documentation. You want the latest tag, v0.2.8. v0.2.6 and v0.2.7 are also fine; the only changes are support for HEAD requests and overriding the default 1-minute timeout. Anything before that is based on warcio and shouldn't be used. `pip install git+https://gitea.arpa.li/JustAnotherArchivist/qwarc.git@v0.2.8` on its own doesn't work because a dependency of aiohttp |
16:08:11 | <@JAA> | changed something long after the release; you need `pip install --upgrade async-timeout==3.0.1` afterwards to fix that. |
16:09:58 | <@JAA> | For the spec file, yep, what TheTechRobo wrote. |
16:10:14 | <@JAA> | The examples on IA should help with that. |
16:13:00 | | beardicus quits [Ping timeout: 250 seconds] |
16:15:47 | <@JAA> | The other thing to be aware of is that qwarc's memory usage grows over time. I've never been able to figure out what causes that; the best guess is heap fragmentation. There's a magic environment variable that helps but doesn't fix it; I don't have it handy right now. It's why --memorylimit is a thing. I normally run qwarc in a `while [[ ( ! -e "qwarc.db" || $(sqlite3 "qwarc.db" 'SELECT COUNT(id) FROM |
16:15:53 | <@JAA> | items WHERE status != 2') -gt 0 ) && ! -e STOP ]]` loop for that reason. |
16:16:37 | <@JAA> | (The environment variable technically lowers the performance a bit due to less efficient memory allocations, but I haven't noticed it in practice.) |
16:19:49 | | holbrooke quits [Client Quit] |
16:33:37 | | beardicus (beardicus) joins |
17:08:54 | | lflare quits [Ping timeout: 250 seconds] |
17:11:34 | | emphatic joins |
17:29:19 | | sec^nd quits [Remote host closed the connection] |
17:29:42 | | sec^nd (second) joins |
17:33:56 | <@imer> | Blueacid: no worries, happens :) |
17:36:00 | <Blueacid> | No worries! Just seeing familiar filenames flying past when watching the warrior doing its thing, and I wondered whether there was any point in (trying to?) do a hash-based dedupe. But I surmise that yes, we might save maybe a few terabytes, which is great, but then the Blogger job alone has stored 1.5PB, so... drop in the ocean |
17:37:56 | | beardicus quits [Ping timeout: 250 seconds] |
17:45:43 | | balrog quits [Ping timeout: 260 seconds] |
17:49:27 | <szczot3k> | Running a dedup on IA will probably need more resources, than it'd save (in terms of disk space) |
17:55:03 | | le0n quits [Ping timeout: 260 seconds] |
17:56:28 | | le0n (le0n) joins |
17:57:08 | | balrog (balrog) joins |
18:06:15 | | beardicus (beardicus) joins |
18:12:48 | | balrog quits [Client Quit] |
18:13:11 | | icedice (icedice) joins |
18:18:55 | <yzqzss> | I'm archving niconico shunga |
18:19:36 | | balrog (balrog) joins |
18:23:26 | <TheTechRobo> | Blueacid: Warrior projects tend to do some dedup, where if the same thing is fetched multiple times in a Wget process (there is generally one process per handful of items), it will be deduped. Also, they use Zstandard with custom dictionaries that are trained on the WARC files, which means that redundant data is basically compressed to nonexistence |
18:24:00 | <TheTechRobo> | As sz.czot3k, deduplicating over every WARC would probably be more trouble than it's worth |
18:24:23 | <TheTechRobo> | *As sz.czot3k said |
18:25:54 | | aninternettroll quits [Remote host closed the connection] |
18:28:01 | | aninternettroll (aninternettroll) joins |
18:44:54 | | qinplus_phone joins |
19:04:28 | | th3z0l4 quits [Ping timeout: 260 seconds] |
19:04:52 | | th3z0l4 joins |
19:05:24 | | lennier2_ joins |
19:19:20 | | beardicus quits [Ping timeout: 250 seconds] |
19:41:48 | | lennier2_ quits [Ping timeout: 260 seconds] |
19:42:17 | | lennier2_ joins |
19:42:40 | | beardicus (beardicus) joins |
19:43:42 | | lennier2 joins |
19:46:45 | | ` |
19:47:03 | | beardicus quits [Ping timeout: 260 seconds] |
19:47:30 | | lennier2_ quits [Ping timeout: 250 seconds] |
19:48:20 | <nicolas17> | `: that's annoying |
19:48:43 | <`> | nicolas17: what is? |
19:49:23 | <nicolas17> | your nickname >:o |
19:49:38 | <`> | ur face is annoying |
19:51:30 | | Radzig2 joins |
19:53:08 | | Radzig quits [Ping timeout: 250 seconds] |
19:53:08 | | Radzig2 is now known as Radzig |
19:54:13 | | steering tries to get the speck of dust off his monitor |
20:03:55 | <Ryz> | arkiver/JAA, regarding #Y - yeah, the ignores are really going to be a main stickler; like damn, I do my rounds on doing ignores on existing jobs whenever I can, but it's oofy and a bit draining over time |
20:04:29 | <Ryz> | Also on having to get some intuition on when something seems to go wrong or might go in a bad direction |
20:05:14 | <masterx244|m> | addendum that i forgot to tell on the list i posted earlier: the 20GB mentioned earlier is the total size of the entire fileset together. the files are 2 to 6MB each |
20:05:57 | | beardicus (beardicus) joins |
20:06:22 | <masterx244|m> | true on that. some rabbitholes are really unintuitive until the crawl is really deep inside them. and if they happen when nobody is around they can waste a good bunch of crawltime |
20:07:33 | <masterx244|m> | vanillaforums based forums have a nasty one for example when a bad formatted link appears in a post where the URL gets added as a relative one to the post url but the post page doesn't redirect to the canonical form and instead keeps that part, same page of the topic again and you get another segment added ad game repeats |
20:07:35 | <masterx244|m> | *and |
20:22:38 | | trix (trix) joins |
20:24:05 | | lennier2_ joins |
20:27:18 | | lennier2 quits [Ping timeout: 260 seconds] |
20:31:25 | | beardicus quits [Remote host closed the connection] |
20:31:45 | | beardicus (beardicus) joins |
20:34:12 | | BlueMaxima joins |
20:44:48 | | th3z0l4_ joins |
20:45:22 | | th3z0l4 quits [Remote host closed the connection] |
21:04:44 | | qinplus_phone quits [Client Quit] |
21:08:00 | | Gadelhas5628737 quits [Quit: Ping timeout (120 seconds)] |
21:30:18 | | beardicus quits [Ping timeout: 260 seconds] |
21:31:48 | | beardicus (beardicus) joins |
21:37:05 | | SootBector quits [Remote host closed the connection] |
21:37:25 | | SootBector (SootBector) joins |
21:39:56 | | holbrooke joins |
21:45:28 | | Webuser663884 joins |
21:45:38 | | Webuser663884 quits [Client Quit] |
21:46:04 | <Blueacid> | szczot3k and TheTechRobo - cheers for the answers, appreciated :) |
21:46:16 | <Blueacid> | Very much as I suspected, but wanted to ask :) |
21:47:47 | <szczot3k> | Cost of any dedup, will most likely be much more, than adding "another disk" |
21:50:02 | <Blueacid> | Yeah, I figured so! |
21:50:13 | <@JAA> | Yeah, especially when you include the non-technical costs: developing this thing, making sure it doesn't accidentally data, etc. |
21:54:38 | <Blueacid> | Yeah the risk is probably the hugest thing - how can you guarantee you've not fouled something up? |
21:54:48 | <Blueacid> | Cheers for humouring me thinking out loud! |
21:55:05 | | etnguyen03 (etnguyen03) joins |
21:55:30 | <szczot3k> | Doing dedup currently would mean (at least) reading every file, and checksuming it |
21:55:33 | <szczot3k> | Which is a huge task |
21:57:51 | <Blueacid> | I agree it's a silly endeavour now - with that said, we wouldn't need to read every file; just the first few bytes of every file with a non-unique filesize. And if those match, then & only then would you checksum them |
21:57:58 | <Blueacid> | that'd reduce the disk / cpu somewhat? |
21:58:36 | <szczot3k> | Then you get false positives |
21:59:19 | <szczot3k> | At this scale, doing so, would lead to false positives |
21:59:22 | <Blueacid> | How so? If you've just checked the first few bytes and they match, then you proceed to fully checksum those files to prove if they're byte-for-byte. But if the first few bytes differ, clearly different file contents; move on |
22:00:38 | <szczot3k> | You'd first want to build a full map of the archive, with checksums. You can't go with an approach "Let's look at this one file, and compare it to every other file", you then end up reading every files (at least the start of it) billions of times |
22:04:18 | | beardicus quits [Remote host closed the connection] |
22:04:39 | | beardicus (beardicus) joins |
22:08:07 | <Blueacid> | Ah yeah, shoot, you're right |
22:08:10 | <Blueacid> | Mea culpa |
22:12:32 | | HP_Archivist (HP_Archivist) joins |
22:15:48 | | beardicus quits [Ping timeout: 260 seconds] |
22:16:59 | | icedice quits [Quit: Leaving] |
22:34:52 | <Barto> | https://i.imgur.com/99suCYs.png internet goes brrr :-) |
22:37:22 | <@JAA> | Most archives use SHA-1 hashes, and there are *definitely* collisions in the collection. |
22:38:05 | <@JAA> | The digests already exist in the index, so that part is kind of done, but you still need to deal with the collisions. |
22:45:16 | | beardicus (beardicus) joins |
22:46:14 | | loug8318142 quits [Quit: The Lounge - https://thelounge.chat] |
22:48:23 | | holbrooke quits [Client Quit] |
22:57:31 | <szczot3k> | Yeah, at this scale colissions are something that actually matters |
23:01:25 | <@JAA> | I mean, one WARC of the SHAttered website is enough. :-) |
23:07:20 | | Island joins |
23:15:30 | | beardicus quits [Ping timeout: 250 seconds] |
23:18:45 | | nicolas17 is now authenticated as nicolas17 |
23:23:46 | <@OrIdow6> | yzqzss: Details? Would I have known this if I was in the other stwp chat? |
23:24:11 | | beardicus (beardicus) joins |
23:31:58 | | beardicus quits [Ping timeout: 250 seconds] |
23:36:55 | | beardicus (beardicus) joins |
23:42:58 | | holbrooke joins |
23:44:54 | | holbrooke quits [Client Quit] |
23:48:33 | | yasomi quits [Ping timeout: 260 seconds] |
23:50:37 | | yasomi (yasomi) joins |
23:55:33 | | beardicus quits [Ping timeout: 260 seconds] |