#archiveteam-bs log for 2025-01-21

Home Search Previous day Next day

00:26:44		etnguyen03 quits [Client Quit]
00:29:24		Webuser814419 joins
00:30:23	<Webuser814419>	Another old forum closed, "For a limited time, it is still available in read-only mode": https://www.criticker.com/forum/
00:32:07		Webuser814419 quits [Client Quit]
00:40:42		holbrooke quits [Client Quit]
00:46:01		holbrooke joins
00:54:19		holbrooke quits [Client Quit]
01:02:47		etnguyen03 (etnguyen03) joins
01:21:41		beardicus (beardicus) joins
01:44:05		BornOn420 quits [Ping timeout: 276 seconds]
01:44:36		BornOn420 (BornOn420) joins
02:17:00	<wickedplayer494>	FWIW: ArtemR (former Android Police owner) is also enforcing the TikTok ban-that-might-or-might-not-be on APKMirror for US visitors
02:17:03	<wickedplayer494>	https://twitter.com/APKMirror/status/1881509961660068099?mx=1
02:17:04	<eggdrop>	nitter: https://xcancel.com/APKMirror/status/1881509961660068099
02:21:26		wickedplayer494 quits [Ping timeout: 250 seconds]
02:22:36		wickedplayer494 joins
02:26:33		beardicus quits [Remote host closed the connection]
02:26:53		beardicus (beardicus) joins
02:29:05		etnguyen03 quits [Client Quit]
02:30:25	<h2ibot>	Wickedplayer494 edited APKMirror (+418, Can't get around the…): https://wiki.archiveteam.org/?diff=54267&oldid=53414
02:30:57		wickedplayer494 is now authenticated as wickedplayer494
02:31:09		pedantic-darwin joins
02:31:15		pedantic-darwin quits [Client Quit]
02:31:42		pedantic-darwin joins
02:37:04		beardicus quits [Read error: Connection reset by peer]
02:38:32		beardicus (beardicus) joins
02:42:05		etnguyen03 (etnguyen03) joins
02:42:29	<h2ibot>	JustAnotherArchivist created UK Online Safety Act 2023 (+1322, Created page with "The '''Online Safety Act…): https://wiki.archiveteam.org/?title=UK%20Online%20Safety%20Act%202023
02:43:10		holbrooke joins
02:51:21	<Ryz>	Fuji TV encountering controversy: https://unseen-japan.com/fuji-tv-nakai-masahiro-scandal/
02:51:52	<Ryz>	For those who have specialized knowledge in Japan, or Japanese culture, would be much appreciated on archiving internet presence of the company and related content via #archivebot please
03:03:48		opl quits [Quit: Ping timeout (120 seconds)]
03:04:04		opl joins
03:15:58		beardicus quits [Ping timeout: 260 seconds]
03:19:26		beardicus (beardicus) joins
03:25:24		etnguyen03 quits [Remote host closed the connection]
03:55:04		holbrooke quits [Client Quit]
04:07:34		Webuser264426 joins
04:07:56		Webuser264426 quits [Client Quit]
04:28:35		holbrooke joins
05:01:28	<@OrIdow6>	!remindme 1d machine
05:01:29	<eggdrop>	[remind] ok, i'll remind you at 2025-01-22T05:01:28Z
05:03:18		wickedplayer494 quits [Ping timeout: 260 seconds]
05:04:14		wickedplayer494 joins
05:04:23		wickedplayer494 is now authenticated as wickedplayer494
05:21:25		klg quits [Quit: bbl]
05:36:22		holbrooke quits [Client Quit]
05:56:48		beardicus quits [Ping timeout: 250 seconds]
06:19:40		klg (klg) joins
06:44:23		klg quits [Client Quit]
06:52:35		beardicus (beardicus) joins
06:57:03		beardicus quits [Ping timeout: 260 seconds]
07:04:43		BlueMaxima quits [Quit: Leaving]
07:08:40		klg (klg) joins
07:15:29	<h2ibot>	OrIdow6 edited Archiveteam:IRC (+228, /* EFnet (mostly historical, October 2020 and…): https://wiki.archiveteam.org/?diff=54269&oldid=53944
07:31:40		Webuser616943 joins
07:33:04		Webuser616943 quits [Client Quit]
07:41:10		loug8318142 joins
07:59:39		mannie (nannie) joins
07:59:47	<that_lurker>	And reproductiverights.gov got nuked
07:59:49	<mannie>	The lists of yesterday are not visible in the viewer so I share them again: main: https://transfer.archivete.am/L0vN2/bankruptcies-NL-2025-jan20-main.txt other references:https://transfer.archivete.am/LZa1F/bankruptcies-NL-2025-jan20-ref.txt ssl-error: https://transfer.archivete.am/U3hdP/bankruptcies-NL-2025-jan20-ssl-error.txt
08:01:18	<@OrIdow6>	mannie: What are these lists of?
08:01:58	<mannie>	All company's that when bankrupt yesterday
08:07:10		mannie quits [Remote host closed the connection]
08:07:45	<@OrIdow6>	Welp away they go
08:08:33	<@OrIdow6>	Something I'd like to do long-term would be some way for people like this to do their own thing, in a sandbox, with approval
08:08:40	<@OrIdow6>	Some kind of overall data limit etc
08:09:23		BornOn420 quits [Remote host closed the connection]
08:09:54		BornOn420 (BornOn420) joins
08:16:56	<@OrIdow6>	!tell mannie Could you please provide more context to this? For instance: what date range does this list cover? Also rather than a long list of references it would be better to have just one or two links per company, to establish that they are going bankrupt; e.g., I cannot find a source for the fact that Sarvision is going bankrupt skimming through the URLs in the list, instead most just seem to establish that the company exists.
08:16:57	<eggdrop>	[tell] ok, I'll tell mannie when they join next
08:17:14	<@OrIdow6>	Might be too harsh, not sure
08:20:32	<@OrIdow6>	JAA: Nice to hear
08:21:55		flotwig_ joins
08:22:48		flotwig quits [Ping timeout: 260 seconds]
08:22:49		flotwig_ is now known as flotwig
08:23:30	<@OrIdow6>	Also JAA, I've suggested to myself learning qwarc as an educational exercise for this Niconico thing, so, uh, how to get started? Which branch is the correct latest one to use, is there documentation besides example usage in ia collections, etc?
08:39:08	<@arkiver>	it would be great if we could have a little AB pipeline with a Japanese IP
09:15:53		Hackerpcs quits [Ping timeout: 260 seconds]
09:38:22	<katia>	OrIdow6, i have a little box in .jp
09:42:53	<katia>	pm'd you ip/root pass
09:43:49		Island quits [Read error: Connection reset by peer]
09:48:15		Hackerpcs (Hackerpcs) joins
10:05:33		qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins
10:05:35		qwertyasdfuiopghjkl2 quits [Max SendQ exceeded]
10:11:18		qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins
10:35:48		emphatic quits [Ping timeout: 260 seconds]
11:11:15	<@OrIdow6>	Thanks katia!
11:11:33	<@OrIdow6>	nstrom\|m: See above, seems you won't need to spend any money after all
11:31:12	<flashfire42\|m>	https://au.finance.yahoo.com/news/popular-aussie-online-retailer-shut-233233284.html
11:54:10	<c3manu>	OrIdow6: mannie is usually checking https://insolventies.rechtspraak.nl/ for new entries of companies that went bankrupt in the Netherlands, then throws domains mentioned there (or googled ones, which might not be the right ones all the time) into something self-programmed to generate a list of subdomains.
11:56:17	<c3manu>	those lists are a mess though, and as of december i have stopped doing them in favor of using AB to grab other things. i’ve had some help doing them here and there, but i don’t think many have been done since i stopped running them.
11:57:24	<c3manu>	the '-ref' lists are usually references to those places that filed for bankruptcy like opening hour pages, stuff like that. i assume those are compiled via manual web searches.
12:00:02		Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:52		Bleo18260072271962345 joins
12:22:47	<@OrIdow6>	c3manu: Ah, thank you for your prior service
12:23:09	<@OrIdow6>	Think I have a chance of making it better if I chastize them like I did up there...?
12:23:30	<@OrIdow6>	My excuse for not wanting to run them myself is not being an ABer
12:23:44	<@OrIdow6>	But I do appreciate that they are trying to save these pages
12:24:18	<@OrIdow6>	*Do you think I
12:24:21	<murb>	OrIdow6: you're not an estuary?
12:25:34	<@OrIdow6>	murb: Haha, I meant ArchiveBotter
12:26:47	<pabs>	OrIdow6: mannie has been doing those bankruptcies for at least a year, posting them in #archivebot and getting folks to do them. there are a lot, and I personally burned out on doing them due to the volume. I think others may have too
12:27:10	<c3manu>	OrIdow6: hm. i think i like the "sandbox" idea of yours, but i have no idea how that would work.
12:27:22	<pabs>	the references are links to the companies and are the easy part, just !ao < them
12:27:37	<c3manu>	yeah i do that too when i see it
12:27:44	<c3manu>	but the other ones require so much manual work
12:28:05	<pabs>	personally I thought mannie should probably get AB access, but I think they may not want the extra work
12:28:51	<c3manu>	if they were "pre-vetted", like "these work, you can just quickly check and queue them", then one "those are broken subdomains or logins" that can be run using '!a <'..
12:29:16	<pabs>	and of course mannie's lists are just the tip of the iceberg, there are many many more countries that we don't have bankruptcy visibility
12:30:16	<c3manu>	but the way it is now, all the non-resolving ones are included, ones that are broken on https:// aren’t checked on http://, if the "main website" is 'www.company.com' then 'company.com' is usually missing...
12:30:23	<@OrIdow6>	c3manu: It's too advanced for technology to do today sadly
12:31:14	<c3manu>	it would actually be easier to reduce them to the domain names and do the subdomain discovery yourself. it’s less effort than working through that list, but it’s still effort
12:31:47	<@OrIdow6>	pabs: From what I'm hearing mannie doesn't strike me malicious or destructive so I'd be in support of it, but like I say that's not my area and not my decision to make
12:31:51	<pabs>	I was only doing it before mannie started doing the subdomains too...
12:32:31	<pabs>	I asked mannie about it a while ago and they only said they would think about it
12:32:49	<@OrIdow6>	It's a shame but true that there are more websites shutting down than we have volunteer labor here
12:33:11	<@OrIdow6>	I think some stuff can be automated eventually but that's the state of things now
12:33:41	<c3manu>	well..sometimes there’s mistakes. like some small "F5 Logistics" company went bankrupt and www.f5.com was added to the list. and me not really knowing and being overwhelmed by the list already, i just queued that >.<
12:34:04	<c3manu>	yeah, i think we’re a little understaffed here as well :D
12:34:16	<pabs>	the subdomain stuff is why I think we need an automated DPoS based subdomain/URL enumerator with software/service type detection and AB/wikibot/Y/jseater/etc job command generation
12:35:01	<@OrIdow6>	Maybe I'm just in a weird mood today but I think an ultimatum to them might be the answer, "no one has enough free time/patience/mental bandwidth to do this, you're going to need to be your own advocate here"
12:35:07	<@OrIdow6>	pabs: Yeah would be nice
12:35:40	<@OrIdow6>	Maybe also something that compares how a headless browser behaves to a js-oblivious one, to try to tell if it can be AB'd effectively
12:35:52	<pabs>	indeed
12:36:25	<pabs>	lots of other tricks we could code too, like detecting if /pipermail or /pipermail/ work on Mailman 2 sites
12:36:33	<c3manu>	true
12:36:35		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
12:37:14	<pabs>	I expect everyone has their favourite site scouting techniques
12:37:47		SkilledAlpaca418962 joins
12:38:53	<@OrIdow6>	Huh, on the "sandbox" thing I guess you could have some bot that proxies commands for them, and denies them if they're on the wrong job ID/if they request too high a rate limit/etc...
12:40:06	<c3manu>	i think crafting those restrictions would be really much effort actually
12:41:57	<c3manu>	like "please do not queue wikipedia.org", "that huge website should be archived, but not on that pipeline"...
12:42:34	<c3manu>	"this one has a session ID so needs to be run as https://random-forum.com/?archiveteam"
12:43:26	<c3manu>	"please make sure to run Shopify pages only with one worker, and only on pipelines without any other active Shopify jobs"
12:43:45	<steering>	surely there's some open source "web scanner" to do that -> < pabs> the subdomain stuff is why I think we need an automated DPoS based subdomain/URL enumerator with >>software/service type detection<<
12:44:47	<steering>	(although it would be nice to have it tuned for "look for the stuff that archiveteam has tooling for")
12:45:31	<pabs>	maybe, but probably none that could download to WARC or read sites from a WARC
12:45:42	<pabs>	or be used in a DPoS setup
12:46:51	<pabs>	on AT-related scanners, I've written a Python thing for finding wikis and generating #wikibot commands, and that_lurker has a Blogger detector
12:49:18		IRC2DC joins
12:55:14	<that_lurker>	I do? o_O
12:57:11	<pabs>	woops, it was <thuban> https://transfer.archivete.am/inline/PUhGC/blogspot-checker.py
13:01:46		NF885 (NF885) joins
13:01:59		beardicus (beardicus) joins
13:02:00		NF885 quits [Client Quit]
13:05:58		NF885 (NF885) joins
13:06:02		eroc1990 quits [Quit: Ping timeout (120 seconds)]
13:06:22		eroc1990 (eroc1990) joins
13:06:40	<NF885>	FYI looks like /sitemap.xml no longer redirects to the sitemap index for some reason on the Biden White House archive site (and /robots.txt doesn't exist)
13:06:50		NF885 quits [Client Quit]
13:08:30		NF885 (NF885) joins
13:08:38	<NF885>	the sitemap indexes are still at https://bidenwhitehouse.archives.gov/sitemap_index.xml and https://bidenwhitehouse.archives.gov/es/sitemap_index.xml, though
13:09:30	<NF885>	(also I probably shouldn't be trying to send this on mobile data)
13:10:11		NF885 quits [Client Quit]
13:25:23	<h2ibot>	Manu edited Discourse/archived (+99, queued community.openenergymonitor.org): https://wiki.archiveteam.org/?diff=54270&oldid=54191
14:17:10	<masterx244\|m>	crunching some data atm to verify links to some firmware files (approx 20GB) for a !ao< run. sussing out dead links with some wget prodding right now
14:20:52	<TheTechRobo>	OrIdow6: Re qwarc, I believe the correct branch is 0.2. Writing a spec file didn't seem too difficult last time I looked into it. The general idea seems to be that you define subclasses of `qwarc.Item`. qwarc will call the `generate` function on each of them to create the initial set of items. Then you can call `add_subitem` to queue any new items
14:20:52	<TheTechRobo>	(i.e. backfeed).
14:21:47	<TheTechRobo>	The `process` callback is where you actually do stuff.
14:22:22	<TheTechRobo>	And of course, you're on your own for parsing
14:28:05	<@arkiver>	a distributed Warrior project that could be compared to AB will be introduced at #Y , but will still take some time
14:29:17	<@arkiver>	however that will still require work as well with maintaining jobs, ignoring stuff, etc.
14:30:04	<masterx244\|m>	dishing out the ignores is the main PITA. loops and other traps usually only appear further down in a archiving job
14:30:35	<@arkiver>	yeah
14:30:52	<masterx244\|m>	(had to grab something for personal archival recently, too. closed off page where i pulled myself a backup and a few traps waited there, too that needed a few dirty ignores to squish)
14:31:07	<@arkiver>	as for access to AB for mannie - i think it may be fine? but i believe that is largely up to JAA , though i'm not sure on the state of AB at the moment when it comes to this
14:32:22	<masterx244\|m>	switched to Arch at the start of this year to finally wall off the last remaining windows on my HW. much easier now that i got the same toolings that i got on my server on my main computer, quickly spun up grab-site for that one job
14:45:21	<masterx244\|m>	murphy's law, just when you need transfer.archiveteam.org its dead
14:46:30	<masterx244\|m>	*transfer.archivete.am
14:47:57	<masterx244\|m>	nope, somehow the bash snippet bricked itself....
14:48:07	<masterx244\|m>	list processed https://transfer.archivete.am/Cu73P/senafirmware_ab.lst
14:48:07	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/Cu73P/senafirmware_ab.lst
14:54:54	<masterx244\|m>	need to figure out why 3 files went missingno. even though i got them in my local archive (some dev-versions that they never intended to be caught but my automatic monitoring got that stuff even though it was just online for a short time)
15:17:40		holbrooke joins
15:42:06		BornOn420 quits [Remote host closed the connection]
15:42:34		BornOn420 (BornOn420) joins
15:59:09	<@JAA>	arkiver: IIRC, they've been offered access before but didn't want it. Something about compiling the list taking long enough and not wanting to invest the time into understanding AB and keeping an eye on the jobs.
15:59:49	<@JAA>	c3manu or pokechu22 might be able to confirm.
16:05:50	<c3manu>	i think i vaguely remember something like that, yeah
16:08:05	<@JAA>	OrIdow6: Re qwarc, there is no documentation. You want the latest tag, v0.2.8. v0.2.6 and v0.2.7 are also fine; the only changes are support for HEAD requests and overriding the default 1-minute timeout. Anything before that is based on warcio and shouldn't be used. `pip install git+https://gitea.arpa.li/JustAnotherArchivist/qwarc.git@v0.2.8` on its own doesn't work because a dependency of aiohttp
16:08:11	<@JAA>	changed something long after the release; you need `pip install --upgrade async-timeout==3.0.1` afterwards to fix that.
16:09:58	<@JAA>	For the spec file, yep, what TheTechRobo wrote.
16:10:14	<@JAA>	The examples on IA should help with that.
16:13:00		beardicus quits [Ping timeout: 250 seconds]
16:15:47	<@JAA>	The other thing to be aware of is that qwarc's memory usage grows over time. I've never been able to figure out what causes that; the best guess is heap fragmentation. There's a magic environment variable that helps but doesn't fix it; I don't have it handy right now. It's why --memorylimit is a thing. I normally run qwarc in a `while [[ ( ! -e "qwarc.db" \|\| $(sqlite3 "qwarc.db" 'SELECT COUNT(id) FROM
16:15:53	<@JAA>	items WHERE status != 2') -gt 0 ) && ! -e STOP ]]` loop for that reason.
16:16:37	<@JAA>	(The environment variable technically lowers the performance a bit due to less efficient memory allocations, but I haven't noticed it in practice.)
16:19:49		holbrooke quits [Client Quit]
16:33:37		beardicus (beardicus) joins
17:08:54		lflare quits [Ping timeout: 250 seconds]
17:11:34		emphatic joins
17:29:19		sec^nd quits [Remote host closed the connection]
17:29:42		sec^nd (second) joins
17:33:56	<@imer>	Blueacid: no worries, happens :)
17:36:00	<Blueacid>	No worries! Just seeing familiar filenames flying past when watching the warrior doing its thing, and I wondered whether there was any point in (trying to?) do a hash-based dedupe. But I surmise that yes, we might save maybe a few terabytes, which is great, but then the Blogger job alone has stored 1.5PB, so... drop in the ocean
17:37:56		beardicus quits [Ping timeout: 250 seconds]
17:45:43		balrog quits [Ping timeout: 260 seconds]
17:49:27	<szczot3k>	Running a dedup on IA will probably need more resources, than it'd save (in terms of disk space)
17:55:03		le0n quits [Ping timeout: 260 seconds]
17:56:28		le0n (le0n) joins
17:57:08		balrog (balrog) joins
18:06:15		beardicus (beardicus) joins
18:12:48		balrog quits [Client Quit]
18:13:11		icedice (icedice) joins
18:18:55	<yzqzss>	I'm archving niconico shunga
18:19:36		balrog (balrog) joins
18:23:26	<TheTechRobo>	Blueacid: Warrior projects tend to do some dedup, where if the same thing is fetched multiple times in a Wget process (there is generally one process per handful of items), it will be deduped. Also, they use Zstandard with custom dictionaries that are trained on the WARC files, which means that redundant data is basically compressed to nonexistence
18:24:00	<TheTechRobo>	As sz.czot3k, deduplicating over every WARC would probably be more trouble than it's worth
18:24:23	<TheTechRobo>	*As sz.czot3k said
18:25:54		aninternettroll quits [Remote host closed the connection]
18:28:01		aninternettroll (aninternettroll) joins
18:44:54		qinplus_phone joins
19:04:28		th3z0l4 quits [Ping timeout: 260 seconds]
19:04:52		th3z0l4 joins
19:05:24		lennier2_ joins
19:19:20		beardicus quits [Ping timeout: 250 seconds]
19:41:48		lennier2_ quits [Ping timeout: 260 seconds]
19:42:17		lennier2_ joins
19:42:40		beardicus (beardicus) joins
19:43:42		lennier2 joins
19:46:45		`
19:47:03		beardicus quits [Ping timeout: 260 seconds]
19:47:30		lennier2_ quits [Ping timeout: 250 seconds]
19:48:20	<nicolas17>	`: that's annoying
19:48:43	<`>	nicolas17: what is?
19:49:23	<nicolas17>	your nickname >:o
19:49:38	<`>	ur face is annoying
19:51:30		Radzig2 joins
19:53:08		Radzig quits [Ping timeout: 250 seconds]
19:53:08		Radzig2 is now known as Radzig
19:54:13		steering tries to get the speck of dust off his monitor
20:03:55	<Ryz>	arkiver/JAA, regarding #Y - yeah, the ignores are really going to be a main stickler; like damn, I do my rounds on doing ignores on existing jobs whenever I can, but it's oofy and a bit draining over time
20:04:29	<Ryz>	Also on having to get some intuition on when something seems to go wrong or might go in a bad direction
20:05:14	<masterx244\|m>	addendum that i forgot to tell on the list i posted earlier: the 20GB mentioned earlier is the total size of the entire fileset together. the files are 2 to 6MB each
20:05:57		beardicus (beardicus) joins
20:06:22	<masterx244\|m>	true on that. some rabbitholes are really unintuitive until the crawl is really deep inside them. and if they happen when nobody is around they can waste a good bunch of crawltime
20:07:33	<masterx244\|m>	vanillaforums based forums have a nasty one for example when a bad formatted link appears in a post where the URL gets added as a relative one to the post url but the post page doesn't redirect to the canonical form and instead keeps that part, same page of the topic again and you get another segment added ad game repeats
20:07:35	<masterx244\|m>	*and
20:22:38		trix (trix) joins
20:24:05		lennier2_ joins
20:27:18		lennier2 quits [Ping timeout: 260 seconds]
20:31:25		beardicus quits [Remote host closed the connection]
20:31:45		beardicus (beardicus) joins
20:34:12		BlueMaxima joins
20:44:48		th3z0l4_ joins
20:45:22		th3z0l4 quits [Remote host closed the connection]
21:04:44		qinplus_phone quits [Client Quit]
21:08:00		Gadelhas5628737 quits [Quit: Ping timeout (120 seconds)]
21:30:18		beardicus quits [Ping timeout: 260 seconds]
21:31:48		beardicus (beardicus) joins
21:37:05		SootBector quits [Remote host closed the connection]
21:37:25		SootBector (SootBector) joins
21:39:56		holbrooke joins
21:45:28		Webuser663884 joins
21:45:38		Webuser663884 quits [Client Quit]
21:46:04	<Blueacid>	szczot3k and TheTechRobo - cheers for the answers, appreciated :)
21:46:16	<Blueacid>	Very much as I suspected, but wanted to ask :)
21:47:47	<szczot3k>	Cost of any dedup, will most likely be much more, than adding "another disk"
21:50:02	<Blueacid>	Yeah, I figured so!
21:50:13	<@JAA>	Yeah, especially when you include the non-technical costs: developing this thing, making sure it doesn't accidentally data, etc.
21:54:38	<Blueacid>	Yeah the risk is probably the hugest thing - how can you guarantee you've not fouled something up?
21:54:48	<Blueacid>	Cheers for humouring me thinking out loud!
21:55:05		etnguyen03 (etnguyen03) joins
21:55:30	<szczot3k>	Doing dedup currently would mean (at least) reading every file, and checksuming it
21:55:33	<szczot3k>	Which is a huge task
21:57:51	<Blueacid>	I agree it's a silly endeavour now - with that said, we wouldn't need to read every file; just the first few bytes of every file with a non-unique filesize. And if those match, then & only then would you checksum them
21:57:58	<Blueacid>	that'd reduce the disk / cpu somewhat?
21:58:36	<szczot3k>	Then you get false positives
21:59:19	<szczot3k>	At this scale, doing so, would lead to false positives
21:59:22	<Blueacid>	How so? If you've just checked the first few bytes and they match, then you proceed to fully checksum those files to prove if they're byte-for-byte. But if the first few bytes differ, clearly different file contents; move on
22:00:38	<szczot3k>	You'd first want to build a full map of the archive, with checksums. You can't go with an approach "Let's look at this one file, and compare it to every other file", you then end up reading every files (at least the start of it) billions of times
22:04:18		beardicus quits [Remote host closed the connection]
22:04:39		beardicus (beardicus) joins
22:08:07	<Blueacid>	Ah yeah, shoot, you're right
22:08:10	<Blueacid>	Mea culpa
22:12:32		HP_Archivist (HP_Archivist) joins
22:15:48		beardicus quits [Ping timeout: 260 seconds]
22:16:59		icedice quits [Quit: Leaving]
22:34:52	<Barto>	https://i.imgur.com/99suCYs.png internet goes brrr :-)
22:37:22	<@JAA>	Most archives use SHA-1 hashes, and there are definitely collisions in the collection.
22:38:05	<@JAA>	The digests already exist in the index, so that part is kind of done, but you still need to deal with the collisions.
22:45:16		beardicus (beardicus) joins
22:46:14		loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
22:48:23		holbrooke quits [Client Quit]
22:57:31	<szczot3k>	Yeah, at this scale colissions are something that actually matters
23:01:25	<@JAA>	I mean, one WARC of the SHAttered website is enough. :-)
23:07:20		Island joins
23:15:30		beardicus quits [Ping timeout: 250 seconds]
23:18:45		nicolas17 is now authenticated as nicolas17
23:23:46	<@OrIdow6>	yzqzss: Details? Would I have known this if I was in the other stwp chat?
23:24:11		beardicus (beardicus) joins
23:31:58		beardicus quits [Ping timeout: 250 seconds]
23:36:55		beardicus (beardicus) joins
23:42:58		holbrooke joins
23:44:54		holbrooke quits [Client Quit]
23:48:33		yasomi quits [Ping timeout: 260 seconds]
23:50:37		yasomi (yasomi) joins
23:55:33		beardicus quits [Ping timeout: 260 seconds]

Home Search Previous day Next day