#archiveteam-bs log for 2024-11-19

Home Search Previous day Next day

00:40:07		sralracer quits [Client Quit]
00:49:23		wickedplayer494 quits [Remote host closed the connection]
00:49:40		wickedplayer494 joins
00:49:46		wickedplayer494 is now authenticated as wickedplayer494
01:30:34	<@JAA>	Barto++
01:30:34	<eggdrop>	[karma] 'Barto' now has 12 karma!
01:51:22	<nicolas17>	JAA: https://transfer.archivete.am/inline/ew4AR/qwarc.txt now-what.gif
01:51:48	<@JAA>	nicolas17: For starters, don't install the master branch.
01:51:48	<nicolas17>	do I have too old python or
01:52:20		nicolas17 tries last tag instead
01:52:44	<@JAA>	But the issue comes from a backwards-incompatible change in an aiohttp dependency that broke everything a while back.
01:52:58	<nicolas17>	sounds damn familiar
01:53:04	<nicolas17>	like I fought this exact issue before
01:53:12	<@JAA>	async-timeout==3.0.1
01:53:37	<nicolas17>	would it be possible to pin it in qwarc's setup.py to avoid the issue?
01:54:25	<nicolas17>	okay that works, and now it asks me for a specfile
01:54:33	<nicolas17>	which I guess is a python script using an undocumented API
01:56:00	<@JAA>	Correct
01:56:14	<@JAA>	And yeah, I should add a pin.
01:56:33	<nicolas17>	...does nothing else support dedup?
01:57:13	<@JAA>	wget does (but writes faulty WARCs, although that seems to be getting fixed soon, finally).
01:57:36	<@JAA>	You could try wget-at, if you're not on ARM.
01:58:11	<nicolas17>	are there qwarc examples somewhere at least?
01:58:13	<@JAA>	Or rather: if you're on x86-64
01:58:34	<@JAA>	In my IA uploads. qwarc is self-documenting and writes the spec file to the meta WARC.
01:58:40		nicolas17 puts RISC-V machine back down
02:05:37	<@JAA>	... it's been 3 years since the last qwarc version? Wat?
02:05:50	<nicolas17>	time is fake
02:06:29	<@JAA>	Yeah, must be.
02:09:41	<nicolas17>	so far cloning gnulib git repo has been the slowest step
02:16:30	<@JAA>	Cloning what and why?
02:16:47	<nicolas17>	wget-at has gnulib as a submodule
02:17:23	<@JAA>	Ah, you're trying wget-at now, right.
02:18:12	<@OrIdow6>	Yeah it's super slow/times out sometimes, don't know why
02:18:15	<nicolas17>	using the Dockerfile built an image that can't run because it only has wget-at and its libraries, but no libc or anything, because it uses "FROM scratch"? what?
02:18:20	<@OrIdow6>	Good thing is you only have to do it once
02:18:45	<@JAA>	Yeah, the wget-at image isn't really meant to be used directly.
02:19:24	<@JAA>	Or wget-lua image, I suppose.
02:19:27		decky quits [Ping timeout: 252 seconds]
02:19:31	<@JAA>	Yay, legacy naming.
02:19:44	<nicolas17>	and if I change "FROM scratch" to "FROM debian:bookworm-slim" it has to download gnulib and compile wget again, because any change to Dockerfile changes the effects of the previous statement "COPY . /tmp/wget"
02:20:24	<@JAA>	You could just pull the grab-base image from atdr.
02:20:26	<nicolas17>	why are computers
02:20:58	<@JAA>	It's meant for DPoS, but it should be possible to just use it to run wget-at.
02:21:01	<nicolas17>	also fun how cares took much longer to compile than wget, if we ignore the time needed to clone gnulib and the autoconf nonsense
02:22:18		decky_e joins
02:41:26	<nicolas17>	the --progress option seems broken
02:42:29	<nicolas17>	--progress=bar or --progress=dot:mega etc, none changes anything, it keeps flooding my console with one dot per KB
02:45:50		HP_Archivist quits [Quit: Leaving]
02:53:31	<nicolas17>	hm this initial test seems fine, but I should probably run this for real on my other computer so that I can actually fit a whole InstallAssistant in RAM
02:57:29		cow_2001 quits [Quit: ✡]
02:58:37		cow_2001 joins
02:59:25	<nicolas17>	wtf it's appending to the -O file?
03:01:56		nulldata quits [Quit: Ping timeout (120 seconds)]
03:02:02	<nicolas17>	"wget-at -O wget.tmp -i list.txt--warc-file=test" downloaded the 12GB file into wget.tmp, then wrote the 12GB data into test.warc, and now wget.tmp is growing more (with the content of the second URL?), I expected it to get truncated for the second download
03:02:49		nulldata (nulldata) joins
03:06:02	<@JAA>	Yes, it should truncate. That's clearly not your exact command since there's no space between txt and --.
03:06:15	<nicolas17>	...pipelines pass --truncate-output
03:06:16	<nicolas17>	I see
03:07:01	<@JAA>	Oh, right, that's a wget-at thing, I think.
03:08:38	<nicolas17>	I think I can also just use -O /dev/null and save some temporary disk space
03:08:48	<nicolas17>	since it uses a separate temporary file for warc purposes anyway
03:10:51	<@JAA>	Since you don't need to extract links or similar, that sounds plausible, yeah.
03:11:31	<nicolas17>	hm the .cdx only has the first URL
03:30:34		pete joins
03:31:04		pete quits [Client Quit]
03:39:22		nic8693102004 (nic) joins
03:53:57	<TheTechRobo>	nicolas17: You can do `FROM <wget-at image> as wget` and then `COPY --from=wget /path/to/wget-lua /usr/bin/wget-lua`, for the record.
03:55:03	<TheTechRobo>	You don't need a runnable container to copy data out of it.
03:55:13	<nicolas17>	on a separate dockerfile you mean?
03:55:20	<TheTechRobo>	Yeah
03:55:29	<TheTechRobo>	IIRC that's what I did
03:55:53	<nicolas17>	(editing this dockerfile in any way caused a rebuild of wget because it was considered part of the source code being copied in a build step)
03:55:54	<TheTechRobo>	Although I haven't used wget-at outside of docker for probably at least a year now.
03:56:02		etnguyen03 quits [Remote host closed the connection]
03:57:54		wyatt8740 quits [Ping timeout: 252 seconds]
03:58:15		loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
03:58:57	<nicolas17>	augh
03:59:10	<nicolas17>	JAA: "VEILPUTFJNJQAAAAAAAAAAAAAAAAAAAA" is this the bug mentioned recently in #archiveteam-dev?
03:59:19	<@JAA>	Sure looks like it.
04:00:17	<nicolas17>	it's joever
04:00:21	<nicolas17>	every warc tool is broken
04:01:49	<TheTechRobo>	Enjoy the broken digests. :-)
04:02:02		wyatt8740 joins
04:03:04	<TheTechRobo>	JAA: Do you think it's reasonable to add an option to test for the broken digest bug in warc-tiny so it doesn't flood the logs so much?
04:03:48	<TheTechRobo>	Similar to how it detects the broken handling of transfer-encoding, but as an option since only Wget-AT is affected
04:04:31	<nicolas17>	this is a 104-byte download btw
04:04:48		ducky quits [Ping timeout: 260 seconds]
04:05:29	<TheTechRobo>	nicolas17: Not sure what the exact odds of hitting the bug are, but I think you just got unlucky.
04:05:36	<nicolas17>	https://swcdn.apple.com/content/downloads/13/38/072-11038-A_8VILF7KGLR/ekuwqfer80bkta2a6l6hn9flavknip4edt/MajorOSInfo.pkg.integrityDataV1
04:06:17		ducky (ducky) joins
04:07:47	<@OrIdow6>	<nicolas17> every warc tool is broken
04:07:49	<h2ibot>	OrIdow6 edited The WARC Ecosystem (+155, /* Tools */ wget-lua to yellow): https://wiki.archiveteam.org/?diff=53840&oldid=53766
04:07:49	<@OrIdow6>	Good point...
04:08:03	<nicolas17>	ffs
04:08:14	<TheTechRobo>	Oh, right, it affects dedupe too.
04:08:30	<@OrIdow6>	I don't think anyone's tested that but apparently that's what it looked like
04:08:45	<TheTechRobo>	Yeah, the copied hash is what's passed to the deduplication function.
04:09:01		TheTechRobo wonders how many records have been incorrectly deduplicated
04:09:27		nulldata quits [Client Quit]
04:09:52	<@OrIdow6>	Mmmm, probably a lot
04:10:09	<TheTechRobo>	I'm not so sure
04:10:14	<TheTechRobo>	URL-agnostic dedupe is per-process
04:10:20		nulldata (nulldata) joins
04:10:22	<@JAA>	1 of 256 hashes will be all 0 bytes with this bug.
04:10:46	<TheTechRobo>	Oh, those odds are significantly worse than I thought. :-(
04:10:55	<@OrIdow6>	Like apparently a 100-request item (/multiitem batch) will have a 5% chance of having a false dedup
04:10:55	<@JAA>	And that's just the biggest chunk of it.
04:11:12	<@OrIdow6>	No
04:11:14	<@JAA>	Specifically the probability of the first byte of the has being NUL.
04:11:25	<@JAA>	hash*
04:11:32	<@OrIdow6>	Yes, actually, I copied the wrong number but it was about the same as the right one
04:11:40	<@OrIdow6>	(Replying to myself not to J A A)
04:11:45	<@JAA>	Collisions on later NUL bytes are also possible but not as likely.
04:12:38	<@OrIdow6>	I'll turn off dedup on Cohost in a bit
04:12:53	<nicolas17>	is it strcpy'ing the binary hash?
04:12:57	<TheTechRobo>	Yes
04:13:00	<TheTechRobo>	strncpy specifically
04:13:02	<@JAA>	strncpy, but yes
04:13:11	<nicolas17>	let's quit computers and start a farm
04:13:16	<TheTechRobo>	:-)
04:13:16	<@JAA>	Sounds good to me.
04:13:31	<@JAA>	Oh wait, modern farm equipment is all computers. D:
04:13:42	<nicolas17>	and fighting tractor DRM
04:16:53	<nicolas17>	so wtf
04:16:56	<nicolas17>	isn't this "easy" to fix
04:17:02	<TheTechRobo>	Yes
04:17:11	<TheTechRobo>	strncmp -> memcmp, theoretically
04:17:15	<TheTechRobo>	Er
04:17:23	<TheTechRobo>	strncpy -> memcpy, theoretically
04:18:53		TheTechRobo has just noticed that there is now only one green row on The WARC Ecosystem :-(
04:19:43	<nicolas17>	let's go find bugs in it
04:21:59	<@JAA>	And even that green row has bugs inherited from wpull. :-(
04:24:44	<nicolas17>	also indeed running this on a machine with 24GB RAM was so much better
04:24:54	<nicolas17>	whole file stays in disk cache
04:27:52	<nicolas17>	https://archive.org/details/macos-installassistant-24C5073e-warc see you in two hours
04:40:18		Guest54 quits [Quit: My MacBook has gone to sleep. ZZZzzz…]
04:41:40		Unholy2361924645377131 quits [Ping timeout: 260 seconds]
04:41:55	<nicolas17>	does this affect archivebot or does that use different software?
04:43:07	<nicolas17>	it seems to be 1 of the 2 uses of strncpy in the whole codebase so yes this seems easy to fix
04:43:21	<nicolas17>	src/warc.c:2085: strncpy(sha1_res_payload, sha1_payload, SHA1_DIGEST_SIZE);
05:27:00	<nicolas17>	JAA: I finished uploading the item
05:27:36	<nicolas17>	I left the .warc uncompressed because it's mainly a giant already-compressed file, is that a problem? are there tools that expect .warc to always be .gz/.zstd?
05:36:25		HP_Archivist (HP_Archivist) joins
06:06:34		ave quits [Quit: Ping timeout (120 seconds)]
06:06:54		ave (ave) joins
06:27:12		night quits [Remote host closed the connection]
06:27:23		night joins
06:27:23		night is now authenticated as night
07:03:52		BlueMaxima quits [Quit: Leaving]
07:05:51		Unholy2361924645377131 (Unholy2361) joins
07:08:04		Pedrosso5 joins
07:08:09		ScenarioPlanet2 (ScenarioPlanet) joins
07:08:10		TheTechRobo2 (TheTechRobo) joins
07:10:25		ScenarioPlanet quits [Ping timeout: 255 seconds]
07:10:25		Pedrosso quits [Ping timeout: 255 seconds]
07:10:25		ScenarioPlanet2 is now known as ScenarioPlanet
07:10:26		Pedrosso5 is now known as Pedrosso
07:10:57		TheTechRobo quits [Ping timeout: 252 seconds]
07:10:57		TheTechRobo2 is now known as TheTechRobo
07:12:45		@rewby quits [Ping timeout: 260 seconds]
07:48:13		ducky quits [Ping timeout: 260 seconds]
08:18:05	<upperbody321\|m>	So-net Blog (SS Blog), the former Sony Communications blogging business, will end its services on 31 March 2025.
08:18:05	<upperbody321\|m>	https://blog-wn.blog.ss-blog.jp/2024-11-15
08:18:05	<upperbody321\|m>	Sorry if it has already been posted
08:37:05		rewby (rewby) joins
08:37:05		@ChanServ sets mode: +o rewby
08:42:53		Island quits [Read error: Connection reset by peer]
08:54:12		xarph quits [Read error: Connection reset by peer]
08:54:30		xarph joins
09:02:15		BennyOtt (BennyOtt) joins
09:21:48		BennyOtt quits [Client Quit]
09:37:14		ducky (ducky) joins
09:41:28		BennyOtt (BennyOtt) joins
10:12:07	<@arkiver>	the bug in Wget-AT is now fixed with https://github.com/ArchiveTeam/wget-lua/commit/8adeb442e256ca8c737da19cc0224e1ca09ef266
10:13:26	<@arkiver>	i will make sure it is propagated to all projects
10:27:29	<@arkiver>	how i believe this would work out in the Wayback Machine is the following. upon indexing of WARCs (created the .cdx.gx file from the .warc.gz), record have their hashes recalculated. this means they would end up in the CDX file with the correct hash.
10:28:37	<@arkiver>	however, hashes used in revisit records are of course not being recalculated - those are assumed to be correct. when a revisit record is resolved, i believe the nearest record matching advertised URL+hash is found and redirected to. if hashes do not match, i believe no redirect would happen.
10:28:43	<@arkiver>	i will confirm that
10:36:21		sralracer joins
10:36:42		sralracer is now authenticated as sralracer
10:40:18	<@arkiver>	but... let's see, is this fixable? to some degree it is, but it would require parsing a ton of data
10:40:52	<@arkiver>	we do not have any kind of 'global' deduplication with some central collection of hashes against which we deduplicate. deduplication only happens withing a single session.
10:41:43	<@arkiver>	a single session produces one WARCs, which always ends up in one megaWARC (it's never split over multiple), and every record in the WARC is clearly associated to the session (or item) it was produced with.
10:43:24	<@arkiver>	together with the WARC-Refers-To-Date and WARC-Refers-To-Target-URI WARC headers, it is possible with a very high degree of certainty to match records together, and fix hashes in that way.
10:50:15	<@arkiver>	thinking more about this. while ideal would be to fix the megaWARCs themselves, it may be possible as well ti create a second WARC next to the megaWARC with the fixed revisit records. this would require only using the CDX to find "maybe bad revisit recording" (all those ending with one or more A s?), then writing fixed version of these to a WARC and placing this smaller WARC in the item.
10:51:02	<@arkiver>	of course, this would not fix the records in the megaWARC, but those that are revisit records will already have their digests recalculated upon creating the CDX file.
10:51:34	<@arkiver>	i'm looking into this, i think there's a possibilities
10:58:57		Wohlstand quits [Remote host closed the connection]
11:10:52	<@arkiver>	err that is not the complete story, that is for payloads that have been correctly deduplicated from each other. payload that have been wrongly deduplicated cannot be fixed
11:12:22		@arkiver is not thinking well at the moment :/
11:13:08		nulldata0 (nulldata) joins
11:14:46		nulldata quits [Ping timeout: 255 seconds]
11:14:47		nulldata0 is now known as nulldata
12:00:02		Bleo182600722719623 quits [Quit: The Lounge - https://thelounge.chat]
12:02:43		Bleo182600722719623 joins
12:05:45		LddPotato quits [Ping timeout: 252 seconds]
12:38:44		SkilledAlpaca41896 quits [Quit: SkilledAlpaca41896]
12:44:43		SkilledAlpaca41896 joins
12:51:21		th3z0l4_ quits [Read error: Connection reset by peer]
12:52:17		th3z0l4 joins
13:06:48		th3z0l4 quits [Ping timeout: 252 seconds]
13:07:23		th3z0l4 joins
13:16:03		lennier2_ joins
13:19:05		lennier2 quits [Ping timeout: 260 seconds]
13:56:34	<BennyOtt>	What is actually the best option? The "warrior-dockerfile" with the interface or each project you want to support individually?
14:07:35		Guest54 joins
14:17:52	<that_lurker>	each project individually would be the best way to go.
14:23:50		FartWithFury (FartWithFury) joins
14:25:29	<nstrom\|m>	Yeah primarily because you can run multiple projects at once that way
14:26:46	<nstrom\|m>	Usually individual projects are limited to some extent on how much you can run on a single IP so even if you have plenty of bandwidth /cpu/Ram you can't usually devote it all to a single project since the site usually blocks/throttles on their end
14:37:43	<FartWithFury>	what are you all archiving with and where too?
14:38:24	<nstrom\|m>	Standalone docker images, each project has one
14:39:11	<nstrom\|m>	And from where, a number of VPS/cloud servers and some Linux hardware at home
14:39:18	<nstrom\|m>	For me at least
14:42:45	<imer>	^same
14:50:40	<FartWithFury>	<3
14:51:22	<that_lurker>	FartWithFury: grab-site for my own website acrhives, tubeup for videso or yt-dlp, wikiteam3 for mediawikis... almost everything goes to archive.org
14:51:48	<that_lurker>	chat_downloader for live chats (youtube, twitch..)
14:54:38	<FartWithFury>	i'm using httrac , wget (python) and downloadthemall addon fore firefox :) and same up to archive
14:55:09	<FartWithFury>	then up to archive.org*
15:02:57	<BennyOtt>	ok, thanks @that_lurker and @nstrom. then I might change something a bit on my side, since "warrior-dockerfile" hasn't received an update yet, even though it's easy to manage when new projects come along.
15:04:30	<that_lurker>	you can also just run multiple instances of the warrior-docker if you want a gui experience
15:06:33	<BennyOtt>	yes, I did that before too
15:31:05		Mateon1 quits [Quit: Mateon1]
15:41:30		VerifiedJ9 quits [Quit: The Lounge - https://thelounge.chat]
15:41:32		Mateon1 joins
15:42:00		thuban quits [Ping timeout: 260 seconds]
15:42:10		VerifiedJ9 (VerifiedJ) joins
15:46:02		thuban (thuban) joins
15:55:41		Raithmir joins
16:04:20		ducky quits [Read error: Connection reset by peer]
16:04:40		ducky (ducky) joins
16:07:05		pabs quits [Ping timeout: 260 seconds]
16:10:23		pabs (pabs) joins
16:15:07		Raithmir quits [Client Quit]
16:25:45		wyatt8740 quits [Ping timeout: 260 seconds]
16:29:51	<katia>	pabs, kokos asked me about Gemini and the possibilities of archiving it. I know you looked into it at some point, are you doing anything with it?
16:37:12		wyatt8740 joins
17:01:06		katocala quits [Ping timeout: 252 seconds]
17:01:32		katocala joins
17:19:20		wessel1512 joins
17:19:30		bladem quits [Read error: Connection reset by peer]
17:32:27		katocala quits [Ping timeout: 252 seconds]
17:33:02		katocala joins
17:56:45		pabs quits [Ping timeout: 260 seconds]
17:59:10		sralracer quits [Client Quit]
18:00:16		pabs (pabs) joins
18:01:11		Naruyoko quits [Read error: Connection reset by peer]
18:01:28		Naruyoko joins
18:13:47		sralracer (sralracer) joins
18:20:38	<h2ibot>	JustAnotherArchivist edited Template:IRC channel (-2, Update for new web chat based on…): https://wiki.archiveteam.org/?diff=53841&oldid=47317
18:25:13	<@JAA>	nicolas17: Uncompressed WARC is fine, I think.
18:27:23	<@JAA>	arkiver: Identifying potentially faulty records should be possible from the CDX and the megawarc JSON. The former should contain the payload digests, and the latter allows identifying boundaries between mini-WARCs to eliminate those false positives.
18:28:29	<@JAA>	Still a lot of data though. And we'd have to check what hash the CDX contains exactly; I think IA recalculates it rather than relying on what's in the WARC, but not entirely sure.
18:29:52		katocala is now authenticated as katocala
18:45:45		nicolas17 is now authenticated as nicolas17
19:06:52		ducky_ (ducky) joins
19:07:13		ducky quits [Ping timeout: 260 seconds]
19:07:28		ducky_ is now known as ducky
19:12:45		Webuser074404 joins
19:13:29		Webuser074404 quits [Client Quit]
19:28:40		BlueMaxima joins
19:33:27		FartWithFury quits [Read error: Connection reset by peer]
20:49:29		Naruyoko5 joins
20:50:03		Naruyoko quits [Read error: Connection reset by peer]
21:03:57		Arachnophine quits [Quit: Ping timeout (120 seconds)]
21:04:14		Arachnophine (Arachnophine) joins
21:06:58		BornOn420 quits [Remote host closed the connection]
21:07:14		alexlehm quits [Remote host closed the connection]
21:07:34		Sluggs quits [Quit: ZNC - http://znc.in]
21:07:38		alexlehm (alexlehm) joins
21:08:36		Barto quits [Quit: WeeChat 4.4.3]
21:09:15		katia_ quits [Ping timeout: 260 seconds]
21:09:50		@JAA quits [Ping timeout: 260 seconds]
21:09:50		kokos\| quits [Ping timeout: 260 seconds]
21:11:09		JAA (JAA) joins
21:11:09		@ChanServ sets mode: +o JAA
21:12:09		loug8318142 joins
21:14:50		kokos- joins
21:16:29		katia_ (katia) joins
21:18:59		Sluggs joins
21:21:52		BornOn420 (BornOn420) joins
21:34:33		Island joins
21:43:06		sralracer quits [Quit: Ooops, wrong browser tab.]
21:43:41		Barto (Barto) joins
22:04:35		@JAA quits [Remote host closed the connection]
22:05:11		JAA (JAA) joins
22:05:11		@ChanServ sets mode: +o JAA
22:14:22	<TheTechRobo>	Is using a VPN with Warrior projects OK if I control the VPN? It's wireguard.
22:20:43		BlueMaxima quits [Read error: Connection reset by peer]
22:22:32	<@OrIdow6>	arkiver: The sidecar WARC idea would work I think; also your explanation of the WBM's behavior seems to fit what I can see - e.g. https://web.archive.org/web/https://t.nhentai.net/galleries/73599/170t.jpg is a revisit with all A's in https://archive.org/details/archiveteam_nhentai_20240920192453_7ede7e27 , but in playback it redirects to a date but then says there are no WBM captures (live URL NSFW)
22:26:22	<@OrIdow6>	If we wanted to do only a CDX/sidecar-warc approach with no reading/writing of the original we could have a threshold of entropy of the hash (number of A's) - say "even if this ends in 20 A's, the first 20 digits are the same, there's only a 1/(whatever) chance this would've happened by coincidence"
22:29:45		Unholy2361924645377131 quits [Ping timeout: 260 seconds]
22:29:46	<@OrIdow6>	More broadly I think we could come up with some kind of "score" for how likely the 2 are to be true equals?
22:31:04	<@OrIdow6>	LIke, # of bits that match before the A's start + (5 if the URLs are the same else 0) + (5 if the Etag and HTTP content-length headers are the same else 0)
22:31:28	<@OrIdow6>	And check if that's greater than 20
22:32:35	<nicolas17>	for non-revisit records you could also just calculate the correct hash
22:32:46	<@OrIdow6>	But better be cautious with this because if done wrongly (if the heuristic measures have too much weight) it could veer into "faking data" territory
22:34:03	<nicolas17>	if it has any 00 at the end, calculate the correct payload hash, if it doesn't match then anything with that hash is suspect
22:35:46	<@OrIdow6>	arkiver: Also in addition to the above I think that ASAP we should go thru uploaded collections, as well as the temporary storage, find all revisits that we think might be false-positive-revisits (i.e., those with above some set number of A's), and throw those into URLs
22:36:18	<@OrIdow6>	Which is doable from the CDXs from what's on IA already
22:37:49	<@OrIdow6>	nicolas17: For stuff already on the IA they already calculate the correct hash (due to the issue with chunked encoding or whatever it was) so that's what can be matched against
22:38:31	<@OrIdow6>	But the fact that the IA already assumes that the WARC-Payload-Digest values in the WARC are garbage means our big issue here is revisits, not the values of that header per se
22:41:06	<@OrIdow6>	"and throw those into URLs" - or maybe AB because URLs might DDOS some stuff
22:43:20	<@JAA>	Depends on how many there are, I'd say.
22:44:25	<@JAA>	And throwing into #// would possibly need to bypass backfeed.
22:47:55	<@OrIdow6>	(On the topic of revisits, I need to check again if those sha1-collision-attack PDFs are mixed up in the WBM...)
22:48:51	<nicolas17>	oh no
22:56:03		etnguyen03 (etnguyen03) joins
22:59:07		Wohlstand (Wohlstand) joins
23:07:08		sralracer (sralracer) joins
23:13:41		pixel leaves [Error from remote client]
23:14:12		peo joins
23:26:37	<peo>	Hi all! Anyone awaye and have knowledge about the grab-site tool ? Need to resume a interrupted download if possible
23:27:23	<@JAA>	Hi peo. That's not supported: https://github.com/ArchiveTeam/grab-site/issues/58
23:28:16	<peo>	yep, as I have read.. just thought if anyone had a work-around except from using the "pause when diskspace is low"-trick
23:29:17	<peo>	It filled up in the datadir's temp folder because it stumbled on a large file while downloading the whole world..
23:30:31	<@JAA>	Ah, I bet it's planet-latest.osm.pbf. Classic.
23:31:37	<nicolas17>	x_x
23:35:01		loug8318142 quits [Client Quit]
23:36:31	<@OrIdow6>	I am running a very slow scan thru some CDXs in collection:archiveteam that are accessible to me for false revisits, if you want the scripts (just a pipeline + GNU parallel, but took a while since the latter still has pretty bad documention) contact me
23:37:49	<that_lurker>	would distributing that job help speed it up?
23:38:06	<nicolas17>	I assume IA would become the bottleneck quickly
23:40:02		sralracer quits [Client Quit]
23:41:28	<@OrIdow6>	Parallel thinks there are 20h left on the set I'm doing (very reduced, limiting it to stuff I can access + stuff I think is likely to be at temporally closer risk)

Home Search Previous day Next day