#archiveteam-bs log for 2026-02-17

Home Search Previous day Next day

00:00:44	<pokechu22>	The meta one is the job log and should be uploaded. The .cdx file is normally derived from the WARC by IA itself, though I don't know if that always happens or only happens for items that get indexed by web.archive.org.
00:00:47		tekulvw quits [Ping timeout: 272 seconds]
00:04:14	<klea>	I think it only happens for items that get indexed by web.archive.org? https://archive.org/download/limewire.com_d_7xNKB_NfXjrIqBWo
00:04:32	<klea>	Tho, maybe it was me not running the derive thing after every file
00:04:35	<klea>	lemme make it derive.
00:04:43	<klea>	(if i remember howto)
00:06:24	<klea>	It seems if you have IA derive (which is the default I believe on the web uploader?), it will make a cdx. <https://archive.org/log/5191197716> claims it will do a CDXIndex.
00:06:54	<klea>	Huh
00:06:58	<klea>	[ PST: 2026-02-16 16:05:08 ] Executing: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 600 /petabox/sw/bin/cdx_writer.pex 'WARCPROX-20260216205304743-00000-y1i40ow9.warc.gz' --file-prefix='limewire.com_d_7xNKB_NfXjrIqBWo' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_limewire.com_d_7xNKB_NfXjrIqBWo/cdxstats.json'>
00:06:58	<klea>	'/t/_limewire.com_d_7xNKB_NfXjrIqBWo/cdx.txt'
00:07:09	<klea>	Wait a second.
00:07:35	<klea>	Couldn't that be a way to bulk check lots of urls by making a warc with records of lots of data, and then getting the cdx and seeing what apparently is missing?
00:07:50	<klea>	Then you'd request deletion of all that crap, because nobody wants it.
00:19:07		nine quits [Quit: See ya!]
00:19:20		nine joins
00:19:20		nine is now authenticated as nine
00:19:20		nine quits [Changing host]
00:19:20		nine (nine) joins
00:23:56	<cruller>	TheoH7: I uploaded the entire output directory. https://archive.org/details/community.jisc.ac.uk-2026-02-16-35e53623-00000
00:32:50		etnguyen03 quits [Client Quit]
00:40:14		SootBector quits [Remote host closed the connection]
00:41:22		SootBector (SootBector) joins
00:42:30	<TheoH7>	cruller: Thanks, have downloaded it.
00:43:16	<TheoH7>	It looks like I also managed to do one where hard-coded links to https://community.ja.net (old address from years ago) are clickable in the WARC. I will upload that to IA likely in a few hours.
00:44:16	<TheoH7>	To upload the whole directory, is the best way to zip and upload, or can you select a whole folder for upload?
00:48:36	<pokechu22>	You can upload multiple files at once within a directory (uploading directories/subdirectories might also be possible but I think is more complicated?)
00:50:57	<TheoH7>	pokechu22: Great, will do that.
00:51:48	<TheoH7>	Seems one of my crawls has somehow managed to start crawling old versions of this site stored on the Wayback Machine, which is odd. I've added the pattern to ignores but just curious how grab-site would've found and started crawling such URL's.
00:52:11	<TheoH7>	I do already have 1 crawl without that done though, and will only upload the 2nd one if contains materially more content
01:02:20		tekulvw (tekulvw) joins
01:03:29		ducky quits [Ping timeout: 272 seconds]
01:04:19		etnguyen03 (etnguyen03) joins
01:07:21		tekulvw quits [Ping timeout: 268 seconds]
01:21:59		tekulvw (tekulvw) joins
01:25:36		Webuser614729 joins
01:26:29		Webuser614729 quits [Client Quit]
01:27:43		wotd joins
01:41:28		pokechu22 quits [Quit: System maintenance]
02:22:10		sec^nd quits [Remote host closed the connection]
02:22:35		sec^nd (second) joins
02:36:40	<nexussfan>	There's a site dedicated to archiving Iranian series and films <https://nostalgik-tv.com/> which says they have 4 terabytes of videos. Would it be a good idea to archive it, or not now?
02:44:47		APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44		ducky (ducky) joins
02:56:13		nine quits [Ping timeout: 272 seconds]
02:58:38		nine joins
02:58:40		nine is now authenticated as nine
02:58:40		nine quits [Changing host]
02:58:40		nine (nine) joins
03:13:09		iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:46		iPwnedYourIOTSmartdog joins
04:01:44		etnguyen03 quits [Remote host closed the connection]
04:11:07		tekulvw quits [Ping timeout: 268 seconds]
04:14:04		tekulvw (tekulvw) joins
04:23:37		tekulvw quits [Ping timeout: 272 seconds]
04:26:20		Island quits [Read error: Connection reset by peer]
04:28:43		tekulvw (tekulvw) joins
04:33:45		tekulvw quits [Ping timeout: 272 seconds]
05:04:47		n9nes quits [Ping timeout: 272 seconds]
05:08:15		n9nes joins
05:14:14		tekulvw (tekulvw) joins
05:24:25		tekulvw quits [Ping timeout: 272 seconds]
05:32:43		sec^nd quits [Remote host closed the connection]
05:33:05		sec^nd (second) joins
05:42:42	<ericgallager>	I forget, did this make it here? https://www.theregister.com/2026/02/12/polyglot_notebooks_deprecation/
05:51:53		tekulvw (tekulvw) joins
05:56:34		tekulvw quits [Ping timeout: 268 seconds]
05:56:53		tekulvw (tekulvw) joins
06:01:30		tekulvw quits [Ping timeout: 268 seconds]
06:16:19		nexussfan quits [Quit: Konversation terminated!]
06:17:13		ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:17:22		ArchivalEfforts joins
06:22:50		tekulvw (tekulvw) joins
06:27:45		tekulvw quits [Ping timeout: 272 seconds]
06:57:16		tekulvw (tekulvw) joins
07:05:37		pokechu22 (pokechu22) joins
08:52:56		ducky quits [Ping timeout: 268 seconds]
08:53:09		ducky (ducky) joins
08:54:25		Dango360 quits [Quit: The Lounge - https://thelounge.chat]
09:29:34		TheEnbyperor_ quits [Read error: Connection reset by peer]
09:30:09		cipherrot quits [Ping timeout: 272 seconds]
09:30:09		TheEnbyperor quits [Ping timeout: 272 seconds]
09:37:48		Snivy quits [Quit: The Lounge - https://thelounge.chat]
09:38:00		TheEnbyperor joins
09:38:11		petrichor (petrichor) joins
09:38:17		Snivy (Snivy) joins
09:38:23		Snivy quits [Remote host closed the connection]
09:38:36		TheEnbyperor_ (TheEnbyperor) joins
09:39:42		Snivy (Snivy) joins
09:42:53		tekulvw quits [Ping timeout: 268 seconds]
10:03:54		rohvani quits [Quit: The Lounge - https://thelounge.chat]
10:09:54		@arkiver quits [Remote host closed the connection]
10:10:21		arkiver (arkiver) joins
10:10:21		@ChanServ sets mode: +o arkiver
10:14:12		fireatseaparks quits [Remote host closed the connection]
10:14:48		fireatseaparks (fireatseaparks) joins
10:26:37		APOLLO03 joins
10:47:11		Webuser505408 joins
10:47:32		Webuser505408 quits [Client Quit]
11:37:03		tekulvw (tekulvw) joins
11:39:37		Cornelius7 (Cornelius) joins
11:41:15		Cornelius quits [Ping timeout: 272 seconds]
11:41:15		Cornelius7 is now known as Cornelius
11:41:53		tekulvw quits [Ping timeout: 272 seconds]
11:47:35		irisfreckles13 joins
11:58:52		APOLLO03 quits [Read error: Connection reset by peer]
11:59:51		APOLLO03 joins
12:00:03		Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:44		Bleo1826007227196234552220 joins
12:37:37		petrichor quits [Client Quit]
12:51:47	<irisfreckles13>	how do i request yt video to be archived?
12:51:47	<irisfreckles13>	possible?
12:53:37	<klea>	irisfreckles13: Depends if it's in scope, see https://wiki.archiveteam.org/index.php/YouTube#Scope and if it's in scope you can query it to #down-the-tube
13:02:28	<h2ibot>	Bear created Philips (+1114, Philips - more like Sorryps): https://wiki.archiveteam.org/?oldid=60489
13:17:48		Shard111 quits [Quit: Im doing something rq. Il brb]
13:19:14		Shard1115 (Shard) joins
13:23:15		petrichor (petrichor) joins
13:37:50		Arcorann quits [Ping timeout: 268 seconds]
13:38:29	<justauser>	ericgallager: Doesn't look too actionable...
13:41:34		Webuser660697 joins
14:03:07		irisfreckles13 quits [Ping timeout: 272 seconds]
14:12:44		Dada joins
14:18:20		Dango360 (Dango360) joins
14:25:45	<h2ibot>	Justauser edited Discourse/active (+148, Added https://forums.kicksecure.com/…): https://wiki.archiveteam.org/?diff=60490&oldid=60465
14:31:32		tekulvw (tekulvw) joins
14:36:25		tekulvw quits [Ping timeout: 268 seconds]
14:52:01	<@arkiver>	imer: are you able to see something in your logs that is queuing the googleapis.com URLs?
14:54:54	<@imer>	arkiver: (assuming #//) no, don't think its related to the other spam though
14:56:09	<@arkiver>	right, sorry, this was for #//
15:09:01		irisfreckles13 joins
15:11:35		Webuser116786 joins
15:11:58		Webuser116786 quits [Client Quit]
16:02:33		tekulvw (tekulvw) joins
16:07:15		tekulvw quits [Ping timeout: 272 seconds]
16:13:57		Island joins
16:28:01	<h2ibot>	Bear edited Mortis (+17, Provided by [[User:BouleBoule]] but not…): https://wiki.archiveteam.org/?diff=60491&oldid=58254
16:30:01	<h2ibot>	Bear edited Mortis (-3, misplaced pipes): https://wiki.archiveteam.org/?diff=60492&oldid=60491
16:37:02	<h2ibot>	Bear edited List of websites excluded from the Wayback Machine (+356, More details on Philips.com ([[Philips]])): https://wiki.archiveteam.org/?diff=60493&oldid=60371
16:40:37		Goofybally9 quits [Quit: The Lounge - https://thelounge.chat]
16:41:23		Goofybally joins
16:42:12		Goofybally quits [Client Quit]
16:42:43		Goofybally (Goofybally) joins
16:46:03	<h2ibot>	Bear edited List of websites excluded from the Wayback Machine (+181, steampunkal.com excluded between 2013-11-12 and…): https://wiki.archiveteam.org/?diff=60494&oldid=60493
16:52:03		DogsRNice joins
17:03:31		tekulvw (tekulvw) joins
17:08:07		tekulvw quits [Ping timeout: 268 seconds]
17:26:02	<HP_Archivist>	RE: WARC captures. JAA apologies I'm just now responding to this. But thank you. I have used webrecorder for captures before, a few years ago. SingleFilez was merged into just SingleFile now, I think.
17:26:39	<HP_Archivist>	I used Webrecorder for these individual captures https://archive.org/details/@archivist_goals?query=warc
17:27:04	<HP_Archivist>	But going forward, I am leaning on trying browsertrix, to do things right.
17:27:56	<HP_Archivist>	Oh, but you said warcprox, too. Hm.
17:30:11	<HP_Archivist>	SingleFile's options are a little obtuse, though I remember using that too a few years back.
17:33:50		tekulvw (tekulvw) joins
17:42:02		tekulvw quits [Ping timeout: 268 seconds]
17:44:31	<justauser>	Fun https://pomf.lain.la/robots.txt
17:45:04	<justauser>	Apparently Google interpreted it as Disallow: /, but whatever is behind DDG didn't.
17:54:22		corentin quits [Ping timeout: 268 seconds]
17:54:27		tekulvw (tekulvw) joins
17:59:01	<justauser>	https://transfer.archivete.am/IjrDe/pomf.lain.la_ddg_nitter.txt
17:59:02	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/IjrDe/pomf.lain.la_ddg_nitter.txt
17:59:19	<justauser>	Google, Yandex: nothing; Bind: unrelated websites.
17:59:21		tekulvw quits [Ping timeout: 272 seconds]
18:07:57	<klea>	justauser: did you get pomf*.lain.la too?, iirc i've seen some pomf urls with pomf2 instead.
18:08:18	<justauser>	GitHub code, CDX: nothing
18:08:27	<justauser>	No, will check.
18:08:47	<klea>	ok, I think pomf3 and maybe check pomf[0-9] too I guess.
18:09:32	<klea>	apparently only pomf2 has valid tls.
18:09:56		Webuser660697 quits [Quit: Ooops, wrong browser tab.]
18:10:35		Webuser810542 joins
18:13:36	<justauser>	pomf2 is so much more abundant on Nitter, but only 1 link in DDG.
18:14:38	<klea>	huh
18:27:13		@rewby quits [Ping timeout: 272 seconds]
18:29:22		rewby (rewby) joins
18:29:22		@ChanServ sets mode: +o rewby
19:04:40		APOLLO03 quits [Ping timeout: 268 seconds]
19:04:43		APOLLO03 joins
19:05:32		tekulvw (tekulvw) joins
19:17:37		Wohlstand1 (Wohlstand) joins
19:18:14		tekulvw quits [Ping timeout: 268 seconds]
19:19:59		Wohlstand1 is now known as Wohlstand
19:27:56	<justauser>	https://transfer.archivete.am/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt
19:27:56	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt
19:29:49	<justauser>	pomf.lain.la is Wayback-excluded, but pomf2 is not.
19:30:20	<justauser>	So I suggest saving everything as pomf2 no matter which URL is has originally?
19:31:31	<justauser>	pomf2.lain.la has huge CDX records, in fact. Does it make sense to scrape them?
19:31:50	<justauser>	One URL = one file, so everything available in CDX is already saved.
19:32:17	<justauser>	But making a copy as AB WARC could help if pomf2 gets excluded too.
19:32:28	<klea>	yeah.
19:32:59	<klea>	Also, whilst at it, it might be neat to archive philips stuff (re Bear's wiki page)
19:33:02		Wohlstand quits [Ping timeout: 268 seconds]
19:33:43		twiswist quits [Ping timeout: 272 seconds]
19:54:53		APOLLO03a joins
19:57:42		APOLLO03 quits [Ping timeout: 268 seconds]
20:07:02		tekulvw (tekulvw) joins
20:10:38		Doranwen quits [Read error: Connection reset by peer]
20:11:01		Doranwen (Doranwen) joins
20:11:05		rohvani joins
20:11:43		tekulvw quits [Ping timeout: 272 seconds]
20:14:01	<TheoH7>	I'm in the process of uploading one of the crawls I did of https://community.jisc.ac.uk to IA. I can't see the websites collection. Is the best match "Community data" or do I need to request permissions?
20:15:44	<@JAA>	HP_Archivist: Browsertrix has the same issues of writing incorrect WARCs as far as I know. Haven't actually tested it though.
20:17:10	<@JAA>	(The readme of Browsertrix Crawler explicitly mentions capturing data through CDP, which can't possibly be correct because CDP doesn't expose the necessary data to write valid WARCs.)
20:21:51		iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds]
20:32:34	<h2ibot>	Cooljeanius edited ArchiveBot/Ignore (+4, /* Substack */ red link to create article from): https://wiki.archiveteam.org/?diff=60495&oldid=59094
20:34:34	<h2ibot>	Cooljeanius created Substack (+154, Created page with "{{stub}} '''Substack''' is…): https://wiki.archiveteam.org/?oldid=60496
20:41:53	<nicolas17>	TheoH7: make sure you set mediatype to "web"
20:42:10	<nicolas17>	someone with more permissions can change the collection later
20:42:19	<nicolas17>	it still won't appear in wayback machine
20:49:10		DlugasnyPL joins
20:50:28	<DlugasnyPL>	HI, does archiveteam has any team which downloading rendered pages ?
20:51:15		Wohlstand1 (Wohlstand) joins
20:53:37		Wohlstand1 is now known as Wohlstand
20:58:09		Webuser081957 joins
20:58:26		Webuser081957 quits [Client Quit]
20:58:49		twiswist (twiswist) joins
21:02:27	<pokechu22>	What do you mean by "rendered pages"?
21:04:06		tekulvw (tekulvw) joins
21:11:15		tekulvw quits [Ping timeout: 272 seconds]
21:11:37	<DlugasnyPL>	as I see in the documentation, warrior using wget to download pages. WGET is ok, but cannot execute for example java scripts. rendered pages = browsertrix ?
21:13:08	<DlugasnyPL>	since few months I`m creating archives using browsertrix. Thats why I`m asking if You have any group here which is working with this kind of archiving
21:18:08	<pokechu22>	Yes, https://wiki.archiveteam.org/index.php/User:TheTechRobo/Mnbot (though this is not suitable for super large sites). Warrior projects also use lua so they can do things slightly smarter (e.g. look for specific strings in the page source and generate new requests off of those), though for particularly complicated sites that's insufficient.
21:25:01	<DlugasnyPL>	Interesting. But as I see its still in development phase. I will observe it.
21:25:21		Island_ joins
21:25:49		Ryz quits [Ping timeout: 272 seconds]
21:25:53		Ryz (Ryz) joins
21:27:14	<DlugasnyPL>	I have started first docker container with warrior on one of my servers. is there any parameter to increase parallel downloads/uploads ? I have a lot of resources, but I do not know how to set it smart.
21:28:58		Island quits [Ping timeout: 268 seconds]
21:29:51	<klea>	--concurrent iirc, but also keep in mind if the limit you set is too high the site may ban you.
21:41:00		tekulvw (tekulvw) joins
21:41:13	<DlugasnyPL>	I do not want to open parallel connections to one site. I would like to increase number of pages which my warrior will crawl - 1-2connections per page, Multiple different domains - how to setup ?
21:42:16	<DlugasnyPL>	1-2 connections per domain, just to avoid ban, multiple domains parallel
21:45:37		tekulvw quits [Ping timeout: 268 seconds]
21:45:43	<klea>	You probably want to run project workers directly instead of the docker warrior then.
21:45:47	<klea>	I believe?
21:46:14	<klea>	If you haven't set a choice, IIRC AT's default choice is normally telegram.
21:48:48	<DlugasnyPL>	project workers - yes, that sounds better ;)
21:50:34	<klea>	I don't know how people typically automate that process tho.
21:54:28	<DlugasnyPL>	i have asked chatpgt but it also do not know
21:56:56		Dada quits [Remote host closed the connection]
22:03:30		DlugasnyPL quits [Remote host closed the connection]
22:11:52		Dada joins
22:44:28		Dada quits [Remote host closed the connection]
22:47:19		DlugasnyPL joins
22:50:30	<DlugasnyPL>	if I will create multiple instances of warrior on my server, you said that I will endup with multiple Telegram crawlers... is there any chance to download list of my pages using this tool ?
22:53:55	<DlugasnyPL>	In my opinion end user should have a choice what his instance will archive. I know that archive team is organised group with process, but is there any chance to do archiving (properly) as "lonely wolf", using stand alone version of warrior with user list of the domains ?
22:56:52	<klea>	You can choose the project, you can't choose what specific items you'll get for a project.
22:59:33		APOLLO03a quits [Ping timeout: 272 seconds]
22:59:33		^ quits [Ping timeout: 272 seconds]
23:00:39	<DlugasnyPL>	I do not understand context of this word "project". What is mean ? Does it mean that somebody must set some project for specified domain ? for example for telegram.org ? this is huge site, so i believe task must be deveided for small pieces and delegated to the end users. project - is something like configuration requreid to download specified page, correct ?
23:01:37		tekulvw (tekulvw) joins
23:02:06	<klea>	A project is normally, but not always, a specific website, or combination of websites that work the in the same manner.
23:02:49	<klea>	The website https://tracker.archiveteam.org/ show some projects which are on the list of things people running Warriors can choose to do.
23:03:26	<klea>	Things like URLTeam2 or YouTube have some counter-recomendations against running them at home due to the fact they leak IP addresses.
23:03:56		^ (^) joins
23:04:46		Wohlstand quits [Client Quit]
23:04:53		nexussfan (nexussfan) joins
23:06:23	<DlugasnyPL>	leak IP ? what do You mean ?
23:06:24		tekulvw quits [Ping timeout: 268 seconds]
23:06:56	<DlugasnyPL>	archive team system trying to hide end user ips ?
23:07:01	<klea>	URLTeam2 also does lots of requests per second to urls that may not be fully vetted against.
23:07:56	<DlugasnyPL>	thats why You are spreading job for multiple end users
23:07:58	<DlugasnyPL>	ok
23:08:48	<klea>	no, also I'm not doing things right now.
23:09:21	<@JAA>	klea is confusing URLTeam with URLs again.
23:09:34	<klea>	Sorry.
23:09:40	<klea>	The webui is confusing.
23:09:48	<klea>	it'd be neat to show the warrior readmes on the webpage too.
23:10:23	<klea>	s/URLTeam2 or/URLs or/
23:11:07	<DlugasnyPL>	that would be perfect
23:11:40	<DlugasnyPL>	but discussion here is also nice :)
23:13:51		APOLLO03 joins
23:14:26		Dada joins
23:15:52	<DlugasnyPL>	I have list of 36000 domains not archived yet, polish sites. I come from Poland. Is there any chance to create project for them ?
23:17:11	<klea>	depends on what kind of sites, and how they work.
23:17:34	<klea>	it could be done slowly on #archivebot most likely if they're not using javascript too much.
23:17:49	<DlugasnyPL>	they are using a lot js
23:18:02	<DlugasnyPL>	formulars, search pages
23:18:04	<DlugasnyPL>	etc.etc.
23:18:51		Arcorann (Arcorann) joins
23:19:11	<DlugasnyPL>	we have started archiving one time per month on most of that sites using browsertrix with crawl depth control but even with this it is very time consuming process
23:19:57	<@JAA>	Maybe it could slowly be run through #jseater but we don't currently have a distributed setup for JS-heavy things.
23:20:13	<@JAA>	No recursion there though.
23:22:10	<klea>	Should I add my jseater url list thingy to the wiki page?
23:22:20		etnguyen03 (etnguyen03) joins
23:22:23	<klea>	maybe I should make it not give who queued stuff?
23:22:59	<DlugasnyPL>	do You know what exactly, which problem discrediting browsertrix ?
23:24:02	<@JAA>	Same problem as ArchiveWeb.Page and anything else that uses Chrome Debugging Protocol to produce WARCs: it can't accurately capture the actual HTTP data received from the server, only a parsed version of it.
23:24:11	<@JAA>	See https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
23:28:04	<DlugasnyPL>	parsed version, you mean output from browser ?
23:28:50	<@JAA>	Parsed representation of the HTTP responses, e.g. key-value pair of headers instead of the raw bytes.
23:33:57	<DlugasnyPL>	what kind of impact it is generating on the final result ? when user will open archive from browsertrix and wget-at - what will be the difference ? for example warc from warrior, search field with js will not work and will not dispaly any items, correct ? Browsertrix will show "rendered" js page, right and wget-at not, but wget-at will have something which is not visibale for an end user but important for WARC stanadard correct ? ?
23:35:12		Dada quits [Remote host closed the connection]
23:35:16	<klea>	WARC is a raw capture of the raw bytes of the HTTP requests and responses, not of rendered webpages.
23:35:57	<DlugasnyPL>	just trying to understand the idea of WARC - visual effect, archive to browse by people or dry standard to collect raw data from the http servers
23:36:34	<DlugasnyPL>	ok
23:37:02	<@JAA>	Capturing the data and playing back pages from it are two entirely separate issues.
23:38:22	<DlugasnyPL>	os if data saved in WARC are frinedly for human eye but not friendly for machine, then it is not a write standard. so You are trying to record raw output from the servers, write it in the form which can be used afterwards to "render" the page ?
23:38:56	<DlugasnyPL>	right standard
23:47:07	<DlugasnyPL>	I think thats all for today. I will keep warrior running and help you to crawl US pages, but any way I would like to start archiving process for planty of polish pages. Thank You for this nice introduction.
23:48:31	<@JAA>	WARC is both very human- and machine-readable, in the same way as HTTP.
23:50:07	<@JAA>	If you capture all HTTP requests/responses that occur during a page load, it should be possible to play that back again later as well. There are a lot of edge cases though.
23:50:22	<@JAA>	You can still store a screenshot or a DOM dump in the WARC as well if you want to do that.
23:50:55	<klea>	It'd be fun to make something to take tcpdump/wireshark output and make warcs out of it, but I don't have time.
23:51:51	<DlugasnyPL>	one more question before I will go sleep - where warrior uploading all files ? directly to archive.org ?

Home Search Previous day Next day