#archiveteam<efnet> log for 2012-12-04

Home Search Previous day Next day

00:20:00	<SmileyG>	urgh, my cousin in law was once on a telly program called crazy cottage
00:20:00	<SmileyG>	need to try and see if i can find it Â¬_Â¬
01:53:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror
02:03:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror
02:05:00	<godane>	1998 to 2004 is not much bigger then the full 2005 article mirror
02:37:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror
05:55:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror
05:55:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror
08:03:00	<hiker1>	Hi. What is the easiest way to access .warc file contents on Windows?
08:06:00	<SketchCow>	Never ask if you should archive something. Archive it and ask if any of us assholes want a copy
08:06:00	<SketchCow>	and then keep it yourself
08:08:00	<Coderjoe>	SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB
08:09:00	<SketchCow>	Duh
08:10:00	<SketchCow>	What was it?
08:10:00	<Coderjoe>	the game developer that made games like Total Annihilation
08:12:00	<Coderjoe>	(that is, Total Annihilation, and another similar RTS with a more medieval theme)
08:13:00	<SketchCow>	Approved.
08:13:00	<Coderjoe>	it also includes updates for their parent company, Humongous Entertainment
08:13:00	<SketchCow>	Do you have a place for me to download from or do I need to give you a slot?
08:15:00	<Coderjoe>	I can set up a download a moment
08:20:00	<godane>	uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror
08:25:00	<nova>	archiving feels so good
08:25:00	<nova>	especially when the original disappears
08:26:00	<hiker1>	Can anyone help me access the contents of a .warc file on Windows?
08:43:00	<ersi>	Coderjoe: Oooh, I want that as well
08:48:00	<Coderjoe>	SketchCow: rsync path sent via PM
08:48:00	<Coderjoe>	ersi: I only have so much upstream bandwidth :-\
08:49:00	<Coderjoe>	looks like I probably did the last mirroring pass in 11/2005
09:02:00	<godane>	so i'm starting to do image grabs for each of my arstechnica dumps
09:09:00	<hiker1>	godane: What programs are you using for the archival process?
09:14:00	<godane>	wget
09:16:00	<hiker1>	thanks
09:17:00	<ersi>	wget-1.14 (the latest) has support for writing to WARC files, thanks to alard
09:18:00	<hiker1>	I'm still trying to get stuff out of the WARC files.
09:19:00	<ersi>	Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment
09:20:00	<ersi>	hiker1: http://warctozip.archive.org/
09:20:00	<hiker1>	What are non-Windows users using?
09:22:00	<hiker1>	ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB.
09:22:00	<Coderjoe>	you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server
09:23:00	<hiker1>	But isn't there a reason stuff is stored using warc instead of zip?
09:23:00	<ersi>	I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef)
09:24:00	<hiker1>	How else do you read the contents of archived websites?
09:24:00	<ersi>	https://github.com/tef/warctools
09:24:00	<ersi>	The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response
09:24:00	<Coderjoe>	there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc
09:25:00		ersi nods
09:25:00	<ersi>	The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;)
09:25:00	<Coderjoe>	if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc)
09:25:00	<Coderjoe>	well, there is the open-source wayback codebase
09:25:00	<ersi>	true, but it's a pain in the ass to setup
09:25:00	<Coderjoe>	iirc, yipdw had an instance of that set up
09:26:00	<hiker1>	I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion.
09:26:00	<hiker1>	I'm trying out IA's warc library for python https://github.com/internetarchive/warc
09:26:00	<Coderjoe>	that's the wayback
09:27:00	<ersi>	IMO warc-tools from tef is better than IA warc
09:27:00	<SketchCow>	Coderjoe: Absorbing your cavedog as we speak.
09:27:00	<ersi>	Om nom nom
09:27:00	<hiker1>	Can you use the wayback machine to view warc's that are uploaded to IA?
09:27:00	<SketchCow>	After I get this, ersi, it'll be on archive.org in seconds. No worries.
09:28:00	<ersi>	SketchCow: I'll nom on it then
09:28:00	<Coderjoe>	I noticed (but only because I was watching the log)
09:28:00	<Coderjoe>	mmm
09:28:00	<Coderjoe>	traffic shaping and prioritizing really takes the pain away
09:40:00	<hiker1>	So it wget outputting to warc the preferred method for archiving sites? Not HTTrack?
09:40:00	<Coderjoe>	depends
09:41:00	<Coderjoe>	though I don't know about httrack
09:41:00	<Coderjoe>	IA has a tool called Heretrix for their normal crawls
09:42:00	<Coderjoe>	we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site.
09:42:00	<chronomex>	yup
09:58:00	<alard>	Or try viewing the warc with this: https://github.com/alard/warc-proxy :)
09:59:00	<hiker1>	That looks a lot closer to what I want
10:00:00	<ersi>	Oops, thought about writing that one out as well :)
10:00:00	<hiker1>	alard: Why does it use a proxy instead of just running a web server?
10:02:00	<alard>	hiker1: The thing is the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls.
10:02:00	<alard>	The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done.
10:03:00	<hiker1>	well, the nice thing about rewriting is then you can serve the files to other people through the web.
10:04:00	<hiker1>	with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access
10:04:00	<ersi>	the non-nice thing is that it's a pain in the ass
10:04:00	<alard>	Yes, but that's not what this tool is for. If you want to do that there's the wayback tool.
10:04:00	<hiker1>	What is the wayback tool?
10:04:00	<ersi>	Wayback Machine
10:04:00	<hiker1>	but that won't serve private warc files
10:04:00	<alard>	https://github.com/internetarchive/wayback
10:04:00	<ersi>	https://github.com/internetarchive/wayback
10:04:00	<ersi>	damn it
10:04:00	<alard>	Heh.
10:05:00	<alard>	But as you can see it's much harder to get that running than the warc-proxy + firefox addon.
10:05:00	<hiker1>	does warcproxy just grab whatever .warc files it sees?
10:06:00	<hiker1>	ah, nvm, it has a neat interface!
10:07:00	<hiker1>	wow, this is really impressive work
10:13:00	<ersi>	+1 alard
10:16:00	<norbert79>	alard: Holy-moly, this goes to my favourites
10:22:00	<godane>	alard: the urls in menu for warc-proxy don't work for me for some reason
10:22:00	<godane>	it doesn't take in the baseurl
10:22:00	<hiker1>	The base url didn't work for me, but the other ones did
10:23:00	<godane>	so it will go to folder/file instead of example.com/folder/file or something like that
10:23:00	<godane>	and so it would error
10:23:00	<alard>	That's strange.
10:24:00	<godane>	also when testing my eff.org grab it would just go to real site
10:24:00	<alard>	(Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.)
10:25:00	<alard>	godane: Is that an https site?
10:25:00	<godane>	yes
10:33:00	<alard>	godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2
10:44:00	<ats>	is there an Internet Archive IRC channel somewhere, or is this the best bet?
10:45:00	<ersi>	#internetarchive unofficial/semi-officialo channel
10:45:00	<ats>	cheers :)
10:45:00	<ersi>	mostly just to get IA shizzle out of this channel :)
10:46:00	<chronomex>	yes, same people here and there mostly
11:38:00	<SketchCow>	More hugs here
11:41:00	<SketchCow>	Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition".
11:41:00	<SketchCow>	And he stopped it.
11:41:00	<SketchCow>	Any ideas?
11:42:00	<ersi>	scrap and start it again?
12:12:00	<SmileyG>	did you givbe it like a 10tb partition for /data?
12:24:00	<tuabkiet>	10TB???
12:48:00	<hiker1>	tuabkiet: You don't have 10 TB of RAID space lying around?
12:49:00	<tuabkiet>	I don't use RAID, and my hard disk is 10 times smaller
12:53:00	<hiker1>	How do I get wget 1.14?
12:58:00	<ersi>	hiker1: It's not in many repositories. You'll probably have to compile it yourself
12:59:00	<hiker1>	damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam?
13:00:00	<ersi>	No, but I can probably help
13:00:00	<hiker1>	how long are you going to be on? I'm still downloading the Mint dvd.
13:01:00	<ersi>	debian testing has wget 1.13.4-3
13:01:00	<hiker1>	How did you find that out?
13:01:00	<hiker1>	I was looking for a package listing but couldn't find one
13:01:00	<ersi>	debian sid has wget 1.14
13:01:00	<ersi>	http://packages.debian.org bro
13:01:00	<hiker1>	they hid it on their packages subdomain! those sneaky...
13:02:00	<ersi>	you can probably install that .deb and everything will be fine
13:02:00	<hiker1>	I think there was an aptosid...
13:03:00	<ersi>	you can probably just dpkg -i the .deb if you're inclined
13:03:00	<hiker1>	ersi: Do you use a linux distro? if so, which?
13:04:00	<ersi>	Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian
13:04:00	<hiker1>	oh.
13:04:00	<hiker1>	no mint?
13:04:00	<ersi>	nope. But it's just another Debian deriative
13:06:00	<SketchCow>	http://archive.org/details/ftp_cavedog.com now up
13:07:00	<hiker1>	Where are archives of known dead sites kept?
13:07:00	<hiker1>	I only saw the just in time captures
13:09:00	<hiker1>	SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB.
13:10:00	<ersi>	Most are up on archive.org
13:10:00	<ersi>	SketchCow: thx~
13:11:00	<hiker1>	Does http://archive.org/details/archiveteam-fire include known dead sites?
13:11:00	<alard>	hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/
13:12:00	<alard>	(a slash at the end of the .tar usually gives you an index)
13:12:00	<hiker1>	alard: oh, wow, that is handy. Thank you.
13:12:00	<SketchCow>	Also, you should trust me
13:12:00	<SketchCow>	Everything I upload is awesome
13:12:00	<hiker1>	hah
13:13:00	<ersi>	Indeed
13:26:00	<hiker1>	I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam?
13:28:00	<hiker1>	The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB.
13:43:00	<tuabkiet>	hiker1: Up it to Internet Archive NOW!
13:43:00	<hiker1>	I am not sure how
13:44:00	<ersi>	Create an account first and foremost
16:44:00	<SketchCow>	Bagger 288! Bagger 288!
16:46:00	<soultcer>	SketchCow: Did you find the two Dailybooth warc files I asked for?
16:47:00	<schbiridi>	the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link
16:48:00	<SketchCow>	Agreed on wanna join.
16:48:00	<SketchCow>	soultcer: No, I've been working on my presentation.
16:48:00	<SketchCow>	E-mail me. jason@textfiles.com.
16:48:00	<soultcer>	Will do
19:00:00	<alard>	Has someone saved the http://blog.webshots.com/ ?
23:27:00	<Nemo_bis>	slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204
23:28:00	<Nemo_bis>	now 5704 wikis begining by "a" vs. 872 in previous snapshot
23:29:00	<Nemo_bis>	still, looks like dumps are not generated for 80 % of wikis they have even if requested
23:39:00	<alard>	---------------------------------------------------------------------------
23:39:00	<alard>	Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks!
23:39:00	<alard>	It's available on the projects tab of your warrior.
23:39:00	<alard>	Next station: DailyBooth.com, closing at the end of the year.
23:39:00	<alard>	If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab
23:39:00	<alard>	(All very similar to WebShots and previous projects.)
23:39:00	<alard>	Join #dailybooth for more detailed discussions.
23:39:00	<alard>	---------------------------------------------------------------------------

Home Search Previous day Next day