00:20:00<SmileyG>urgh, my cousin in law was once on a telly program called crazy cottage
00:20:00<SmileyG>need to try and see if i can find it ¬_¬
01:53:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror
02:03:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror
02:05:00<godane>1998 to 2004 is not much bigger then the full 2005 article mirror
02:37:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror
05:55:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror
05:55:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror
08:03:00<hiker1>Hi. What is the easiest way to access .warc file contents on Windows?
08:06:00<SketchCow>Never ask if you should archive something. Archive it and ask if any of us assholes want a copy
08:06:00<SketchCow>and then keep it yourself
08:08:00<Coderjoe>SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB
08:09:00<SketchCow>Duh
08:10:00<SketchCow>What was it?
08:10:00<Coderjoe>the game developer that made games like Total Annihilation
08:12:00<Coderjoe>(that is, Total Annihilation, and another similar RTS with a more medieval theme)
08:13:00<SketchCow>Approved.
08:13:00<Coderjoe>it also includes updates for their parent company, Humongous Entertainment
08:13:00<SketchCow>Do you have a place for me to download from or do I need to give you a slot?
08:15:00<Coderjoe>I can set up a download a moment
08:20:00<godane>uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror
08:25:00<nova>archiving feels so good
08:25:00<nova>especially when the original disappears
08:26:00<hiker1>Can anyone help me access the contents of a .warc file on Windows?
08:43:00<ersi>Coderjoe: Oooh, I want that as well
08:48:00<Coderjoe>SketchCow: rsync path sent via PM
08:48:00<Coderjoe>ersi: I only have so much upstream bandwidth :-\
08:49:00<Coderjoe>looks like I probably did the last mirroring pass in 11/2005
09:02:00<godane>so i'm starting to do image grabs for each of my arstechnica dumps
09:09:00<hiker1>godane: What programs are you using for the archival process?
09:14:00<godane>wget
09:16:00<hiker1>thanks
09:17:00<ersi>wget-1.14 (the latest) has support for writing to WARC files, thanks to alard
09:18:00<hiker1>I'm still trying to get stuff out of the WARC files.
09:19:00<ersi>Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment
09:20:00<ersi>hiker1: http://warctozip.archive.org/
09:20:00<hiker1>What are non-Windows users using?
09:22:00<hiker1>ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB.
09:22:00<Coderjoe>you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server
09:23:00<hiker1>But isn't there a reason stuff is stored using warc instead of zip?
09:23:00<ersi>I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef)
09:24:00<hiker1>How else do you read the contents of archived websites?
09:24:00<ersi>https://github.com/tef/warctools
09:24:00<ersi>The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response
09:24:00<Coderjoe>there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc
09:25:00ersi nods
09:25:00<ersi>The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;)
09:25:00<Coderjoe>if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc)
09:25:00<Coderjoe>well, there is the open-source wayback codebase
09:25:00<ersi>true, but it's a pain in the ass to setup
09:25:00<Coderjoe>iirc, yipdw had an instance of that set up
09:26:00<hiker1>I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion.
09:26:00<hiker1>I'm trying out IA's warc library for python https://github.com/internetarchive/warc
09:26:00<Coderjoe>that's the wayback
09:27:00<ersi>IMO warc-tools from tef is better than IA warc
09:27:00<SketchCow>Coderjoe: Absorbing your cavedog as we speak.
09:27:00<ersi>Om nom nom
09:27:00<hiker1>Can you use the wayback machine to view warc's that are uploaded to IA?
09:27:00<SketchCow>After I get this, ersi, it'll be on archive.org in seconds. No worries.
09:28:00<ersi>SketchCow: I'll nom on it then
09:28:00<Coderjoe>I noticed (but only because I was watching the log)
09:28:00<Coderjoe>mmm
09:28:00<Coderjoe>traffic shaping and prioritizing really takes the pain away
09:40:00<hiker1>So it wget outputting to warc the preferred method for archiving sites? Not HTTrack?
09:40:00<Coderjoe>depends
09:41:00<Coderjoe>though I don't know about httrack
09:41:00<Coderjoe>IA has a tool called Heretrix for their normal crawls
09:42:00<Coderjoe>we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site.
09:42:00<chronomex>yup
09:58:00<alard>Or try viewing the warc with this: https://github.com/alard/warc-proxy :)
09:59:00<hiker1>That looks a lot closer to what I want
10:00:00<ersi>Oops, thought about writing that one out as well :)
10:00:00<hiker1>alard: Why does it use a proxy instead of just running a web server?
10:02:00<alard>hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls.
10:02:00<alard>The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done.
10:03:00<hiker1>well, the nice thing about rewriting is then you can serve the files to other people through the web.
10:04:00<hiker1>with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access
10:04:00<ersi>the non-nice thing is that it's a pain in the ass
10:04:00<alard>Yes, but that's not what this tool is for. If you want to do that there's the wayback tool.
10:04:00<hiker1>What is the wayback tool?
10:04:00<ersi>Wayback Machine
10:04:00<hiker1>but that won't serve private warc files
10:04:00<alard>https://github.com/internetarchive/wayback
10:04:00<ersi>https://github.com/internetarchive/wayback
10:04:00<ersi>damn it
10:04:00<alard>Heh.
10:05:00<alard>But as you can see it's much harder to get that running than the warc-proxy + firefox addon.
10:05:00<hiker1>does warcproxy just grab whatever .warc files it sees?
10:06:00<hiker1>ah, nvm, it has a neat interface!
10:07:00<hiker1>wow, this is really impressive work
10:13:00<ersi>+1 alard
10:16:00<norbert79>alard: Holy-moly, this goes to my favourites
10:22:00<godane>alard: the urls in menu for warc-proxy don't work for me for some reason
10:22:00<godane>it doesn't take in the baseurl
10:22:00<hiker1>The base url didn't work for me, but the other ones did
10:23:00<godane>so it will go to folder/file instead of example.com/folder/file or something like that
10:23:00<godane>and so it would error
10:23:00<alard>That's strange.
10:24:00<godane>also when testing my eff.org grab it would just go to real site
10:24:00<alard>(Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.)
10:25:00<alard>godane: Is that an https site?
10:25:00<godane>yes
10:33:00<alard>godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2
10:44:00<ats>is there an Internet Archive IRC channel somewhere, or is this the best bet?
10:45:00<ersi>#internetarchive unofficial/semi-officialo channel
10:45:00<ats>cheers :)
10:45:00<ersi>mostly just to get IA shizzle out of this channel :)
10:46:00<chronomex>yes, same people here and there mostly
11:38:00<SketchCow>More hugs here
11:41:00<SketchCow>Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition".
11:41:00<SketchCow>And he stopped it.
11:41:00<SketchCow>Any ideas?
11:42:00<ersi>scrap and start it again?
12:12:00<SmileyG>did you givbe it like a 10tb partition for /data?
12:24:00<tuabkiet>10TB???
12:48:00<hiker1>tuabkiet: You don't have 10 TB of RAID space lying around?
12:49:00<tuabkiet>I don't use RAID, and my hard disk is 10 times smaller
12:53:00<hiker1>How do I get wget 1.14?
12:58:00<ersi>hiker1: It's not in many repositories. You'll probably have to compile it yourself
12:59:00<hiker1>damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam?
13:00:00<ersi>No, but I can probably help
13:00:00<hiker1>how long are you going to be on? I'm still downloading the Mint dvd.
13:01:00<ersi>debian testing has wget 1.13.4-3
13:01:00<hiker1>How did you find that out?
13:01:00<hiker1>I was looking for a package listing but couldn't find one
13:01:00<ersi>debian sid has wget 1.14
13:01:00<ersi>http://packages.debian.org bro
13:01:00<hiker1>they hid it on their packages subdomain! those sneaky...
13:02:00<ersi>you can probably install that .deb and everything will be fine
13:02:00<hiker1>I think there was an aptosid...
13:03:00<ersi>you can probably just dpkg -i the .deb if you're inclined
13:03:00<hiker1>ersi: Do you use a linux distro? if so, which?
13:04:00<ersi>Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian
13:04:00<hiker1>oh.
13:04:00<hiker1>no mint?
13:04:00<ersi>nope. But it's just another Debian deriative
13:06:00<SketchCow>http://archive.org/details/ftp_cavedog.com now up
13:07:00<hiker1>Where are archives of known dead sites kept?
13:07:00<hiker1>I only saw the just in time captures
13:09:00<hiker1>SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB.
13:10:00<ersi>Most are up on archive.org
13:10:00<ersi>SketchCow: thx~
13:11:00<hiker1>Does http://archive.org/details/archiveteam-fire include known dead sites?
13:11:00<alard>hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/
13:12:00<alard>(a slash at the end of the .tar usually gives you an index)
13:12:00<hiker1>alard: oh, wow, that is handy. Thank you.
13:12:00<SketchCow>Also, you should trust me
13:12:00<SketchCow>Everything I upload is awesome
13:12:00<hiker1>hah
13:13:00<ersi>Indeed
13:26:00<hiker1>I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam?
13:28:00<hiker1>The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB.
13:43:00<tuabkiet>hiker1: Up it to Internet Archive NOW!
13:43:00<hiker1>I am not sure how
13:44:00<ersi>Create an account first and foremost
16:44:00<SketchCow>Bagger 288! Bagger 288!
16:46:00<soultcer>SketchCow: Did you find the two Dailybooth warc files I asked for?
16:47:00<schbiridi>the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link
16:48:00<SketchCow>Agreed on wanna join.
16:48:00<SketchCow>soultcer: No, I've been working on my presentation.
16:48:00<SketchCow>E-mail me. jason@textfiles.com.
16:48:00<soultcer>Will do
19:00:00<alard>Has someone saved the http://blog.webshots.com/ ?
23:27:00<Nemo_bis>slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204
23:28:00<Nemo_bis>now 5704 wikis begining by "a" vs. 872 in previous snapshot
23:29:00<Nemo_bis>still, looks like dumps are not generated for 80 % of wikis they have even if requested
23:39:00<alard>---------------------------------------------------------------------------
23:39:00<alard>Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks!
23:39:00<alard>It's available on the projects tab of your warrior.
23:39:00<alard>Next station: DailyBooth.com, closing at the end of the year.
23:39:00<alard>If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab
23:39:00<alard>(All very similar to WebShots and previous projects.)
23:39:00<alard>Join #dailybooth for more detailed discussions.
23:39:00<alard>---------------------------------------------------------------------------