00:20:00 | <SmileyG> | urgh, my cousin in law was once on a telly program called crazy cottage |
00:20:00 | <SmileyG> | need to try and see if i can find it ¬_¬ |
01:53:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror |
02:03:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror |
02:05:00 | <godane> | 1998 to 2004 is not much bigger then the full 2005 article mirror |
02:37:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror |
05:55:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror |
05:55:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror |
08:03:00 | <hiker1> | Hi. What is the easiest way to access .warc file contents on Windows? |
08:06:00 | <SketchCow> | Never ask if you should archive something. Archive it and ask if any of us assholes want a copy |
08:06:00 | <SketchCow> | and then keep it yourself |
08:08:00 | <Coderjoe> | SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB |
08:09:00 | <SketchCow> | Duh |
08:10:00 | <SketchCow> | What was it? |
08:10:00 | <Coderjoe> | the game developer that made games like Total Annihilation |
08:12:00 | <Coderjoe> | (that is, Total Annihilation, and another similar RTS with a more medieval theme) |
08:13:00 | <SketchCow> | Approved. |
08:13:00 | <Coderjoe> | it also includes updates for their parent company, Humongous Entertainment |
08:13:00 | <SketchCow> | Do you have a place for me to download from or do I need to give you a slot? |
08:15:00 | <Coderjoe> | I can set up a download a moment |
08:20:00 | <godane> | uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror |
08:25:00 | <nova> | archiving feels so good |
08:25:00 | <nova> | especially when the original disappears |
08:26:00 | <hiker1> | Can anyone help me access the contents of a .warc file on Windows? |
08:43:00 | <ersi> | Coderjoe: Oooh, I want that as well |
08:48:00 | <Coderjoe> | SketchCow: rsync path sent via PM |
08:48:00 | <Coderjoe> | ersi: I only have so much upstream bandwidth :-\ |
08:49:00 | <Coderjoe> | looks like I probably did the last mirroring pass in 11/2005 |
09:02:00 | <godane> | so i'm starting to do image grabs for each of my arstechnica dumps |
09:09:00 | <hiker1> | godane: What programs are you using for the archival process? |
09:14:00 | <godane> | wget |
09:16:00 | <hiker1> | thanks |
09:17:00 | <ersi> | wget-1.14 (the latest) has support for writing to WARC files, thanks to alard |
09:18:00 | <hiker1> | I'm still trying to get stuff out of the WARC files. |
09:19:00 | <ersi> | Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment |
09:20:00 | <ersi> | hiker1: http://warctozip.archive.org/ |
09:20:00 | <hiker1> | What are non-Windows users using? |
09:22:00 | <hiker1> | ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB. |
09:22:00 | <Coderjoe> | you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server |
09:23:00 | <hiker1> | But isn't there a reason stuff is stored using warc instead of zip? |
09:23:00 | <ersi> | I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef) |
09:24:00 | <hiker1> | How else do you read the contents of archived websites? |
09:24:00 | <ersi> | https://github.com/tef/warctools |
09:24:00 | <ersi> | The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response |
09:24:00 | <Coderjoe> | there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc |
09:25:00 | | ersi nods |
09:25:00 | <ersi> | The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;) |
09:25:00 | <Coderjoe> | if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc) |
09:25:00 | <Coderjoe> | well, there is the open-source wayback codebase |
09:25:00 | <ersi> | true, but it's a pain in the ass to setup |
09:25:00 | <Coderjoe> | iirc, yipdw had an instance of that set up |
09:26:00 | <hiker1> | I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion. |
09:26:00 | <hiker1> | I'm trying out IA's warc library for python https://github.com/internetarchive/warc |
09:26:00 | <Coderjoe> | that's the wayback |
09:27:00 | <ersi> | IMO warc-tools from tef is better than IA warc |
09:27:00 | <SketchCow> | Coderjoe: Absorbing your cavedog as we speak. |
09:27:00 | <ersi> | Om nom nom |
09:27:00 | <hiker1> | Can you use the wayback machine to view warc's that are uploaded to IA? |
09:27:00 | <SketchCow> | After I get this, ersi, it'll be on archive.org in seconds. No worries. |
09:28:00 | <ersi> | SketchCow: I'll nom on it then |
09:28:00 | <Coderjoe> | I noticed (but only because I was watching the log) |
09:28:00 | <Coderjoe> | mmm |
09:28:00 | <Coderjoe> | traffic shaping and prioritizing really takes the pain away |
09:40:00 | <hiker1> | So it wget outputting to warc the preferred method for archiving sites? Not HTTrack? |
09:40:00 | <Coderjoe> | depends |
09:41:00 | <Coderjoe> | though I don't know about httrack |
09:41:00 | <Coderjoe> | IA has a tool called Heretrix for their normal crawls |
09:42:00 | <Coderjoe> | we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site. |
09:42:00 | <chronomex> | yup |
09:58:00 | <alard> | Or try viewing the warc with this: https://github.com/alard/warc-proxy :) |
09:59:00 | <hiker1> | That looks a lot closer to what I want |
10:00:00 | <ersi> | Oops, thought about writing that one out as well :) |
10:00:00 | <hiker1> | alard: Why does it use a proxy instead of just running a web server? |
10:02:00 | <alard> | hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls. |
10:02:00 | <alard> | The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done. |
10:03:00 | <hiker1> | well, the nice thing about rewriting is then you can serve the files to other people through the web. |
10:04:00 | <hiker1> | with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access |
10:04:00 | <ersi> | the non-nice thing is that it's a pain in the ass |
10:04:00 | <alard> | Yes, but that's not what this tool is for. If you want to do that there's the wayback tool. |
10:04:00 | <hiker1> | What is the wayback tool? |
10:04:00 | <ersi> | Wayback Machine |
10:04:00 | <hiker1> | but that won't serve private warc files |
10:04:00 | <alard> | https://github.com/internetarchive/wayback |
10:04:00 | <ersi> | https://github.com/internetarchive/wayback |
10:04:00 | <ersi> | damn it |
10:04:00 | <alard> | Heh. |
10:05:00 | <alard> | But as you can see it's much harder to get that running than the warc-proxy + firefox addon. |
10:05:00 | <hiker1> | does warcproxy just grab whatever .warc files it sees? |
10:06:00 | <hiker1> | ah, nvm, it has a neat interface! |
10:07:00 | <hiker1> | wow, this is really impressive work |
10:13:00 | <ersi> | +1 alard |
10:16:00 | <norbert79> | alard: Holy-moly, this goes to my favourites |
10:22:00 | <godane> | alard: the urls in menu for warc-proxy don't work for me for some reason |
10:22:00 | <godane> | it doesn't take in the baseurl |
10:22:00 | <hiker1> | The base url didn't work for me, but the other ones did |
10:23:00 | <godane> | so it will go to folder/file instead of example.com/folder/file or something like that |
10:23:00 | <godane> | and so it would error |
10:23:00 | <alard> | That's strange. |
10:24:00 | <godane> | also when testing my eff.org grab it would just go to real site |
10:24:00 | <alard> | (Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.) |
10:25:00 | <alard> | godane: Is that an https site? |
10:25:00 | <godane> | yes |
10:33:00 | <alard> | godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2 |
10:44:00 | <ats> | is there an Internet Archive IRC channel somewhere, or is this the best bet? |
10:45:00 | <ersi> | #internetarchive unofficial/semi-officialo channel |
10:45:00 | <ats> | cheers :) |
10:45:00 | <ersi> | mostly just to get IA shizzle out of this channel :) |
10:46:00 | <chronomex> | yes, same people here and there mostly |
11:38:00 | <SketchCow> | More hugs here |
11:41:00 | <SketchCow> | Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition". |
11:41:00 | <SketchCow> | And he stopped it. |
11:41:00 | <SketchCow> | Any ideas? |
11:42:00 | <ersi> | scrap and start it again? |
12:12:00 | <SmileyG> | did you givbe it like a 10tb partition for /data? |
12:24:00 | <tuabkiet> | 10TB??? |
12:48:00 | <hiker1> | tuabkiet: You don't have 10 TB of RAID space lying around? |
12:49:00 | <tuabkiet> | I don't use RAID, and my hard disk is 10 times smaller |
12:53:00 | <hiker1> | How do I get wget 1.14? |
12:58:00 | <ersi> | hiker1: It's not in many repositories. You'll probably have to compile it yourself |
12:59:00 | <hiker1> | damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam? |
13:00:00 | <ersi> | No, but I can probably help |
13:00:00 | <hiker1> | how long are you going to be on? I'm still downloading the Mint dvd. |
13:01:00 | <ersi> | debian testing has wget 1.13.4-3 |
13:01:00 | <hiker1> | How did you find that out? |
13:01:00 | <hiker1> | I was looking for a package listing but couldn't find one |
13:01:00 | <ersi> | debian sid has wget 1.14 |
13:01:00 | <ersi> | http://packages.debian.org bro |
13:01:00 | <hiker1> | they hid it on their packages subdomain! those sneaky... |
13:02:00 | <ersi> | you can probably install that .deb and everything will be fine |
13:02:00 | <hiker1> | I think there was an aptosid... |
13:03:00 | <ersi> | you can probably just dpkg -i the .deb if you're inclined |
13:03:00 | <hiker1> | ersi: Do you use a linux distro? if so, which? |
13:04:00 | <ersi> | Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian |
13:04:00 | <hiker1> | oh. |
13:04:00 | <hiker1> | no mint? |
13:04:00 | <ersi> | nope. But it's just another Debian deriative |
13:06:00 | <SketchCow> | http://archive.org/details/ftp_cavedog.com now up |
13:07:00 | <hiker1> | Where are archives of known dead sites kept? |
13:07:00 | <hiker1> | I only saw the just in time captures |
13:09:00 | <hiker1> | SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB. |
13:10:00 | <ersi> | Most are up on archive.org |
13:10:00 | <ersi> | SketchCow: thx~ |
13:11:00 | <hiker1> | Does http://archive.org/details/archiveteam-fire include known dead sites? |
13:11:00 | <alard> | hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/ |
13:12:00 | <alard> | (a slash at the end of the .tar usually gives you an index) |
13:12:00 | <hiker1> | alard: oh, wow, that is handy. Thank you. |
13:12:00 | <SketchCow> | Also, you should trust me |
13:12:00 | <SketchCow> | Everything I upload is awesome |
13:12:00 | <hiker1> | hah |
13:13:00 | <ersi> | Indeed |
13:26:00 | <hiker1> | I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam? |
13:28:00 | <hiker1> | The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB. |
13:43:00 | <tuabkiet> | hiker1: Up it to Internet Archive NOW! |
13:43:00 | <hiker1> | I am not sure how |
13:44:00 | <ersi> | Create an account first and foremost |
16:44:00 | <SketchCow> | Bagger 288! Bagger 288! |
16:46:00 | <soultcer> | SketchCow: Did you find the two Dailybooth warc files I asked for? |
16:47:00 | <schbiridi> | the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link |
16:48:00 | <SketchCow> | Agreed on wanna join. |
16:48:00 | <SketchCow> | soultcer: No, I've been working on my presentation. |
16:48:00 | <SketchCow> | E-mail me. jason@textfiles.com. |
16:48:00 | <soultcer> | Will do |
19:00:00 | <alard> | Has someone saved the http://blog.webshots.com/ ? |
23:27:00 | <Nemo_bis> | slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204 |
23:28:00 | <Nemo_bis> | now 5704 wikis begining by "a" vs. 872 in previous snapshot |
23:29:00 | <Nemo_bis> | still, looks like dumps are not generated for 80 % of wikis they have even if requested |
23:39:00 | <alard> | --------------------------------------------------------------------------- |
23:39:00 | <alard> | Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks! |
23:39:00 | <alard> | It's available on the projects tab of your warrior. |
23:39:00 | <alard> | Next station: DailyBooth.com, closing at the end of the year. |
23:39:00 | <alard> | If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab |
23:39:00 | <alard> | (All very similar to WebShots and previous projects.) |
23:39:00 | <alard> | Join #dailybooth for more detailed discussions. |
23:39:00 | <alard> | --------------------------------------------------------------------------- |