01:45:00<SketchCow>FortuneCities is now 100% into the format for the Wayback machine.
01:46:00<SketchCow>No idea if they've done the final sweep yet ; the power outage definitely set some projects and activities back.
01:46:00<SketchCow>But this basically leaves PicPlz.
01:47:00<SketchCow>I've got that project going now in two windows (one ingesting, one uploading)
01:47:00<SketchCow>Will definitely take a day or two, since it's 3.5tb of pictures.
01:51:00<SketchCow>For MobileMe, we're going to have to do flat-out pull-downs, conversion, and REPLACEMENT, so I want to hold off on that for a bit.
01:52:00<SketchCow>There's just no other way - too much data.
01:54:00<chronomex>yeah, we oughtn't double it unnecessarily
01:54:00<SketchCow>So first I want to see all this data we just did make the full journey into Wayback and completely live.
01:55:00<SketchCow>When it does, I want to then start dialing down/removing the doubled Fortunecity and other large collections, like Picplz, to not be doubles.
02:02:00<SketchCow>On the good side, I'm getting between 15-21mb a second off of the pipe.
02:02:00<SketchCow>So I think whatever was going on before the power outage is now in good shape.
02:03:00<Patt>SketchCow, so it really went out?
02:03:00<chronomex>no, it was staged, just like the moon landing
02:03:00<chronomex>of course the power went out
02:04:00<SketchCow>Richmond district of SF had a power outage.
02:04:00<SketchCow>Took Internet Archive with it.
02:04:00<SketchCow>Exciting.
02:04:00<SketchCow>That's a lot of stuff out.
02:04:00<Patt>sorry
02:04:00<SketchCow>Stuff came back slowly, but it did come back.
02:04:00<SketchCow>RIGHT during their big party to celebrate 10 petabytes of web historical data in the Wayback machine.
02:05:00<SketchCow>http://www.flickr.com/photos/mlinksva/8126312466/
02:05:00<SketchCow>So as you can see, they got out emergency lights and power, put the laptop on it, and just kept going.
02:06:00<chronomex>sucko
02:06:00<chronomex>funny tho
02:06:00<chronomex>was the livestream interrupted?
02:06:00<SketchCow>Well, this was the party - it had no livestream.
02:07:00<chronomex>ah
02:07:00<SketchCow>The whole Books in Browsers went fine - they lost power at, like, 8pm.
02:09:00<shaqfu>Is H-Net at any sort of risk?
02:09:00<SketchCow>What's H-Net in this context?
02:09:00<chronomex>hurricane electric?
02:10:00<shaqfu>The humanities mailing list
02:10:00<SketchCow>I mean, never trust any mailing list is my rule.
02:10:00<SketchCow>And it's text and trivial to grab.
02:10:00<shaqfu>Did a very quick survey and saw a lot of defunct lists, but it still seems to be in some use
02:10:00<SketchCow>I assume you don't mean http://dhhumanist.org/text.html
02:11:00<chronomex>I'm on several mailing lists more or less solely so I have my own archive of it
02:11:00<chronomex>for example, it's why I will never unsubscribe from a yahoo list
02:11:00<shaqfu>No, http://www.h-net.org/lists/
02:11:00<shaqfu>DHHumanist is what made me look at it
02:16:00<shaqfu>And yeah, it's trivial, but no need if it's still under active watch
02:27:00<godane>hey shaqfu
02:27:00<shaqfu>Yo
02:28:00<godane>i'm grabing another magazine
02:28:00<godane>called ce lifestyles
02:40:00<SketchCow>https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
02:40:00<SketchCow>Lot of great stuff - thanks, team.
02:57:00<flaushy>alard: 2 instances running with different ips
03:23:00<SketchCow>Could someone please WARC http://www.wikipediareview.com/ up?
03:29:00<flaushy>is all the info how to do it on the wiki? then i might try :)
03:29:00<SketchCow>SHOULD be. If not, let me know
03:30:00<flaushy>then i ll try. Thx
03:35:00<flaushy>SketchCow: src/wget "http://www.archiveteam.org/" --mirror --warc-file="at" <- is the command to use, right? anything else i need to watch out for?
03:35:00<flaushy>via http://archiveteam.org/index.php?title=Wget_with_WARC_output
03:38:00<tef>flaushy: with a different starting url, of course :)
03:38:00<flaushy>tef yeah :)
03:52:00<SketchCow>http://projects.metafilter.com/3766/Just-Solve-the-Problem-Month-Solve-File-Formats
07:09:00<DFJustin>looks like we have an admirer https://archive.org/details/virtualitera.freeweb7.com
07:14:00<joepie91>yay, wrote a javascript unpacker
07:14:00<joepie91>... in python
09:24:00<alard>flaushy: Thanks. The ask-crawl produced 100 new usernames overnight.
13:32:00<flaushy>alard: i think i got blocked at ask
13:32:00<flaushy>Your client does not have permission to access this site.
13:54:00<flaushy>I m getting HTTPs 400 on wikipediareview after a while, and i feel like it is too small to be a complete rip. Any suggestions on wget parameters to "crawl gently" and avoid blocks? And any suggestions on "checking" complete rips?
14:20:00<alard>flaushy: You can use gunzip *.warc.gz | grep Target-URI to get a list of the URLs in the warc file.
14:21:00<alard>wget has options to have a delay between requests (there are multiple, look in wget --help).
14:22:00<alard>If you want to download the images, you probably need --page-requisites (I think that's not included in --mirror).
14:23:00<flaushy>ah cool
14:23:00<flaushy>humanizing as well?
14:23:00<flaushy>what delays would be good? (i once mirrored a wiki with 5 sec average, painful...)
14:25:00<flaushy>alard: do we want to keep crawling at ask?
14:25:00<alard>What do you mean with "humanizing"? I don't know about delays, it really depends on the purpose.
14:26:00<alard>Well, I'm not sure if it's worth it. It is producing some usernames, very slowly, and they block reasonably quick.
14:26:00<flaushy>you have a random backoff and it is on average Xsecs
14:26:00<flaushy>eg so you keep a window of 0 - 10 secs
14:27:00<flaushy>ah --random-wait was its name in wget :)
14:27:00<alard>Yes, that's it.
14:27:00<alard>You might want to set the --user-agent to something other than Wget.
14:28:00<flaushy>okie thx
14:28:00<flaushy>btw did yacy crawl over bt?
14:28:00<flaushy>maybe we could get data out of their index
14:31:00<alard>Yacy, never heard of before. Does it contain any data? (I just used the demo to search for "archive team", but that didn't produce results.)
14:32:00<flaushy>it is a p2p search engine attempt
14:32:00<flaushy>it didnt skale a couple of years ago, lost interest in it after a while
14:34:00<alard>http://search.yacy.net/HostBrowser.html?path=www.btinternet.com&list=Browse+Host
14:34:00<flaushy>okie screw that ;(
14:35:00<alard>I now see that http://wikipediareview.com/ is a forum. They're hard to archive.
14:39:00<SketchCow>Agreed
14:40:00<SketchCow>There's not a RUSH on this. They're just on the skids
14:40:00<SketchCow>They've been up and down over the past couple years. Unpaid bills, etc.
14:43:00<alard>Well, with 'hard to archive' I only meant that it probably needs something more structured than just wget --mirror.
14:43:00<alard>A Lua script could be a solution to do a structured download. What type of forum software is it?
14:44:00<alard>It may be time for a new script in the forum download library.
14:49:00<alard>Which reminds me: should we do a second run of boards.cityofheroes.com?
14:53:00<SketchCow>I am not against it.
14:53:00<SketchCow>I think people are probably going pretty nuts towards the end as this idiocy goes down.
14:54:00<alard>Oh, wait, it's less urgent than I thought: "The City of Heroes® servers will shut off on November 30, 2012" http://na.cityofheroes.com/en/news/news_archive/city_of_heroes_sunset_faq.php
14:57:00<SketchCow>Well, that's still urgent. And I mean obviously we wait to closer to end, like the 20th or later.
14:57:00<SketchCow>Set an alarm!
14:58:00<SketchCow>Someone is working on a file format of motion picture film.
14:58:00<SketchCow>Agreed, they encode audio and visual data right into the film.
15:39:00<closure>ha! http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=b72d04988f767dd4b8dab3e1267c03b7f80d4c2c;hp=6633a5158d4d3a6f0bdf9fa5c2c8725e47b051cc
15:59:00<underscor>http://monitor.us.archive.org/weathermap/weathermap.html
16:11:00<alard>"Paul" is very important.
16:18:00<alard>So here's a script to download the Invision Powerboard forums on wikipediareview.com:
16:18:00<alard>https://github.com/ArchiveTeam/wikipediareview-grab/blob/master/invpowerboard.lua
16:22:00<alard>It could be a small warrior project. (I think it might be too big for one single wget run.)
16:38:00<flaushy>alard: thx. i will put it on my nas lateron :)
20:16:00<ivan`>is all of ftp.scene.org backed up?
20:31:00<DFJustin>it has multiple mirrors right
20:51:00<godane>i think i found some thing interesting
20:51:00<godane>a magazine called hebdogiciel
20:52:00<godane>its a french magazine from 1983 to 1987
20:52:00<godane>collection item is not viewable even though i can see the magazines just fine
20:53:00<godane>anyways archive.org only has 13 issues
20:55:00<godane>i have found the rest of them
22:17:00<DFJustin>godane: the french site that has all that stuff got really butthurt when jason put it up so that's why it all went dark
22:48:00<godane>DFJustin: Only 13 issues are visable
22:49:00<godane>I don't think jason uploaded the full set
23:11:00<SketchCow>Back.
23:11:00<SketchCow>Had to visit underscor
23:11:00<SketchCow>Yes.
23:12:00<SketchCow>All the french magazines are dark unless we digitize them ourselves.
23:13:00<SketchCow>They got butthurt like the Al Qaeda gets butthurt
23:13:00<balrog->:(
23:14:00<balrog->yeah I was digging around about early PC programming books ... quite a bit of that on IA, but dark :/
23:15:00<SketchCow>Anyway, you're all missing the important point
23:15:00<SketchCow>John Romero got married.
23:15:00<SketchCow>Off the market
23:15:00<SketchCow>Now, this is a major blow to the group but I think we can recover
23:15:00<SketchCow>If we stick together
23:16:00<SketchCow>"Sandy could potentially be an unprecedented threat to the masses in its path, a massive storm that hasn't been rivaled in generations."
23:16:00<SketchCow>Whew, way to couch it carefully
23:36:00<godane>SketchCow: So you can only undark them if archive.org digitize them?
23:38:00<godane>SketchCow:: that sort of would defeat the point of darking it then
23:40:00<SketchCow>No.
23:40:00<SketchCow>I can undark them if SOMEONE I KNOW digitizes them.
23:40:00<SketchCow>This is a specific situation, to those magazines.
23:41:00<godane>thats just weird
23:42:00<godane>anyway there is no seed to torrent collection right now
23:42:00<godane>no point in downloading it