01:45:00 | <SketchCow> | FortuneCities is now 100% into the format for the Wayback machine. |
01:46:00 | <SketchCow> | No idea if they've done the final sweep yet ; the power outage definitely set some projects and activities back. |
01:46:00 | <SketchCow> | But this basically leaves PicPlz. |
01:47:00 | <SketchCow> | I've got that project going now in two windows (one ingesting, one uploading) |
01:47:00 | <SketchCow> | Will definitely take a day or two, since it's 3.5tb of pictures. |
01:51:00 | <SketchCow> | For MobileMe, we're going to have to do flat-out pull-downs, conversion, and REPLACEMENT, so I want to hold off on that for a bit. |
01:52:00 | <SketchCow> | There's just no other way - too much data. |
01:54:00 | <chronomex> | yeah, we oughtn't double it unnecessarily |
01:54:00 | <SketchCow> | So first I want to see all this data we just did make the full journey into Wayback and completely live. |
01:55:00 | <SketchCow> | When it does, I want to then start dialing down/removing the doubled Fortunecity and other large collections, like Picplz, to not be doubles. |
02:02:00 | <SketchCow> | On the good side, I'm getting between 15-21mb a second off of the pipe. |
02:02:00 | <SketchCow> | So I think whatever was going on before the power outage is now in good shape. |
02:03:00 | <Patt> | SketchCow, so it really went out? |
02:03:00 | <chronomex> | no, it was staged, just like the moon landing |
02:03:00 | <chronomex> | of course the power went out |
02:04:00 | <SketchCow> | Richmond district of SF had a power outage. |
02:04:00 | <SketchCow> | Took Internet Archive with it. |
02:04:00 | <SketchCow> | Exciting. |
02:04:00 | <SketchCow> | That's a lot of stuff out. |
02:04:00 | <Patt> | sorry |
02:04:00 | <SketchCow> | Stuff came back slowly, but it did come back. |
02:04:00 | <SketchCow> | RIGHT during their big party to celebrate 10 petabytes of web historical data in the Wayback machine. |
02:05:00 | <SketchCow> | http://www.flickr.com/photos/mlinksva/8126312466/ |
02:05:00 | <SketchCow> | So as you can see, they got out emergency lights and power, put the laptop on it, and just kept going. |
02:06:00 | <chronomex> | sucko |
02:06:00 | <chronomex> | funny tho |
02:06:00 | <chronomex> | was the livestream interrupted? |
02:06:00 | <SketchCow> | Well, this was the party - it had no livestream. |
02:07:00 | <chronomex> | ah |
02:07:00 | <SketchCow> | The whole Books in Browsers went fine - they lost power at, like, 8pm. |
02:09:00 | <shaqfu> | Is H-Net at any sort of risk? |
02:09:00 | <SketchCow> | What's H-Net in this context? |
02:09:00 | <chronomex> | hurricane electric? |
02:10:00 | <shaqfu> | The humanities mailing list |
02:10:00 | <SketchCow> | I mean, never trust any mailing list is my rule. |
02:10:00 | <SketchCow> | And it's text and trivial to grab. |
02:10:00 | <shaqfu> | Did a very quick survey and saw a lot of defunct lists, but it still seems to be in some use |
02:10:00 | <SketchCow> | I assume you don't mean http://dhhumanist.org/text.html |
02:11:00 | <chronomex> | I'm on several mailing lists more or less solely so I have my own archive of it |
02:11:00 | <chronomex> | for example, it's why I will never unsubscribe from a yahoo list |
02:11:00 | <shaqfu> | No, http://www.h-net.org/lists/ |
02:11:00 | <shaqfu> | DHHumanist is what made me look at it |
02:16:00 | <shaqfu> | And yeah, it's trivial, but no need if it's still under active watch |
02:27:00 | <godane> | hey shaqfu |
02:27:00 | <shaqfu> | Yo |
02:28:00 | <godane> | i'm grabing another magazine |
02:28:00 | <godane> | called ce lifestyles |
02:40:00 | <SketchCow> | https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 |
02:40:00 | <SketchCow> | Lot of great stuff - thanks, team. |
02:57:00 | <flaushy> | alard: 2 instances running with different ips |
03:23:00 | <SketchCow> | Could someone please WARC http://www.wikipediareview.com/ up? |
03:29:00 | <flaushy> | is all the info how to do it on the wiki? then i might try :) |
03:29:00 | <SketchCow> | SHOULD be. If not, let me know |
03:30:00 | <flaushy> | then i ll try. Thx |
03:35:00 | <flaushy> | SketchCow: src/wget "http://www.archiveteam.org/" --mirror --warc-file="at" <- is the command to use, right? anything else i need to watch out for? |
03:35:00 | <flaushy> | via http://archiveteam.org/index.php?title=Wget_with_WARC_output |
03:38:00 | <tef> | flaushy: with a different starting url, of course :) |
03:38:00 | <flaushy> | tef yeah :) |
03:52:00 | <SketchCow> | http://projects.metafilter.com/3766/Just-Solve-the-Problem-Month-Solve-File-Formats |
07:09:00 | <DFJustin> | looks like we have an admirer https://archive.org/details/virtualitera.freeweb7.com |
07:14:00 | <joepie91> | yay, wrote a javascript unpacker |
07:14:00 | <joepie91> | ... in python |
09:24:00 | <alard> | flaushy: Thanks. The ask-crawl produced 100 new usernames overnight. |
13:32:00 | <flaushy> | alard: i think i got blocked at ask |
13:32:00 | <flaushy> | Your client does not have permission to access this site. |
13:54:00 | <flaushy> | I m getting HTTPs 400 on wikipediareview after a while, and i feel like it is too small to be a complete rip. Any suggestions on wget parameters to "crawl gently" and avoid blocks? And any suggestions on "checking" complete rips? |
14:20:00 | <alard> | flaushy: You can use gunzip *.warc.gz | grep Target-URI to get a list of the URLs in the warc file. |
14:21:00 | <alard> | wget has options to have a delay between requests (there are multiple, look in wget --help). |
14:22:00 | <alard> | If you want to download the images, you probably need --page-requisites (I think that's not included in --mirror). |
14:23:00 | <flaushy> | ah cool |
14:23:00 | <flaushy> | humanizing as well? |
14:23:00 | <flaushy> | what delays would be good? (i once mirrored a wiki with 5 sec average, painful...) |
14:25:00 | <flaushy> | alard: do we want to keep crawling at ask? |
14:25:00 | <alard> | What do you mean with "humanizing"? I don't know about delays, it really depends on the purpose. |
14:26:00 | <alard> | Well, I'm not sure if it's worth it. It is producing some usernames, very slowly, and they block reasonably quick. |
14:26:00 | <flaushy> | you have a random backoff and it is on average Xsecs |
14:26:00 | <flaushy> | eg so you keep a window of 0 - 10 secs |
14:27:00 | <flaushy> | ah --random-wait was its name in wget :) |
14:27:00 | <alard> | Yes, that's it. |
14:27:00 | <alard> | You might want to set the --user-agent to something other than Wget. |
14:28:00 | <flaushy> | okie thx |
14:28:00 | <flaushy> | btw did yacy crawl over bt? |
14:28:00 | <flaushy> | maybe we could get data out of their index |
14:31:00 | <alard> | Yacy, never heard of before. Does it contain any data? (I just used the demo to search for "archive team", but that didn't produce results.) |
14:32:00 | <flaushy> | it is a p2p search engine attempt |
14:32:00 | <flaushy> | it didnt skale a couple of years ago, lost interest in it after a while |
14:34:00 | <alard> | http://search.yacy.net/HostBrowser.html?path=www.btinternet.com&list=Browse+Host |
14:34:00 | <flaushy> | okie screw that ;( |
14:35:00 | <alard> | I now see that http://wikipediareview.com/ is a forum. They're hard to archive. |
14:39:00 | <SketchCow> | Agreed |
14:40:00 | <SketchCow> | There's not a RUSH on this. They're just on the skids |
14:40:00 | <SketchCow> | They've been up and down over the past couple years. Unpaid bills, etc. |
14:43:00 | <alard> | Well, with 'hard to archive' I only meant that it probably needs something more structured than just wget --mirror. |
14:43:00 | <alard> | A Lua script could be a solution to do a structured download. What type of forum software is it? |
14:44:00 | <alard> | It may be time for a new script in the forum download library. |
14:49:00 | <alard> | Which reminds me: should we do a second run of boards.cityofheroes.com? |
14:53:00 | <SketchCow> | I am not against it. |
14:53:00 | <SketchCow> | I think people are probably going pretty nuts towards the end as this idiocy goes down. |
14:54:00 | <alard> | Oh, wait, it's less urgent than I thought: "The City of Heroes® servers will shut off on November 30, 2012" http://na.cityofheroes.com/en/news/news_archive/city_of_heroes_sunset_faq.php |
14:57:00 | <SketchCow> | Well, that's still urgent. And I mean obviously we wait to closer to end, like the 20th or later. |
14:57:00 | <SketchCow> | Set an alarm! |
14:58:00 | <SketchCow> | Someone is working on a file format of motion picture film. |
14:58:00 | <SketchCow> | Agreed, they encode audio and visual data right into the film. |
15:39:00 | <closure> | ha! http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=b72d04988f767dd4b8dab3e1267c03b7f80d4c2c;hp=6633a5158d4d3a6f0bdf9fa5c2c8725e47b051cc |
15:59:00 | <underscor> | http://monitor.us.archive.org/weathermap/weathermap.html |
16:11:00 | <alard> | "Paul" is very important. |
16:18:00 | <alard> | So here's a script to download the Invision Powerboard forums on wikipediareview.com: |
16:18:00 | <alard> | https://github.com/ArchiveTeam/wikipediareview-grab/blob/master/invpowerboard.lua |
16:22:00 | <alard> | It could be a small warrior project. (I think it might be too big for one single wget run.) |
16:38:00 | <flaushy> | alard: thx. i will put it on my nas lateron :) |
20:16:00 | <ivan`> | is all of ftp.scene.org backed up? |
20:31:00 | <DFJustin> | it has multiple mirrors right |
20:51:00 | <godane> | i think i found some thing interesting |
20:51:00 | <godane> | a magazine called hebdogiciel |
20:52:00 | <godane> | its a french magazine from 1983 to 1987 |
20:52:00 | <godane> | collection item is not viewable even though i can see the magazines just fine |
20:53:00 | <godane> | anyways archive.org only has 13 issues |
20:55:00 | <godane> | i have found the rest of them |
22:17:00 | <DFJustin> | godane: the french site that has all that stuff got really butthurt when jason put it up so that's why it all went dark |
22:48:00 | <godane> | DFJustin: Only 13 issues are visable |
22:49:00 | <godane> | I don't think jason uploaded the full set |
23:11:00 | <SketchCow> | Back. |
23:11:00 | <SketchCow> | Had to visit underscor |
23:11:00 | <SketchCow> | Yes. |
23:12:00 | <SketchCow> | All the french magazines are dark unless we digitize them ourselves. |
23:13:00 | <SketchCow> | They got butthurt like the Al Qaeda gets butthurt |
23:13:00 | <balrog-> | :( |
23:14:00 | <balrog-> | yeah I was digging around about early PC programming books ... quite a bit of that on IA, but dark :/ |
23:15:00 | <SketchCow> | Anyway, you're all missing the important point |
23:15:00 | <SketchCow> | John Romero got married. |
23:15:00 | <SketchCow> | Off the market |
23:15:00 | <SketchCow> | Now, this is a major blow to the group but I think we can recover |
23:15:00 | <SketchCow> | If we stick together |
23:16:00 | <SketchCow> | "Sandy could potentially be an unprecedented threat to the masses in its path, a massive storm that hasn't been rivaled in generations." |
23:16:00 | <SketchCow> | Whew, way to couch it carefully |
23:36:00 | <godane> | SketchCow: So you can only undark them if archive.org digitize them? |
23:38:00 | <godane> | SketchCow:: that sort of would defeat the point of darking it then |
23:40:00 | <SketchCow> | No. |
23:40:00 | <SketchCow> | I can undark them if SOMEONE I KNOW digitizes them. |
23:40:00 | <SketchCow> | This is a specific situation, to those magazines. |
23:41:00 | <godane> | thats just weird |
23:42:00 | <godane> | anyway there is no seed to torrent collection right now |
23:42:00 | <godane> | no point in downloading it |