01:09:00<nintendud>hot damn, I just noticed the warrior is uploading to archive.org now
01:09:00<nintendud>and multiple uploads at once
01:09:00<nintendud>I like
01:45:00<flaushy>great talk at defcon SketchCow
01:55:00<underscor>http://ia600109.us.archive.org:8088/mrtg/networkv2.html
01:55:00<underscor>Guess where webshots started
01:55:00<underscor>xD
01:59:00<nintendud>hmmmmmmmmm
02:57:00<SketchCow>Things are slowing down on the webshots side for FOS, which is good.
02:59:00<flaushy>if i happen to have an ip-address change in the rsync process, is that a problem?
03:04:00<Sue>SketchCow: it's slowing down because a few of us are having issues running the script
03:06:00<SketchCow>No, no.
03:06:00<SketchCow>They should not going to FOS anymore.
03:06:00<SketchCow>There's some stragglers.
03:07:00<Sue>oh
03:50:00<godane>uploaded: http://archive.org/details/cdrom-linuxformatmagazine-130
04:09:00<flaushy>SketchCow: do you keep FOS running for rsync over the weekend?
04:10:00<flaushy>since i need to leave here soon and won't be able to fix stuff until monday, and currently getting some nasty errors with the new version of the script
04:11:00<flaushy>(it's a small node, nooon)
04:14:00<flaushy>ah nevermind, it wont let me run outdated code :/
04:23:00<S[h]O[r]T>fos should be fine to accept pending rsyncs afaik
04:23:00<S[h]O[r]T>but you cant get new items from the tracker now
04:23:00<S[h]O[r]T>on the old code
04:24:00<flaushy>yepe just realized
04:24:00<flaushy>and thx :)
05:04:00<Sue>webshots standalone users getting async error: do which curl
05:13:00<SketchCow>http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
05:13:00<SketchCow>Gaze upon the future!
05:14:00<chronomex>MEGAWARC
05:14:00<chronomex>I am chronomex and I approve this message.
05:16:00<SketchCow>Watching Breaking Bad
05:16:00<chronomex>I also approve of Breaking Bad.
05:17:00<SketchCow>What I like is that it used/uses all this music that, it is later reported, the bands had no say in it going in.
05:18:00<SketchCow>So you get someone getting his face shot off or some prostitute ruining her life, and a sad little musician has to see it was paired with their music.
05:19:00<godane>i found the glenn beck forums
05:20:00<chronomex>hahaha, I didn't know that
05:20:00<godane>what is funny is that there are 'pirated' copies of old glenn beck radio shows on it
05:20:00<chronomex>lol
05:22:00<godane>i found 3 days from 2007
05:26:00<SketchCow>Uploading the SOPA blackout collection.
05:26:00<SketchCow>That should be interesting.
05:28:00<SketchCow>It'll be a nice complete grab.
05:29:00<SketchCow>I think we've about hit the end of the uploads through FOS of webshots
05:32:00<godane>looks like i need to grab the glenn beck forums
05:32:00<godane>based on wayback machine there only grabs from 2009 and 2010
05:34:00<SketchCow>Well, I'll give you an inside tip, godane.
05:34:00<SketchCow>By the end of October, the Wayback machine will have doubled its content.
05:34:00<SketchCow>Subsequently, it might be worth it to just wait to see what flies in first.
05:34:00<godane>ok
05:38:00<SketchCow>I've just killed webshots uploading on FOS
05:38:00<SketchCow>Since it's going to be replaced with a much more powerful system
05:38:00<SketchCow>And I need to dump a bunch of stuff through FOS to get it into Wayback
05:39:00<SketchCow>City of Heroes Forum is getting prepped for wayback.
05:39:00<SketchCow>That should be exciting in the extreme for them.
05:40:00<SketchCow>Wayback access
06:20:00<SketchCow>Breaking Bad.... perfect background for archiving work.
06:31:00<DFJustin>all glory to the megawarc
06:34:00<Cameron_D>Yay, all my BT users are done
07:57:00<godane>i know why there is not real archive for glenn beck forums
07:58:00<chronomex>why?
07:58:00<godane>you have to pay for access to it
07:58:00<godane>i think thats the case
07:58:00<godane>the link is only on the archive mp3 page anyway so it sort makes sense
07:59:00<chronomex>hmm
08:02:00<godane>so i maybe the only hope to archive this
08:03:00<godane>thats what i tell my self cause stuff like dl.tv and crankygeeks would have be lost if i left it up to you guys
08:04:00<godane>good thing i did archived when i did
08:17:00<chronomex>Go for it!
08:25:00<alard>We're now out of btinternet usernames. 4 hard cases left, but that's it.
09:00:00<SmileyG>hurrar
09:01:00<SmileyG>Haha I was last upload? \o/
11:04:00<C-Keen>hm, how can I add a warc file to the internet archive?
11:20:00<alard>C-Keen: http://archive.org/create/
11:20:00<C-Keen>alard: and then just upload the warc?
11:20:00<alard>Yes, make a new item.
11:21:00<C-Keen>alard: will that get put in the wayback machine?
11:21:00<alard>Ah, that I do not know.
11:21:00<alard>Although if you upload it there and put it on SketchCow's list there might be a chance.
11:22:00<C-Keen>alard: alright, will do
11:30:00<C-Keen>alard: hm, should a website mirror that contains mainly educational texts and source code be put in the Community Texts collection? I am unsure what to pick here
11:43:00<C-Keen>hah uploaded my first item to the archive...
11:53:00<alard>I think Community Texts is the only collection you can pick. The others are protected (SketchCow can move items).
12:00:00<SmileyG>errrr
12:00:00<SmileyG>why has my webshots stopped generating new processes
12:00:00<SmileyG>Oh, restarting project ¬_¬
13:41:00<SketchCow>I HAVE to start heading south.
13:42:00<SketchCow>But we just had a CDX derive fail off of a megawarc generator.
13:43:00<SketchCow>alard: http://www.us.archive.org/log_show.php?task_id=127674813
13:46:00<alard>SketchCow: Is the original tar somewhere?
13:47:00<SketchCow>Sadly, no.
13:47:00<SketchCow>I should have left it.
13:47:00<SketchCow>BOARDS-COH-01.tar.megawarc.json.gz BOARDS-COH-02.tar.megawarc.warc.gz BOARDS-COH-04.tar.megawarc.tar
13:47:00<SketchCow>root@teamarchive-1:/2/CITY# ls
13:47:00<SketchCow>BOARDS-COH-01.tar.megawarc.tar BOARDS-COH-03.tar.megawarc.json.gz BOARDS-COH-04.tar.megawarc.warc.gz
13:47:00<SketchCow>BOARDS-COH-01.tar.megawarc.warc.gz BOARDS-COH-03.tar.megawarc.tar megawarc
13:47:00<SketchCow>BOARDS-COH-02.tar.megawarc.json.gz BOARDS-COH-03.tar.megawarc.warc.gz
13:47:00<SketchCow>I'm going to gunzip one myself while I get dressed here.
13:47:00<SketchCow>BOARDS-COH-02.tar.megawarc.tar BOARDS-COH-04.tar.megawarc.json.gz
13:48:00<alard>I'm downloading that failed .warc.gz now, but that will take a while.
13:48:00<SketchCow>I wouldn't do that.
13:48:00<SketchCow>root@teamarchive-1:/2/CITY# gunzip BOARDS-COH-01.tar.megawarc.warc.gz
13:49:00<SketchCow>More critical, MUCH more critical, is http://www.us.archive.org/log_show.php?task_id=127728961
13:49:00<SketchCow>Watch it, and if it fails, THEN we have some testing to do.
13:51:00<SketchCow>http://www.us.archive.org/log_show.php?task_id=127728688 or this one.
13:51:00<SketchCow>That'll happen faster.
13:52:00<alard>These are converted tars? Or original megawarcs?
13:52:00<SketchCow>Converted tars.
13:52:00<SketchCow>Wait no
13:52:00<SketchCow>I took three different tars, uncompresed them into a file directory.
13:53:00<SketchCow>Then megawarc'd the file directory
13:54:00<dragondon>greetings all! is there a way to set the upload speed? This 0.4kB/s is ridiculous....
13:55:00<alard>The BOARDS-COH-01 too? That would be strange, since it contains a directory, and the pack option isn't supposed to add directories.
13:55:00<alard>dragondon: Which project?
13:55:00<dragondon>webshots
13:55:00<alard>Does it say CurlUpload?
13:55:00<dragondon>yes
13:56:00<alard>Hmm. No, there isn't any limit.
13:56:00<dragondon>I did see higher speeds earlier but now it's dragging...
13:56:00<dragondon>been doing so for a few hours now
13:58:00<SketchCow>alard: So:
13:58:00<SketchCow>if http://www.us.archive.org/log_show.php?task_id=127728688 doesn't work, warning sign.
13:58:00<SketchCow>http://www.us.archive.org/log_show.php?task_id=127728961 is the critical one.
13:58:00<SketchCow>If that doesn't work, we have real issues, that's a webshots generator.
13:59:00<SketchCow>I have to start driving to NYC now.
14:00:00<SketchCow>If it doesn't work, for whatever reason (webshots), underscor needs to go back to .tar generation until we figure it out.
14:00:00<SketchCow>Otherwise, just assume BOARDS-COH is me doing something fucked up
14:00:00<alard>Yes. (I can't reach archive.org now. www.us.archive.org works.)
14:00:00<godane>see
14:00:00<godane>i'm not crazy when i couldn't get to archive.org
14:00:00<dragondon>same here (South Korea) "Iceweasel can't establish a connection to the server at archive.org"
14:01:00<godane>i got the same error
14:01:00<godane>:-D
14:01:00<dragondon>I can ping it thought
14:01:00<dragondon>though
14:01:00<godane>its not just me
14:01:00<SketchCow>Just alerted them
14:03:00<alard>The last gzip record in BOARDS-COH-01.tar.megawarc.warc.gz is fine.
14:04:00<alard>As is the first.
14:04:00<dragondon>with this version of the VM, will it loose everything if I force the machine to shutdown? I need to figure out some hardware issues here.
14:05:00<alard>Yes.
14:05:00<dragondon>I'm hoping that furutre updates will have a buffer to prevent that :)
14:05:00<SketchCow>http://www.us.archive.org/log_show.php?task_id=127728961 - task failed.
14:06:00<SketchCow>Luckily (?) it's a mysql error.
14:06:00<alard>dragondon: Wget can't resume, and upload resuming is complicated, so it's unlikely.
14:06:00<dragondon>:(
14:07:00<dragondon>alard, is there no way to generate the files first, then send, and have soem sort of check/confirm/then resume?
14:08:00<alard>It's complicated, and you don't have to restart that often. Resume things would also complicate error recovery (if the warrior has a problem now, you can reboot and start again).
14:09:00<dragondon>it's not a warrior issue, for some reason my system is reporting only have my phyiscal memory....kinda don't like killing all the work it did, hence why I was asking for any speed mods. Guess I'll have to force restart
14:26:00<SketchCow>the sopa item failed
14:28:00<SketchCow>in both of these cases, I generated it from a set of directory.
14:28:00<alard>I suspect there's an undetected invalid warc in there.
14:30:00<SketchCow>4 web shots I am going to suggest that we go back to generating large tar files.
14:31:00<alard>Yes.
14:31:00<SketchCow>it sounds like we need to do a few more additional tests.
14:31:00<alard>We didn't do enough.
14:32:00<SketchCow>that is just because I think of you as an unstoppable code juggernaut
14:34:00<SketchCow>however, there is a hole range of code you have absolutely no access to.
14:37:00<alard>It would be handy if these error messages included a byte position. That would make it easier to find the problem.
14:43:00<SketchCow>I am in the car in can't look this up easily, but I do believe there is a public repository of all this code.
14:45:00<alard>If the gzip is invalid (that's what the error message suggests, at least) that just needs to be fixed. There's nothing wrong with the indexer.
14:45:00<alard>./megawarc --verbose pack test.tar data/infiles/
14:45:00<alard>Checking data/infiles/bad.warc.gz
14:45:00<alard>Checking data/infiles/good.warc.gz
14:45:00<alard>Copying data/infiles/good.warc.gz to warc
14:45:00<alard>Copying data/infiles/bad.warc.gz to warc
14:46:00<alard>That's wrong: bad.warc.gz isn't complete (I chopped off the last 1000 bytes) so it shouldn't go in the warc.
14:49:00<alard>The megawarc gzip-testing doesn't work, it seems. (The good news is that the positions in the json are correct, so the current megawarcs can be repaired.)
15:00:00<SketchCow>old versions are kept.
15:00:00<alard>For webshots?
15:11:00<SketchCow>all I had not to sure did wait 1 moment
15:12:00<alard>You shouldn't text while driving. :)
15:12:00<chronomex>Watch out SketchCow is using voice recognition
15:12:00<SketchCow>let's try again. Any archive that I was given are still in car for bad. The new batch was being tested, but we have not fully committed to it, instead we are just feelings disks on the round robin machine.
15:13:00<alard>Sure.
15:14:00<SketchCow>we were going to suck my nuts off
15:15:00<SketchCow>I let that 1 go because what I said was sign off.
15:15:00<SketchCow>obviously, voice recognition has a way to go
15:17:00<SketchCow>although, if something with my computer end up sucking my nuts off, dad hey, what's a little problem here and there with voice recognition?
15:18:00<SketchCow>maybe that's how the algorithm got the job in the first place
15:29:00<SmileyG>Fatal error: Uncaught exception 'Exception' with message 'WARNING-OR-ERROR: [2] [mysql_connect(): Too many connections] [/usr/local/petabox/www/common/DB.inc] [269]' in /usr/local/petabox/deriver/derive.php:46
15:29:00<SmileyG>Stack trace:#
15:29:00<SmileyG>It died.
15:29:00<SmileyG>Though that doesn't appear to be an issue with the megawarc itself which seems good.
15:42:00<alard>I think it works better now:
15:42:00<alard>CRC check failed 0x5cdcbe41 != 0x30788a20L
15:42:00<alard>Checking data/infiles/bad-extra.warc.gz
15:42:00<alard>Checking data/infiles/good.warc.gz
15:42:00<alard>Copying data/infiles/good.warc.gz to warc
15:42:00<alard>Invalid gzip data/infiles/bad-extra.warc.gz
15:42:00<alard>Copying data/infiles/bad-extra.warc.gz to tar
15:42:00<alard>Checking data/infiles/bad.warc.gz
15:42:00<alard>CRC check failed 0xdcbe4175 != 0xc21fb9ffL
15:42:00<alard>Invalid gzip data/infiles/bad.warc.gz
15:42:00<alard>Copying data/infiles/bad.warc.gz to tar
15:42:00<alard>https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e
15:51:00<joepie91>oh look
15:51:00<joepie91>http://catalysthost.com/clientarea/cart.php?gid=4
15:51:00<joepie91>:P
15:51:00<joepie91>>Unmetered 1gbit
17:02:00<sankin1>for $7 a month? doesn't sound like a bad deal
17:21:00<underscor>SketchCow: http://www.us.archive.org/log_show.php?task_id=127728961 failed due to DB problems, rerunning.
17:22:00<underscor>(DB problems that are unrelated to megawarc)
17:25:00<underscor>oh shit, boxes are almost full
17:25:00<underscor>better start bailing
17:26:00<SmileyG>underscor: herp
17:26:00<SmileyG>weren't they already doing so :S
17:27:00<underscor>hm?
17:27:00<underscor>There
17:27:00<SmileyG>Why are they running outta room?
17:27:00<SmileyG>:/
17:27:00<underscor>Oh
17:27:00<underscor>There's no auto-ingest to IA
17:27:00<SmileyG>Do they not automatically pump to IA?
17:27:00<SmileyG>Ah ok.
17:28:00<underscor>Jason (and I) want a human to eyeball them
17:28:00<underscor>For now
17:28:00<SmileyG>Understandable.
17:28:00<SmileyG>So, do YOU work at IA?
17:31:00<godane>all of my theregister.co.uk warc dumps are up
17:31:00<godane>i have not done 2011 yet
17:31:00<godane>but its up to 2010
17:31:00<godane>which is all i have right now
17:40:00<underscor>SmileyG: Yeah
17:40:00<underscor>I'm part time, though
17:40:00<SmileyG>o
17:40:00<underscor>(I'm a student the rest of the time)
17:40:00<SmileyG>Still, awesome.
17:40:00<underscor>In upstate NY
17:40:00<underscor>hehe, thanks :D
17:41:00<SmileyG>IA should have some more DC's ;)
17:41:00<SmileyG>Like one in coventry, hahah here, its cheap(yeah right)
17:43:00<DFJustin>looks like that megawarc has gz issues as well
17:48:00<SmileyG>Awww
17:56:00<alard>There must be quite a few invalid warc files then.
18:05:00<underscor>alard: I thought megawarc checked them out?
18:06:00<SmileyG>:<
18:06:00<SmileyG>hmmm this worries me
18:06:00<underscor>Is there a way to check the validity of a gz on the command line?
18:06:00<underscor>(besides just extracting it)
18:12:00<SmileyG>gunzip -t file.tar.gz
18:12:00<underscor>thx
18:12:00<underscor>alard: should we switch to tars for now, or what do you think?
18:12:00<SmileyG>for test :)
18:13:00<SmileyG>also hmmm
18:13:00<SmileyG>if your worried about the tars, you can check them too
18:13:00<SmileyG>gunzip -c file.tar.gz | tar t > /dev/null
18:13:00<underscor>I am beginning to get close to drowning, so I need to figure out the exit strategy
18:13:00<SmileyG>herp
18:14:00<underscor>(people should save slower!)
18:14:00<underscor>xD
18:14:00<SmileyG>can you just do what the warrior would do?
18:14:00<SmileyG>but just the upload bit (and direct it to FOS/IA ?
18:14:00<SmileyG>You said theres... 12? servers/
18:16:00<underscor>SmileyG: No, no, I *am* fos/IA
18:16:00underscor is the servers warriors are uploading to
18:16:00<underscor>Those servers are nearing full
18:16:00<SmileyG>yeah
18:16:00<SmileyG>but originally all teh warriors were uploading to 1 location, which is now a number of locations?
18:16:00<underscor>FOS is full/not accessible for this project
18:16:00<underscor>Yes
18:16:00<SmileyG>What was the plan for the orignal server?
18:17:00<underscor>It was to upload tars, which had been happening
18:17:00<SmileyG>can you not replicate that process over to the other servers?
18:17:00<underscor>now the plan was doing the megawarcing with the script from alard, which I've been doing
18:17:00<underscor>but if they're corrupt, then maybe we should go back to tars for now
18:17:00<SmileyG>yeah
18:18:00<underscor>I may just make a Command Decision(tm) since SketchCow is on the road
18:18:00<underscor>and deal with the fallout later
18:19:00<SmileyG>well if you don't, all archiving basically stops
18:19:00<SmileyG>unless you've got his number?
18:19:00<underscor>yeah, I may call him after class
18:19:00<underscor>http://p.defau.lt/?Fy8RdcZOojTsFPlt6Yyzcg
18:19:00<underscor>uh oh
18:19:00<underscor>cc alard
18:23:00<DFJustin><underscor> alard: I thought megawarc checked them out? <-- I think it did, but the check wasn't working? (until he fixed it just now)
18:23:00<DFJustin><alard> https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e
18:26:00<underscor>aha
18:34:00<underscor>gunzip -t webshots-20121012070021.megawarc.warc.gz │··········
18:34:00<underscor>gzip: webshots-20121012070021.megawarc.warc.gz: invalid compressed data--crc error
18:34:00<underscor>│··········
18:34:00<underscor>sigh
18:35:00<underscor>so I guess this one is fucked
18:38:00<underscor>sigh
18:38:00<underscor>gunzip -t webshots-20121012070358.megawarc.warc.gz │··········
18:38:00<underscor>gzip: webshots-20121012070358.megawarc.warc.gz: invalid compressed data--crc error
18:38:00<underscor>│··········
18:40:00<underscor>Rebuilding using new code
18:40:00<underscor>alard: what does the script do if it encounters a "bad" .warc.gz?
18:44:00<underscor>DFJustin: Where did alard say that link btw?
18:46:00<S[h]O[r]T>underscor
18:46:00<S[h]O[r]T>do you not have history in here from earlier
18:46:00<S[h]O[r]T>alard and SketchCow were talking about the corruption. i can paste in pm if you need
18:47:00<underscor>woah, there we go
18:47:00<underscor>what the hell, quassel
18:48:00<SmileyG>how long does it take to rebuild :S
18:49:00<underscor>S[h]O[r]T: Found it. Not sure what quassel was doing saying there wasn't more scrollback >:(
18:49:00<underscor>SmileyG: Uh, I haven't timed them, actually
18:50:00<underscor>I assume if a set passes gunzip -t, then it's probably safe to upload
18:51:00<SmileyG>I *believe* so, the only better check is physically unpacking it and checking.
18:51:00<SmileyG>which kind of negates the point.
18:58:00<alard>underscor: It turned out that the gzip check I had in megawarc didn't really check anything.
18:59:00<alard>So if there was an invalid warc, it was added to the big warc, which then became unreadable.
18:59:00<alard>I think it is fixed in the latest megawarc version (it works on my test files, at least).
18:59:00<underscor>Is there a way to easyclean from the json?
18:59:00<underscor>Ouch.
19:00:00<alard>Before that fix, SketchCow suggested that we keep using tar until the megawarc is somewhat more stable and tested.
19:00:00<alard>Yes.
19:00:00<underscor>schweet
19:00:00<alard>The positions of the warcs in the json are correct.
19:00:00<alard>So it's possible to untangle them.
19:01:00<alard>So it might be an idea to keep using the latest megawarc script for webshots.
19:01:00<alard>I think it works, it's a good test. We also don't loose data if it does not, it just means rebuilding things.
19:02:00<alard>(To answer your question about what happens to the invalid gzips: they're added to the tar file.)
19:02:00<SmileyG>-Die in a Fire ?
19:02:00<underscor>ah, the "extras" tar file
19:02:00<underscor>?
19:03:00<alard>Yes. So if the tar file is not empty, that means there were things that couldn't be saved in the warc.
19:03:00<underscor>What about the ones that say "extra field of 10 bytes ignored"?
19:04:00<underscor>(ones = warc.gz, when testing with gunzip -t)
19:07:00<underscor>Uploading the first new set
19:08:00<alard>That's the warc format: it has an extra gzip field with the length of the compressed warc record.
19:08:00<SmileyG>hmmm
19:08:00<SmileyG>gzip patch needed at some point then? :S
19:08:00<alard>That's handy if you want to skip through the warc, but the gzip utility doesn't know how to use it.
19:08:00<alard>Well, it does what it says: it sees an extra field and ignores it.
19:08:00<SmileyG>:D
19:08:00<SmileyG>least it doesn't blow up I guess
19:09:00<SmileyG>Wonder if you can tell the test to ignore it (so it only raises errors on _real_ error
19:09:00<SmileyG>s
19:09:00<SmileyG>I smell diner.
19:13:00<underscor>SmileyG: It still returns $? = 0
19:13:00<underscor>so it's not really a big deal
19:14:00<alard>Is this a new one? http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518
19:15:00<underscor>Yes
19:15:00<underscor>Only the json is up though
19:15:00<underscor>the warc is still uploading
19:16:00<alard>Ah. Was there a tar?
19:17:00<underscor>0 bytes
19:17:00<alard>So it's exiting to see if this one passes the test.
19:17:00<underscor>It passed gunzip -t too
19:17:00<SmileyG>underscor: Ah ok ! I presumed it'd return some non-fatal error code
19:17:00<SmileyG>but if its not showing it other than the stout output.... no worries
19:19:00<underscor>alard: Is the procedure to fix these to "create" the tar backwards, and repack, or will you be able to write a "fixme" thing? :)
19:19:00<alard>I think it will be a fixme thing.
19:19:00<underscor>rad
19:20:00<underscor>another one finished!
19:20:00<underscor>-rw-r--r-- 1 abuie users 50G Oct 12 19:18 webshots-20121012070358.megawarc.warc.gz
19:20:00<underscor>-rw-r--r-- 1 abuie users 103K Oct 12 19:18 webshots-20121012070358.megawarc.json.gz
19:20:00<underscor>-rw-r--r-- 1 abuie users 388M Oct 12 19:18 webshots-20121012070358.megawarc.tar
19:20:00<alard>Hey, a tar.
19:20:00<SmileyG>working and looking correct now?
19:20:00<alard>That's both good and bad news.
19:20:00<alard>Good for megawarc, bad for webshots.
19:21:00<SmileyG>o_O
19:22:00<underscor>We can extract out the "bad" users and requeue them, though, right?
19:23:00<SmileyG>AH, the tars are failed users getting left over?
19:24:00<underscor>SmileyG: Yeah
19:24:00<alard>The invalid warcs end up in the tar.
19:24:00<underscor>Well, faulty warc.g
19:24:00<underscor>z
19:24:00<underscor>mhm
19:25:00<alard>We could make a list of the users that have made it to archive.org and compare that with the full list of users.
19:25:00<alard>But for the moment we have enough new users.
19:26:00<underscor>Lots of limestone networks hosts, wonder who that is
19:26:00<underscor>They're pumping a lot of data :D
19:27:00<SmileyG>Is it Sue?
19:27:00<SmileyG>She was saying shes gonna hit her cap shortly in #webshots
19:29:00<underscor>alard: http://archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518 Here we go!
19:31:00<chronomex>wooooo
19:31:00chronomex parties
19:34:00<underscor>ugh, moving 50GB takes so long
19:39:00<underscor>http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012070358 is getting its replacement uploaded
19:42:00<underscor>schweet
19:42:00<underscor>Every box is now megawarcing
19:43:00<underscor>Although these take quite a bit of time
19:43:00<underscor>Wonder if I can keep up with the inflow
19:43:00<chronomex>INTERFLOW
19:44:00<underscor>http://ia600109.us.archive.org:8088/mrtg/networkv2.html http://ia601104.us.archive.org:8088/mrtg/networkv2.html http://ia700106.us.archive.org:8088/mrtg/networkv2.html
19:44:00<underscor>Y'all have been keeping them pretty nice and busy
19:49:00<SmileyG>- Downloaded 18400 URLs got another nice one :D
21:02:00<underscor>http://www.us.archive.org/log_show.php?task_id=127779327 It's cdxing now!
21:02:00<underscor>Cross fingers!!!! :D
21:12:00<alard>The CDX indexer is already running twice as long as the previous time.
21:12:00<underscor>'s a good sign :D
21:15:00<SketchCow>hooray.
21:15:00<SmileyG>\o/
21:15:00SmileyG waits for underscor to start groveling.
21:15:00<underscor>yay, Jason's back
21:16:00<chronomex>25 seconds!
21:16:00<underscor>I have like 2TB processing
21:16:00<underscor>:P
21:16:00<underscor>megawarcing takes a fair bit of time/work, though
21:16:00<underscor>still can't quite tell if I'm filling faster than I'm dumping
21:17:00<SmileyG>hope not :S
21:17:00<underscor>Also, just the sheer (super awesome!) scale of moving 50gb bricks is... interesting
21:17:00<SmileyG>:)
21:20:00<SketchCow>[6~[6~[6~[6~[6~[6~[6~[6~
21:20:00<SmileyG>what he said.
21:23:00<alard>The voice recognition gets more and more interesting.
21:25:00<underscor>hahahha
21:25:00<underscor>It was better when it was sucking his nuts off or whatever
21:30:00<SketchCow>so, status update please
21:31:00<SketchCow>this android ssh client has no pgup
21:32:00<SketchCow>also. comiccon is hell on earth.
21:34:00<underscor>SketchCow: We (think) we patched the bugs
21:35:00<underscor>Test derive is still running
21:35:00<underscor>but it got further than any of them have
21:35:00<underscor>so (probably) good
21:35:00<underscor>I have like 2.5TB to ingest
21:35:00<underscor>once we see how this goes
21:44:00<alard>Looks good? http://www.us.archive.org/log_show.php?task_id=127779327
21:47:00<underscor>alard: you are the freakin' man
21:47:00<underscor>we need to set you up on gittip :D
21:48:00<SmileyG>i am so jeli
21:49:00<underscor>SketchCow: IT WORKED IT WORKED IT WORKED
21:50:00<alard>This isn't actually that much better than before: there's no tar with invalid warcs. That already worked.
21:50:00<SmileyG>http://archive.org/details/webshots-freeze-frame-20121012173401 the latest one lacks a warc?
21:50:00<underscor>SmileyG: still uploading
21:51:00<underscor>alard: oh
21:51:00<SmileyG>o
21:51:00<SmileyG>:D
21:51:00<SketchCow>find one where the problem is fixed versus before.
21:51:00<SketchCow>sopa is a good one
21:51:00<underscor>alard: 3 finished, no tar
21:51:00<underscor>:D
21:52:00<alard>The two that crashed with the zlib.error problem should have tars.
21:52:00<underscor>I just had to restart them
21:52:00<underscor>yes, those haven't finished
21:52:00<alard>I'm now testing my fix script on the sopa files.
21:52:00<alard>(Takes a while.)
21:53:00<underscor>Ah, this will be the one that lets us fix a bad megawarc.warc.gz
21:53:00<underscor>?
21:53:00<SketchCow>underscor. ask hank when the last load in of the wayback happens, please.
21:54:00<alard>Yes. It reads the megawarc, checks every warc.gz, sorts them into new warc/tar files, and saves the locations in a new json file.
21:54:00<alard>It worked on my tiny test file, but 15GB takes a little longer.
21:54:00<SketchCow>alard, call it megarepair and add it to the repository. :)
21:58:00<alard>Too late, it's already called megawarc-fix. It's now in the repository.
22:03:00<underscor>SketchCow: asking
22:04:00<underscor>alard: pulling :D
22:04:00<underscor>you're amazing
22:04:00<alard>Might need some testing first, though.
22:11:00<alard>Hmm. Apparently not every tar header is exactly 512 bytes long.
22:12:00<alard>There's a 'gnu tar' type that has headers of 1024, 1536 etc bytes long, if there are long filenames.
22:12:00<alard>As there are in the SOPA file.
22:35:00<underscor>-rw-r--r-- 1 abuie users 50G Oct 12 21:37 webshots-20121012183139.megawarc.warc.gz
22:35:00<underscor>-rw-r--r-- 1 abuie users 73K Oct 12 21:37 webshots-20121012183139.megawarc.json.gz
22:35:00<underscor>-rw-r--r-- 1 abuie users 639M Oct 12 21:37 webshots-20121012183139.megawarc.tar
22:35:00<underscor>alard: One of the ones with tars finished, fyi
22:35:00<underscor>Uploading now
23:20:00<alard>So, which date is *only* available in a faulty megawarc?
23:20:00<alard>*data
23:20:00<alard>The SOPA megawarc is really hard to fix, since it has these ridiculously long file names.
23:23:00<alard>So I'd like to suggest that we 1. make new megawarcs from scratch, and test with gunzip -tv / tar -tv / megawarc restore, if we have the original data; and 2. use megawarc-fix to fix the megawarcs that we don't have in another form, such as webshots.
23:23:00<underscor>alard: I'm currently running the fixer on 20121012070021
23:23:00<alard>webshots doesn't have long filenames, so the fixer should work for those files.
23:27:00<alard>There are no long filenames in 20121012070021, so that should work, I hope: curl -s -L http://archive.org/download/webshots-freeze-frame-20121012070021/webshots-20121012070021.megawarc.json.gz | gunzip | grep LongLink
23:28:00<alard>CoH can also be fixed: curl -s -L http://archive.org/download/archiveteam-city-of-heroes-forums-megawarc-1/BOARDS-COH-01.tar.megawarc.json.gz | gunzip | grep LongLink
23:29:00<alard>But SOPA can not: curl -s -L http://archive.org/download/archiveteam-sopa-blackout/2012-sopa-day-collection.megawarc.json.gz | gunzip | grep LongLink
23:35:00<underscor>I'll defer to SketchCow before fixing CoH, but I assume he'll want it to be
23:56:00<tef_>how are they corrupt ?