00:02:00<corobo>running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :)
05:32:00<tuankiet>@alard Hello!
10:43:00<alard>tuankiet: Hi.
10:44:00<alard>kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads.
10:46:00<alard>tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment).
11:56:00<tuankiet>@alard: Ok
11:57:00<tuankiet>I am running the Yahoo and Github script
12:34:00<alard>tuankiet: Very good.
12:35:00<alard>It's a pity that Dailybooth is so slow. We're working on too many projects.
13:23:00<Nemo_bis>At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/
13:48:00<SketchCow>OK, so.
13:48:00<SketchCow>I have to say.
13:48:00<SketchCow>When you checked in the github content for the github content for us to turn around and download github
13:48:00<SketchCow>Oh man
13:48:00<SketchCow>I almost died
13:53:00<GLaDOS>So uuh, I heard you like github..
13:55:00<SketchCow>At this exact moment, archive.org has one petabyte of disk space
13:56:00<SketchCow>free
13:56:00<Nemo_bis>SketchCow: are you saying because you plan to reduce it vastly and very soon? :p
13:56:00<SketchCow>Yes
13:56:00<Nemo_bis>:)
13:56:00<SketchCow>I'd like to understand.... do we need more archiveteam warriors on the dailybooth project?
13:57:00<Nemo_bis>I also have to admit that it's not so obvious what one has to do to help the archiveteam
13:57:00<Nemo_bis>too many projects and we're too lazy to update the wiki
14:03:00<SketchCow>We're not too lazy.
14:03:00<SketchCow>The wiki's choked because of the spam. I will fix it.
14:04:00<Nemo_bis>Speaking of which, can you make me sysop
14:04:00<Nemo_bis>it's weird not to have the delete button on a wiki
14:04:00<Nemo_bis>and frustrating for me :)
14:18:00<alard>No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow.
14:19:00<alard>Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy.
14:20:00<alard>We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's).
14:21:00<alard>Do we want the Github downloads in warc format?
14:36:00<SketchCow>I personally think no.
14:38:00<alard>You don't want to go for maximum inaccessibility?
14:40:00<alard>If not warc, then what? A .tar?
14:40:00<alard>(What to do with the /downloads HTML page?)
14:42:00<alard>We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download).
14:43:00<alard>The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads
15:18:00<SketchCow>I think in this case, we're rescuing a filesystem, not an experience
15:19:00<SketchCow>A .txt file accompanying the files indicating the download count, if you're being completist.
15:19:00<SketchCow>And personally, I think that assassment could be in a single .txt file
16:18:00<alard>SketchCow: Could you have a look at alardland/github?
16:36:00closure perks up his ears hearing about plans to do something with github
16:36:00<closure>is this about archiving the git repos, or some of their other data?
16:36:00<alard>The downloads.
16:37:00<closure>hmm, not familiar with that
16:37:00<alard>https://github.com/blog/1302-goodbye-uploads
16:37:00<closure>aha, thanks
16:38:00<alard>We're making a list of repositories, so that could be used for other things in the future.
16:38:00<closure>so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site
16:40:00<SketchCow>alard - Looks good.
16:40:00<SketchCow>I suspect this won't be a LOT of data
16:41:00<alard>You *hope* it's not a lot of data.
16:42:00<closure>for a lot of data, see sourceforge downloads :P
16:42:00<SketchCow>I don't actually (hope)
16:42:00<SketchCow>Because once again the COmpass Has Swung and archive.org has tons of disk space.
16:42:00<SketchCow>I mean, we still should help raise funds because it helps
16:43:00<SketchCow>But 1 petabyte of free disk space right now
16:43:00<SketchCow>So yeah, let's do it.
16:43:00<SketchCow>I'll e-mail a hug to my github buddies
16:46:00<closure>ah, I see you already found githubarchive.org
16:48:00<alard>SketchCow: Want to say hi in the User-Agent header as well?
16:52:00<SketchCow>Sure.
16:52:00<SketchCow>"Archive Team Loves GitHub"
16:55:00<alard>https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3
17:07:00<alard>Heh, the tracker might not like this: http://tracker.archiveteam.org/github/
17:08:00<closure>have you already pulled in the api dump data? If not, I might try some massaging
17:09:00<alard>No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/
17:09:00<alard>(I think the highest ID is in the 7,000,000 range.)
17:12:00<closure>I'm running the scraper for that, so if there's time to plow through the whole range, that's fine
17:44:00<SketchCow>What is our HQ url again?
17:45:00<nitro2k01>What? Headquarters? http://archiveteam.org/ ?
17:49:00<SketchCow>No, got it.
17:49:00<SketchCow>http://warriorhq.archiveteam.org/
17:49:00<nitro2k01>Ah, that
17:50:00<godane>burning a bluray of gbtv/theblaze episodes
17:50:00<godane>the rest of november and election coverage is on this one
18:43:00<Nemo_bis>SketchCow: can I buy other 50 kg of magazines to send you? :D
18:43:00<Nemo_bis>"PC Professionale" 110-189
18:43:00<Nemo_bis>shipping will cost about three times as buying
18:44:00<DFJustin>I like how kg is our standard unit for magazines now
18:45:00<Nemo_bis>DFJustin: what other unit could I choose for transatlantic cooperations? :p
18:53:00<Nemo_bis>I don't remember if ias3upload.pl overwrites existing files with same name or not
18:57:00<godane>i uploaded august of 2011 episodes of x-play today
19:11:00<SketchCow>At current trends, github data will be about 200gb
19:14:00<DFJustin>*yawn*
19:17:00<chronomex>*slurp*
19:31:00<chronomex>alard: did we already finish the API grabbing?
19:32:00<chronomex>my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141
19:32:00<Deewiant>I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes
19:33:00<Deewiant>(I put my auth info in there so it can do 5000 instead of 60 per hour)
19:33:00<chronomex>neato
19:34:00<Deewiant>(At first I accidentally put them on a tracker HTTP request, had to change the password then >_<)
19:34:00<chronomex>hah, woops
19:34:00<chronomex>probably nobody's looking at those ... except the NSA watches them in transit
19:35:00<Deewiant>Yep, I think it was an unencrypted request too
19:35:00<chronomex>you're fucked
19:35:00<Deewiant>Well, I managed to change the password without any trouble
19:36:00<Deewiant>Maybe somebody defaced all my repos in the interim ;-P
19:45:00<chronomex>you seem to be sucking the job queue dry
19:45:00<chronomex>good work
19:46:00<Deewiant>Where does it get jobs from?
20:10:00<kennethre>chronomex: sorry :)
20:11:00<kennethre>I'd recommend using something like celery
20:23:00<chronomex>erp, what?
20:27:00<kennethre>re: requests.exceptions.ConnectionError
20:27:00<kennethre>to spread them across different machines, handle exceptions, etc
20:27:00<SketchCow>62 BBC R&D Descriptions left!
20:27:00<SketchCow>Poor github
20:30:00<balrog_>yeah, I'm getting no tasks.
20:30:00<balrog_>actually I am getting one in a while
20:33:00<alard>http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html
20:42:00<godane>SketchCow: Thanks for puting up x-play episodes in collection
20:54:00<SketchCow>No problem.
20:54:00<SketchCow>More soon
21:04:00<godane>i will do 2012 episodes in 2013 so i don't get this stuff darked
21:05:00<godane>when the network is died there shouldn't be fear of nbc sending dmca notices i hope
21:07:00<soultcer>there are so many people fetching github repo lists that it is hard to actually get a task assigned
21:24:00<sankin1>the leaderboard is flying
21:26:00<soultcer>Whoa there's already a project to download
21:28:00<alard>Perhaps I should ask: what is an acceptable number of requests to send to GitHub?
21:28:00<alard>We're currently doing over 50 requests per second.
21:31:00<soultcer>As long as Github doesn't show elevated error response rates, keep it up :D
21:32:00<alard>Apparently underscor has joined us.
21:32:00<Deewiant>The non-Warrioring cheater.
21:33:00<soultcer>Well dailybooth is kind of boring with it's low download speed and timeouts
21:33:00underscor pads in drearily, rubbing sleep out of his eyes
21:33:00<underscor>what oh yes hi
21:33:00<kennethre>alard: to the api?
21:33:00<alard>No, to the /downloads page.
21:33:00<kennethre>i wouldn't worry about it
21:34:00<kennethre>unless you get 500s
21:37:00<soultcer>The actual downloads are from cloudfront and probably s3-backed
21:37:00<kennethre>yep
21:38:00<alard>(The precise thing to say would be: 50 r/s to the /downloads pages.)
21:38:00<SketchCow>Just for the record, godane - you are cutting it way close to the edge.
21:39:00<SketchCow>I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead"
21:39:00<SketchCow>I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material.
21:39:00<SketchCow>Even the 90s
21:39:00<SketchCow>I mean, if you have a choice.
21:44:00<SketchCow>In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\
21:44:00<SketchCow>In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz
21:46:00<underscor>Boy, my browser really hates the tracker
21:47:00<Nemo_bis>underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :)
21:47:00<Deewiant>Pause your scripts, it'll be much more palatable ;-)
21:48:00<underscor>Nemo_bis: Wow, I'd forgotten about that
21:48:00<Nemo_bis>:D
21:48:00<underscor>Damn, that is cool :D
21:48:00<underscor>Deewiant: <:B <:B
21:51:00<Nemo_bis>Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads
21:58:00<SketchCow>http://archive.org/details/creativecomputingv11n11-tiffcbz
21:59:00<DFJustin>same url is same
22:00:00<Nemo_bis>I love how iasupload smartly retries
22:01:00<Nemo_bis>SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal"
22:06:00<SketchCow>If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds.
22:06:00<SketchCow>They're still deriving
22:08:00<godane>looks like archive.org is having problems
22:10:00<godane>also everything is waiting to be archived
22:11:00<Nemo_bis>SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue
22:12:00<Nemo_bis>Also, I miss ocrcount
22:19:00<chronomex>oh jesus I just loaded the github tracker
22:19:00<chronomex>I don't think I've ever seen a tracker go this fast
22:19:00<chronomex>zoooom
22:25:00<SketchCow>Is there a total reposity count somewhere on github?
22:25:00<SketchCow>I'm looking for it.
22:25:00<SketchCow>I see press release saying 3.7 million
22:26:00<balrog_>that's in Sep 13
22:28:00<SketchCow>https://twitter.com/textfiles/status/279350174541819905
22:28:00<balrog_>this is just downloading file listings, right?
22:29:00<balrog_>or is that part finished?
22:29:00<balrog_>SketchCow: also note that there are many private github repos
22:29:00<balrog_>since you can pay for private ones
22:30:00<balrog_>3.7 million would include those
22:30:00<chronomex>I think that number includes gists as well
22:30:00<balrog_>doubt it, but maybe
22:31:00<balrog_>I liked github downloads because you could post binaries and hotlink them from elsewhere
22:31:00<balrog_>sucks that they're going away
22:34:00<balrog_>are you guys sure the downloads contain data?
22:34:00<balrog_>or is this just listings?
22:35:00<DFJustin>I saw one that was 7mb
22:36:00<balrog_>some should be 20-50 or more
22:36:00<balrog_>DFJustin: are all the lists retrieved?
22:36:00<SketchCow>First, realize what these are.
22:36:00<balrog_>so it's now downloading files, right?
22:37:00<SketchCow>These are NOT the code repositories.
22:37:00<balrog_>most of them will be under 1mb
22:37:00<balrog_>yes, I understand
22:37:00<SketchCow>Like github/boner-muncher is code
22:37:00<balrog_>however, some projects have posted fairly large files
22:37:00<balrog_>I've used this service myself for some of my code.
22:37:00<SketchCow>The /downloads are JUST the separate files.
22:37:00<balrog_>yes
22:37:00<DFJustin>just watched it for a couple seconds and some are dozens of mb so I think it's ok
22:37:00<SketchCow>Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature
22:37:00<SketchCow>VAST majority.
22:37:00<balrog_>that is correct
22:37:00<balrog_>ahh, so the warrior lists those who didn't use it.
22:37:00<DFJustin>also that is cartoonishly fast
22:37:00<SketchCow>root@teamarchive-1:/1/ALARD/warrior/github# du -sh .
22:37:00<SketchCow>55G .
22:38:00<balrog_>hopefully wget-lua compiles before this is done :P
22:38:00<SketchCow>18303
22:38:00<SketchCow>root@teamarchive-1:/1/ALARD/warrior/github# find . -type f | wc -l
22:38:00<SketchCow>Remember, that's including index.txt
22:38:00<balrog_>and index.txt is generated for all repos?
22:38:00<SketchCow>2717
22:38:00<SketchCow>root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt | wc -l
22:38:00<SketchCow>See? Yes
22:39:00<SketchCow>Yes, it is.
22:39:00<SketchCow>Just to keep the download counts
22:45:00<balrog_>how do I set this to work without warrior?
22:46:00<soultcer>I assume same as all other warrior projects that use wget-lua
22:46:00<balrog_>just python ./pipeline.py?
22:47:00<soultcer>1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers
22:47:00<soultcer>2) git clone github.com/archiveteam/seesaw-kit.git
22:47:00<SketchCow>Poor github, they just want to do the right thing.
22:47:00<SketchCow>OK, separate hannel
22:48:00<SketchCow>#gothub
22:48:00<soultcer>3) git clone github.com/archiveteam/github-download.git
22:51:00<balrog_>soultcer: yeah I have all that, I have wget lua, just how to start it?
22:51:00<soultcer>with run-pipeline, as usual
22:52:00<SketchCow>Please redirect people over to #gothub.
22:52:00<SketchCow>We're back to the Usual Crap again
23:10:00<SketchCow>alard: Please come to #gothub - possible bug