00:02:00 | <corobo> | running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :) |
05:32:00 | <tuankiet> | @alard Hello! |
10:43:00 | <alard> | tuankiet: Hi. |
10:44:00 | <alard> | kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads. |
10:46:00 | <alard> | tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment). |
11:56:00 | <tuankiet> | @alard: Ok |
11:57:00 | <tuankiet> | I am running the Yahoo and Github script |
12:34:00 | <alard> | tuankiet: Very good. |
12:35:00 | <alard> | It's a pity that Dailybooth is so slow. We're working on too many projects. |
13:23:00 | <Nemo_bis> | At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/ |
13:48:00 | <SketchCow> | OK, so. |
13:48:00 | <SketchCow> | I have to say. |
13:48:00 | <SketchCow> | When you checked in the github content for the github content for us to turn around and download github |
13:48:00 | <SketchCow> | Oh man |
13:48:00 | <SketchCow> | I almost died |
13:53:00 | <GLaDOS> | So uuh, I heard you like github.. |
13:55:00 | <SketchCow> | At this exact moment, archive.org has one petabyte of disk space |
13:56:00 | <SketchCow> | free |
13:56:00 | <Nemo_bis> | SketchCow: are you saying because you plan to reduce it vastly and very soon? :p |
13:56:00 | <SketchCow> | Yes |
13:56:00 | <Nemo_bis> | :) |
13:56:00 | <SketchCow> | I'd like to understand.... do we need more archiveteam warriors on the dailybooth project? |
13:57:00 | <Nemo_bis> | I also have to admit that it's not so obvious what one has to do to help the archiveteam |
13:57:00 | <Nemo_bis> | too many projects and we're too lazy to update the wiki |
14:03:00 | <SketchCow> | We're not too lazy. |
14:03:00 | <SketchCow> | The wiki's choked because of the spam. I will fix it. |
14:04:00 | <Nemo_bis> | Speaking of which, can you make me sysop |
14:04:00 | <Nemo_bis> | it's weird not to have the delete button on a wiki |
14:04:00 | <Nemo_bis> | and frustrating for me :) |
14:18:00 | <alard> | No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow. |
14:19:00 | <alard> | Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy. |
14:20:00 | <alard> | We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's). |
14:21:00 | <alard> | Do we want the Github downloads in warc format? |
14:36:00 | <SketchCow> | I personally think no. |
14:38:00 | <alard> | You don't want to go for maximum inaccessibility? |
14:40:00 | <alard> | If not warc, then what? A .tar? |
14:40:00 | <alard> | (What to do with the /downloads HTML page?) |
14:42:00 | <alard> | We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download). |
14:43:00 | <alard> | The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads |
15:18:00 | <SketchCow> | I think in this case, we're rescuing a filesystem, not an experience |
15:19:00 | <SketchCow> | A .txt file accompanying the files indicating the download count, if you're being completist. |
15:19:00 | <SketchCow> | And personally, I think that assassment could be in a single .txt file |
16:18:00 | <alard> | SketchCow: Could you have a look at alardland/github? |
16:36:00 | | closure perks up his ears hearing about plans to do something with github |
16:36:00 | <closure> | is this about archiving the git repos, or some of their other data? |
16:36:00 | <alard> | The downloads. |
16:37:00 | <closure> | hmm, not familiar with that |
16:37:00 | <alard> | https://github.com/blog/1302-goodbye-uploads |
16:37:00 | <closure> | aha, thanks |
16:38:00 | <alard> | We're making a list of repositories, so that could be used for other things in the future. |
16:38:00 | <closure> | so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site |
16:40:00 | <SketchCow> | alard - Looks good. |
16:40:00 | <SketchCow> | I suspect this won't be a LOT of data |
16:41:00 | <alard> | You *hope* it's not a lot of data. |
16:42:00 | <closure> | for a lot of data, see sourceforge downloads :P |
16:42:00 | <SketchCow> | I don't actually (hope) |
16:42:00 | <SketchCow> | Because once again the COmpass Has Swung and archive.org has tons of disk space. |
16:42:00 | <SketchCow> | I mean, we still should help raise funds because it helps |
16:43:00 | <SketchCow> | But 1 petabyte of free disk space right now |
16:43:00 | <SketchCow> | So yeah, let's do it. |
16:43:00 | <SketchCow> | I'll e-mail a hug to my github buddies |
16:46:00 | <closure> | ah, I see you already found githubarchive.org |
16:48:00 | <alard> | SketchCow: Want to say hi in the User-Agent header as well? |
16:52:00 | <SketchCow> | Sure. |
16:52:00 | <SketchCow> | "Archive Team Loves GitHub" |
16:55:00 | <alard> | https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3 |
17:07:00 | <alard> | Heh, the tracker might not like this: http://tracker.archiveteam.org/github/ |
17:08:00 | <closure> | have you already pulled in the api dump data? If not, I might try some massaging |
17:09:00 | <alard> | No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/ |
17:09:00 | <alard> | (I think the highest ID is in the 7,000,000 range.) |
17:12:00 | <closure> | I'm running the scraper for that, so if there's time to plow through the whole range, that's fine |
17:44:00 | <SketchCow> | What is our HQ url again? |
17:45:00 | <nitro2k01> | What? Headquarters? http://archiveteam.org/ ? |
17:49:00 | <SketchCow> | No, got it. |
17:49:00 | <SketchCow> | http://warriorhq.archiveteam.org/ |
17:49:00 | <nitro2k01> | Ah, that |
17:50:00 | <godane> | burning a bluray of gbtv/theblaze episodes |
17:50:00 | <godane> | the rest of november and election coverage is on this one |
18:43:00 | <Nemo_bis> | SketchCow: can I buy other 50 kg of magazines to send you? :D |
18:43:00 | <Nemo_bis> | "PC Professionale" 110-189 |
18:43:00 | <Nemo_bis> | shipping will cost about three times as buying |
18:44:00 | <DFJustin> | I like how kg is our standard unit for magazines now |
18:45:00 | <Nemo_bis> | DFJustin: what other unit could I choose for transatlantic cooperations? :p |
18:53:00 | <Nemo_bis> | I don't remember if ias3upload.pl overwrites existing files with same name or not |
18:57:00 | <godane> | i uploaded august of 2011 episodes of x-play today |
19:11:00 | <SketchCow> | At current trends, github data will be about 200gb |
19:14:00 | <DFJustin> | *yawn* |
19:17:00 | <chronomex> | *slurp* |
19:31:00 | <chronomex> | alard: did we already finish the API grabbing? |
19:32:00 | <chronomex> | my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141 |
19:32:00 | <Deewiant> | I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes |
19:33:00 | <Deewiant> | (I put my auth info in there so it can do 5000 instead of 60 per hour) |
19:33:00 | <chronomex> | neato |
19:34:00 | <Deewiant> | (At first I accidentally put them on a tracker HTTP request, had to change the password then >_<) |
19:34:00 | <chronomex> | hah, woops |
19:34:00 | <chronomex> | probably nobody's looking at those ... except the NSA watches them in transit |
19:35:00 | <Deewiant> | Yep, I think it was an unencrypted request too |
19:35:00 | <chronomex> | you're fucked |
19:35:00 | <Deewiant> | Well, I managed to change the password without any trouble |
19:36:00 | <Deewiant> | Maybe somebody defaced all my repos in the interim ;-P |
19:45:00 | <chronomex> | you seem to be sucking the job queue dry |
19:45:00 | <chronomex> | good work |
19:46:00 | <Deewiant> | Where does it get jobs from? |
20:10:00 | <kennethre> | chronomex: sorry :) |
20:11:00 | <kennethre> | I'd recommend using something like celery |
20:23:00 | <chronomex> | erp, what? |
20:27:00 | <kennethre> | re: requests.exceptions.ConnectionError |
20:27:00 | <kennethre> | to spread them across different machines, handle exceptions, etc |
20:27:00 | <SketchCow> | 62 BBC R&D Descriptions left! |
20:27:00 | <SketchCow> | Poor github |
20:30:00 | <balrog_> | yeah, I'm getting no tasks. |
20:30:00 | <balrog_> | actually I am getting one in a while |
20:33:00 | <alard> | http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html |
20:42:00 | <godane> | SketchCow: Thanks for puting up x-play episodes in collection |
20:54:00 | <SketchCow> | No problem. |
20:54:00 | <SketchCow> | More soon |
21:04:00 | <godane> | i will do 2012 episodes in 2013 so i don't get this stuff darked |
21:05:00 | <godane> | when the network is died there shouldn't be fear of nbc sending dmca notices i hope |
21:07:00 | <soultcer> | there are so many people fetching github repo lists that it is hard to actually get a task assigned |
21:24:00 | <sankin1> | the leaderboard is flying |
21:26:00 | <soultcer> | Whoa there's already a project to download |
21:28:00 | <alard> | Perhaps I should ask: what is an acceptable number of requests to send to GitHub? |
21:28:00 | <alard> | We're currently doing over 50 requests per second. |
21:31:00 | <soultcer> | As long as Github doesn't show elevated error response rates, keep it up :D |
21:32:00 | <alard> | Apparently underscor has joined us. |
21:32:00 | <Deewiant> | The non-Warrioring cheater. |
21:33:00 | <soultcer> | Well dailybooth is kind of boring with it's low download speed and timeouts |
21:33:00 | | underscor pads in drearily, rubbing sleep out of his eyes |
21:33:00 | <underscor> | what oh yes hi |
21:33:00 | <kennethre> | alard: to the api? |
21:33:00 | <alard> | No, to the /downloads page. |
21:33:00 | <kennethre> | i wouldn't worry about it |
21:34:00 | <kennethre> | unless you get 500s |
21:37:00 | <soultcer> | The actual downloads are from cloudfront and probably s3-backed |
21:37:00 | <kennethre> | yep |
21:38:00 | <alard> | (The precise thing to say would be: 50 r/s to the /downloads pages.) |
21:38:00 | <SketchCow> | Just for the record, godane - you are cutting it way close to the edge. |
21:39:00 | <SketchCow> | I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead" |
21:39:00 | <SketchCow> | I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material. |
21:39:00 | <SketchCow> | Even the 90s |
21:39:00 | <SketchCow> | I mean, if you have a choice. |
21:44:00 | <SketchCow> | In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\ |
21:44:00 | <SketchCow> | In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz |
21:46:00 | <underscor> | Boy, my browser really hates the tracker |
21:47:00 | <Nemo_bis> | underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :) |
21:47:00 | <Deewiant> | Pause your scripts, it'll be much more palatable ;-) |
21:48:00 | <underscor> | Nemo_bis: Wow, I'd forgotten about that |
21:48:00 | <Nemo_bis> | :D |
21:48:00 | <underscor> | Damn, that is cool :D |
21:48:00 | <underscor> | Deewiant: <:B <:B |
21:51:00 | <Nemo_bis> | Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads |
21:58:00 | <SketchCow> | http://archive.org/details/creativecomputingv11n11-tiffcbz |
21:59:00 | <DFJustin> | same url is same |
22:00:00 | <Nemo_bis> | I love how iasupload smartly retries |
22:01:00 | <Nemo_bis> | SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal" |
22:06:00 | <SketchCow> | If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds. |
22:06:00 | <SketchCow> | They're still deriving |
22:08:00 | <godane> | looks like archive.org is having problems |
22:10:00 | <godane> | also everything is waiting to be archived |
22:11:00 | <Nemo_bis> | SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue |
22:12:00 | <Nemo_bis> | Also, I miss ocrcount |
22:19:00 | <chronomex> | oh jesus I just loaded the github tracker |
22:19:00 | <chronomex> | I don't think I've ever seen a tracker go this fast |
22:19:00 | <chronomex> | zoooom |
22:25:00 | <SketchCow> | Is there a total reposity count somewhere on github? |
22:25:00 | <SketchCow> | I'm looking for it. |
22:25:00 | <SketchCow> | I see press release saying 3.7 million |
22:26:00 | <balrog_> | that's in Sep 13 |
22:28:00 | <SketchCow> | https://twitter.com/textfiles/status/279350174541819905 |
22:28:00 | <balrog_> | this is just downloading file listings, right? |
22:29:00 | <balrog_> | or is that part finished? |
22:29:00 | <balrog_> | SketchCow: also note that there are many private github repos |
22:29:00 | <balrog_> | since you can pay for private ones |
22:30:00 | <balrog_> | 3.7 million would include those |
22:30:00 | <chronomex> | I think that number includes gists as well |
22:30:00 | <balrog_> | doubt it, but maybe |
22:31:00 | <balrog_> | I liked github downloads because you could post binaries and hotlink them from elsewhere |
22:31:00 | <balrog_> | sucks that they're going away |
22:34:00 | <balrog_> | are you guys sure the downloads contain data? |
22:34:00 | <balrog_> | or is this just listings? |
22:35:00 | <DFJustin> | I saw one that was 7mb |
22:36:00 | <balrog_> | some should be 20-50 or more |
22:36:00 | <balrog_> | DFJustin: are all the lists retrieved? |
22:36:00 | <SketchCow> | First, realize what these are. |
22:36:00 | <balrog_> | so it's now downloading files, right? |
22:37:00 | <SketchCow> | These are NOT the code repositories. |
22:37:00 | <balrog_> | most of them will be under 1mb |
22:37:00 | <balrog_> | yes, I understand |
22:37:00 | <SketchCow> | Like github/boner-muncher is code |
22:37:00 | <balrog_> | however, some projects have posted fairly large files |
22:37:00 | <balrog_> | I've used this service myself for some of my code. |
22:37:00 | <SketchCow> | The /downloads are JUST the separate files. |
22:37:00 | <balrog_> | yes |
22:37:00 | <DFJustin> | just watched it for a couple seconds and some are dozens of mb so I think it's ok |
22:37:00 | <SketchCow> | Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature |
22:37:00 | <SketchCow> | VAST majority. |
22:37:00 | <balrog_> | that is correct |
22:37:00 | <balrog_> | ahh, so the warrior lists those who didn't use it. |
22:37:00 | <DFJustin> | also that is cartoonishly fast |
22:37:00 | <SketchCow> | root@teamarchive-1:/1/ALARD/warrior/github# du -sh . |
22:37:00 | <SketchCow> | 55G . |
22:38:00 | <balrog_> | hopefully wget-lua compiles before this is done :P |
22:38:00 | <SketchCow> | 18303 |
22:38:00 | <SketchCow> | root@teamarchive-1:/1/ALARD/warrior/github# find . -type f | wc -l |
22:38:00 | <SketchCow> | Remember, that's including index.txt |
22:38:00 | <balrog_> | and index.txt is generated for all repos? |
22:38:00 | <SketchCow> | 2717 |
22:38:00 | <SketchCow> | root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt | wc -l |
22:38:00 | <SketchCow> | See? Yes |
22:39:00 | <SketchCow> | Yes, it is. |
22:39:00 | <SketchCow> | Just to keep the download counts |
22:45:00 | <balrog_> | how do I set this to work without warrior? |
22:46:00 | <soultcer> | I assume same as all other warrior projects that use wget-lua |
22:46:00 | <balrog_> | just python ./pipeline.py? |
22:47:00 | <soultcer> | 1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers |
22:47:00 | <soultcer> | 2) git clone github.com/archiveteam/seesaw-kit.git |
22:47:00 | <SketchCow> | Poor github, they just want to do the right thing. |
22:47:00 | <SketchCow> | OK, separate hannel |
22:48:00 | <SketchCow> | #gothub |
22:48:00 | <soultcer> | 3) git clone github.com/archiveteam/github-download.git |
22:51:00 | <balrog_> | soultcer: yeah I have all that, I have wget lua, just how to start it? |
22:51:00 | <soultcer> | with run-pipeline, as usual |
22:52:00 | <SketchCow> | Please redirect people over to #gothub. |
22:52:00 | <SketchCow> | We're back to the Usual Crap again |
23:10:00 | <SketchCow> | alard: Please come to #gothub - possible bug |