#archiveteam<efnet> log for 2012-12-13

Home Search Previous day Next day

00:02:00	<corobo>	running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :)
05:32:00	<tuankiet>	@alard Hello!
10:43:00	<alard>	tuankiet: Hi.
10:44:00	<alard>	kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads.
10:46:00	<alard>	tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment).
11:56:00	<tuankiet>	@alard: Ok
11:57:00	<tuankiet>	I am running the Yahoo and Github script
12:34:00	<alard>	tuankiet: Very good.
12:35:00	<alard>	It's a pity that Dailybooth is so slow. We're working on too many projects.
13:23:00	<Nemo_bis>	At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/
13:48:00	<SketchCow>	OK, so.
13:48:00	<SketchCow>	I have to say.
13:48:00	<SketchCow>	When you checked in the github content for the github content for us to turn around and download github
13:48:00	<SketchCow>	Oh man
13:48:00	<SketchCow>	I almost died
13:53:00	<GLaDOS>	So uuh, I heard you like github..
13:55:00	<SketchCow>	At this exact moment, archive.org has one petabyte of disk space
13:56:00	<SketchCow>	free
13:56:00	<Nemo_bis>	SketchCow: are you saying because you plan to reduce it vastly and very soon? :p
13:56:00	<SketchCow>	Yes
13:56:00	<Nemo_bis>	:)
13:56:00	<SketchCow>	I'd like to understand.... do we need more archiveteam warriors on the dailybooth project?
13:57:00	<Nemo_bis>	I also have to admit that it's not so obvious what one has to do to help the archiveteam
13:57:00	<Nemo_bis>	too many projects and we're too lazy to update the wiki
14:03:00	<SketchCow>	We're not too lazy.
14:03:00	<SketchCow>	The wiki's choked because of the spam. I will fix it.
14:04:00	<Nemo_bis>	Speaking of which, can you make me sysop
14:04:00	<Nemo_bis>	it's weird not to have the delete button on a wiki
14:04:00	<Nemo_bis>	and frustrating for me :)
14:18:00	<alard>	No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow.
14:19:00	<alard>	Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy.
14:20:00	<alard>	We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's).
14:21:00	<alard>	Do we want the Github downloads in warc format?
14:36:00	<SketchCow>	I personally think no.
14:38:00	<alard>	You don't want to go for maximum inaccessibility?
14:40:00	<alard>	If not warc, then what? A .tar?
14:40:00	<alard>	(What to do with the /downloads HTML page?)
14:42:00	<alard>	We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download).
14:43:00	<alard>	The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads
15:18:00	<SketchCow>	I think in this case, we're rescuing a filesystem, not an experience
15:19:00	<SketchCow>	A .txt file accompanying the files indicating the download count, if you're being completist.
15:19:00	<SketchCow>	And personally, I think that assassment could be in a single .txt file
16:18:00	<alard>	SketchCow: Could you have a look at alardland/github?
16:36:00		closure perks up his ears hearing about plans to do something with github
16:36:00	<closure>	is this about archiving the git repos, or some of their other data?
16:36:00	<alard>	The downloads.
16:37:00	<closure>	hmm, not familiar with that
16:37:00	<alard>	https://github.com/blog/1302-goodbye-uploads
16:37:00	<closure>	aha, thanks
16:38:00	<alard>	We're making a list of repositories, so that could be used for other things in the future.
16:38:00	<closure>	so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site
16:40:00	<SketchCow>	alard - Looks good.
16:40:00	<SketchCow>	I suspect this won't be a LOT of data
16:41:00	<alard>	You hope it's not a lot of data.
16:42:00	<closure>	for a lot of data, see sourceforge downloads :P
16:42:00	<SketchCow>	I don't actually (hope)
16:42:00	<SketchCow>	Because once again the COmpass Has Swung and archive.org has tons of disk space.
16:42:00	<SketchCow>	I mean, we still should help raise funds because it helps
16:43:00	<SketchCow>	But 1 petabyte of free disk space right now
16:43:00	<SketchCow>	So yeah, let's do it.
16:43:00	<SketchCow>	I'll e-mail a hug to my github buddies
16:46:00	<closure>	ah, I see you already found githubarchive.org
16:48:00	<alard>	SketchCow: Want to say hi in the User-Agent header as well?
16:52:00	<SketchCow>	Sure.
16:52:00	<SketchCow>	"Archive Team Loves GitHub"
16:55:00	<alard>	https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3
17:07:00	<alard>	Heh, the tracker might not like this: http://tracker.archiveteam.org/github/
17:08:00	<closure>	have you already pulled in the api dump data? If not, I might try some massaging
17:09:00	<alard>	No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/
17:09:00	<alard>	(I think the highest ID is in the 7,000,000 range.)
17:12:00	<closure>	I'm running the scraper for that, so if there's time to plow through the whole range, that's fine
17:44:00	<SketchCow>	What is our HQ url again?
17:45:00	<nitro2k01>	What? Headquarters? http://archiveteam.org/ ?
17:49:00	<SketchCow>	No, got it.
17:49:00	<SketchCow>	http://warriorhq.archiveteam.org/
17:49:00	<nitro2k01>	Ah, that
17:50:00	<godane>	burning a bluray of gbtv/theblaze episodes
17:50:00	<godane>	the rest of november and election coverage is on this one
18:43:00	<Nemo_bis>	SketchCow: can I buy other 50 kg of magazines to send you? :D
18:43:00	<Nemo_bis>	"PC Professionale" 110-189
18:43:00	<Nemo_bis>	shipping will cost about three times as buying
18:44:00	<DFJustin>	I like how kg is our standard unit for magazines now
18:45:00	<Nemo_bis>	DFJustin: what other unit could I choose for transatlantic cooperations? :p
18:53:00	<Nemo_bis>	I don't remember if ias3upload.pl overwrites existing files with same name or not
18:57:00	<godane>	i uploaded august of 2011 episodes of x-play today
19:11:00	<SketchCow>	At current trends, github data will be about 200gb
19:14:00	<DFJustin>	yawn
19:17:00	<chronomex>	slurp
19:31:00	<chronomex>	alard: did we already finish the API grabbing?
19:32:00	<chronomex>	my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141
19:32:00	<Deewiant>	I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes
19:33:00	<Deewiant>	(I put my auth info in there so it can do 5000 instead of 60 per hour)
19:33:00	<chronomex>	neato
19:34:00	<Deewiant>	(At first I accidentally put them on a tracker HTTP request, had to change the password then >_<)
19:34:00	<chronomex>	hah, woops
19:34:00	<chronomex>	probably nobody's looking at those ... except the NSA watches them in transit
19:35:00	<Deewiant>	Yep, I think it was an unencrypted request too
19:35:00	<chronomex>	you're fucked
19:35:00	<Deewiant>	Well, I managed to change the password without any trouble
19:36:00	<Deewiant>	Maybe somebody defaced all my repos in the interim ;-P
19:45:00	<chronomex>	you seem to be sucking the job queue dry
19:45:00	<chronomex>	good work
19:46:00	<Deewiant>	Where does it get jobs from?
20:10:00	<kennethre>	chronomex: sorry :)
20:11:00	<kennethre>	I'd recommend using something like celery
20:23:00	<chronomex>	erp, what?
20:27:00	<kennethre>	re: requests.exceptions.ConnectionError
20:27:00	<kennethre>	to spread them across different machines, handle exceptions, etc
20:27:00	<SketchCow>	62 BBC R&D Descriptions left!
20:27:00	<SketchCow>	Poor github
20:30:00	<balrog_>	yeah, I'm getting no tasks.
20:30:00	<balrog_>	actually I am getting one in a while
20:33:00	<alard>	http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html
20:42:00	<godane>	SketchCow: Thanks for puting up x-play episodes in collection
20:54:00	<SketchCow>	No problem.
20:54:00	<SketchCow>	More soon
21:04:00	<godane>	i will do 2012 episodes in 2013 so i don't get this stuff darked
21:05:00	<godane>	when the network is died there shouldn't be fear of nbc sending dmca notices i hope
21:07:00	<soultcer>	there are so many people fetching github repo lists that it is hard to actually get a task assigned
21:24:00	<sankin1>	the leaderboard is flying
21:26:00	<soultcer>	Whoa there's already a project to download
21:28:00	<alard>	Perhaps I should ask: what is an acceptable number of requests to send to GitHub?
21:28:00	<alard>	We're currently doing over 50 requests per second.
21:31:00	<soultcer>	As long as Github doesn't show elevated error response rates, keep it up :D
21:32:00	<alard>	Apparently underscor has joined us.
21:32:00	<Deewiant>	The non-Warrioring cheater.
21:33:00	<soultcer>	Well dailybooth is kind of boring with it's low download speed and timeouts
21:33:00		underscor pads in drearily, rubbing sleep out of his eyes
21:33:00	<underscor>	what oh yes hi
21:33:00	<kennethre>	alard: to the api?
21:33:00	<alard>	No, to the /downloads page.
21:33:00	<kennethre>	i wouldn't worry about it
21:34:00	<kennethre>	unless you get 500s
21:37:00	<soultcer>	The actual downloads are from cloudfront and probably s3-backed
21:37:00	<kennethre>	yep
21:38:00	<alard>	(The precise thing to say would be: 50 r/s to the /downloads pages.)
21:38:00	<SketchCow>	Just for the record, godane - you are cutting it way close to the edge.
21:39:00	<SketchCow>	I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead"
21:39:00	<SketchCow>	I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material.
21:39:00	<SketchCow>	Even the 90s
21:39:00	<SketchCow>	I mean, if you have a choice.
21:44:00	<SketchCow>	In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\
21:44:00	<SketchCow>	In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz
21:46:00	<underscor>	Boy, my browser really hates the tracker
21:47:00	<Nemo_bis>	underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :)
21:47:00	<Deewiant>	Pause your scripts, it'll be much more palatable ;-)
21:48:00	<underscor>	Nemo_bis: Wow, I'd forgotten about that
21:48:00	<Nemo_bis>	:D
21:48:00	<underscor>	Damn, that is cool :D
21:48:00	<underscor>	Deewiant: <:B <:B
21:51:00	<Nemo_bis>	Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads
21:58:00	<SketchCow>	http://archive.org/details/creativecomputingv11n11-tiffcbz
21:59:00	<DFJustin>	same url is same
22:00:00	<Nemo_bis>	I love how iasupload smartly retries
22:01:00	<Nemo_bis>	SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal"
22:06:00	<SketchCow>	If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds.
22:06:00	<SketchCow>	They're still deriving
22:08:00	<godane>	looks like archive.org is having problems
22:10:00	<godane>	also everything is waiting to be archived
22:11:00	<Nemo_bis>	SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue
22:12:00	<Nemo_bis>	Also, I miss ocrcount
22:19:00	<chronomex>	oh jesus I just loaded the github tracker
22:19:00	<chronomex>	I don't think I've ever seen a tracker go this fast
22:19:00	<chronomex>	zoooom
22:25:00	<SketchCow>	Is there a total reposity count somewhere on github?
22:25:00	<SketchCow>	I'm looking for it.
22:25:00	<SketchCow>	I see press release saying 3.7 million
22:26:00	<balrog_>	that's in Sep 13
22:28:00	<SketchCow>	https://twitter.com/textfiles/status/279350174541819905
22:28:00	<balrog_>	this is just downloading file listings, right?
22:29:00	<balrog_>	or is that part finished?
22:29:00	<balrog_>	SketchCow: also note that there are many private github repos
22:29:00	<balrog_>	since you can pay for private ones
22:30:00	<balrog_>	3.7 million would include those
22:30:00	<chronomex>	I think that number includes gists as well
22:30:00	<balrog_>	doubt it, but maybe
22:31:00	<balrog_>	I liked github downloads because you could post binaries and hotlink them from elsewhere
22:31:00	<balrog_>	sucks that they're going away
22:34:00	<balrog_>	are you guys sure the downloads contain data?
22:34:00	<balrog_>	or is this just listings?
22:35:00	<DFJustin>	I saw one that was 7mb
22:36:00	<balrog_>	some should be 20-50 or more
22:36:00	<balrog_>	DFJustin: are all the lists retrieved?
22:36:00	<SketchCow>	First, realize what these are.
22:36:00	<balrog_>	so it's now downloading files, right?
22:37:00	<SketchCow>	These are NOT the code repositories.
22:37:00	<balrog_>	most of them will be under 1mb
22:37:00	<balrog_>	yes, I understand
22:37:00	<SketchCow>	Like github/boner-muncher is code
22:37:00	<balrog_>	however, some projects have posted fairly large files
22:37:00	<balrog_>	I've used this service myself for some of my code.
22:37:00	<SketchCow>	The /downloads are JUST the separate files.
22:37:00	<balrog_>	yes
22:37:00	<DFJustin>	just watched it for a couple seconds and some are dozens of mb so I think it's ok
22:37:00	<SketchCow>	Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature
22:37:00	<SketchCow>	VAST majority.
22:37:00	<balrog_>	that is correct
22:37:00	<balrog_>	ahh, so the warrior lists those who didn't use it.
22:37:00	<DFJustin>	also that is cartoonishly fast
22:37:00	<SketchCow>	root@teamarchive-1:/1/ALARD/warrior/github# du -sh .
22:37:00	<SketchCow>	55G .
22:38:00	<balrog_>	hopefully wget-lua compiles before this is done :P
22:38:00	<SketchCow>	18303
22:38:00	<SketchCow>	root@teamarchive-1:/1/ALARD/warrior/github# find . -type f \| wc -l
22:38:00	<SketchCow>	Remember, that's including index.txt
22:38:00	<balrog_>	and index.txt is generated for all repos?
22:38:00	<SketchCow>	2717
22:38:00	<SketchCow>	root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt \| wc -l
22:38:00	<SketchCow>	See? Yes
22:39:00	<SketchCow>	Yes, it is.
22:39:00	<SketchCow>	Just to keep the download counts
22:45:00	<balrog_>	how do I set this to work without warrior?
22:46:00	<soultcer>	I assume same as all other warrior projects that use wget-lua
22:46:00	<balrog_>	just python ./pipeline.py?
22:47:00	<soultcer>	1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers
22:47:00	<soultcer>	2) git clone github.com/archiveteam/seesaw-kit.git
22:47:00	<SketchCow>	Poor github, they just want to do the right thing.
22:47:00	<SketchCow>	OK, separate hannel
22:48:00	<SketchCow>	#gothub
22:48:00	<soultcer>	3) git clone github.com/archiveteam/github-download.git
22:51:00	<balrog_>	soultcer: yeah I have all that, I have wget lua, just how to start it?
22:51:00	<soultcer>	with run-pipeline, as usual
22:52:00	<SketchCow>	Please redirect people over to #gothub.
22:52:00	<SketchCow>	We're back to the Usual Crap again
23:10:00	<SketchCow>	alard: Please come to #gothub - possible bug

Home Search Previous day Next day