#archiveteam<efnet> log for 2013-01-29

Home Search Previous day Next day

04:20:00	<xk_id>	Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite.
06:45:00	<Nemo_bis>	norbert79: http://sourceforge.net/projects/xowa
08:06:00	<norbert79>	Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :)
08:16:00	<Nemo_bis>	norbert79: looks like it parses the wikitext on demand
08:24:00	<norbert79>	Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed
10:54:00	<xk_id>	Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request?
10:55:00	<turnkit>	Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable.
10:56:00	<Lord_Nigh>	does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing
10:57:00	<Lord_Nigh>	seems to affect the entire line of usb dvd drives made by that manufacturer
10:59:00	<turnkit>	(scratches head)
10:59:00	<Lord_Nigh>	offsets meaning when digitally reading the audio areas
11:00:00	<Lord_Nigh>	theres a block of gibberish which appears before the audio starts
11:00:00	<Lord_Nigh>	and that has to be cut out
11:01:00	<turnkit>	http://forum.imgburn.com/index.php?showtopic=5974
11:06:00	<turnkit>	I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989
11:06:00	<turnkit>	But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though.
11:10:00	<alard>	xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?)
11:10:00	<turnkit>	Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ?
11:11:00	<xk_id>	alard: simplest maybe, but not most convenient :P
11:11:00	<xk_id>	alard: need to speed things up a bit here...
11:11:00	<xk_id>	alard: so I'm capturing when each request is being made and delay based on that
11:11:00	<alard>	In that case, why wait?
11:11:00	<ersi>	xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that
11:11:00	<xk_id>	alard: cause, morals
11:12:00	<ersi>	And it was like, a total of 4M~+ requests to be done, right?
11:12:00	<xk_id>	ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s
11:12:00		xk_id nods
11:12:00	<xk_id>	maximum 4M. between 3 and 4M
11:12:00	<alard>	Is it a small site? Or one with more than enough capacity?
11:13:00	<xk_id>	It's an online social network.
11:13:00	<xk_id>	http://gather.com
11:13:00	<ersi>	is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair
11:14:00	<xk_id>	I need to finish quickly. less than a week.
11:14:00	<ersi>	ouch
11:14:00	<ersi>	with 1s wait, it'll take 46 days
11:14:00	<xk_id>	I'm doing it distributely
11:15:00	<ersi>	(46 days basing on single thread - wait 1s between request)
11:15:00		xk_id nods
11:15:00	<alard>	But, erm, if you're running a number of threads, why wait at all?
11:15:00	<xk_id>	good question... I suppose one reason would be to avoid being identified as an attacher
11:15:00	<alard>	For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread.
11:16:00	<alard>	Are you using a disposable IP address?
11:16:00	<xk_id>	EC2
11:16:00	<alard>	You could just try without a delay and see what happens.
11:16:00	<ersi>	And keep a watch, for when the banhammer falls and switch over to some new ones :)
11:16:00	<alard>	Switch to more threads with a delay if they block you.
11:17:00	<ersi>	Maybe randomise your useragent a little as well
11:17:00	<xk_id>	I thought of randomising the user agent :D
11:17:00	<ersi>	goodie
11:17:00	<alard>	So you've got rid of the morals? :)
11:17:00		ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around
11:18:00	<xk_id>	I'm not yet decided how to go about this.
11:18:00	<xk_id>	There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this.
11:20:00	<xk_id>	but I also cannot go completely mental about this
11:20:00	<xk_id>	and finish my crawl in a day :P
11:22:00	<xk_id>	so basically: a) be polite. b) not be identified as an attacker. c) finish quickly.
11:24:00	<xk_id>	from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right?
11:26:00	<alard>	Delaying is better than not delaying?
11:26:00	<xk_id>	well, certainly, if I only had one machine.
11:27:00	<ersi>	The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time"
11:27:00	<alard>	I'm not sure that 5 machines is better than 3 if the total requests/second is equal.
11:28:00	<alard>	The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.)
11:28:00	<ersi>	maybe only US hosting, seems slow for me as well
11:28:00	<ersi>	and maybe crummy infra
11:29:00	<xk_id>	it's a bit slow for me too
11:30:00	<xk_id>	so is it pointless to delay between requests if I'm using more than one machine?
11:30:00	<xk_id>	probably not, right?
11:31:00	<alard>	It may be useful to avoid detection, but otherwise I think it's the number of requests that count.
11:32:00	<xk_id>	detection-wise, yes. I was curious politeness-wise..
11:32:00	<xk_id>	hmm
11:33:00	<alard>	Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS.
11:33:00	<xk_id>	hah
11:34:00	<alard>	Given that it's slow already, it's probably good to keep an eye on the site while you're crawling.
11:34:00	<xk_id>	do you think I could actually harm it?
11:34:00	<alard>	Yes.
11:35:00	<xk_id>	what would follow to that?
11:35:00	<alard>	You're sending a lot of difficult questions.
11:35:00	<xk_id>	hmm
11:36:00	<alard>	It's also hard on their cache, since you're asking for a different page each time.
11:36:00	<xk_id>	I think I'll do some delays across the cluster as well.
11:37:00	<alard>	Can you pause and resume easily?
11:37:00	<xk_id>	and perhaps I should do it during US-time night, I think most of their userbase is from there.
11:37:00	<xk_id>	yes, I know the job queue in advance.
11:37:00	<xk_id>	and I will work with that
11:38:00	<alard>	Start slow and then increase your speed if you find that you can?
11:39:00	<xk_id>	ok
11:44:00	<xk_id>	You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong?
11:45:00	<alard>	No, I think you're right. 5 workers with delays is more polite than 5 workers without delays.
11:45:00	<xk_id>	This is an interesting Maths problem.
11:45:00	<xk_id>	even if the delays don't take into account the operations of the other workers?
11:46:00	<alard>	Of course: adding a delay means each worker is sending fewer requests per second.
11:46:00	<xk_id>	but does this decrease the requests per second of the entire cluster? :)
11:47:00	<xk_id>	hmm
11:47:00	<alard>	Yes. So you can use a smaller cluster to get the same result.
11:47:00	<alard>	number of workers * (1 / (request length + delay)) = total requests per second
11:48:00	<xk_id>	but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster.
11:48:00	<xk_id>	*rate/time
11:48:00	<alard>	Why does that matter?
11:49:00	<xk_id>	because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second
11:49:00	<xk_id>	no?
11:50:00	<alard>	That is true, but it's unlikely that your workers remain synchronized.
11:50:00	<alard>	(Unless you're implementing something difficult to ensure that they are.)
11:50:00	<xk_id>	will adding delays in the code of each worker improve this, or will it remain practically neutral?
11:52:00	<alard>	I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon.
11:52:00	<xk_id>	So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers.
11:53:00	<alard>	I don't think delays will help.
11:53:00	<xk_id>	yes. I think you're right
11:53:00	<alard>	They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time.
11:54:00	<xk_id>	let's just hope my worker's won't randomly synchronise haha
11:54:00	<xk_id>	:P
11:55:00	<xk_id>	but, interesting situation.
11:56:00	<xk_id>	alard: thank you
12:00:00	<alard>	xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :)
12:22:00	<Smiley>	http://wikemacs.org/wiki/Main_Page - going away shortly
12:22:00	<ersi>	mediawikigrab that shizzle
12:22:00	<Smiley>	yeah I'm looking how now.
12:23:00	<Smiley>	errrr
12:23:00	<Smiley>	yeah how ? :S
12:23:00		Smiley needs to go to lunch :<
12:23:00	<Smiley>	http://code.google.com/p/wikiteam/
12:25:00	<Smiley>	Is that sstill up to date?
12:26:00	<Smiley>	it doesn't say anything about warc :/
12:29:00	<ersi>	It's not doing WARC at all
12:29:00	<ersi>	If I'm not mistaken, emijrp hacked it together
12:31:00	<Smiley>	hmmm there is no index.php :S
12:32:00	<Smiley>	./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3
12:32:00	<Smiley>	Checking index.php... http://wikemacs.org/index.php
12:32:00	<Smiley>	Error in index.php, please, provide a correct path to index.php
12:32:00	<Smiley>	:<
12:32:00	<Smiley>	tried with /wiki/main_page too
12:33:00	<alard>	http://wikemacs.org/w/index.php ?
12:37:00	<ersi>	You need to find the api.php I think. I know the infoz is on the wikiteam page
12:47:00	<Smiley>	http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server
13:04:00	<Nemo_bis>	yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images
13:06:00	<Smiley>	woah :/
13:06:00	<Smiley>	whats with the /w/ directory then?
13:08:00	<Smiley>	yey running now :)
13:24:00	<GLaDOS>	Smiley: they use it if they're using rewrite rules.
13:33:00	<Smiley>	ah ok
13:45:00	<Smiley>	ok so I have the dump.... heh
13:45:00	<Smiley>	Whats next Â¬_Â¬
13:53:00	<db48x>	Smiley: keep it safe
13:59:00	<Smiley>	:)
13:59:00	<Smiley>	should I tar it or something and upload it to the archive?
14:00:00	<db48x>	that'd be one way to keep it safe
14:02:00		Smiley ponders what to name it
14:02:00	<Smiley>	wikemacsorg....
14:03:00	<ersi>	domaintld-yearmonthdate.something
14:03:00	<Smiley>	wikemacsorg29012013.tgz
14:03:00	<Smiley>	awww close :D
14:06:00	<Smiley>	And upload it to the archive under the same name?
14:12:00	<Smiley>	going up now
14:25:00	<Nemo_bis>	Smiley: there's also an uploader.py
14:25:00	<Smiley>	doh! :D
15:50:00	<schbiridi>	german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013
15:50:00	<Smiley>	:<
15:50:00	<Smiley>	"We are sorry but sevenload doesn't offer its service in your country."
15:50:00	<Smiley>	herp.
15:55:00	<xk_id>	what follows, if I crash the webserver I'm crawling?
15:56:00	<alard>	You won't be able to get more data from it.
15:57:00	<xk_id>	ever?
15:57:00	<xk_id>	hmm.
15:57:00	<alard>	No, as long as it's down.
15:58:00	<xk_id>	oh, but surely it will resume after short a while.
15:58:00	<alard>	And you're more likely to get blocked and cause frowns.
15:58:00	<xk_id>	will there be serious frowns?
15:59:00	<alard>	Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster.
16:00:00	<xk_id>	good idea
16:49:00	<alard>	SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue?
16:51:00	<SketchCow>	COntinue like download it?
16:51:00	<SketchCow>	Not download it.
16:51:00	<SketchCow>	But a mapping of all the people is somthing we should upload to archive.org
16:52:00	<SketchCow>	And we should have their crawlers put it on the roster
16:55:00	<balrog_>	just exactly how big was mobileme?
16:57:00	<alard>	mobileme was 200TB.
16:58:00	<alard>	The archive.org crawlers probably won't download the audio and video, by the way.
17:00:00	<alard>	The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before)
17:01:00	<balrog_>	how much has the trial downloaded?
17:02:00	<alard>	81GB. Part of that was without logging in and with earlier versions of the script.
17:02:00	<DFJustin>	"one file per user" or one line per user
17:03:00	<alard>	Hmm. As you may have noticed from the item name, this is the dyslexia edition.
19:11:00	<chronomex>	lol
19:25:00	<godane>	i'm uploading a 35min interview of the guy that made mairo
19:27:00	<godane>	filename: E3Interview_Miyamoto_G4700_flv.flv
19:35:00	<omf__>	godane, do you know what year that video came from?
19:36:00	<Smiley>	hmmm
19:36:00	<Smiley>	whats in it?
19:36:00		Smiley bets 2007
19:41:00	<DFJustin>	if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo
19:52:00	<godane>	omf__: i think 2005
19:53:00	<godane>	also found a interview with bruce campbell
19:53:00	<godane>	problem is 11mins in the clip it goes very slow with no sound
19:54:00	<godane>	its very sad that it was not fixed
19:55:00	<godane>	its also 619mb
19:55:00	<godane>	this is will be archived more the there screwed up anyways
19:59:00	<godane>	http://abandonware-magazines.org/

Home Search Previous day Next day