04:20:00<xk_id>Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite.
06:45:00<Nemo_bis>norbert79: http://sourceforge.net/projects/xowa
08:06:00<norbert79>Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :)
08:16:00<Nemo_bis>norbert79: looks like it parses the wikitext on demand
08:24:00<norbert79>Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed
10:54:00<xk_id>Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request?
10:55:00<turnkit>Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable.
10:56:00<Lord_Nigh>does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing
10:57:00<Lord_Nigh>seems to affect the entire line of usb dvd drives made by that manufacturer
10:59:00<turnkit>(scratches head)
10:59:00<Lord_Nigh>offsets meaning when digitally reading the audio areas
11:00:00<Lord_Nigh>theres a block of gibberish which appears before the audio starts
11:00:00<Lord_Nigh>and that has to be cut out
11:01:00<turnkit>http://forum.imgburn.com/index.php?showtopic=5974
11:06:00<turnkit>I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989
11:06:00<turnkit>But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though.
11:10:00<alard>xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?)
11:10:00<turnkit>Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ?
11:11:00<xk_id>alard: simplest maybe, but not most convenient :P
11:11:00<xk_id>alard: need to speed things up a bit here...
11:11:00<xk_id>alard: so I'm capturing when each request is being made and delay based on that
11:11:00<alard>In that case, why wait?
11:11:00<ersi>xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that
11:11:00<xk_id>alard: cause, morals
11:12:00<ersi>And it was like, a total of 4M~+ requests to be done, right?
11:12:00<xk_id>ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s
11:12:00xk_id nods
11:12:00<xk_id>maximum 4M. between 3 and 4M
11:12:00<alard>Is it a small site? Or one with more than enough capacity?
11:13:00<xk_id>It's an online social network.
11:13:00<xk_id>http://gather.com
11:13:00<ersi>is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair
11:14:00<xk_id>I need to finish quickly. less than a week.
11:14:00<ersi>ouch
11:14:00<ersi>with 1s wait, it'll take 46 days
11:14:00<xk_id>I'm doing it distributely
11:15:00<ersi>(46 days basing on single thread - wait 1s between request)
11:15:00xk_id nods
11:15:00<alard>But, erm, if you're running a number of threads, why wait at all?
11:15:00<xk_id>good question... I suppose one reason would be to avoid being identified as an attacher
11:15:00<alard>For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread.
11:16:00<alard>Are you using a disposable IP address?
11:16:00<xk_id>EC2
11:16:00<alard>You could just try without a delay and see what happens.
11:16:00<ersi>And keep a watch, for when the banhammer falls and switch over to some new ones :)
11:16:00<alard>Switch to more threads with a delay if they block you.
11:17:00<ersi>Maybe randomise your useragent a little as well
11:17:00<xk_id>I thought of randomising the user agent :D
11:17:00<ersi>goodie
11:17:00<alard>So you've got rid of the morals? :)
11:17:00ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around
11:18:00<xk_id>I'm not yet decided how to go about this.
11:18:00<xk_id>There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this.
11:20:00<xk_id>but I also cannot go completely mental about this
11:20:00<xk_id>and finish my crawl in a day :P
11:22:00<xk_id>so basically: a) be polite. b) not be identified as an attacker. c) finish quickly.
11:24:00<xk_id>from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right?
11:26:00<alard>Delaying is better than not delaying?
11:26:00<xk_id>well, certainly, if I only had one machine.
11:27:00<ersi>The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time"
11:27:00<alard>I'm not sure that 5 machines is better than 3 if the total requests/second is equal.
11:28:00<alard>The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.)
11:28:00<ersi>maybe only US hosting, seems slow for me as well
11:28:00<ersi>and maybe crummy infra
11:29:00<xk_id>it's a bit slow for me too
11:30:00<xk_id>so is it pointless to delay between requests if I'm using more than one machine?
11:30:00<xk_id>probably not, right?
11:31:00<alard>It may be useful to avoid detection, but otherwise I think it's the number of requests that count.
11:32:00<xk_id>detection-wise, yes. I was curious politeness-wise..
11:32:00<xk_id>hmm
11:33:00<alard>Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS.
11:33:00<xk_id>hah
11:34:00<alard>Given that it's slow already, it's probably good to keep an eye on the site while you're crawling.
11:34:00<xk_id>do you think I could actually harm it?
11:34:00<alard>Yes.
11:35:00<xk_id>what would follow to that?
11:35:00<alard>You're sending a lot of difficult questions.
11:35:00<xk_id>hmm
11:36:00<alard>It's also hard on their cache, since you're asking for a different page each time.
11:36:00<xk_id>I think I'll do some delays across the cluster as well.
11:37:00<alard>Can you pause and resume easily?
11:37:00<xk_id>and perhaps I should do it during US-time night, I think most of their userbase is from there.
11:37:00<xk_id>yes, I know the job queue in advance.
11:37:00<xk_id>and I will work with that
11:38:00<alard>Start slow and then increase your speed if you find that you can?
11:39:00<xk_id>ok
11:44:00<xk_id>You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong?
11:45:00<alard>No, I think you're right. 5 workers with delays is more polite than 5 workers without delays.
11:45:00<xk_id>This is an interesting Maths problem.
11:45:00<xk_id>even if the delays don't take into account the operations of the other workers?
11:46:00<alard>Of course: adding a delay means each worker is sending fewer requests per second.
11:46:00<xk_id>but does this decrease the requests per second of the entire cluster? :)
11:47:00<xk_id>hmm
11:47:00<alard>Yes. So you can use a smaller cluster to get the same result.
11:47:00<alard>number of workers * (1 / (request length + delay)) = total requests per second
11:48:00<xk_id>but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster.
11:48:00<xk_id>*rate/time
11:48:00<alard>Why does that matter?
11:49:00<xk_id>because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second
11:49:00<xk_id>no?
11:50:00<alard>That is true, but it's unlikely that your workers remain synchronized.
11:50:00<alard>(Unless you're implementing something difficult to ensure that they are.)
11:50:00<xk_id>will adding delays in the code of each worker improve this, or will it remain practically neutral?
11:52:00<alard>I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon.
11:52:00<xk_id>So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers.
11:53:00<alard>I don't think delays will help.
11:53:00<xk_id>yes. I think you're right
11:53:00<alard>They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time.
11:54:00<xk_id>let's just hope my worker's won't randomly synchronise haha
11:54:00<xk_id>:P
11:55:00<xk_id>but, interesting situation.
11:56:00<xk_id>alard: thank you
12:00:00<alard>xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :)
12:22:00<Smiley>http://wikemacs.org/wiki/Main_Page - going away shortly
12:22:00<ersi>mediawikigrab that shizzle
12:22:00<Smiley>yeah I'm looking how now.
12:23:00<Smiley>errrr
12:23:00<Smiley>yeah how ? :S
12:23:00Smiley needs to go to lunch :<
12:23:00<Smiley>http://code.google.com/p/wikiteam/
12:25:00<Smiley>Is that sstill up to date?
12:26:00<Smiley>it doesn't say anything about warc :/
12:29:00<ersi>It's not doing WARC at all
12:29:00<ersi>If I'm not mistaken, emijrp hacked it together
12:31:00<Smiley>hmmm there is no index.php :S
12:32:00<Smiley>./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3
12:32:00<Smiley>Checking index.php... http://wikemacs.org/index.php
12:32:00<Smiley>Error in index.php, please, provide a correct path to index.php
12:32:00<Smiley>:<
12:32:00<Smiley>tried with /wiki/main_page too
12:33:00<alard>http://wikemacs.org/w/index.php ?
12:37:00<ersi>You need to find the api.php I think. I know the infoz is on the wikiteam page
12:47:00<Smiley>http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server
13:04:00<Nemo_bis>yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images
13:06:00<Smiley>woah :/
13:06:00<Smiley>whats with the /w/ directory then?
13:08:00<Smiley>yey running now :)
13:24:00<GLaDOS>Smiley: they use it if they're using rewrite rules.
13:33:00<Smiley>ah ok
13:45:00<Smiley>ok so I have the dump.... heh
13:45:00<Smiley>Whats next ¬_¬
13:53:00<db48x>Smiley: keep it safe
13:59:00<Smiley>:)
13:59:00<Smiley>should I tar it or something and upload it to the archive?
14:00:00<db48x>that'd be one way to keep it safe
14:02:00Smiley ponders what to name it
14:02:00<Smiley>wikemacsorg....
14:03:00<ersi>domaintld-yearmonthdate.something
14:03:00<Smiley>wikemacsorg29012013.tgz
14:03:00<Smiley>awww close :D
14:06:00<Smiley>And upload it to the archive under the same name?
14:12:00<Smiley>going up now
14:25:00<Nemo_bis>Smiley: there's also an uploader.py
14:25:00<Smiley>doh! :D
15:50:00<schbiridi>german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013
15:50:00<Smiley>:<
15:50:00<Smiley>"We are sorry but sevenload doesn't offer its service in your country."
15:50:00<Smiley>herp.
15:55:00<xk_id>what follows, if I crash the webserver I'm crawling?
15:56:00<alard>You won't be able to get more data from it.
15:57:00<xk_id>ever?
15:57:00<xk_id>hmm.
15:57:00<alard>No, as long as it's down.
15:58:00<xk_id>oh, but surely it will resume after short a while.
15:58:00<alard>And you're more likely to get blocked and cause frowns.
15:58:00<xk_id>will there be serious frowns?
15:59:00<alard>Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster.
16:00:00<xk_id>good idea
16:49:00<alard>SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue?
16:51:00<SketchCow>COntinue like download it?
16:51:00<SketchCow>Not download it.
16:51:00<SketchCow>But a mapping of all the people is somthing we should upload to archive.org
16:52:00<SketchCow>And we should have their crawlers put it on the roster
16:55:00<balrog_>just exactly how big was mobileme?
16:57:00<alard>mobileme was 200TB.
16:58:00<alard>The archive.org crawlers probably won't download the audio and video, by the way.
17:00:00<alard>The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before)
17:01:00<balrog_>how much has the trial downloaded?
17:02:00<alard>81GB. Part of that was without logging in and with earlier versions of the script.
17:02:00<DFJustin>"one file per user" or one line per user
17:03:00<alard>Hmm. As you may have noticed from the item name, this is the dyslexia edition.
19:11:00<chronomex>lol
19:25:00<godane>i'm uploading a 35min interview of the guy that made mairo
19:27:00<godane>filename: E3Interview_Miyamoto_G4700_flv.flv
19:35:00<omf__>godane, do you know what year that video came from?
19:36:00<Smiley>hmmm
19:36:00<Smiley>whats in it?
19:36:00Smiley bets 2007
19:41:00<DFJustin>if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo
19:52:00<godane>omf__: i think 2005
19:53:00<godane>also found a interview with bruce campbell
19:53:00<godane>problem is 11mins in the clip it goes very slow with no sound
19:54:00<godane>its very sad that it was not fixed
19:55:00<godane>its also 619mb
19:55:00<godane>this is will be archived more the there screwed up anyways
19:59:00<godane>http://abandonware-magazines.org/