04:20:00 | <xk_id> | Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite. |
06:45:00 | <Nemo_bis> | norbert79: http://sourceforge.net/projects/xowa |
08:06:00 | <norbert79> | Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :) |
08:16:00 | <Nemo_bis> | norbert79: looks like it parses the wikitext on demand |
08:24:00 | <norbert79> | Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed |
10:54:00 | <xk_id> | Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request? |
10:55:00 | <turnkit> | Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable. |
10:56:00 | <Lord_Nigh> | does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing |
10:57:00 | <Lord_Nigh> | seems to affect the entire line of usb dvd drives made by that manufacturer |
10:59:00 | <turnkit> | (scratches head) |
10:59:00 | <Lord_Nigh> | offsets meaning when digitally reading the audio areas |
11:00:00 | <Lord_Nigh> | theres a block of gibberish which appears before the audio starts |
11:00:00 | <Lord_Nigh> | and that has to be cut out |
11:01:00 | <turnkit> | http://forum.imgburn.com/index.php?showtopic=5974 |
11:06:00 | <turnkit> | I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989 |
11:06:00 | <turnkit> | But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though. |
11:10:00 | <alard> | xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?) |
11:10:00 | <turnkit> | Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ? |
11:11:00 | <xk_id> | alard: simplest maybe, but not most convenient :P |
11:11:00 | <xk_id> | alard: need to speed things up a bit here... |
11:11:00 | <xk_id> | alard: so I'm capturing when each request is being made and delay based on that |
11:11:00 | <alard> | In that case, why wait? |
11:11:00 | <ersi> | xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that |
11:11:00 | <xk_id> | alard: cause, morals |
11:12:00 | <ersi> | And it was like, a total of 4M~+ requests to be done, right? |
11:12:00 | <xk_id> | ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s |
11:12:00 | | xk_id nods |
11:12:00 | <xk_id> | maximum 4M. between 3 and 4M |
11:12:00 | <alard> | Is it a small site? Or one with more than enough capacity? |
11:13:00 | <xk_id> | It's an online social network. |
11:13:00 | <xk_id> | http://gather.com |
11:13:00 | <ersi> | is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair |
11:14:00 | <xk_id> | I need to finish quickly. less than a week. |
11:14:00 | <ersi> | ouch |
11:14:00 | <ersi> | with 1s wait, it'll take 46 days |
11:14:00 | <xk_id> | I'm doing it distributely |
11:15:00 | <ersi> | (46 days basing on single thread - wait 1s between request) |
11:15:00 | | xk_id nods |
11:15:00 | <alard> | But, erm, if you're running a number of threads, why wait at all? |
11:15:00 | <xk_id> | good question... I suppose one reason would be to avoid being identified as an attacher |
11:15:00 | <alard> | For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread. |
11:16:00 | <alard> | Are you using a disposable IP address? |
11:16:00 | <xk_id> | EC2 |
11:16:00 | <alard> | You could just try without a delay and see what happens. |
11:16:00 | <ersi> | And keep a watch, for when the banhammer falls and switch over to some new ones :) |
11:16:00 | <alard> | Switch to more threads with a delay if they block you. |
11:17:00 | <ersi> | Maybe randomise your useragent a little as well |
11:17:00 | <xk_id> | I thought of randomising the user agent :D |
11:17:00 | <ersi> | goodie |
11:17:00 | <alard> | So you've got rid of the morals? :) |
11:17:00 | | ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around |
11:18:00 | <xk_id> | I'm not yet decided how to go about this. |
11:18:00 | <xk_id> | There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this. |
11:20:00 | <xk_id> | but I also cannot go completely mental about this |
11:20:00 | <xk_id> | and finish my crawl in a day :P |
11:22:00 | <xk_id> | so basically: a) be polite. b) not be identified as an attacker. c) finish quickly. |
11:24:00 | <xk_id> | from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right? |
11:26:00 | <alard> | Delaying is better than not delaying? |
11:26:00 | <xk_id> | well, certainly, if I only had one machine. |
11:27:00 | <ersi> | The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time" |
11:27:00 | <alard> | I'm not sure that 5 machines is better than 3 if the total requests/second is equal. |
11:28:00 | <alard> | The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.) |
11:28:00 | <ersi> | maybe only US hosting, seems slow for me as well |
11:28:00 | <ersi> | and maybe crummy infra |
11:29:00 | <xk_id> | it's a bit slow for me too |
11:30:00 | <xk_id> | so is it pointless to delay between requests if I'm using more than one machine? |
11:30:00 | <xk_id> | probably not, right? |
11:31:00 | <alard> | It may be useful to avoid detection, but otherwise I think it's the number of requests that count. |
11:32:00 | <xk_id> | detection-wise, yes. I was curious politeness-wise.. |
11:32:00 | <xk_id> | hmm |
11:33:00 | <alard> | Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS. |
11:33:00 | <xk_id> | hah |
11:34:00 | <alard> | Given that it's slow already, it's probably good to keep an eye on the site while you're crawling. |
11:34:00 | <xk_id> | do you think I could actually harm it? |
11:34:00 | <alard> | Yes. |
11:35:00 | <xk_id> | what would follow to that? |
11:35:00 | <alard> | You're sending a lot of difficult questions. |
11:35:00 | <xk_id> | hmm |
11:36:00 | <alard> | It's also hard on their cache, since you're asking for a different page each time. |
11:36:00 | <xk_id> | I think I'll do some delays across the cluster as well. |
11:37:00 | <alard> | Can you pause and resume easily? |
11:37:00 | <xk_id> | and perhaps I should do it during US-time night, I think most of their userbase is from there. |
11:37:00 | <xk_id> | yes, I know the job queue in advance. |
11:37:00 | <xk_id> | and I will work with that |
11:38:00 | <alard> | Start slow and then increase your speed if you find that you can? |
11:39:00 | <xk_id> | ok |
11:44:00 | <xk_id> | You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong? |
11:45:00 | <alard> | No, I think you're right. 5 workers with delays is more polite than 5 workers without delays. |
11:45:00 | <xk_id> | This is an interesting Maths problem. |
11:45:00 | <xk_id> | even if the delays don't take into account the operations of the other workers? |
11:46:00 | <alard> | Of course: adding a delay means each worker is sending fewer requests per second. |
11:46:00 | <xk_id> | but does this decrease the requests per second of the entire cluster? :) |
11:47:00 | <xk_id> | hmm |
11:47:00 | <alard> | Yes. So you can use a smaller cluster to get the same result. |
11:47:00 | <alard> | number of workers * (1 / (request length + delay)) = total requests per second |
11:48:00 | <xk_id> | but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster. |
11:48:00 | <xk_id> | *rate/time |
11:48:00 | <alard> | Why does that matter? |
11:49:00 | <xk_id> | because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second |
11:49:00 | <xk_id> | no? |
11:50:00 | <alard> | That is true, but it's unlikely that your workers remain synchronized. |
11:50:00 | <alard> | (Unless you're implementing something difficult to ensure that they are.) |
11:50:00 | <xk_id> | will adding delays in the code of each worker improve this, or will it remain practically neutral? |
11:52:00 | <alard> | I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon. |
11:52:00 | <xk_id> | So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers. |
11:53:00 | <alard> | I don't think delays will help. |
11:53:00 | <xk_id> | yes. I think you're right |
11:53:00 | <alard> | They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time. |
11:54:00 | <xk_id> | let's just hope my worker's won't randomly synchronise haha |
11:54:00 | <xk_id> | :P |
11:55:00 | <xk_id> | but, interesting situation. |
11:56:00 | <xk_id> | alard: thank you |
12:00:00 | <alard> | xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :) |
12:22:00 | <Smiley> | http://wikemacs.org/wiki/Main_Page - going away shortly |
12:22:00 | <ersi> | mediawikigrab that shizzle |
12:22:00 | <Smiley> | yeah I'm looking how now. |
12:23:00 | <Smiley> | errrr |
12:23:00 | <Smiley> | yeah how ? :S |
12:23:00 | | Smiley needs to go to lunch :< |
12:23:00 | <Smiley> | http://code.google.com/p/wikiteam/ |
12:25:00 | <Smiley> | Is that sstill up to date? |
12:26:00 | <Smiley> | it doesn't say anything about warc :/ |
12:29:00 | <ersi> | It's not doing WARC at all |
12:29:00 | <ersi> | If I'm not mistaken, emijrp hacked it together |
12:31:00 | <Smiley> | hmmm there is no index.php :S |
12:32:00 | <Smiley> | ./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3 |
12:32:00 | <Smiley> | Checking index.php... http://wikemacs.org/index.php |
12:32:00 | <Smiley> | Error in index.php, please, provide a correct path to index.php |
12:32:00 | <Smiley> | :< |
12:32:00 | <Smiley> | tried with /wiki/main_page too |
12:33:00 | <alard> | http://wikemacs.org/w/index.php ? |
12:37:00 | <ersi> | You need to find the api.php I think. I know the infoz is on the wikiteam page |
12:47:00 | <Smiley> | http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server |
13:04:00 | <Nemo_bis> | yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images |
13:06:00 | <Smiley> | woah :/ |
13:06:00 | <Smiley> | whats with the /w/ directory then? |
13:08:00 | <Smiley> | yey running now :) |
13:24:00 | <GLaDOS> | Smiley: they use it if they're using rewrite rules. |
13:33:00 | <Smiley> | ah ok |
13:45:00 | <Smiley> | ok so I have the dump.... heh |
13:45:00 | <Smiley> | Whats next ¬_¬ |
13:53:00 | <db48x> | Smiley: keep it safe |
13:59:00 | <Smiley> | :) |
13:59:00 | <Smiley> | should I tar it or something and upload it to the archive? |
14:00:00 | <db48x> | that'd be one way to keep it safe |
14:02:00 | | Smiley ponders what to name it |
14:02:00 | <Smiley> | wikemacsorg.... |
14:03:00 | <ersi> | domaintld-yearmonthdate.something |
14:03:00 | <Smiley> | wikemacsorg29012013.tgz |
14:03:00 | <Smiley> | awww close :D |
14:06:00 | <Smiley> | And upload it to the archive under the same name? |
14:12:00 | <Smiley> | going up now |
14:25:00 | <Nemo_bis> | Smiley: there's also an uploader.py |
14:25:00 | <Smiley> | doh! :D |
15:50:00 | <schbiridi> | german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013 |
15:50:00 | <Smiley> | :< |
15:50:00 | <Smiley> | "We are sorry but sevenload doesn't offer its service in your country." |
15:50:00 | <Smiley> | herp. |
15:55:00 | <xk_id> | what follows, if I crash the webserver I'm crawling? |
15:56:00 | <alard> | You won't be able to get more data from it. |
15:57:00 | <xk_id> | ever? |
15:57:00 | <xk_id> | hmm. |
15:57:00 | <alard> | No, as long as it's down. |
15:58:00 | <xk_id> | oh, but surely it will resume after short a while. |
15:58:00 | <alard> | And you're more likely to get blocked and cause frowns. |
15:58:00 | <xk_id> | will there be serious frowns? |
15:59:00 | <alard> | Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster. |
16:00:00 | <xk_id> | good idea |
16:49:00 | <alard> | SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue? |
16:51:00 | <SketchCow> | COntinue like download it? |
16:51:00 | <SketchCow> | Not download it. |
16:51:00 | <SketchCow> | But a mapping of all the people is somthing we should upload to archive.org |
16:52:00 | <SketchCow> | And we should have their crawlers put it on the roster |
16:55:00 | <balrog_> | just exactly how big was mobileme? |
16:57:00 | <alard> | mobileme was 200TB. |
16:58:00 | <alard> | The archive.org crawlers probably won't download the audio and video, by the way. |
17:00:00 | <alard> | The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before) |
17:01:00 | <balrog_> | how much has the trial downloaded? |
17:02:00 | <alard> | 81GB. Part of that was without logging in and with earlier versions of the script. |
17:02:00 | <DFJustin> | "one file per user" or one line per user |
17:03:00 | <alard> | Hmm. As you may have noticed from the item name, this is the dyslexia edition. |
19:11:00 | <chronomex> | lol |
19:25:00 | <godane> | i'm uploading a 35min interview of the guy that made mairo |
19:27:00 | <godane> | filename: E3Interview_Miyamoto_G4700_flv.flv |
19:35:00 | <omf__> | godane, do you know what year that video came from? |
19:36:00 | <Smiley> | hmmm |
19:36:00 | <Smiley> | whats in it? |
19:36:00 | | Smiley bets 2007 |
19:41:00 | <DFJustin> | if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo |
19:52:00 | <godane> | omf__: i think 2005 |
19:53:00 | <godane> | also found a interview with bruce campbell |
19:53:00 | <godane> | problem is 11mins in the clip it goes very slow with no sound |
19:54:00 | <godane> | its very sad that it was not fixed |
19:55:00 | <godane> | its also 619mb |
19:55:00 | <godane> | this is will be archived more the there screwed up anyways |
19:59:00 | <godane> | http://abandonware-magazines.org/ |