09:10:00<ersi>Any rsyncers around?
09:12:00<ersi>nevermind
10:20:00<SmileyG>ARUGH MY WARRIOR HAS GONE BONKERS
10:21:00<norbert79>What's wrong?
10:21:00<SmileyG>Tried to run urlteam and errrr wow, 2012-11-30 10:20:48,117 tinyback.Tracker INFO: Initializing tracker at http://tracker.tinyarchive.org/v1/
10:21:00<SmileyG>Traceback (most recent call last):
10:21:00<SmileyG>File "./single_task.py", line 36, in <module>
10:21:00<SmileyG>task = tracker.fetch()
10:21:00<SmileyG>File "/data/data/projects/URLTeam-0937f7a/tinyback/tracker.py", line 28, in fetch
10:21:00<SmileyG>raise Exception("Unexpected status %i" % status)
10:21:00<SmileyG>Exception: Unexpected status 502
10:21:00<SmileyG>Process TinyBack returned exit code 1 for Item
10:21:00<SmileyG>Failed TinyBack for Item
10:22:00<Deewiant>Tracker's down apparently
10:48:00<chronomex>pity it's not on tracker.archiveteam,org, else my drunk ass could reset it right now
10:59:00<Muad-Dib>yay, this 16k-picture Webshots profile is almost done :D
11:04:00<tuankiet>SmileyG: Yeah, I have the same thing
11:05:00<SmileyG>k
11:05:00<tuankiet>Nothing to run now =))
11:08:00<chronomex>Muad-Dib: what's your name on the tracker?
11:17:00<Muad-Dib>same as irc
11:18:00<Muad-Dib>almost to 100 GB now
11:18:00<chronomex>nice
12:01:00<DoubleJ>chronomex: Do you have access to the webshots user list? I'm assuming there are some that have been checked out forever that we could retry.
12:04:00<tuankiet>SmilleyG: the URLTeam tracker is down. HTTP 502
12:05:00<SmileyG>:(
12:07:00<soultcer>tuankiet: Tracker should be back up now
12:08:00<tuankiet>soultcer: OK :D
12:08:00<soultcer>I ran a webshots downloader on the same machine as the tracker, and somehow the wget-lua process of that crawler stole all the memory and the incredibly smart oom-killer decided that it is best to kill small processes instead of the one memory-hogging one.
12:09:00<tuankiet>Ah, bad thing
12:10:00<soultcer>I should have ran the dailybooth downloader as a separate user with resource limits turned on
12:13:00<tuankiet>how many ram do you have
12:16:00<soultcer>1 GB + 1 GB swap
12:17:00<soultcer>alard: Good to see you again ;-)
12:18:00<alard>Hi.
12:22:00<SmileyG>its alive!
12:24:00<chronomex>alard is alive!
12:25:00<chronomex>We all were worried
12:30:00<tuankiet>Ah, where are you in the last few days alard?
12:31:00<alard>Oh, in a very good place, actually. I was in Istanbul.
12:31:00<ersi>Neat
12:32:00<tuankiet>Travel alard?
12:32:00<alard>chronomex: In the meantime, you've apparently downloaded all of webshots etc. Good. (And repaired the tracker as well, it seems.)
12:32:00<soultcer>alard: We did what any computer expert does when software is running slow. We threw more hardware at it ;-)
12:33:00<ersi>Or rather, restarted it
12:33:00<alard>tuankiet: Some sort of holiday. There's a lot to see there. (But that's -bs stuff.)
12:33:00<alard>Ah, that always works.
12:39:00<alard>There's still a bit of webshots, cityofheroes left. I've requeued it now.
12:40:00<ersi>I still got workers running webshots, so should be picking it up
12:40:00<Deewiant>ersi: I think the linode did actually get upgraded
12:42:00<alard>Dailybooth: there are more users, the tracker only has the first batch.
12:42:00<alard>What's up with www.archiveteam.org? Is that a known problem?
12:43:00<Deewiant>alard: cityofheroes should be the "archive team's choice" now since it's going down something like today, right?
12:43:00<soultcer>Oh, luckily we still have a month for dailybooth
12:43:00<alard>Oh, it's back now.
12:44:00<Muad-Dib>Are you guys aware that some of your archive.org torrents are being improperly webseeded
12:44:00<alard>Deewiant: Is it? Last time I looked it was all very vague. We have 217 items left, so that's not too much.
12:45:00<Deewiant>alard: I'm not sure, but at least a few people have been worried about it here in the past few days.
12:45:00<Muad-Dib>me too
12:46:00<Deewiant>November 30th is the game's "planned end of services" date, at least
12:47:00<ersi>People are generally very worried about things here
12:47:00<SmileyG>:D
12:47:00<SmileyG>I'm worried about the end of the universe.
12:48:00<Nemo_bis>the Russian government issued an official statement according to which you shouldn't
12:49:00<alard>Deewiant: CoH is now the "archive team's choice" project.
12:49:00<Deewiant>Cool, that seems like the right call to me :-)
12:50:00<Muad-Dib>switching to CoH too
12:50:00<alard>Muad-Dib: If you switch, please switch to "archiveteam's choice". CoH will be done pretty soon, I think.
12:51:00<Muad-Dib>after that I´ll switch back to webshots, see if I can break the 125GB mark
12:53:00<Muad-Dib>alard: I´ll switch to choice, will it switch to webshots after CoH forums?
12:53:00<Muad-Dib>has the user discovery for webshots already finished?
12:53:00<ersi>It'll switch to the next AT Choce, when it's changed. Depends on what it'll be changed to.
12:54:00<alard>Yes, I think webshots would be next.
12:55:00<alard>Webshots user discovery isn't finished yet. We've explored about half it.
12:55:00<Muad-Dib>maybe more people need to be put on it :/
12:57:00<Muad-Dib>put on the user discovery*
12:58:00<alard>The user discovery is included in the user downloading, so everyone who downloads looks for a few new users first.
13:01:00<Muad-Dib>yeah, but maybe we need some people running only the user discovery, so it isn´t slowed by the gallery downloading
13:02:00<ersi>>_>
13:02:00<Muad-Dib>on those peers*
13:06:00<Muad-Dib>going home, brb
13:30:00<tuankiet>alard: Will CoH end after this?
13:31:00<tuankiet>2 million items on Dailybooth o_0 0_o
13:39:00<ersi>Why do you think alard would know? He's not CoH :)
13:53:00<soultcer>alard: Dailybooth is not downloading any users, just getting 404s
13:55:00<alard>soultcer: Oh. Any idea why?
13:56:00<soultcer>Maybe there are no users with those IDs
13:56:00<soultcer>Finding username for ID 495912: not found (response code 404).
13:56:00<soultcer>Received item '495912' from tracker
13:58:00<alard>They changed the API.
13:58:00<tuankiet>Sorry. I am running CoH now
13:58:00<soultcer>Those bastards
13:59:00<soultcer>haha just great, http://blog.dailybooth.com/category/api/ is a known malware site according to google
14:00:00<alard>Yes, you get funny popups if you go to http://developers.dailybooth.com/
14:00:00<ersi>soultcer: yeah, been like that pretty long :) pretty funny
14:01:00<alard>Apparently https://api.dailybooth.com/v1/users/1.json still works (the HTTPS version)
14:01:00<soultcer>Makes me really want to trust them with my precious family pictures
14:03:00<alard>{"error":{"error":"rate_limit","error_description":"Rate limit exceeded.","error_code":412}}
14:04:00<soultcer>guess they finally figured out it was wget-lua and not some macos webkit :D
14:04:00<alard>Probably.
14:05:00<soultcer>Is it just me or is dailybooth mostly girls doing mirror-shots?
14:05:00<ersi>So? :)
14:05:00<alard>I think that was the original idea: take a picture of yourself every day.
14:07:00<alard>"Users seem to be predominantly teenagers — many of whom seem very bored" http://paidcontent.org/2011/03/09/419-photo-social-network-dailybooth-raises-6-million/
14:09:00<soultcer>Anyway, I guess we will need a new way to go from user id to username, or we have to start boring old username discovery using search engines
14:11:00<alard>The current version of the script also uses the API to discover the photos, comments etc.
14:13:00<alard>We could try it via https://api.dailybooth.com first, see how that works. (I'm not sure when you get the rate limit message.)
14:13:00<alard>https://api.dailybooth.com/v1/status/rate_limit.json isn't very informative.
14:19:00<soultcer>It seems that dailybooth is returning 404 for all API requests, even from new IPs?
14:22:00<alard>It redirects from api.dailybooth.com to dailybooth.com, I think. This still works: https://api.dailybooth.com/v1/users/495911.json
14:30:00<tuankiet>alard: Let's help them
14:32:00<tuankiet>Webshots shoult listen to One More Night =)). There is 1 night before the old Webshots die =))
14:34:00<balrog_>will we make it?
14:35:00<tuankiet>DailyBooth API run again https://api.dailybooth.com/v1/users/495911.json
14:35:00<tuankiet>Output: {"user_id":495911,"username":"Krystofao_O","picture_count":9,"followers_count":67,"following_count":22,"private":false,"details":{"name":"Krystofa Hill","gender":"male","age":16,"country":"United Kingdom","relationship_status":"single","about":"Hello. Im 17.Bisexual. I go to college :). First Year. From England. I absolutley love different cultures. Im a very talkative person, but when meeting peop
14:35:00<tuankiet>Pretty Little Liars, One Tree Hill, The O.C, Ghost Whisperer, Gilmore Girls, Dr Quinn Medicine Woman, The Tribe.\n\n Tamna Island. Boys Over Flowers. You&#039;re Beautiful. Tree Of Heaven. Stairway To Heaven. Sad Love Story. Truth. Forever Yours. My Girlfriend is a gumiho. Prosecutor Princess. Secret Garden. City Hunter. Scent Of A Woman. Brilliant Legacy. Dae Jang Geum. Damo. Palace. ","music":"M
14:35:00<tuankiet>Se7en, B2ST, , T.O.P. Block B. Infinite. SHINee, Super Junior. \n\n","websites":"www.facebook.com\/krystofahill\nwww.twitter.com\/krystofao_O","movies":null,"books":null},"avatars":{"tiny":"http:\/\/d1oi94rh653f1l.cloudfront.net\/16\/avatars\/tiny\/495911_26931560.jpg","small":"http:\/\/d1oi94rh653f1l.cloudfront.net\/16\/avatars\/small\/495911_26931560.jpg","medium":"http:\/\/d1oi94rh653f1l.cloudf
14:35:00<tuankiet>arina &amp; The Diamonds. Imogen Heap. Paloma Faith. David Guetta. Adele. Lana Del Rey. Panic! At The Disco. 30 Seconds to Mars. My Chemical Romance. Rihanna. Nicki Minah. Greyson Chance. Kelly Clarkson. Oh Land. Natalia Kills. Cher Lloyd. VV Brown. Katy Perry. Eminem.Professor Green. Pink. Poets of the Fall. Cute is what we Aim For.\n\nBIGBANG , U-kiss, 2NE1, SNSD,G- DRAGON, C.N. Blue,FT Island, ,
14:35:00<tuankiet>le for the first time im very shy ^.^","interests":"The only historic thing i am intrested in is the Egyptians, ooo and also ancient South Korea :).\r\nHmm, i do like lollypops in class, since it occupys me and helps me concentrate :L\r\nMOOSIQUE, Love that stuff.\r\nMOOVIES, i also love them\r\nBOOKIES, i am quite a book fantastic, lol.","tv":"Supernatural, Smallville, Vampire Diaries, True Blood,
14:35:00<tuankiet>ront.net\/16\/avatars\/medium\/495911_26931560.jpg","large":"http:\/\/d1oi94rh653f1l.cloudfront.net\/16\/avatars\/large\/495911_26931560.jpg"}}
14:35:00<balrog_>that seems to work
14:36:00<ersi>Uh.. Dude.
14:36:00<SmileyG>D:
14:37:00<balrog_>hm?
14:37:00<balrog_>honestly I wasn't looking at any of the content
14:43:00<tuankiet>Can we rescue Webshots?
14:45:00<ersi>What kind of answer are you expecting? Also, why are you asking - after we've downloaded 50+TB of data?
14:46:00<tuankiet>Just curious ;) I ask that, can we rescue all the data?
14:46:00<ersi>First project for you? Just curious
14:48:00<tuankiet>After the MobileMe, I run but I found my upload speed was crazy, so after that I stop run but I run again after that. So you can understand that this is my first project ;)
14:50:00<ersi>Basically, I'd say "Yes, we've saved Webshots" to your initial question; Considering we've downloaded 55TB of Webshots data - which is more than if we wouldn't have done anything. I understand the idea of trying to save something perfectly, and I'm sure everyone here wants to do that for the most part - but it's most often impossible
14:51:00<tuankiet>Thanks
14:51:00<ersi>that's how I think about it at least
15:12:00<DFJustin>don't we need to do a second pass on CoH to get new posts
21:55:00<soultcer>alard, chronomex: It appears the webshots user discovery tracker is down
21:56:00<soultcer>So each webshots download will be slowed down by waiting for a timeout on the username discovery part
21:57:00<chronomex>are you sure? I'm not sure what's going on but there's some scrolling logs in a screen session named webshots-adder
21:57:00<chronomex>and webshots-tracker
21:58:00<soultcer>Yeah, trying to reach it results in a timeout. The webshots user discovery tracker was listening on HTTP port 8123
22:02:00<alard>Strange.
22:02:00<chronomex>alard: can you take this from here?
22:03:00<alard>chronomex: I'm taking a look.
22:03:00<chronomex>cool, thanks
22:13:00<alard>soultcer: Better now?
22:15:00<soultcer>alard: I am now getting HTTP 500 when submitting results
22:20:00<alard>soultcer: Ah, yes. Fixed now? (I'm now running the discovery tracker in Nginx, instead of as a standalone application. There were too many requests for a single-threaded app.)
22:21:00<soultcer>alard: Yes, working fine now. Great work ;-)
22:22:00<alard>Good.
22:46:00<alard>Dailybooth is now completely unavailable, it seems.
22:47:00<alard>A temporary lapse, it's back now.
22:48:00<godane>i'm grabing http://crypto.stanford.edu
22:49:00<godane>looks like there very few crawls
23:12:00<SketchCow>OK, back.
23:13:00<godane>hey SketchCow
23:13:00<godane>looks like i'm finding vm dumps on crypto.standford.edu
23:28:00<SketchCow>I'm going to Stuttgart Germany tomorrow. I assume nobody here is from that region.
23:33:00<alard>The latest set of DailyBooth scripts fixes the api url and can handle the rate limiting.
23:41:00<SketchCow>alard: The wayback machine loaded all the data.
23:41:00<SketchCow>I mean, I don't know how much of ours it grabbed in, but it's all up on the beta wayback.