00:26:00<SketchCow>Huh.
00:31:00<godane>hey SketchCow
00:33:00<BlueMax>I assume we know about Webshots already?
00:36:00<BlueMax>http://www.webshots.com/ turning into another service, deleting all user photos that do not conform to said service
00:56:00<SketchCow>Yeah. I'm thinking we might need to go after this. Maybe.
00:57:00<godane>closes on December 1, 2012.
00:57:00<godane>so that may give use some time to decide to grab it
00:59:00<godane>it uses flash to display images
01:01:00<godane>i'm grabing theregister.co.uk by year
01:02:00<godane>there is also wget.log file
01:07:00<BlueMax>how many photos does it have, 600 million?
01:14:00<SketchCow>It says that.
01:19:00<joepie91>godane: where do you see it use flash?
01:20:00<godane>http://www.webshots.com/pro/photo/3334729?path=/artist-kevin-mcneal
01:21:00<godane>the picture is blocked by my flash blocker
01:21:00<joepie91>oh indeed
01:21:00<joepie91>that's weird
01:21:00<joepie91>ooooo
01:21:00<joepie91>that's for the pro part of the site
01:21:00<joepie91>the members part doesn't use flash it seems
01:22:00<godane>ok
01:22:00<joepie91>also, the picture URLs are very easy to extract from the page source so that's not a problem
01:22:00<godane>ok
01:23:00<joepie91>SketchCow: want me to have a run and collect as many usernames as possible?
01:23:00<joepie91>from webshots
01:24:00<joepie91>they seem to have a fairly parse-able index, but it seems limited to showing 10k users per category
02:15:00<joepie91>hm. webshots is pretty big.
02:20:00<Nintendud>how i shot web
02:34:00<joepie91>hmhm: http://i.imgur.com/X7qwe.png
02:38:00<joepie91>well, looks like it started fetching users
02:38:00<joepie91>http://i.imgur.com/zlFdz.png
02:41:00<swagstaff>Y'all on top of this? http://www.buzzfeed.com/katienotopoulos/your-internet-photos-are-already-starting-to-die
02:43:00<swagstaff>"... However, buried deep within their http://help.getsmileapp.com/customer/portal/articles/708519-what-if-i-don%E2%80%99t-do-anything- is the bad news. The bad news is that if you don’t log into your old Webshots account and confirm it as a new Smile account, all your photos will be deleted. ...."
02:44:00<joepie91>swagstaff: http://i.imgur.com/zlFdz.png :)
02:44:00<joepie91>I'm not sure if there are any plans to archive everything
02:44:00<joepie91>but I'm already generating a list of all users, just in case
02:47:00<swagstaff>Glad to see the Team is Ever Alert. Good luck if you decide to archive.
02:49:00<joepie91>alright, time to sleep... with a bit of luck it's done gathering usernames by tomorrow :D
02:50:00<Nintendud>joepie91: nice.
02:50:00<Nintendud>my warrior has been bored recently.
02:52:00<joepie91>heh
02:52:00<joepie91>also, not that it's much use since it's not really distributed, but if anyone wants the source of said script, git clone http://git.cryto.net/repo/projects/joepie91/webshots
02:53:00<joepie91>very hacky and simple, but it works :P
02:57:00<arkhive>690 million photos apparently.
02:58:00<joepie91>yup
02:59:00<joepie91>time for sleep
02:59:00<joepie91>goodnight all
03:28:00DFJustin anticipates the cathedral of butthurt as photographers realize that flash doesn't keep you from saving the photos
05:18:00<SketchCow>I was wrong.
05:18:00<SketchCow>By the way.
05:18:00<SketchCow>Wayback machine has indexed 186 billion webpages.
05:18:00<SketchCow>And is expecting to do 240 billion.
05:18:00<SketchCow>Billion.
05:27:00<DFJustin>(☉ε ⊙ノ)ノ
05:40:00<BlueMax>I think I just crapped my pants at that number
06:04:00<Nemo_bis>too bad Google no longer has that childish count of indexed pages
06:07:00<Nemo_bis>hmm "In 2012, Google has indexed over 30 trillion web pages, 100 billion queries per month"
09:35:00<alard>I've started a Webshots page on the wiki: http://archiveteam.org/index.php?title=Webshots
09:36:00<alard>(There are some nice comments on this "photo of the day": http://travel.webshots.com/photo/2248078140105543869vCJpvs )
11:28:00<alard>https://github.com/ArchiveTeam/webshots-grab/
11:29:00<SmileyG>oooo code
11:30:00SmileyG looks and attempts to learn lua in 10 minutes before giving up
11:33:00<SmileyG>yup, I have no clue wtf that does :<
11:36:00<joepie91>All photos will be removed by December 1. Until then, you may use message boards and you may search, browse and view images, but you wonâ??t be able to upload or download images.
11:36:00<joepie91>nasty.
11:38:00<joepie91>"hi, we're going to close down the site and tell you in advance but you can't download anything anymore by the time you're aware of it"
11:38:00<joepie91>not that flash is a terribly good protection, but ok :P
11:38:00<SmileyG>CHALLENGE ACCEPTED!
11:42:00<joepie91>lol
11:44:00<SmileyG>hmmm
11:44:00<SmileyG>is it a sign that I accidently named the script webshits.sh?
11:45:00<joepie91>hahaha
11:46:00<joepie91>mmm
11:46:00<joepie91>AT wiki frontpage really needs an update
12:35:00<joepie91>http://i.imgur.com/EMv89.png
12:36:00<joepie91>found about 950k users so far
12:36:00<joepie91>re: webshots
12:36:00<SmileyG>joepie91: how? :o
12:37:00<joepie91>SmileyG: I started out by crawling all 'top members' for all categories
12:37:00<joepie91>script just parses all usernames out of the page, deduplicates
12:37:00<SmileyG>hmmm
12:37:00<joepie91>and writes the whole list to a file
12:37:00<joepie91>it's been running for a night or so now
12:37:00<SmileyG>nice, where are these scripts :S
12:37:00<joepie91>I estimate it's through about half of the users now
12:38:00<joepie91>(it just collects usernames, nothing else, though)
12:38:00<joepie91>git clone http://git.cryto.net/repo/projects/joepie91/webshots
12:38:00<SmileyG>ahhh
12:38:00<joepie91>it's a stupidly simple script though :P
12:38:00<joepie91>(regex is fun!)
12:39:00<SmileyG>:/
12:40:00<SmileyG>I'm trying to understand this and i just urgh.
12:41:00<joepie91>SmileyG: what part are you having trouble understanding?
12:41:00<SmileyG>actually your code kind of makes sense
12:41:00<SmileyG>but I just don't know how I'd ever write it :<
12:41:00<joepie91>what it basically does is this:
12:42:00<joepie91>retrieve community index, find all "top members" links, retrieve all those links, and for each of them find all pagination links
12:42:00<joepie91>then for every page of every category ('top members' link), it finds all user-page URLs in the page
12:42:00<joepie91>and extracts the username from that
12:42:00<joepie91>that's pretty much it
12:43:00<joepie91>structure may seem a bit odd because I'm trying to prevent it from loading the first page of a category twice
12:43:00<joepie91>so the first page (which is the destination of the 'top members' link) is added separately
12:43:00<joepie91>to the 'queue'
12:43:00<joepie91>if I hadn't done that, it would've sufficed to just use a few nested foreach loops and I'd be done
12:43:00<joepie91>:P
12:56:00<SmileyG>:/ archive.org down :?
13:10:00<joepie91>works for me
13:59:00<joepie91>whoop
13:59:00<joepie91>986098
13:59:00<joepie91>root@aarnist:~/webshots/webshots# cat users.txt | wc -l
14:00:00<joepie91>seems it's done :)
14:00:00<joepie91>a remark though:
14:00:00<joepie91>there were *very* few duplicates
14:00:00<joepie91>that makes me think that this is really only a portion of the webshots users
14:00:00<joepie91>every category's "top users" listing only shows 100 pages max
14:01:00<joepie91>SmileyG, SketchCow, any suggestions as to how to get more usernames?
14:04:00<joepie91>yeah, I was afraid of this: https://www.google.nl/search?sugexp=chrome,mod=8&sourceid=chrome&ie=UTF-8&q=site%3Acommunity.webshots.com+inurl%3Auser
14:04:00<joepie91>about 11.400.000 results
14:04:00<joepie91>that's about 11 times as much as I have now
14:07:00<jiphex>Safari is quite clever. If you open a link like google.com/search?q=something%20something - it parses the query and puts the query into the google search address bar thing as if you'd typed it yourself
14:15:00<SmileyG>hmmm
14:15:00<SmileyG>not really
14:15:00<SmileyG>POST.
14:15:00<SmileyG>:D
14:17:00<jiphex>erm, wrong channel :/
14:22:00<joepie91>SmileyG: http://aarnist.cryto.net:81/webshots/users.txt
14:22:00<joepie91>also, SmileyG, #webshots
14:22:00<alard>(and everyone else is welcome too, of course)
14:23:00<joepie91>:P
17:34:00<alard>SketchCow: Can you make a webshots rsync area on fos?
17:46:00<SketchCow>YEs.
17:46:00<SketchCow>How big do we think this is going to be? Probably big, huh.
17:48:00<alard>Big, yes.
17:49:00<SketchCow>I'm in that channel, let's discuss it there.
19:53:00<SketchCow>Cool thing.
19:53:00<SketchCow>Someone has donated a massive pile of recorded-off-vcr news programs from the 1980s.
19:53:00<SketchCow>So yeah
19:55:00<SketchCow>I'm packing up the first range of Cinch and putting it on the archive.
19:55:00<SketchCow>446gb of audio!
19:55:00<joepie91>Cinch?
19:56:00<chronomex>three pounds of flax!
20:05:00<SketchCow>Cinch.FM, one of the sillier shutdowns.
20:11:00<alard>SketchCow: Keep in mind that this is not the main website, but the discussion boards: http://archive.org/details/archiveteam-city-of-heroes-main
20:12:00<alard>I have a copy of the City of Heroes website (the documentation), but I haven't uploaded that yet.
20:41:00<joepie91>in other news: minus is basically dead, they only allow multimedia uploads now and want to move away from general purpose file storage
20:42:00<chronomex>pirated movies ONLY
20:43:00<joepie91>lol
20:43:00<Aranje>truth
20:43:00<chronomex>worked for megaupload, right?
20:43:00<chronomex>that guy made -piles- of money
20:44:00<Aranje>It's sounding like he'll get to keep it too, pretty shortly
20:44:00<Aranje>on a long enough (or short enough!) timescale, minus will be a raging success :D
20:45:00<godane>i'm still trying to find old webuser magazines for my "collection"
20:45:00<godane>some of the missing issue only link to dead file servers now
20:46:00<godane>or there not there servers anymore
20:47:00<joepie91>hai Aranje
20:47:00<joepie91>long time no speak
20:47:00<joepie91>also #webshots
20:47:00<joepie91>:P
20:57:00<SketchCow>So, to explain - I just call it the "Main" city of heroes grab because we want to do one last supplemental grab later.
20:57:00<SketchCow>That's what I meant.
20:59:00<alard>SketchCow: Yes, I thought so. Shall I rsync you the warcs of the main website so you can add them to the item? (The 'alard' rsync space on fos is gone.)
21:18:00<SketchCow>How big?
21:26:00<alard>3.6GB
21:28:00<SketchCow>I guess I need an alard place.
21:28:00<SketchCow>Let me get that going.
21:57:00<oli>joepie91: ...
21:57:00<joepie91>lol
21:57:00<joepie91>hai
21:57:00<oli>i got lots of capacity in LA and the hetzner box
21:57:00<joepie91>see #webshots
21:57:00<joepie91>:P