05:09:00<SketchCow>http://archive.org/details/webshots-freeze-frame
05:19:00<DFJustin>parking some more cars in the IA driveway eh
05:57:00<godane>SketchCow: I'm uploading 132 iso of linux format
05:58:00<godane>using your naming for the isos so there standard and easy to find
14:38:00<SmileyG>Jason, can you pull mny new version of the front page plz, http://archiveteam.org/index.php?title=Djsmiley2k/main_page
14:39:00<SmileyG>And possibly tweet to followers about Warrior? - ( I tweeted Oi followers - webshots are going to delete all member photos - were backing them up - HELP http://archiveteam.org/index.php?title=ArchiveTeam_Warrior & Download, VM - PLZ RT ) I'm sure you can come up with something more witty.
17:13:00<Zym_>anyone can tell me how the archiveteam warrior works? I'm not sure if mines doing anything productive...
17:14:00<soultcer>Zym_: It has a webinterface, try http://localhost:8001/
17:18:00<Zym_>got that. I'm still not sure what happens under the surface because all i get is a gui but no output
18:43:00<SketchCow>http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing
18:43:00<SketchCow>IMMEDIATELY.
18:47:00<joepie91>uhoh
18:47:00<C-Keen>To access our FTP site using a web browser, type the following into your browser:
18:47:00<C-Keen>ftp://username:password@ftp.btinternet.com
18:47:00<C-Keen>Once you're logged in you'll see an FTP listing of directories.
18:47:00<C-Keen>OREALLY?
18:47:00<C-Keen>FTP listing of directories
18:47:00<C-Keen>You may have to select the "pub" directory to view your files. Once you have found your files, right click on the ones you want to save, then choose "Save Target as".
18:57:00<soultcer>How do we usually archive such small webhosting providers? wget-warc? heritrix?
19:06:00<SmileyG>root:*:0:1:Operator:/:
19:06:00<SmileyG>bin:*:2:2::/:/bin/csh
19:06:00<SmileyG>daemon:*:1:1::/:
19:06:00<SmileyG>ftp:*:99:99:ftp user:/var/ftp-anon:nosuchshell
19:06:00<SmileyG>sys:*:3:3::/bin:
19:06:00<SmileyG>nobody:*:65534:65534::/:
19:06:00<SmileyG>lulz?
19:06:00<C-Keen>that's the ftp server?
19:06:00<SmileyG>yes.
19:06:00<C-Keen>lolwut?
19:07:00<SmileyG>they have hiden directory listings but not actually blocked it
19:07:00<C-Keen>should make backup easier :p
19:07:00<SmileyG>its a chroot though I think.
19:10:00<SmileyG>https://www.google.co.uk/#q=site:btinternet.com&hl=en&safe=off&prmd=imvns&ei=cnV0ULvMKcvK0AW1_oDwDg&start=10&sa=N&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=70cfab5615ed83a6&biw=888&bih=625
19:10:00<SmileyG>plenty of sites :S
19:11:00<C-Keen>indeed
19:11:00<C-Keen>how can we get a list of them all?
19:13:00<C-Keen>http://www.btinternet.com/~memoriesofmax/ <-
19:13:00<C-Keen>that's what we do it for...
19:17:00<SketchCow>Really?
19:17:00<SketchCow>I do it for http://www.btinternet.com/~carol.stirling2/New_Folder/whips_and_crops.htm
19:17:00<C-Keen>kinky
19:18:00<SketchCow>http://www.btinternet.com/~shawater/warning.html
19:18:00<C-Keen>this looks like a tiny geocities
19:18:00<SketchCow>It IS a tiny Geocities.
19:19:00<SmileyG>SketchCow: thats "Does it for me" ;)
19:19:00<SmileyG>DO we have anything to rip google results yet?
19:19:00<alard>We could put it on the warrior.
19:19:00<alard>SmileyG: https://gist.github.com/2788197d2db2779cb7b0
19:19:00<SmileyG>ipv6? :S
19:20:00<alard>Is faster.
19:20:00<SmileyG>not if I don't already have IPv6 setup :S
19:20:00<alard>No. Do you want to be blocked by Google?
19:20:00<SmileyG>For how long :S
19:21:00<alard>Well, the trick is: if you use ipv6 you can query at a much higher rate.
19:21:00<SmileyG>D:
19:21:00<C-Keen>I can do it and hand out results...
19:21:00<SmileyG>Looks like theres only 10 pages of google results before they turn into random pages on the sites...
19:21:00<C-Keen>pestilenz.org has got a native connection and not really a bandwidth limit as our hoster does not have a billing plan for v6 in place
19:22:00<alard>C-Keen: You do need a ipv6-over-ipv4 tunnel from tunnelbroker.net.
19:22:00<C-Keen>alard: aawwww
19:22:00<alard>Because you need a /48 subnet.
19:23:00<C-Keen>it's only a /64
19:23:00<joepie91><Migs>LEGAL torrenting is legal
19:23:00<joepie91><Migs>torrenting is not
19:23:00<joepie91><joepie91>torrenting IS legal
19:23:00<joepie91>prepare to laugh
19:24:00<SmileyG>lols
19:24:00<SmileyG>WHere is this? I want to point and laugh
19:24:00<SmileyG>car annology
19:24:00<SmileyG>"Driving is legal"
19:24:00<SmileyG>"Legal driving is legal" "Driving is not"
19:24:00<SmileyG>:D
19:24:00<joepie91>SmileyG: #webtech on freenode
19:28:00<SmileyG>Not worth wasting your time over joepie91
19:31:00<C-Keen>hm, using the google search api limits you to 100 search results per day. how dumb is that
19:31:00<C-Keen>has anyone tried and actually *asked* btinternet for a list of all the sites?
19:34:00<SmileyG>so I'm setting up ipv6 tunnel for future usage :<
19:35:00<joepie91>seriously
19:35:00<joepie91>these people are sad
19:35:00<SmileyG>joepie91: haters, don't waste time on them
19:35:00<joepie91>no, not haters
19:35:00<joepie91>just straight out idiots
19:38:00SketchCow is desperately trying to upload everything off FOS while this insane webshots thing is going down.
19:38:00<SketchCow>Just in case you're wondering what has my attention.
19:39:00<SmileyG>http://www.btinternet.com/new/content/custhome/listings/A.shtml
19:39:00<SmileyG>BINGO
19:39:00<SmileyG>This isn't all sites
19:39:00<SmileyG>but a large chunjk :D
19:40:00<SmileyG>NB Sites are not listed automatically however the listing submission process is currently unavailable while the system is being redesigned. Apologies for the inconvenience. << not sure whath appened there.
19:41:00<soultcer>I bet they actually add all sites manually
19:41:00<SmileyG>althought _a lot_ of sites are dead
19:41:00<C-Keen>the first two entries on that site are broken
19:42:00<C-Keen>maybe it is already out of date
19:42:00<SmileyG>http://www.accordions.btinternet.co.uk/ yeaaaah
19:42:00<C-Keen>it seems to always redirect me to that yahoo login page
19:42:00<SmileyG>yeah the last one works :S
19:43:00<soultcer>Why is it anytime something about deleting data comes up, it's Yahoo that is the culprit?
19:43:00<primus>yahoo is starting to be annoying, do these guys actually keep anything they buy?
19:43:00<SmileyG>lol
19:43:00<SmileyG>"streamline the business"
19:43:00<SmileyG>Remeber, they want to go 482mph.
19:44:00<primus>maybe redbull can sponsor the balloon for them too ....
19:45:00<C-Keen>SmileyG: seems like that index is all the way broken
19:45:00<SmileyG>C-Keen: the last link worked :(
19:45:00<SmileyG>hense why I was celebrating, only to realise the rest didn't. :(
19:45:00<SmileyG>Is there any index that doesn't ban you?
19:45:00<SmileyG>duckduckgo for example?
19:46:00<C-Keen>the duckduckgo folks might even contribute the stuff in a sane way
19:46:00<SmileyG>prepacked list?
19:47:00<C-Keen>for example!
19:47:00SmileyG asks some guys who love ddg.
19:49:00<joepie91>god
19:49:00<joepie91>SmileyG: these guys are complete retards, seriously... I give them a techdirt link after they ask me for a legal source saying that copyright infringement =/= theft
19:49:00<joepie91>and they are so busy bitching about me and stroking their own ego that they don't even notice that the legal document itself is embedded on the page
19:50:00<SmileyG>joepie91: ¬_¬ I deal with people like this all day. Not worth the hassle
19:50:00<joepie91>and keep claiming that I should 'show a legal document' and that 'techdirt is not a legal source'
19:50:00<SmileyG>then again I get involved too
19:50:00<C-Keen>SmileyG: I am asking on their "official IRC channel"
19:50:00<SmileyG>some guy was telling me earlier how " " == ""
19:50:00<SmileyG>i.e. space == null
19:51:00<C-Keen>21:51 < yano> duckduckgo doesn't provide more than the 0-click and the first result via it's API
19:51:00<C-Keen>21:51 < crazedpsyc> we don't and can't allow that due to licensing on the results
19:52:00<C-Keen>SmileyG: ^
19:53:00<SmileyG>ghay.
19:54:00<SmileyG>they like to see themselves as rebels "ooo we block tracking"
19:54:00<C-Keen>but then if you want to shove the results up theirs they all go "I am only thirteen!"
19:54:00<C-Keen>or somethign
19:58:00<SmileyG>lol;
20:04:00<Zym>what about simple wget?
20:05:00<Zym>and a picture of rob? http://www.asallen.btinternet.co.uk/Files/Rob.JPG
20:05:00<SketchCow>Wget no way.
20:05:00<SketchCow>We'll be doing WARCs.
20:05:00<SketchCow>We want these in wayback.archive.org
20:06:00<SketchCow>joepie, what the fuck are you arguing about, and why is it ot in -bs?
20:06:00<SketchCow>Because it really should be there.
20:06:00<SketchCow>Damn, Save Rob
20:06:00<SketchCow>Because 10-1 Rob ain't making it another year
20:08:00<joepie91>SketchCow: not really an argument, more a quick remark :P
20:08:00<joepie91>but yeah, I'll take it to bs
20:08:00<alard>Something to be aware of: www.btinternet.com/~$u == www.$u.btinternet.co.uk
20:08:00<SmileyG>Oh hey its big rob.
20:09:00<SmileyG>alard: urgh I should Of said that, it seemed obvious to me ¬_¬
20:09:00<alard>Well, it means: 1. any searches (google etc.) for usernames have to be done for both types of urls, and 2. we need to decide which type to download.
20:10:00<joepie91>it gets more complicated:
20:10:00<joepie91>http://www.sam.gamble.btinternet.co.uk/languages/english/english.inc.php
20:10:00<alard>Both versions, perhaps.
20:10:00<joepie91>note the sam.gamble
20:10:00<SmileyG>yes
20:10:00<alard>http://www.btinternet.com/~sam.gamble/
20:10:00<SketchCow>My attitude is download both
20:10:00<SmileyG>so btinternet.com/~sam.gamble?
20:11:00<SketchCow>All of them.
20:11:00<SketchCow>We'll keep them separate in dump
20:11:00<SketchCow>Do some compares
20:12:00<joepie91>et presto!
20:12:00<joepie91>SketchCow, alard, SmileyG, http://www.btinternet.com/new/content/custhome/
20:12:00<joepie91>directory :)
20:12:00<joepie91>oh wait
20:12:00<joepie91>ah yes
20:12:00<joepie91>.com == .co.uk
20:13:00<joepie91>so that should have a listing
20:13:00<SmileyG>[20:39:41] < SmileyG> http://www.btinternet.com/new/content/custhome/listings/A.shtml
20:13:00<SmileyG>joepie91: those indexes are outta date it seems :(
20:13:00<joepie91>:(
20:14:00<alard>This is a nice username: http://www.c.h.f.btinternet.co.uk/
20:14:00<joepie91>mmm
20:15:00<joepie91>SmileyG, what are you basing it on that it's outdated?
20:15:00<primus>lots of them redirect to yahoo login
20:15:00<SmileyG>the first 10 or so links didn't work for me :<
20:17:00<joepie91>mmm
20:17:00<Zym>well alot redirect to yahoo login and alot just 404
20:17:00<Zym>or rather 500
20:18:00<SketchCow>I'd like this to be a warrior project, and I want us to fucking DEMOLISH this site
20:18:00<SketchCow>This is a chance to get a geocities "right"
20:19:00<joepie91>mmmm
20:19:00<joepie91>anyone up for running a google crawler?
20:19:00<alard>I'm doing a little bit of google crawling now, I'll stop later. 2200 usernames so far.
20:19:00<alard>My ipv6 trick no longer works, so it's going slow.
20:19:00<joepie91>alard: and you're not getting banned?
20:19:00<joepie91>ipv6 trick..?
20:20:00<joepie91>also, in case it's useful, http://git.cryto.net/cgit/crytobooks/tree/crawler/crawler.py has some code that should work again with a few fixes
20:20:00<joepie91>to crawl google (as well as calibres)
20:20:00<joepie91>does pretty well in the not-getting-banned department :P
20:20:00<alard>It used to be that with a /48 ipv6 subnet you could switch ip addresses between /64 subnets.
20:20:00<joepie91>ahh
20:20:00<alard>So as long as you picked a different /64 subnet each time, you could keep on searching.
20:21:00<joepie91>hmm... alard: what if you spread the task between warriors?
20:21:00<joepie91>the googling
20:21:00<alard>It would be nice if we could find this newsgroup: btinternet.homepages.announce
20:22:00<C-Keen>probably on their nntp server
20:22:00<alard>1. I don't have a googling warrior task, no infra to handle the results. 2. You'd have a lot of unused time for the warrior.
20:22:00<joepie91>also, alard, another option although it may be a bit strange, is downloading database dumps from hacks etc, extracting btinternet e-mail addresses, and checking if a site exists for the username used for the email
20:22:00<joepie91>hmm
20:23:00<alard>So to make this a warrior project, we'd need to: 1. figure out a good wget command that downloads everything for a user 2. make a warrior project (with rsync upload) 3. have a list of users
20:23:00<SmileyG>wget -m with it limited to the ~user won't work?
20:24:00<alard>Yes, something like wget -m.
20:25:00<C-Keen>list of users is the main thing
20:26:00<alard>and page-requisites, and something to prevent infinite recursion, and adjust-extension, and ... ?
20:27:00<alard>SketchCow: Can you make a btinternet thing on fos?
20:27:00<alard>(We'll need it at some point.)
20:32:00<DFJustin>I'm sure there's lots of users whose index pages don't link to everything on the account
20:32:00<SmileyG>DFJustin: in those cases we are screwed afaik
20:33:00<DFJustin>just grabbing a list of sub-urls from google would probably improve it substantially
20:33:00<SmileyG>unless google has randomly crawed them due to linking at some point
20:33:00<soultcer>I'll generate a list of all shorturls that link to btinternet if that helps
20:34:00<DFJustin>there's also http://wayback.archive.org/web/*/http://www.btinternet.com/~shawater/*
20:35:00<joepie91>well, yeah... when crawling google for usernames, why not save all the URLs along the way?
20:35:00<joepie91>:P
20:36:00<SmileyG>well your gonna know the username, but yeah
20:36:00<SmileyG>add the pages to the list for that username...
20:36:00<joepie91>thing is... google seems to pick up URLs from a lot of places
20:37:00<joepie91>not just crawling
20:37:00<joepie91>I've more than once seen URLs show up in the search engine index simply because it was pasted in google docs, sent via gmail, opened in a browser that uses google safe browsing, ...
20:37:00<joepie91>even if we can't find it, maybe google has
20:38:00<chronomex>google safe browsing ought not submit urls, they use a bloom filter to vastly reduce queries
20:39:00<chronomex>in most cases the url doesn't leave your machine
20:45:00<joepie91>... I have an idea, one moment
20:46:00<joepie91>http://ip.robtex.com/213.123.20.90.html#shared
20:46:00<joepie91>well, that's a start
20:48:00<joepie91>yeah, looks like all free hosted sites are on 213.123.20.90
20:48:00<soultcer>Oh, good idea
20:48:00<joepie91>literally all of them
20:48:00<joepie91>the IPs around it host the FTP server, real domains, and internal servers
20:48:00<joepie91>lol @ them being blacklisted
20:49:00<joepie91>this has a few as well: http://support.clean-mx.de/clean-mx/viruses.php?ip=213.123.20.90&sort=email%20asc
21:02:00<Wack0>what's going on
21:03:00<DFJustin>http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing
21:04:00<Wack0>oh
21:08:00<alard>https://github.com/ArchiveTeam/btinternet-grab
21:08:00<alard>More specifically, what about these wget options? https://github.com/ArchiveTeam/btinternet-grab/blob/master/pipeline.py#L65-86
21:12:00<soultcer>alard: Seems fine, but wouldn't that fetch everything twice, using both URLs?
21:12:00<alard>Yes.
21:12:00<soultcer>Guess that's better if it goes into the wayback machine.
21:12:00<soultcer>Does wget-warc support dedup?
21:13:00<alard>A little bit, but not for items with different urls.
21:14:00<soultcer>How many users do you already have?
21:14:00<alard>2250
21:15:00<SmileyG>:/
21:15:00<SmileyG>one or two people should be able to bang this out...
21:15:00<alard>We're going to find more, right?
21:16:00<SmileyG>I hope so :D
21:16:00<soultcer>I hope so, I am searching both urlteam data and the domain name system for usernames
21:20:00<SmileyG>sry guys I'm utterly out of it atm, this cold is really hammering me
21:35:00<soultcer>alard: http://helo.nodes.soultcer.com/btinternet-dns.txt (usernames for btinternet)
21:38:00<alard>Thanks. Processed items: 1297
21:38:00<alard>http://tracker.archiveteam.org/btinternet/
21:45:00<soultcer>alard: http://helo.nodes.soultcer.com/btinternet-urlteam1.txt (from URLteam this time), I'll have a second list where I scan for the co.uk version of the URL, but it have to go to bed now so I can only send that last one to you tomorrow
21:48:00<alard>soultcer: Added. Processed items: 960, added to main queue: 621
21:48:00<alard>(And going to bed is a sensible idea. I might copy that.)
21:49:00<soultcer>Well extracting the URLs would take a couple hours. Not going to bed right now, but before processing would be done
21:50:00<alard>I can give you an account on the tracker so you can add them yourself, if you'd like.
21:50:00<alard>(Not sure how many more usernames you'll be trying to find.)
21:51:00<soultcer>Only the ones from the second run over the URLteam data
21:51:00<soultcer>I have no other data sources available
21:51:00<soultcer>I am actually quite surprised I even found any hostnames via DNS at all
22:15:00<godane>underscor: I add another: http://archive.org/details/cdrom-linuxformatmagazine-154
22:16:00<godane>before you say its too early to upload: http://linuxformat.com/archives?issue=154
23:27:00<SketchCow>alard: You have alardland to make new subworlds
23:35:00<chronomex>I wonder how big soundcloud is