#archiveteam<efnet> log for 2013-01-10

Home Search Previous day Next day

02:06:00	<arisboch>	Hi, anyone know about yourfanfiction.com, they did go offline, any word from the archive team about backups, yourfanficiton.com said some time in advance, that they're in danger?
02:10:00	<chronomex>	never heard of them
03:16:00	<SketchCow>	OK, that's enough of nemo's stuff
03:16:00	<SketchCow>	That's ONE way to spend six hours
06:07:00	<twrist>	So..
06:07:00	<twrist>	I cannot seem to run yahooblog-grab
06:08:00	<twrist>	Running pipeline.py just quits without any output
11:51:00	<Pimpollo>	www.jizzday.com
11:51:00	<Pimpollo>	www.jizzday.com
14:04:00	<Nemo_bis>	All accounts will be backed up and made available for download (actually, you can do this now, but the new backups will be offline on archive.org and available forever.)
14:04:00	<Nemo_bis>	http://status.net/2013/01/09/preview-of-changes-to-identi-ca
14:04:00	<Nemo_bis>	To be trusted?
14:05:00	<GLaDOS>	Could be a decoy to hold us off.
14:06:00	<GLaDOS>	Do it anyway, for the sake of doing it.
14:06:00	<GLaDOS>	...unless they're in here.
16:05:00	<Coderjoe>	if anything has been uploaded to IA, it can be checked on, even if it is dark
16:08:00	<Coderjoe>	though knowing part of the identifier, or the uploader or collection helps find it
16:17:00	<Nemo_bis>	Coderjoe: what do you mean checked on?
16:17:00	<Coderjoe>	it can be looked up in the catalog
16:18:00	<Coderjoe>	though only admins or possibly local IA users would be able to access the files
16:18:00	<Nemo_bis>	with wildcard search you mean
16:18:00	<Nemo_bis>	I should use that I guess.
16:18:00	<Coderjoe>	or the metamanager
16:19:00	<Coderjoe>	which I don't know if there is limited access to
16:20:00	<Coderjoe>	i think the features to do any changes require rights, but just doing queries doesn't
16:21:00	<Nemo_bis>	chages require shell access, IIRC queries adminship
16:28:00	<Coderjoe>	i don't think shell access is needed for the changes I refer to in the metamanager
16:28:00	<Coderjoe>	(like moving items between collections and the like)
16:34:00	<Nemo_bis>	hm
16:35:00	<godane1>	start my uploading of the screen savers has one per a item: http://archive.org/details/The.Screen.Savers.2004.04.01
16:35:00	<godane1>	*i started
17:05:00	<underscor>	Viewing metamgr you to pass User::any_admin()
17:05:00	<underscor>	Which effectively means you need to be a collection owner
17:05:00	<underscor>	You get (more?) buttons if you're User::slash_admin()
17:06:00	<underscor>	which basically means you can frobnicate any item
17:06:00	<underscor>	To view a dark item's files, though, you have to have shell access
17:06:00	<underscor>	on the datanodes
17:15:00	<godane1>	underscor: can you find out why new robots.txt override older ones?
17:15:00	<godane1>	example of older robots.txt: http://web.archive.org/web/20040630192118/http://cetips.com/robots.txt
17:16:00	<godane1>	example of the newer ones: http://web.archive.org/web/20111004042827/http://cetips.com/robots.txt
17:17:00	<Coderjoe>	"oh crap. we didn't want that to be stored on IA." perhaps?
17:18:00	<godane1>	more like the newer ones is cause of that bad website sitter that blocks ia bots
17:24:00	<underscor>	It is a policy decision to avoid legal drama.
17:25:00	<underscor>	But, in theory, we could tie website captures to whatever state the robots.txt had at the time of the capture
17:25:00	<underscor>	instead of the current one
17:25:00	<underscor>	However, there are a lot of people who use the robots.txt block thinking that their stuff wouldn't show up again, so we'd need some other way to "opt out"
17:58:00	<balrog_>	godane1: the new robots.txt doesn't seem to imply that ia_archiver should be blocked at all?
18:00:00	<godane1>	i was able look at this site like more then a year ago
18:00:00	<godane1>	but now i can't
18:00:00	<godane1>	this in the newer robot.txt: User-Agent: ia_archiver
18:01:00	<godane1>	and a Disallow: is under that
18:03:00	<balrog_>	godane1: yes but nothing is specified to be disallowed.
18:04:00	<godane1>	i know that
18:04:00	<balrog_>	isn't that a bug in ia_archiver? :/
18:04:00	<godane1>	maybe
18:04:00	<balrog_>	I brought this up a week or two ago
18:04:00	<balrog_>	it sure looks like a bug
18:05:00	<godane1>	maybe it just blocks anything if ia_archiver comes up
18:05:00	<balrog_>	yeah but what if I specifically want to ALLOW ia_archiver for a site? (as it seems to be the case here)
18:05:00	<balrog_>	http://www.robotstxt.org/robotstxt.html states that the syntax on that site is "To allow a single robot"
18:05:00	<balrog_>	this is most definitely a bug ;(
18:06:00	<balrog_>	err no it isn't
18:06:00	<balrog_>	the reason it's blocked is this: http://web.archive.org/web/20120819150435/http://spi.domainsponsor.com/ds_robots.txt
18:06:00	<balrog_>	a rogue robots.txt
18:06:00	<godane1>	thats what i thought too
18:07:00	<godane1>	maybe archive should have a special black list of robots.txt
18:07:00	<godane1>	if something like comes up then just ignore
18:07:00	<balrog_>	or apply only to current/future crawls
18:07:00	<balrog_>	and don't black out older ones
18:08:00	<balrog_>	that would make the most sense
18:08:00	<godane1>	that works too
18:08:00	<balrog_>	http://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains âÂ though jory2 derailed the thread :(
18:10:00	<balrog_>	(at the end)
18:10:00	<balrog_>	http://archive.org/post/433169/domainsponsorcom-monikercom-re-deleted-archive-after-domain-backorder etc
18:16:00	<schbiridi>	maybe ia could only block if the whois did not change between the time of retrival and the current robots.txt
18:19:00	<DFJustin>	it seems like the main problem is these large-scale squatters and that could be taken care of with a few special cases
18:20:00	<balrog_>	DFJustin: yes, this is the main problem âÂ the large-scale squatters
18:40:00	<SketchCow>	The bitsavers ingestion has hit its stride!
18:50:00	<Nemo_bis>	Nice, derivers were lazying again.
18:54:00	<Smiley>	MOAR FATA.
19:27:00	<SketchCow>	http://archive.org/details/bitsavers now is starting to have individual companies
21:09:00	<guigui>	hello! how do I know/limit how much disk space the warrior uses?
21:38:00	<SketchCow>	I think it's under a gig, isn't it?
21:40:00	<alard>	The disk image can grow to up to 60 GB.
21:41:00	<SketchCow>	!
21:41:00	<alard>	There's a way to give it more space, or less: disconnect the "data" disk from the VM, create a new virtual disk image of the size you want and connect it.
21:41:00	<alard>	The warrior will format the new drive when it boots.
21:43:00	<alard>	(The problem with these virtual disk images is that they only grow, never shrink. So even if though the warrior removes the downloaded files, the disk image will eventually grow to its full size.)
22:41:00	<ersi>	sigh.

Home Search Previous day Next day