#archiveteam<efnet> log for 2013-01-09

Home Search Previous day Next day

00:51:00	<SketchCow>	So, unfortunately, it looks like Myspace is now doing a small transition and killing pages
02:42:00	<bsmith093>	you know how some really advanced bulk renamers can add the parent folder(s) to the name of a file? well i need to remove that part organized like this stuff/blah/status/blah - authorname - filename.txt the only matching parts will be the "blah", and it is garanteed to be a part of the filename
02:56:00	<instence_>	uh
02:56:00	<instence_>	whats your before and after?
02:58:00	<instence_>	you say matching parts, are you trying to regex match those? and only modify those files partially? or?
02:59:00	<instence_>	an app for windows I used to rename stuff is called "ReNamer" works great
03:14:00	<bsmith093>	instence_: before= stuff/blah/status/blah - authorname - filename.txt
03:14:00	<bsmith093>	instence_: after= stuff/blah/status/authorname - filename.txt
03:28:00	<instence_>	with ReNamer that would be quite easy, but I think its a windows only app
03:29:00	<instence_>	http://www.den4b.com/?x=downloads&product=renamer
03:30:00	<instence_>	http://www.den4b.com/?x=screenshots&product=renamer
03:31:00	<instence_>	you can stack rules as well
03:46:00	<dashcloud>	so, is there still a channel for fileformat wiki efforts, or it just goes here or -bs?
04:45:00	<tuankiet>	Hello eberybody!
04:48:00	<bsmith093>	instence_: how would i do that in renamer, it runs fine in wine, so im using that for now
04:55:00	<tuankiet>	@alard: are there any projects?
06:03:00	<Nemo_bis>	SketchCow: thanks!
06:15:00	<SketchCow>	No problem. Sorry there's still a lag with me this year.
06:16:00	<SketchCow>	I'd hoped to be more archiveteam responsive, but this DEFCON documentary is kicking my aaaaaassssss
08:41:00	<chronomex>	godane: ftp.download.packardbell.com: Downloaded: 2679 files, 28G in 1d 17h 47m 55s (194 KB/s)
08:42:00	<chronomex>	now: time nice ionice -c 3 zip -vr ftp.download.packardbell.com.zip ftp.download.packardbell.com
09:07:00	<godane>	chronomex: thanks for getting it
09:07:00	<godane>	i know that would have take me forever to get
09:08:00	<chronomex>	:)
09:08:00	<godane>	and to also upload
09:19:00	<chronomex>	yeah, might take a while
09:19:00	<chronomex>	I downloaded a terabyte of ftp last month :P
09:20:00	<Nemo_bis>	chronomex: ah, 200 KB/s, lucky you :)
09:21:00	<Nemo_bis>	NATO still at 40 KB/s
09:21:00	<Nemo_bis>	42 GiB so far
09:22:00	<chronomex>	o_O
09:22:00	<chronomex>	ftp.3gpp.org is huge
09:22:00	<chronomex>	btw.
09:23:00	<chronomex>	350g, iirc
09:23:00	<Nemo_bis>	everything has recent timestamps there
13:54:00	<hiker1>	To what extent does heritrix discover JavaScript and CSS?
14:15:00	<alard>	tuankiet: Well, it's time to start downloading the Yahoo blogs.
14:16:00	<alard>	hiker1: It probably downloads things referenced with <script> or <link rel="stylesheet"> tags, and I think it even has some rules to find images etc. in the actual CSS and JavaScript files.
14:16:00	<hiker1>	How easy is it to set up?
14:17:00	<alard>	It isn't that hard, but it's unwieldy.
14:17:00	<hiker1>	I wanted to test it on a single site.
14:18:00	<hiker1>	I suppose it's probably not worth the hassle
14:18:00	<ersi>	Neither Heritrix or Wayback is easy to setup
14:22:00	<hiker1>	sigh.
14:22:00	<hiker1>	Maybe someone that knows how could release a VirtualBox image with it already installed and ready to accept a warc file?
14:24:00	<hiker1>	alard: There is a python library called mitmproxy. Might be useful to proxy the HTTPS records: http://mitmproxy.org/
14:24:00	<hiker1>	Right now I am using a simple rewrite modification to warc-proxy to get them sent.
14:24:00	<hiker1>	very, very rudimentary.
14:24:00	<ersi>	I've fiddled a little with it, and plan to maybe continue - but we'll see (RE: wayback, heritrix)
14:24:00	<godane>	so i just found a very good copy of the screen savers episode from 2003
14:25:00	<godane>	Kevin Rose uploaded it too :-D
14:25:00	<ersi>	OH MY GOD!
14:29:00	<godane>	https://www.youtube.com/user/kevinrose
14:29:00	<godane>	i found it on his youtube channel
14:29:00	<godane>	i may have to email him so i get more episodes of tss
14:31:00	<godane>	he has about 50 episodes of the screen savers in mp4
14:31:00	<godane>	:-D
14:40:00	<hiker1>	WARC doesn't replay the actual browser sessions, only the traffic. Some JavaScript scripts I found appear to append a callback handle to the url that is generated at runtime based on a live JS object. WARC can not replay this behavior.
14:42:00	<hiker1>	Technically it does archive all the information that a website outputs, but some of the information is impractical to use or view without extensive modifications to the JavaScript.
14:44:00	<hiker1>	It makes me think of an HTML5 game http://wordsquared.com/. You can download all the traffic, but you will never be able to see the game properly I think.
14:44:00	<alard>	hiker1: And? Or are you just thinking aloud? :)
14:44:00	<alard>	You could fix individual sites, but there's no general solution, I think.
14:53:00	<hiker1>	thinking aloud :)
14:53:00	<hiker1>	I noticed this while attempting to archive a website just now.
14:54:00	<tuankiet>	@alard: Tracker rate limiting is in effect. Retrying after 30 seconds... :((
14:57:00	<alard>	tuankiet: Yes, there was something wrong yesterday. I'm now gathering some files to debug with. (Until I got distracted by wordsquared just now. :)
14:57:00	<hiker1>	hah xD
15:00:00	<tuankiet>	@alard: Oh, runnning again. I've just restarted VMs to update the code :))
15:11:00	<alard>	Good. Found the problem: HTTP/1.1 999 Unable to process request at this time -- error 999
15:12:00	<alard>	What's the best way to handle those? Wait and retry?
15:13:00	<Nemo_bis>	ah, as it was feared
15:14:00	<balrog->	that means you are being throttled
15:15:00	<balrog->	http://www.murraymoffatt.com/software-problem-0011.html
15:16:00	<alard>	It's Nemo_bis, in this case.
15:17:00	<Nemo_bis>	alard: I got that error? but I just started
15:17:00	<balrog->	wow MS is killing messenger
15:17:00	<Nemo_bis>	I have lots of "Project code is out of date and needs to be upgraded. Retrying after 30 seconds..."
15:18:00	<alard>	Yes, I've paused the thing again.
15:18:00	<twrist>	Messenger is being integrated into skype, though.
15:18:00	<twrist>	So yeah.
15:18:00	<alard>	Nemo_bis: In the last few minutes there were 999-warcs from grue, tuankiet, and you.
15:18:00	<Nemo_bis>	hm
15:19:00	<balrog->	twrist: yeah but the protocol, etc are going away
15:19:00	<twrist>	Ah, right.
15:19:00	<ersi>	Super old.
15:19:00	<balrog->	alard: need to detect 999s and throttle
15:19:00	<twrist>	So, what's currently being archived?
15:19:00	<Nemo_bis>	alard: I've switched the warrior to tinyback
15:19:00	<alard>	balrog-: How long to wait? (And does saying you're from Google still work?)
15:20:00	<balrog->	alard: I don't know, I haven't tested âÂ info online says 2-24 hours, but I don't know
15:20:00	<Nemo_bis>	can it be that Yahoo is suspicious because it sees activity from my IP on flickr etc. as logged in user?
15:20:00	<Nemo_bis>	it definitely can't be bandwidth in my case
15:21:00	<tuankiet>	Bad thing now
15:22:00	<alard>	Nemo_bis: Perhaps you're normally less active on Asian blogs.
15:23:00	<twrist>	Give me a git URL to clone, guys.
15:23:00	<ersi>	At what project are you guys getting HTTP 999's?
15:24:00	<twrist>	I'm itching to join in.
15:24:00	<ersi>	twrist: http://github.com/archiveteam/
15:24:00	<twrist>	Need to be a bit more precise, I'm using ubuntu server and IRSSI
15:24:00	<twrist>	I only just started as well
15:24:00		twrist is GLaDOS, FYI
15:25:00	<ersi>	I think they're doing yahooblogs-grab right now
15:25:00	<twrist>	ah
15:25:00	<twrist>	so https://github.com/archiveteam/yahooblogs-grab.git?
15:25:00	<ersi>	yeah..
15:25:00	<alard>	twrist: There's not much sense starting right now, we need to update the script.
15:26:00	<alard>	ersi: blog.yahoo.com
15:26:00	<twrist>	ah
15:28:00	<tuankiet>	Or using Tor so we won't have 999 again. But the speed is super low :))
15:29:00	<twrist>	The URL I typed out isn't working.
15:29:00	<twrist>	Anyone else able to paste it in here?
15:29:00	<Deewiant>	https://github.com/ArchiveTeam/yahooblog-grab.git
15:30:00	<twrist>	ah, no s
15:37:00	<twrist>	so the arguments were --downloader=name --concurrent=6?
15:43:00	<alard>	Yes. There's a new version that should handle the 999 error better.
15:54:00	<goekesmi>	ls
15:54:00		goekesmi sighs.
15:54:00	<hiker1>	xD
16:04:00	<chazchaz>	Is ther a channel for yahooblog-grab?
16:42:00	<SketchCow>	I suggest #yahooblah
16:46:00	<Coderjoe>	O_O yahoo blog is from yahoo korea?
18:13:00	<alard>	I think the current version of the script works better. (There are fewer 0MB items, and it's much slower.)
19:09:00	<hiker1>	Is anyone archiving stuff from Tor?
19:24:00	<swebb>	I used tor once to auto-change my IP when grabbing some stuff from google, but it was way slow.
19:25:00	<hiker1>	well, yeah. But there are some websites which are tor only.
19:28:00		ats raw-images an extremely dodgy floppy four times using two different Amiga drives, converts using disk-analyser, merges the resulting partial images back together giving a full image, and peers happily at the first bits of email he ever sent :)
19:28:00	<balrog->	what are you using to merge?
19:30:00	<ats>	rawadf off aminet, patched to not complain about the number of tracks in the .eadf files disk-analyser produces
19:30:00	<ats>	I also had to patch disk-analyser to not write junk into the EADF track header structure...
19:32:00	<ats>	then disk-analyser again to turn (raw-track) EADF into (AmigaDOS-track) ADF, adfread to extract the files from the filesystem, and unar to extract the .lzx archives on the floppy
19:52:00	<hiker1>	If anyone is bored of archiving with wget, please try my WarcMiddleware. I'd be glad to assist in setting it up. https://github.com/iramari/WarcMiddleware
20:34:00	<Nemo_bis>	alard: how do I know if I'm still collecting mostly useless 999 crap, in case I work on Yahoo?
21:02:00	<alard>	Nemo_bis: Hard to say. It shouldn't, it should retry (and print a message).
21:04:00	<Nemo_bis>	ok
21:05:00	<Nemo_bis>	TinyBack was getting ratelimited anyway
21:38:00	<SketchCow>	Nemo_bis: http://archive.org/details/magazine_rack
21:38:00	<Nemo_bis>	SketchCow: Pretty!!!
21:39:00	<Nemo_bis>	Are you going to make some of those dark?
21:39:00	<SketchCow>	Ostensibly
21:40:00	<Nemo_bis>	:)
21:45:00	<SketchCow>	Like, Wood Magazine will probably disappear.
21:50:00	<Nemo_bis>	But... children in Africa will DIE if we don't let them know how to build life-saving wood stuff, in English, on a website!
21:53:00	<Nemo_bis>	On eMule and eMule only there's also another 5 GiB archive of another woodworking magazine. Surely the same woodworking geek scanner.
21:54:00	<chronomex>	haha
21:55:00	<SketchCow>	Which one?
21:55:00	<SketchCow>	You have so many here.
21:56:00	<SketchCow>	http://archive.org/details/general_magazine
21:56:00	<SketchCow>	http://archive.org/details/woodsmith_magazin
21:57:00	<SketchCow>	http://archive.org/details/woodsmith_magazine I mean
21:58:00	<SketchCow>	How long was this uploading, Nemo_bis?
21:58:00	<Nemo_bis>	SketchCow: I don't know, a few days of work for the CSV maybe.
21:58:00	<Nemo_bis>	I didn't measure the time for download and upload in itself.
22:00:00	<Nemo_bis>	Also a few hours of trackers browsing and other searches.
22:01:00	<Nemo_bis>	http://p.defau.lt/?YTRaoQFxExjw8T612Pl_XQ
22:03:00	<SketchCow>	In the future, like godane, I can just browse your uploads and see what you haven't had pushed into a collection and make it happen.
22:03:00	<SketchCow>	Your activities also get the attention of the devs, who see it come by
22:08:00		Nemo_bis hopes not to get too many curses
22:08:00	<Nemo_bis>	I thought sending you a nice list at the end of the job was going to be helpful?
22:09:00	<SketchCow>	No.
22:09:00	<SketchCow>	Doesn't help and it actually gets caught in the spam filter
22:10:00	<SketchCow>	Because someone from italy is mailing me piles of URLs
22:10:00	<Nemo_bis>	Oh, even.
22:10:00	<chronomex>	:P
22:12:00	<SketchCow>	Also, the vorugsveta collection didn't make it through the fun
22:12:00	<SketchCow>	I'm going to make it a collection for you, but it needs more love
22:12:00	<Nemo_bis>	Yes, I noticed.
22:13:00	<Nemo_bis>	I didn't look those zips carefully enough, sorry.
22:13:00	<SketchCow>	Yeah, those things are buuuuuuuunk
22:13:00	<SketchCow>	How about I dark them all with a note to delete them?
22:13:00	<Nemo_bis>	Suggestions on how to get something useful out of a FictionBook?
22:13:00	<Nemo_bis>	I'm ok with it.
22:14:00	<SketchCow>	No, wait, this thing is valid.
22:14:00	<SketchCow>	Just not playing with our system
22:14:00	<SketchCow>	FICTIONBOOOOOOOOOK
22:14:00	<SketchCow>	Thanks, Russia
22:14:00	<Nemo_bis>	heh
22:14:00	<Nemo_bis>	It's not even well seeded, by the way.
22:28:00	<hiker1>	Nemo_bis: What did you mean when you said make some of those dark?
22:28:00	<SketchCow>	http://archive.org/details/vokrugsveta
22:28:00	<SketchCow>	we'll see when the gods arise on that one
22:29:00	<mistym>	Nemo_bis: Wikipedia suggests Calibre can convert FictionBook to smth more conventional.
22:30:00	<SketchCow>	https://twitter.com/jefferson_bail/status/289096186420400128
22:49:00	<Nemo_bis>	SketchCow: thanks for fixing it. I liked that tweet too, wondered what syllabus exactly.
22:50:00	<SketchCow>	I'm sure it's related to computer programming, and realizing what was done
22:50:00	<SketchCow>	I asked him to send it along.
22:50:00	<Nemo_bis>	Nice
22:52:00	<SketchCow>	By the way, the guy who wrote the wikipedia entry also wrote a scathing e-mail to archive.org about how we were the pit of evil
22:52:00	<SketchCow>	Good thing I helped bring in so much fundraising last year
22:53:00	<SketchCow>	Also: Ares Magazine is as sexy as sexy gets
22:58:00	<SketchCow>	http://archive.org/details/ares_magazine
23:03:00	<Nemo_bis>	Should still be usable, shouldn't it? With some printing perhaps.
23:05:00	<godane>	stupid question
23:05:00	<godane>	i don't know how to submit a comment on youtube
23:07:00	<SketchCow>	Goood
23:08:00	<godane>	why is that?
23:08:00	<godane>	trying to help kevin rose upload the 50 episodes of the screen savers he has
23:10:00	<godane>	this is the episode in question: https://www.youtube.com/watch?v=ZglwVT5NIJw
23:10:00	<godane>	its a episode from july 14 2003
23:11:00	<godane>	there next to no caps for episodes in 2003
23:26:00	<SketchCow>	Example of "I'm just gonna dark it"
23:26:00	<SketchCow>	http://www.woodworkersjournal.com/Main/Store/5_Disc_Annual_Collection_CD_Bundle_20052009_257.aspx
23:30:00	<dashcloud>	here's something interested I came across today: http://www.emsps.com/oldtools/ They buy and sell old-very old software
23:31:00	<Nemo_bis>	SketchCow: some computer magazines like Pc Open here use the PDFs of their past issues as fillers for DVDs when they don't find enough stuff, it seems.
23:32:00	<Nemo_bis>	Something like 10 % of their CD/DVDs contains either some or all past issues in PDF...
23:32:00	<dashcloud>	Linux Journal definitely does that
23:38:00	<chronomex>	nice
23:46:00	<SketchCow>	So, I don't mind being the guy making these collections, BUT
23:47:00	<SketchCow>	I'd really appreciate it if you do-gooder motherfuckers would walk the collection and find doubles and cases where we have something really shitty when there's known better versions.
23:52:00	<Nemo_bis>	SketchCow: are there more duplicates than those I told you?
23:52:00	<Nemo_bis>	(Question is pointless if email really went to spam.)
23:55:00	<SketchCow>	It did go to spam.
23:58:00	<Nemo_bis>	http://p.defau.lt/?2fxIiFNmvwaO2FBSJdn7fA
23:58:00	<Nemo_bis>	<https://archive.org/search.php?query=%22Toronto%20PET%20User%27s%20Group%22> (duplicate of <https://archive.org/details/tpug-newsletter I'm afraid)
23:58:00	<Nemo_bis>	and YourComputer which you had already spotted (and deleted, unless it was someone else)
23:59:00	<Nemo_bis>	I didn't find more in public items.

Home Search Previous day Next day