00:51:00<SketchCow>So, unfortunately, it looks like Myspace is now doing a small transition and killing pages
02:42:00<bsmith093>you know how some really advanced bulk renamers can add the parent folder(s) to the name of a file? well i need to remove that part organized like this stuff/blah/status/blah - authorname - filename.txt the only matching parts will be the "blah", and it is garanteed to be a part of the filename
02:56:00<instence_>uh
02:56:00<instence_>whats your before and after?
02:58:00<instence_>you say matching parts, are you trying to regex match those? and only modify those files partially? or?
02:59:00<instence_>an app for windows I used to rename stuff is called "ReNamer" works great
03:14:00<bsmith093>instence_: before= stuff/blah/status/blah - authorname - filename.txt
03:14:00<bsmith093>instence_: after= stuff/blah/status/authorname - filename.txt
03:28:00<instence_>with ReNamer that would be quite easy, but I think its a windows only app
03:29:00<instence_>http://www.den4b.com/?x=downloads&product=renamer
03:30:00<instence_>http://www.den4b.com/?x=screenshots&product=renamer
03:31:00<instence_>you can stack rules as well
03:46:00<dashcloud>so, is there still a channel for fileformat wiki efforts, or it just goes here or -bs?
04:45:00<tuankiet>Hello eberybody!
04:48:00<bsmith093>instence_: how would i do that in renamer, it runs fine in wine, so im using that for now
04:55:00<tuankiet>@alard: are there any projects?
06:03:00<Nemo_bis>SketchCow: thanks!
06:15:00<SketchCow>No problem. Sorry there's still a lag with me this year.
06:16:00<SketchCow>I'd hoped to be more archiveteam responsive, but this DEFCON documentary is kicking my aaaaaassssss
08:41:00<chronomex>godane: ftp.download.packardbell.com: Downloaded: 2679 files, 28G in 1d 17h 47m 55s (194 KB/s)
08:42:00<chronomex>now: time nice ionice -c 3 zip -vr ftp.download.packardbell.com.zip ftp.download.packardbell.com
09:07:00<godane>chronomex: thanks for getting it
09:07:00<godane>i know that would have take me forever to get
09:08:00<chronomex>:)
09:08:00<godane>and to also upload
09:19:00<chronomex>yeah, might take a while
09:19:00<chronomex>I downloaded a terabyte of ftp last month :P
09:20:00<Nemo_bis>chronomex: ah, 200 KB/s, lucky you :)
09:21:00<Nemo_bis>NATO still at 40 KB/s
09:21:00<Nemo_bis>42 GiB so far
09:22:00<chronomex>o_O
09:22:00<chronomex>ftp.3gpp.org is huge
09:22:00<chronomex>btw.
09:23:00<chronomex>350g, iirc
09:23:00<Nemo_bis>everything has recent timestamps there
13:54:00<hiker1>To what extent does heritrix discover JavaScript and CSS?
14:15:00<alard>tuankiet: Well, it's time to start downloading the Yahoo blogs.
14:16:00<alard>hiker1: It probably downloads things referenced with <script> or <link rel="stylesheet"> tags, and I think it even has some rules to find images etc. in the actual CSS and JavaScript files.
14:16:00<hiker1>How easy is it to set up?
14:17:00<alard>It isn't that hard, but it's unwieldy.
14:17:00<hiker1>I wanted to test it on a single site.
14:18:00<hiker1>I suppose it's probably not worth the hassle
14:18:00<ersi>Neither Heritrix or Wayback is easy to setup
14:22:00<hiker1>sigh.
14:22:00<hiker1>Maybe someone that knows how could release a VirtualBox image with it already installed and ready to accept a warc file?
14:24:00<hiker1>alard: There is a python library called mitmproxy. Might be useful to proxy the HTTPS records: http://mitmproxy.org/
14:24:00<hiker1>Right now I am using a simple rewrite modification to warc-proxy to get them sent.
14:24:00<hiker1>very, very rudimentary.
14:24:00<ersi>I've fiddled a little with it, and plan to maybe continue - but we'll see (RE: wayback, heritrix)
14:24:00<godane>so i just found a very good copy of the screen savers episode from 2003
14:25:00<godane>Kevin Rose uploaded it too :-D
14:25:00<ersi>OH MY GOD!
14:29:00<godane>https://www.youtube.com/user/kevinrose
14:29:00<godane>i found it on his youtube channel
14:29:00<godane>i may have to email him so i get more episodes of tss
14:31:00<godane>he has about 50 episodes of the screen savers in mp4
14:31:00<godane>:-D
14:40:00<hiker1>WARC doesn't replay the actual browser sessions, only the traffic. Some JavaScript scripts I found appear to append a callback handle to the url that is generated at runtime based on a live JS object. WARC can not replay this behavior.
14:42:00<hiker1>Technically it does archive all the information that a website outputs, but some of the information is impractical to use or view without extensive modifications to the JavaScript.
14:44:00<hiker1>It makes me think of an HTML5 game http://wordsquared.com/. You can download all the traffic, but you will never be able to see the game properly I think.
14:44:00<alard>hiker1: And? Or are you just thinking aloud? :)
14:44:00<alard>You could fix individual sites, but there's no general solution, I think.
14:53:00<hiker1>thinking aloud :)
14:53:00<hiker1>I noticed this while attempting to archive a website just now.
14:54:00<tuankiet>@alard: Tracker rate limiting is in effect. Retrying after 30 seconds... :((
14:57:00<alard>tuankiet: Yes, there was something wrong yesterday. I'm now gathering some files to debug with. (Until I got distracted by wordsquared just now. :)
14:57:00<hiker1>hah xD
15:00:00<tuankiet>@alard: Oh, runnning again. I've just restarted VMs to update the code :))
15:11:00<alard>Good. Found the problem: HTTP/1.1 999 Unable to process request at this time -- error 999
15:12:00<alard>What's the best way to handle those? Wait and retry?
15:13:00<Nemo_bis>ah, as it was feared
15:14:00<balrog->that means you are being throttled
15:15:00<balrog->http://www.murraymoffatt.com/software-problem-0011.html
15:16:00<alard>It's Nemo_bis, in this case.
15:17:00<Nemo_bis>alard: I got that error? but I just started
15:17:00<balrog->wow MS is killing messenger
15:17:00<Nemo_bis>I have lots of "Project code is out of date and needs to be upgraded. Retrying after 30 seconds..."
15:18:00<alard>Yes, I've paused the thing again.
15:18:00<twrist>Messenger is being integrated into skype, though.
15:18:00<twrist>So yeah.
15:18:00<alard>Nemo_bis: In the last few minutes there were 999-warcs from grue, tuankiet, and you.
15:18:00<Nemo_bis>hm
15:19:00<balrog->twrist: yeah but the protocol, etc are going away
15:19:00<twrist>Ah, right.
15:19:00<ersi>Super old.
15:19:00<balrog->alard: need to detect 999s and throttle
15:19:00<twrist>So, what's currently being archived?
15:19:00<Nemo_bis>alard: I've switched the warrior to tinyback
15:19:00<alard>balrog-: How long to wait? (And does saying you're from Google still work?)
15:20:00<balrog->alard: I don't know, I haven't tested — info online says 2-24 hours, but I don't know
15:20:00<Nemo_bis>can it be that Yahoo is suspicious because it sees activity from my IP on flickr etc. as logged in user?
15:20:00<Nemo_bis>it definitely can't be bandwidth in my case
15:21:00<tuankiet>Bad thing now
15:22:00<alard>Nemo_bis: Perhaps you're normally less active on Asian blogs.
15:23:00<twrist>Give me a git URL to clone, guys.
15:23:00<ersi>At what project are you guys getting HTTP 999's?
15:24:00<twrist>I'm itching to join in.
15:24:00<ersi>twrist: http://github.com/archiveteam/
15:24:00<twrist>Need to be a bit more precise, I'm using ubuntu server and IRSSI
15:24:00<twrist>I only just started as well
15:24:00twrist is GLaDOS, FYI
15:25:00<ersi>I think they're doing yahooblogs-grab right now
15:25:00<twrist>ah
15:25:00<twrist>so https://github.com/archiveteam/yahooblogs-grab.git?
15:25:00<ersi>yeah..
15:25:00<alard>twrist: There's not much sense starting right now, we need to update the script.
15:26:00<alard>ersi: blog.yahoo.com
15:26:00<twrist>ah
15:28:00<tuankiet>Or using Tor so we won't have 999 again. But the speed is super low :))
15:29:00<twrist>The URL I typed out isn't working.
15:29:00<twrist>Anyone else able to paste it in here?
15:29:00<Deewiant>https://github.com/ArchiveTeam/yahooblog-grab.git
15:30:00<twrist>ah, no s
15:37:00<twrist>so the arguments were --downloader=name --concurrent=6?
15:43:00<alard>Yes. There's a new version that should handle the 999 error better.
15:54:00<goekesmi>ls
15:54:00goekesmi sighs.
15:54:00<hiker1>xD
16:04:00<chazchaz>Is ther a channel for yahooblog-grab?
16:42:00<SketchCow>I suggest #yahooblah
16:46:00<Coderjoe>O_O yahoo blog is from yahoo korea?
18:13:00<alard>I think the current version of the script works better. (There are fewer 0MB items, and it's much slower.)
19:09:00<hiker1>Is anyone archiving stuff from Tor?
19:24:00<swebb>I used tor once to auto-change my IP when grabbing some stuff from google, but it was way slow.
19:25:00<hiker1>well, yeah. But there are some websites which are tor only.
19:28:00ats raw-images an extremely dodgy floppy four times using two different Amiga drives, converts using disk-analyser, merges the resulting partial images back together giving a full image, and peers happily at the first bits of email he ever sent :)
19:28:00<balrog->what are you using to merge?
19:30:00<ats>rawadf off aminet, patched to not complain about the number of tracks in the .eadf files disk-analyser produces
19:30:00<ats>I also had to patch disk-analyser to not write junk into the EADF track header structure...
19:32:00<ats>then disk-analyser again to turn (raw-track) EADF into (AmigaDOS-track) ADF, adfread to extract the files from the filesystem, and unar to extract the .lzx archives on the floppy
19:52:00<hiker1>If anyone is bored of archiving with wget, please try my WarcMiddleware. I'd be glad to assist in setting it up. https://github.com/iramari/WarcMiddleware
20:34:00<Nemo_bis>alard: how do I know if I'm still collecting mostly useless 999 crap, in case I work on Yahoo?
21:02:00<alard>Nemo_bis: Hard to say. It shouldn't, it should retry (and print a message).
21:04:00<Nemo_bis>ok
21:05:00<Nemo_bis>TinyBack was getting ratelimited anyway
21:38:00<SketchCow>Nemo_bis: http://archive.org/details/magazine_rack
21:38:00<Nemo_bis>SketchCow: Pretty!!!
21:39:00<Nemo_bis>Are you going to make some of those dark?
21:39:00<SketchCow>Ostensibly
21:40:00<Nemo_bis>:)
21:45:00<SketchCow>Like, Wood Magazine will probably disappear.
21:50:00<Nemo_bis>But... children in Africa will DIE if we don't let them know how to build life-saving wood stuff, in English, on a website!
21:53:00<Nemo_bis>On eMule and eMule only there's also another 5 GiB archive of another woodworking magazine. Surely the same woodworking geek scanner.
21:54:00<chronomex>haha
21:55:00<SketchCow>Which one?
21:55:00<SketchCow>You have so many here.
21:56:00<SketchCow>http://archive.org/details/general_magazine
21:56:00<SketchCow>http://archive.org/details/woodsmith_magazin
21:57:00<SketchCow>http://archive.org/details/woodsmith_magazine I mean
21:58:00<SketchCow>How long was this uploading, Nemo_bis?
21:58:00<Nemo_bis>SketchCow: I don't know, a few days of work for the CSV maybe.
21:58:00<Nemo_bis>I didn't measure the time for download and upload in itself.
22:00:00<Nemo_bis>Also a few hours of trackers browsing and other searches.
22:01:00<Nemo_bis>http://p.defau.lt/?YTRaoQFxExjw8T612Pl_XQ
22:03:00<SketchCow>In the future, like godane, I can just browse your uploads and see what you haven't had pushed into a collection and make it happen.
22:03:00<SketchCow>Your activities also get the attention of the devs, who see it come by
22:08:00Nemo_bis hopes not to get too many curses
22:08:00<Nemo_bis>I thought sending you a nice list at the end of the job was going to be helpful?
22:09:00<SketchCow>No.
22:09:00<SketchCow>Doesn't help and it actually gets caught in the spam filter
22:10:00<SketchCow>Because someone from italy is mailing me piles of URLs
22:10:00<Nemo_bis>Oh, even.
22:10:00<chronomex>:P
22:12:00<SketchCow>Also, the vorugsveta collection didn't make it through the fun
22:12:00<SketchCow>I'm going to make it a collection for you, but it needs more love
22:12:00<Nemo_bis>Yes, I noticed.
22:13:00<Nemo_bis>I didn't look those zips carefully enough, sorry.
22:13:00<SketchCow>Yeah, those things are buuuuuuuunk
22:13:00<SketchCow>How about I dark them all with a note to delete them?
22:13:00<Nemo_bis>Suggestions on how to get something useful out of a FictionBook?
22:13:00<Nemo_bis>I'm ok with it.
22:14:00<SketchCow>No, wait, this thing is valid.
22:14:00<SketchCow>Just not playing with our system
22:14:00<SketchCow>FICTIONBOOOOOOOOOK
22:14:00<SketchCow>Thanks, Russia
22:14:00<Nemo_bis>heh
22:14:00<Nemo_bis>It's not even well seeded, by the way.
22:28:00<hiker1>Nemo_bis: What did you mean when you said make some of those dark?
22:28:00<SketchCow>http://archive.org/details/vokrugsveta
22:28:00<SketchCow>we'll see when the gods arise on that one
22:29:00<mistym>Nemo_bis: Wikipedia suggests Calibre can convert FictionBook to smth more conventional.
22:30:00<SketchCow>https://twitter.com/jefferson_bail/status/289096186420400128
22:49:00<Nemo_bis>SketchCow: thanks for fixing it. I liked that tweet too, wondered what syllabus exactly.
22:50:00<SketchCow>I'm sure it's related to computer programming, and realizing what was done
22:50:00<SketchCow>I asked him to send it along.
22:50:00<Nemo_bis>Nice
22:52:00<SketchCow>By the way, the guy who wrote the wikipedia entry also wrote a scathing e-mail to archive.org about how we were the pit of evil
22:52:00<SketchCow>Good thing I helped bring in so much fundraising last year
22:53:00<SketchCow>Also: Ares Magazine is as sexy as sexy gets
22:58:00<SketchCow>http://archive.org/details/ares_magazine
23:03:00<Nemo_bis>Should still be usable, shouldn't it? With some printing perhaps.
23:05:00<godane>stupid question
23:05:00<godane>i don't know how to submit a comment on youtube
23:07:00<SketchCow>Goood
23:08:00<godane>why is that?
23:08:00<godane>trying to help kevin rose upload the 50 episodes of the screen savers he has
23:10:00<godane>this is the episode in question: https://www.youtube.com/watch?v=ZglwVT5NIJw
23:10:00<godane>its a episode from july 14 2003
23:11:00<godane>there next to no caps for episodes in 2003
23:26:00<SketchCow>Example of "I'm just gonna dark it"
23:26:00<SketchCow>http://www.woodworkersjournal.com/Main/Store/5_Disc_Annual_Collection_CD_Bundle_20052009_257.aspx
23:30:00<dashcloud>here's something interested I came across today: http://www.emsps.com/oldtools/ They buy and sell old-very old software
23:31:00<Nemo_bis>SketchCow: some computer magazines like Pc Open here use the PDFs of their past issues as fillers for DVDs when they don't find enough stuff, it seems.
23:32:00<Nemo_bis>Something like 10 % of their CD/DVDs contains either some or all past issues in PDF...
23:32:00<dashcloud>Linux Journal definitely does that
23:38:00<chronomex>nice
23:46:00<SketchCow>So, I don't mind being the guy making these collections, BUT
23:47:00<SketchCow>I'd really appreciate it if you do-gooder motherfuckers would walk the collection and find doubles and cases where we have something really shitty when there's known better versions.
23:52:00<Nemo_bis>SketchCow: are there more duplicates than those I told you?
23:52:00<Nemo_bis>(Question is pointless if email really went to spam.)
23:55:00<SketchCow>It did go to spam.
23:58:00<Nemo_bis>http://p.defau.lt/?2fxIiFNmvwaO2FBSJdn7fA
23:58:00<Nemo_bis><https://archive.org/search.php?query=%22Toronto%20PET%20User%27s%20Group%22> (duplicate of <https://archive.org/details/tpug-newsletter I'm afraid)
23:58:00<Nemo_bis>and YourComputer which you had already spotted (and deleted, unless it was someone else)
23:59:00<Nemo_bis>I didn't find more in public items.