00:51:00 | <SketchCow> | So, unfortunately, it looks like Myspace is now doing a small transition and killing pages |
02:42:00 | <bsmith093> | you know how some really advanced bulk renamers can add the parent folder(s) to the name of a file? well i need to remove that part organized like this stuff/blah/status/blah - authorname - filename.txt the only matching parts will be the "blah", and it is garanteed to be a part of the filename |
02:56:00 | <instence_> | uh |
02:56:00 | <instence_> | whats your before and after? |
02:58:00 | <instence_> | you say matching parts, are you trying to regex match those? and only modify those files partially? or? |
02:59:00 | <instence_> | an app for windows I used to rename stuff is called "ReNamer" works great |
03:14:00 | <bsmith093> | instence_: before= stuff/blah/status/blah - authorname - filename.txt |
03:14:00 | <bsmith093> | instence_: after= stuff/blah/status/authorname - filename.txt |
03:28:00 | <instence_> | with ReNamer that would be quite easy, but I think its a windows only app |
03:29:00 | <instence_> | http://www.den4b.com/?x=downloads&product=renamer |
03:30:00 | <instence_> | http://www.den4b.com/?x=screenshots&product=renamer |
03:31:00 | <instence_> | you can stack rules as well |
03:46:00 | <dashcloud> | so, is there still a channel for fileformat wiki efforts, or it just goes here or -bs? |
04:45:00 | <tuankiet> | Hello eberybody! |
04:48:00 | <bsmith093> | instence_: how would i do that in renamer, it runs fine in wine, so im using that for now |
04:55:00 | <tuankiet> | @alard: are there any projects? |
06:03:00 | <Nemo_bis> | SketchCow: thanks! |
06:15:00 | <SketchCow> | No problem. Sorry there's still a lag with me this year. |
06:16:00 | <SketchCow> | I'd hoped to be more archiveteam responsive, but this DEFCON documentary is kicking my aaaaaassssss |
08:41:00 | <chronomex> | godane: ftp.download.packardbell.com: Downloaded: 2679 files, 28G in 1d 17h 47m 55s (194 KB/s) |
08:42:00 | <chronomex> | now: time nice ionice -c 3 zip -vr ftp.download.packardbell.com.zip ftp.download.packardbell.com |
09:07:00 | <godane> | chronomex: thanks for getting it |
09:07:00 | <godane> | i know that would have take me forever to get |
09:08:00 | <chronomex> | :) |
09:08:00 | <godane> | and to also upload |
09:19:00 | <chronomex> | yeah, might take a while |
09:19:00 | <chronomex> | I downloaded a terabyte of ftp last month :P |
09:20:00 | <Nemo_bis> | chronomex: ah, 200 KB/s, lucky you :) |
09:21:00 | <Nemo_bis> | NATO still at 40 KB/s |
09:21:00 | <Nemo_bis> | 42 GiB so far |
09:22:00 | <chronomex> | o_O |
09:22:00 | <chronomex> | ftp.3gpp.org is huge |
09:22:00 | <chronomex> | btw. |
09:23:00 | <chronomex> | 350g, iirc |
09:23:00 | <Nemo_bis> | everything has recent timestamps there |
13:54:00 | <hiker1> | To what extent does heritrix discover JavaScript and CSS? |
14:15:00 | <alard> | tuankiet: Well, it's time to start downloading the Yahoo blogs. |
14:16:00 | <alard> | hiker1: It probably downloads things referenced with <script> or <link rel="stylesheet"> tags, and I think it even has some rules to find images etc. in the actual CSS and JavaScript files. |
14:16:00 | <hiker1> | How easy is it to set up? |
14:17:00 | <alard> | It isn't that hard, but it's unwieldy. |
14:17:00 | <hiker1> | I wanted to test it on a single site. |
14:18:00 | <hiker1> | I suppose it's probably not worth the hassle |
14:18:00 | <ersi> | Neither Heritrix or Wayback is easy to setup |
14:22:00 | <hiker1> | sigh. |
14:22:00 | <hiker1> | Maybe someone that knows how could release a VirtualBox image with it already installed and ready to accept a warc file? |
14:24:00 | <hiker1> | alard: There is a python library called mitmproxy. Might be useful to proxy the HTTPS records: http://mitmproxy.org/ |
14:24:00 | <hiker1> | Right now I am using a simple rewrite modification to warc-proxy to get them sent. |
14:24:00 | <hiker1> | very, very rudimentary. |
14:24:00 | <ersi> | I've fiddled a little with it, and plan to maybe continue - but we'll see (RE: wayback, heritrix) |
14:24:00 | <godane> | so i just found a very good copy of the screen savers episode from 2003 |
14:25:00 | <godane> | Kevin Rose uploaded it too :-D |
14:25:00 | <ersi> | OH MY GOD! |
14:29:00 | <godane> | https://www.youtube.com/user/kevinrose |
14:29:00 | <godane> | i found it on his youtube channel |
14:29:00 | <godane> | i may have to email him so i get more episodes of tss |
14:31:00 | <godane> | he has about 50 episodes of the screen savers in mp4 |
14:31:00 | <godane> | :-D |
14:40:00 | <hiker1> | WARC doesn't replay the actual browser sessions, only the traffic. Some JavaScript scripts I found appear to append a callback handle to the url that is generated at runtime based on a live JS object. WARC can not replay this behavior. |
14:42:00 | <hiker1> | Technically it does archive all the information that a website outputs, but some of the information is impractical to use or view without extensive modifications to the JavaScript. |
14:44:00 | <hiker1> | It makes me think of an HTML5 game http://wordsquared.com/. You can download all the traffic, but you will never be able to see the game properly I think. |
14:44:00 | <alard> | hiker1: And? Or are you just thinking aloud? :) |
14:44:00 | <alard> | You could fix individual sites, but there's no general solution, I think. |
14:53:00 | <hiker1> | thinking aloud :) |
14:53:00 | <hiker1> | I noticed this while attempting to archive a website just now. |
14:54:00 | <tuankiet> | @alard: Tracker rate limiting is in effect. Retrying after 30 seconds... :(( |
14:57:00 | <alard> | tuankiet: Yes, there was something wrong yesterday. I'm now gathering some files to debug with. (Until I got distracted by wordsquared just now. :) |
14:57:00 | <hiker1> | hah xD |
15:00:00 | <tuankiet> | @alard: Oh, runnning again. I've just restarted VMs to update the code :)) |
15:11:00 | <alard> | Good. Found the problem: HTTP/1.1 999 Unable to process request at this time -- error 999 |
15:12:00 | <alard> | What's the best way to handle those? Wait and retry? |
15:13:00 | <Nemo_bis> | ah, as it was feared |
15:14:00 | <balrog-> | that means you are being throttled |
15:15:00 | <balrog-> | http://www.murraymoffatt.com/software-problem-0011.html |
15:16:00 | <alard> | It's Nemo_bis, in this case. |
15:17:00 | <Nemo_bis> | alard: I got that error? but I just started |
15:17:00 | <balrog-> | wow MS is killing messenger |
15:17:00 | <Nemo_bis> | I have lots of "Project code is out of date and needs to be upgraded. Retrying after 30 seconds..." |
15:18:00 | <alard> | Yes, I've paused the thing again. |
15:18:00 | <twrist> | Messenger is being integrated into skype, though. |
15:18:00 | <twrist> | So yeah. |
15:18:00 | <alard> | Nemo_bis: In the last few minutes there were 999-warcs from grue, tuankiet, and you. |
15:18:00 | <Nemo_bis> | hm |
15:19:00 | <balrog-> | twrist: yeah but the protocol, etc are going away |
15:19:00 | <twrist> | Ah, right. |
15:19:00 | <ersi> | Super old. |
15:19:00 | <balrog-> | alard: need to detect 999s and throttle |
15:19:00 | <twrist> | So, what's currently being archived? |
15:19:00 | <Nemo_bis> | alard: I've switched the warrior to tinyback |
15:19:00 | <alard> | balrog-: How long to wait? (And does saying you're from Google still work?) |
15:20:00 | <balrog-> | alard: I don't know, I haven't tested â info online says 2-24 hours, but I don't know |
15:20:00 | <Nemo_bis> | can it be that Yahoo is suspicious because it sees activity from my IP on flickr etc. as logged in user? |
15:20:00 | <Nemo_bis> | it definitely can't be bandwidth in my case |
15:21:00 | <tuankiet> | Bad thing now |
15:22:00 | <alard> | Nemo_bis: Perhaps you're normally less active on Asian blogs. |
15:23:00 | <twrist> | Give me a git URL to clone, guys. |
15:23:00 | <ersi> | At what project are you guys getting HTTP 999's? |
15:24:00 | <twrist> | I'm itching to join in. |
15:24:00 | <ersi> | twrist: http://github.com/archiveteam/ |
15:24:00 | <twrist> | Need to be a bit more precise, I'm using ubuntu server and IRSSI |
15:24:00 | <twrist> | I only just started as well |
15:24:00 | | twrist is GLaDOS, FYI |
15:25:00 | <ersi> | I think they're doing yahooblogs-grab right now |
15:25:00 | <twrist> | ah |
15:25:00 | <twrist> | so https://github.com/archiveteam/yahooblogs-grab.git? |
15:25:00 | <ersi> | yeah.. |
15:25:00 | <alard> | twrist: There's not much sense starting right now, we need to update the script. |
15:26:00 | <alard> | ersi: blog.yahoo.com |
15:26:00 | <twrist> | ah |
15:28:00 | <tuankiet> | Or using Tor so we won't have 999 again. But the speed is super low :)) |
15:29:00 | <twrist> | The URL I typed out isn't working. |
15:29:00 | <twrist> | Anyone else able to paste it in here? |
15:29:00 | <Deewiant> | https://github.com/ArchiveTeam/yahooblog-grab.git |
15:30:00 | <twrist> | ah, no s |
15:37:00 | <twrist> | so the arguments were --downloader=name --concurrent=6? |
15:43:00 | <alard> | Yes. There's a new version that should handle the 999 error better. |
15:54:00 | <goekesmi> | ls |
15:54:00 | | goekesmi sighs. |
15:54:00 | <hiker1> | xD |
16:04:00 | <chazchaz> | Is ther a channel for yahooblog-grab? |
16:42:00 | <SketchCow> | I suggest #yahooblah |
16:46:00 | <Coderjoe> | O_O yahoo blog is from yahoo korea? |
18:13:00 | <alard> | I think the current version of the script works better. (There are fewer 0MB items, and it's much slower.) |
19:09:00 | <hiker1> | Is anyone archiving stuff from Tor? |
19:24:00 | <swebb> | I used tor once to auto-change my IP when grabbing some stuff from google, but it was way slow. |
19:25:00 | <hiker1> | well, yeah. But there are some websites which are tor only. |
19:28:00 | | ats raw-images an extremely dodgy floppy four times using two different Amiga drives, converts using disk-analyser, merges the resulting partial images back together giving a full image, and peers happily at the first bits of email he ever sent :) |
19:28:00 | <balrog-> | what are you using to merge? |
19:30:00 | <ats> | rawadf off aminet, patched to not complain about the number of tracks in the .eadf files disk-analyser produces |
19:30:00 | <ats> | I also had to patch disk-analyser to not write junk into the EADF track header structure... |
19:32:00 | <ats> | then disk-analyser again to turn (raw-track) EADF into (AmigaDOS-track) ADF, adfread to extract the files from the filesystem, and unar to extract the .lzx archives on the floppy |
19:52:00 | <hiker1> | If anyone is bored of archiving with wget, please try my WarcMiddleware. I'd be glad to assist in setting it up. https://github.com/iramari/WarcMiddleware |
20:34:00 | <Nemo_bis> | alard: how do I know if I'm still collecting mostly useless 999 crap, in case I work on Yahoo? |
21:02:00 | <alard> | Nemo_bis: Hard to say. It shouldn't, it should retry (and print a message). |
21:04:00 | <Nemo_bis> | ok |
21:05:00 | <Nemo_bis> | TinyBack was getting ratelimited anyway |
21:38:00 | <SketchCow> | Nemo_bis: http://archive.org/details/magazine_rack |
21:38:00 | <Nemo_bis> | SketchCow: Pretty!!! |
21:39:00 | <Nemo_bis> | Are you going to make some of those dark? |
21:39:00 | <SketchCow> | Ostensibly |
21:40:00 | <Nemo_bis> | :) |
21:45:00 | <SketchCow> | Like, Wood Magazine will probably disappear. |
21:50:00 | <Nemo_bis> | But... children in Africa will DIE if we don't let them know how to build life-saving wood stuff, in English, on a website! |
21:53:00 | <Nemo_bis> | On eMule and eMule only there's also another 5 GiB archive of another woodworking magazine. Surely the same woodworking geek scanner. |
21:54:00 | <chronomex> | haha |
21:55:00 | <SketchCow> | Which one? |
21:55:00 | <SketchCow> | You have so many here. |
21:56:00 | <SketchCow> | http://archive.org/details/general_magazine |
21:56:00 | <SketchCow> | http://archive.org/details/woodsmith_magazin |
21:57:00 | <SketchCow> | http://archive.org/details/woodsmith_magazine I mean |
21:58:00 | <SketchCow> | How long was this uploading, Nemo_bis? |
21:58:00 | <Nemo_bis> | SketchCow: I don't know, a few days of work for the CSV maybe. |
21:58:00 | <Nemo_bis> | I didn't measure the time for download and upload in itself. |
22:00:00 | <Nemo_bis> | Also a few hours of trackers browsing and other searches. |
22:01:00 | <Nemo_bis> | http://p.defau.lt/?YTRaoQFxExjw8T612Pl_XQ |
22:03:00 | <SketchCow> | In the future, like godane, I can just browse your uploads and see what you haven't had pushed into a collection and make it happen. |
22:03:00 | <SketchCow> | Your activities also get the attention of the devs, who see it come by |
22:08:00 | | Nemo_bis hopes not to get too many curses |
22:08:00 | <Nemo_bis> | I thought sending you a nice list at the end of the job was going to be helpful? |
22:09:00 | <SketchCow> | No. |
22:09:00 | <SketchCow> | Doesn't help and it actually gets caught in the spam filter |
22:10:00 | <SketchCow> | Because someone from italy is mailing me piles of URLs |
22:10:00 | <Nemo_bis> | Oh, even. |
22:10:00 | <chronomex> | :P |
22:12:00 | <SketchCow> | Also, the vorugsveta collection didn't make it through the fun |
22:12:00 | <SketchCow> | I'm going to make it a collection for you, but it needs more love |
22:12:00 | <Nemo_bis> | Yes, I noticed. |
22:13:00 | <Nemo_bis> | I didn't look those zips carefully enough, sorry. |
22:13:00 | <SketchCow> | Yeah, those things are buuuuuuuunk |
22:13:00 | <SketchCow> | How about I dark them all with a note to delete them? |
22:13:00 | <Nemo_bis> | Suggestions on how to get something useful out of a FictionBook? |
22:13:00 | <Nemo_bis> | I'm ok with it. |
22:14:00 | <SketchCow> | No, wait, this thing is valid. |
22:14:00 | <SketchCow> | Just not playing with our system |
22:14:00 | <SketchCow> | FICTIONBOOOOOOOOOK |
22:14:00 | <SketchCow> | Thanks, Russia |
22:14:00 | <Nemo_bis> | heh |
22:14:00 | <Nemo_bis> | It's not even well seeded, by the way. |
22:28:00 | <hiker1> | Nemo_bis: What did you mean when you said make some of those dark? |
22:28:00 | <SketchCow> | http://archive.org/details/vokrugsveta |
22:28:00 | <SketchCow> | we'll see when the gods arise on that one |
22:29:00 | <mistym> | Nemo_bis: Wikipedia suggests Calibre can convert FictionBook to smth more conventional. |
22:30:00 | <SketchCow> | https://twitter.com/jefferson_bail/status/289096186420400128 |
22:49:00 | <Nemo_bis> | SketchCow: thanks for fixing it. I liked that tweet too, wondered what syllabus exactly. |
22:50:00 | <SketchCow> | I'm sure it's related to computer programming, and realizing what was done |
22:50:00 | <SketchCow> | I asked him to send it along. |
22:50:00 | <Nemo_bis> | Nice |
22:52:00 | <SketchCow> | By the way, the guy who wrote the wikipedia entry also wrote a scathing e-mail to archive.org about how we were the pit of evil |
22:52:00 | <SketchCow> | Good thing I helped bring in so much fundraising last year |
22:53:00 | <SketchCow> | Also: Ares Magazine is as sexy as sexy gets |
22:58:00 | <SketchCow> | http://archive.org/details/ares_magazine |
23:03:00 | <Nemo_bis> | Should still be usable, shouldn't it? With some printing perhaps. |
23:05:00 | <godane> | stupid question |
23:05:00 | <godane> | i don't know how to submit a comment on youtube |
23:07:00 | <SketchCow> | Goood |
23:08:00 | <godane> | why is that? |
23:08:00 | <godane> | trying to help kevin rose upload the 50 episodes of the screen savers he has |
23:10:00 | <godane> | this is the episode in question: https://www.youtube.com/watch?v=ZglwVT5NIJw |
23:10:00 | <godane> | its a episode from july 14 2003 |
23:11:00 | <godane> | there next to no caps for episodes in 2003 |
23:26:00 | <SketchCow> | Example of "I'm just gonna dark it" |
23:26:00 | <SketchCow> | http://www.woodworkersjournal.com/Main/Store/5_Disc_Annual_Collection_CD_Bundle_20052009_257.aspx |
23:30:00 | <dashcloud> | here's something interested I came across today: http://www.emsps.com/oldtools/ They buy and sell old-very old software |
23:31:00 | <Nemo_bis> | SketchCow: some computer magazines like Pc Open here use the PDFs of their past issues as fillers for DVDs when they don't find enough stuff, it seems. |
23:32:00 | <Nemo_bis> | Something like 10 % of their CD/DVDs contains either some or all past issues in PDF... |
23:32:00 | <dashcloud> | Linux Journal definitely does that |
23:38:00 | <chronomex> | nice |
23:46:00 | <SketchCow> | So, I don't mind being the guy making these collections, BUT |
23:47:00 | <SketchCow> | I'd really appreciate it if you do-gooder motherfuckers would walk the collection and find doubles and cases where we have something really shitty when there's known better versions. |
23:52:00 | <Nemo_bis> | SketchCow: are there more duplicates than those I told you? |
23:52:00 | <Nemo_bis> | (Question is pointless if email really went to spam.) |
23:55:00 | <SketchCow> | It did go to spam. |
23:58:00 | <Nemo_bis> | http://p.defau.lt/?2fxIiFNmvwaO2FBSJdn7fA |
23:58:00 | <Nemo_bis> | <https://archive.org/search.php?query=%22Toronto%20PET%20User%27s%20Group%22> (duplicate of <https://archive.org/details/tpug-newsletter I'm afraid) |
23:58:00 | <Nemo_bis> | and YourComputer which you had already spotted (and deleted, unless it was someone else) |
23:59:00 | <Nemo_bis> | I didn't find more in public items. |