01:07:00<phuzion>ivan`: where did you get that github-repositories.txt file?
01:07:00<chronomex>the gothub subcommittee posted it on IA recently
01:08:00<chronomex>http://archive.org/details/archiveteam-github-repository-index-201212
01:09:00<phuzion>Thanks
01:09:00phuzion considers just starting that for the hell of it, to see how much space it ends up taking.
01:27:00<balrog_>some repos already got renamed or deleted
01:27:00<joepie91>does anyone have a done-in-10-seconds way to submit a WARC to the internet archive?
01:27:00<joepie91>from a headless server
01:28:00<joepie91>I have a WARC of the BBC site of a while ago
01:28:00<chronomex>balrog_: yeah, they will do that
01:28:00<chronomex>joepie91: you can cobble together something that uses curl to POST it to the s3 api
01:29:00<joepie91>:p
01:29:00<joepie91>right, but how would I do that, seeing as I'm entirely unfamiliar with the s3 api
01:29:00<chronomex>okay
01:30:00<chronomex>you need to get tokens and I'll give you a command line
01:30:00<joepie91>how do i get tokens?
01:30:00<ivan`>phuzion: let me know if you run out, I might have 3TB of space to do some of it
01:31:00<phuzion>ivan`: I've started on it, I'll let you know when it fills up my drive.
01:31:00<chronomex>joepie91: http://archive.org/account/s3.php
01:31:00<joepie91>okay, got them
01:32:00<ivan`>phuzion: you might want to run two in parallel since half the time github will be busy counting objects
01:32:00<chronomex>joepie91: curl '--header' 'authorization: LOW your-magic-token' '--header' 'x-archive-meta01-collection:opensource' '--header' 'x-amz-auto-make-bucket:1' '--header' 'x-archive-meta-noindex:true' --header 'x-archive-meta-(title|date|mediatype|language|etc): Value'
01:32:00<phuzion>Hmm... Perhaps I can figure out how to split the list into even and odd lines...
01:32:00<ivan`>or with xargs or parallel
01:32:00ivan` looks it up
01:33:00<chronomex>yes, `parallel' is good
01:33:00<joepie91>magic token == secret key?
01:33:00<chronomex>joepie91: hold on don't run that yet
01:33:00<chronomex>yes, secret key
01:33:00<chronomex>you actually want to run it with these as well ...
01:34:00<chronomex>curl -i '-#' ${args from above} --upload-file /dev/null "http://s3.us.archive.org/"$identifier
01:34:00<chronomex>this will give you a progress bar and stuff
01:35:00<joepie91>what is the $identifier?
01:35:00joepie91 is confused now
01:35:00<joepie91>okay, let me ask it differently
01:36:00<joepie91>if I wanted to upload a warc.gz of the BBC.co.uk site named "BBC.co.uk WARC", and the filename was at-bbc.warc.gz
01:36:00<joepie91>what would the full command be to run (minus secret key, ofc)
01:36:00<joepie91>so that I get a bit of a better grasp on the syntax :p
01:38:00<phuzion>ivan`: I'm trying to figure out how to split the file in half, I want to do even and odd lines, but can't quite nail the sed syntax, you any good with sed?
01:39:00<phuzion>Wait, hang on, I might have gotten it
01:39:00<ivan`>no, I was busy trying to figure out how to do the subshell thing with parallel
01:41:00<DFJustin>so did someone warc this yet, closing tomorrow http://japan.gamespot.com/
01:41:00<phuzion>Yeah, got it
01:41:00<phuzion>sed -n "1~2 p" github-repositories.txt > github-odd.txt and then sed -n "2~2 p" github-repositories.txt > github-even.txt
01:46:00<chronomex>joepie91: curl -i -'#' (all the --header options from above) --upload-file at-bbc.warc.gz "http://s3.us.archive.org/BBC.co.uk-warc"
01:48:00<chronomex>joepie91: make sense?
01:48:00<chronomex>you should write a description and stuff
01:54:00<joepie91>right... not much to describe though
01:54:00<joepie91>:P
01:55:00<chronomex>well, write where it came from, include the wget command line, etc
01:55:00<chronomex>maybe why you got it
01:55:00<chronomex>just a few sentences
01:56:00<joepie91>just use \n to insert a newline?
01:57:00<chronomex>I don't know tbh
01:57:00phuzion predicts that his 1tb drive will be full tonight, thanks to cloning github repos
01:57:00<chronomex>phuzion: that sounds like a safe bet
01:57:00<joepie91>meh, don't have the command I ran anymore anyway :/
01:57:00<phuzion>heh
01:58:00<godane>how big would you think japan.gamespot.com should be?
01:58:00<joepie91>desc?
01:58:00<joepie91>what's the name of the description header?
01:58:00<chronomex>ummm
01:58:00<phuzion>Can someone take http://git.kernel.org/index.html and get all of the git:// links out of the page for me? I wanna clone all of those as well
02:00:00<chronomex>joepie91: read the examples at http://archive.org/help/abouts3.txt
02:02:00<joepie91>phuzion: http://sprunge.us/XOBV
02:02:00<joepie91>ignore the first line
02:03:00<joepie91>rest should be valid
02:03:00<phuzion>joepie91: you sir, are a gentleman and a scholar
02:03:00<phuzion>Mind if I ask the wizardry you used to obtain such a result?
02:07:00<joepie91>sure, 1 sec :P
02:08:00<joepie91>bit of a nasty method, but
02:08:00<joepie91>http://pastie.org/5541069
02:08:00<joepie91>it does the job
02:08:00<joepie91>curl http://whatever | python gitlink.py
02:09:00<joepie91>the regex is extremely lazy though, and there's no guarantee that it'll work with other stuff :P
02:09:00<joepie91>plus I don't think it'll match more than one git:// url per line in the html file
02:09:00<joepie91>which is fine for this, but may not be fine for other things
02:10:00<chronomex>I would do curl http://whatever | sed -e 's/[" ]/\n/g' | grep ^git://
02:10:00<joepie91>chronomex: that won't work if there's other stuff on the same line, right?
02:10:00<joepie91>wait
02:10:00<joepie91>I see what you're doing
02:11:00<chronomex>:)
02:11:00<joepie91>that would break here though, if you don't include ) in your regex
02:11:00<chronomex>ok
02:11:00<joepie91>there was one that would break and have a ) at the end
02:11:00<chronomex>well, as usual, it requires tuning
02:11:00<joepie91>:P
02:11:00<joepie91>plus you'd have to add <
02:11:00<joepie91>in case it's mentioned as text
02:11:00<chronomex>well yes
02:11:00<chronomex>but you see where I'm going with it
02:11:00<joepie91>yes :)
02:11:00<joepie91>I'm horrible with sed and awk so I prefer python for these kind of things :P
02:12:00<chronomex>or you could do grep -o 'git://[-_A-Za-z./%0-9 etc]*'
02:12:00<chronomex>-o is only-matching-regions
02:31:00<chronomex>alard: tracker is back in swapsville http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/memory.html
02:39:00<godane>so i'm starting the mirroring of japan.gamespot.com
03:27:00<godane>and japan.gamespot.com is gone
03:27:00<balrog_>:/ did it get backed up?
03:28:00<godane>part of it did
03:28:00<godane>not much
03:43:00<godane>i'm uploading my warc for japan.gamespot.com right now
03:43:00<godane>just don't expect much
03:50:00<balrog_>:[
03:55:00<godane>uploaded: http://archive.org/details/japan.gamespot.com-20121216-mirror-incomplete
03:55:00<godane>we wore not fast enough
03:56:00<godane>i grabbing stuff like fireflyfans.net before it needs a panic download in under 2 hours
03:59:00<godane>it was already starting to redirect to japan.cnet.com best on my wget.log
04:03:00<phuzion>chronomex: The tracker that you talk of, is that why I can't download github stuff?
04:03:00<chronomex>phuzion: no idea
07:01:00chronomex currently stuffing some ftp grabs from last week into .zips
09:50:00<Nemo_bis>chronomex: you uploaded some of these a while ago didn't you? https://archive.org/details/bellsystempractices
09:50:00<Nemo_bis>are http://thepiratebay.se/torrent/5946997/Bell_Systems_Technical_Journals_(Full_Site_Rip) darkened or just not on archive.org?
09:54:00<chronomex>Nemo_bis: that is my collection, yes.
09:56:00<Nemo_bis>chronomex: do you anything about the bell system technical journals then?
09:56:00<chronomex>do I what anything?
09:57:00<chronomex>I think you a word
09:57:00<alard>chronomex: The tracker likes to swap. We have too many large projects at the moment.
09:57:00<chronomex>yeah, that was my understanding
09:58:00<Nemo_bis>*do you know
09:59:00<alard>GitHub is done now, so that will be going.
09:59:00<chronomex>I know some things about the BSTJ, yes?
10:00:00<Nemo_bis>chronomex: about them being uploaded on archive.org
10:01:00<chronomex>don't
10:02:00<chronomex>http://archive.org/search.php?query=bell%20system%20technical%20journal hmmm this is bad
10:03:00<chronomex>maybe I should upload that 50G torrent
10:03:00<chronomex>orrrrr not?
10:03:00<chronomex>upload it from the lucent site
10:03:00<chronomex>maybe I'll do that tomorrow
10:04:00<chronomex>yeahhhh
10:18:00<Nemo_bis>chronomex: yes you should :)
10:18:00<Nemo_bis>unless Jason already did it?
10:25:00<hiker3>http://japan.gamespot.com/ is gone now
10:26:00<hiker3>I see godane managed to grab some of it
10:28:00<hiker3>If http://andriasang.com/ comes back online it might be nice to grab a copy as well. I am not sure how much longer it will stay up
10:29:00<hiker3>Thank you for grabbing what you did, godane.
11:58:00<Nemo_bis>If you want to pick some... (Or add suggestions; my proxy and myself got sick of browsing TPB. ;-) ) http://archiveteam.org/index.php?title=Magazines_and_journals
13:05:00<godane>hiker3: looks like gamespotjapan twitter feed is gone too
13:05:00<hiker3>Hi! But isn't twitter archived automatically?
13:06:00<godane>don't knnow
13:06:00<godane>i just know that the account doesn't exist anymore
13:06:00<hiker3>Were you the only one grabbing the site?
13:11:00<godane>i don't know
13:12:00<godane>in least jason scott got it
13:12:00<godane>*in less
13:12:00<godane>when was it posted that it was going to be redirected to japan.cnet.com
13:13:00<hiker3>I came in here 3 days ago and mentioned it I think
13:13:00<godane>so i hope jason got the warning then
13:13:00<godane>i know i was not going to get all of it
13:14:00<hiker3>Is there any way someone can get http://andriasang.com/ if it comes back up?
13:14:00<hiker3>It's been having errors for a few weeks now, and the author has moved on to other things so I am not sure how much longer it will stay up.
14:10:00<godane>so looks like fireflyfans.net bluesunroom is very big
14:41:00<joepie91>chronomex: http://aarnist.cryto.net:81/data/at-trancenu.warc.gz
14:41:00<joepie91>a seemingly complete warc of trance.nu
14:44:00<godane>uploaded: http://archive.org/details/www.engadget.com-images-2006-mirror
14:44:00<norbert79>joepie91: What do you think, would all sources found for gopher be worth of uploading?
14:45:00<norbert79>I mean the UMN gopher engine
14:48:00<joepie91>norbert79: I have no idea what that would entail, to be perfectly honest
14:48:00<joepie91>that was before my time :P
14:50:00<norbert79>Lot of old gopher code; it would mean like I would say: old apache2 code :)
14:52:00<joepie91>ah, right
14:52:00<joepie91>sure, why not :P
14:54:00<hiker3>Is there a list of websites which have shutdown but have archives from AT?
14:56:00<godane>i have gopher plugin in for firefox
14:58:00<Nemo_bis>sigh people packaging PDFs in NRG packaged in multifile RARs
15:11:00<norbert79>Looks like sharing isn't accessible atm
15:12:00<godane>i found some usenet dumps
15:12:00<godane>on gopher://telefisk.org/
15:12:00<joepie91>hiker3: I think it's on the archiveteam wiki
15:12:00<godane>the archive is up to like 2011
15:13:00<norbert79>godane: Telefisk is still anactive gopher server
15:13:00<godane>yes
15:13:00<godane>from what i can tell
15:13:00<norbert79>godane: You could also add olduse.net to this too
15:14:00<joepie91>ah
15:14:00<joepie91>hiker3: http://archive.org/details/archiveteam
15:15:00<norbert79>godane: Wanted to upload Old Gopher Sources, connection died, now I can't use that keyword anymore, but am offered OldGopherSOurces_631
15:15:00<norbert79>godane: What now?
15:15:00<norbert79>Shall I ignore this?
15:17:00<godane>i'm donwloading this stuff to be on the safe side
15:19:00<DFJustin>norbert79: it looks like https://archive.org/details/OldGopherSources was created, so you ought to be able to go in and edit it
15:20:00<norbert79>DFJustin: Cheers, looks like both https://archive.org/details/OldGopherSources and https://archive.org/details/OldGopherSources_693 got created and got stuck again
15:20:00<DFJustin>afaik olduse.net comes from data that is already on IA so no point in archiving it https://archive.org/details/utzoo-wiseman-usenet-archive
15:21:00<norbert79>DFJustin: About these pages, can I somehow remove them?
15:21:00<DFJustin>no
15:21:00<norbert79>I wish to remove the second, aw crap
15:21:00<DFJustin>it's not public yet so no big deal
15:21:00<norbert79>Ok
15:22:00<norbert79>DFJustin: What is the right choice for compressed source files?
15:22:00<norbert79>I am offered movie, audio and text
15:22:00<norbert79>and etree
15:22:00<DFJustin>pick text and an admin can move it later
15:23:00<norbert79>cheers
15:29:00<norbert79>Done
16:15:00<godane>looks like fireflyfans.net store the bluesun images using the files md5sum
16:41:00<joepie91>anything else that needs wget-warcing?
16:48:00<Nemo_bis>joepie91: are you open also to different suggestions? :)
16:48:00<joepie91>that depends on what said suggestion is :P
16:48:00<Nemo_bis>I put some on http://archiveteam.org/index.php?title=Magazines_and_journals
16:50:00<joepie91>Nemo_bis: I can't do torrents, though
16:50:00<Nemo_bis>ah
16:50:00<joepie91>disallowed by the host that I'm using
16:50:00<joepie91>because it's very IO heavy
16:51:00<joepie91>:P
16:51:00<balrog_>:[
16:51:00<joepie91>see https://srsvps.com/terms.html
16:51:00<balrog_>even if you limit to a few connections at a time? ahh
16:51:00<balrog_>I understand that OVH is pretty lenient
16:51:00<balrog_>and is popular for seedboxes
16:51:00<joepie91>ya, but my only OVH box that I could use for this would be my kimsufi
16:51:00<balrog_>yeah
16:52:00<joepie91>:P
16:52:00<joepie91>and that one isn't supposed to do anything besides function as a testing box for my vps panel
16:52:00<joepie91>don't want to risk suspension or similar
16:52:00<Nemo_bis>aww 503 Service Unavailable
16:52:00<joepie91>I have one other VPS on an OVH server, but if I start torrenting on that, encyclopedia dramatica will probably slow down to a crawl, since it's a backend server >.>
16:53:00<Nemo_bis>there's an interesting NATO FTP site there that you could grab though
16:53:00<balrog_>ohhh?
16:53:00balrog_ has been on the lookout for NATO documents
16:53:00<balrog_>well, certain specific ones having to do with speech codecs
16:53:00<joepie91>Nemo_bis: how large is it, approx?
16:53:00<joepie91>I have about 50G of space left
16:53:00<Nemo_bis>joepie91: dunno, some dozens GiB perhaps
16:53:00<joepie91>on this vps
16:53:00<joepie91>hmm
16:53:00<joepie91>I could do it partially
16:53:00<Nemo_bis>ftp.rta.nato.int/PubFullText/AGARD/ or http://thepiratebay.se/torrent/7639843/AGARD_monographs_(_AGARDographs_) 15 GiB/453
16:54:00<Nemo_bis>and parent folder
16:54:00<joepie91>what is the easiest way to mass-download from an FTP server?
16:54:00<Nemo_bis>wget
16:54:00<DFJustin>lftp
16:54:00<joepie91>I'd assume warc isn't suitable for this
16:54:00<Nemo_bis>http://archiveteam.org/index.php?title=FTP
16:54:00<joepie91>ah, nice :P
16:55:00<DFJustin>it doesn't really matter if you're doing a one time pull, I like lftp for updating an existing mirror
16:56:00<joepie91>downloading...
16:57:00<joepie91>140kb/sec
16:57:00<joepie91>:p
16:57:00<joepie91>not particularly fast
16:57:00<joepie91>KB*
16:57:00<Nemo_bis>:<
16:57:00<joepie91>you'd think nato could afford a decent pipe
16:57:00<joepie91>oh, by the way, alard, are you here?
17:04:00<joepie91>Nemo_bis: I've started downloading the car and motorcycle manual torrents from my home connection (on my media server)
17:04:00<joepie91>:P
17:04:00<joepie91>it'll be slow, but it's something
17:05:00<joepie91>it'll download at 1.1MB/sec max, and upload at like 60KB/sec max
17:06:00<godane>i found a amiga virus collection
17:07:00<joepie91>haha
17:08:00<Nemo_bis>joepie91: ok, upload with the bulk uploader, you know how?
17:09:00<joepie91>Nemo_bis: no idea, and the standard uploader on the archive.org site says it doesn't work properly on unix-based systems
17:09:00<joepie91>honestly, archive.org needs some kind of software to do uploads
17:09:00<joepie91>easily
17:09:00<joepie91>including the whole tagging etc
17:10:00<Nemo_bis>https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader
17:12:00<joepie91>whoa, I did not know that existed
17:20:00<Nemo_bis>joepie91: if your upload bandwidth is so little, perhaps you chose too big a torrent :)
17:20:00<Nemo_bis>but let's see
17:21:00<joepie91>nah, I'll just have patience :P
17:21:00<joepie91>that server runs 24/7 anyway
17:21:00<joepie91>plus I'll probably upload stuff separately
17:26:00<godane>this item needs some help: www.engadget.com-images-2007-mirror
17:34:00<schbiridi>joepie91: i mount the ftp with curlftpfs and then use rsync
17:34:00<joepie91>schbiridi: you mean the archive.org FTP upload?
17:35:00<joepie91>according to the info page that's not recommended because of bandwidth
17:53:00<schbiridi>nah, for mirroring FTP servers
17:54:00<schbiridi>sorry :D
17:54:00<balrog_>schbiridi: I usually use lftp here, or wget
17:57:00<Nemo_bis>the wonders of interoperable systems: everyone can use any flavour of software one likes most ;)
18:00:00<joepie91>ahh
18:01:00<schbiridi>i find rsync the most versatile
18:53:00<alard>joepie91: Yes?
19:06:00<joepie91>alard: I have something that may be of use to you for future projects
19:06:00<joepie91>I wrote a self-extracting python script thingie
19:06:00<joepie91>I'm using it for the installer for my VPS panel, but it may be useful for stand-alone versions of crawlers etc as well
19:07:00<joepie91>https://github.com/joepie91/cvm/tree/develop/tools/pysfx
19:07:00<joepie91>it doesn't have its own repo yet (it will soon, though)
19:07:00<joepie91>example usage: https://github.com/joepie91/cvm/blob/develop/installer/build.sh
19:07:00<joepie91>end result is a single .py that you can run, it'll extract itself to a temp dir, and run the specified command
19:49:00<ivan`>blip.tv a serious risk given "@richhickey says @skelter Blip doesn't want conference vids, tech talks etc, and gave us 2 weeks to move."
19:49:00<ivan`>I have all the Clojure videos, don't worry about those
19:58:00<chronomex>oh really?
19:58:00<chronomex>I thought blip.tv has the HOPE video
20:00:00<ivan`>I also have all of http://blip.tv/linuxconfau
20:04:00<ivan`>going to grab http://blip.tv/linux-journal and other things I can find on google
20:06:00<ivan`>(note: my upstream is terrible and no real backups)
20:07:00<alard>joepie91: Ah, that's something to remember. Similar to py2exe, but for Linux?
20:08:00<joepie91>alard: more similar to a 7zip sfx with autorun, I'd say, but for Linux :P
20:09:00<joepie91>it doesn't include dependencies etc, it just works with whatever tar.gz you give it
20:09:00<joepie91>you could theoretically pack up something entirely non-python with it
20:09:00<joepie91>as it will just run whatever command you specify, but with working directory set to the temp extraction directory
20:13:00<ivan`>what's our preferred piratepad?
20:13:00<ivan`>piratepad?
20:16:00<alard>joepie91: OK.
20:24:00<ivan`>if anyone is really interested in blip.tv I can provide a youtube-dl patch and start listing channels in piratepad
20:25:00<ivan`>otherwise I'll just continue sucking things down at 1M/s and hope 2 weeks only apply to hickey
20:35:00<ivan`>surprised to see a lot of great content, site must have terrible googlejuice
21:38:00<godane>SketchCow: so i have the bluesunrom of fireflyfans.net
21:38:00<godane>2.1gb warc.gz with 11000+ images
22:11:00<SketchCow>Goodness
22:13:00<Nemo_bis>I think hank briefly killed the site (or so) earlier today, maybe with this https://archive.org/~tracey/mrtg/derives.html
22:16:00<godane>SketchCow: did you grab japan.gamespot.com?
22:16:00<godane>i only grabed 90mb
22:18:00<alard>SketchCow: I uploaded the GitHub files, see http://archive.org/details/github-downloads-201212-part-a (to -z and -0 to -9)
22:23:00<SketchCow>So this is the after-we-fixed-the-bugs thing?
22:24:00<alard>Yes.
22:32:00<SketchCow>Fantastic.
22:32:00<SketchCow>I think I put this in software.
22:36:00<dashcloud>ivan`: I'm interested in pulling down the blip.tv stuff and I've got a good pipe here- the latest released version of youtube-dl seems to work fine with blip- any special options to use?
22:38:00<ivan`>dashcloud: yes, you need a patch to get the Source/720p content
22:38:00<ivan`>sec
22:41:00<dashcloud>just ping me with it, and I'll get to it sometime tonight
22:43:00<SketchCow>http://archive.org/details/github-downloads-2012-12
22:48:00<SketchCow>Don't do mediatype data, do mediatype software
22:50:00<alard>"data" is the default, I think. (I'm not even sure if non-admins can upload anything but the default type, but I haven't tried that.)
22:50:00<DFJustin>non-admins can upload anything
22:56:00<SketchCow>Anyway, it's all set now
23:01:00<ivan`>dashcloud: http://piratepad.net/R18h7lKV1N has the patch and some channels
23:02:00<ivan`>it's possible that the clojure channel got specifically targeted for using up too much of their bandwidth or something, but blip.tv still seems careless
23:12:00<dashcloud>ivan`: conferences don't seem a terribly good fit for blip as it is now- conferences happen once a year, and blip is geared toward episodic-type content (weekly/biweekly/monthly shows)
23:12:00<godane>so i got a account to astraweb.com
23:13:00<godane>only got the $10/25gb credit
23:13:00<dashcloud>holy crap- that's weird watching text suddenly show up on the page
23:13:00<godane>just to test if i can get episode of attack of the show without missing parts
23:20:00<joepie91>dashcloud: heh, you're unfamiliar with etherpad?
23:20:00<joepie91>it's basically multiplayer notepad :P
23:21:00<dashcloud>I've never used one before
23:21:00<dashcloud>did you run the git-annex kickstarter?
23:22:00<joepie91>git-annex kickstarter?
23:22:00<godane>good news everyone
23:22:00<godane>i maybe able to save more aots
23:35:00<godane>also most of my engadget dumps are uploaded
23:35:00<godane>will do a 2012 year dump sometime next year
23:41:00<ivan`>oh hey http://blip.tv/acquia and its videos got nuked too
23:41:00<ivan`>whatever it was. not that I'll have any idea now.