04:25:00<hiker2>Is there an easy way to convert a warc to a warc.gz?
04:29:00<no2pencil>you wanto compress it with gzip?
04:40:00<hiker2>I think warc.gz technically compresses the individual blocks, and is not simply a compressed file.
04:45:00<GLaDOS>Correct.
04:57:00<hiker2>Are mirrors classified as panic grabs even if there isn't any real worry the site will go down?
05:03:00<godane>hiker2: i'm always doing panic grabs of sites that are not going down
05:04:00<hiker2>Where do I upload them to?
05:04:00<godane>i normally upload my stuff to texts
05:05:00<godane>when its warc.gz dumps
05:11:00<hiker2>Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file?
05:13:00<DFJustin>yes but then the wayback machine's software wouldn't be able to ingest them directly
05:13:00<hiker2>Is anyone actually using the wayback machine software to view warcs besides IA?
05:14:00<GLaDOS>Remember: warc.gz isn't just a compressed warc file
05:14:00<GLaDOS>It compresses the chunks, but not the headers and hashes
05:14:00<GLaDOS>or something
05:14:00<hiker2>and that's why its compression suffers
05:14:00<DFJustin>there's the IA partners under the Archive-It umbrella http://archive-it.org/
05:14:00<hiker2>GLaDOS: It appears to compress the entire Record, including headers.
05:15:00<hiker2>DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine.
05:15:00<DFJustin>that second point is incorrect
05:16:00<DFJustin>jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine
05:16:00<DFJustin>except for the very latest new grabs
05:16:00<hiker2>Can the beta be accessed from anywhere?
05:16:00<DFJustin>yes
05:16:00<DFJustin>http://wayback-beta.archive.org/
05:17:00<hiker2>DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta?
05:18:00<DFJustin>look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/
05:18:00<DFJustin>the regular wayback machine was doing spotty crawls but the huge spike is us
05:19:00<hiker2>Do they load for you?
05:19:00<hiker2>ah, it loaded
05:20:00<hiker2>neat. I didn't realize this stuff was being pooled together somewhere
05:20:00<hiker2>How quickly are sites added to the machine after being grabbed on here?
05:21:00<DFJustin>updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details
05:26:00<DFJustin>here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
05:28:00<DFJustin>so not actually everything because mobileme is too freaking huge for right now
06:03:00<godane>there must have been older pdfs on computerpoweruser.com
06:06:00<godane>nevermind
06:06:00<godane>it was a dead link
06:23:00<hiker2>If the wayback machine already has a good archive, should I bother archiving a site?
06:23:00<godane>i'm grabing ftp://ftp.futurenet.com
06:23:00<godane>hiker2: i say archive it again
06:23:00<godane>sometimes wayback machine can't get stuff cause of robots.txt
06:24:00<hiker2>some of the waybackmachine grabs don't properly archive external images either
06:25:00<hiker2>The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider.
06:45:00<hiker2>godane: How can I tell which uploads are yours?
06:59:00<godane>hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22
07:00:00<hiker2>Do individual items not show who uploaded them?
07:36:00<DFJustin>they do but you have to look at the meta.xml
07:37:00<DFJustin>unless you're a collection admin
07:37:00<Lord_Nigh>i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc
07:38:00<DFJustin>http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
08:09:00<godane>i'm now uploading the offical xbox magazine web archive
08:23:00<godane>looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com
08:41:00<lemonkey>LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l
09:43:00<godane>is anyone mirroring g4tv videos?
10:17:00<alard>hiker2: The warcproxy also depends on the per-record gzip compression.
10:17:00<Nemo_bis>Is this archived somewhere? http://torcache.net/
10:33:00<godane>i think this needs to be backed up: ftp://ftp.download.packardbell.com/
10:34:00<godane>it has manuals and drivers packardbell or hp stuff
10:52:00<chronomex>godane: on it
10:54:00<Nemo_bis>the NATO FTP is still downloading... at 30 KiB/s now
10:54:00<chronomex>nice
10:54:00<chronomex>packard bell is similarly 90s-bound
11:09:00<Nemo_bis>godane: do you also try eMule to grab stuff?
11:10:00<Nemo_bis>in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:20:00<chazchaz>Do you know any examples off the top of your head?
11:24:00<Nemo_bis>chazchaz: examples of what?
11:38:00<chronomex>03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:38:00<chronomex>countries, I'd wager
11:42:00<godane>Nemo_bis: i did try to find the techtv music wars special on emule
11:43:00<godane>but turns out that the server it was called razor
11:44:00<godane>it was raided in feb 2006
11:51:00<Nemo_bis>chronomex: I'm sure of Italy and Spain, for instance
11:51:00<Nemo_bis>godane: eMule is serverless since ages, it uses KAD
11:52:00<godane>i'm on amule since i run linux
11:52:00<Nemo_bis>so?
11:52:00<godane>i have search techtv and i'm still not finding it
11:53:00<Nemo_bis>I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT
11:53:00<Nemo_bis>and only MorphXT has a decent support for it
11:54:00<Nemo_bis>godane: KAD needs some time
11:55:00<Nemo_bis>you can try and add to downloads other techtv things and find more noes
11:55:00<Nemo_bis>*nodes
11:56:00<Nemo_bis>that said, it might be the wrong thing to search there, dunno
12:37:00<Nemo_bis>http://www.introni.it/marzaglia.html
12:38:00<hiker2>godane: What exactly from the official xbox mag are you archiving?
12:44:00<godane>hiker2: everything here: http://www.oxmonline.com/secretstash
12:48:00<hiker2>godane: Are those the full issues?
12:48:00<Nemo_bis>chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm
12:48:00<hiker2>wow. Why do they offer them for free? Most sites would charge for them.
12:50:00<hiker2>godane: Do you delete the archives from your computer after you upload them to IA?
12:54:00<godane>hiker2: no
12:54:00<hiker2>What do you do with them?
12:54:00<godane>i most burn them to bluray when i need space
12:54:00<hiker2>wow. you are serious about archiving!
12:55:00<godane>living with dialup made me serious about archiving
12:55:00<hiker2>I had dialup as well.. But you don't have it anymore I assume
13:56:00<godane>Nemo_bis: do you know how to get better search results in emule?
14:13:00<Nemo_bis>godane: you have to know the nodes closer to those who have that stuff
14:13:00<Nemo_bis>when you've been downloading/uploading some things for a while, you're more likely to find similar things
14:14:00<Nemo_bis>you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly
14:15:00<Nemo_bis>if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search
14:15:00<Nemo_bis>but if it's too specific you may not find anything
14:16:00<Nemo_bis>of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc.
15:27:00<chazchaz>Nemo_bis: Examples of niche things one might only find on eMule.
15:36:00<schbiridi>chazchaz: every network/service/community might have things you can find nowhere else
15:38:00<chazchaz>I know, I was just curious about examples for eMule
15:59:00<godane>there are tons of trails on future publising ftp
15:59:00<godane>*game trailers
17:19:00<Nemo_bis>chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results
18:25:00<DFJustin>I've grabbed some shareware cds off emule
18:26:00<DFJustin>as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57
18:30:00<Nemo_bis>Nice. :)
18:30:00<Nemo_bis>If you keep it in the shared files, you'll later find more similar stuff.
18:32:00<Nemo_bis>That's the only ISO I see, too. I'm downloading a couple PDFs though
18:32:00<DFJustin>yeah lots of ebooks
18:41:00<DFJustin>also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals
19:04:00<DFJustin>I see a bunch of photoshop magazine CDs, grabbing those
19:41:00<hiker1>Are there any WARC guis?
20:03:00<Nemo_bis>This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/
20:03:00<Nemo_bis>And he has uploaded only a tiny fraction of them.
20:04:00<hiker1>People like to exaggerate their claims.
22:30:00<_obscure_>Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/
23:24:00<hiker1>I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware
23:25:00<hiker1>I have successfully used it to do so at least.
23:45:00<chronomex>nice