04:25:00 | <hiker2> | Is there an easy way to convert a warc to a warc.gz? |
04:29:00 | <no2pencil> | you wanto compress it with gzip? |
04:40:00 | <hiker2> | I think warc.gz technically compresses the individual blocks, and is not simply a compressed file. |
04:45:00 | <GLaDOS> | Correct. |
04:57:00 | <hiker2> | Are mirrors classified as panic grabs even if there isn't any real worry the site will go down? |
05:03:00 | <godane> | hiker2: i'm always doing panic grabs of sites that are not going down |
05:04:00 | <hiker2> | Where do I upload them to? |
05:04:00 | <godane> | i normally upload my stuff to texts |
05:05:00 | <godane> | when its warc.gz dumps |
05:11:00 | <hiker2> | Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file? |
05:13:00 | <DFJustin> | yes but then the wayback machine's software wouldn't be able to ingest them directly |
05:13:00 | <hiker2> | Is anyone actually using the wayback machine software to view warcs besides IA? |
05:14:00 | <GLaDOS> | Remember: warc.gz isn't just a compressed warc file |
05:14:00 | <GLaDOS> | It compresses the chunks, but not the headers and hashes |
05:14:00 | <GLaDOS> | or something |
05:14:00 | <hiker2> | and that's why its compression suffers |
05:14:00 | <DFJustin> | there's the IA partners under the Archive-It umbrella http://archive-it.org/ |
05:14:00 | <hiker2> | GLaDOS: It appears to compress the entire Record, including headers. |
05:15:00 | <hiker2> | DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine. |
05:15:00 | <DFJustin> | that second point is incorrect |
05:16:00 | <DFJustin> | jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine |
05:16:00 | <DFJustin> | except for the very latest new grabs |
05:16:00 | <hiker2> | Can the beta be accessed from anywhere? |
05:16:00 | <DFJustin> | yes |
05:16:00 | <DFJustin> | http://wayback-beta.archive.org/ |
05:17:00 | <hiker2> | DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta? |
05:18:00 | <DFJustin> | look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/ |
05:18:00 | <DFJustin> | the regular wayback machine was doing spotty crawls but the huge spike is us |
05:19:00 | <hiker2> | Do they load for you? |
05:19:00 | <hiker2> | ah, it loaded |
05:20:00 | <hiker2> | neat. I didn't realize this stuff was being pooled together somewhere |
05:20:00 | <hiker2> | How quickly are sites added to the machine after being grabbed on here? |
05:21:00 | <DFJustin> | updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details |
05:26:00 | <DFJustin> | here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 |
05:28:00 | <DFJustin> | so not actually everything because mobileme is too freaking huge for right now |
06:03:00 | <godane> | there must have been older pdfs on computerpoweruser.com |
06:06:00 | <godane> | nevermind |
06:06:00 | <godane> | it was a dead link |
06:23:00 | <hiker2> | If the wayback machine already has a good archive, should I bother archiving a site? |
06:23:00 | <godane> | i'm grabing ftp://ftp.futurenet.com |
06:23:00 | <godane> | hiker2: i say archive it again |
06:23:00 | <godane> | sometimes wayback machine can't get stuff cause of robots.txt |
06:24:00 | <hiker2> | some of the waybackmachine grabs don't properly archive external images either |
06:25:00 | <hiker2> | The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider. |
06:45:00 | <hiker2> | godane: How can I tell which uploads are yours? |
06:59:00 | <godane> | hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22 |
07:00:00 | <hiker2> | Do individual items not show who uploaded them? |
07:36:00 | <DFJustin> | they do but you have to look at the meta.xml |
07:37:00 | <DFJustin> | unless you're a collection admin |
07:37:00 | <Lord_Nigh> | i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc |
07:38:00 | <DFJustin> | http://www.archiveteam.org/index.php?title=Wget_with_WARC_output |
08:09:00 | <godane> | i'm now uploading the offical xbox magazine web archive |
08:23:00 | <godane> | looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com |
08:41:00 | <lemonkey> | LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l |
09:43:00 | <godane> | is anyone mirroring g4tv videos? |
10:17:00 | <alard> | hiker2: The warcproxy also depends on the per-record gzip compression. |
10:17:00 | <Nemo_bis> | Is this archived somewhere? http://torcache.net/ |
10:33:00 | <godane> | i think this needs to be backed up: ftp://ftp.download.packardbell.com/ |
10:34:00 | <godane> | it has manuals and drivers packardbell or hp stuff |
10:52:00 | <chronomex> | godane: on it |
10:54:00 | <Nemo_bis> | the NATO FTP is still downloading... at 30 KiB/s now |
10:54:00 | <chronomex> | nice |
10:54:00 | <chronomex> | packard bell is similarly 90s-bound |
11:09:00 | <Nemo_bis> | godane: do you also try eMule to grab stuff? |
11:10:00 | <Nemo_bis> | in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems |
11:20:00 | <chazchaz> | Do you know any examples off the top of your head? |
11:24:00 | <Nemo_bis> | chazchaz: examples of what? |
11:38:00 | <chronomex> | 03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems |
11:38:00 | <chronomex> | countries, I'd wager |
11:42:00 | <godane> | Nemo_bis: i did try to find the techtv music wars special on emule |
11:43:00 | <godane> | but turns out that the server it was called razor |
11:44:00 | <godane> | it was raided in feb 2006 |
11:51:00 | <Nemo_bis> | chronomex: I'm sure of Italy and Spain, for instance |
11:51:00 | <Nemo_bis> | godane: eMule is serverless since ages, it uses KAD |
11:52:00 | <godane> | i'm on amule since i run linux |
11:52:00 | <Nemo_bis> | so? |
11:52:00 | <godane> | i have search techtv and i'm still not finding it |
11:53:00 | <Nemo_bis> | I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT |
11:53:00 | <Nemo_bis> | and only MorphXT has a decent support for it |
11:54:00 | <Nemo_bis> | godane: KAD needs some time |
11:55:00 | <Nemo_bis> | you can try and add to downloads other techtv things and find more noes |
11:55:00 | <Nemo_bis> | *nodes |
11:56:00 | <Nemo_bis> | that said, it might be the wrong thing to search there, dunno |
12:37:00 | <Nemo_bis> | http://www.introni.it/marzaglia.html |
12:38:00 | <hiker2> | godane: What exactly from the official xbox mag are you archiving? |
12:44:00 | <godane> | hiker2: everything here: http://www.oxmonline.com/secretstash |
12:48:00 | <hiker2> | godane: Are those the full issues? |
12:48:00 | <Nemo_bis> | chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm |
12:48:00 | <hiker2> | wow. Why do they offer them for free? Most sites would charge for them. |
12:50:00 | <hiker2> | godane: Do you delete the archives from your computer after you upload them to IA? |
12:54:00 | <godane> | hiker2: no |
12:54:00 | <hiker2> | What do you do with them? |
12:54:00 | <godane> | i most burn them to bluray when i need space |
12:54:00 | <hiker2> | wow. you are serious about archiving! |
12:55:00 | <godane> | living with dialup made me serious about archiving |
12:55:00 | <hiker2> | I had dialup as well.. But you don't have it anymore I assume |
13:56:00 | <godane> | Nemo_bis: do you know how to get better search results in emule? |
14:13:00 | <Nemo_bis> | godane: you have to know the nodes closer to those who have that stuff |
14:13:00 | <Nemo_bis> | when you've been downloading/uploading some things for a while, you're more likely to find similar things |
14:14:00 | <Nemo_bis> | you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly |
14:15:00 | <Nemo_bis> | if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search |
14:15:00 | <Nemo_bis> | but if it's too specific you may not find anything |
14:16:00 | <Nemo_bis> | of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc. |
15:27:00 | <chazchaz> | Nemo_bis: Examples of niche things one might only find on eMule. |
15:36:00 | <schbiridi> | chazchaz: every network/service/community might have things you can find nowhere else |
15:38:00 | <chazchaz> | I know, I was just curious about examples for eMule |
15:59:00 | <godane> | there are tons of trails on future publising ftp |
15:59:00 | <godane> | *game trailers |
17:19:00 | <Nemo_bis> | chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results |
18:25:00 | <DFJustin> | I've grabbed some shareware cds off emule |
18:26:00 | <DFJustin> | as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57 |
18:30:00 | <Nemo_bis> | Nice. :) |
18:30:00 | <Nemo_bis> | If you keep it in the shared files, you'll later find more similar stuff. |
18:32:00 | <Nemo_bis> | That's the only ISO I see, too. I'm downloading a couple PDFs though |
18:32:00 | <DFJustin> | yeah lots of ebooks |
18:41:00 | <DFJustin> | also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals |
19:04:00 | <DFJustin> | I see a bunch of photoshop magazine CDs, grabbing those |
19:41:00 | <hiker1> | Are there any WARC guis? |
20:03:00 | <Nemo_bis> | This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/ |
20:03:00 | <Nemo_bis> | And he has uploaded only a tiny fraction of them. |
20:04:00 | <hiker1> | People like to exaggerate their claims. |
22:30:00 | <_obscure_> | Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/ |
23:24:00 | <hiker1> | I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware |
23:25:00 | <hiker1> | I have successfully used it to do so at least. |
23:45:00 | <chronomex> | nice |