#archiveteam<efnet> log for 2013-01-06

Home Search Previous day Next day

04:25:00	<hiker2>	Is there an easy way to convert a warc to a warc.gz?
04:29:00	<no2pencil>	you wanto compress it with gzip?
04:40:00	<hiker2>	I think warc.gz technically compresses the individual blocks, and is not simply a compressed file.
04:45:00	<GLaDOS>	Correct.
04:57:00	<hiker2>	Are mirrors classified as panic grabs even if there isn't any real worry the site will go down?
05:03:00	<godane>	hiker2: i'm always doing panic grabs of sites that are not going down
05:04:00	<hiker2>	Where do I upload them to?
05:04:00	<godane>	i normally upload my stuff to texts
05:05:00	<godane>	when its warc.gz dumps
05:11:00	<hiker2>	Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file?
05:13:00	<DFJustin>	yes but then the wayback machine's software wouldn't be able to ingest them directly
05:13:00	<hiker2>	Is anyone actually using the wayback machine software to view warcs besides IA?
05:14:00	<GLaDOS>	Remember: warc.gz isn't just a compressed warc file
05:14:00	<GLaDOS>	It compresses the chunks, but not the headers and hashes
05:14:00	<GLaDOS>	or something
05:14:00	<hiker2>	and that's why its compression suffers
05:14:00	<DFJustin>	there's the IA partners under the Archive-It umbrella http://archive-it.org/
05:14:00	<hiker2>	GLaDOS: It appears to compress the entire Record, including headers.
05:15:00	<hiker2>	DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine.
05:15:00	<DFJustin>	that second point is incorrect
05:16:00	<DFJustin>	jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine
05:16:00	<DFJustin>	except for the very latest new grabs
05:16:00	<hiker2>	Can the beta be accessed from anywhere?
05:16:00	<DFJustin>	yes
05:16:00	<DFJustin>	http://wayback-beta.archive.org/
05:17:00	<hiker2>	DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta?
05:18:00	<DFJustin>	look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/
05:18:00	<DFJustin>	the regular wayback machine was doing spotty crawls but the huge spike is us
05:19:00	<hiker2>	Do they load for you?
05:19:00	<hiker2>	ah, it loaded
05:20:00	<hiker2>	neat. I didn't realize this stuff was being pooled together somewhere
05:20:00	<hiker2>	How quickly are sites added to the machine after being grabbed on here?
05:21:00	<DFJustin>	updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details
05:26:00	<DFJustin>	here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
05:28:00	<DFJustin>	so not actually everything because mobileme is too freaking huge for right now
06:03:00	<godane>	there must have been older pdfs on computerpoweruser.com
06:06:00	<godane>	nevermind
06:06:00	<godane>	it was a dead link
06:23:00	<hiker2>	If the wayback machine already has a good archive, should I bother archiving a site?
06:23:00	<godane>	i'm grabing ftp://ftp.futurenet.com
06:23:00	<godane>	hiker2: i say archive it again
06:23:00	<godane>	sometimes wayback machine can't get stuff cause of robots.txt
06:24:00	<hiker2>	some of the waybackmachine grabs don't properly archive external images either
06:25:00	<hiker2>	The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider.
06:45:00	<hiker2>	godane: How can I tell which uploads are yours?
06:59:00	<godane>	hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22
07:00:00	<hiker2>	Do individual items not show who uploaded them?
07:36:00	<DFJustin>	they do but you have to look at the meta.xml
07:37:00	<DFJustin>	unless you're a collection admin
07:37:00	<Lord_Nigh>	i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc
07:38:00	<DFJustin>	http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
08:09:00	<godane>	i'm now uploading the offical xbox magazine web archive
08:23:00	<godane>	looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com
08:41:00	<lemonkey>	LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l
09:43:00	<godane>	is anyone mirroring g4tv videos?
10:17:00	<alard>	hiker2: The warcproxy also depends on the per-record gzip compression.
10:17:00	<Nemo_bis>	Is this archived somewhere? http://torcache.net/
10:33:00	<godane>	i think this needs to be backed up: ftp://ftp.download.packardbell.com/
10:34:00	<godane>	it has manuals and drivers packardbell or hp stuff
10:52:00	<chronomex>	godane: on it
10:54:00	<Nemo_bis>	the NATO FTP is still downloading... at 30 KiB/s now
10:54:00	<chronomex>	nice
10:54:00	<chronomex>	packard bell is similarly 90s-bound
11:09:00	<Nemo_bis>	godane: do you also try eMule to grab stuff?
11:10:00	<Nemo_bis>	in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:20:00	<chazchaz>	Do you know any examples off the top of your head?
11:24:00	<Nemo_bis>	chazchaz: examples of what?
11:38:00	<chronomex>	03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:38:00	<chronomex>	countries, I'd wager
11:42:00	<godane>	Nemo_bis: i did try to find the techtv music wars special on emule
11:43:00	<godane>	but turns out that the server it was called razor
11:44:00	<godane>	it was raided in feb 2006
11:51:00	<Nemo_bis>	chronomex: I'm sure of Italy and Spain, for instance
11:51:00	<Nemo_bis>	godane: eMule is serverless since ages, it uses KAD
11:52:00	<godane>	i'm on amule since i run linux
11:52:00	<Nemo_bis>	so?
11:52:00	<godane>	i have search techtv and i'm still not finding it
11:53:00	<Nemo_bis>	I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT
11:53:00	<Nemo_bis>	and only MorphXT has a decent support for it
11:54:00	<Nemo_bis>	godane: KAD needs some time
11:55:00	<Nemo_bis>	you can try and add to downloads other techtv things and find more noes
11:55:00	<Nemo_bis>	*nodes
11:56:00	<Nemo_bis>	that said, it might be the wrong thing to search there, dunno
12:37:00	<Nemo_bis>	http://www.introni.it/marzaglia.html
12:38:00	<hiker2>	godane: What exactly from the official xbox mag are you archiving?
12:44:00	<godane>	hiker2: everything here: http://www.oxmonline.com/secretstash
12:48:00	<hiker2>	godane: Are those the full issues?
12:48:00	<Nemo_bis>	chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm
12:48:00	<hiker2>	wow. Why do they offer them for free? Most sites would charge for them.
12:50:00	<hiker2>	godane: Do you delete the archives from your computer after you upload them to IA?
12:54:00	<godane>	hiker2: no
12:54:00	<hiker2>	What do you do with them?
12:54:00	<godane>	i most burn them to bluray when i need space
12:54:00	<hiker2>	wow. you are serious about archiving!
12:55:00	<godane>	living with dialup made me serious about archiving
12:55:00	<hiker2>	I had dialup as well.. But you don't have it anymore I assume
13:56:00	<godane>	Nemo_bis: do you know how to get better search results in emule?
14:13:00	<Nemo_bis>	godane: you have to know the nodes closer to those who have that stuff
14:13:00	<Nemo_bis>	when you've been downloading/uploading some things for a while, you're more likely to find similar things
14:14:00	<Nemo_bis>	you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly
14:15:00	<Nemo_bis>	if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search
14:15:00	<Nemo_bis>	but if it's too specific you may not find anything
14:16:00	<Nemo_bis>	of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc.
15:27:00	<chazchaz>	Nemo_bis: Examples of niche things one might only find on eMule.
15:36:00	<schbiridi>	chazchaz: every network/service/community might have things you can find nowhere else
15:38:00	<chazchaz>	I know, I was just curious about examples for eMule
15:59:00	<godane>	there are tons of trails on future publising ftp
15:59:00	<godane>	*game trailers
17:19:00	<Nemo_bis>	chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results
18:25:00	<DFJustin>	I've grabbed some shareware cds off emule
18:26:00	<DFJustin>	as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57
18:30:00	<Nemo_bis>	Nice. :)
18:30:00	<Nemo_bis>	If you keep it in the shared files, you'll later find more similar stuff.
18:32:00	<Nemo_bis>	That's the only ISO I see, too. I'm downloading a couple PDFs though
18:32:00	<DFJustin>	yeah lots of ebooks
18:41:00	<DFJustin>	also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals
19:04:00	<DFJustin>	I see a bunch of photoshop magazine CDs, grabbing those
19:41:00	<hiker1>	Are there any WARC guis?
20:03:00	<Nemo_bis>	This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/
20:03:00	<Nemo_bis>	And he has uploaded only a tiny fraction of them.
20:04:00	<hiker1>	People like to exaggerate their claims.
22:30:00	<_obscure_>	Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/
23:24:00	<hiker1>	I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware
23:25:00	<hiker1>	I have successfully used it to do so at least.
23:45:00	<chronomex>	nice

Home Search Previous day Next day