#archiveteam<efnet> log for 2012-12-26

Home Search Previous day Next day

04:33:00	<ArtimusAg>	Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents
04:40:00	<chronomex>	cool
06:13:00	<Vito``>	tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site.
06:14:00	<Vito``>	tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser
06:14:00	<Vito``>	none can feed back into wayback machine yet, but it's on the to-do list
06:29:00	<Coderjoe>	that's the curse of "Web 2.0" designs :-\
06:47:00	<instence>	The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere.
06:49:00	<Vito``>	yeah, I expect to completely replace wget with phantomjs at some point
06:50:00	<Vito``>	well, except for single-file mirroring, like a PDF or something
09:43:00	<godane>	chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it
09:51:00	<BlueMaxim>	godane it pretty much is, to my knowledge
13:11:00	<hiker1>	I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too.
13:20:00	<ersi>	AFAIK the images from tinypic are hosted on an subdomain
13:20:00	<ersi>	like i.tinypic.com or something like that
13:23:00	<hiker1>	yes
13:24:00	<hiker1>	But how do I tell it to not access ^tinypic.com and only access *.tinypic.com?
13:25:00	<hiker1>	--domains and --exclude-domains don't appear to accept wildcards or regex
13:30:00	<schbirid1>	correct
13:49:00	<hiker1>	How can I avoid downloading from the wrong domain then?
17:54:00	<SketchCow>	Hooray, oxing day
18:15:00	<Nemo_bis>	SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them?
18:29:00	<godane>	SketchCow: I'm up to 2011.08.31 of attack of the show
18:29:00	<godane>	also i'm uploading vgmusic.com warc.gz right now
19:01:00	<hiker1>	godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy
19:08:00	<godane>	i'm grabing it
19:09:00	<godane>	i grabbed the articles
19:09:00	<godane>	but the images i will have to try next
19:12:00	<godane>	uploaded: http://archive.org/details/vgmusic.com-20121226-mirror
19:48:00	<godane>	hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror
19:58:00	<hiker1>	godane: What commands did you use to mirror the site/
20:20:00	<godane>	i made a index file first
20:20:00	<hiker1>	using what command?
20:21:00	<godane>	wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:22:00	<godane>	i had to do it this way cause there was way too many images to mirror it whole
20:22:00	<hiker1>	You append http://andriasang.com to that command, right?
20:23:00	<godane>	its all from http://andriasang.com
20:23:00	<hiker1>	And will that grab the html articles, just not the images?
20:24:00	<godane>	*I had to add http://andriasang.com to all urls since there local urls
20:25:00	<hiker1>	I don't understand what you mean by that. How did you add it to all the urls?
20:25:00	<godane>	with sed
20:26:00	<hiker1>	to start, that first command grabs all the html files, correct?
20:26:00	<godane>	when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05
20:26:00	<hiker1>	and ignores images because you did not use --page-requisites
20:27:00	<hiker1>	Am I correct in saying that?
20:28:00	<godane>	i just grabbed what was listed in my index.txt
20:28:00	<godane>	there is one image in there
20:28:00	<godane>	http://andriasang.com/u/anoop/avatar_full.1351839050.jpg
20:28:00	<hiker1>	Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log`
20:29:00	<godane>	it saves html files
20:29:00	<godane>	i got the index.txt file from another warc of the pages
20:29:00	<DFJustin>	ultraman cooking what
20:30:00	<hiker1>	godane: Could you explain that? How did you get the index file to begin with?
20:32:00	<godane>	i think i grabed it by: zcat *.warc.gz \| grep -ohP 'href='[^'>]+'
20:33:00	<godane>	i did this to my pages warc.gz
20:33:00	<hiker1>	How'd you get the warc.gz to begin with?
20:34:00	<godane>	for i in $(seq 1 895); do
20:34:00	<godane>	echo "http://andriasang.com/?page=$i" >> index.txt
20:34:00	<godane>	done
20:36:00	<hiker1>	So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date?
20:36:00	<godane>	i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:39:00	<hiker1>	So you end up downloading the page listings twice in this process?
20:39:00	<hiker1>	the first time to get all the urls, then the second time to get the real warc file with all the articles?
20:39:00	<godane>	no
20:39:00	<godane>	first time it was pages
20:40:00	<godane>	then all urls of articles
20:40:00	<hiker1>	Did you then merge the two together?
20:40:00	<godane>	the dates and pages would also be in the articles dump to
20:40:00	<hiker1>	oh, ok
20:41:00	<hiker1>	How do you plan to get the images?
20:43:00	<godane>	by grabing the urls like how i grabed the images
20:44:00	<hiker1>	Will you then be able to merge the two warc files so that the images can be viewed in the articles?
20:45:00	<godane>	the way back machine can handler multiable warcs
20:45:00	<hiker1>	Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine?
20:46:00	<godane>	you can use warc-proxy to do it locally
20:46:00	<hiker1>	and just load both warc files from that?
20:46:00	<godane>	yes
20:48:00	<hiker1>	Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help.
22:14:00	<hiker1>	godane: How do you handle grabbing CSS or images embedded in CSS?
22:28:00	<godane>	i sadly don't know how to grab stuff in css
22:28:00	<godane>	even with wget
22:28:00	<godane>	cause i don't know if wget grabs urls in css
22:30:00	<Nemo_bis>	the requisites option maybe?
22:32:00	<godane>	i can't grab the full website in one warc
22:39:00	<hiker1>	Why can't you?
22:40:00	<godane>	it was 2.8gb big and was still going when i was doing it the first time
22:40:00	<hiker1>	is that too large for one wget?
22:41:00	<godane>	4gb is the limit on one warc.gz
22:41:00	<godane>	it was getting there and it bothered me
22:41:00	<hiker1>	oh.
22:43:00	<godane>	there is over 317000+ images in that site
22:44:00	<ersi>	that's a few
22:44:00	<chronomex>	yeah that'll add up
22:45:00	<hiker1>	wow
22:46:00	<godane>	i may have to do another grab later of images
22:46:00	<hiker1>	What do you mean?
22:46:00	<godane>	there was alot of images that had no folder/url path in it
22:46:00	<godane>	it was just the file name
22:47:00	<hiker1>	I thought you were only grabbing html files right now
22:47:00	<godane>	html was already done
22:47:00	<godane>	http://archive.org/details/andriasang.com-articles-20121224-mirror
22:47:00	<godane>	thats the html articles
22:48:00	<godane>	there was about 30 articles that gave the 502 bad gateway error
22:48:00	<godane>	i was only able to get 4 of them on a retry
22:49:00	<godane>	i limit the warc.gz file size to 1G

Home Search Previous day Next day