04:33:00<ArtimusAg>Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents
04:40:00<chronomex>cool
06:13:00<Vito``>tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site.
06:14:00<Vito``>tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser
06:14:00<Vito``>none can feed back into wayback machine yet, but it's on the to-do list
06:29:00<Coderjoe>that's the curse of "Web 2.0" designs :-\
06:47:00<instence>The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere.
06:49:00<Vito``>yeah, I expect to completely replace wget with phantomjs at some point
06:50:00<Vito``>well, except for single-file mirroring, like a PDF or something
09:43:00<godane>chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it
09:51:00<BlueMaxim>godane it pretty much is, to my knowledge
13:11:00<hiker1>I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too.
13:20:00<ersi>AFAIK the images from tinypic are hosted on an subdomain
13:20:00<ersi>like i.tinypic.com or something like that
13:23:00<hiker1>yes
13:24:00<hiker1>But how do I tell it to not access ^tinypic.com and only access *.tinypic.com?
13:25:00<hiker1>--domains and --exclude-domains don't appear to accept wildcards or regex
13:30:00<schbirid1>correct
13:49:00<hiker1>How can I avoid downloading from the wrong domain then?
17:54:00<SketchCow>Hooray, oxing day
18:15:00<Nemo_bis>SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them?
18:29:00<godane>SketchCow: I'm up to 2011.08.31 of attack of the show
18:29:00<godane>also i'm uploading vgmusic.com warc.gz right now
19:01:00<hiker1>godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy
19:08:00<godane>i'm grabing it
19:09:00<godane>i grabbed the articles
19:09:00<godane>but the images i will have to try next
19:12:00<godane>uploaded: http://archive.org/details/vgmusic.com-20121226-mirror
19:48:00<godane>hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror
19:58:00<hiker1>godane: What commands did you use to mirror the site/
20:20:00<godane>i made a index file first
20:20:00<hiker1>using what command?
20:21:00<godane>wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:22:00<godane>i had to do it this way cause there was way too many images to mirror it whole
20:22:00<hiker1>You append http://andriasang.com to that command, right?
20:23:00<godane>its all from http://andriasang.com
20:23:00<hiker1>And will that grab the html articles, just not the images?
20:24:00<godane>*I had to add http://andriasang.com to all urls since there local urls
20:25:00<hiker1>I don't understand what you mean by that. How did you add it to all the urls?
20:25:00<godane>with sed
20:26:00<hiker1>to start, that first command grabs all the html files, correct?
20:26:00<godane>when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05
20:26:00<hiker1>and ignores images because you did not use --page-requisites
20:27:00<hiker1>Am I correct in saying that?
20:28:00<godane>i just grabbed what was listed in my index.txt
20:28:00<godane>there is one image in there
20:28:00<godane>http://andriasang.com/u/anoop/avatar_full.1351839050.jpg
20:28:00<hiker1>Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log`
20:29:00<godane>it saves html files
20:29:00<godane>i got the index.txt file from another warc of the pages
20:29:00<DFJustin>ultraman cooking what
20:30:00<hiker1>godane: Could you explain that? How did you get the index file to begin with?
20:32:00<godane>i think i grabed it by: zcat *.warc.gz | grep -ohP 'href='[^'>]+'
20:33:00<godane>i did this to my pages warc.gz
20:33:00<hiker1>How'd you get the warc.gz to begin with?
20:34:00<godane>for i in $(seq 1 895); do
20:34:00<godane>echo "http://andriasang.com/?page=$i" >> index.txt
20:34:00<godane>done
20:36:00<hiker1>So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date?
20:36:00<godane>i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:39:00<hiker1>So you end up downloading the page listings twice in this process?
20:39:00<hiker1>the first time to get all the urls, then the second time to get the real warc file with all the articles?
20:39:00<godane>no
20:39:00<godane>first time it was pages
20:40:00<godane>then all urls of articles
20:40:00<hiker1>Did you then merge the two together?
20:40:00<godane>the dates and pages would also be in the articles dump to
20:40:00<hiker1>oh, ok
20:41:00<hiker1>How do you plan to get the images?
20:43:00<godane>by grabing the urls like how i grabed the images
20:44:00<hiker1>Will you then be able to merge the two warc files so that the images can be viewed in the articles?
20:45:00<godane>the way back machine can handler multiable warcs
20:45:00<hiker1>Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine?
20:46:00<godane>you can use warc-proxy to do it locally
20:46:00<hiker1>and just load both warc files from that?
20:46:00<godane>yes
20:48:00<hiker1>Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help.
22:14:00<hiker1>godane: How do you handle grabbing CSS or images embedded in CSS?
22:28:00<godane>i sadly don't know how to grab stuff in css
22:28:00<godane>even with wget
22:28:00<godane>cause i don't know if wget grabs urls in css
22:30:00<Nemo_bis>the requisites option maybe?
22:32:00<godane>i can't grab the full website in one warc
22:39:00<hiker1>Why can't you?
22:40:00<godane>it was 2.8gb big and was still going when i was doing it the first time
22:40:00<hiker1>is that too large for one wget?
22:41:00<godane>4gb is the limit on one warc.gz
22:41:00<godane>it was getting there and it bothered me
22:41:00<hiker1>oh.
22:43:00<godane>there is over 317000+ images in that site
22:44:00<ersi>that's a few
22:44:00<chronomex>yeah that'll add up
22:45:00<hiker1>wow
22:46:00<godane>i may have to do another grab later of images
22:46:00<hiker1>What do you mean?
22:46:00<godane>there was alot of images that had no folder/url path in it
22:46:00<godane>it was just the file name
22:47:00<hiker1>I thought you were only grabbing html files right now
22:47:00<godane>html was already done
22:47:00<godane>http://archive.org/details/andriasang.com-articles-20121224-mirror
22:47:00<godane>thats the html articles
22:48:00<godane>there was about 30 articles that gave the 502 bad gateway error
22:48:00<godane>i was only able to get 4 of them on a retry
22:49:00<godane>i limit the warc.gz file size to 1G