04:33:00 | <ArtimusAg> | Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents |
04:40:00 | <chronomex> | cool |
06:13:00 | <Vito``> | tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site. |
06:14:00 | <Vito``> | tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser |
06:14:00 | <Vito``> | none can feed back into wayback machine yet, but it's on the to-do list |
06:29:00 | <Coderjoe> | that's the curse of "Web 2.0" designs :-\ |
06:47:00 | <instence> | The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere. |
06:49:00 | <Vito``> | yeah, I expect to completely replace wget with phantomjs at some point |
06:50:00 | <Vito``> | well, except for single-file mirroring, like a PDF or something |
09:43:00 | <godane> | chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it |
09:51:00 | <BlueMaxim> | godane it pretty much is, to my knowledge |
13:11:00 | <hiker1> | I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too. |
13:20:00 | <ersi> | AFAIK the images from tinypic are hosted on an subdomain |
13:20:00 | <ersi> | like i.tinypic.com or something like that |
13:23:00 | <hiker1> | yes |
13:24:00 | <hiker1> | But how do I tell it to not access ^tinypic.com and only access *.tinypic.com? |
13:25:00 | <hiker1> | --domains and --exclude-domains don't appear to accept wildcards or regex |
13:30:00 | <schbirid1> | correct |
13:49:00 | <hiker1> | How can I avoid downloading from the wrong domain then? |
17:54:00 | <SketchCow> | Hooray, oxing day |
18:15:00 | <Nemo_bis> | SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them? |
18:29:00 | <godane> | SketchCow: I'm up to 2011.08.31 of attack of the show |
18:29:00 | <godane> | also i'm uploading vgmusic.com warc.gz right now |
19:01:00 | <hiker1> | godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy |
19:08:00 | <godane> | i'm grabing it |
19:09:00 | <godane> | i grabbed the articles |
19:09:00 | <godane> | but the images i will have to try next |
19:12:00 | <godane> | uploaded: http://archive.org/details/vgmusic.com-20121226-mirror |
19:48:00 | <godane> | hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror |
19:58:00 | <hiker1> | godane: What commands did you use to mirror the site/ |
20:20:00 | <godane> | i made a index file first |
20:20:00 | <hiker1> | using what command? |
20:21:00 | <godane> | wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log |
20:22:00 | <godane> | i had to do it this way cause there was way too many images to mirror it whole |
20:22:00 | <hiker1> | You append http://andriasang.com to that command, right? |
20:23:00 | <godane> | its all from http://andriasang.com |
20:23:00 | <hiker1> | And will that grab the html articles, just not the images? |
20:24:00 | <godane> | *I had to add http://andriasang.com to all urls since there local urls |
20:25:00 | <hiker1> | I don't understand what you mean by that. How did you add it to all the urls? |
20:25:00 | <godane> | with sed |
20:26:00 | <hiker1> | to start, that first command grabs all the html files, correct? |
20:26:00 | <godane> | when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05 |
20:26:00 | <hiker1> | and ignores images because you did not use --page-requisites |
20:27:00 | <hiker1> | Am I correct in saying that? |
20:28:00 | <godane> | i just grabbed what was listed in my index.txt |
20:28:00 | <godane> | there is one image in there |
20:28:00 | <godane> | http://andriasang.com/u/anoop/avatar_full.1351839050.jpg |
20:28:00 | <hiker1> | Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log` |
20:29:00 | <godane> | it saves html files |
20:29:00 | <godane> | i got the index.txt file from another warc of the pages |
20:29:00 | <DFJustin> | ultraman cooking what |
20:30:00 | <hiker1> | godane: Could you explain that? How did you get the index file to begin with? |
20:32:00 | <godane> | i think i grabed it by: zcat *.warc.gz | grep -ohP 'href='[^'>]+' |
20:33:00 | <godane> | i did this to my pages warc.gz |
20:33:00 | <hiker1> | How'd you get the warc.gz to begin with? |
20:34:00 | <godane> | for i in $(seq 1 895); do |
20:34:00 | <godane> | echo "http://andriasang.com/?page=$i" >> index.txt |
20:34:00 | <godane> | done |
20:36:00 | <hiker1> | So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date? |
20:36:00 | <godane> | i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log |
20:39:00 | <hiker1> | So you end up downloading the page listings twice in this process? |
20:39:00 | <hiker1> | the first time to get all the urls, then the second time to get the real warc file with all the articles? |
20:39:00 | <godane> | no |
20:39:00 | <godane> | first time it was pages |
20:40:00 | <godane> | then all urls of articles |
20:40:00 | <hiker1> | Did you then merge the two together? |
20:40:00 | <godane> | the dates and pages would also be in the articles dump to |
20:40:00 | <hiker1> | oh, ok |
20:41:00 | <hiker1> | How do you plan to get the images? |
20:43:00 | <godane> | by grabing the urls like how i grabed the images |
20:44:00 | <hiker1> | Will you then be able to merge the two warc files so that the images can be viewed in the articles? |
20:45:00 | <godane> | the way back machine can handler multiable warcs |
20:45:00 | <hiker1> | Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine? |
20:46:00 | <godane> | you can use warc-proxy to do it locally |
20:46:00 | <hiker1> | and just load both warc files from that? |
20:46:00 | <godane> | yes |
20:48:00 | <hiker1> | Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help. |
22:14:00 | <hiker1> | godane: How do you handle grabbing CSS or images embedded in CSS? |
22:28:00 | <godane> | i sadly don't know how to grab stuff in css |
22:28:00 | <godane> | even with wget |
22:28:00 | <godane> | cause i don't know if wget grabs urls in css |
22:30:00 | <Nemo_bis> | the requisites option maybe? |
22:32:00 | <godane> | i can't grab the full website in one warc |
22:39:00 | <hiker1> | Why can't you? |
22:40:00 | <godane> | it was 2.8gb big and was still going when i was doing it the first time |
22:40:00 | <hiker1> | is that too large for one wget? |
22:41:00 | <godane> | 4gb is the limit on one warc.gz |
22:41:00 | <godane> | it was getting there and it bothered me |
22:41:00 | <hiker1> | oh. |
22:43:00 | <godane> | there is over 317000+ images in that site |
22:44:00 | <ersi> | that's a few |
22:44:00 | <chronomex> | yeah that'll add up |
22:45:00 | <hiker1> | wow |
22:46:00 | <godane> | i may have to do another grab later of images |
22:46:00 | <hiker1> | What do you mean? |
22:46:00 | <godane> | there was alot of images that had no folder/url path in it |
22:46:00 | <godane> | it was just the file name |
22:47:00 | <hiker1> | I thought you were only grabbing html files right now |
22:47:00 | <godane> | html was already done |
22:47:00 | <godane> | http://archive.org/details/andriasang.com-articles-20121224-mirror |
22:47:00 | <godane> | thats the html articles |
22:48:00 | <godane> | there was about 30 articles that gave the 502 bad gateway error |
22:48:00 | <godane> | i was only able to get 4 of them on a retry |
22:49:00 | <godane> | i limit the warc.gz file size to 1G |