I am trying to mirror a blogger site so that I can have an exact copy of it on my filesystem to view. I have tried issuing the following command on Linux:

wget -r -k -x -e robots=off --wait 1 http://your.site.here.blogspot.com/

I have even tried using the -D flag to list a comma-separated list of domanins to follow (would prefer to just follow any domain though without having to specify all of them). I have even tried changing the .com part of the URL to the top-level domain for my country (.it) (without which for some reason I don't understand and would like to know, wget retrieves only index.html and no other page, perhaps someone here can explain why).

So, even when I do a

wget -r -k -x -e robots=off --wait 1 http://your.site.here.blogspot.it/

several HTML and also the favicon.ico are downloaded but none of the .png images from blogger are downloaded. Why is this so and how can I get wget to work properly. I've read the wget man page but had no luck.

Thanks.

  • Are you sure the .png images are hosted on http://your.site.here.blogspot.it/? Images uploaded to the Blogger service seem to be served from <number>.bp.blogspot.com instead, which would explain why wget won't fetch them. – jayhendren Oct 4 '13 at 0:38
  • Have you considered a User Agent change? some sites prevent images/pages from being parsed by different robots/tools. – Tyzoid Aug 20 '14 at 13:17

As jayhendren suggested, I had tried listing the domain bp.blogspot.com on the list following the -D flag. However what I forgot to do is add the -H flag. Why wget requires the extra -H flag to be added separately from the list of domains to follow with the -D flag is unclear to me, but it works. Here is the command I ultimately specified to mirror the Blogger site including the images served from the external domain:

wget --domains=blogspot.it,bp.blogspot.com -H --mirror -e robots=off \
  --wait 0.5 --convert-links http://yoursitehere.blogspot.it/

Note: this works from Italy. Convert .it to .com or to whatever other top-level domain if you want this to work from your location.

Regards.

  • Awesome, I've been pulling my hair out trying to archive a blog's images as well as its pages, but I've run into so many dead ends. The --domains flag is exactly the ticket; it does what you expect and the -k flag AKA --convert-links is also fantastic. There's a great article here which offers even more options for archiving a Blogger site, whose author owns a Blogger site himself. – GDP2 22 hours ago

Without error output of wget I can't tell what's the exact problem you have. But generally when downloading (or mirroring a website) with wget, I'll use the -mirror option like this:

wget --mirror -p --adjust-extension --wait 1 http://your.site.here.blogspot.it/
  • Well, from the man page the -p option takes a parameter so this can't be right. Also, -e robots=off is required since the site I am mirroring would otherwise disallow mirroring through the robots.txt file. The -wait 1 option is so the server does not get overloaded. However the image files do not get downloaded. Try this on any blogspot.com site and see for yourself. Where am I going wrong? – John Sonderson Oct 4 '13 at 12:42

Your Answer

 

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Not the answer you're looking for? Browse other questions tagged or ask your own question.