Mirror a site with wget, but only a specific subdirectory

Published October 29, 2019 by Tommy George

Recently I needed to get a static archive of a website, but only a particular subdirectory. For example, I didn’t want all of example.com/, I only needed the pages and files in example.com/blog/.

This seems to be perfect for that task, as described:

wget --mirror --convert-links --adjust-extension --page-requisites --wait=1  --no-parent  https://www.example.com/blog/

As it turned out, however, I needed some very specific static files that were linked from those pages, but stored in a higher level directory at example.com/static-files/. The --no-parent option prevented those files from being downloaded.

This change seemed to do the trick:

wget --mirror --convert-links --adjust-extension --page-requisites --wait=1  --include-directories="/static-files,/blog"  https://www.example.com/blog/

… which defines specific directories to download. This may have adverse side-effects that I’m just not noticing yet, compared to using --no-parent, but I’ve not looked that closely yet. It seemed to do what I needed.

Taking extra care?

And finally, if you need to add a little sway to the request timing, and provide a user-agent to the host, and ignore rules set in a robots.txt file, then you’ll want to modify the command to include a few more arguments, as such:

wget --mirror --convert-links --adjust-extension --page-requisites -e robots=off --wait=1  --random-wait --no-verbose -o wget-log.log --include-directories="/static-files,/blog" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"  https://www.example.com/blog/

You can find the documentation for all of these command options in the wget Manual. (I was using version 1.20 here).