Mirror a site with wget, but only a specific subdirectory
Recently I needed to get a static archive of a website, but only a particular
subdirectory. For example, I didn’t want all of example.com/
, I only needed
the pages and files in example.com/blog/
.
This seems to be perfect for that task, as described:
wget --mirror --convert-links --adjust-extension --page-requisites --wait=1 --no-parent https://www.example.com/blog/
As it turned out, however, I needed some very specific static files that were
linked from those pages, but stored in a higher level directory at
example.com/static-files/
. The --no-parent
option prevented those files
from being downloaded.
This change seemed to do the trick:
wget --mirror --convert-links --adjust-extension --page-requisites --wait=1 --include-directories="/static-files,/blog" https://www.example.com/blog/
… which defines specific directories to download. This may have adverse
side-effects that I’m just not noticing yet, compared to using --no-parent
,
but I’ve not looked that closely yet. It seemed to do what I needed.
Taking extra care?
And finally, if you need to add a little sway to the request timing, and
provide a user-agent to the host, and ignore rules set in a robots.txt
file,
then you’ll want to modify the command to include a few more arguments, as such:
wget --mirror --convert-links --adjust-extension --page-requisites -e robots=off --wait=1 --random-wait --no-verbose -o wget-log.log --include-directories="/static-files,/blog" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36" https://www.example.com/blog/
You can find the documentation for all of these command options in the wget Manual. (I was using version 1.20 here).