I recently needed to archive a small website before decommissioning it. There are a few distinct reasons you might want an archive of a website:
- To archive the information in case you need it later.
- To archive the look and feel so that you can see how it has progressed.
- To archive the digital artifacts so that you can host them elsewhere as a mirror.
Each of these produces files in a different format, which are useful over different time-periods. In this post, I’ll write a bit about all three, since it’s easiest to archive a website while it is still online.
1. Saving webpage content to PDF
To write an individual page to a PDF, you can use
wkhtmltopdf. On Debian/Ubuntu, this can be installed with:
sudo apt-get install wkhtmltopdf
This produces a PDF, which you can copy/paste text from, or print.
You then simply repeat this for every page which you want to archive.
2. Saving webpage content to an image
If you are more interested in how the website looked, rather than what it contained, then you can use the same package to write it to an image. I used the
jpg format here, because the file sizes were reasonable at higher resolution. I also zoom the page 200% to get higher quality, and selected sizes which are typical of desktop, tablet and mobile screen sizes.
This gives you three images for the page. This example page is quite short, but a larger page produces a very tall image.
As above, this needs to be repeated for each page which you want to archive.
3. Mirroring the entire site as HTML
A full mirror of the site is a good short-term archive. Some websites have a lot of embedded external content like maps and external social media widgets, which I would expect to gradually stop working over time as these services change. Still, you might still be able to browse the website on your local computer in 10 or 20 years time, depending on how browsers change.
wget is the go-to tool for mirroring sites, but it has a lot of options!
mkdir -p html/ cd html wget \ --trust-server-names \ -e robots=off \ --mirror \ --convert-links \ --adjust-extension \ --page-requisites \ --no-parent \ https://example.com
There are quite a few options here, I’ll briefly explain why I used each one:
|--trust-server-names||Allow the correct filename to be used when a redirect is used.|
|-e robots=off||Disable rate limiting. This only OK to do if you own the site and can be sure that mirroring it will not cause capacity issues.|
|--mirror||Short-hand for some options to recursively download the site.|
|--convert-links||Change links on the target site to local ones.|
|--adjust-extension||If you get a page called “foo”, save it as “foo.html”.|
|--no-parent||Only download sub-pages from the starting page. This is useful if you want to fetch only part of the domain.|
The result can be opened locally in a web browser:
These options worked well for me on a WordPress site.
Putting it all together
The site I was mirroring was quite small, so I manually assembled a list of pages to mirror, gave each a name, and wrote them in a text file called
urls.txt in this format:
https://site.example/ index https://site.example/foo foo https://site.example/bar bar
I then ran this script to mirror each URL as an image and PDF, before mirroring the entire site locally in HTML.
The actual domain
example.com has only one page, so after running the script against it, it downloads this set of files:
├── archive.sh ├── html │ └── example.com │ └── index.html ├── jpg │ ├── desktop │ │ └── index.jpg │ ├── mobile │ │ └── index.jpg │ └── tablet │ └── index.jpg ├── pdf │ └── index.pdf └── urls.txt