Three ways to archive a website

I recently needed to archive a small website before decommissioning it. There are a few distinct reasons you might want an archive of a website:

  • To archive the information in case you need it later.
  • To archive the look and feel so that you can see how it has progressed.
  • To archive the digital artifacts so that you can host them elsewhere as a mirror.

Each of these produces files in a different format, which are useful over different time-periods. In this post, I’ll write a bit about all three, since it’s easiest to archive a website while it is still online.

1. Saving webpage content to PDF

To write an individual page to a PDF, you can use wkhtmltopdf. On Debian/Ubuntu, this can be installed with:

sudo apt-get install wkhtmltopdf

The only extra setting I use for this is the “Javascript delay”, since some parts of the page will be loaded after the main content.

mkdir -p pdf/
wkhtmltopdf --javascript-delay 1000 https://example.com/ pdf/index.pdf

This produces a PDF, which you can copy/paste text from, or print.

You then simply repeat this for every page which you want to archive.

2. Saving webpage content to an image

If you are more interested in how the website looked, rather than what it contained, then you can use the same package to write it to an image. I used the jpg format here, because the file sizes were reasonable at higher resolution. I also zoom the page 200% to get higher quality, and selected sizes which are typical of desktop, tablet and mobile screen sizes.

mkdir -p jpg/desktop jpg/mobile jpg/tablet
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 https://example.com/ jpg/desktop/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 https://example.com/ jpg/tablet/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 https://example.com/ jpg/mobile/index.jpg

This gives you three images for the page. This example page is quite short, but a larger page produces a very tall image.

As above, this needs to be repeated for each page which you want to archive.

3. Mirroring the entire site as HTML

A full mirror of the site is a good short-term archive. Some websites have a lot of embedded external content like maps and external social media widgets, which I would expect to gradually stop working over time as these services change. Still, you might still be able to browse the website on your local computer in 10 or 20 years time, depending on how browsers change.

wget is the go-to tool for mirroring sites, but it has a lot of options!

mkdir -p html/
cd html
wget \
  --trust-server-names \
  -e robots=off \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --no-parent \
  https://example.com

There are quite a few options here, I’ll briefly explain why I used each one:

--trust-server-names Allow the correct filename to be used when a redirect is used.
-e robots=off Disable rate limiting. This only OK to do if you own the site and can be sure that mirroring it will not cause capacity issues.
--mirror Short-hand for some options to recursively download the site.
--convert-links Change links on the target site to local ones.
--adjust-extension If you get a page called “foo”, save it as “foo.html”.
--page-requisites Also download CSS and Javascript files referenced on the page
--no-parent Only download sub-pages from the starting page. This is useful if you want to fetch only part of the domain.

The result can be opened locally in a web browser:

These options worked well for me on a WordPress site.

Putting it all together

The site I was mirroring was quite small, so I manually assembled a list of pages to mirror, gave each a name, and wrote them in a text file called urls.txt in this format:

https://site.example/ index
https://site.example/foo foo
https://site.example/bar bar

I then ran this script to mirror each URL as an image and PDF, before mirroring the entire site locally in HTML.

#!/bin/bash
set -exu -o pipefail

mkdir -p jpg/desktop jpg/mobile jpg/tablet html/ pdf
while read page_url page_name; do
  echo "## $page_url ($page_name)"
  # JPEG archive
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 $page_url jpg/desktop/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 $page_url jpg/tablet/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 $page_url jpg/mobile/$page_name.jpg
  # Printable archive
  wkhtmltopdf --javascript-delay 1000 $page_url pdf/$page_name.pdf
done < urls.txt

# Browsable archive
MAIN_PAGE=$(head -n1 urls.txt | cut -d' ' -f1)
mkdir -p html/
(cd html && \
  wget --trust-server-names -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent $MAIN_PAGE)

The actual domain example.com has only one page, so after running the script against it, it downloads this set of files:

├── archive.sh
├── html
│   └── example.com
│       └── index.html
├── jpg
│   ├── desktop
│   │   └── index.jpg
│   ├── mobile
│   │   └── index.jpg
│   └── tablet
│       └── index.jpg
├── pdf
│   └── index.pdf
└── urls.txt

Happy archiving!

New WordPress theme (2018 edition)

This week I replaced the previous wordpress theme on this blog with the current one.

 
I used Bootstrap to place widgets in my blog content, and Prism.js to do syntax highlighting on code snippets.

This is a heavily modified version of the default twentyseventeen theme. I chose this as as a base because of it’s good use of white-space for typesetting. The updated bootstrap-based layout is also a big improvement for mobile users, who now makes up the majority of web traffic:

How to create an animated GIF from a series of images

Sometimes, you end up with a folder full of images, which you want to animate. With the open source ImageMagick tool, this is easy on the command line:

animate *.png

This will show you all of the PNG files in the folder in quick succession, like a flip book.

ImageMagick works on just about any OS. For Linux users, the package is generally imagemagick or ImageMagick:

sudo apt-get install imagemagick
yum install ImageMagick

But this blog post is about animated GIFs, so lets make one of those. This is a compact way to combine images (here and here for examples in context), gives you a re-usable at-a-glance illustration of something that changes over time.

Example from an older post:

2015-04-tetris

The steps to make a good conversion command are:

  1. Check that alphabetically, your images are in order. If not, rename them:
    echo *
  2. Convert them to a GIF a few times, and find the delay that suits you (hundredths of a second between frames)
    convert -delay 80 *.png animated.gif
  3. Choose an output size (width x height):
    convert -resize 415x -delay 80 *.png animated.gif
  4. Compress with -Layers Optimize for a smaller file:
    convert -resize 415x -delay 80 *.png -layers Optimize animated.gif

Notes

  • Generated thumbnails usually take the first frame only, which is why we ask Imagemagick to resize it (WordPress users: Choose “Full Size”).
  • To pause at the start of the loop for a moment, just copy the first image a few times.

New WordPress theme

Since the last major revision of my site setup, I’ve been including more technical content, which would be easier to read with syntax highlighting and tabs.

The most visible part of the transition is now complete:

The old theme was Skittlish, but I decided to move to a new theme which was based on Bootstrap, so that I could use its components. The new theme is a modified version of the default twentyfourteen theme, using the visual style of morphic, with Prism.js added for code highlighting.

Successful migration to WordPress in 3 easy steps

We made the decision in January to migrate our website to a WordPress installation, which is the CMS of choice for most blogs.

This posed a big challenge, mainly because we had been using our in-house CMS to publish content for the past 2 years, meaning our content was tied up in a difficult-to-export format.

Still, it allowed me to dig into the nifty and well-developed world of WordPress. My magic formula for a WordPress migration, in a nutshell:

  1. Export your old blog as something WordPress can understand.
  2. Hack at the theme until your site is beautiful.
  3. Don’t break your URL’s.

1. Export your blog

Depending on how you are blogging already, you may be able to save an export which can be loaded into WordPress with a plugin (see Importing Content on the WordPress wiki).

We were not in this lucky category, so I delved into the WXR (WordPress eXtended RSS) format. This example file was a big help, and I wrote up a short PHP script to create a similar-looking file from my blog.

2. Hack at the theme

We adapted our site from the ‘Skittlish’ theme, which has also been ported to WordPress. Every theme carries some baggage, so I highly suggest rolling up your sleeves and opening wp-content/themes on your blog.

All non-feature modifications are done in WordPress via themes, so keep tweaking it until you’re happy, or get a designer to put together a theme that suits your needs.

3. Don’t break your URL’s

I link between blog posts a lot, and breaking these links would be mind-numbing to clean up after (and a SEO sin). Using the import method above, I used article titles which matched the old permalinks.

WordPress then lets you configure permalinks to use this field, replicating the old behavior and keeping everybody happy.

Wrap-up

If you run WordPress on your site, then it makes sense to have somebody on your team who really knows how it works.

If your initial setup is not handled with care, then you could end up wasting several days of work checking old content for errors.

Good luck!