Three ways to archive a website

I recently needed to archive a small website before decommissioning it. There are a few distinct reasons you might want an archive of a website:

  • To archive the information in case you need it later.
  • To archive the look and feel so that you can see how it has progressed.
  • To archive the digital artifacts so that you can host them elsewhere as a mirror.

Each of these produces files in a different format, which are useful over different time-periods. In this post, I’ll write a bit about all three, since it’s easiest to archive a website while it is still online.

1. Saving webpage content to PDF

To write an individual page to a PDF, you can use wkhtmltopdf. On Debian/Ubuntu, this can be installed with:

sudo apt-get install wkhtmltopdf

The only extra setting I use for this is the “Javascript delay”, since some parts of the page will be loaded after the main content.

mkdir -p pdf/
wkhtmltopdf --javascript-delay 1000 https://example.com/ pdf/index.pdf

This produces a PDF, which you can copy/paste text from, or print.

You then simply repeat this for every page which you want to archive.

2. Saving webpage content to an image

If you are more interested in how the website looked, rather than what it contained, then you can use the same package to write it to an image. I used the jpg format here, because the file sizes were reasonable at higher resolution. I also zoom the page 200% to get higher quality, and selected sizes which are typical of desktop, tablet and mobile screen sizes.

mkdir -p jpg/desktop jpg/mobile jpg/tablet
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 https://example.com/ jpg/desktop/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 https://example.com/ jpg/tablet/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 https://example.com/ jpg/mobile/index.jpg

This gives you three images for the page. This example page is quite short, but a larger page produces a very tall image.

As above, this needs to be repeated for each page which you want to archive.

3. Mirroring the entire site as HTML

A full mirror of the site is a good short-term archive. Some websites have a lot of embedded external content like maps and external social media widgets, which I would expect to gradually stop working over time as these services change. Still, you might still be able to browse the website on your local computer in 10 or 20 years time, depending on how browsers change.

wget is the go-to tool for mirroring sites, but it has a lot of options!

mkdir -p html/
cd html
wget \
  --trust-server-names \
  -e robots=off \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --no-parent \
  https://example.com

There are quite a few options here, I’ll briefly explain why I used each one:

--trust-server-names Allow the correct filename to be used when a redirect is used.
-e robots=off Disable rate limiting. This only OK to do if you own the site and can be sure that mirroring it will not cause capacity issues.
--mirror Short-hand for some options to recursively download the site.
--convert-links Change links on the target site to local ones.
--adjust-extension If you get a page called “foo”, save it as “foo.html”.
--page-requisites Also download CSS and Javascript files referenced on the page
--no-parent Only download sub-pages from the starting page. This is useful if you want to fetch only part of the domain.

The result can be opened locally in a web browser:

These options worked well for me on a WordPress site.

Putting it all together

The site I was mirroring was quite small, so I manually assembled a list of pages to mirror, gave each a name, and wrote them in a text file called urls.txt in this format:

https://site.example/ index
https://site.example/foo foo
https://site.example/bar bar

I then ran this script to mirror each URL as an image and PDF, before mirroring the entire site locally in HTML.

#!/bin/bash
set -exu -o pipefail

mkdir -p jpg/desktop jpg/mobile jpg/tablet html/ pdf
while read page_url page_name; do
  echo "## $page_url ($page_name)"
  # JPEG archive
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 $page_url jpg/desktop/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 $page_url jpg/tablet/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 $page_url jpg/mobile/$page_name.jpg
  # Printable archive
  wkhtmltopdf --javascript-delay 1000 $page_url pdf/$page_name.pdf
done < urls.txt

# Browsable archive
MAIN_PAGE=$(head -n1 urls.txt | cut -d' ' -f1)
mkdir -p html/
(cd html && \
  wget --trust-server-names -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent $MAIN_PAGE)

The actual domain example.com has only one page, so after running the script against it, it downloads this set of files:

├── archive.sh
├── html
│   └── example.com
│       └── index.html
├── jpg
│   ├── desktop
│   │   └── index.jpg
│   ├── mobile
│   │   └── index.jpg
│   └── tablet
│       └── index.jpg
├── pdf
│   └── index.pdf
└── urls.txt

Happy archiving!

How to resize a Windows VM image with virt-resize

I recently had a Windows 7 Virtual Machine stored on an undersized qcow2 file. This post steps through the simplest way that I know to produce a new, bigger disk and expand the filesystem onto it.

Empty out the empty space

Because the guest VM is stored on a QCOW2 file, we can recover un-used space on disk by zeroing it out now. Download the sdelete utility from Microsoft and run it on the system.

sdelete -z

One this is done, power off the guest.

Assuming the host is linux, you need the qemu-utls and libguestfs-tools packages to follow these steps. On Debian-

apt-get install libguestfs-tools qemu-utls

Move the VM image to a new filename and inspect it.

mv windows.img windows.img.bak

The file command indicates that the disk is about 30GB expanded.

$ file windows.img.bak 
windows.img.bak: QEMU QCOW Image (v3), 32212254720 bytes

The qemu-img command shows that the disk is 83% full:

$ qemu-img check windows.img.bak 
No errors were found on the image.
411337/491520 = 83.69% allocated, 5.66% fragmented, 0.00% compressed clusters
Image end offset: 26961969152

Check out your FS names, note that /dev/sda2 is the disk we want to up-size in this case

$ virt-filesystems -a windows.img -l
Name       Type        VFS   Label            Size         Parent
/dev/sda1  filesystem  ntfs  System Reserved  104857600    -
/dev/sda2  filesystem  ntfs  -                32105299968  -

Make a new, bigger disk image

Create a new disk of the desired size. In my case, 50G is sufficient:

$ qemu-img create -f qcow2 windows.img 50G
Formatting 'windows.img', fmt=qcow2 size=53687091200 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

The file command shows that this new empty disk image is larger than the old image.

$ file windows.img
windows.img: QEMU QCOW Image (v3), 53687091200 bytes

Copy the old disk to the new one. The --expand option names a partition which will be grown to fill the extra space.

virt-resize --expand /dev/sda2 windows.img.bak windows.img

The virt-resize command shows a progress bar while it works, and zero-blocks will be reclaimed as a result of the output format:

2016-04-disk01

The final line of output suggests holding on to your backup until you’ve checked it, which is wise:

Resize operation completed with no errors. Before deleting the old disk,
carefully check that the resized disk boots and works correctly.

Check that the new disk is valid and contains partitions at the expected size:

$ virt-filesystems -a windows.img -l
Name       Type        VFS   Label            Size         Parent
/dev/sda1  filesystem  ntfs  System Reserved  104857600    -
/dev/sda2  filesystem  ntfs  -                53579939840  -

Boot up the guest

When the machine boots up, you may get a disk check prompt. Because the console I was using triggered the ‘Press any key to cancel’ prompt, I had to reboot and leave the console disconnected in order for the check to start.

2016-04-disk02

2016-04-disk03

After booting, the C:\ drive should display at its new size:

2016-04-disk04

It’s time to migrate away from Outlook Express

Outlook Express is obsolete, so you need to migrate if you’re still using it. This post is a quick guide to saving your local data so that you can jump ship.

If you are keen on desktop-based email, then there are only two real contenders for a replacement mail client:

  • Mozilla Thunderbird (suggested).
  • Windows Mail.

So where is all my data?

The script below is a Windows bat script for backing up an Outlook Express setup. Just fill in the Identity variable and run it from anywhere to produce a folder containing the Outlook saved emails and contacts.

You can find your identity string as a folder name in your profile path, under Local Settings\Application Data\Identities\. The profile path for a user called bob would usually be in a folder like C:\Documents and Settings\bob or C:\Users\bob.

These files can be read by both of the suggested replacements, so a good transition might be:

  1. Make a backup.
  2. Install an alternative & set it up.
  3. Try to import as much as you can.
  4. Once you’re happy, uninstall Outlook Express, or at least delete the shortcuts to it.
@echo off
ECHO -------------------------------------------------------------------------------
SET BACKUPDIR="%username% - Outlook Backup"
SET IDENTITY={AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAAAAAA}
ECHO Backing up outlook files to: %backupdir%
ECHO --------------------------------------------------------------------------------

ECHO - Clearing backup location... (1 of 3) 
DEL /F /S /Q %backupdir% > NUL 2>&1
MKDIR %backupdir% 2> NUL
ECHO - - Done

ECHO - Copying emails... (2 of 3) 
MKDIR %backupdir%\Emails 2> NUL
XCOPY /E /H /C /R /Y "%userprofile%\Local Settings\Application Data\Identities\%IDENTITY%\Microsoft\Outlook Express" %backupdir%\Emails > NUL
ECHO - - Done

ECHO - Copying address book... (3 of 3)
MKDIR %backupdir%\AddressBook 2> NUL
XCOPY /E /H /C /R /Y "%appdata%\Microsoft\Address Book" %backupdir%\AddressBook > NUL
ECHO - - Done

ECHO - Completed 3 of 3 tasks.
ECHO -------------------------------------------------------------------------------
ECHO The backup is complete.
ECHO Please copy %BACKUPDIR% to external storage.
ECHO --------------------------------------------------------------------------------
PAUSE

Backing up from a hosting provider

Backups are great, and they’re not rocket science. I’m writing up how we do backups, not because I think it’s a cool or unique setup (because it’s not), but to highlight how effective a simple solution can be.

We use rsync to take a local copy of whatever is on our web host without wasting bandwidth downloading files that aren’t needed. The layout looks like this:

Our hosting provider is accessible via ssh, and the backup box we use is a Raspberry Pi model B, costing (more or less) 50 AUD to get running.

On the server

On the server, we back up databases with mysqldump. To do this, you need to enter user details into a .my.cnf file, and then something like this will do the trick:

#!/bin/sh
# Remove old dump
rm -f database.sql.gz

# Dump and compress database
mysqldump -h sql.example.com --all-databases > database.sql
gzip database.sql

The above script is called database-dump.sh, and is called from the backup box, to dump the databases to a file before grabbing all the files.

On the backup box

First, a script to get the files. You should use password-less login with ssh-copy-id for this to work non-interactively:

#!/bin/sh
# Update the database dump
ssh user@host.example.com './database-dump.sh'
# Get files
rsync -avz --delete-during user@host.example.com:/home/user .

We save a copy of the files at this date in a dated archive, so we can back-date to find deleted things. At the end of the above script:

mkdir -p archive
now=$(date +"%Y-%m-%d")
tar -czf archive/backup-$now.tar.gz user

There aren’t a huge number of changes to record daily, so we got cron to run the above script weekly on the backup box. Read man crontab for how to do this.

What backup is not

If you think you shouldn’t be doing backups, you’re wrong. The following are not good excuses:

  1. Trust — Whoever is looking after the data wont lose it.
    Our host is pretty good, but their terms of service say they wont be responsible for any data loss. Even providers which have support agreements can make mistakes. You’ll also be able to work faster if you’re not paranoid about any mistake being unrecoverable.
  2. Expense — It’s a nice idea but not worth it.
    It’s dirt cheap, you can learn to do it yourself, and once set up requires virtually no administration. If your organisation can’t afford some kind of backup solution, then it should probably stop using data in any form.
  3. RAID — I invested money in RAID, so I don’t need backups.
    If you accidentally delete something, or notice that some your files have been tampered with, then RAID will not help you. If there is a problem (eg. fire) at the hosting location, then you will be in trouble regardless of disk redundancy.