Three ways to archive a website

I recently needed to archive a small website before decommissioning it. There are a few distinct reasons you might want an archive of a website:

  • To archive the information in case you need it later.
  • To archive the look and feel so that you can see how it has progressed.
  • To archive the digital artifacts so that you can host them elsewhere as a mirror.

Each of these produces files in a different format, which are useful over different time-periods. In this post, I’ll write a bit about all three, since it’s easiest to archive a website while it is still online.

1. Saving webpage content to PDF

To write an individual page to a PDF, you can use wkhtmltopdf. On Debian/Ubuntu, this can be installed with:

sudo apt-get install wkhtmltopdf

The only extra setting I use for this is the “Javascript delay”, since some parts of the page will be loaded after the main content.

mkdir -p pdf/
wkhtmltopdf --javascript-delay 1000 https://example.com/ pdf/index.pdf

This produces a PDF, which you can copy/paste text from, or print.

You then simply repeat this for every page which you want to archive.

2. Saving webpage content to an image

If you are more interested in how the website looked, rather than what it contained, then you can use the same package to write it to an image. I used the jpg format here, because the file sizes were reasonable at higher resolution. I also zoom the page 200% to get higher quality, and selected sizes which are typical of desktop, tablet and mobile screen sizes.

mkdir -p jpg/desktop jpg/mobile jpg/tablet
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 https://example.com/ jpg/desktop/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 https://example.com/ jpg/tablet/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 https://example.com/ jpg/mobile/index.jpg

This gives you three images for the page. This example page is quite short, but a larger page produces a very tall image.

As above, this needs to be repeated for each page which you want to archive.

3. Mirroring the entire site as HTML

A full mirror of the site is a good short-term archive. Some websites have a lot of embedded external content like maps and external social media widgets, which I would expect to gradually stop working over time as these services change. Still, you might still be able to browse the website on your local computer in 10 or 20 years time, depending on how browsers change.

wget is the go-to tool for mirroring sites, but it has a lot of options!

mkdir -p html/
cd html
wget \
  --trust-server-names \
  -e robots=off \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --no-parent \
  https://example.com

There are quite a few options here, I’ll briefly explain why I used each one:

--trust-server-names Allow the correct filename to be used when a redirect is used.
-e robots=off Disable rate limiting. This only OK to do if you own the site and can be sure that mirroring it will not cause capacity issues.
--mirror Short-hand for some options to recursively download the site.
--convert-links Change links on the target site to local ones.
--adjust-extension If you get a page called “foo”, save it as “foo.html”.
--page-requisites Also download CSS and Javascript files referenced on the page
--no-parent Only download sub-pages from the starting page. This is useful if you want to fetch only part of the domain.

The result can be opened locally in a web browser:

These options worked well for me on a WordPress site.

Putting it all together

The site I was mirroring was quite small, so I manually assembled a list of pages to mirror, gave each a name, and wrote them in a text file called urls.txt in this format:

https://site.example/ index
https://site.example/foo foo
https://site.example/bar bar

I then ran this script to mirror each URL as an image and PDF, before mirroring the entire site locally in HTML.

#!/bin/bash
set -exu -o pipefail

mkdir -p jpg/desktop jpg/mobile jpg/tablet html/ pdf
while read page_url page_name; do
  echo "## $page_url ($page_name)"
  # JPEG archive
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 $page_url jpg/desktop/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 $page_url jpg/tablet/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 $page_url jpg/mobile/$page_name.jpg
  # Printable archive
  wkhtmltopdf --javascript-delay 1000 $page_url pdf/$page_name.pdf
done < urls.txt

# Browsable archive
MAIN_PAGE=$(head -n1 urls.txt | cut -d' ' -f1)
mkdir -p html/
(cd html && \
  wget --trust-server-names -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent $MAIN_PAGE)

The actual domain example.com has only one page, so after running the script against it, it downloads this set of files:

├── archive.sh
├── html
│   └── example.com
│       └── index.html
├── jpg
│   ├── desktop
│   │   └── index.jpg
│   ├── mobile
│   │   └── index.jpg
│   └── tablet
│       └── index.jpg
├── pdf
│   └── index.pdf
└── urls.txt

Happy archiving!

Monitoring network throughput with Prometheus

Today I’m writing a bit about a Prometheus deployment that I made last year on a Raspberry Pi, to get better data about congestion on my uplink to the Internet.

The problem

You have probably run an Internet speed test before, like this:

2017-07-net-05

A speed test will tell you how slow your computer’s connection is, but it can’t narrow down whether it’s because of other LAN users, the line quality, or congestion at the provider.

You can start to assemble this information from the router, which has counters for each network interface:

2017-07-net-02

This table is from a Sagemcom F@ST 3864, which is a consumer-grade router. It has no SNMP interface, so the only way to get these metrics is to query /statsifc.html and /info.html from the LAN.

Getting the data

We can derive throughput metrics for the uplink if we scrape these metrics every few second and load them into a time-series database. To do this, I wrote a small adapter (called an “exporter” in Prometheus lingo), which exposed the metrics in a more structured way.

The result was a web page on the Raspberry Pi, which returned interface data like this:

# HELP lan_network_receive_bytes Received bytes for network interface
# TYPE lan_network_receive_bytes gauge
lan_network_receive_bytes{device="eth0"} 0.0
lan_network_receive_bytes{device="eth1"} 0.0
lan_network_receive_bytes{device="eth2"} 0.0
lan_network_receive_bytes{device="eth3"} 0.0
lan_network_receive_bytes{device="wl0"} 737476060.0
# HELP lan_network_send_bytes Sent bytes for network interface
# TYPE lan_network_send_bytes gauge
lan_network_send_bytes{device="eth0"} 363957004.0
lan_network_send_bytes{device="eth1"} 0.0
lan_network_send_bytes{device="eth2"} 0.0
lan_network_send_bytes{device="eth3"} 0.0
lan_network_send_bytes{device="wl0"} 2147483647.0
# HELP lan_network_receive_packets Received packets for network interface
# TYPE lan_network_receive_packets gauge
lan_network_receive_packets{device="eth0",disposition="transfer"} 1766250.0
lan_network_receive_packets{device="eth0",disposition="error"} 0.0
lan_network_receive_packets{device="eth0",disposition="drop"} 0.0
lan_network_receive_packets{device="eth1",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth1",disposition="error"} 0.0
lan_network_receive_packets{device="eth1",disposition="drop"} 0.0
lan_network_receive_packets{device="eth2",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth2",disposition="error"} 0.0
lan_network_receive_packets{device="eth2",disposition="drop"} 0.0
lan_network_receive_packets{device="eth3",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth3",disposition="error"} 0.0
lan_network_receive_packets{device="eth3",disposition="drop"} 0.0
lan_network_receive_packets{device="wl0",disposition="transfer"} 6622351.0
lan_network_receive_packets{device="wl0",disposition="error"} 0.0
lan_network_receive_packets{device="wl0",disposition="drop"} 0.0
# HELP lan_network_send_packets Sent packets for network interface
# TYPE lan_network_send_packets gauge
lan_network_send_packets{device="eth0",disposition="transfer"} 3148577.0
lan_network_send_packets{device="eth0",disposition="error"} 0.0
lan_network_send_packets{device="eth0",disposition="drop"} 0.0
lan_network_send_packets{device="eth1",disposition="transfer"} 0.0
lan_network_send_packets{device="eth1",disposition="error"} 0.0
lan_network_send_packets{device="eth1",disposition="drop"} 0.0
lan_network_send_packets{device="eth2",disposition="transfer"} 0.0
lan_network_send_packets{device="eth2",disposition="error"} 0.0
lan_network_send_packets{device="eth2",disposition="drop"} 0.0
lan_network_send_packets{device="eth3",disposition="transfer"} 0.0
lan_network_send_packets{device="eth3",disposition="error"} 0.0
lan_network_send_packets{device="eth3",disposition="drop"} 0.0
lan_network_send_packets{device="wl0",disposition="transfer"} 8803737.0
lan_network_send_packets{device="wl0",disposition="error"} 0.0
lan_network_send_packets{device="wl0",disposition="drop"} 0.0
# HELP wan_network_receive_bytes Received bytes for network interface
# TYPE wan_network_receive_bytes gauge
wan_network_receive_bytes{device="ppp2.1"} 3013958333.0
wan_network_receive_bytes{device="ptm0.1"} 0.0
wan_network_receive_bytes{device="eth4.3"} 0.0
wan_network_receive_bytes{device="ppp1.1"} 0.0
wan_network_receive_bytes{device="ppp3.2"} 0.0
# HELP wan_network_send_bytes Sent bytes for network interface
# TYPE wan_network_send_bytes gauge
wan_network_send_bytes{device="ppp2.1"} 717118493.0
wan_network_send_bytes{device="ptm0.1"} 0.0
wan_network_send_bytes{device="eth4.3"} 0.0
wan_network_send_bytes{device="ppp1.1"} 0.0
wan_network_send_bytes{device="ppp3.2"} 0.0
# HELP wan_network_receive_packets Received packets for network interface
# TYPE wan_network_receive_packets gauge
wan_network_receive_packets{device="ppp2.1",disposition="transfer"} 11525693.0
wan_network_receive_packets{device="ppp2.1",disposition="error"} 0.0
wan_network_receive_packets{device="ppp2.1",disposition="drop"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="transfer"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="error"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="drop"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="transfer"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="error"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="drop"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="transfer"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="error"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="drop"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="transfer"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="error"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="drop"} 0.0
# HELP wan_network_send_packets Sent packets for network interface
# TYPE wan_network_send_packets gauge
wan_network_send_packets{device="ppp2.1",disposition="transfer"} 7728904.0
wan_network_send_packets{device="ppp2.1",disposition="error"} 0.0
wan_network_send_packets{device="ppp2.1",disposition="drop"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="transfer"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="error"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="drop"} 0.0
wan_network_send_packets{device="eth4.3",disposition="transfer"} 0.0
wan_network_send_packets{device="eth4.3",disposition="error"} 0.0
wan_network_send_packets{device="eth4.3",disposition="drop"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="transfer"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="error"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="drop"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="transfer"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="error"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="drop"} 0.0
# HELP adsl_attainable_rate_down_kbps ADSL Attainable Rate down (Kbps)
# TYPE adsl_attainable_rate_down_kbps gauge
adsl_attainable_rate_down_kbps 19708.0
# HELP adsl_attainable_rate_up_kbps ADSL Attainable Rate up (Kbps)
# TYPE adsl_attainable_rate_up_kbps gauge
adsl_attainable_rate_up_kbps 1087.0
# HELP adsl_rate_down_kbps ADSL Rate down (Kbps)
# TYPE adsl_rate_down_kbps gauge
adsl_rate_down_kbps 18175.0
# HELP adsl_rate_up_kbps ADSL Rate up (Kbps)
# TYPE adsl_rate_up_kbps gauge
adsl_rate_up_kbps 1087.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 34197504.0
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 22441984.0
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1497148890.92
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3254.92
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0

I then deployed Prometheus to the same Raspberry Pi, and configured it to read these metrics every few seconds by editing prometheus.yml

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: net
    static_configs:
    - targets: ["localhost:8000"]

Making some queries

Prometheus has a query language, which I find similar to spreadsheet formulas. You can enter a query directly into the web interface to get a graph or data table.

2017-07-net-03

I settled on these queries to get the data I needed. They show me the maximum attainable line rate, actual sync rate, and current throughput over the WAN interface.

Downloads

Throughput:

rate(wan_network_receive_bytes{device="ppp2.1"}[10s])*8/1024/1024

ADSL attainable:

adsl_attainable_rate_down_kbps/1024

ADSL sync:

adsl_rate_down_kbps/1024

Uploads

Usage:

rate(wan_network_send_bytes{device="ppp2.1"}[10s])*8/1024/1024

ADSL attainable:

adsl_attainable_rate_up_kbps/1024

ADSL sync:

adsl_rate_up_kbps/1024

Onto a dashboard

I then deployed the last component in this setup, Grafana, to the Raspberry Pi. This tool lets you save your queries on a dashboard.

I made two plots, one for uploads, and one for downloads-

2017-07-net-04

By saturating the link with traffic (such as when running a speed test), it was now possible to compare the actual network speed with the ADSL sync speed.

2017-07-net-06

In my case, the best attainable network speed changed depending on the time of day, while the ADSL sync speed was constant. That’s a simple case of congestion.

Conclusion

I’ve deployed a few tiny Prometheus setups like this, because of how simple it is to work with new sources of metrics. It’s designed for much larger setups than an individual router, so it’s a worthwhile tool to be familiar with. Data is always a good reality-check for your assumptions, of course.

This setup had the level of security that you would expect of a Raspberry Pi project (none), and crashed after 4 days because I did not configure it for a RAM-limited environment, but it was a useful learning exercise, so I uploaded it to GitHub anyway. The python and Ansible code can be found here.

How to export a Maildir email inbox as a CSV file

I’ve often thought it would be useful to have an “Export as CSV” button on my inbox, and recently had an excuse to implement one.

I needed to parse the Subject of each email coming from an automated process, but the inbox was in the Maildir format, and I needed something more useful for data interchange.

So I wrote a utility, mail2csv, to export the Maildir as CSV file, which many existing tools can work with. It produces one row per email, and one column per email header:

The basic usage is:

$ mail2csv example/
Date,Subject,From
"Wed, 16 May 2018 20:05:16 +0000",An email,Bob <bob@example.com>
"Wed, 16 May 2018 20:07:52 +0000",Also an email,Alice <alice@example.com>

You can select any header with the --headers field, which accepts globs:

$ mail2csv example/ --headers Message-ID Date Subject
Message-ID,Date,Subject
<89125@example.com>,"Wed, 16 May 2018 20:05:16 +0000",An email
<180c6@example.com>,"Wed, 16 May 2018 20:07:52 +0000",Also an email

If you’re not sure which headers your emails have, then you can make a CSV with all the headers, and just read the first line:

$ mail2csv  example/ --headers '*' | head -n1

You can find the Python code for this tool on GitHub here.

How to make HHVM 3.21 identify itself as PHP 7

HHVM is an alternative runtime for PHP, which I try to maintain compatibility with.

I recently upgraded a project to PHPUnit 6, and the tests started failing on hhvm-3.21 with this error:

This version of PHPUnit is supported on PHP 7.0 and PHP 7.1.
You are using PHP 5.6.99-hhvm (/usr/bin/hhvm).

Although HHVM supports many PHP 7 features, it is identifying itself as PHP 5.6. The official documentation includes a setting for enabling additional PHP 7 features, which also sets the reported version.

This project was building on Travis CI, so adding this one-liner to .travis.yml sets this flag in the configuration, and the issue goes away:

before_script:
  - bash -c 'if [[ $TRAVIS_PHP_VERSION == hhvm* ]]; then echo "hhvm.php7.all = 1" | sudo tee -a /etc/hhvm/php.ini; fi'

I’m writing about this for other developers who hit the same issue. The current releases of HHVM don’t seem to have this issue, so hopefully this work-around is temporary.

How to check PHP code style in a continuous integration environment

I’m going to write a bit today about PHP_CodeSniffer, which is a tool for detecting PHP style violations.

Background

Convention is good. It’s like following the principle of least surprise in software design, but for your developer audience.

PHP has two code standards that I see in the wild:

If you are writing new PHP code, or refactoring code that follows no discernible standard, you should use the PSR-2 code style.

I use PHP_CodeSniffer on a few projects to stop the build early if the code is not up to standard. It has the ability to fix the formatting problems it finds, so it is not a difficult test to clear.

Add PHP_CodeSniffer to your project

I’m assuming that you are already using composer to manage your PHP dependencies.

Ensure you are specifying your minimum PHP version in your composer.json:

"require": {
    "php": ">=7.0.0"
}

Next, add squizlabs/php_codesniffer as a development dependency of your project:

composer require --dev squizlabs/php_codesniffer

Check your code

I use this command to check the src folder. It can be run locally or on a CI box:

php vendor/bin/phpcs --standard=psr2 src/ -n

This will return nonzero if there are problems with the code style, which should stop the build on most systems.

The -n flag avoids less severe “warnings”. One of these is the “Line exceeds 120 characters”, which can be difficult to completely eliminate.

Fixing the style

Running the phpcbf command with the same arguments will correct simple formatting issues:

php vendor/bin/phpcs --standard=psr2 src/ -n

Travis CI example

A full .travis.yml which uses this would look like this (adapted from here):

language: php

php:
  - 7.0
  - 7.1
  - 7.2

install:
  - composer install

script:
  - php vendor/bin/phpcs --standard=psr2 test/ -n

If you are running a PHP project without any sort of build, then this is a good place to start: Aside from style, this checks that dependencies can be installed on a supported range of PHP versions, and will also break if there are files with syntax errors.

Make better CLI progress bars with Unicode block characters

As a programmer, you might add a progress bar so that the user has feedback while they wait for a slow task.

If you are writing a console (CLI) application, then you need to make your progress bars from text. A good command-line progress bar should update in small increments, like this example:

This uses Unicode Block Elements to give the progress bar a higher resolution.

Code Character
U+2588
U+2589
U+258A
U+258B
U+258C
U+258D
U+258E
U+258F

A lot of applications use plain ASCII in their progress bars. The progress bar in wget, for example, uses ===> characters only, like this:

Progress bars made from ASCII characters like = and # signs are very common, most likely because of the historical portability issues around non-ASCII text. Nowadays, UTF-8 support is ubiquitous, and it’s pointless to adhere to such limitations.

Example: Better progress bars in python

The animation at the top of this blog post is a simple python script.

import sys
import time

from progress_bar import ProgressBar

"""
Example usage of ProgressBar class
"""
print("Doing work\n")

with ProgressBar(sys.stdout) as progress:
    for i in range(0,800):
        progress.update(i / 800)
        time.sleep(0.05)
print("\nDone.\n");

The script progress_bars.py, written for Python 3, contains a class that allows the progress bar to be created and drawn on different types of terminals.

import abc
import math
import shutil
import sys
import time

from typing import TextIO

"""
Produce progress bar with ANSI code output.
"""
class ProgressBar(object):
    def __init__(self, target: TextIO = sys.stdout):
        self._target = target
        self._text_only = not self._target.isatty()
        self._update_width()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if(exc_type is None):
            # Set to 100% for neatness, if no exception is thrown
            self.update(1.0)
        if not self._text_only:
            # ANSI-output should be rounded off with a newline
            self._target.write('\n')
        self._target.flush()
        pass

    def _update_width(self):
        self._width, _ = shutil.get_terminal_size((80, 20))

    def update(self, progress : float):
        # Update width in case of resize
        self._update_width()
        # Progress bar itself
        if self._width < 12:
            # No label in excessively small terminal
            percent_str = ''
            progress_bar_str = ProgressBar.progress_bar_str(progress, self._width - 2)
        elif self._width < 40:
            # No padding at smaller size
            percent_str = "{:6.2f} %".format(progress * 100)
            progress_bar_str = ProgressBar.progress_bar_str(progress, self._width - 11) + ' '
        else:
            # Standard progress bar with padding and label
            percent_str = "{:6.2f} %".format(progress * 100) + "  "
            progress_bar_str = " " * 5 + ProgressBar.progress_bar_str(progress, self._width - 21)
        # Write output
        if self._text_only:
            self._target.write(progress_bar_str + percent_str + '\n')
            self._target.flush()
        else:
            self._target.write('\033[G' + progress_bar_str + percent_str)
            self._target.flush()

    @staticmethod
    def progress_bar_str(progress : float, width : int):
        # 0 <= progress <= 1
        progress = min(1, max(0, progress))
        whole_width = math.floor(progress * width)
        remainder_width = (progress * width) % 1
        part_width = math.floor(remainder_width * 8)
        part_char = [" ", "▏", "▎", "▍", "▌", "▋", "▊", "▉"][part_width]
        if (width - whole_width - 1) < 0:
          part_char = ""
        line = "[" + "█" * whole_width + part_char + " " * (width - whole_width - 1) + "]"
        return line

Aside from the use of box drawing characters, this script includes a few other things which a good progress bar should implement:

  • Resize the progress bar when you resize the terminal
  • Simplify the progress bar on very small terminals
  • Don’t print ANSI terminal codes if the script is not connected to a terminal
  • Round off to 100% once the use of the progress bar completes without error

After writing this, I discovered that the progress pypi package can also use these box characters, so I haven’t packaged this code up. I haven’t used progress before, but you might like to evaluate it for your own applications.

How to create effective PHP project documentation with Read the Docs

Documentation is one of the ways that software projects can communicate information to their users. Effective documentation is high-quality, meaning that it’s complete, accurate, and up-to-date. At least for open source libraries, it also means that you can find it with a search engine. For many small PHP projects, the reality is very far removed from the ideal.

Read the Docs (readthedocs.io) makes it easy to host an up-to-date copy of your project’s documentation online. There are around 2,000 PHP projects which host their documentation on the site, which makes PHP the third most popular programming language for projects on the site.

This post covers the process that is used to automatically publish the documentation for the gfx-php PHP graphics library, which is one of those 2000 projects. You should consider using this setup as a template if you your project is small enough that it does not have its own infrastructure.

Basic concept

Typically, people are using Read the Docs with a tool called Sphinx. If you are writing in Python, it’s also possible to use the autodoc Sphinx plugin to add API documentation, based on docstrings in the code.

PHP programmers are already spoiled for choice if they want to produce HTML documentation from their code. These tools all have huge PHP user bases:

These will each output their own HTML, which is only useful if you want to self-host the documentation. I wanted a tool that was more like “autodoc for PHP”, so that I can have my API docs magically appear in Sphinx output that is hosted on Read the Docs.

Doxygen is the most useful tool for this purpose, because it has a stable XML output format and good-enough PHP support. I decided to write a tool which to take the Doxygen XML info and generate rest for Sphinx:

This introduces some extra tools, which looks complex at first. The stack is geared towards running within the Read the Docs build environment, so most developers can treat it as a black box after the initial setup:

This setup is entirely hosted with free cloud services, so you don’t need to run any applications on your own hardware.

Tools to install on local workstation

First, we will set up each of these tools locally, so that we know everything is working before we upload it.

  • Doxygen
  • Sphinx
  • doxyphp2sphinx

Doxygen

Doxygen can read PHP files to extract class names, documentation and method signatures. Linux and Mac install this from most package managers (apt-get, dnf or brew) under the name doxygen, while Windows users need to chase down binaries.

In your repo, make a sub-folder called docs/, and create a Doxyfile with some defaults:

mkdir docs/
doxygen -g Doxyfile

You need to edit the configuration to make it suitable or generating XML output for your PHP project. The version of Doxygen used here is 1.8.13, where you only need to change a few values to set Doxygen to:

  • Recursively search PHP files in your project’s source folder
  • Generate XML and HTML only
  • Log warnings to a file

For a typical project, these settings are:

PROJECT_NAME           = "Example Project"
INPUT                  = ../src
WARN_LOGFILE           = warnings.log
RECURSIVE              = YES
USE_MDFILE_AS_MAINPAGE = ../README.md
GENERATE_LATEX         = NO
GENERATE_XML           = YES

Once you set these in Doxyfile, you can run Doxygen to generate HTML and XML output.

$ doxygen

Doxygen will pick up most method signatures automatically, and you can add to them via docblocks, which work along the same lines as docstrings in Python. Read Doxygen: Documenting the Code to learn the syntax if you have not used a documentation generator in a curly-bracket language before.

The Doxygen HTML will never be published, but you might need to read see how well Doxygen understands your code.

The XML output is much more useful for our purposes. It contains the same information, and we will read it to generate pages of documentation for Sphinx to render.

Sphinx

Sphinx is the tool that we will use to render the final HTML output. If you haven’t used it before, then see the official Getting Started page.

We are using Sphinx because it can be executed by online services like Read the Docs. It uses the reStructuredText format, which is a whole lot more complex than Markdown, but supports cross-references. I’ll only describe these steps briefly, because there are existing how-to guides on making Sphinx work for manually-written PHP documentation elsewhere on the Internet, such as:

Still in the docs folder with your Doxyfile, create and render an empty Sphinx project.

pip install sphinx
sphinx-quickstart --quiet --project example_project --author example_bob
make html

The generated HTML will initially appear like this:

We need to customize this in a way that adds PHP support. The quickest way is to drop this text into requirements.txt:

Sphinx==1.7.4
sphinx-rtd-theme==0.3.0
sphinxcontrib-phpdomain==0.4.1
doxyphp2sphinx>=1.0.1

Then these two sections of config.py:

extensions = [
  "sphinxcontrib.phpdomain"
]
html_theme = 'sphinx_rtd_theme'

Add this to the end of config.py

# PHP Syntax
from sphinx.highlighting import lexers
from pygments.lexers.web import PhpLexer
lexers["php"] = PhpLexer(startinline=True, linenos=1)
lexers["php-annotations"] = PhpLexer(startinline=True, linenos=1)

# Set domain
primary_domain = "php"

And drop this contents in _templates/breadcrumbs.html (explanation)

{%- extends "sphinx_rtd_theme/breadcrumbs.html" %}

{% block breadcrumbs_aside %}
{% endblock %}

Then finally re-install dependencies and re-build:

pip install -r requirements.txt
make html

The HTML output under _build will now appear as:

This setup gives us three things:

  • The documentation looks the same as Read the Docs.
  • We can use PHP snippets and class documentation.
  • There are no ‘Edit’ links, which is important because some of the files will be generated in the next steps.

doxyphp2sphinx

The doxyphp2sphinx tool will generate .rst files from the Doxygen XML files. This was installed from your requirements.txt in the last step, but you can also install it standalone via pip:

pip install doxyphp2sphinx

The only thing you need to specify is the name of the namespace that you are documenting, using :: as a separator.

doxyphp2sphinx FooCorp::Example

This command will read the xml/ subdirectory, and will create api.rst. It will fill the api/ directory with documentation for each class in the \FooCorp\Example namespace.

To verify that this has worked, check your class structure:

$ tree ../src
../src
├── Dooverwhacky.php
└── Widget.php

You should have documentation for each of these:

$ tree xml/ -P 'class*'
xml/
├── classFooCorp_1_1Example_1_1Dooverwhacky.xml
└── classFooCorp_1_1Example_1_1Widget.xml

And if you have the correct namespace name, you will have .rst files for each as well:

$ tree api
api
├── dooverwhacky.rst
└── widget.rst

Now, add a reference to api.rst somewhere in index.rst:

.. toctree::
   :maxdepth: 2
   :caption: API Documentation

   Classes <api.rst>

And re-compile:

make html

You can now navigate to your classes in the HTML documentation.

The quality of the generated class documentation can be improved by adding docstrings. An example of the generated documentation for a method is:

Writing documentation

As you add pages to your documentation, you can include PHP snippets and reference the usage of your classes.

.. code-block:: php

   <?php
   echo "Hello World"

Lorem ipsum dolor sit :class:`Dooverwhacky`, foo bar baz :meth:`Widget::getFeatureCount`.

This will create syntax highlighting for your examples and inline links to the generated API docs.

Beyond this, you will need to learn some reStructuredText. I found this reference to be useful.

Local docs build

A full build has these dependencies:

apt-get install doxygen make python-pip
pip install -r docs/requirements.txt

And these steps:

cd docs/
doxygen 
doxyphp2sphinx FooCorp::Example
make html

Cloud tools

Next, we will take this local build, and run it on a cloud setup instead, using these services.

  • GitHub
  • Read the docs

GitHub

I will assume that you know how to use Git and GitHub, and that you are creating code that is intended for re-use.

Add these files to your .gitignore:

docs/_build/
docs/warnings.log
docs/xml/
docs/html/

Upload the remainder of your repository to GitHub. The gfx-php project is an example of a working project with all of the correct files included.

To have the initial two build steps execute on Read the Docs, add this to the end of docs/conf.py. Don’t forget to update the namespace to match the command you were running locally.

# Regenerate API docs via doxygen + doxyphp2sphinx
import subprocess, os
read_the_docs_build = os.environ.get('READTHEDOCS', None) == 'True'
if read_the_docs_build:
    subprocess.call(['doxygen', 'Doxyfile'])
    subprocess.call(['doxyphp2sphinx', 'FooCorp::Example'])

Read the Docs

After all this setup, Read the Docs should be able to build the project. Create the project on Read the Docs by using the Import from GitHub option. There are only two settings which need to be set:

The requirements file location must be docs/requirements.txt:

And the programming language should be PHP.

After this, you can go ahead and run a build.

As a last step, you will want to ensure that you have a Webhook set up on GitHub to trigger the builds automatically in future.

Conclusion

It is emerging as a best practice for small libraries to host their documentation with Read the Docs, but this is not yet common in the PHP world. Larger PHP projects tend to self-host documentation on the project website, but smaller projects will often have no hosted API documentation.

Once you write your docs, publishing them should be easy! Hopefully this provides a good example of what’s possible.

Acknowledgements

Credit where credit is due: The Breathe project fills this niche for C++ developers using Doxygen, and has been around for some time. Breathe is in the early stages of adding PHP support, but is not yet ready at the time of writing.

How to get a notification when your site appears on HackerNews

A few weeks ago, somebody listed this article about GNU parallel to HackerNews, and I got a small wave of new visitors trawling my blog.

I don’t actively monitor referrers to this site, so I was oblivious to this until a few days afterward. Aware of the Slashdot Effect, I thought I should set up some free tools to remind me to log in and check the site’s health if it happens again:

Hacker News RSS feeds

Hacker news does not publish it’s own RSS feeds, so I had to use a third-party service. I found a URL that would feed me the feed to the latest articles off this site, by searching the “url” attribute:

https://hnrss.org/newest?q=mike42.me&search_attrs=url

This URL gives an RSS feed, as you might expect:

IFTTT

To save me installing and checking a local reader, I set up IFTTT to send me an email when new articles are published to this feed.

The “RSS feed to email” applet is perfect for this kind of consumer-grade automation.

I set it up with the URL, and well, nothing interesting happens. Only new articles are emailed, so this is expected.

Example email

Since I also use this IFTTT applet to get notifications for other RSS feeds, I do know that it works. Within an hour or two of a new article appearing in the feed, the applet gives you an email from RSS Feed via IFTTT <action@ifttt.com>:

It’s not exactly a real-time notification, but it’s a good start. At this point, I know when my posts are being linked from a specific high-traffic site, which is a good start.

For any site bigger than a personal blog, you might be interested in handling extra traffic rather than just be vaguely aware of it, but I’ll save that for a future post.

Quick statistics about documentation hosted on Read the Docs

Many open source projects host their online documentation with Read The Docs. Since I’m currently developing a tool for Read The Docs users, I recently took a look at the types of projects that are hosted there.

The data in this post is derived from the list of projects available in the public Read The Docs API, and has its limitations. In particular, spam, inactive, or abandoned projects are not filtered.

Totals

This data is current as of May 20, 2018. At that time, there were 90,129 total projects on readthedocs.io.

Starting out, I had assumed that the majority of projects had all four of these attributes:

  • Hosted in Git
  • Programmed in Python
  • Documented in English
  • Documentation generated with Sphinx

As it turned out, this particular combination is only used on 35.8% (32,218) of projects, so let’s take a look at how each of these vary.

Programming languages

The two main conclusions I drew by looking at the programming languages are:

  • Python is the largest developer community that is using the Read the Docs platform
  • A lot of projects are hosting documentation that is not tagged with a programming language
# % Projects Programming language
1 39.92% 35978 Not code (“just words”)
2 39.27% 35391 Python
3 9.27% 8354 (No language listed)
4 2.83% 2551 Javascript
5 2.29% 2060 PHP
6 1.15% 1040 C++
7 1.13% 1018 Java
8 0.85% 769 Other
9 0.77% 690 C#
10 0.71% 640 C
11 0.42% 380 CSS
12 0.34% 304 Go
13 0.31% 283 Ruby
14 0.31% 277 Julia
15 0.11% 96 Perl
16 0.06% 55 Objective C
17 0.06% 54 R
18 0.04% 36 TypeScript
19 0.03% 31 Scala
20 0.03% 30 Haskell
21 0.03% 29 Lua
22 0.02% 22 Swift
23 0.02% 15 CoffeeScript
24 0.02% 14 Visual Basic
25 0.01% 12 Groovy

Documentation languages

You might have guessed that English dominates in software documentation, but here is the data:

# % Projects Language
1 93.1% 83888 English (en)
2 2.4% 2178 Indonesian (id)
3 1.4% 1287 Chinese (zh, zh_TW, zh_CN)
4 0.5% 456 Russian (ru)
5 0.5% 425 Spanish (es)
6 0.5% 412 French (fr)
7 0.3% 254 Portuguese (pt)
8 0.3% 233 German (de)
9 0.2% 209 Italian (it)
10 0.2% 208 Japanese (jp)
0.6% 579 All other languages

In total, documentation on Read the Docs has been published in 54 languages. The “All other languages” item here represents 44 other languages.

Documentation build tool

The majority are using the Sphinx documentation generator. The table here counts sphinx, sphinx_htmldir and sphinx_singlehtml together.

# % Projects Language
1 92.5% 83364 Sphinx
2 6.9% 6229 MkDocs
3 0.6% 536 Auto-select

Version control

Git is the version control system used for the vast majority of projects.

# % Projects Version control
1 96.5% 86988 Git
2 2.1% 1879 Mercurial
3 0.9% 816 SVN
4 0.5% 446 Bazaar

Recovering text from a receipt with escpos-tools

I have written previously about how to generate receipts for printers which understand ESC/POS. Today, I thought I would write about the opposite process.

Unlike PostScript, the ESC/POS binary language is not commonly understood by software. I wrote a few utilities last year to help change that, called escpos-tools.

In this post, I’ll step through an example ESC/POS binary file that an escpos-tools user sent to me, and show how we can turn it back into a usable format. The tools we are using are:

You might need this sort of process if you need to email a copy of your receipts, or to archive them for audit use.

Printing the file

Binary print files are generated from drivers. I can feed this one back to my printer like this:

cat receipt.bin > /dev/usb/lp0

My Epson TM-T20 receipt printer understands ESC/POS, and prints this out:

Installing escpos-tools

escpos-tools is not packaged yet, so you need git and composer (from the PHP eco-system) to use it.

$ git clone https://github.com/receipt-print-hq/escpos-tools
$ cd escpos-tools
$ composer install

Inspecting the file

There is text in the file, so the first thing you should try to do is esc2text. In this case, which works like this:

$ php esc2text.php receipt.bin

In this case, I got no output, so I switch to -v to show the commands being found.

$ php esc2text.php receipt.bin  -v
[DEBUG] SetRelativeVerticalPrintPositionCmd 
[DEBUG] GraphicsDataCmd 
[DEBUG] GraphicsDataCmd 
[DEBUG] SetRelativeVerticalPrintPositionCmd 
...

This indicates that there is no text being sent to the receipt, only images. We know from the print-out that the images contain text, so we need a few more utilities.

Recovering images from the receipt

To extract the images, use escimages. It runs like this:

$ php escimages.php --file receipt.bin
[ Image 1: 576x56 ]
[ Image 2: 576x56 ]
[ Image 3: 576x56 ]
[ Image 4: 576x56 ]
[ Image 5: 576x56 ]
[ Image 6: 576x56 ]
[ Image 7: 576x56 ]
[ Image 8: 576x52 ]

This gave us 8 narrow images:

Using ImageMagick’s convert command, these can be combined into one image like this:

convert -append receipt-*.png -density 70 -units PixelsPerInch receipt.png

The result is now the same as what our printer would output:

Recovering text from the receipt

Lastly, tesseract is an open source OCR engine which can recover text from images. This image is a lossless copy of what we sent to the printer, which is an “easy” input for OCR.

$ tesseract receipt.png -
Estimating resolution as 279
Test Receipt for USB Printer 1

Mar 17, 2018
10:12 PM



Ticket: 01



Item $0,00

Total $0.00

This quality of output is fairly accurate for an untrained reader.

Conclusion

The escpos-tools family of utilities gives some visibility into the contents of ESC/POS binary files.

If you have a use case which requires working with this type of file, then I would encourage you to consider contributing code or example files to the project, so that the utilities can be improved over time.

Get the code

View on GitHub →