Benchmarking PHP code with PhpBench

This blog post is all about measuring the speed of PHP code through micro-benchmarks.

Why micro-benchmarks

I think of micro-benchmarks as a complement to tests. A well-written test would give a developer a definite pass or fail result, while well-written micro-benchmark will give the developer an indication of the duration of an operation.

To put it another way, tests can help you find out whether the code is behaving correctly, while micro-benchmarks can help you to track whether your changes are making the code faster or slower.

Make sure that your code is doing real processing work before you spend time on this: Algorithms such as sorting, compressing, or parsing are good candidates. My guess is that most of the code in a regular PHP application would not benefit significantly from having a suite of benchmarks, because web apps tend to be I/O bound (eg database, network and disk access).

Introducing PhpBench

I have previously written a standalone test script every time I’ve needed to measure the speed of something in PHP. This works up to a point, but it gets quite hard to maintain as the number of benchmarks increases.

PhpBench is one of the available tools for running micro-benchmarks from PHP, and there are a few reasons why I think it’s worth a try:

  • it’s actively maintained
  • it installs with composer
  • it runs in a similar way to PHPUnit, so the benchmarks are fully specified in PHP code and run with a PHP tool.
  • it implements familiar concepts that you would find in benchmarking tools from other languages (eg. JMH from the Java world).

Installation

Start out with a blank composer project.

$ composer init

Add some auto-loading settings to composer.yml that PHP knows to find ExampleApp classes in the src folder.

{
    "name": "mike42/php-benchmark-examples",
    "description": "Example PHP benchmark project",
    "type": "project",
    "license": "MIT",
    "minimum-stability": "dev",
    "require": {},
    "require-dev": {},
    "autoload": {
        "psr-4": {
            "ExampleApp\\": "src"
        }
    }
}

At this point, install the phpbench dev version.

$ composer require --dev phpbench/phpbench:@dev
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 22 installs, 0 updates, 0 removals
  - Installing symfony/process (4.4.x-dev 1a42849): Cloning 1a42849a7f from cache
  - Installing symfony/options-resolver (4.4.x-dev 94cbb72): Cloning 94cbb72bb9 from cache
  - Installing symfony/finder (4.4.x-dev 1b5ec12): Cloning 1b5ec12340 from cache
  - Installing symfony/polyfill-ctype (dev-master 82ebae0): Cloning 82ebae0220 from cache
  - Installing symfony/filesystem (4.4.x-dev 5914824): Cloning 59148241f7 from cache
  - Installing psr/log (dev-master c4421fc): Cloning c4421fcac1 from cache
  - Installing symfony/debug (4.4.x-dev 8278839): Cloning 8278839457 from cache
  - Installing symfony/service-contracts (dev-master 0c81a04): Cloning 0c81a04f68 from cache
  - Installing symfony/polyfill-php73 (dev-master d1fb4ab): Cloning d1fb4abcc0 from cache
  - Installing symfony/polyfill-mbstring (dev-master fe5e94c): Cloning fe5e94c604 from cache
  - Installing symfony/console (4.4.x-dev e2fe100): Cloning e2fe1002fd from cache
  - Installing lstrojny/functional-php (1.9.0): Loading from cache
  - Installing beberlei/assert (v3.x-dev ce139b6): Cloning ce139b6bf8 from cache
  - Installing seld/jsonlint (1.7.1): Loading from cache
  - Installing psr/container (dev-master 014d250): Cloning 014d250dae from cache
  - Installing phpbench/container (1.2): Loading from cache
  - Installing webmozart/assert (1.4.0): Loading from cache
  - Installing webmozart/path-util (dev-master 95a8f7a): Cloning 95a8f7ad15 from cache
  - Installing phpbench/dom (0.2.0): Loading from cache
  - Installing doctrine/lexer (dev-master ee614dd): Cloning ee614dd93a from cache
  - Installing doctrine/annotations (1.7.x-dev 3f35255): Cloning 3f35255290 from cache
  - Installing phpbench/phpbench (dev-master dccc67d): Cloning dccc67dd52 from cache
symfony/service-contracts suggests installing symfony/service-implementation
symfony/console suggests installing symfony/event-dispatcher
symfony/console suggests installing symfony/lock
Writing lock file
Generating autoload files

Writing a benchmark

Add some code to src/ExampleThing.php to measure.

<?php

namespace ExampleApp;

class ExampleThing {
  /**
   * Inefficiently multiply numbers together
   */
  public function multiply(int $x, int $y) : int {
    $ret = 0;
    for($i = 0; $i < $x; $i++) {
      for($j = 0; $j < $y; $j++) {
        $ret++;
      }
    }
    return $ret;
  }
}

Write a benchmark for the code in benchmarks/ExampleThingBenchmark.php

<?php

use ExampleApp\ExampleThing;

/**
 * @BeforeMethods({"init"})
 * @Revs(1000)
 * @Iterations(5)
 */
class ExampleThingBenchmark {

    private static $exampleThing;

    public function init()
    {
        self::$exampleThing = new ExampleThing();
    }

    /**
     * @Subject
     */
    public function doMultiply()
    {
        self::$exampleThing -> multiply(100, 100);
    }
}

Before you run anything, you will also need a configuration file. Add this minimal configuration to phpbench.json.dist.

{
    "php_disable_ini": true,
    "bootstrap": "vendor/autoload.php",
    "path": "benchmark",
    "php_config": {
        "extension": [ "json.so" ]
    },
    "time_unit": "milliseconds"
}

Running the benchmark

The most basic way to run all benchmarks at once is phpbench run, which looks like this:

$ php vendor/bin/phpbench run
PhpBench @git_tag@. Running benchmarks.
Using configuration file: /home/mike/workspace/blog/phpbench/php-benchmarks/phpbench.json.dist

\ExampleThingBenchmark

    doMultiply..............................I4 [μ Mo]/r: 0.232 0.232 (ms) [μSD μRSD]/r: 0.000ms 0.10%

1 subjects, 5 iterations, 1,000 revs, 0 rejects, 0 failures, 0 warnings
(best [mean mode] worst) = 0.232 [0.232 0.232] 0.233 (ms)
⅀T: 1.162ms μSD/r 0.000ms μRSD/r: 0.100%

As you might expect, there are options to run specific benchmarks, or to present the results in different ways. However, there were some PhpBench-specific things which you will need to navigate to use it successfully.

Firstly, there are two different output settings to consider:

  • report options allow you to decide which data to print in a table.
  • output options allow you to decide how to format it for output

I also found it useful to add --progress=none to suppress the text displayed by default, since you will otherwise get broken HTML output.

Lastly, be aware that confusingly, the default report is not active by default. Specify it to get a table listing out each iteration.

$ php vendor/bin/phpbench run --progress=none --report=default
suite: 13415431dc0db41c12decee0fa83c26d1db0f678, date: 2019-05-31, stime: 22:01:34
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+
| benchmark             | subject    | set | revs | iter | mem_peak | time_rev  | comp_z_value | comp_deviation |
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 0    | 915,736b | 236.716μs | +1.68σ       | +1.08%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 1    | 915,736b | 231.999μs | -1.45σ       | -0.93%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 2    | 915,736b | 233.960μs | -0.15σ       | -0.1%          |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 3    | 915,736b | 233.878μs | -0.2σ        | -0.13%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 4    | 915,736b | 234.361μs | +0.12σ       | +0.08%         |
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+

The type of report that I found most useful is aggregate, since it provides a summary of each iteration.

$ php vendor/bin/phpbench run --progress=none --report=aggregate
suite: 1341543c02ebee4031d165665c2724c379bf9c98, date: 2019-05-31, stime: 22:01:18
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+
| benchmark             | subject    | set | revs | its | mem_peak | best      | mean      | mode      | worst     | stdev   | rstdev | diff  |
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 5   | 915,736b | 231.907μs | 232.614μs | 232.089μs | 233.538μs | 0.713μs | 0.31%  | 1.00x |
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+

Both of the example above use console output, which is the default.

HTML reports

To capture phpbench output from a CI environment, you can generate a HTML report as well. Add an output section to phpbench.json.dist.

{
    ...
    "outputs": {
         "html_file": {
             "extends": "html",
             "file": "benchmarks.html",
             "title": "Example benchmark report"
         }
    }
}

This new html_file output can be used with --output=html_file

$ php vendor/bin/phpbench run \
    --progress=none --report=aggregate \
    --output=console --output=html_file

From Jenkins

This is a Jenkinsfile that I use to run benchmarks. You need to install the “HTML Publisher” and “AnsiColor” Jenkins plugins.

pipeline {
  agent any

  stages {
    stage('Install') {
      steps {
        ansiColor('xterm') {
          sh 'composer install'
        }
      }
    }
    stage('Run benchmarks') {
      steps {
        ansiColor('xterm') {
          sh 'php vendor/bin/phpbench run --progress=none --report aggregate --output=console --output=html_file --ansi'
          publishHTML([
            allowMissing: false,
            alwaysLinkToLastBuild: true,
            keepAll: true,
            reportDir: '',
            reportFiles: 'benchmarks.html',
            reportName:
            'phpbench report',
            reportTitles: ''
          ])       
        }
      }
    }
  }
}

Each build is then benchmarked, and the results are available as a link on the left.

The console output of the build shows the results.

To actually view the HTML report you will need to apply this tweak.

Next steps

I’ve started to add phpbench micro-benchmarks to some PHP projects that I work on, and I’m finding it a lot more useful than standalone test scripts.

The next step would be integrate micro-benchmarks better into the development process. For example, I can already review code coverage improvements/regressions of a change on GitHub using Coveralls, which posts a comment to each pull request. This saves you from merging changes which don’t meet your project’s standards.

As far as I can tell, you would need to write custom code for more expressive phpbench reports along these lines, such as:

  • the history of a benchmark over time, or
  • the most-changed benchmarks against a baseline

phpbench does provide the building-blocks though, since you can set some options to archive the results of each run to a folder.

How to integrate Gitea and Jenkins

Gitea is a web app for hosting Git repositories. It’s open source, and very simple to get running. With some extra setup, it can also trigger Jenkins builds, and display the Jenkins build status of each commit once it has been built.

Because the documentation for the Jenkins plugin is very minimalist, I decided to write about it for future reference.

About this setup

I installed Jenkins and Gitea on the same Debian 9 server on the LAN. They communicate only over HTTP, so they could just as easily be installed separately.

To make the configuration clear, I’ve used jenkins.example.com in URLs which refer to Jenkins, and gitea.example.com for the Gitea.

Gitea installation

This command will install and start the linux-amd64 version of Gitea as the user “git”.

useradd git -r --create-home && \
  mkdir /opt/gitea && chown -R git: /opt/gitea && \
  wget -O /opt/gitea/gitea https://dl.gitea.io/gitea/1.7.0/gitea-1.7.0-linux-amd64 && \
  chmod +x /opt/gitea/gitea && \
  sudo -u git bash -c "cd /opt/gitea && ./gitea web"

Shut it down, and configure some paths at /opt/gitea/custom/conf/app.ini. These will depend on your environment.

SSH_DOMAIN       = gitea.example.com
DOMAIN           = gitea.example.com
HTTP_PORT        = 3000
ROOT_URL         = http://gitea.example.com:3000/

Start it back up as a systemd service at this point, by creating /etc/systemd/system/gitea.service with this content:

[Unit]
Description=gitea
After=network.target

[Service]
ExecStart=/opt/gitea/gitea web
WorkingDirectory=/opt/gitea
User=git
Type=simple

[Install]
WantedBy=multi-user.target

Once this is saved, start the service.

systemctl daemon-reload
systemctl start gitea

Optionally, also configure it to start on boot.

systemctl enable gitea

Jenkins installation

I installed Jenkins from the official Debian repo at jenkins-ci.org, and clicked through the initial install.

Add plugin

Open up Manage JenkinsManage Plugins. Navigate to Available and check the Gitea plugin.

Next, install the plugin and restart Jenkins.

The configuration for the plugin is located under Manage JenkinsConfigure System.

At this point you will want to tell Jenkins where to find your Gitea server.

I don’t suggest choosing Manage hooks, because it uses the same account to manage hooks across all repos, which would violate the principle of least privilege.

Set up a project in Gitea

In Gitea, create a project, then a repository under that.

Register an account in Gitea for Jenkins to use for this project.

Log out, log back in as yourself, and add Jenkins as a collaborator to the repo, with Write access.

This is the only permission you need for public repositories. If you plan to lock down your Gitea organization later, then you will also need to give this Jenkins account Read access at the organization level.

Set up a project in Jenkins

Add a new Gitea Organization Jenkins job.

Enter the name of the organization, and the account to log in with.

Add the details for the new account, and make sure it’s selected.

The other options don’t need to be changed at this stage.

When you press ‘save’, Jenkins will immediately attempt to find any repositories in the Gitea organization, and kick off any builds. Unless everything is correct, this is unlikely to work the first time, so pay attention to error logs.

These three places will show what’s happening:

  • Scan Gitea Organization Log, which lists repositories in the organization.
  • Scan Multibranch Pipeline Log for each repository, which shows the discovery of branches.
  • Console output for each build, which will show errors if the build status could not be submitted.

Problems which I’ve found here include:

  • The URL of Gitea in the Jenkins configuration must match the URL to Gitea in its own configuration.
  • The Jenkins user account must have permission to list repositories, clone, and update statuses.
  • Empty repositories, and repositories without a ‘Jenkinsfile’ are ignored.

For that last step, here is an empty Jenkinsfile that you can put in your repository to test this integration:

pipeline {
    agent any

    stages {
        stage('Do nothing') {
            steps {
                sh '/bin/true'
            }
        }
    }
}

Once this is sorted out, you can expect to see your repository in Jenkins.

Every branch with a Jenkinsfile will appear.

And each time a commit is mentioned in Gitea, it will display a small icon to indicate the build status.

Set up a web hook in Gitea

At this point, builds need to be manually triggered. To trigger them each time the repository changes, we need to get a notification out to Jenkins.

Under the repository settings, click WebhooksAdd webhookGitea.

The correct values to use are:

  • URL: http://[ your jenkins server ]/gitea-webhook/post
  • POST Content Type: application/json

Once you press Add Webhook, the path will appear with a small grey dot, indicating that it hasn’t been run before.

If you click edit, then the Test Delivery button can be used to check that it’s working.

The icon indicates the status. If things aren’t working correctly, then click the delivery UUID to expand the full request information, which should help with debugging.

Final result

With Jenkins and Gitea, you have a simple self-hosted a continuous integration environment.

In Gitea, you can store, update and review your code. Any build and test steps in a Jenkinsfile will be run automatically each time the repository changes.

The detailed output for each build is visible in Jenkins, where you can track build results with a variety of plugins.

Handling I/O errors in PHP

This blog post is all about how to handle errors from the PHP file_get_contents function, and others which work like it.

The file_get_contents function will read the contents of a file into a string. For example:

<?php
$text = file_get_contents("hello.txt");
echo $text;

You can try this out on the command-line like so:

$ echo "hello" > hello.txt
$ php test.php 
hello

This function is widely used, but I’ve observed that error handling around it is often not quite right. I’ve fixed a few bugs involving incorrect I/O error handling recently, so here are my thoughts on how it should be done.

How file_get_contents fails

For legacy reasons, this function does not throw an exception when something goes wrong. Instead, it will both log a warning, and return false.

<?php
$filename = "not-a-real-file.txt";
$text = file_get_contents($filename);
echo $text;

Which looks like this when you run it:

$ php test.php 
PHP Warning:  file_get_contents(not-a-real-file.txt): failed to open stream: No such file or directory in test.php on line 3
PHP Stack trace:
PHP   1. {main}() test.php:0
PHP   2. file_get_contents() test.php:3

Warnings are not very useful on their own, because the code will continue on without the correct data.

Error handling in four steps

If anything goes wrong when you are reading a file, your code should be throwing some type of Exception which describes the problem. This allows developers to put a try {} catch {} around it, and avoids nasty surprises where invalid data is used later.

Step 1: Detect that the file was not read

Any call to file_get_contents should be immediately followed by a check for that false return value. This is how you know that there is a problem.

<?php
$filename = "not-a-real-file.txt";
$text = file_get_contents($filename);
if($text === false) {
  throw new Exception("File was not loaded");
}
echo $text;

This now gives both a warning and an uncaught exception:

$ php test.php 
PHP Warning:  file_get_contents(not-a-real-file.txt): failed to open stream: No such file or directory in test.php on line 3
PHP Stack trace:
PHP   1. {main}() test.php:0
PHP   2. file_get_contents() test.php:3
PHP Fatal error:  Uncaught Exception: File was not loaded in test.php:5
Stack trace:
#0 {main}
  thrown in test.php on line 5

Step 2: Suppress the warning

Warnings are usually harmless, but there are several good reasons to suppress them:

  • It ensures that you are not depending on a global error handler (or the absence of one) for correct behaviour.
  • The warning might appear in the middle of the output, depending on php.ini.
  • Warnings can produce a lot of noise in the logs

Use @ to silence any warnings from a function call.

<?php
$filename = "not-a-real-file.txt";
$text = @file_get_contents($filename);
if($text === false) {
  throw new Exception("File was not loaded");
}
echo $text;

The output is now only the uncaught Exception:

$ php test.php 
PHP Fatal error:  Uncaught Exception: File was not loaded in test.php:5
Stack trace:
#0 {main}
  thrown in test.php on line 5

Step 3: Get the reason for the failure

Unfortunately, we lost the “No such file or directory” message, which is pretty important information, which should go in the Exception. This information is retrieved from the old-style error_get_last method.

This function might just return empty data, so you should check that everything is set and non-empty before you try to use it.

<?php
$filename = "not-a-real-file.txt";
error_clear_last();
$text = @file_get_contents($filename);
if($text === false) {
  $e = error_get_last();
  $error = (isset($e) && isset($e['message']) && $e['message'] != "") ?
      $e['message'] : "Check that the file exists and can be read.";
  throw new Exception("File '$filename' was not loaded. $error");
}
echo $text;

This now embeds the failure reason directly in the message.

$ php test.php 
PHP Fatal error:  Uncaught Exception: File 'not-a-real-file.txt' was not loaded. file_get_contents(not-a-real-file.txt): failed to open stream: No such file or directory in test.php:9
Stack trace:
#0 {main}
  thrown in test.php on line 9

Step 4: Add a fallback

The last time I introduced error_clear_last()/get_last_error() into a code-base, I learned out that HHVM does not have these functions.

Call to undefined function error_clear_last()

The fix for this is to write some wrapper code, to verify that each function exists.

<?php
$filename = 'not-a-real-file';
clearLastError();
$text = @file_get_contents($filename);
if ($text === false) {
    $error = getLastErrorOrDefault("Check that the file exists and can be read.");
    throw new \Exception("Could not retrieve image data from '$filename'. $error");
}
echo $text;

/**
 * Call error_clear_last() if it exists. This is dependent on which PHP runtime is used.
 */
function clearLastError()
{
    if (function_exists('error_clear_last')) {
        error_clear_last();
    }
}
/**
 * Retrieve the message from error_get_last() if possible. This is very useful for debugging, but it will not
 * always exist or return anything useful.
 */
function getLastErrorOrDefault(string $default)
{
    if (function_exists('error_clear_last')) {
        $e = error_get_last();
        if (isset($e) && isset($e['message']) && $e['message'] != "") {
            return $e['message'];
        }
    }
    return $default;
}

This does the same thing as before, but without breaking other PHP runtimes.

$ php test.php 
PHP Fatal error:  Uncaught Exception: Could not retrieve image data from 'not-a-real-file'. file_get_contents(not-a-real-file): failed to open stream: No such file or directory in test.php:7
Stack trace:
#0 {main}
  thrown in test.php on line 7

Since HHVM is dropping support for PHP, I expect that this last step will soon become unnecessary.

How not to handle errors

Some applications put a series of checks before each I/O operation, and then simply perform the operation with no error checking. An example of this would be:

<?php
$filename = "not-a-real-file.txt";
// Check everything!
if(!file_exists($filename)) {
  throw new Exception("$filename does not exist");
}
if(!is_file($filename)) {
  throw new Exception("$filename is not a file");
}
if(!is_readable($filename)) {
  throw new Exception("$filename cannot be read");
}
// Assume that nothing can possibly go wrong..
$text = @file_get_contents($filename);
echo $text;

You could probably make a reasonable-sounding argument that checks are a good idea, but I consider them to be misguided:

  • If you skip any actual error handling, then your code is going to fail in more surprising ways when you encounter an I/O problem that could not be detected.
  • If you do perform correct error handling as well, then the extra checks add nothing other than more branches to test.

Lastly, beware of false positives. For example, the above snippet will reject HTTP URL’s, which are perfectly valid for file_get_contents.

Conclusion

Most PHP code now uses try/catch/finally blocks to handle problems, but the ecosystem really values backwards compatibility, so existing functions are rarely changed.

The style of error reporting used in these I/O functions is by now a legacy quirk, and should be wrapped to consistently throw a useful Exception.

The top 100 Ansible modules

Ansible is an automation system which is widely used for deployments and configuration. It contains a dizzying array of modules for interfacing with things like files, services, package managers, and various pieces of software and hardware.

Background

There are some modules which I would expect every Ansible user to be familiar with, while others are barely used at all. I could have a guess about which modules I use most, but what about the wider community?

I couldn’t find any existing data, so I set out to determine which Ansible modules were most widely-used. The best source for this is Ansible Galaxy, which is a directory of around 18,000 Ansible roles. Using some scripts, I sifted through the 18,259 publicly accessible roles from Ansible Galaxy, found 71,610 files to look through, and tallied up which modules were being used by each of the 317,757 tasks in those files.

The list

These are the top 100 modules as of January 2019. I used the same method to build such a list in January 2018, so you can also see the change in popularity of some modules over the last 12 months.

Rank Module Uses Rank +/-
1 file 26,224 +1
2 include 25,267 -1
3 template 24,062
4 command 23,952
5 service 19,436
6 shell 19,401
7 set_fact 17,079 +1
8 apt 16,006 +1
9 lineinfile 12,432 -2
10 copy 8,794 +1
11 yum 8,733 -1
12 assert 7,504 +7
13 include_tasks 6,686 +15
14 stat 5,922 -2
15 package 5,422 -1
16 get_url 5,323 -3
17 debug 5,147 -2
18 import_tasks 5,016 +20
19 include_vars 4,251 -2
20 apt_repository 3,789 -4
21 user 3,233 -3
22 fail 2,812 +2
23 unarchive 2,734 -2
24 apt_key 2,665 -4
25 pip 2,474
26 systemd 2,389 +8
27 action 2,114 -4
28 git 2,043 -1
29 uri 1,680 +7
30 group 1,665 -1
31 sysctl 1,331 -5
32 raw 1,323 -2
33 mysql_user 1,266
34 meta 1,204 +3
35 replace 1,125 -3
36 ini_file 1,125 -5
37 find 1,030 -15
38 local_action 986 -3
39 mysql_db 959
40 cron 925
41 wait_for 898
42 rpm_key 771
43 include_role 742 +15
44 yum_repository 723 +2
45 mount 669 -2
46 blockinfile 619 +3
47 firewalld 579 +1
48 ufw 563 -1
49 authorized_key 557 -4
50 docker_container 488 +2
51 dnf 455 +5
52 seboolean 417 -8
53 homebrew 409 -2
54 fetch 378 +1
55 npm 367 -5
56 osx_defaults 363 +8
57 postgresql_user 350 -4
58 pkgng 345 -4
59 pause 340 +4
60 script 314 +5
61 setup 290 +7
62 postgresql_db 289 -2
63 mysql_replication 286 -1
64 win_regedit 281 +6
65 pacman 257 -4
66 debconf 256 +1
67 slurp 255 +10
68 gem 253 -11
69 iptables 240 +13
70 apache2_module 231 -4
71 synchronize 213 -2
72 docker 213 -13
73 alternatives 210
74 selinux 202
75 oc_obj 199 +98
76 make 194 -1
77 win_shell 191 +13
78 modprobe 187 +11
79 hostname 181 -7
80 zypper 174 -4
81 xml 160
82 supervisorctl 160 -11
83 win_file 149 +15
84 homebrew_cask 149 -6
85 add_host 144 -2
86 rabbitmq_user 129 -7
87 pamd 124 +81
88 win_command 116 +13
89 assemble 116 -9
90 htpasswd 115 -6
91 apk 112 -5
92 openbsd_pkg 111 -7
93 win_get_url 109 +14
94 win_chocolatey 109
95 docker_image 109 +4
96 tempfile 106 +25
97 locale_gen 105 -5
98 easy_install 97 -10
99 django_manage 97 -3
100 composer 96 -3

Out of over 2,500 modules mentioned in Ansible Galaxy, it turns out that only a few dozen which are widely used. Following this, there is a long tail of infrequently-used and custom modules.

How many modules do I need to know?

You can write 80% of the roles in Ansible Galaxy using only the 74 most popular Ansible modules.

For writing an individual Ansible role, you don’t need anywhere near this many. The median number of modules used by an role in Ansible Galaxy is just 6.

There are a few reasons that you won’t need to use many popular modules all at the same time:

  • Different technologies: eg. mysql_db and postgresql_db set up different databases, and would be invoked by different roles if you need both.
  • Obsolete modules: eg. docker has been split up, and new code will use docker_container instead.
  • Different levels of abstraction: eg. cross-platform developers will use the package module, while develpers who know their platform may directly use the apt, dnf and yum roles. You are unlikely to see these mixed.
  • Different styles: include_tasks vs. import_tasks.

Some notes

Here are some things to keep in mind when reading the table.

  • I skipped any role that was not hosted in a public GitHub repository, or could not be parsed. A php-yaml bug also kept a few files out of the 2019 tally.
  • Each task was counted only once, so local_action and action are obscuring which modules were eventually executed.
  • The OpenShift oc_obj module on this list is not bundled with Ansible (the module itself is distributed through Galaxy).
  • I did not attempt to exclude abandoned or incomplete packages. Eg. The docker module on this list has been removed from the latest version, but remains popular in existing Ansible Galaxy roles.

Thermal Sans Mono: A new bitmap font for receipt printers

Today I’m writing about a new project that I’ve started to create a font, called “Thermal Sans Mono”. I’m hoping to make this into a set of specialized free bitmap fonts for use on thermal receipt printers and printer emulators.

Here is what it looks like so far:

In this blog post I’ll talk a bit about why I needed a new font, how it’s derived from GNU Unifont, and what I’m planning to do with it.

Background

Firstly, I am solving a technical problem, not a design problem.

Printing non-ASCII characters on receipts can get a bit complex, especially if you have a lot of different printers to support. Each thermal receipt printer has a different set of available “code pages” of glyphs, which makes localisation a real adventure.

Using the escpos-printer-db project, we can currently create software which accepts UTF-8 text, and switches between the available code pages to render text for a specific printer. I’m now looking at how we can encode characters that are not not in any available code page.

For example, in this code from the python-escpos library, the ? subsitution character still has to be printed for some characters:

encoding = self.encoder.find_suitable_encoding(text[0])
if not encoding:
    self._handle_character_failed(text[0])

Rather than send a ? character, I think we could retrieve a glyph from a suitable font and send it as a bitmap instead. I previously wrote about a technique to cache glyphs from GNU Unifont in printer memory, but these glyphs don’t have the correct size or weight to be displayed alongside the glyphs that the printer has rendered itself.

Of course we can’t do this yet, because there is no suitable bitmap font.

How it’s made

The closest font to what we need is GNU Unifont, which is thankfully freely licensed. A Unifont glyph is typically 8×16 with a thin line and a lot of whitespace, while I need to print in 12×24, with thick lines and no whitespace.

To control as many variables as possible, I decided to pre-render a transformed version of GNU Unifont. This involved tracing the original glyph and re-drawing it with new parameters.

The first step is to use a purpose-built algorithm to trace the lines from the glyph:

The points are then re-drawn onto a new canvas with a thicker outline, and a lot less whitespace.

When this canvas is scaled back to 12×24, we have our glyph.

This glyph is now the correct size and weight for use on most thermal receipt printers.

Roadmap

Currently, I have only processed the ASCII characters. As a GNU Unifont derivative, we should (with some effort) be able to include many more glyphs in the future.

As a starting goal, I am aiming to make the font complete enough to include as a fallback font in the escpos-php printing library.

Download

I’ve made my initial work work on “Thermal Sans Mono” available as a bitmap font PCF and BDF format on GitHub. You can find these in the “Releases” section of the GitHub project.

View on GitHub →

As always, bugs, comments and suggestions are appreciated.

Three ways to archive a website

I recently needed to archive a small website before decommissioning it. There are a few distinct reasons you might want an archive of a website:

  • To archive the information in case you need it later.
  • To archive the look and feel so that you can see how it has progressed.
  • To archive the digital artifacts so that you can host them elsewhere as a mirror.

Each of these produces files in a different format, which are useful over different time-periods. In this post, I’ll write a bit about all three, since it’s easiest to archive a website while it is still online.

1. Saving webpage content to PDF

To write an individual page to a PDF, you can use wkhtmltopdf. On Debian/Ubuntu, this can be installed with:

sudo apt-get install wkhtmltopdf

The only extra setting I use for this is the “Javascript delay”, since some parts of the page will be loaded after the main content.

mkdir -p pdf/
wkhtmltopdf --javascript-delay 1000 https://example.com/ pdf/index.pdf

This produces a PDF, which you can copy/paste text from, or print.

You then simply repeat this for every page which you want to archive.

2. Saving webpage content to an image

If you are more interested in how the website looked, rather than what it contained, then you can use the same package to write it to an image. I used the jpg format here, because the file sizes were reasonable at higher resolution. I also zoom the page 200% to get higher quality, and selected sizes which are typical of desktop, tablet and mobile screen sizes.

mkdir -p jpg/desktop jpg/mobile jpg/tablet
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 https://example.com/ jpg/desktop/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 https://example.com/ jpg/tablet/index.jpg
wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 https://example.com/ jpg/mobile/index.jpg

This gives you three images for the page. This example page is quite short, but a larger page produces a very tall image.

As above, this needs to be repeated for each page which you want to archive.

3. Mirroring the entire site as HTML

A full mirror of the site is a good short-term archive. Some websites have a lot of embedded external content like maps and external social media widgets, which I would expect to gradually stop working over time as these services change. Still, you might still be able to browse the website on your local computer in 10 or 20 years time, depending on how browsers change.

wget is the go-to tool for mirroring sites, but it has a lot of options!

mkdir -p html/
cd html
wget \
  --trust-server-names \
  -e robots=off \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --no-parent \
  https://example.com

There are quite a few options here, I’ll briefly explain why I used each one:

--trust-server-names Allow the correct filename to be used when a redirect is used.
-e robots=off Disable rate limiting. This only OK to do if you own the site and can be sure that mirroring it will not cause capacity issues.
--mirror Short-hand for some options to recursively download the site.
--convert-links Change links on the target site to local ones.
--adjust-extension If you get a page called “foo”, save it as “foo.html”.
--page-requisites Also download CSS and Javascript files referenced on the page
--no-parent Only download sub-pages from the starting page. This is useful if you want to fetch only part of the domain.

The result can be opened locally in a web browser:

These options worked well for me on a WordPress site.

Putting it all together

The site I was mirroring was quite small, so I manually assembled a list of pages to mirror, gave each a name, and wrote them in a text file called urls.txt in this format:

https://site.example/ index
https://site.example/foo foo
https://site.example/bar bar

I then ran this script to mirror each URL as an image and PDF, before mirroring the entire site locally in HTML.

#!/bin/bash
set -exu -o pipefail

mkdir -p jpg/desktop jpg/mobile jpg/tablet html/ pdf
while read page_url page_name; do
  echo "## $page_url ($page_name)"
  # JPEG archive
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 4380 $page_url jpg/desktop/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 2048 $page_url jpg/tablet/$page_name.jpg
  wkhtmltoimage --zoom 2.0 --javascript-delay 1000 --width 960 $page_url jpg/mobile/$page_name.jpg
  # Printable archive
  wkhtmltopdf --javascript-delay 1000 $page_url pdf/$page_name.pdf
done < urls.txt

# Browsable archive
MAIN_PAGE=$(head -n1 urls.txt | cut -d' ' -f1)
mkdir -p html/
(cd html && \
  wget --trust-server-names -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent $MAIN_PAGE)

The actual domain example.com has only one page, so after running the script against it, it downloads this set of files:

├── archive.sh
├── html
│   └── example.com
│       └── index.html
├── jpg
│   ├── desktop
│   │   └── index.jpg
│   ├── mobile
│   │   └── index.jpg
│   └── tablet
│       └── index.jpg
├── pdf
│   └── index.pdf
└── urls.txt

Happy archiving!

Monitoring network throughput with Prometheus

Today I’m writing a bit about a Prometheus deployment that I made last year on a Raspberry Pi, to get better data about congestion on my uplink to the Internet.

The problem

You have probably run an Internet speed test before, like this:

2017-07-net-05

A speed test will tell you how slow your computer’s connection is, but it can’t narrow down whether it’s because of other LAN users, the line quality, or congestion at the provider.

You can start to assemble this information from the router, which has counters for each network interface:

2017-07-net-02

This table is from a Sagemcom F@ST 3864, which is a consumer-grade router. It has no SNMP interface, so the only way to get these metrics is to query /statsifc.html and /info.html from the LAN.

Getting the data

We can derive throughput metrics for the uplink if we scrape these metrics every few second and load them into a time-series database. To do this, I wrote a small adapter (called an “exporter” in Prometheus lingo), which exposed the metrics in a more structured way.

The result was a web page on the Raspberry Pi, which returned interface data like this:

# HELP lan_network_receive_bytes Received bytes for network interface
# TYPE lan_network_receive_bytes gauge
lan_network_receive_bytes{device="eth0"} 0.0
lan_network_receive_bytes{device="eth1"} 0.0
lan_network_receive_bytes{device="eth2"} 0.0
lan_network_receive_bytes{device="eth3"} 0.0
lan_network_receive_bytes{device="wl0"} 737476060.0
# HELP lan_network_send_bytes Sent bytes for network interface
# TYPE lan_network_send_bytes gauge
lan_network_send_bytes{device="eth0"} 363957004.0
lan_network_send_bytes{device="eth1"} 0.0
lan_network_send_bytes{device="eth2"} 0.0
lan_network_send_bytes{device="eth3"} 0.0
lan_network_send_bytes{device="wl0"} 2147483647.0
# HELP lan_network_receive_packets Received packets for network interface
# TYPE lan_network_receive_packets gauge
lan_network_receive_packets{device="eth0",disposition="transfer"} 1766250.0
lan_network_receive_packets{device="eth0",disposition="error"} 0.0
lan_network_receive_packets{device="eth0",disposition="drop"} 0.0
lan_network_receive_packets{device="eth1",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth1",disposition="error"} 0.0
lan_network_receive_packets{device="eth1",disposition="drop"} 0.0
lan_network_receive_packets{device="eth2",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth2",disposition="error"} 0.0
lan_network_receive_packets{device="eth2",disposition="drop"} 0.0
lan_network_receive_packets{device="eth3",disposition="transfer"} 0.0
lan_network_receive_packets{device="eth3",disposition="error"} 0.0
lan_network_receive_packets{device="eth3",disposition="drop"} 0.0
lan_network_receive_packets{device="wl0",disposition="transfer"} 6622351.0
lan_network_receive_packets{device="wl0",disposition="error"} 0.0
lan_network_receive_packets{device="wl0",disposition="drop"} 0.0
# HELP lan_network_send_packets Sent packets for network interface
# TYPE lan_network_send_packets gauge
lan_network_send_packets{device="eth0",disposition="transfer"} 3148577.0
lan_network_send_packets{device="eth0",disposition="error"} 0.0
lan_network_send_packets{device="eth0",disposition="drop"} 0.0
lan_network_send_packets{device="eth1",disposition="transfer"} 0.0
lan_network_send_packets{device="eth1",disposition="error"} 0.0
lan_network_send_packets{device="eth1",disposition="drop"} 0.0
lan_network_send_packets{device="eth2",disposition="transfer"} 0.0
lan_network_send_packets{device="eth2",disposition="error"} 0.0
lan_network_send_packets{device="eth2",disposition="drop"} 0.0
lan_network_send_packets{device="eth3",disposition="transfer"} 0.0
lan_network_send_packets{device="eth3",disposition="error"} 0.0
lan_network_send_packets{device="eth3",disposition="drop"} 0.0
lan_network_send_packets{device="wl0",disposition="transfer"} 8803737.0
lan_network_send_packets{device="wl0",disposition="error"} 0.0
lan_network_send_packets{device="wl0",disposition="drop"} 0.0
# HELP wan_network_receive_bytes Received bytes for network interface
# TYPE wan_network_receive_bytes gauge
wan_network_receive_bytes{device="ppp2.1"} 3013958333.0
wan_network_receive_bytes{device="ptm0.1"} 0.0
wan_network_receive_bytes{device="eth4.3"} 0.0
wan_network_receive_bytes{device="ppp1.1"} 0.0
wan_network_receive_bytes{device="ppp3.2"} 0.0
# HELP wan_network_send_bytes Sent bytes for network interface
# TYPE wan_network_send_bytes gauge
wan_network_send_bytes{device="ppp2.1"} 717118493.0
wan_network_send_bytes{device="ptm0.1"} 0.0
wan_network_send_bytes{device="eth4.3"} 0.0
wan_network_send_bytes{device="ppp1.1"} 0.0
wan_network_send_bytes{device="ppp3.2"} 0.0
# HELP wan_network_receive_packets Received packets for network interface
# TYPE wan_network_receive_packets gauge
wan_network_receive_packets{device="ppp2.1",disposition="transfer"} 11525693.0
wan_network_receive_packets{device="ppp2.1",disposition="error"} 0.0
wan_network_receive_packets{device="ppp2.1",disposition="drop"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="transfer"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="error"} 0.0
wan_network_receive_packets{device="ptm0.1",disposition="drop"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="transfer"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="error"} 0.0
wan_network_receive_packets{device="eth4.3",disposition="drop"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="transfer"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="error"} 0.0
wan_network_receive_packets{device="ppp1.1",disposition="drop"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="transfer"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="error"} 0.0
wan_network_receive_packets{device="ppp3.2",disposition="drop"} 0.0
# HELP wan_network_send_packets Sent packets for network interface
# TYPE wan_network_send_packets gauge
wan_network_send_packets{device="ppp2.1",disposition="transfer"} 7728904.0
wan_network_send_packets{device="ppp2.1",disposition="error"} 0.0
wan_network_send_packets{device="ppp2.1",disposition="drop"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="transfer"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="error"} 0.0
wan_network_send_packets{device="ptm0.1",disposition="drop"} 0.0
wan_network_send_packets{device="eth4.3",disposition="transfer"} 0.0
wan_network_send_packets{device="eth4.3",disposition="error"} 0.0
wan_network_send_packets{device="eth4.3",disposition="drop"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="transfer"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="error"} 0.0
wan_network_send_packets{device="ppp1.1",disposition="drop"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="transfer"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="error"} 0.0
wan_network_send_packets{device="ppp3.2",disposition="drop"} 0.0
# HELP adsl_attainable_rate_down_kbps ADSL Attainable Rate down (Kbps)
# TYPE adsl_attainable_rate_down_kbps gauge
adsl_attainable_rate_down_kbps 19708.0
# HELP adsl_attainable_rate_up_kbps ADSL Attainable Rate up (Kbps)
# TYPE adsl_attainable_rate_up_kbps gauge
adsl_attainable_rate_up_kbps 1087.0
# HELP adsl_rate_down_kbps ADSL Rate down (Kbps)
# TYPE adsl_rate_down_kbps gauge
adsl_rate_down_kbps 18175.0
# HELP adsl_rate_up_kbps ADSL Rate up (Kbps)
# TYPE adsl_rate_up_kbps gauge
adsl_rate_up_kbps 1087.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 34197504.0
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 22441984.0
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1497148890.92
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3254.92
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0

I then deployed Prometheus to the same Raspberry Pi, and configured it to read these metrics every few seconds by editing prometheus.yml

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: net
    static_configs:
    - targets: ["localhost:8000"]

Making some queries

Prometheus has a query language, which I find similar to spreadsheet formulas. You can enter a query directly into the web interface to get a graph or data table.

2017-07-net-03

I settled on these queries to get the data I needed. They show me the maximum attainable line rate, actual sync rate, and current throughput over the WAN interface.

Downloads

Throughput:

rate(wan_network_receive_bytes{device="ppp2.1"}[10s])*8/1024/1024

ADSL attainable:

adsl_attainable_rate_down_kbps/1024

ADSL sync:

adsl_rate_down_kbps/1024

Uploads

Usage:

rate(wan_network_send_bytes{device="ppp2.1"}[10s])*8/1024/1024

ADSL attainable:

adsl_attainable_rate_up_kbps/1024

ADSL sync:

adsl_rate_up_kbps/1024

Onto a dashboard

I then deployed the last component in this setup, Grafana, to the Raspberry Pi. This tool lets you save your queries on a dashboard.

I made two plots, one for uploads, and one for downloads-

2017-07-net-04

By saturating the link with traffic (such as when running a speed test), it was now possible to compare the actual network speed with the ADSL sync speed.

2017-07-net-06

In my case, the best attainable network speed changed depending on the time of day, while the ADSL sync speed was constant. That’s a simple case of congestion.

Conclusion

I’ve deployed a few tiny Prometheus setups like this, because of how simple it is to work with new sources of metrics. It’s designed for much larger setups than an individual router, so it’s a worthwhile tool to be familiar with. Data is always a good reality-check for your assumptions, of course.

This setup had the level of security that you would expect of a Raspberry Pi project (none), and crashed after 4 days because I did not configure it for a RAM-limited environment, but it was a useful learning exercise, so I uploaded it to GitHub anyway. The python and Ansible code can be found here.

How to export a Maildir email inbox as a CSV file

I’ve often thought it would be useful to have an “Export as CSV” button on my inbox, and recently had an excuse to implement one.

I needed to parse the Subject of each email coming from an automated process, but the inbox was in the Maildir format, and I needed something more useful for data interchange.

So I wrote a utility, mail2csv, to export the Maildir as CSV file, which many existing tools can work with. It produces one row per email, and one column per email header:

The basic usage is:

$ mail2csv example/
Date,Subject,From
"Wed, 16 May 2018 20:05:16 +0000",An email,Bob <bob@example.com>
"Wed, 16 May 2018 20:07:52 +0000",Also an email,Alice <alice@example.com>

You can select any header with the --headers field, which accepts globs:

$ mail2csv example/ --headers Message-ID Date Subject
Message-ID,Date,Subject
<89125@example.com>,"Wed, 16 May 2018 20:05:16 +0000",An email
<180c6@example.com>,"Wed, 16 May 2018 20:07:52 +0000",Also an email

If you’re not sure which headers your emails have, then you can make a CSV with all the headers, and just read the first line:

$ mail2csv  example/ --headers '*' | head -n1

You can find the Python code for this tool on GitHub here.

How to make HHVM 3.21 identify itself as PHP 7

HHVM is an alternative runtime for PHP, which I try to maintain compatibility with.

I recently upgraded a project to PHPUnit 6, and the tests started failing on hhvm-3.21 with this error:

This version of PHPUnit is supported on PHP 7.0 and PHP 7.1.
You are using PHP 5.6.99-hhvm (/usr/bin/hhvm).

Although HHVM supports many PHP 7 features, it is identifying itself as PHP 5.6. The official documentation includes a setting for enabling additional PHP 7 features, which also sets the reported version.

This project was building on Travis CI, so adding this one-liner to .travis.yml sets this flag in the configuration, and the issue goes away:

before_script:
  - bash -c 'if [[ $TRAVIS_PHP_VERSION == hhvm* ]]; then echo "hhvm.php7.all = 1" | sudo tee -a /etc/hhvm/php.ini; fi'

I’m writing about this for other developers who hit the same issue. The current releases of HHVM don’t seem to have this issue, so hopefully this work-around is temporary.

How to check PHP code style in a continuous integration environment

I’m going to write a bit today about PHP_CodeSniffer, which is a tool for detecting PHP style violations.

Background

Convention is good. It’s like following the principle of least surprise in software design, but for your developer audience.

PHP has two code standards that I see in the wild:

If you are writing new PHP code, or refactoring code that follows no discernible standard, you should use the PSR-2 code style.

I use PHP_CodeSniffer on a few projects to stop the build early if the code is not up to standard. It has the ability to fix the formatting problems it finds, so it is not a difficult test to clear.

Add PHP_CodeSniffer to your project

I’m assuming that you are already using composer to manage your PHP dependencies.

Ensure you are specifying your minimum PHP version in your composer.json:

"require": {
    "php": ">=7.0.0"
}

Next, add squizlabs/php_codesniffer as a development dependency of your project:

composer require --dev squizlabs/php_codesniffer

Check your code

I use this command to check the src folder. It can be run locally or on a CI box:

php vendor/bin/phpcs --standard=psr2 src/ -n

This will return nonzero if there are problems with the code style, which should stop the build on most systems.

The -n flag avoids less severe “warnings”. One of these is the “Line exceeds 120 characters”, which can be difficult to completely eliminate.

Fixing the style

Running the phpcbf command with the same arguments will correct simple formatting issues:

php vendor/bin/phpcs --standard=psr2 src/ -n

Travis CI example

A full .travis.yml which uses this would look like this (adapted from here):

language: php

php:
  - 7.0
  - 7.1
  - 7.2

install:
  - composer install

script:
  - php vendor/bin/phpcs --standard=psr2 test/ -n

If you are running a PHP project without any sort of build, then this is a good place to start: Aside from style, this checks that dependencies can be installed on a supported range of PHP versions, and will also break if there are files with syntax errors.