How to create effective PHP project documentation with Read the Docs

Documentation is one of the ways that software projects can communicate information to their users. Effective documentation is high-quality, meaning that it’s complete, accurate, and up-to-date. At least for open source libraries, it also means that you can find it with a search engine. For many small PHP projects, the reality is very far removed from the ideal.

Read the Docs (readthedocs.io) makes it easy to host an up-to-date copy of your project’s documentation online. There are around 2,000 PHP projects which host their documentation on the site, which makes PHP the third most popular programming language for projects on the site.

This post covers the process that is used to automatically publish the documentation for the gfx-php PHP graphics library, which is one of those 2000 projects. You should consider using this setup as a template if you your project is small enough that it does not have its own infrastructure.

Basic concept

Typically, people are using Read the Docs with a tool called Sphinx. If you are writing in Python, it’s also possible to use the autodoc Sphinx plugin to add API documentation, based on docstrings in the code.

PHP programmers are already spoiled for choice if they want to produce HTML documentation from their code. These tools all have huge PHP user bases:

These will each output their own HTML, which is only useful if you want to self-host the documentation. I wanted a tool that was more like “autodoc for PHP”, so that I can have my API docs magically appear in Sphinx output that is hosted on Read the Docs.

Doxygen is the most useful tool for this purpose, because it has a stable XML output format and good-enough PHP support. I decided to write a tool which to take the Doxygen XML info and generate rest for Sphinx:

This introduces some extra tools, which looks complex at first. The stack is geared towards running within the Read the Docs build environment, so most developers can treat it as a black box after the initial setup:

This setup is entirely hosted with free cloud services, so you don’t need to run any applications on your own hardware.

Tools to install on local workstation

First, we will set up each of these tools locally, so that we know everything is working before we upload it.

  • Doxygen
  • Sphinx
  • doxyphp2sphinx

Doxygen

Doxygen can read PHP files to extract class names, documentation and method signatures. Linux and Mac install this from most package managers (apt-get, dnf or brew) under the name doxygen, while Windows users need to chase down binaries.

In your repo, make a sub-folder called docs/, and create a Doxyfile with some defaults:

mkdir docs/
doxygen -g Doxyfile

You need to edit the configuration to make it suitable or generating XML output for your PHP project. The version of Doxygen used here is 1.8.13, where you only need to change a few values to set Doxygen to:

  • Recursively search PHP files in your project’s source folder
  • Generate XML and HTML only
  • Log warnings to a file

For a typical project, these settings are:

PROJECT_NAME           = "Example Project"
INPUT                  = ../src
WARN_LOGFILE           = warnings.log
RECURSIVE              = YES
USE_MDFILE_AS_MAINPAGE = ../README.md
GENERATE_LATEX         = NO
GENERATE_XML           = YES

Once you set these in Doxyfile, you can run Doxygen to generate HTML and XML output.

$ doxygen

Doxygen will pick up most method signatures automatically, and you can add to them via docblocks, which work along the same lines as docstrings in Python. Read Doxygen: Documenting the Code to learn the syntax if you have not used a documentation generator in a curly-bracket language before.

The Doxygen HTML will never be published, but you might need to read see how well Doxygen understands your code.

The XML output is much more useful for our purposes. It contains the same information, and we will read it to generate pages of documentation for Sphinx to render.

Sphinx

Sphinx is the tool that we will use to render the final HTML output. If you haven’t used it before, then see the official Getting Started page.

We are using Sphinx because it can be executed by online services like Read the Docs. It uses the reStructuredText format, which is a whole lot more complex than Markdown, but supports cross-references. I’ll only describe these steps briefly, because there are existing how-to guides on making Sphinx work for manually-written PHP documentation elsewhere on the Internet, such as:

Still in the docs folder with your Doxyfile, create and render an empty Sphinx project.

pip install sphinx
sphinx-quickstart --quiet --project example_project --author example_bob
make html

The generated HTML will initially appear like this:

We need to customize this in a way that adds PHP support. The quickest way is to drop this text into requirements.txt:

Sphinx==1.7.4
sphinx-rtd-theme==0.3.0
sphinxcontrib-phpdomain==0.4.1
doxyphp2sphinx>=1.0.1

Then these two sections of config.py:

extensions = [
  "sphinxcontrib.phpdomain"
]
html_theme = 'sphinx_rtd_theme'

Add this to the end of config.py

# PHP Syntax
from sphinx.highlighting import lexers
from pygments.lexers.web import PhpLexer
lexers["php"] = PhpLexer(startinline=True, linenos=1)
lexers["php-annotations"] = PhpLexer(startinline=True, linenos=1)

# Set domain
primary_domain = "php"

And drop this contents in _templates/breadcrumbs.html (explanation)

{%- extends "sphinx_rtd_theme/breadcrumbs.html" %}

{% block breadcrumbs_aside %}
{% endblock %}

Then finally re-install dependencies and re-build:

pip install -r requirements.txt
make html

The HTML output under _build will now appear as:

This setup gives us three things:

  • The documentation looks the same as Read the Docs.
  • We can use PHP snippets and class documentation.
  • There are no ‘Edit’ links, which is important because some of the files will be generated in the next steps.

doxyphp2sphinx

The doxyphp2sphinx tool will generate .rst files from the Doxygen XML files. This was installed from your requirements.txt in the last step, but you can also install it standalone via pip:

pip install doxyphp2sphinx

The only thing you need to specify is the name of the namespace that you are documenting, using :: as a separator.

doxyphp2sphinx FooCorp::Example

This command will read the xml/ subdirectory, and will create api.rst. It will fill the api/ directory with documentation for each class in the \FooCorp\Example namespace.

To verify that this has worked, check your class structure:

$ tree ../src
../src
├── Dooverwhacky.php
└── Widget.php

You should have documentation for each of these:

$ tree xml/ -P 'class*'
xml/
├── classFooCorp_1_1Example_1_1Dooverwhacky.xml
└── classFooCorp_1_1Example_1_1Widget.xml

And if you have the correct namespace name, you will have .rst files for each as well:

$ tree api
api
├── dooverwhacky.rst
└── widget.rst

Now, add a reference to api.rst somewhere in index.rst:

.. toctree::
   :maxdepth: 2
   :caption: API Documentation

   Classes <api.rst>

And re-compile:

make html

You can now navigate to your classes in the HTML documentation.

The quality of the generated class documentation can be improved by adding docstrings. An example of the generated documentation for a method is:

Writing documentation

As you add pages to your documentation, you can include PHP snippets and reference the usage of your classes.

.. code-block:: php

   <?php
   echo "Hello World"

Lorem ipsum dolor sit :class:`Dooverwhacky`, foo bar baz :meth:`Widget::getFeatureCount`.

This will create syntax highlighting for your examples and inline links to the generated API docs.

Beyond this, you will need to learn some reStructuredText. I found this reference to be useful.

Local docs build

A full build has these dependencies:

apt-get install doxygen make python-pip
pip install -r docs/requirements.txt

And these steps:

cd docs/
doxygen 
doxyphp2sphinx FooCorp::Example
make html

Cloud tools

Next, we will take this local build, and run it on a cloud setup instead, using these services.

  • GitHub
  • Read the docs

GitHub

I will assume that you know how to use Git and GitHub, and that you are creating code that is intended for re-use.

Add these files to your .gitignore:

docs/_build/
docs/warnings.log
docs/xml/
docs/html/

Upload the remainder of your repository to GitHub. The gfx-php project is an example of a working project with all of the correct files included.

To have the initial two build steps execute on Read the Docs, add this to the end of docs/conf.py. Don’t forget to update the namespace to match the command you were running locally.

# Regenerate API docs via doxygen + doxyphp2sphinx
import subprocess, os
read_the_docs_build = os.environ.get('READTHEDOCS', None) == 'True'
if read_the_docs_build:
    subprocess.call(['doxygen', 'Doxyfile'])
    subprocess.call(['doxyphp2sphinx', 'FooCorp::Example'])

Read the Docs

After all this setup, Read the Docs should be able to build the project. Create the project on Read the Docs by using the Import from GitHub option. There are only two settings which need to be set:

The requirements file location must be docs/requirements.txt:

And the programming language should be PHP.

After this, you can go ahead and run a build.

As a last step, you will want to ensure that you have a Webhook set up on GitHub to trigger the builds automatically in future.

Conclusion

It is emerging as a best practice for small libraries to host their documentation with Read the Docs, but this is not yet common in the PHP world. Larger PHP projects tend to self-host documentation on the project website, but smaller projects will often have no hosted API documentation.

Once you write your docs, publishing them should be easy! Hopefully this provides a good example of what’s possible.

Acknowledgements

Credit where credit is due: The Breathe project fills this niche for C++ developers using Doxygen, and has been around for some time. Breathe is in the early stages of adding PHP support, but is not yet ready at the time of writing.

Optimization: How I made my PHP code run 100 times faster

I’ve had a PHP wikitext parser as a dependency of some of my projects since 2012. It has always been a minor performance bottleneck, and I recently decided to do something about it.

I prepared an update to the code over the course of a month, and achieved a speedup of 142 times over the original code.

Before: 20.65 seconds, After: 0.145 seconds

A lot of the information that I could find online about PHP performance was very outdated, so I decided to write a bit about what I’ve learned. This post walks through the process I used, and the things which were were slowing down my code.

This is a long read — I’ve included examples which show the sorts of things that were slowing down my code, and what I replaced them with. If you’re a serious PHP programmer, then read on!

Lesson 1: Know when to optimize

Conventional wisdom seems to dictate that making your code faster is a waste of developer time.

I think it makes you a better programmer to occasionally optimize something in a language that you normally work with. Having a well-calibrated intuition about how your code will run is part of becoming proficient in a language, and you will tend to create fewer performance problems if you’ve got that intuition.

But you do need to be selective about it. This code has survived in the wild for over five years, and I think I will still be using it in another five. This code is also a good candidate because it does not access external resources, so there is only one component to examine.

Lesson 2: Write tests

In the spirit of MakeItWorkMakeItRightMakeItFast, I started by checking my test suite so that I could refactor the code with confidence.

In my case, I haven’t got good unit tests, but I have input files that I can feed through the parser to compare with known-good HTML output, which serves the same purpose:

php example.php > out.txt
diff good.txt out.txt

I ran this after every change to the code, so that I could be sure that the changes were not affecting the output.

Lesson 3: Profile your code & Question your assumptions

Code profiling allows you see how each part of your program is contributing to its overall run-time. This helps you to target your optimization efforts.

The two main debuggers for PHP are Zend and Xdebug, which can both profile your code. I have xdebug installed, which is the free debugger, and I use the Eclipse IDE, which is the free IDE. Unfortunately, the built-in profiling tool in Eclipse seems to only support the Zend debugger, so I have to profile my scripts on the command-line.

The best sources of information for this are:

On Debian or Ubuntu, xdebug is installed via apt-get:

sudo apt-get install php-cli php-xdebug

On Fedora, the package is called php-pecl-xdebug, and is installed as:

sudo dnf install php-pecl-xdebug

Next, I executed a slow-running example script with profiling enabled:

php -dxdebug.profiler_enable=1 -dxdebug.profiler_output_dir=. example.php

This produces a profile file, which you can use any valgrind-compatible tools to inspect. I used kcachegrind

sudo apt-get install kcachegrind

And for fedora:

sudo dnf install kcachegrind

You can locate and open the profile on the command-line like so:

ls
kcachegrind cachegrind.out.13385

Before profiling, I had guessed that the best way to speed up the code would be to reduce the amount of string concatenation. I have lots of tight loops which append characters one-at-a-time:

$buffer .= "$c"

Xdebug showed me that my guess was wrong, and I would have wasted a lot of time if I tried to remove string concatenation.

kcachegrind screen capture

Instead, it was clear that I was

  • Calculating the same thing hundreds of times over.
  • Doing it inefficiently.

Lesson 4: Avoid multibyte string functions

I had used functions from the mbstring extension (mb_strlen, mb_substr) to replace strlen and substr throughout my code. This is the simplest way to add UTF-8 support when iterating strings, is commonly suggested, and is a bad idea.

What people do

If you have an ASCII string in PHP and want to iterate over each byte, the idiomatic way to do it is with a for loop which indexes into the string, something like this:

<?php
// Make a test string
$testString = str_repeat('a', 60000);
// Loop through test string
$len = strlen($testString);
for($i = 0; $i < $len; $i++) {
  $c = substr($testString, $i, 1);
  // Do work on $c
  // ...
}

I’ve used substr here so that I can show that it has the same usage as mb_substr, which generally operates on UTF-8 characters. The idiomatic PHP for iterating over a multi-byte string one character at a time would be:

<?php
// Make a test string
$testString = str_repeat('a', 60000);
// Loop through test string
$len = mb_strlen($testString);
for($i = 0; $i < $len; $i++) {
  $c = mb_substr($testString, $i, 1);
  // Do work on $c
  // ...
}

Since mb_substr needs to parse UTF-8 from the start of the string each time it is called, the second snippet runs in polynomial time, where the snippet that calls substr in a loop is linear.

With a few kilobytes of input, this makes mb_substr unacceptably slow.

substr: 0.03 seconds, mb_substr: 4.23 seconds

Averaging over 10 runs, the mb_substr snippet takes 4.23 seconds, while the snippet using substr takes 0.03 seconds.

What people should do

Split your strings into bytes or characters before you iterate, and write methods which operate on arrays rather than strings.

You can use str_split to iterate over bytes:

<?php
// Make a test string
$testString = str_repeat('a', 60000);
// Loop through test string
$testArray = str_split($testString);
$len = count($testArray);
for($i = 0; $i < $len; $i++) {
  $c = $testArray[$i];
  // Do work on $c
  // ...
}

And for unicode strings, use preg_split. I learned about this trick from StackOverflow, but it might not be the fastest way. Please leave a comment if you have an alternative!

<?php
// Make a test string
$testString = str_repeat('a', 60000);
// Loop through test string
$testArray = preg_split('//u', $testString, -1, PREG_SPLIT_NO_EMPTY);
$len = count($testArray);
for($i = 0; $i < $len; $i++) {
  $c = $testArray[$i];
  // Do work on $c
  // ...
}

By converting the string to an array, you can pay the penalty of decoding the UTF-8 up-front. This is a few milliseconds at the start of the script, rather than a few milliseconds each time you need to read a character.

str_split: 0.0097s, preg_split: 0.0160s

After discovering this faster alternative to mb_substr, I systematically removed every mb_substr and mb_strlen from the code I was working on.

Lesson 5: Optimize for the most common case

Around 50% of the remaining runtime was spent in a method which expanded templates.

To parse wikitext, you first need to expand templates, which involves detecting tags like {{ template-name | foo=bar }} and <noinclude></noinclude>.

My 40 kilobyte test file had fewer than 100 instances of { and <, | and =, so I added a short-circuit to the code to skip most of the processing, for most of the characters.

<?php
self::$preprocessorChars = [
    '<' => true,
    '=' => true,
    '|' => true,
    '{' => true
];

// ...
for ($i = 0; $i < $len; $i++) {
    $c = $textChars[$i];
    if (!isset(self::$preprocessorChars[$c])) {
        /* Fast exit for characters that do not start a tag. */
        $parsed .= $c;
        continue;
    }
   // ... Slower processing 
}

The slower processing is now avoided 99.75% of the time.

Checking for the presence of a key in a map is very fast. To illustrate, here are two examples which each branch on { and <, | and =.

This one uses a map to check each character:

<?php
// Make a test string
$testString = str_repeat('a', 600000);
$chars = [
    '<' => true,
    '=' => true,
    '|' => true,
    '{' => true
];
// Loop through test string
$testArray = preg_split('//u', $testString, -1, PREG_SPLIT_NO_EMPTY);
$len = count($testArray);
$parsed = "";
for($i = 0; $i < $len; $i++) {
  $c = $testArray[$i];
  if(!isset($chars[$c])) {
    $parsed .= $c;
    continue;
  }
  // Never executed
}

While one uses no map, and has four !== checks instead:

<?php
// Make a test string
$testString = str_repeat('a', 600000);
// Loop through test string
$testArray = preg_split('//u', $testString, -1, PREG_SPLIT_NO_EMPTY);
$len = count($testArray);
$parsed = "";
for($i = 0; $i < $len; $i++) {
  $c = $testArray[$i];
  if($c !== "<" && $c !== "=" && $c !== "|" && $c !== "{") {
    $parsed .= $c;
    continue;
  }
  // Never executed
}

Even though the run time of each script includes the generation of a 600kB test string, the difference is still visible pronounced:

0.29 seconds with map, 0.37 seconds without map

Averaging over 10 runs, the code took 0.29 seconds when using a map, while it took 0.37 seconds to run the example with used four !== statements.

I was a little surprised by this result, but I’ll let the data speak for itself rather than try to explain why this is the case.

Lesson 6: Share data between functions

The next item to appear in the profiler was the copious use of array_slice.
My code uses recursive descent, and was constantly slicing up the input to pass around information. The array slicing had replaced earlier string slicing, which was even slower.

I refactored the code to pass around the entire string with indexes rather than actually cutting it up.

As a contrived example, these scripts each use a (very unnecessary) recursive-descent parser to take words from the dictionary and transform them like this:

example --> (example!)

The first example slices up the input array at each recursion step:

<?php
function handleWord(string $word) {
  return "($word!)\n";
}

/**
 * Parse a word up to the next newline.
 */
function parseWord(array $textChars) {
  $parsed = "";
  $len = count($textChars);
  for($i = 0; $i < $len; $i++) {
    $c = $textChars[$i];
    if($c === "\n") {
      // Word is finished because we hit a newline
      $start = $i + 1; // Past the newline
      $remainderChars = array_slice($textChars, $start , $len - $start);
      return array('parsed' => handleWord($parsed), 'remainderChars' => $remainderChars);
    }
    $parsed .= $c;
  }
  // Word is finished because we hit the end of the input
  return array('parsed' => handleWord($parsed), 'remainderChars' => []);
}

/**
 * Accept newline-delimited dictionary
 */
function parseDictionary(array $textChars) {
  $parsed = "";
  $len = count($textChars);
  for($i = 0; $i < $len; $i++) {
    $c = $textChars[$i];
    if($c === "\n") {
      // Not a word...
      continue;
    }
    // This is part of a word
    $start = $i;
    $remainderChars = array_slice($textChars, $start, $len - $start);
    $result = parseWord($remainderChars);
    $textChars = $result['remainderChars'];
    $len = count($textChars);
    $i = -1;
    $parsed .= $result['parsed'];
  }
  return array('parsed' => $parsed, 'remainderChars' => []);
}

// Load file, split into characters, parse, print result
$testString = file_get_contents("words");
$testArray = preg_split('//u', $testString, -1, PREG_SPLIT_NO_EMPTY);
$ret = parseDictionary($testArray);
file_put_contents("words2", $ret['parsed']);

While the second one always takes an index into the input array:

<?php
function handleWord(string $word) {
  return "($word!)\n";
}

/**
 * Parse a word up to the next newline.
 */
function parseWord(array $textChars, int $idxFrom = 0) {
  $parsed = "";
  $len = count($textChars);
  for($i = $idxFrom; $i < $len; $i++) {
    $c = $textChars[$i];
    if($c === "\n") {
      // Word is finished because we hit a newline
      $start = $i + 1; // Past the newline
      return array('parsed' => handleWord($parsed), 'remainderIdx' => $start);
    }
    $parsed .= $c;
  }
  // Word is finished because we hit the end of the input
  return array('parsed' => handleWord($parsed), $i);
}

/**
 * Accept newline-delimited dictionary
 */
function parseDictionary(array $textChars, int $idxFrom = 0) {
  $parsed = "";
  $len = count($textChars);
  for($i = $idxFrom; $i < $len; $i++) {
    $c = $textChars[$i];
    if($c === "\n") {
      // Not a word...
      continue;
    }
    // This is part of a word
    $start = $i;
    $result = parseWord($textChars, $start);
    $i = $result['remainderIdx'] - 1;
    $parsed .= $result['parsed'];
  }
  return array('parsed' => $parsed, 'remainderChars' => []);
}

// Load file, split into characters, parse, print result
$testString = file_get_contents("words");
$testArray = preg_split('//u', $testString, -1, PREG_SPLIT_NO_EMPTY);
$ret = parseDictionary($testArray);
file_put_contents("words2", $ret['parsed']);

The run-time difference between these examples is again very pronounced:

3.04s with slicing, 0.0302s with no slicing

Averaging over 10 runs, the code snippet which extracts sub-arrays took 3.04 seconds, while the code that passes around the entire array ran in 0.0302 seconds.

It’s amazing that such an obvious inefficiency in my code had been hiding behind larger problems before.

Lesson 7: Use scalar type hinting

Scalar type hinting looks like this:

function foo(int bar, string $baz) {
...
}

This is the secret PHP performance feature that they don’t tell you about. It does not actually change the speed of your code, but does ensure that it can’t be run on the slower PHP releases before 7.0.

PHP 7.0 was released in 2015, and it’s been established that it is twice as fast as PHP 5.6 in many cases.

I think it’s reasonable to have a dependency on a supported version of your platform, and performance improvements like this are a good reason to update your minimum supported version of PHP.

By breaking compatibility with scalar type hints, you ensure that your software does not appear to “work” in a degraded state.

Simplifying the compatibility landscape will also make performance a more tractable problem.

Lesson 8: Ignore advice that isn’t backed up by (relevant) data

While I was updating this code, I found a lot of out-dated claims about PHP performance, which did not hold true for the versions that I support.

To call out a few myths that still seem to be alive:

  • Style of quotes impacts performance.
  • Use of by-val is slower than by-ref for variable passing.
  • String concatenation is bad for performance.

I attempted to implement each of these, and wasted a lot of time. Thankfully, I was measuring the run-time and using version control, so it was easy for me to identify and discard changes which had a negligible or negative performance impact.

If somebody makes a claim about something being fast or slow in PHP, best assume that it doesn’t apply to you, unless you see some example code with timings on a recent PHP version.

Conclusion

If you’ve read this far, then I hope you’ve seen that modern PHP is not an intrinsically slow language. A significant speed-up is probably achievable on any real-world code-base with such endemic performance issues.

Before: 20.65 seconds, After: 0.145 seconds

To repeat the graphic from the introduction: the test file could be parsed in 20.65 seconds on the old code, and 0.145 seconds on the improved code (averaging over 10 runs, as before).

At this point I declared my efforts “good enough” and moved on. The job is done, since although another pass could speed it up further, this code is no longer slow enough to justify the effort.

Since I’ve closed the book on PHP 5, I would be interested in knowing whether there is a faster way to parse UTF-8 with the new IntlChar functions, but I’ll have to save that for the next project.

Now that you’ve seen some very inefficient ways to write PHP, I also hope that you will be able to avoid introducing similar problems in your own projects!

New WordPress theme (2018 edition)

This week I replaced the previous wordpress theme on this blog with the current one.

 
I used Bootstrap to place widgets in my blog content, and Prism.js to do syntax highlighting on code snippets.

This is a heavily modified version of the default twentyseventeen theme. I chose this as as a base because of it’s good use of white-space for typesetting. The updated bootstrap-based layout is also a big improvement for mobile users, who now makes up the majority of web traffic:

How to install PHP Composer as a regular user

Composer is an essential utility for PHP programmers, and allows you to manage dependencies.

Dependencies

You can use your regular account to install composer, use it, and even update it. You do need to have a few packages installed first though:

sudo apt-get install git curl php-cli

Or on Fedora:

sudo dnf install git curl php-cli

Local install

Next, fetch the installer and deploy composer to your home directory

curl https://getcomposer.org/installer > composer-setup.php
mkdir -p ~/.local/bin
php composer-setup.php --install-dir=$HOME/.local/bin --filename=composer
rm composer-setup.php

Last, add ~/.local/bin to your $PATH:

echo 'PATH=$PATH:~/.local/bin' >> ~/.bashrc
source  ~/.bashrc
echo $PATH

You can now run composer:

$ composer --help
Usage:
  help [options] [--] []
...
$ composer self-update
You are already using composer version 1.5.6 (stable channel).

Make Composer available for all users

Just run this line if you decide that all users should have access to your copy of Composer:

sudo mv ~/.local/bin/composer /usr/local/bin/composer

If you look up a how to install Composer, you will find a tempting one-liner that uses curl to fetch a script from the Composer website, then executes it as root. I don’t think it’s good practice to install software like that, so I would encourage you to just run ‘sudo mv’ at the end.

New WordPress theme

Since the last major revision of my site setup, I’ve been including more technical content, which would be easier to read with syntax highlighting and tabs.

The most visible part of the transition is now complete:

The old theme was Skittlish, but I decided to move to a new theme which was based on Bootstrap, so that I could use its components. The new theme is a modified version of the default twentyfourteen theme, using the visual style of morphic, with Prism.js added for code highlighting.

How to generate professional-quality PDF files from PHP

There are a few ways to go about making PDF files from your PHP web app. Your options are basically-

  1. Put all of your text in a 210mm column and get the user to save it as PDF.
  2. Learn a purpose-built library, such as FPDF (free) or pdflib (proprietary).
  3. Use PHP for generating markup which can be saved to PDF. This is of course LaTeX

This article assumes an intermediate knowledge of both PHP and LaTeX, and that your server is not running Windows.

The software mix

PHP is an open-source server package which generates HTML pages, usually based on some sort of dynamic data. It is equally good at (but less well known for) generating other types of markup.

LaTeX is an open source document typesetting system, which will take a markup file in .tex format, and output a printable document, such as a PDF. The engine I will use here is XeLaTeX, because it supports modern trimmings such as Unicode and OpenType fonts.

Naturally, this post will use PHP to populate a .tex file, and then xelatex to create a PDF for the user.

This sounds straightforward enough, but it may not work with all shared hosts. Check your setup before you read on:

  1. Your server needs PHP, with safe mode disabled, so that it can run commands.
  2. This server needs xelatex, or a suitable substitute such as pdflatex.

A bit about markup

We will be working with .tex templates, which will be valid LaTeX files. The basic rules are:

  1. Define a \newcommand for every variable, so that you can compile the document without PHP.
  2. Drop PHP code in comments, which will print out code to override those variables.

So you will end up with code like this:

% Make placeholders visible
\newcommand{\placeholder}[1]{\textbf{$<$ #1 $>$}}

% Defaults for each variable
\newcommand{\test}{\placeholder{Data here}}

% Fill in
% <?php echo "\n" . "\\renewcommand{\\test}{" . LatexTemplate::escape($data['test']) . "}\n"; ?>

Look messy? A multi-line block of PHP is a little easier to follow. This example is from the body of a table, see if you can figure out the syntax:

%<?php                                                                      /*
% */ foreach($data['invoiceItem'] as $invoiceItem) {                        /*
% */    echo "\n" . LatexTemplate::escape($invoiceItem['item']) . " & " .   /*
% */        LatexTemplate::escape($invoiceItem['qty']) . " & " .            /*
% */        LatexTemplate::escape($invoiceItem['price']) . " & " .          /*
% */        LatexTemplate::escape($invoiceItem['total']) . "\\\\\n";        /*
% */ } ?>

So what about this LatexTemplate::escape() business? In LaTeX, just about every symbol seems to be part of the syntax, so it is sadly not very simple to escape.

I have settled on the following series of str_replace() calls to sanitise information for display. It is crude but effective. Generating LaTex is much like generating SQL, HTML or LDIF from your website: it is quite important to make a habit of wrapping every piece of data with a function to prevent users from writing (‘injecting’) arbitrary code into your document:

/**
 * Series of substitutions to sanitise text for use in LaTeX.
 *
 * http://stackoverflow.com/questions/2627135/how-do-i-sanitize-latex-input
 * Target document should \usepackage{textcomp}
 */
public static function escape($text) {
	// Prepare backslash/newline handling
	$text = str_replace("\n", "\\\\", $text); // Rescue newlines
	$text = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $text); // Strip all non-printables
	$text = str_replace("\\\\", "\n", $text); // Re-insert newlines and clear \\
	$text = str_replace("\\", "\\\\", $text); // Use double-backslash to signal a backslash in the input (escaped in the final step).

	// Symbols which are used in LaTeX syntax
	$text = str_replace("{", "\\{", $text);
	$text = str_replace("}", "\\}", $text);
	$text = str_replace("$", "\\$", $text);
	$text = str_replace("&", "\\&", $text);
	$text = str_replace("#", "\\#", $text);
	$text = str_replace("^", "\\textasciicircum{}", $text);
	$text = str_replace("_", "\\_", $text);
	$text = str_replace("~", "\\textasciitilde{}", $text);
	$text = str_replace("%", "\\%", $text);

	// Brackets & pipes
	$text = str_replace("<", "\\textless{}", $text);
	$text = str_replace(">", "\\textgreater{}", $text);
	$text = str_replace("|", "\\textbar{}", $text);

	// Quotes
	$text = str_replace("\"", "\\textquotedbl{}", $text);
	$text = str_replace("'", "\\textquotesingle{}", $text);
	$text = str_replace("`", "\\textasciigrave{}", $text);

	// Clean up backslashes from before
	$text = str_replace("\\\\", "\\textbackslash{}", $text); // Substitute backslashes from first step.
	$text = str_replace("\n", "\\\\", trim($text)); // Replace newlines (trim is in case of leading \\)
	return $text;
}

We then have a template which we can include() from PHP, or run xelatex over. Below is minimal.tex, a minimal example of a PHP-latex template in this form:

% This file is a valid PHP file and also a valid LaTeX file
% When processed with LaTeX, it will generate a blank template
% Loading with PHP will fill it with details

\documentclass{article}
% Required for proper escaping
\usepackage{textcomp} % Symbols
\usepackage[T1]{fontenc} % Input format

% Because Unicode etc.
\usepackage{fontspec} % For loading fonts
\setmainfont{Liberation Serif} % Has a lot more symbols than Computer Modern

% Make placeholders visible
\newcommand{\placeholder}[1]{\textbf{$<$ #1 $>$}}

% Defaults for each variable
\newcommand{\test}{\placeholder{Data here}}

% Fill in
% <?php echo "\n" . "\\renewcommand{\\test}{" . LatexTemplate::escape($data['test']) . "}\n"; ?>

\begin{document}
	\section{Data From PHP}
	\test{}
\end{document}

Generate a PDF on the server

Here is where the fun begins. There is no plugin for compiling a LaTeX document, so we need to directly execute the command on a file.

Looks like we need to save the output somewhere then. You would generate your filled-in LaTeX code in a temporary file by doing something like this:

/**
 * Generate a PDF file using xelatex and pass it to the user
 */
public static function download($data, $template_file, $outp_file) {
	// Pre-flight checks
	if(!file_exists($template_file)) {
		throw new Exception("Could not open template");
	}
	if(($f = tempnam(sys_get_temp_dir(), 'tex-')) === false) {
		throw new Exception("Failed to create temporary file");
	}

	$tex_f = $f . ".tex";
	$aux_f = $f . ".aux";
	$log_f = $f . ".log";
	$pdf_f = $f . ".pdf";

	// Perform substitution of variables
	ob_start();
	include($template_file);
	file_put_contents($tex_f, ob_get_clean());

The next step is to execute your engine of choice on the output files:

	// Run xelatex (Used because of native unicode and TTF font support)
	$cmd = sprintf("xelatex -interaction nonstopmode -halt-on-error %s",
			escapeshellarg($tex_f));
	chdir(sys_get_temp_dir());
	exec($cmd, $foo, $ret);

Once this is done, you can delete a lot of the extra LaTeX files, and check if a .pdf appeared as expected:

	// No need for these files anymore
	@unlink($tex_f);
	@unlink($aux_f);
	@unlink($log_f);

	// Test here
	if(!file_exists($pdf_f)) {
		@unlink($f);
		throw new Exception("Output was not generated and latex returned: $ret.");
	}

And of course, send the completed file back via HTTP:

	// Send through output
	$fp = fopen($pdf_f, 'rb');
	header('Content-Type: application/pdf');
	header('Content-Disposition: attachment; filename="' . $outp_file . '"' );
	header('Content-Length: ' . filesize($pdf_f));
	fpassthru($fp);

	// Final cleanup
	@unlink($pdf_f);
	@unlink($f);
}

The static functions escape($text) and download($data, $template_file, $outp_file) are together placed into a class called LatexTemplate for the remainder of the example (complete file on GitHub).

Gluing it all together

With the library and template, it is quite easy to set up a PHP script which triggers the above code:

<?php
require_once('../LatexTemplate.php');

$test = "";
if(isset($_GET['t'])) {
	// Make the LaTeX file and send it through
	$test = $_GET['t'];
	if($test =="") {
		// Test pattern to show symbol handling
		for($i = 0; $i < 256; $i++) {
			$test .= chr($i) . " . ";
		}
	}

	try {
		LatexTemplate::download(array('test' => $test), 'minimal.tex', 'foobar.pdf');
	} catch(Exception $e) {
		echo $e -> getMessage();
	}

}
?>
<html>
<head>
<title>LaTeX test (minimal)</title>
</head>
</html>
<body>
	<p>Enter some text to be placed on the output:</p>
	<form>
		<input type="text" name="t" /><input type="submit" value="Generate" />
	</form>
</body>
</html>

The above code will show a form, which asks for input. When it gets some text, it will generate a PDF containing the text. If no text is given, it will output an ASCII table, simply to show that it can handle the symbols.

Once the template code is hidden away, this powerful technique is easily applied.

Results

This is only a minimal example. In any real application, your template would be more extensive.

Compiling the template directly creates this PDF:

From the web, a form is presented to fill this single field:

Which results in a PDF containing the user data:

Tips

  1. The text after \end{document} is not even parsed in latex. Use this area to write <?php ?> with
    fewer constraints.
  2. Consult the github repository for this code to see the complete example.
  3. Comment out the line @unlink($tex_f); of you want to preserve (for debugging, etc) the generated markup.

How to query Microsoft SQL Server from PHP

This post is for anybody who runs a GNU/Linux server and needs to query a MSSQL database. This setup will work on Debian and its relatives. As it’s a dense mix of technologies, so I’ve included all of the details which worked for me.

An obvious note: Microsoft SQL is not an ideal choice of database to pair with a GNU/Linux server, but may be acceptable if you are writing something which needs to import some data from external application which has a better reason to be using it.

A command-line alternative to this setup would be sqsh, which will let you running scheduled queries without PHP, if that’s what you’re after.

Prerequisites

Once you have PHP, the required libraries can be fetched with:

sudo apt-get install unixodbc php5-odbc tdsodbc

MSSQL is accessed with the FreeTDS driver. Once the above packages are installed, you need to tell ODBC where to find this driver, by adding the following block to /etc/odbcinst.ini:

[FreeTDS]
Description=MSSQL DB
Driver=/usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
UsageCount=1

The path is different on platforms other than amd64. Check the file list for the tdsodbc package on your architecture if you lose track of the path.

The next step requires that you know the database server address, version, and database name. Add a block for your database to the end of /etc/odbc.ini:

[foodb]
Driver = FreeTDS
Description = Foo Database
Trace = Yes
TraceFile = /tmp/sql.log
ForceTrace = yes
Server = 10.x.x.x
Port = 1433
Database = FooDB
TDS_Version = 8.0

Experiment with TDS_Version values if you have issues connecting. Different versions of MSSQL require different values. The name of the data source (‘foodb’), the Database, Description and Server are all bogus values which you will need to fill.

An example

For new PHP scripts, database grunt-work is invariably done via PHP Data Objects (PDO). The good news is, it is easy to use it with MSSQL from here.

The below file takes a query on standard input, throws it at the database, and returns the result as comma-separated values.

Save this as query.php and fill in your data source (‘odbc:foodb’ here), username, and password.

#!/usr/bin/env php
<?php
$query = file_get_contents("php://stdin");
$user = 'baz;
$pass = 'super secret password here';

$dbh = new PDO('odbc:foodb', $user, $pass);
$dbh -> setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$sth = $dbh -> prepare($query);
$sth -> execute();
$results = $sth -> fetchAll(PDO::FETCH_ASSOC);

/* Quick exit if there are no rows */
if(count($results) == 0) {
	return 0;
}
$f = fopen("php://stdout", "w");

/* Output header */
$a = $results[0];
$header = array();
foreach($a as $key => $val) {
	$header[] = $key;
}
fputcsv($f, $header);

/* Output rows */
foreach($results as $result) {
	fputcsv($f, $result);
}

fclose($f);

To test the new script, first make it executable:

chmod +x query.php

To run a simple test query:

echo "SELECT Name from sys.tables ORDER BY Name;" | ./query.php

Refining the setup

The above script has some cool features: It’s short, actually useful, and it sets PDO::ERRMODE_EXCEPTION. This means that if something breaks, it will fail loudly and tell you why.

Hopefully, if your setup has issues, you can track down the cause with the error, and solve it by scrolling through this how-to again.

If you encounter a MSSQL datablase with an unknown schema, then you may want to list al rows and columns. This is achieved with:

SELECT tables.name AS tbl, columns.name AS col FROM sys.columns JOIN sys.tables ON columns.object_id = tables.object_id ORDER BY tbl, col;

The catch

I’ve run into some bizarre limitations using this. Be sure to run it on a server which you can update at the drop of a hat.

A mini-list of issues I’ve seen with this combination of software (no sources as I never tracked down the causes):

  • An old version of the driver would segfault PHP, apparently when non-ASCII content appeared in a text field.
  • Substituting non-text values fails in the version I am using, although Google suggests that updating the ODBC driver fixes this.

Winning 2048 game with key-mashing?

This new, simple, addictive game is out, called 2048. You need to slide two numbers together, resulting in a bigger number, in ever-increasing powers of two. You get 2048, and you win.

2014-03-2048-1

I noticed that somebody already wrote neat AI for it, although it does run quite slowly. But then I also noticed a friend mashing keys in a simple pattern, and thought I should test whether this was more effective. The answer: it kinda is.

2014-03-2048-2

At least in the first half of the game, a simple key-mashing pattern is a much faster way to build high numbers. The PHP script below will usually get to 512 without much trouble, but rarely to 1024. I would suggest running it for a while, and then taking over with some strategy.

The script

This script spits out commands which can be piped to xte for some automatic key-mashing on GNU / Linux. Save as 2048.php

#!/usr/bin/env php
mouseclick 1
<?php
for($i = 0; $i < 10; $i++) {
	move("Left", 1);
	move("Right", 1);
}

while(1) {
	move("Down", 1);
	move("Left", 1);
	move("Down", 1);
	move("Right", 1);
}

function move($dir, $count) {
	for($i = 0; $i < $count; $i++) {
		echo "key $dir\nsleep 0.025\n";
		usleep(25000);
	}
}

And then in a terminal, run this then click over to a 2048 game:

sleep 3; php 2048.php | xte

Good luck!

How to liberate your myki data

myki logo

myki is the public transport ticketing system in Melbourne. If you register your myki, you can view the usage history online. Unfortunately, you are limited to paging through HTML, or downloading a PDF.

This post will show you how to get your myki history into a CSV file on a GNU/Linux computer, so that you can analyse it with your favourite spreadsheet/database program.

Get your data as PDFs

Firstly, you need to register your myki, log in, and export your history. The web interface seemed to give you the right data if you chose blocks of 1 month.

Export myki data for each month

Once you do this, organise these into a folder filled with statements.

A folder filled with myki statements

You need the pdftotext utility to go on. In debian, this is in the poppler-utils package.

The manual steps below run you through how to extract the data, and at the bottom of the screen there are some scripts I’ve put together to do this automatically.

Manual steps to extract your data

These steps are basically a crash course in "scraping" PDF files.

To convert all of the PDF’s to text, run:

for i in *.pdf; do pdftotext -layout -nopgbrk $i; done

This preserves the line-based layout. The next step is to filter out the lines which don’t contain data. Each line we’re interested in begins with a date, followed by the word “Touch On”, “Touch Off”, or “Top Up”

18/08/2013 13:41:20   T...

We can filter all of the text files using grep, and a regex to match this:

cat *.txt | grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T"

The output looks like:
Filtered output, showing data

So what are we looking at?

  1. One row per line
  2. Fields delimited by multiple spaces

To collapse every double-space into a tab, we use unexpand. Then, to collapse duplicate tabs, we use tr:

cat filtered-data.txt | unexpand -t 2 | tr -s '\t'

Finally, some fields need to be quoted, and tabs need to be converted to CSV. The PHP script below will do that step.

Scripts to get your data

myki2csv.sh is a script which performs the above manual steps:

#!/bin/bash
# Convert myki history from PDF to CSV
#	(c) Michael Billington < michael.billington@gmail.com >
#	MIT Licence
hash pdftotext || exit 1
hash unexpand || exit 1
pdftotext -layout -nopgbrk $1 - | \
	grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T" | \
	unexpand -t2 | \
	tr -s '\t' | \
	./tab2csv.php > ${1%.pdf}.csv

tab2csv.php is called at the end of the above script, to turn the result into a well-formed CSV file:

#!/usr/bin/env php
<?php
/* Generate well-formed CSV from dodgy tab-delimitted data
	(c) Michael Billington < michael.billington@gmail.com >
	MIT Licence */
$in = fopen("php://stdin", "r");
$out = fopen("php://stdout", "w");
while($line = fgets($in)) {
	$a = explode("\t", $line);
	foreach($a as $key => $value) {
		$a[$key]=trim($value);
		/* Quote out ",", and escape "" */
		if(!(strpos($value, "\"") === false &&
				strpos($value, ",") === false)) {
			$a[$key] = "\"".str_replace("\"", "\"\"", $a[$key])."\"";
		}
	}
	$line = implode(",", $a) . "\r\n";
	fwrite($out, $line);
}

Invocation

Call script on a single foo.pdf to get foo.csv:

./myki2csv.sh foo.pdf

Convert all PDF’s to CSV and then join them:

for i in *.pdf; do ./myki2csv.sh $i; done
tac *.csv > my-myki-data.csv

Importing into LibreOffice

The first field must be marked as a DD/MM/YYYY date, and the “zones” need to be marked as text (so that “1/2” isn’t treated as a fraction!)

These are my import settings:

Options to import the myki data into LibreOffice

Happy data analysis!

Update 2013-09-18: The -nopgbrk option was added to the above instructions, to prevent page break characters causing grep to skip one valid line per page

Update 2014-05-04: The code for the above, as well as this follow-up post are now available on github.