Importing myki data into GnuCash

GnuCash, despite all its bugs, is one of the best open source accounting programs going around.

Since it is not hard to export the history from a myki (public transport) card, I figured that it would be nice to track it as an account in Gnucash.

myki logo Gnucash logo

The data on the statements is not quite suitable for an accounting program. Some changes that need to be done are:

  • Conversion of date format to YYYY-MM-DD.
  • Single-field descriptions for each transaction (“Top up myki money”, “Travel zone 2”), rather than multiple fields.
  • Entries which have a 0.00 cost need to be removed.

Once you have a CSV of your data, the script below will filter it to be ready for import:

#!/usr/bin/env php
<?php
/* Dump Myki CSV file to a file suitable for gnucash
	(c) Michael Billington < michael.billington@gmail.com >
	MIT Licence */
$in = fopen("php://stdin", "r");
$out = fopen("php://stdout", "w");
$err = fopen("php://stdout", "w");
$lc = 0;

while($line = fgets($in)) {
	$lc++;
	$a = str_getcsv($line, ',', '"');
	if(count($a) == 8) {
		$date = implode("-", array_reverse(explode("/", substr($a[0], 0, strpos($a[0], " ")))));
		$credit = $a[5] == "-" ? "" : $a[5];
		if($credit != "") { // Probably a top-up or reimbursement
			$description = $a[1];
		} else if($a[3] == "-") { // Probably buying myki pass
			$description = trim($a[1], "*");
		} else { // Probably travel charges
			$description = "Travel: " . $a[2] . ", Zone " . $a[3];
		}
		$debit = $a[6] == "-" ? "" : $a[6];
		$balance = $a[7] == "-" ? "" : $a[7];
		if($balance != "") { // Ignore non-charge entries
			fputcsv($out, array($date, $description, $credit, $debit, $balance));
		}
	}
}
fclose($in);
fclose($out);
fclose($err);

Why would you track it as an account?

There are lots of reasons why a public transport card is account-like enough to put into GnuCash:

  • Money is not spent until you touch on and off — note that no GST is payable until the card balance is used. This means that if you record a top-up under Expenses, it’s not quite correct.
  • You can cancel a myki and have its balance moved to another card.
  • A card can be handed in, and the balance paid back to you as cash.

Update 2014-05-04: All of these myki-related scripts are now available on github.

How to liberate your myki data

myki logo

myki is the public transport ticketing system in Melbourne. If you register your myki, you can view the usage history online. Unfortunately, you are limited to paging through HTML, or downloading a PDF.

This post will show you how to get your myki history into a CSV file on a GNU/Linux computer, so that you can analyse it with your favourite spreadsheet/database program.

Get your data as PDFs

Firstly, you need to register your myki, log in, and export your history. The web interface seemed to give you the right data if you chose blocks of 1 month.

Export myki data for each month

Once you do this, organise these into a folder filled with statements.

A folder filled with myki statements

You need the pdftotext utility to go on. In debian, this is in the poppler-utils package.

The manual steps below run you through how to extract the data, and at the bottom of the screen there are some scripts I’ve put together to do this automatically.

Manual steps to extract your data

These steps are basically a crash course in "scraping" PDF files.

To convert all of the PDF’s to text, run:

for i in *.pdf; do pdftotext -layout -nopgbrk $i; done

This preserves the line-based layout. The next step is to filter out the lines which don’t contain data. Each line we’re interested in begins with a date, followed by the word “Touch On”, “Touch Off”, or “Top Up”

18/08/2013 13:41:20   T...

We can filter all of the text files using grep, and a regex to match this:

cat *.txt | grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T"

The output looks like:
Filtered output, showing data

So what are we looking at?

  1. One row per line
  2. Fields delimited by multiple spaces

To collapse every double-space into a tab, we use unexpand. Then, to collapse duplicate tabs, we use tr:

cat filtered-data.txt | unexpand -t 2 | tr -s '\t'

Finally, some fields need to be quoted, and tabs need to be converted to CSV. The PHP script below will do that step.

Scripts to get your data

myki2csv.sh is a script which performs the above manual steps:

#!/bin/bash
# Convert myki history from PDF to CSV
#	(c) Michael Billington < michael.billington@gmail.com >
#	MIT Licence
hash pdftotext || exit 1
hash unexpand || exit 1
pdftotext -layout -nopgbrk $1 - | \
	grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T" | \
	unexpand -t2 | \
	tr -s '\t' | \
	./tab2csv.php > ${1%.pdf}.csv

tab2csv.php is called at the end of the above script, to turn the result into a well-formed CSV file:

#!/usr/bin/env php
<?php
/* Generate well-formed CSV from dodgy tab-delimitted data
	(c) Michael Billington < michael.billington@gmail.com >
	MIT Licence */
$in = fopen("php://stdin", "r");
$out = fopen("php://stdout", "w");
while($line = fgets($in)) {
	$a = explode("\t", $line);
	foreach($a as $key => $value) {
		$a[$key]=trim($value);
		/* Quote out ",", and escape "" */
		if(!(strpos($value, "\"") === false &&
				strpos($value, ",") === false)) {
			$a[$key] = "\"".str_replace("\"", "\"\"", $a[$key])."\"";
		}
	}
	$line = implode(",", $a) . "\r\n";
	fwrite($out, $line);
}

Invocation

Call script on a single foo.pdf to get foo.csv:

./myki2csv.sh foo.pdf

Convert all PDF’s to CSV and then join them:

for i in *.pdf; do ./myki2csv.sh $i; done
tac *.csv > my-myki-data.csv

Importing into LibreOffice

The first field must be marked as a DD/MM/YYYY date, and the “zones” need to be marked as text (so that “1/2” isn’t treated as a fraction!)

These are my import settings:

Options to import the myki data into LibreOffice

Happy data analysis!

Update 2013-09-18: The -nopgbrk option was added to the above instructions, to prevent page break characters causing grep to skip one valid line per page

Update 2014-05-04: The code for the above, as well as this follow-up post are now available on github.