How to liberate your myki data

myki logo

myki is the public transport ticketing system in Melbourne. If you register your myki, you can view the usage history online. Unfortunately, you are limited to paging through HTML, or downloading a PDF.

This post will show you how to get your myki history into a CSV file on a GNU/Linux computer, so that you can analyse it with your favourite spreadsheet/database program.

Get your data as PDFs

Firstly, you need to register your myki, log in, and export your history. The web interface seemed to give you the right data if you chose blocks of 1 month.

Export myki data for each month

Once you do this, organise these into a folder filled with statements.

A folder filled with myki statements

You need the pdftotext utility to go on. In debian, this is in the poppler-utils package.

The manual steps below run you through how to extract the data, and at the bottom of the screen there are some scripts I’ve put together to do this automatically.

Manual steps to extract your data

These steps are basically a crash course in "scraping" PDF files.

To convert all of the PDF’s to text, run:

for i in *.pdf; do pdftotext -layout -nopgbrk $i; done

This preserves the line-based layout. The next step is to filter out the lines which don’t contain data. Each line we’re interested in begins with a date, followed by the word “Touch On”, “Touch Off”, or “Top Up”

18/08/2013 13:41:20   T...

We can filter all of the text files using grep, and a regex to match this:

cat *.txt | grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T"

The output looks like:
Filtered output, showing data

So what are we looking at?

  1. One row per line
  2. Fields delimited by multiple spaces

To collapse every double-space into a tab, we use unexpand. Then, to collapse duplicate tabs, we use tr:

cat filtered-data.txt | unexpand -t 2 | tr -s '\t'

Finally, some fields need to be quoted, and tabs need to be converted to CSV. The PHP script below will do that step.

Scripts to get your data

myki2csv.sh is a script which performs the above manual steps:

#!/bin/bash
# Convert myki history from PDF to CSV
#	(c) Michael Billington < michael.billington@gmail.com >
#	MIT Licence
hash pdftotext || exit 1
hash unexpand || exit 1
pdftotext -layout -nopgbrk $1 - | \
	grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T" | \
	unexpand -t2 | \
	tr -s '\t' | \
	./tab2csv.php > ${1%.pdf}.csv

tab2csv.php is called at the end of the above script, to turn the result into a well-formed CSV file:

#!/usr/bin/env php
<?php
/* Generate well-formed CSV from dodgy tab-delimitted data
	(c) Michael Billington < michael.billington@gmail.com >
	MIT Licence */
$in = fopen("php://stdin", "r");
$out = fopen("php://stdout", "w");
while($line = fgets($in)) {
	$a = explode("\t", $line);
	foreach($a as $key => $value) {
		$a[$key]=trim($value);
		/* Quote out ",", and escape "" */
		if(!(strpos($value, "\"") === false &&
				strpos($value, ",") === false)) {
			$a[$key] = "\"".str_replace("\"", "\"\"", $a[$key])."\"";
		}
	}
	$line = implode(",", $a) . "\r\n";
	fwrite($out, $line);
}

Invocation

Call script on a single foo.pdf to get foo.csv:

./myki2csv.sh foo.pdf

Convert all PDF’s to CSV and then join them:

for i in *.pdf; do ./myki2csv.sh $i; done
tac *.csv > my-myki-data.csv

Importing into LibreOffice

The first field must be marked as a DD/MM/YYYY date, and the “zones” need to be marked as text (so that “1/2” isn’t treated as a fraction!)

These are my import settings:

Options to import the myki data into LibreOffice

Happy data analysis!

Update 2013-09-18: The -nopgbrk option was added to the above instructions, to prevent page break characters causing grep to skip one valid line per page

Update 2014-05-04: The code for the above, as well as this follow-up post are now available on github.

Why you should disable IPv6 on Windows

This post is mainly in response to (what is in my opinion) a piece of misinformation which I stumbled across today in this blog:

If you are running any Windows computer on an un-trusted network, then it is probably wide open to CVE-2010-4669. This means that a few thousand dodgy ICMPv6 packets could fill up its memory until it keels over and needs to be rebooted.

I’m not an advocate of Windows on servers, but it exists and can be made to crash less. If you don’t need IPv6, because you are behind an IPv4 NAT for example, you can just switch it off and bypass Microsoft’s poorly designed implementation altogether. To that end, here is a nice article that will get you depolying .reg files for that in a few minutes.

This is easy and I would recommend it. Contrary to the article above, your computer will work fine on an IPv4 network without IPv6. If disabling IPv6 breaks some application, then it probably wouldn’t have worked properly on your network anyway. What’s important is that the computer works!

A solid windows firewall configuration will also solve this, but involves leaving the vulnerable stack running. This is a decent security compromise, as it assumes that you will actually cover every possible attack scenario in your firewall rules.

QJoyPad coolness

I got a USB SNES-controller imitation from the Internet a while back for controlling a missile launcher, and recently decided to re-purpose it for controlling a GNU/Linux computer. After all VLC is great, but plugging in a keyboard is not so great!

The gamepad is apparently a USB joystick in disguise. From lsusb:

Bus 003 Device 004: ID 0079:0011 DragonRise Inc. Gamepad

The only packaged program for doing this in Debian was joy2key. It was too cryptic for me to figure out in <5 minutes, so I tossed it. Google turned up xjoypad, jkeys and jscal as suggestions, but QJoypad looked the most promising, and is as simple as a program should be.

To compile it, you need the QT development libraries, and an X library called libxtst-dev

The profile in the screenshot (called “VLC”) controls the mouse, pauses, adjusts volume, and toggles fullscreen. It works well enough for media and web browsing, as long as you don’t need to type anything!

Bugs noticed:

  • Can’t set a button to do Ctrl+<key>, only the key on its own.

Backing up from a hosting provider

Backups are great, and they’re not rocket science. I’m writing up how we do backups, not because I think it’s a cool or unique setup (because it’s not), but to highlight how effective a simple solution can be.

We use rsync to take a local copy of whatever is on our web host without wasting bandwidth downloading files that aren’t needed. The layout looks like this:

Our hosting provider is accessible via ssh, and the backup box we use is a Raspberry Pi model B, costing (more or less) 50 AUD to get running.

On the server

On the server, we back up databases with mysqldump. To do this, you need to enter user details into a .my.cnf file, and then something like this will do the trick:

#!/bin/sh
# Remove old dump
rm -f database.sql.gz

# Dump and compress database
mysqldump -h sql.example.com --all-databases > database.sql
gzip database.sql

The above script is called database-dump.sh, and is called from the backup box, to dump the databases to a file before grabbing all the files.

On the backup box

First, a script to get the files. You should use password-less login with ssh-copy-id for this to work non-interactively:

#!/bin/sh
# Update the database dump
ssh user@host.example.com './database-dump.sh'
# Get files
rsync -avz --delete-during user@host.example.com:/home/user .

We save a copy of the files at this date in a dated archive, so we can back-date to find deleted things. At the end of the above script:

mkdir -p archive
now=$(date +"%Y-%m-%d")
tar -czf archive/backup-$now.tar.gz user

There aren’t a huge number of changes to record daily, so we got cron to run the above script weekly on the backup box. Read man crontab for how to do this.

What backup is not

If you think you shouldn’t be doing backups, you’re wrong. The following are not good excuses:

  1. Trust — Whoever is looking after the data wont lose it.
    Our host is pretty good, but their terms of service say they wont be responsible for any data loss. Even providers which have support agreements can make mistakes. You’ll also be able to work faster if you’re not paranoid about any mistake being unrecoverable.
  2. Expense — It’s a nice idea but not worth it.
    It’s dirt cheap, you can learn to do it yourself, and once set up requires virtually no administration. If your organisation can’t afford some kind of backup solution, then it should probably stop using data in any form.
  3. RAID — I invested money in RAID, so I don’t need backups.
    If you accidentally delete something, or notice that some your files have been tampered with, then RAID will not help you. If there is a problem (eg. fire) at the hosting location, then you will be in trouble regardless of disk redundancy.

Debian & XFCE quirks on Toshiba NB550D

Today I re-installed Debian wheezy on my Toshiba netbook and realised how useful it might be to collate the hurdles into one tidy reference blog post (to save looking everything up next time).

This just covers everything I had to configure or work around to get a working setup.

Install & hardware issues

From linux, use dd to write your disk image onto a flash drive:

dd if=debian-wheezy-DI-rc1-amd64-netinst.iso of=/dev/sdX bs=4M

If you don’t know your flash drive device, then locate ‘Disk Utility’ or use sudo fdisk -l and choose the likely candidate.

Now boot up the netbook. If you’ve disabled the splash screen, then F12 will get you a boot menu and F2 will let you enable USB booting (if you don’t see the flash drive).

The installer gives you a warning about needing non-free firmware. You can safely ignore this, it’s just bluetooth.

When you get the option to, Install openssh. You will have graphics issues later, and your computer will be next to useless if you don’t have some way to log in.

Follow the installer as usual, and boot into the new system.

Graphics

From GNOME, everything initially worked okay out of the box for me, but logging out would predictably corrupt the graphics like so:

Pro-tip(TM): Write down your IP address before this happens and follow the rest of the steps via ssh.

These steps on the Debian wiki suggest getting xserver-xorg-video-radeon and xserver-xorg-video-ati, but they are already installed (and xserver-xorg-video-radeonhd does not appear to exist in wheezy). The free firmware also didn’t work for me:

root@mikebook:~# apt-get install firmware-linux-free

So it looks like we need firmware-linux-nonfree, which means we need to allow non-free packages. Edit the end of each line in /etc/apt/sources.list to add contrib and non-free:

After this, update your package list, install the non-free firmware, and restart X (rebooting is shown here but not really necessary):

apt-get update
apt-get install firmware-linux-nonfree
reboot

Next time you log in, GNOME will report that it is running in full-bloat mode, which is a good sign. If you still have issues, then the output of lspci is what you need to google:

02:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RV710 [Radeon HD 4350] (prog-if 00 [VGA controller])

Touchpad / two-finger scroll

The Laptop’s Synaptics touchpad will work just fine on the default settings. I only wrote this up because the version of XFCE in wheezy has no options for tapping, two-finger scroll, or other fancy things (unlike the screenshots on xfce.org).

GNOME will let you set up per-user mouse preferences, but these don’t affect gdm (the login screen), so you can’t tap the login buttons (how annoying!)

The solution is to configure the mouse using Xorg’s configuration. The Debian wiki page on the topic gives an example file to dump in /etc/X11/xorg.conf.d/. I swapped two values to get two-finger right-click:

Section "InputClass"
        Identifier      "Touchpad"                      # required
        MatchIsTouchpad "yes"                           # required
        Driver          "synaptics"                     # required
        Option          "MinSpeed"              "0.5"
        Option          "MaxSpeed"              "1.0"
        Option          "AccelFactor"           "0.075"
        Option          "TapButton1"            "1"
        Option          "TapButton2"            "3"     # multitouch
        Option          "TapButton3"            "2"     # multitouch
        Option          "VertTwoFingerScroll"   "1"     # multitouch
        Option          "HorizTwoFingerScroll"  "1"     # multitouch
        Option          "VertEdgeScroll"        "1"
        Option          "CoastingSpeed"         "8"
        Option          "CornerCoasting"        "1"
        Option          "CircularScrolling"     "1"
        Option          "CircScrollTrigger"     "7"
        Option          "EdgeMotionUseAlways"   "1"
        Option          "LBCornerButton"        "8"     # browser "back" btn
        Option          "RBCornerButton"        "9"     # browser "forward" btn
EndSection

After saving this file, reboot or run killall Xorg as root.

Sleep

I haven’t investigated this properly, but I would steer clear of suspend-to-RAM, and set your power settings to hibernate (ie suspend-to-disk) instead. This is one error you might get on wake (if you are lucky enough to get a display after it wakes):

This is not a kernel thing, as a no-X install can pm-suspend without issue.

Sudo

By default, the user you create during the Debian setup is not in the sudo group. To change this:

su
adduser joebloggs sudo

You need to log out then again for this to affect your session.

Making XFCE more useable

XFCE was my desktop of choice, so at this point, you can either stop reading or run this:

apt-get install xfce4 xfce4-goodies

Appearance

Because it runs GTK-2 and not GTK-3, GNOME apps will look ugly beside XFCE apps if you don’t choose settings which work well for both toolkits. I chose these ones but there are other good combinations:

  • Window Manager -> Theme: Default-4.6
  • Appearance -> Style: Anquita

If you use it, then you should open gnome-terminal now. It defaults to black-on-black under XFCE, which you will want to swap out for something less stupid.

Replacing Thunar with Nautilus

Thunar is great, but Nautilus is more familiar to me, and can easily be set up as the preferred file browser:

  • Preferred Applications -> Utilities -> File Manager: Nautilus

Thunar will hold onto your desktop unless you remove it from Session and Startup (tab over to ‘Session’ and delete xfdesktop)

To tell Nautilus to handle the desktop, install gnome-tweak-tool, and check the box labelled ‘Have file manager handle the desktop’. Next time you start Nautilus, it will give you a working desktop.

Disable screensavers

XFCE has some very cool screensavers, but personally I think this part of desktop computing is a bit last-century:

  • Settings -> Screensaver: Blank screen only

The program XScreenSaver itself is a bit of an eyesore. If you don’t like it, this forum post has some suggestions for alternatives.

Getting a calendar

The default clock on the panel is not clickable. Simply remove it and add the ‘DateTime’ widget — This can show a clock with a drop-down calendar, which is basically standard.

Getting ‘Print Screen’ to work

XFCE makes this super easy to set up (once you turn up this thread on google):

  • Settings manager -> Keyboard -> Application shortcuts

Add a new shortcut to this command:

xfce4-screenshooter -f

Then hit Print, and you should get this:

External monitor

When you use an external monitor and switch off the laptop display, you can get stuck without a screen if you pull out the cable! The XFCE screen-switching app (mapped to Fn-F5 on my keyboard) is not really navigable by keyboard, so I added this shortcut as well:

The command xrandr --auto will switch on any connected monitor with a sane default resolution, fixing your display without rebooting.

Update 2013-03-15: I changed this to Shift+Alt+F5, because some programs use the above shortcut, rendering it useless when said programs have focus.

qtHiero: Open-source Egyptian hieroglyph editor

I’m just starting out with Qt4 and C++ and came up with this semi-useful little tool for marking up Egyptian hieroglyphs in MdC.

So far the only annoying Qt-quirk I’ve found is the lack of support for non-BMP unicode characters in the QChar type. Turns out you need to use a QString with two QChars, which is exactly the situation which QChar is supposed to solve (by being larger than 8 bits so that there is a 1-1 correspondence between written characters and QChars in a string).

The unfortunate hack I had to put in for fetching a hieroglyph from a codepoint looked like this:

/**
 * Return a QString from a unicode code-point
 **/
QString MainWindow :: unicode2qstr(uint32_t character) {
	if(0x10000 > character) {
		/* BMP character. */
		return QString(QChar(character));
	} else if (0x10000 <= character) {
		/* Non-BMP character, return surrogate pair */
		unsigned int code;
		QChar glyph[2];
		code = (character - 0x10000);
		glyph[0] = QChar(0xD800 | (code >> 10));
		glyph[1] = QChar(0xDC00 | (code & 0x3FF));
		return QString(glyph, 2);
	}
	/* character > 0x10FFF */
	return QString("");
}

The Qt developer tools get a 10/10 from me though. I say this mainly because glade runs like a slug at the best of times.

Making an XKCD-style password generator in C++

I’m learning C++ at the moment, and I don’t find long tutorials or studying the standard template library particularly fun.

Making this type of password-generator is not new, but it is a nice practical exercise to start out in any language.

1. Get a list of common English words

Googling “common English words” yielded this list, purporting to contain 5,000 words. Unfortunately it contains almost 1,000 duplicates and numerous non-words! Wiktionary has a much higher-quality list of words compiled from Project Gutenberg, but the markup looks a bit like this:

==== 1 - 1000 ====
===== 1 - 100 =====
[[the]] = 56271872
[[of]] = 33950064
[[and]] = 29944184
[[to]] = 25956096
[[in]] = 17420636
[[I]] = 11764797  

Noting the wikilinks surrounding each word, I put together this PHP script to extract the link destinations and called it get-wikilinks.php:

#!/usr/bin/php
<?php
/* Return list of wikilinked words from input text */
$text = explode("[[", file_get_contents("php://stdin"));
foreach($text as $link) {
	$rbrace = strpos($link, "]]");
	if(!$rbrace === false) {
		/* Also escape on [[foo|bar]] links */
		$pipe = strpos($link, "|");
		if(!$pipe === false && $pipe < $rbrace) {
			$rbrace = $pipe;
		}
		$word = trim(substr($link, 0, $rbrace))."n";
		if(strpos($word, "'") === false && !is_numeric(substr($word, 0, 1))) {
			/* Leave out words with apostrophes or starting with numbers */
			echo $word;
		}
	}
}

The output of this script is much more workable:

$ chmod +x get-wikilinks.php
$ cat wikt.txt | ./get-wikilinks.php
the
of
and
to
in
I

Using sort and uniq makes a top-notch list of common words, ready for an app to digest:

$ cat wikt.txt | ./get-wikilinks.php | sort | uniq > wordlist.txt

2. Write some C++

There are two problems being solved here:

  • Reading a file into memory
    • An ifstream is used to access the file, and getline() will return false when EOF has been reached
    • Each line is loaded into a vector (roughly the same type of container as an ArrayList in Java), which is resized dynamically and accessed like an array.
  • Choosing random numbers
    • These are seeded from a random_device, being more cross-platform than reading from a file like /dev/urandom.
    • Note that random is new to C++11.
pw.cpp
#include <fstream>
#include <vector>
#include <string>
#include <iostream>
#include <random>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
    const char* fname = "wordlist.txt";

    /* Parse command-line arguments */
    int max = 1;
    if(argc == 2) {
        max = atoi(argv[1]);
    }

    /* Open word list file */
    ifstream input;
    input.open(fname);
    if(input.fail()) {
        cerr << "ERROR: Failed to open " << fname << endl;
    }

    /* Read to end and load words */
    vector<string> wordList;
    string line;
    while(getline(input, line)) {
        wordList.push_back(line);
    }

    /* Seed from random device */
    random_device rd;
    default_random_engine gen;
    gen.seed(rd());
    uniform_int_distribution<int> dist(0, wordList.size() - 1);

    /* Output as many passwords as required */
    const int pwLen = 4;
    int wordId, i, j;
    for(i = 0; i < max; i++) {
        for(j = 0; j < pwLen; j++) {
            cout << wordList[dist(gen)] << ((j != pwLen - 1) ? " " : "");
        }
        cout << endl;
    }

    return 0;
}

3. Compile

Lots of projects in compiled languages have a Makefile, so that you can compile them without having to type all the compiler options manually.

Makefiles are a bit heavy to learn properly, but for a project this tiny, something simple is fine:

default:
	g++ pw.cpp -o pw -std=c++11

clean:
	rm -f pw

Now we can compile and run the generator:

make
./pw

The output looks like this for ./pw 30 ("generate 30 passwords"):

Downtime

Looks like one of Facebook’s webservers took a nap yesterday.

When they designed this error page, I wonder if they realised that both the ‘Help’ and ‘Go back’ links give you the same error again.

Pyrocket and Ubuntu

I have a great USB rocket launcher, it’s more useful than a computer mouse most of the time actually. I spotted a moth on the roof the other day, and hadn’t installed pyrocket on this computer yet.

A quick apt-get install pyrocket is all it takes to solve that though, right? No such luck.

Apparently, the quality control in the Ubuntu repos are such that this package has been broken for several months now, despite the dependency issue being fixed by an upstream fork.

So this is the error you get at the moment anyway:

Traceback (most recent call last):
  File "/usr/bin/pyrocket", line 17, in 
    from rocket_frontend import RocketWindow
  File "/usr/lib/pymodules/python2.7/rocket_frontend.py", line 11, in <module>
    from rocket_webcam import VideoWindow
  File "/usr/lib/pymodules/python2.7/rocket_webcam.py", line 2, in <module>
    from opencv import cv, highgui
ImportError: No module named opencv

The solution, beyond complaining about it, is to read the bug report here and do this:

git clone https://github.com/stadler/pyrocket
cd pyrocket/src
./pyrocket.py

But the moth had escaped by then.