Tag Archives: data

What’s in the AEC data feed, and how to use it

With an election coming up, it’s probably a good time to post some notes about using Australian election data feeds. This is mainly aimed at any interested programmers who have a use for this sort of data.

I’ve left out any example code here, as what you write will be specific to how you use the data.

Overview

The AEC publishes results on election night, by simply posting zipped files on an FTP server every few minutes. The detailed documentation is at aec.gov.au.

I have experimented with loading the data into a database during a recent by-election with some success. The data model that I assembled for this was:

2016-06-data-model-aec-feed

You will be able to locate many electoral concepts in this data model, as candidates with a particular affiliation receive votes (contest eachother) at different polling places for a a seat in one of the houses. This maps to the more extensive XML format to extract the fields that I was most interested in mapping.

Getting the data

The FTP server is mediafeed.aec.gov.au, and it accepts anonymous login on election night.

Here you will find a numbered folder for the night’s election. Within this, there is a lot of repetition, so you can ignore most of the files straight away. The sub-folders that I suggest loading are Detailed/LightProgress and Detailed/Preload. For example, in a past election, the file layout was:

18126/
18126/Detailed
18126/Detailed/LightProgress
18126/Detailed/LightProgress/aec-mediafeed-Detailed-LightProgress-18126-20150919194400.zip
18126/Detailed/LightProgress/aec-mediafeed-Detailed-LightProgress-18126-20150919194527.zip
18126/Detailed/LightProgress/aec-mediafeed-Detailed-LightProgress-18126-20150919194702.zip
18126/Detailed/LightProgress/aec-mediafeed-Detailed-LightProgress-18126-20150911140244.zip
18126/Detailed/LightProgress/aec-mediafeed-Detailed-LightProgress-18126-20150919195500.zip
18126/Detailed/LightProgress/...
18126/Detailed/Preload
18126/Detailed/Preload/aec-mediafeed-Detailed-Preload-18126-20150911140243.zip

Loading the data

Start with the Preload data. This zip file is available first, and contains the candidate names, polling places, event details, and an initial (zeroed-out) results feed. Using a custom script and a schema like the one I’ve posted above, you will be able to import this into your database.

Next, the LightProgress feed updates will start being added every few minutes from 7pm. These Zip files contain a small piece of XML, each one superseding older files.

I’ve cut down one of these files as an example:

<?xml version="1.0" encoding="utf-8"?>
<MediaFeed Id="e9334806-990e-4773-8b00-7d74fa58b6af" Created="2015-09-19T19:55:00" SchemaVersion="3" EmlVersion="5" xmlns="http://www.aec.gov.au/xml/schema/mediafeed" xmlns:eml="urn:oasis:names:tc:evs:schema:eml" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xal="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0" xmlns:xnl="urn:oasis:names:tc:ciq:xsdschema:xNL:2.0" xmlns:ts="urn:oasis:names:tc:evs:schema:eml:ts" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://www.aec.gov.au/xml/schema/mediafeed ../Schema/AEC/aec-mediafeed-results-v3-0.xsd">
  <ManagingAuthority>
    <eml:AuthorityIdentifier Id="AEC">Australian Electoral Commission</eml:AuthorityIdentifier>
  </ManagingAuthority>
  <MessageLanguage>en</MessageLanguage>
  <MessageGenerator>
    <Name>Virtual Tally Room</Name>
    <Environment>PROD</Environment>
    <Site>CDC</Site>
    <Server>DB02</Server>
    <Platform>x64</Platform>
    <Version>9.2.0.28493</Version>
  </MessageGenerator>
  <Cycle Created="2015-09-19T19:54:47">37f177b4-ab20-4073-be85-0edefa5c8e96</Cycle>
  <Results Phase="ElectionNight" Verbosity="LightProgress" Granularity="Detailed">
    <eml:EventIdentifier Id="18126">
      <eml:EventName>Canning By-election</eml:EventName>
    </eml:EventIdentifier>
    <Election>
      <eml:ElectionIdentifier Id="H">
        <eml:ElectionName>House of Representatives By-election for the division of Canning</eml:ElectionName>
        <eml:ElectionCategory>ByElection</eml:ElectionCategory>
      </eml:ElectionIdentifier>
      <House>
        <Contests>
          <Contest Projected="true">
            <eml:ContestIdentifier Id="236">
              <eml:ContestName>Canning</eml:ContestName>
            </eml:ContestIdentifier>
            <Enrolment>112809</Enrolment>
            <FirstPreferences PollingPlacesReturned="0" PollingPlacesExpected="53">
              <Candidate>
                <eml:CandidateIdentifier Id="25424" />
                <BallotPosition>1</BallotPosition>
                <Elected>false</Elected>
                <Votes MatchedHistoric="0">0</Votes>
                <VotesByType>
                  <Votes Type="Ordinary">0</Votes>
                  <Votes Type="Absent">0</Votes>
                  <Votes Type="Provisional">0</Votes>
                  <Votes Type="PrePoll">0</Votes>
                  <Votes Type="Postal">0</Votes>
                </VotesByType>
              </Candidate>
              <!-- (other candidates, ghost candidates) -->
              <Formal>
                <Votes MatchedHistoric="0">0</Votes>
                <VotesByType>
                  <Votes Type="Ordinary">0</Votes>
                  <Votes Type="Absent">0</Votes>
                  <Votes Type="Provisional">0</Votes>
                  <Votes Type="PrePoll">0</Votes>
                  <Votes Type="Postal">0</Votes>
                </VotesByType>
              </Formal>
              <Informal>
                <Votes MatchedHistoric="0">0</Votes>
                <VotesByType>
                  <Votes Type="Ordinary">0</Votes>
                  <Votes Type="Absent">0</Votes>
                  <Votes Type="Provisional">0</Votes>
                  <Votes Type="PrePoll">0</Votes>
                  <Votes Type="Postal">0</Votes>
                </VotesByType>
              </Informal>
              <Total>
                <Votes MatchedHistoric="0">0</Votes>
                <VotesByType>
                  <Votes Type="Ordinary">0</Votes>
                  <Votes Type="Absent">0</Votes>
                  <Votes Type="Provisional">0</Votes>
                  <Votes Type="PrePoll">0</Votes>
                  <Votes Type="Postal">0</Votes>
                </VotesByType>
              </Total>
            </FirstPreferences>
            <TwoCandidatePreferred Restricted="true" PollingPlacesReturned="0" PollingPlacesExpected="53" />
            <TwoPartyPreferred>
              <Coalition>
                <CoalitionIdentifier Id="2" />
                <Votes>0</Votes>
              </Coalition>
              <Coalition>
                <CoalitionIdentifier Id="1" />
                <Votes>0</Votes>
              </Coalition>
            </TwoPartyPreferred>
            <PollingPlaces>
              <PollingPlace>
                <PollingPlaceIdentifier Id="7467" />
                <FirstPreferences />
                <TwoCandidatePreferred Restricted="true" />
              </PollingPlace>
              <!-- (other polling places, ghost candidates) -->
            </PollingPlaces>
          </Contest>
        </Contests>
      </House>
    </Election>
  </Results>
</MediaFeed>

Spatial data

You can also download electorate boundaries as spatial data here to back any visualisations of the results.

The locations of polling places are already encoded in the Preload data file.

Good luck!

Two ways to back up your Google Apps account

If you use Gmail or hosted Google Apps, you might be interested in taking a backup of your data, such as emails, Drive documents, and calendar entries. Thankfully, you can usually export copy of your account data using Google Takeout.

If your hosted Apps account has Takeout disabled, then you can do a backup, it simply has a few extra steps.

Option 1: Google Takeout

This method is nice and simple. Simply go to the Data tools – Download your data page, and select which services you want to export:

2015-01-google-takeout

It can be a bit eye-opening to see the amount of data Google has on you (Files, conversations, location history, etc). At this point, click through to “Prepare Download”. Depending on the size of your account, this may take as a coffee break, a few hours, or even an entire day.

2015-02-google-takeout-prepare

If you check the box for it, you’ll get an email like this when your Download completes:

2015-02-google-archive

And this lets you fetch a single file:

2015-01-download

The .zip file contains a series of folders, one for each service. The defaults seem to be:

Mail
A unix mbox file
Calendar
One iCal file for each calendar
Contacts
One vCard file for each group.
Drive
Exports as PDF, docx, xlsx

Option 2: Export data from each service

Sometimes, Google Takeout isn’t an option.

2015-01-google-takeout-disabled

Luckily, most Google services have some sort of data export built in. This means, if you have a new contact manager, or want to include your Drive in your PC backup, it’s still possible.

The export formats in these examples should match the Google Takeout defaults. Tab through each service t see how to export it:

If you are not a power user, then I would suggest setting up a copy of Mozilla Thunderbird via IMAP, and regularly using it for your email. This is a simple way to keep a clone of your inbox on your desktop computer, so that it can be included in backups.

If you are more tech-savvy, then the rest of this section will focus on helping you generate an mbox file containing a full backup of your email, the same format as Takeout uses. The best tool for that is a Linux program called getmail.

On Debian or Ubuntu Linux, issue this command to install getmail:

sudo apt-get install getmail4

For other package managers, see these directions.

First, you need to enable IMAP for your account, see Google’s article: Get started with IMAP and POP3, for the steps.

Now create a file at ~/.getmail/getmailrc, and configure it to read your email account via IMAP/SSL.

[retriever]
type = SimpleIMAPSSLRetriever
server = imap.gmail.com
port = 993
username = bob@mail.example.com
password = ....

[destination]
type = Mboxrd
path = ~/inbox

[options]
verbose = 1
getmail

After some time, you will end up with a large mbox file at ~/inbox, containing all of your mail.

If, for some reason, you need to use POP3 instead, then see this article on Gmail backup

Go to your contacts, and find a group. Check the box next to each name, and then find the Export button:

2015-02-contacts-export

Select the vCard format here, as it’s the same format which Takeout would have used:

2015-02-contacts-export-group-vcard

Google provides a share-able iCal link, which you can download once, but it is only available if your calendar is public.

So, if your calendar isn’t too sensitive, click “Share” and make the calendar public:2015-02-calendar-01-share

2015-02-calendar-02-public

Go to “Calendar Settings”, find the iCal link. It may take a few minutes for the link to start working, but once it does, download it, and then turn off public sharing.

There is a small risk that somebody else loads your calendar while its public, so if this concerns you, then save the events individually.

Exporting from Google Drive is nice and simple. Select all of your files (Shift+Click):

2015-02-drive-select

And then find the Download button:

2015-02-drive-download

If you apply this to your whole drive, it may take a while, so you may wish to download it in parts if your Internet can be unreliable.

Know how to export a different service? Send it in and I’ll add it to the list.

How do I use these files?

Google Drive’s files are exported in familiar formats. If you haven’t used an mbox, vCard or ics file before, then you will need to find a program which can read these for you.

Google’s support answer “Download your data: Per-service information” contains a list of files types which you’ll run into during this process, and suggests programs which can import them.

Successful migration to WordPress in 3 easy steps

We made the decision in January to migrate our website to a WordPress installation, which is the CMS of choice for most blogs.

This posed a big challenge, mainly because we had been using our in-house CMS to publish content for the past 2 years, meaning our content was tied up in a difficult-to-export format.

Still, it allowed me to dig into the nifty and well-developed world of WordPress. My magic formula for a WordPress migration, in a nutshell:

  1. Export your old blog as something WordPress can understand.
  2. Hack at the theme until your site is beautiful.
  3. Don’t break your URL’s.

1. Export your blog

Depending on how you are blogging already, you may be able to save an export which can be loaded into WordPress with a plugin (see Importing Content on the WordPress wiki).

We were not in this lucky category, so I delved into the WXR (WordPress eXtended RSS) format. This example file was a big help, and I wrote up a short PHP script to create a similar-looking file from my blog.

2. Hack at the theme

We adapted our site from the ‘Skittlish’ theme, which has also been ported to WordPress. Every theme carries some baggage, so I highly suggest rolling up your sleeves and opening wp-content/themes on your blog.

All non-feature modifications are done in WordPress via themes, so keep tweaking it until you’re happy, or get a designer to put together a theme that suits your needs.

3. Don’t break your URL’s

I link between blog posts a lot, and breaking these links would be mind-numbing to clean up after (and a SEO sin). Using the import method above, I used article titles which matched the old permalinks.

WordPress then lets you configure permalinks to use this field, replicating the old behavior and keeping everybody happy.

Wrap-up

If you run WordPress on your site, then it makes sense to have somebody on your team who really knows how it works.

If your initial setup is not handled with care, then you could end up wasting several days of work checking old content for errors.

Good luck!

Loading Asterisk CDR into a database

In the interests of accurate accounting, Asterisk creates a Master.csv, logging all calls and a few things about them.

This page on wiki.asterisk.org has a good breakdown of what all the fields mean.

I put together this MySQL table so that the data can be processed for data-analysis, accounting, or whatever it is that the data is needed for.

-- Code to create MySQL table for Asterisk CDR data
-- Field descriptions from:
--   https://wiki.asterisk.org/wiki/display/AST/CDR+Fields
CREATE TABLE cdr (
    accountcode VARCHAR(256) COMMENT 'What account number to use, (string, 20 characters)',
    src VARCHAR(80) COMMENT 'CallerID number',
    dst VARCHAR(80) COMMENT 'Destination extension',
    dcontext VARCHAR(80) COMMENT 'Destination context',
    clid VARCHAR(80) COMMENT 'CallerID with text',
    channel VARCHAR(80) COMMENT 'Channel used',
    dstchannel VARCHAR(80) COMMENT 'Destination channel if appropriate',
    lastapp VARCHAR(80) COMMENT 'Last application if appropriate',
    lastdata VARCHAR(80) COMMENT 'Last application data (arguments)',
    tsstart DATETIME COMMENT 'Start of call (date/time)',
    tsanswer DATETIME COMMENT 'Answer of call (date/time)',
    tsend DATETIME COMMENT 'End of call (date/time)',
    duration INT(11) COMMENT 'Total time in system, in seconds (integer), from dial to hangup',
    billsec INT(11) COMMENT 'Total time call is up, in seconds (integer), from answer to hangup',
    disposition ENUM('ANSWERED', 'NO ANSWER', 'BUSY') COMMENT 'What happened to the call',
    amaflags ENUM('DOCUMENTATION', 'BILL', 'IGNORE') COMMENT 'What flags to use, specified on a per channel basis like accountcode.',
    uniqueid VARCHAR(32) COMMENT 'Unique Channel Identifier',
    userfield VARCHAR(256) COMMENT 'user field: A user-defined field, maximum 255 characters'
) ;

The second-last field is not present on all installations, but it is on mine. The version I’m on is:

asterisk:~# asterisk -V
Asterisk 1.8.10.1~dfsg-1ubuntu1

How to liberate your myki data

myki logo

myki is the public transport ticketing system in Melbourne. If you register your myki, you can view the usage history online. Unfortunately, you are limited to paging through HTML, or downloading a PDF.

This post will show you how to get your myki history into a CSV file on a GNU/Linux computer, so that you can analyse it with your favourite spreadsheet/database program.

Get your data as PDFs

Firstly, you need to register your myki, log in, and export your history. The web interface seemed to give you the right data if you chose blocks of 1 month.

Export myki data for each month

Once you do this, organise these into a folder filled with statements.

A folder filled with myki statements

You need the pdftotext utility to go on. In debian, this is in the poppler-utils package.

The manual steps below run you through how to extract the data, and at the bottom of the screen there are some scripts I’ve put together to do this automatically.

Manual steps to extract your data

These steps are basically a crash course in "scraping" PDF files.

To convert all of the PDF’s to text, run:

for i in *.pdf; do pdftotext -layout -nopgbrk $i; done

This preserves the line-based layout. The next step is to filter out the lines which don’t contain data. Each line we’re interested in begins with a date, followed by the word “Touch On”, “Touch Off”, or “Top Up”

18/08/2013 13:41:20   T...

We can filter all of the text files using grep, and a regex to match this:

cat *.txt | grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T"

The output looks like:
Filtered output, showing data

So what are we looking at?

  1. One row per line
  2. Fields delimited by multiple spaces

To collapse every double-space into a tab, we use unexpand. Then, to collapse duplicate tabs, we use tr:

cat filtered-data.txt | unexpand -t 2 | tr -s '\t'

Finally, some fields need to be quoted, and tabs need to be converted to CSV. The PHP script below will do that step.

Scripts to get your data

myki2csv.sh is a script which performs the above manual steps:

#!/bin/bash
# Convert myki history from PDF to CSV
#	(c) Michael Billington < michael.billington@gmail.com >
#	MIT Licence
hash pdftotext || exit 1
hash unexpand || exit 1
pdftotext -layout -nopgbrk $1 - | \
	grep "^[0-3][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] *T" | \
	unexpand -t2 | \
	tr -s '\t' | \
	./tab2csv.php > ${1%.pdf}.csv

tab2csv.php is called at the end of the above script, to turn the result into a well-formed CSV file:

#!/usr/bin/env php
<?php
/* Generate well-formed CSV from dodgy tab-delimitted data
	(c) Michael Billington < michael.billington@gmail.com >
	MIT Licence */
$in = fopen("php://stdin", "r");
$out = fopen("php://stdout", "w");
while($line = fgets($in)) {
	$a = explode("\t", $line);
	foreach($a as $key => $value) {
		$a[$key]=trim($value);
		/* Quote out ",", and escape "" */
		if(!(strpos($value, "\"") === false &&
				strpos($value, ",") === false)) {
			$a[$key] = "\"".str_replace("\"", "\"\"", $a[$key])."\"";
		}
	}
	$line = implode(",", $a) . "\r\n";
	fwrite($out, $line);
}

Invocation

Call script on a single foo.pdf to get foo.csv:

./myki2csv.sh foo.pdf

Convert all PDF’s to CSV and then join them:

for i in *.pdf; do ./myki2csv.sh $i; done
tac *.csv > my-myki-data.csv

Importing into LibreOffice

The first field must be marked as a DD/MM/YYYY date, and the “zones” need to be marked as text (so that “1/2” isn’t treated as a fraction!)

These are my import settings:

Options to import the myki data into LibreOffice

Happy data analysis!

Update 2013-09-18: The -nopgbrk option was added to the above instructions, to prevent page break characters causing grep to skip one valid line per page

Update 2014-05-04: The code for the above, as well as this follow-up post are now available on github.

Making an XKCD-style password generator in C++

I’m learning C++ at the moment, and I don’t find long tutorials or studying the standard template library particularly fun.

Making this type of password-generator is not new, but it is a nice practical exercise to start out in any language.

1. Get a list of common English words

Googling “common English words” yielded this list, purporting to contain 5,000 words. Unfortunately it contains almost 1,000 duplicates and numerous non-words! Wiktionary has a much higher-quality list of words compiled from Project Gutenberg, but the markup looks a bit like this:

==== 1 - 1000 ====
===== 1 - 100 =====
[[the]] = 56271872
[[of]] = 33950064
[[and]] = 29944184
[[to]] = 25956096
[[in]] = 17420636
[[I]] = 11764797  

Noting the wikilinks surrounding each word, I put together this PHP script to extract the link destinations and called it get-wikilinks.php:

#!/usr/bin/php
<?php
/* Return list of wikilinked words from input text */
$text = explode("[[", file_get_contents("php://stdin"));
foreach($text as $link) {
	$rbrace = strpos($link, "]]");
	if(!$rbrace === false) {
		/* Also escape on [[foo|bar]] links */
		$pipe = strpos($link, "|");
		if(!$pipe === false && $pipe < $rbrace) {
			$rbrace = $pipe;
		}
		$word = trim(substr($link, 0, $rbrace))."n";
		if(strpos($word, "'") === false && !is_numeric(substr($word, 0, 1))) {
			/* Leave out words with apostrophes or starting with numbers */
			echo $word;
		}
	}
}

The output of this script is much more workable:

$ chmod +x get-wikilinks.php
$ cat wikt.txt | ./get-wikilinks.php
the
of
and
to
in
I

Using sort and uniq makes a top-notch list of common words, ready for an app to digest:

$ cat wikt.txt | ./get-wikilinks.php | sort | uniq > wordlist.txt

2. Write some C++

There are two problems being solved here:

  • Reading a file into memory
    • An ifstream is used to access the file, and getline() will return false when EOF has been reached
    • Each line is loaded into a vector (roughly the same type of container as an ArrayList in Java), which is resized dynamically and accessed like an array.
  • Choosing random numbers
    • These are seeded from a random_device, being more cross-platform than reading from a file like /dev/urandom.
    • Note that random is new to C++11.
pw.cpp
#include <fstream>
#include <vector>
#include <string>
#include <iostream>
#include <random>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
    const char* fname = "wordlist.txt";

    /* Parse command-line arguments */
    int max = 1;
    if(argc == 2) {
        max = atoi(argv[1]);
    }

    /* Open word list file */
    ifstream input;
    input.open(fname);
    if(input.fail()) {
        cerr << "ERROR: Failed to open " << fname << endl;
    }

    /* Read to end and load words */
    vector<string> wordList;
    string line;
    while(getline(input, line)) {
        wordList.push_back(line);
    }

    /* Seed from random device */
    random_device rd;
    default_random_engine gen;
    gen.seed(rd());
    uniform_int_distribution<int> dist(0, wordList.size() - 1);

    /* Output as many passwords as required */
    const int pwLen = 4;
    int wordId, i, j;
    for(i = 0; i < max; i++) {
        for(j = 0; j < pwLen; j++) {
            cout << wordList[dist(gen)] << ((j != pwLen - 1) ? " " : "");
        }
        cout << endl;
    }

    return 0;
}

3. Compile

Lots of projects in compiled languages have a Makefile, so that you can compile them without having to type all the compiler options manually.

Makefiles are a bit heavy to learn properly, but for a project this tiny, something simple is fine:

default:
	g++ pw.cpp -o pw -std=c++11

clean:
	rm -f pw

Now we can compile and run the generator:

make
./pw

The output looks like this for ./pw 30 ("generate 30 passwords"):