Let’s implement power-on self test (POST)

I recently implemented simple Power-on self test (POST) routine for my 65C816 test board, so that it can stop and indicate a hardware failure before attempting to run any normal code.

This was an interesting adventure for two main reasons:

  • This code needs to work with no RAM present in the computer.
  • I wanted to try re-purposing the emulation status (E) output on the 65C816 CPU to blink an LED.

Background

Even modern computers will stop and provide a blinking light or series of beeps if you don’t install RAM or a video card. This is implemented in the BIOS or UEFI.

I got the idea to use the E or MX CPU outputs for this purpose from this thread on the 6502.org forums. This method would allow me to blink a light with just a CPU, clock input, and ROM working.

My main goal is to perform a quick test that each device is present, so that start-up fails in a predictable way if I’ve connected something incorrectly. This is much simpler than the POST routine from a real BIOS, because I’m not doing device detection, and I’m not testing every byte of memory.

Boot-up process

On my test board, I’ve connected an LED directly to the emulation status (E) output on the 65C816 CPU. The CPU starts in emulation mode (E is high). However I have noticed that on power-up, the value of E appears to be random until /RES goes high. If I were wiring this up again, I would also prevent the LED from lighting up while the CPU is in reset:

The first thing the CPU does is read an address from ROM, called the reset vector, which tells it where start executing code.

In my case, the first two instructions set the CPU to native mode, which are clc (clear carry) and xce (exchange carry with emulation).

.segment "CODE"
reset:
    .a8
    .i8
    clc                            ; switch to native mode
    xce
    jmp post_check_loram

By default accumulator and index registers are 8-bit, the .a8 and .i8 directives simply tell the assembler ca65 that this is the case.

Next, the code will jmp to the start of the POST process.

Checking low RAM

The first part of the POST procedure checks if the lower part of RAM is available, by writing values to two address and checking that the same values can be read back.

Note that a:$00 causes the assembler to interpret $00 as an absolute address. This will otherwise be interpreted as direct-page address, which is not what’s intended here.

post_check_loram:
    ldx #%01010101                 ; Power-on self test (POST) - do we have low RAM?
    ldy #%10101010
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%01010101
    bne post_fail_loram
    cpy #%10101010
    bne post_fail_loram
    jmp post_check_hiram

If this fails, then the boot process stops, and the emulation LED blinks in a distinctive pattern (two blinks) forever.

post_fail_loram:                   ; blink emulation mode output with two pulses
    pause 8
    blink
    blink
    jmp post_fail_loram            ; repeat indefinitely

Macros: pause and blink

It’s a mini-challenge to write code to blink an LED in a distinctive pattern without assuming that RAM works. This means no stack operations (eg. jsr and rts instructions), and that I need to store anything I need in 3 bytes: the A, X, Y registers. A triple-nested loop is the best I can come up with.

I wrote a pause macro, which runs a time-wasting loop for the requested duration – approximately a multiple of 100ms at this clock speed. Every time this macro is used, the len value is substituted in, and this code is included in the source file. This example also uses unnamed labels, which is a ca65 feature for writing messy code.

.macro pause len                   ; time-wasting loop as macro
    lda #len
:
    ldx #64
:
    ldy #255
:
    dey
    cpy #0
    bne :-
    dex
    cpx #0
    bne :--
    dec
    cmp #0
    bne :---
.endmacro

The second macro I wrote is blink, which briefly lights up the LED attached to the E output by toggling emulation mode. I’m using the pause macro from both native mode and emulation mode in this snippet, so I can only treat A, X and Y as 8-bit registers.

.macro blink
    sec                            ; switch to emulation mode
    xce
    pause 1
    clc                            ; switch to native mode
    xce
    pause 2
    sec                            ; switch to emulation mode
.endmacro

Checking high RAM

There is also a second RAM chip, and this process is repeated with some differences. For one, I can now use the stack, which is how I set the data bank byte in this snippet.

Here a:$01 is important, because with direct page addressing, $01 means $000001 at this point in the code, where I want to test that I can write to the memory address $080001.

post_check_hiram:
    ldx #%10101010                 ; Power-on self test (POST) - do we have high RAM?
    ldy #%01010101
    lda #$08                       ; data bank to high ram
    pha
    plb
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%10101010
    bne post_fail_hiram
    cpy #%01010101
    bne post_fail_hiram
    lda #$00                       ; reset data bank to boot-up value
    pha
    plb
    jmp post_check_via

The failure here is similar, but the LED will blink 3 times instead of 2.

post_fail_hiram:                   ; blink emulation mode output with three pulses. we could use RAM here?
    pause 8
    blink
    blink
    blink
    jmp post_fail_hiram            ; repeat indefinitely

To make sure that I was writing to different chips, I installed the RAM chips one at a time, first observing the expected failures, and then observing that the code continued past this point with the chip installed.

I also checked with an oscilloscope that both RAM chips are now being accessed during start-up. Now that I’ve got some confidence that the computer now requires both chips to start, I can skip a few debugging steps if I’ve got code that isn’t working later.

Checking the Versatile Interface Adapter (VIA)

The third chip I wanted to add to the POST process is the 65C22 VIA. I kept this check simple, because one read to check for a start-up default is sufficient to test for device presence.

VIA_IER = $c00e
post_check_via:                    ; Power-on self test (POST) - do we have a 65C22 Versatile Interface Adapter (VIA)?
    lda a:VIA_IER
    cmp #%10000000                 ; start-up state, interrupts enabled overall (IFR7) but all interrupt sources (IFR0-6) disabled.
    bne post_fail_via
    jmp post_ok

This stops and blinks 4 times if it fails. I recorded the GIF at the top of this blog post by removing the component which generates a chip-select for the VIA, which causes this code to trigger on boot.

post_fail_via:
    pause 8
    blink
    blink
    blink
    blink
    jmp post_fail_via

Beep for good measure

At the end of the POST process, I put in some code to generate a short beep.

This uses the fact that the 65C22 can toggle the PB7 output each time a certain number of clock-cycles pass. I’ve connected a piezo buzzer to that output, which I’m using as a PC speaker. The 65C22 is serving the role of a programmable interrupt timer from the PC world.

VIA_DDRB = $c002
VIA_T1C_L = $c004
VIA_T1C_H = $c005
VIA_ACR = $c00b
BEEP_FREQ_DIVIDER = 461            ; 1KHz, formula is CPU clock / (desired frequency * 2), or 921600 / (1000 * 2) ~= 461
post_ok:                           ; all good, emit celebratory beep, approx 1KHz for 1/10th second
    ; Start beep
    lda #%10000000                 ; VIA PIN PB7 only
    sta VIA_DDRB
    lda #%11000000                 ; set ACR. first two bits = 11 is continuous square wave output on PB7
    sta VIA_ACR
    lda #<BEEP_FREQ_DIVIDER        ; set T1 low-order counter
    sta VIA_T1C_L
    lda #>BEEP_FREQ_DIVIDER        ; set T1 high-order counter
    sta VIA_T1C_H
    ; wait approx 0.1 seconds
    pause 1
    ; Stop beep
    lda #%11000000                 ; set ACR. returns to a one-shot mode
    sta VIA_ACR
    stz VIA_T1C_L                  ; zero the counters
    stz VIA_T1C_H
    ; POST is now done
    jmp post_done

The post_done label points to the start of the old ROM code, which is currently just a “hello world” program.

Next steps

I’m now able to lock in some of my assumptions about should be available in software, so that I can write more complex programs without second-guessing the hardware.

Once the boot ROM is interacting with more hardware, I may add additional checks. I will probably need to split this into different sections, and make use of jsr/rts once RAM has been tested, because the macros are currently generating a huge amount of machine code. I have 8KiB of ROM in the memory map for this computer, and the code on this page tales up around 1.1KiB.

Building an emulator for my 65C816 computer

I’ve been working on adding text-based input and output to my 65C816 computer prototype, but it’s not yet working reliably.

To unblock software development, I decided to put together an emulator, so that I can test my code while the hardware is out of action.

Choosing a base project

There were three main options I considered for running 65C816 programs on a modern computer,

  1. Write a new emulator from scratch
  2. Extract the useful parts of an open source Super Nintendo or Apple IIgs emulator and adapt them
  3. Find a 65C816 CPU emulator, and extend it.

I found a good candidate for the third option in Lib65816 by Francesco Rigoni, which is a C++ library for emulating a 65C816 CPU. The code is GPLv3 licensed, and I could see where I would need to make changes to match the design of my computer.

Process

After reading the code, I could see some places where I would need to modify the library, so I copied it in to a new folder, together with its logging dependency and sample program, and refactored it into one project.

I started by extending RAM up to 16 banks (1 megabyte), and mapping in a ROM device which serves bytes from a file. I also updated the handling of interrupt vectors, so that the CPU would retrieve them from the system bus.

Adding 65C22 support

My first test program was an example of preemptive multi-tasking from my previous blog post. At this point it ran, but did not switch between tasks, since that requires extra hardware support.

I implemented a minimal support for the 65C22 VIA used in my computer design, just enough to log changes to the outputs, and to fire interrupts from a timer.

The Lib65816 library does not support NMI interrupts, which are used in this test program. I added edge detection, and connected the interrupt output of the 65C22 to the CPU NMI input in code.

This screen capture shows the test program during switching between two tasks, with a different VIA port being accessed before and after the switch.

This gave me some confidence that the CPU emulation was good enough to develop on, so I moved on to the next test program.

Adding serial support

The second test program is also from an earlier blog post, and tests printing and reading characters from an NXP SC16C752 UART.

I added a very minimal emulation of the UART chip I’m using, suppressed all of the logging, and converted output to use the curses library. This last step enabled character-by-character I/O, instead of the default line-buffered I/O.

This screen capture shows the program running under emulation. It simply prints “Hello world”, then echoes back the next 3 characters that the user types.

Wrap-up

I didn’t plan to write an emulator for my custom computer, but I am currently debugging some tricky hardware problems, and this detour will allow me to continue to make progress in other areas while I try to find a solution.

I plan to keep the emulator up to date with hardware changes, since it will make it possible to demonstrate the system without any hardware access. It’s also easier to debug issues now that I can choose two different execution environments.

The hardware design, software and emulator for this computer are online at mike42/65816-computer. I would like to again acknowledge Francesco Rigoni’s work on Lib65816. I’m very grateful that he chose to release this under the GPL, since it allows me to re-use the code for my project.

Let’s implement preemptive multitasking

I am currently working on building a computer based on the 65C816 processor. This processor has a 16-bit stack pointer, which should make it far more capable than the earlier 65C02, at least in theory.

I wanted to check my understanding of how the stack works on this processor, so I tried to build the simplest possible implementation of preemptive multitasking in 65C816 assembly.

What I’m working with

I am working with a breadboard computer. It has no operating system, and no ability to load programs or start/stop processes.

To get code running, I am assembling it on a modern computer, then writing the resulting binary to an EEPROM chip.

To see the result of the code, I am using a logic analyser to watch some output pins on the 65C22 VIA. The 65C22 chip also has a timer, so I connected its interrupt output to the NMI input of the CPU for this test.

Test programs

I wrote two simple programs which infinitely loop. These programs use ca65 assembler syntax.

The 65C22 has two 8-bit ports, so I made each test program write a recognisable pattern to its own output port, so that it’s easy to see which one is running at any given time, and check that it is not disrupted by the task switching.

The first program writes to Port A of the 65C22, and alternates the value between 0000 0000, and 0000 0011.

.segment "CODE"
; Task 1
task_1_main:
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
@repeat_1:
  lda #%00000000                  ; alternate between two values
  sta PORTA
  lda #%00000011
  sta PORTA
  jmp @repeat_1                   ; repeat forever

The second program writes to Port B of the 65C22. This uses a ror instruction to produce a pattern across all digits of the output port.

.segment "CODE"
; Task 2
task_2_main:
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
  clc
  lda #%01010101                  ; grab a start value
@repeat_2:
  ror                             ; rotate right
  sta PORTB
  jmp @repeat_2                   ; repeat forever

Context switching

I’m implementing round-robin scheduling between two processes.

My basic plan was to use a regular interrupt routine to save context, switch the stack pointer to the other process, restore context for the next process, then return from the interrupt.

.segment "CODE"
nmi:
  .a16
  .i16
  rep #%00110000
  ; Save task context to stack
  pha                             ; Push A, X, Y
  phx
  phy
  phb                             ; Push data bank, direct register
  phd

  ; swap stack pointer so that we will restore the other task
  tsc                             ; transfer current stack pointer to memory
  sta temp
  lda next_task_sp                ; load next stack pointer from memory
  tcs
  lda temp                        ; previous task is up next
  sta next_task_sp

  ; Clear interrut
  .a8
  .i8
  sep #%00110000                  ; save X only (assumes it is 8 bits..)
  ldx T1C_L                       ; Clear the interrupt, side-effect of reading

  .a16
  .i16
  rep #%00110000
  ; Restore process context from stack, reverse order
  pld                             ; Pull direct register, data bank
  plb
  ply                             ; Pull Y, X, A
  plx
  pla
  rti

Setup process

The above switch works while the two processes are up and running, that takes some setting up. Firstly, the two variables need to be referenced somewhere.

.segment "BSS"
next_task_sp: .res 2              ; Stack pointer of whichever task is not currently running
temp: .res 2

The boot-up routine then sets everything up. It’s broken up a bit here. First is the CPU initialisation, since the 65C816 will boot to emulation mode.

.segment "CODE"
reset:
  clc                             ; switch to native mode
  xce

The second step is to start a task in the background. This involves pushing appropriate values to the stack, so that when we context switch, “Task 2” will start executing from the top.

  ; Save context as if we are at task_2_main, so we can switch to it later.
  .a16                            ; use 16-bit accumulator and index registers
  .i16
  rep #%00110000
  lda #$3000                      ; set up stack, direct page at $3000
  tcs
  tcd
  ; emulate what is pushed to the stack before NMI is called: program bank, program counter, processor status register
  phk                             ; program bank register, same as current code, will be 0 here.
  pea task_2_main                 ; 16 bit program counter for the start of this task
  php                             ; processor status register
  ; match what we push to the stack in the nmi routine
  lda #0
  tax
  tay
  pha                             ; Push A, X, Y
  phx
  phy
  phb                             ; Push data bank, direct register
  phd
  tsc                             ; save stack pointer to next_task_sp
  sta next_task_sp

  lda #$2000                      ; set up stack, direct page at $2000 for task_1_main
  tcs
  tcd

The nex step is to set up a timer, so that the interrupt routine will fire regularly. This causes interrupts to occur ~28 times per second.

  ; Set up the interrupt timer
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
  lda #%11111111                  ; set all VIA pins to output
  sta DDRA
  sta DDRB
  ; set up timer 1
  lda #%01000000                  ; set ACR. first two bits = 01 is continuous interrupts for T1
  sta ACR
  lda #%11000000                  ; enable VIA interrupt for T1
  sta IER
  ; set up a timer at ~65535 clock pulses.
  lda #$ff                        ; set T1 low-order counter
  sta T1C_L
  lda #$ff                        ; set T1 high-order counter
  sta T1C_H

Finally, “Task 1” can be started in the foreground.

  ; start running task 1
  jmp task_1_main

Linker configuration and boilerplate

This is the first time I’m blogging about code written in 65C816 native mode, so for the sake of completeness, I’ll also include the updated linker configuration. The most important change (compared with this blog post) to that 32 bytes are now set aside for interrupt vectors.

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $0100,  size = $7e00, type = rw, file = "";
    PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE: load = ZP,  type = zp;
    BSS:      load = RAM, type = bss;
    CODE:     load = PRG, type = ro,  start = $e000;
    VECTORS:  load = PRG, type = ro,  start = $ffe0;
}

Note that this linker configuration is not quite complete, but it is good enough to get code running. The zero page is not relevant for the code I’ll be writing, so I might remove that at some point.

I also have definitions for the 65C22 I/O registers, which correspond to the location that it is mapped into RAM on my prototype computer.

PORTB = $c000
PORTA = $c001
DDRB = $c002
DDRA = $c003
T1C_L = $c004
T1C_H = $c005
ACR = $c00b
IFR = $c00b
IER = $c00e

The final piece of the puzzle then, are definitions for all those interrupt vectors.

.segment "CODE"
irq:
  rti

unused_interrupt:                 ; Probably make this into a crash.
  rti

.segment "VECTORS"
; native mode interrupt vectors
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; COP
.word unused_interrupt            ; BRK
.word unused_interrupt            ; Abort
.word nmi                         ; NMI
.word unused_interrupt            ; Reserved
.word irq                         ; IRQ

; emulation mode interrupt vectors
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; COP
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Abort
.word unused_interrupt            ; NMI
.word reset                       ; Reset
.word unused_interrupt            ; IRQ/BRK

The process for building this assembly code into a usable ROM are:

ca65 --cpu 65816 main.s
ld65 -o rom.bin -C system.cfg main.o

This outputs an 8KiB file. I need to pad this with zeroes up to 32KiB, which is the size of the ROM chip I am burning it to.

truncate -s 32768 rom.bin

Lastly, I write this to the ROM.

minipro -p AT28C256 -w rom.bin

Result

I connected a logic analyser to two wires from each output port, the NMI interrupt line, and the reset signal, and opened up sigrok.

This clearly shows the CPU switching between the two processes each time an interrupt is triggered.

Zooming in on the task cut-over, I can also see that the patterns on each port are different, as expected.

Of course this did not work the first time, and I spent quite a bit of time de-constructing and re-constructing the code to isolate bugs.

As result, I shipped some improvements to my 6502 assembly plugin for JetBrains IDE’s, which I have blogged about previously. The plugin will now provide suggestions for mnemonics. There is a project setting for 6502 vs 65C02 vs 65C816 mode, and it will only suggest mnemonics which are available for the CPU.

It also provides a weak warning for binary or hex numbers which are not 8, 16 or 24 bits, which would have saved me some time.

Lastly, it has optional checking for undefined/unused values. This helps catch problems in the editor, and allows me to identify unused code and variables.

Wrap-up

I already knew how multitasking works on a high-level, but it was quite interesting to implement it myself. I have been using this as a test program for my 65C816 computer, since it exercises RAM, ROM, I/O, and uses interrupts.

I don’t think I will be able to use this exact method for more complex programs, because it will break down a bit with multiple interrupt sources. After writing this code, I also found that it’s not common for modern systems to save register values to the stack when context-switching, and that a process control block is used instead.

The next step for my 65C816 computer will be the addition of an NXP UART, which I tested in an earlier blog post.

Interfacing an NXP UART to an 8-bit computer

I recently tested the SC16C752 UART from NXP with my 6502 computer, and I was able to use it to receive and send characters. This blog post shows the simple test program and circuit that I used.

Some background

I built a 65C02-based computer last year, which uses a serial connection for text I/O. I used the WDC 65C51N to provide a UART interface, and I found it to be quite limited.

I am now building a 16-bit computer, and I want to avoid the 65C51N, for two main reasons:

  • It runs at 5v, and I’m planning to convert the design to 3.3v later on. The other 65-series chips I’m using support a wide voltage range.
  • It can only store one byte at a time. I am planning to run at a higher baud rate, and the system will have multiple interrupt sources. It would take very careful programming to avoid losing bytes with a 65C51N in this setup.

I spent some time investigating alternative UART IC’s, and found some discussion on 6502.org about replacing the 6551 with an NXP industrial (SC26 or SC28 series) UART. At the time of writing, many of the hobbyist-friendly PLCC versions of these chips are listed as end of life, though I’ve since found pin and software-compatible versions of this made by MaxLinear (eg. the XR88C92).

For my uses though, I chose the NXP SC16C752. It has two channels, 64-byte receive and transmit buffers, and supports a wide voltage range. I don’t mind that it’s a different series to these standard chips, because I plan to write software from scratch. The only downside is the packaging, which is LQFP48 with 0.5mm pin pitch. I needed to solder this tiny chip to an adapter board in order to use it.

The test circuit

I wired this up to my 6502 computer project as shown in the diagram below. IO3 is a spare address decoding output, which maps the chip to address $8c00, while the rest of the signals connect to the 6502 CPU directly.

I used a 74AC00 and 74AC04 to generate the active-high reset and separate read/write signals (as suggested on the same thread linked before). I plan to use a PLD to generate a qualified write signals on my 65C816 computer build.

Note that there are separate chip-selects for each UART, and I’ve only connected one here. I have also not connected the interrupt output for this test, and have some unused inputs which are left floating.

Developing a test program

I started writing some code, and connected a logic analyser to the UART output. The early tests did not work well, and I now know that this is because I had not enabled the FIFOs, causing characters to be lost.

This screen capture shows an attempt to send the entire alphabet to the UART, but only the letters “A”, “B” and “M” appearing at the output.

I came up with this test program, which does 8-N-1 communication at 115,200 bits per second.

;
; Test program for the SC16C752B dual UART from NXP
;
.import sys_exit

; Assuming /CSA is connected to IO3
UART_BASE = $8c00

; UART registers
; Not included: FIFO ready register, Xon1 Xon2 words.
UART_RHR = UART_BASE        ; Receiver Holding Register (RHR) - R
UART_THR = UART_BASE        ; Transmit Holding Register (THR) - W
UART_IER = UART_BASE + 1    ; Interrupt Enable Register (IER) - R/W
UART_IIR = UART_BASE + 2    ; Interrupt Identification Register (IIR) - R
UART_FCR = UART_BASE + 2    ; FIFO Control Register (FCR) - W
UART_LCR = UART_BASE + 3    ; Line Control Register (LCR) - R/W
UART_MCR = UART_BASE + 4    ; Modem Control Register (MCR) - R/W
UART_LSR = UART_BASE + 5    ; Line Status Register (LSR) - R
UART_MSR = UART_BASE + 6    ; Modem Status Register (MSR) - R
UART_SPR = UART_BASE + 7    ; Scratchpad Register (SPR) - R/W
; Different meaning when LCR is logic 1
UART_DLL = UART_BASE        ; Divisor latch LSB - R/W
UART_DLM = UART_BASE + 1    ; Divisor latch MSB - R/W
; Different meaning when LCR is %1011 1111 ($bh).
UART_EFR = UART_BASE + 2    ; Enhanced Feature Register (EFR) - R/W
UART_TCR = UART_BASE + 6    ; Transmission Control Register (TCR) - R/W
UART_TLR = UART_BASE + 7    ; Trigger Level Register (TLR) - R/W

.segment "CODE"
main:
  lda #$80                  ; Enable divisor latches
  sta UART_LCR
  lda #1                    ; Set divisior to 1 - on a 1.8432 MHZ XTAL1, this gets 115200bps.
  sta UART_DLL
  lda #0
  sta UART_DLM
  lda #%00010111            ; Sets up 8-n-1
  sta UART_LCR
  lda #%00001111            ; Enable FIFO, set DMA mode 1
  sta UART_FCR
  jsr test_print
  jsr test_recv
  lda #0
  jmp sys_exit

test_print:                 ; Send test string to UART
  ldx #0
@test_char:
  lda test_string, X
  beq @test_done
  sta UART_THR
  inx
  jmp @test_char
@test_done:
  rts

test_string: .asciiz "Hello, world!"

test_recv:                  ; Receive three characters and echo them back
  jsr uart_recv_char
  sta UART_THR
  jsr uart_recv_char
  sta UART_THR
  jsr uart_recv_char
  sta UART_THR
  rts

uart_recv_char:             ; Receive next char to A register
  lda UART_LSR
  and #%00000001
  cmp #%00000001
  bne uart_recv_char
  lda UART_RHR
  rts

The program sends the text “Hello, world!”, then waits for input. The next three characters it receives are sent back to the user, and then the program terminates. The only thing in this test program which is specific to my 6502 computer is sys_exit, which a routine to transfer control back to the operating system.

It was very rewarding to see the fast, error-free text output on the logic analyser for the first time. This is a great improvement from the 65C51N.

I was also surprised at how short the delay is to echo back a character. It will be interesting to see how much the latency increases once there is some more complex software handling the input.

Wrap up and next steps

I plan to use the NXP SC16C752 for my 65C816 computer project, which will be controlled via text-based input and output.

I’m testing it on its own first, so that I can refer to a known-good setup if it doesn’t work. This is enough to get started, though there are some features which I wasn’t able to test yet, since I can’t use custom interrupt service routines on this setup.

Adding an SD card reader to my 6502 computer

I recently attempted to add SD card support to my home-built 6502 computer.

This was successful, but more challenging than I expected. This blog post is just a few things that I learned along the way.

Background

SD cards are a popular way to add storage to electronics projects. In my case, I will be using the SD card in SPI mode. My home-built computer has an expansion port with some GPIO pins, which should be sufficient for this.

My basic plan was to:

  • write some test data to an SD card from a modern computer
  • interface the SD card to my home-built 6502 computer
  • write a 6502 assembly program to print out the test data

Since this is a learning project, I started from scratch, and avoided reading any existing implementations of SPI or SD cards before jumping in.

Test data

I started this project by generating a small test file, repeating each character in the alphabet 512 times.

echo abcdefghijklmnopqrstuvwxyz | fold -w1 | while read line; do yes $line | head -n512 ; done | tr -d "\n" > disk.img

It takes a one-line shell command to write this image to an SD card on all good operating systems, or a few clicks in the disk imaging utility.

dd if=disk.img of=/dev/my_sd_card

My laptop has an SD card slot, but I also have Windows on it at the moment, which turned this into a far larger problem than I could have imagined.

Driver support for this particular SD card reader was dropped in Windows 10, there are no built-in tools for imaging disks on Windows, and I had to try multiple tools before I could find one which would write a raw disk image (one which does not contain a valid filesystem).

In retrospect, this was the first of many lessons about not having the right tools for this project.

Physical interface

I connected an SD card module to my 6502 computer, via a breadboard. I used a module which could level-shift, because my computer runs at 5V, where SD cards usually run at 3.3V. This particular module is an Adafruit ADA254.

I also added pull-up resistors to all of the data lines, after noticing that there were several sitations where they would float. The wiring I used is:

The full schematic for my 6502 computer is on this blog post, while the schematic for the SD card module is online here.

Software: Part 1

With some test data and a hardware interface, I got to the third step, which was to read data from the SD card. I was completely unprepared for how tricky it is to do this from scratch, without the right experience or debugging tools.

As a quick re-cap, my development environment is still very limited: I am writing 6502 assembly on a modern computer, and I have the ability to send the assembled program to my home-built 6502 computer over a serial connection.

I started by using lecture notes from the University at Buffalo’s EE-379 course, available online here. This describes SPI, SD card command structure, and the sequence to initialise a card.

The two protocols I needed to implement are SPI, which transfers the bits, plus the SD card commands which run over that. I had to write quite a lot of code with no feedback, because it takes several commands before SD cards are ready to work with in SPI mode, and the exact details vary depending on the generation of card (SDHC vs SDXC, etc).

After days of debugging, I was unable to get a response from the SD card. With few other options, I started filling my code with the assembly-language equivalent of print statements, and found that every byte from the SD card was the default FF. This means ‘no data’ in this context.

Print statements also slowed the program down to a crawl, which allowed me to use a pizeo buzzer to listen to each line. This clicks each time a signal changes between 0 and 1, so I was able to check that each line had the expected activity. I found that the SD card was sending responses, but that the computer was not receiving them, because they were not connected to eachother on the breadboard.

Software: Part 2

After connecting the card correctly, I could see responses for the first time. The format here is byte sent / byte received, with line breaks between commands.

#
SD card detected
40/7f 00/ff 00/ff 00/ff 00/ff 95/ff ff/ff ff/01 ff/ff
48/7f 00/ff 00/ff 01/ff aa/ff 0f/ff ff/ff ff/09 ff/ff ff/ff ff/ff ff/ff
7a/7f 00/ff 00/ff 00/ff 00/ff 75/ff ff/ff ff/01 ff/00 ff/ff ff/80 ff/00 ff/ff ff/ff
#

The correct initialisation sequence depends on the type of SD card, and I spent a lot of time going through each possibility, looking for one one which worked. When this failed, I found this summary online, which includes more information about decoding the responses: Secure Digital (SD) Card Spec and Info.

As it turned out, I was not sending the correct CRC for CMD8, and a CRC error is indicated by the 09 (binary 0000 1001) response to the second command above. Each bit in the response has a different meaning, and I had mistaken this for indicating an unsupported command, which is expected for some card types.

It’s a good thing that I had this problem, because searching for the CRC algorithm turned up this collection of AVR C tutorials, which includes a 4-part series on implementing SD card support.

I noticed that all of the command examples were wrapped with code like this:

    // assert chip select
    SPI_transfer(0xFF);
    CS_ENABLE();
    SPI_transfer(0xFF);
    // ... code to send a command...
    // deassert chip select
    SPI_transfer(0xFF);
    CS_DISABLE();
    SPI_transfer(0xFF);

Later on, I had a problem where responses were sometimes mis-aligned by one bit. When I also tried sending empty bytes between commands, while the SD card is not selected, these problems disappeared entirely.

Software: Part 3

With the initialisation code complete, I could finally request data from the SD card.

Data on an SD card is arranged in 512-byte blocks by default. I wrote a routine to read one block of data from the SD card into RAM, plus a routine to print sections of RAM.

The result was a program which prints the first kilobyte of the SD card, which is the letter a 512 times, followed by the letter b 512 times.

This screen capture shows the output of a program matching a hexdump of the test file, which is exactly what I was aiming for.

The full code is too long to include in a blog post, but can be found in the repository for this project, mike42/6502-computer on GitHub.

Some lessons

While I was able to read data from an SD card, this took a long time, and there were several points where I nearly hit a dead-end.

I couldn’t see what was happening at a hardware level, and I couldn’t compare my setup with one that worked, because I had never done anything similar before. Spot-the-difference is a powerful debugging tool, and by jumping straight into 6502 assembly, while avoiding existing 6502 implementations, I definitely made things difficult for myself.

With that in mind, there are several ways I could have made this easier:

  • Better equipment. A logic analyser, an SD card writer, and spare SD cards would have been useful for testing.
  • Building somewhere less-constrained first. I could have implemented this first on a Raspberry Pi, then ported the result to 6502 assembly. It would then be possible to verify my hardware setup by running existing code, while still getting the learning experience from writing my implementation from scratch. Note that this would have required a different SD card break-out board.
  • Building something simpler first. For example, Ben Eater released a a YouTube video around the time I was working on this, where he reads from an SPI temperature sensor on a 6502 computer. It would be much easier to get started if I had used any other SPI device on a 6502 first.

The next step in this project is to load programs from an SD card, which would require me to implement a filesystem. This is significantly more complex than what I’ve done here, and I know not to approach it the same way.

More equipment

I picked up a logic analyser, USB SD card reader, alternative SD card module, and a spare SD card.

The logic analyser is compatible with sigrok, which is open source. This is the first time I’ve used this software, and it can decode SD card commands running over SPI, among other things.

The USB SD card reader will allow me to write disk images directly from my development environment, instead of copying them over to a laptop. I do most of my development from a virtual machine, and USB peripherals are easy to pass though.

The alternative SD card module is from SparkFun. Compared with the module I was writing in this blog post, it has a smaller footprint, and includes the pull-up resistors. I’m hoping I can use this to fit an SD card into the 3D-printed enclosure for this project.

Implementing the XMODEM protocol for file transfer

I recently built a 6502-based computer from scratch, and I’m using it as a platform for testing 6502 assembly code. The software on this computer is currently very limited, and re-programming an EEPROM chip to test changes is a slow process.

This blog post covers changes which I made to implement the XMODEM protocol, allowing me to upload programs over a serial connection.

The protocol itself

I chose XMODEM because it’s the simplest protocol which is compatible with modern terminal software. All I need right now is the ability to place some data (machine code) into RAM, so that I can execute it.

An important note that is that modern file transfer protocols would normally run over TCP/IP, while XMODEM runs over a serial connection. This works for me, since it’s the only interface I’ve got.

For what it’s worth, XMODEM is also era-appropriate technology for this retro computer, since it was first implemented in 1977. The 6502 processor came out in 1975.

Implementation

Implementing XMODEM was the easy part of this process. There are existing 6502 assembly implementations which I could have used, but I decided to work from the description on the Wikipedia article instead. This is partly to make sure I understand it, and partly to keep the software licensing of my ROM as clear as possible.

I have the beginnings of a shell for running commands from ROM, so I implemented two new commands:

  • rx, to receive a file over XMODEM and load it into memory.
  • run to jump to the program

To start the transfer, I found it useful to write a method for reading a character with a time-out. I already have routines for reading and writing characters from serial.

; Like acia_recv_char, but terminates after a short time if nothing is received
shell_rx_receive_with_timeout:
  ldy #$ff
@y_loop:
  ldx #$ff
@x_loop:
  lda ACIA_STATUS              ; check ACIA status in inner loop
  and #$08                     ; mask rx buffer status flag
  bne @rx_got_char
  dex
  cpx #0
  bne @x_loop
  dey
  cpy #0
  bne @y_loop
  clc                          ; no byte received in time
  rts
@rx_got_char:
  lda ACIA_RX                  ; get byte from ACIA data port
  sec                          ; set carry bit
  rts

The entire receive process is around 60 lines of assembly code. The process involves ASCII NAK character repeatedly until a SOH character is received, which indicates the start of a 128 byte packet. All padding around each packet is discarded, and the payload is written to memory. This process is repeated until the end of transmission (EOT) is received.

; Built-in command: rx
; recives a file over XMODEM protocol, and loads it into memory
USER_PROGRAM_START = $0400   ; Address for start of user programs
USER_PROGRAM_WRITE_PTR = $00 ; ZP address for writing user program

shell_rx_main:
  ; Set pointers
  lda #<USER_PROGRAM_START   ; Low byte first
  sta USER_PROGRAM_WRITE_PTR
  lda #>USER_PROGRAM_START   ; High byte next
  sta USER_PROGRAM_WRITE_PTR + 1
  ; Delay so that we can set up file send
  lda #1                     ; wait ~1 second
  jsr shell_rx_sleep_seconds
  ; NAK, ACK once
@shell_block_nak:
  lda #$15                   ; NAK gets started
  jsr acia_print_char
  lda SPEAKER                ; Click each time we send a NAK or ACK
  jsr shell_rx_receive_with_timeout  ; Check in loop w/ timeout
  bcc @shell_block_nak       ; Not received yet
  cmp #$01                   ; If we do have char, should be SOH
  bne @shell_rx_fail         ; Terminate transfer if we don't get SOH
@shell_rx_block:
  ; Receive one block
  jsr acia_recv_char         ; Block number
  jsr acia_recv_char         ; Inverse block number
  ldy #0                     ; Start at char 0
@shell_rx_char:
  jsr acia_recv_char
  sta (USER_PROGRAM_WRITE_PTR), Y
  iny
  cpy #128
  bne @shell_rx_char
  jsr acia_recv_char         ; Checksum - TODO verify this and jump to shell_block_nak to repeat if not mathing
  lda #$06                   ; ACK the packet
  jsr acia_print_char
  lda SPEAKER                ; Click each time we send a NAK or ACK
  jsr acia_recv_char
  cmp #$04                   ; EOT char, no more blocks
  beq @shell_rx_done
  cmp #$01                   ; SOH char, next block on the way
  bne @shell_block_nak       ; Anything else fail transfer
  lda USER_PROGRAM_WRITE_PTR ; This next part moves write pointer along by 128 bytes
  cmp #$00
  beq @block_half_advance
  lda #$00                   ; If low byte != 0, set to 0 and inc high byte
  sta USER_PROGRAM_WRITE_PTR
  inc USER_PROGRAM_WRITE_PTR + 1
  jmp @shell_rx_block
@block_half_advance:         ; If low byte = 0, set it to 128
  lda #$80
  sta USER_PROGRAM_WRITE_PTR
  jmp @shell_rx_block
@shell_rx_done:
  lda #$6                    ; ACK the EOT as well.
  jsr acia_print_char
  lda SPEAKER                ; Click each time we send a NAK or ACK
  lda #1                     ; wait a moment (printing does not work otherwise..)
  jsr shell_rx_sleep_seconds
  jsr shell_rx_print_user_program
  jsr shell_newline
  lda #0
  jmp sys_exit
@shell_rx_fail:
  lda #1
  jmp sys_exit

This code is the first time that I’ve used the “Zero Page Indirect Indexed with Y” address mode. It uses a pointer to the program’s location, which is incremented 128 each time a packet is accepted.

The index in the current packet is stored in the Y register, allowing each incoming byte to be written to memory with one operation, which is called in a tight loop.

sta (USER_PROGRAM_WRITE_PTR), Y

The other routines mentioned in this source listing are:

  • shell_rx_sleep_seconds – sleeps for the number of seconds in the A register.
  • shell_rx_print_user_program – prints the first 256 bytes of the program for verification.

The checksum is entirely ignored. Everything here is running on my desk, so I’m not too worried about transmission errors.

After a few iterations, I was able to upload placeholder text to RAM.

The screen capture shows 256 bytes of memory (2 packets in XMODEM). The ASCII SUB character (1a in hex) fills out unused bytes at the end, since everything sent over XMODEM needs to be a multiple of 128 bytes.

Assembling some programs

I thought I was almost done at this point, but it turns out that writing a “Hello World” test program to run from RAM is easier said than done.

I started by creating a new ld65 linker configuration which links code to run from address $0400 onwards, which is where programs are loaded into RAM. This system does not have a loader for re-locatable code, and absolute addresses are assigned by the linker.

This is the configuration which I used for these tests. It has some flaws, but it’s good enough to place the CODE segment in the right place.

# Test configuration for user programs
MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $0100,  size = $0300, type = rw, file = "";
    MAIN:   start = $0400,  size = $1000, type = rw, file = %O;
    ROM:    start = $C000,  size = $4000, type = ro, file = "";
}

SEGMENTS {
    ZEROPAGE: load = ZP,  type = zp;
    BSS:      load = RAM, type = bss;
    CODE:     load = MAIN, type = ro;
    VECTORS:  load = ROM, type = ro,  start = $FFFA, optional=yes;
}

If I wanted my test program to terminate, then I needed to find the address of sys_exit in ROM. This routine hands control back to the shell after a command completes.

To get this, I re-assembled the ROM, but added the flag --debug-info to the ca65 assembler, and --dbgfile boot.dbg to the ld65 linker. This produces a text file, which contains the final address of each symbol.

sym id=45,name="sys_exit",addrsize=absolute,scope=0,def=28,ref=298+192+426+390+157,val=0xC11E,seg=0,type=lab

The simplest test program I could come up with was:

; location in ROM
sys_exit = $c11e

.segment "CODE"
main:
  jmp sys_exit

I was able to assemble this program to 3 bytes of binary and run it. It does absolutely nothing, and simply returns to the shell.

To extend this into a “Hello World” program, I needed to look up the location of two more routines in the ROM.

; locations of some functions in ROM
acia_print_char = $c020
shell_newline   = $c113
sys_exit        = $c11e

.segment "CODE"
main:
  ldx #0
@test_char:
  lda test_string, X
  beq @test_done
  jsr acia_print_char
  inx
  jmp @test_char
@test_done:
  jsr shell_newline
  lda #0
  jmp sys_exit

test_string: .asciiz "Test program"

This animation shows the process of uploading the binary and executing it via minicom.

There is nothing special about the invocation for minicom here, it’s just a bit-rate and device name, which in my case is:

minicom -b 19200 -D /dev/ttyUSB0

Automating the boring stuff

This approach is very fragile, because the code is full of hard-coded memory addresses.

Any time I change the ROM, the addresses for these routines could change, and this program will no longer work.

On any real architecture, applications call into the OS via system calls, or maybe a jump table for 8-bit systems. I don’t a stable binary interface, and there is no userspace/kernel distinction, so I decided not to go down that path.

Instead, I wrote a Python script to generate an assembly source file, using the same debug symbols which I was manually reading. This runs after each ROM build.

The format is:

...
.export VIA_T2C_H                        = $8009
.export VIA_T2C_L                        = $8008
.export acia_delay                       := $c02b
.export acia_print_char                  := $c014
...

When this file is assembled, it doesn’t generate any code, but it does define all of the constants and labels from the ROM. When applications link against this, they can .import anything which would be available to ROM-based code.

The start of the above script is re-written as:

.import acia_print_char, shell_newline, sys_exit

This means that I have source compatibility, just not binary compatibility.

Bringing everything together

With my initial goal achieved, I was still not happy with the development experience. I started to dig into the murky world of minicom scripting to see if I could automate the process of uploading/running programs, and found that it is indeed possible.

The scripting capabilities of minicom do not appear to be widely used, and I needed to read through the source code to find ways to work around the problems I encountered.

For example, the “send” command was sending characters too quickly for the target machine. There are hard-coded delays between commands in the interpreter, but you can’t use multiple “send” commands to slow things down, because it is also hard-coded to append a line ending each time.

The fix was to use !<, which sends the output of a command verbatim. This script uses echo -n command to send each character individually. This is the exact same procedure as the animation in the previous section. First, the script types “rx”, then sends the file, then types “run”.

# confirm we have a shell prompt
send ""
sleep 0.1
# put into receive mode
expect "# "
sleep 0.1
!< echo -n 'r'
!< echo -n 'x'
send ""
sleep 0.1
# send the program and wait for prompt
! sx -q $TEST_PROGRAM 2> /dev/null
expect "# "
# run program
sleep 0.1
!< echo -n 'r'
!< echo -n 'u'
!< echo -n 'n'
send ""
sleep 0.1
# Terminate minicom when program completes (bit crazy..)
expect "# "
sleep 0.1
! killall -9 minicom

It’s also worth talking about the end of this script. When minicom scripts end, the user is returned to an interactive session. I wanted the minicom process to terminate without clearing the screen, so I run killall -9 minicom as the last line of the script.

This is not the safest thing to include in a script, but it’s certainly effective, and I am quite certain that there is no built-in way to achieve this with the current version of minicom. Realistic alternatives include patching minicom, switching to different terminal software.

The minicom invocation includes the name of the minicom script to execute, and the script will use the TEST_PROGRAM environment variable to name the binary file to run.

TEST_PROGRAM=test.bin minicom -b 19200 -S minicom_autorun.txt -D /dev/ttyUSB0

When running this command, the program executes on the 6502 computer seamlessly, though minicom leaves the terminal in a broken state when it is killed, and bash prints the “Killed” text to the screen.

To work around this, I wrapped the minicom invocation in a shell script which builds the program, suppresses all errors, and always returns 0. This avoids constant reporting that killall -9 is a problem, but it does hide any other problems.

#!/bin/bash
set -eu -o pipefail
make $1.bin
(TEST_PROGRAM=$1.bin minicom -b 19200 -S minicom_autorun.txt -D /dev/ttyUSB0 || true) 2> /dev/null

This script is invoked as:

./assemble_and_run.sh test

A few months back, I wrote an IntelliJ plugin for 6502 assembly, and I’m writing all of my code in PyCharm. The last step was to get this running from my IDE.

I added a build configuration which invokes the assemble_and_run.sh script. The script is completely generic, and I can use it to run other programs by changing the argument.

I can now write a program, click run, and watch it execute on real hardware, all without leaving my IDE. It took a lot of effort, but this is a huge improvement from where I started.

Wrap-up

A fast compile/test/run cycle is a huge gain for developer productivity on any system. I can now get fast, accurate feedback on whether my 6502 assembly programs work on real hardware. The first program I’m writing involves interfacing with an SD card, which will be a lot more approachable with this improvement.

I was surprised at how tricky it was to assemble and link a small program to run in RAM instead of ROM, and I can see that my linker configuration will need to be improved as I write programs which use more data. My basic plan is to test features by uploading them over XMODEM, then add them to the ROM when they are working.

The full source code for this project, which includes hardware and software, can be found on GitHub at mike42/6502-computer.

Porting BASIC to my 6502 computer

I’ve recently been building my own 6502-based computer. After a lot of work on the hardware side, I decided that it’s time to move past “Hello World” programs, and port a BASIC interpreter.

Background

This computer is architecturally similar to a 1980’s home computer, where BASIC was a widely-used interpreted language. Porting an existing interpreter will be a fast way to get a command-prompt, and it will also add the ability to load arbitrary software, by typing out a program.

I heard about an interpreter called EhBASIC (“Enhanced BASIC”) from on this YouTube video by Chris Bird. It’s source-available (free for non-commercial use only), and seems to be well-regarded by 6502 enthusiasts.

EhBASIC was written by the late Lee Davidson. I based my port on the version v2.22, hosted on Klaus Dormann’s GitHub. I used Hans Otten’s mirror of Lee’s website as a reference, plus the EhBASIC section of the 6502.org forum.

Porting process

EhBASIC was a breeze to port.

I started with the ‘patched’ code from Klaus2m5/6502_EhBASIC_V2.22, which I placed in a local git repository, so that I could track the changes.

I first changed some of the syntax so that the code would assemble with ca65, since that is the assembler I’m using for everything else. There are other ca65 ports around, though they do not include the same patches.

The --feature labels_without_colons setting in ca65 was useful here, since the original source does not include colons after label names. The output was 10.5KiB of machine code, which will fit in the 16KiB of ROM space which I have available in my home-built computer.

I also confirmed that the code did not depend on using undocumented 6502 CPU opcodes, which would have been incompatible with my newer-generation 65C02 CPU.

Next, I dropped in the correct routines to read and write characters. I already had something similar from an earlier blog post. I needed to make some small changes to use the carry bit (SEC / CLC opcodes) to indicate whether a character had been received, and to set up the 6551 ACIA on startup. I programmed the code to an EEPROM, and it worked the first time.

All of my changes against the original code, including the Makefile and ca65 config file are in this commit. The memory addresses line up with the decoding scheme described here.

Wrap-up

BASIC is an interesting language, if only for historical reasons. Classic BASIC feels very clunky by modern standards, but the user experience is not that different to the modern Python REPL. Or at least, it is more similar than you might expect given how far computers have advanced.

I’m running EhBASIC on an 8-bit CPU, where the main alternative is plain 6502 assembly language. BASIC has allowed me to hit the ground running, and gives me access to floating point maths and user-loadable programs, on a computer that does not have an operating system or removable storage.

My home-built computer has a switch for selecting between two ROM’s. My plan is to write my own code in assembly language, but keep this port of EhBASIC in the secondary ROM, so that the computer can always boot to something that works.

IntelliJ plugin for 6502 assembly language

I’ve been writing some 6502 assembly code in the past year, and have found the developer experience for this language to be lacking some modern conveniences.

In my last blog post, I described the development tools that I used to implement a simple NES game on Linux. I used text editor to write the code, and it couldn’t do much more than syntax highlighting.

I write most of my other code in IntelliJ, so I decided to take a look at what would be required to write a plugin for 6502 assembly support. I managed to together something that mostly works, which I published to the JetBrains Marketplace last weekend.

You can find it by searching “6502 Assembly” in most JetBrains IDE’s.

Features

Firstly, it supports syntax highlighting. I limited the scope of the plugin to ca65 assembler syntax only, since it’s the assembler that I know best.

You can navigate to any label with Ctrl+Click. This will not yet work for other types of symbols, such as constants, macros, and imports.

If the plugin sees a jump or branch statement, and can figure out where the jump goes, then it will show a gutter icon which navigates to the target.

You can find the usages of a label.

In ca65, you can use nested scopes. The plugin shows code folding buttons to collapse these scopes.

You can navigate to a symbol by name using the “Go To Symbol” action.

The plugin allows for commenting and un-commenting blocks of code, though there is no formatter, so indentation sometimes still has to be fixed manually.

It also allows you to rename a label and its usages, which is a great time-saver.

Limitations

One limitation is that the plugin does not fully parse expressions, so it will not detect errors from mis-matched brackets.

Secondly, the plugin does not understand the project structure. This means that if you re-use the same label name in different places in your project, the “Rename” and “Find Usages” function will match all of them at once, because it is not smart enough to follow imports and apply scope rules.

Future improvements

A lot of things can be done with an IDE plugin, but I’m planning to use this initial implementation for a while before attempting to add any big-ticket features.

A reasonable goal might be to have “Hello World” project skeletons for common 6502 systems, and to support launching with some common emulators. I would like to be able to set breakpoints and debug 6502 programs in an IDE, but external debugging interfaces are not commonly available for retro emulators, so this is probably not a reasonable goal.

The code is on GitHub under an MIT license. If you are using this plugin and would like to help improve it, then pull requests are welcome.

Building my first NES game: A retrospective

Last year, I spent a fair amount of time learning how to make games for the Nintendo Entertainment System. The homebrew scene for this system is very much alive, and I was quite proud that I was able to make a simple, working video game which runs on a NES emulator.

So, I present to you: 8-bit Table Tennis.

This blog post just a few notes about the tools and resources that I used for NES development on Linux, plus a few things that I learned along the way.

Development tools

The CPU in the NES is a derivative of the once-ubiquitous MOS 6502, and games for it are mostly written in 6502 assembly. I already do a fair amount of programming, but quickly found that my usual editors, compilers and debuggers were useless for this platform.

The main resources I used were:

To build my code, I used the cc65 toolchain, which includes a cross-assembler. I did most of my testing in the Nestopia emulator, and most of my editing in Gedit. All three of these are available in the official Debian repositories.

Gedit setup

Out of the box, Gedit can’t syntax highlight 6502 assmebly. I found a language spec file for it on the the 6502.org forums, and edited it slightly before using it.

At the time of writing, new language specs can be installed on Debian like this:

sudo cp asm6502.lang /usr/share/gtksourceview-4/language-specs/asm6502.lang && sudo chmod 0644 /usr/share/gtksourceview-4/language-specs/asm6502.lang

It possible to add a console and git support to with gedit, but this is still a long way from a full IDE.

Geany

Geany is a programming text editor, and is also available in the official Debian repositories.

I wasn’t able to get it to recognise 6502 assembly, but it does have good x86 assembly (NASM) support, which is similar enough for it to allow navigation through the source code using labels.

Geany can also set up projects with build scripts (such as a Makefile), which could allow for quicker testing.

FCEUX

FCEUX is a NES emulator which has an in-built debugger, and it runs well under WINE. I was also able to build and run it natively on Linux, but the debugger is only present in the Windows build.

I also briefly experimented with using the FCEUX Lua interface to pause/resume the emulator over a TCP socket for debugging, since that interface is available in the Linux build. I could only pause the emulator between frames, so I decided to abandon this.

Graphics

I created the graphics in the GNU Image Manipulation Program, and manually broke the title page down into tiles.

I also wrote a custom program to convert a 4-colour PNG file into the native NES CHR graphics format. I run this conversion as part of the build process, so that the graphics files can be stored in a modern format.

Things I learned

Programming for the NES is quite simple once the project is up and running, but even simple tasks can be a real hassle. I spent more time than I expected on mundane tasks such as collision detection, and certainly did a poor job of implementing game physics. The 6502 CPU has no built-in way to multiply, divide, or perform floating point operations. The version used for the NES additionally lacks any way to perform a binary-to-decimal conversion, which would have been very useful for displaying the player scores! In hindsight, I would have been able to improve the physics by pre-computing some lookup tables.

On the other hand, I expected it to be difficult to work within 2KiB of RAM, but due to the simplicity of the game, I only used around 25% of it, and had plenty of CHR ROM space leftover as well.

It took me three weekends to make this project, without having ever written a line of 6502 assembly before. I think that development would be a lot faster if I were able to set breakpoints from my code editor, so I will definitely be looking at other debugging emulators before attempting my next 6502 assembly project.

Conclusion

8-bit Table Tennis is available as an iNES ROM on GitHub.

I have only been running it with an emulator, so if any readers of this blog own a flash cartridge for the NES (such as an Everdrive), then please let me know if it works on the real hardware!

Benchmarking PHP code with PhpBench

This blog post is all about measuring the speed of PHP code through micro-benchmarks.

Why micro-benchmarks

I think of micro-benchmarks as a complement to tests. A well-written test would give a developer a definite pass or fail result, while well-written micro-benchmark will give the developer an indication of the duration of an operation.

To put it another way, tests can help you find out whether the code is behaving correctly, while micro-benchmarks can help you to track whether your changes are making the code faster or slower.

Make sure that your code is doing real processing work before you spend time on this: Algorithms such as sorting, compressing, or parsing are good candidates. My guess is that most of the code in a regular PHP application would not benefit significantly from having a suite of benchmarks, because web apps tend to be I/O bound (eg database, network and disk access).

Introducing PhpBench

I have previously written a standalone test script every time I’ve needed to measure the speed of something in PHP. This works up to a point, but it gets quite hard to maintain as the number of benchmarks increases.

PhpBench is one of the available tools for running micro-benchmarks from PHP, and there are a few reasons why I think it’s worth a try:

  • it’s actively maintained
  • it installs with composer
  • it runs in a similar way to PHPUnit, so the benchmarks are fully specified in PHP code and run with a PHP tool.
  • it implements familiar concepts that you would find in benchmarking tools from other languages (eg. JMH from the Java world).

Installation

Start out with a blank composer project.

$ composer init

Add some auto-loading settings to composer.yml that PHP knows to find ExampleApp classes in the src folder.

{
    "name": "mike42/php-benchmark-examples",
    "description": "Example PHP benchmark project",
    "type": "project",
    "license": "MIT",
    "minimum-stability": "dev",
    "require": {},
    "require-dev": {},
    "autoload": {
        "psr-4": {
            "ExampleApp\\": "src"
        }
    }
}

At this point, install the phpbench dev version.

$ composer require --dev phpbench/phpbench:@dev
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 22 installs, 0 updates, 0 removals
  - Installing symfony/process (4.4.x-dev 1a42849): Cloning 1a42849a7f from cache
  - Installing symfony/options-resolver (4.4.x-dev 94cbb72): Cloning 94cbb72bb9 from cache
  - Installing symfony/finder (4.4.x-dev 1b5ec12): Cloning 1b5ec12340 from cache
  - Installing symfony/polyfill-ctype (dev-master 82ebae0): Cloning 82ebae0220 from cache
  - Installing symfony/filesystem (4.4.x-dev 5914824): Cloning 59148241f7 from cache
  - Installing psr/log (dev-master c4421fc): Cloning c4421fcac1 from cache
  - Installing symfony/debug (4.4.x-dev 8278839): Cloning 8278839457 from cache
  - Installing symfony/service-contracts (dev-master 0c81a04): Cloning 0c81a04f68 from cache
  - Installing symfony/polyfill-php73 (dev-master d1fb4ab): Cloning d1fb4abcc0 from cache
  - Installing symfony/polyfill-mbstring (dev-master fe5e94c): Cloning fe5e94c604 from cache
  - Installing symfony/console (4.4.x-dev e2fe100): Cloning e2fe1002fd from cache
  - Installing lstrojny/functional-php (1.9.0): Loading from cache
  - Installing beberlei/assert (v3.x-dev ce139b6): Cloning ce139b6bf8 from cache
  - Installing seld/jsonlint (1.7.1): Loading from cache
  - Installing psr/container (dev-master 014d250): Cloning 014d250dae from cache
  - Installing phpbench/container (1.2): Loading from cache
  - Installing webmozart/assert (1.4.0): Loading from cache
  - Installing webmozart/path-util (dev-master 95a8f7a): Cloning 95a8f7ad15 from cache
  - Installing phpbench/dom (0.2.0): Loading from cache
  - Installing doctrine/lexer (dev-master ee614dd): Cloning ee614dd93a from cache
  - Installing doctrine/annotations (1.7.x-dev 3f35255): Cloning 3f35255290 from cache
  - Installing phpbench/phpbench (dev-master dccc67d): Cloning dccc67dd52 from cache
symfony/service-contracts suggests installing symfony/service-implementation
symfony/console suggests installing symfony/event-dispatcher
symfony/console suggests installing symfony/lock
Writing lock file
Generating autoload files

Writing a benchmark

Add some code to src/ExampleThing.php to measure.

<?php

namespace ExampleApp;

class ExampleThing {
  /**
   * Inefficiently multiply numbers together
   */
  public function multiply(int $x, int $y) : int {
    $ret = 0;
    for($i = 0; $i < $x; $i++) {
      for($j = 0; $j < $y; $j++) {
        $ret++;
      }
    }
    return $ret;
  }
}

Write a benchmark for the code in benchmarks/ExampleThingBenchmark.php

<?php

use ExampleApp\ExampleThing;

/**
 * @BeforeMethods({"init"})
 * @Revs(1000)
 * @Iterations(5)
 */
class ExampleThingBenchmark {

    private static $exampleThing;

    public function init()
    {
        self::$exampleThing = new ExampleThing();
    }

    /**
     * @Subject
     */
    public function doMultiply()
    {
        self::$exampleThing -> multiply(100, 100);
    }
}

Before you run anything, you will also need a configuration file. Add this minimal configuration to phpbench.json.dist.

{
    "php_disable_ini": true,
    "bootstrap": "vendor/autoload.php",
    "path": "benchmark",
    "php_config": {
        "extension": [ "json.so" ]
    },
    "time_unit": "milliseconds"
}

Running the benchmark

The most basic way to run all benchmarks at once is phpbench run, which looks like this:

$ php vendor/bin/phpbench run
PhpBench @git_tag@. Running benchmarks.
Using configuration file: /home/mike/workspace/blog/phpbench/php-benchmarks/phpbench.json.dist

\ExampleThingBenchmark

    doMultiply..............................I4 [μ Mo]/r: 0.232 0.232 (ms) [μSD μRSD]/r: 0.000ms 0.10%

1 subjects, 5 iterations, 1,000 revs, 0 rejects, 0 failures, 0 warnings
(best [mean mode] worst) = 0.232 [0.232 0.232] 0.233 (ms)
⅀T: 1.162ms μSD/r 0.000ms μRSD/r: 0.100%

As you might expect, there are options to run specific benchmarks, or to present the results in different ways. However, there were some PhpBench-specific things which you will need to navigate to use it successfully.

Firstly, there are two different output settings to consider:

  • report options allow you to decide which data to print in a table.
  • output options allow you to decide how to format it for output

I also found it useful to add --progress=none to suppress the text displayed by default, since you will otherwise get broken HTML output.

Lastly, be aware that confusingly, the default report is not active by default. Specify it to get a table listing out each iteration.

$ php vendor/bin/phpbench run --progress=none --report=default
suite: 13415431dc0db41c12decee0fa83c26d1db0f678, date: 2019-05-31, stime: 22:01:34
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+
| benchmark             | subject    | set | revs | iter | mem_peak | time_rev  | comp_z_value | comp_deviation |
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 0    | 915,736b | 236.716μs | +1.68σ       | +1.08%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 1    | 915,736b | 231.999μs | -1.45σ       | -0.93%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 2    | 915,736b | 233.960μs | -0.15σ       | -0.1%          |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 3    | 915,736b | 233.878μs | -0.2σ        | -0.13%         |
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 4    | 915,736b | 234.361μs | +0.12σ       | +0.08%         |
+-----------------------+------------+-----+------+------+----------+-----------+--------------+----------------+

The type of report that I found most useful is aggregate, since it provides a summary of each iteration.

$ php vendor/bin/phpbench run --progress=none --report=aggregate
suite: 1341543c02ebee4031d165665c2724c379bf9c98, date: 2019-05-31, stime: 22:01:18
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+
| benchmark             | subject    | set | revs | its | mem_peak | best      | mean      | mode      | worst     | stdev   | rstdev | diff  |
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+
| ExampleThingBenchmark | doMultiply | 0   | 1000 | 5   | 915,736b | 231.907μs | 232.614μs | 232.089μs | 233.538μs | 0.713μs | 0.31%  | 1.00x |
+-----------------------+------------+-----+------+-----+----------+-----------+-----------+-----------+-----------+---------+--------+-------+

Both of the example above use console output, which is the default.

HTML reports

To capture phpbench output from a CI environment, you can generate a HTML report as well. Add an output section to phpbench.json.dist.

{
    ...
    "outputs": {
         "html_file": {
             "extends": "html",
             "file": "benchmarks.html",
             "title": "Example benchmark report"
         }
    }
}

This new html_file output can be used with --output=html_file

$ php vendor/bin/phpbench run \
    --progress=none --report=aggregate \
    --output=console --output=html_file

From Jenkins

This is a Jenkinsfile that I use to run benchmarks. You need to install the “HTML Publisher” and “AnsiColor” Jenkins plugins.

pipeline {
  agent any

  stages {
    stage('Install') {
      steps {
        ansiColor('xterm') {
          sh 'composer install'
        }
      }
    }
    stage('Run benchmarks') {
      steps {
        ansiColor('xterm') {
          sh 'php vendor/bin/phpbench run --progress=none --report aggregate --output=console --output=html_file --ansi'
          publishHTML([
            allowMissing: false,
            alwaysLinkToLastBuild: true,
            keepAll: true,
            reportDir: '',
            reportFiles: 'benchmarks.html',
            reportName:
            'phpbench report',
            reportTitles: ''
          ])       
        }
      }
    }
  }
}

Each build is then benchmarked, and the results are available as a link on the left.

The console output of the build shows the results.

To actually view the HTML report you will need to apply this tweak.

Next steps

I’ve started to add phpbench micro-benchmarks to some PHP projects that I work on, and I’m finding it a lot more useful than standalone test scripts.

The next step would be integrate micro-benchmarks better into the development process. For example, I can already review code coverage improvements/regressions of a change on GitHub using Coveralls, which posts a comment to each pull request. This saves you from merging changes which don’t meet your project’s standards.

As far as I can tell, you would need to write custom code for more expressive phpbench reports along these lines, such as:

  • the history of a benchmark over time, or
  • the most-changed benchmarks against a baseline

phpbench does provide the building-blocks though, since you can set some options to archive the results of each run to a folder.