Switching between VFIO-enabled virtual machines

For the past few months, I’ve been using a setup for software development where I pass the mouse, keyboard and graphics card to a virtual machine via Linux VFIO. The setup is described in detail in its own blog post.

In short though, the idea is to provision a separate virtual machine for each project, so that I can install whatever tools I need to solve the problem at hand, with no real consequences if I break my install.

The problem

I have multiple virtual machines which are configured to use the same hardware, so I can only run one at a time. Before making the changes outlined in this blog post, the method I used to switch to a different VM was unwieldy: I needed is to shut down the current VM, switch monitor inputs to the host graphics, start a different VM, then switch monitor inputs back the the VM graphics.

I intend to do all of my work in VM’s, so my basic goal was to make a ‘switch VM’ mechanism available from within a VM, to remove any need to interact with the host operating system directly.

Creating an API

My plan for this was to run an agent on the host, which will accept a request to switch VM’s. It would then shut down the current VM, and start up a different one instead.

This was the first time I tried using the libvirt Python bindings before, and I was able to quickly make a script which toggled between two VM’s, which are nested within the VM I’m using for development, because of course they are.

Toggling which VM is active, using the Python libvirt API in PyCharm.

I then extended this to use an orderly shutdown, and to work with an arbitrary number of VM’s. CirrOS was not completing its boot in this environment, so I switched the test VM’s to Debian, and also passed through a USB device to each of them so that I would get an error if I ever tried to boot both VM’s simultaneously.

Last, I exposed this over HTTP as a simple API.

For the first iteration, I installed this on the host, and placed some scripts in a folder which is shared between all of the VM’s. The scripts use curl commands to call the API. To trigger a VM switch (or a shutdown of the host), I could go to a folder, and run one of these scripts.

Web UI

With that proof-of-concept out of the way, I made a web interface, so that I can use this instead of some bash scripts in a folder. This is a very simple Angular app.

GNOME extension

For the most part, I’m working on Debian using the GNOME desktop environment, which gives me some options for integrating this further into my desktop.

I first considered making a search provider, but settled instead for making a simple standalone extension to display an indicator instead, based on the example app shown running here.

As if nested virtualization was not mind-bending enough, I started testing this using a GNOME session within a GNOME session.

The extension is written in JavaScript, though the JavaScript API for GNOME extensions was very unfamiliar to me. There some obvious quality issues in the code, but it’s not being published to a repository, and it fires off the correct HTTP requests, so I’ll call it a success.

Deploying a GNOME extension manually is as simple as placing some files my home directory.

Result and next steps

I’ve bookmarked this URL in each of my VM’s.

And I’ve also installed this GNOME extension on the VM that runs on boot.

The idea of this setup is to have one virtual machine per project, so a good next step would be to make it easier to kick of the install process.

I’ve put the code for this project online at mike42/vfio-vm-switcher on GitHub.

Controlling computer fans with a microcontroller

I’m currently working on building a small computer, and want to add some 4-pin computer fans, running quietly at a fixed speed.

This blog post is just a quick set of notes from prototyping, since it covers a few topics which I haven’t written about before.

The problem

The speed of 4-pin computer fans can be controlled by varying the duty cycle on a 25 KHz PWM signal. This signal normally comes from a 4-pin case fan header on the motherboard, which will not be available in this situation. Rather than run the fans at 100%, I’m going to try to generate my own PWM signal.

Two main considerations led me to choose to use a microcontroller for this:

  • I’ll need some way to adjust the PWM duty cycle after building the circuit, because I don’t know which value will give the best trade-off between airflow and noise yet.
  • Fans need a higher PWM duty cycle to start than they do to run. If I want to run the fans at a very low speed, then I’ll need them to ramp up on start-up, before slowing to the configured value.

It’s worth noting that I’m a complete beginner with microcontrollers. I’ve run some example programs on the Raspberry Pi Pico, but that’s it.

First attempt

I already had MicroPython running on my Raspberry Pi Pico, so that was my starting point.

For development environment, I installed the MicroPython plugin for PyCharm, since I use JetBrains IDE’s already for programming. Most guides for beginners suggest using Thonny.

There are introductory examples on GitHub at raspberrypi/pico-micropython-examples which showed me everything I needed to know. I was able to combine an ADC example and a PWM example within a few minutes.

import time
from machine import Pin, PWM, ADC, Timer

level_input = ADC(0)
pwm_output = PWM(Pin(27))
timer = Timer()

# 25 KHz
pwm_output.freq(25000)


def update_pwm(timer):
    """ Update PWM duty cycle based on ADC input """
    duty = level_input.read_u16()
    pwm_output.duty_u16(duty)


# Start with 50% duty cycle for 2 seconds (to start fan)
pwm_output.duty_u16(32768)
time.sleep(2)

# Update from ADC input after that
timer.init(mode=Timer.PERIODIC, period=100, callback=update_pwm)

On my oscilloscope, I could confirm that the PWM signal had a 25 KHz frequency, and that the code was adjusting the duty cycle as expected. When the analog input (above) is set to a high value, the PWM signal has a high duty cycle.

When set to a low value, the PWM signal has a low duty cycle.

But can it control fans?

I wired up a PC fan to 12 V power, and also sent it this PWM signal, but the speed didn’t change. This was expected, since I knew that I would most likely need to convert the 3.3 V signal up to 5 V.

I ran it through a 74LS04 hex inverter with VCC at 5 V, which did the trick. I could then adjust a potentiometer, and the fan would speed up or slow down.

I captured the breadboard setup in Fritzing. Just note that there are floating inputs on the 74LS04 chip (not generally a good idea) and that the part is incorrectly labelled 74HC04, when the actual part I used was a 74LS04.

This shows a working setup, but it’s got more components than I would like. I decided to implement it a second time, on a simpler micro-controller which can work at 5 V directly.

Porting to the ATtiny85

For a second attempt, I tried using an ATtiny85. This uses the AVR architecture (of Arduino fame), which I’ve never looked at before.

This chip is small, widely available, and can run at 5 V. I can also program it using the TL-866II+ rather than investing into the ecosystem with Arduino development boards or programmers.

I found GPL-licensed code by Marcelo Aquino for controlling 4-wire fans.

After a few false starts trying to compile manually, I followed this guide to get the ATtiny85 ‘board’ definition loaded into the Arduino IDE.

From there I was able to build and get an intel hex file, using the “Export compiled binary” feature.

Writing fuses

The code assumes an 8 MHz clock. The Attiny85 ships from the factory with a “divide by 8” fuse active. This needs to be turned off, otherwise the clock will be 1 MHz. This involves setting some magic values, separate to the program code.

I found these values using a fuse calculator.

The factory default for this chip is:

lfuse 0x62, hfuse 0xdf, efuse 0xff.

To disable the divide-by-8 clock but leave all other values default, this needs to be:

lfuse 0xe2, hfuse 0xdf, efuse 0xff.

I am using the minipro open source tool to program the chip, via a TL-866II+ programmer. First, to get the fuses:

$ minipro -p ATTINY85@DIP8 -r fuses.txt -c config
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Reading config... 0.00Sec  OK

This returns the values expected in a text file.

lfuse = 0x62
hfuse = 0xdf
efuse = 0x00
lock = 0xff

I then set lfuse = 0xe2, and wrote the values back with this command.

$ minipro -p ATTINY85@DIP8 -w fuses.txt -c config
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Writing fuses... 0.01Sec  OK
Writing lock bits... 0.01Sec  OK

Writing code

Now the micro-controller is ready to accept the exported binary containing the program.

minipro -s -p ATTINY85@DIP8 -w sketch_pwm.ino.tiny8.hex -f ihex
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Found Intel hex file.
Erasing... 0.01Sec OK
Writing Code...  1.09Sec  OK
Reading Code...  0.45Sec  OK
Verification OK

With the chip fully programmed, I wired it up on a breadboard with 5 V power.

Checking the output again, the signal was the correct amplitude this time, but the frequency does move around a bit. This is likely because I’m using the internal RC timing on the chip, which is not very accurate. My understanding is that anything near 25 KHz will work fine.

Updates

I made only one change to Marcelo’s code, which is to spin the fan at 50% for a few seconds before using the potentiometer-set value. This is to avoid any issues where the fans fail to start because I’ve set them to run at a very low PWM value.

/*
 *                         ATtiny85
 *                      -------u-------
 *  RST - A0 - (D 5) --| 1 PB5   VCC 8 |-- +5V
 *                     |               |
 *        A3 - (D 3) --| 2 PB3   PB2 7 |-- (D 2) - A1  --> 10K Potentiometer
 *                     |               | 
 *        A2 - (D 4) --| 3 PB4   PB1 6 |-- (D 1) - PWM --> Fan Blue wire
 *                     |               |      
 *              Gnd ---| 4 GND   PB0 5 |-- (D 0) - PWM --> Disabled
 *                     -----------------
 */

// normal delay() won't work anymore because we are changing Timer1 behavior
// Adds delay_ms and delay_us functions
#include <util/delay.h>    // Adds delay_ms and delay_us functions

// Clock at 8mHz
#define F_CPU 8000000  // This is used by delay.h library

const int PWMPin = 1;  // Only works with Pin 1(PB1)
const int PotPin = A1;

void setup()
{
  pinMode(PWMPin, OUTPUT);
  // Phase Correct PWM Mode, no Prescaler
  // PWM on Pin 1(PB1), Pin 0(PB0) disabled
  // 8Mhz / 160 / 2 = 25Khz
  TCCR0A = _BV(COM0B1) | _BV(WGM00);
  TCCR0B = _BV(WGM02) | _BV(CS00); 
  // Set TOP and initialize duty cycle to 50%
  OCR0A = 160;  // TOP - DO NOT CHANGE, SETS PWM PULSE RATE
  OCR0B = 80; // duty cycle for Pin 1(PB1)
  // initial bring-up: leave at 50% for 4 seconds
  _delay_ms(4000);
}

void loop()
{
  int in, out;
  // follow potentiometer-set speed from there
  in = analogRead(PotPin);
  out = map(in, 0, 1023, 0, 160);
  OCR0B = out;
  _delay_ms(200);
}

The wiring of the breadboard is shown below. The capacitor is 0.1 µF for decoupling.

This is far more compact than the Raspberry Pi Pico prototype. I could also miniaturise it further by simply using surface-mount versions of the same components, where using an RP2040 microcontroller from the Pico directly on a custom board would incur some design effort.

Next steps & lessons learned

Although this project is simple, I had to learn quite a few things to prototype it successfully. Using a generic chip programmer like the TL-866II+ appears to be uncommon in the AVR world, and most online guides instead suggest repurposing an Arduino development board or using an Arduino ICSP programmer to program the Attiny85. I was glad to confirm that I could use my existing hardware instead of buying ecosystem-specific items which I would have no other use for. I find the development experience to be far better with the Raspberry Pi Pico, and that’s what I would be choosing for a more complex project.

I also captured the breadboard wiring in Fritzing for this blog post. The diagrams are clearer than a photo of a breadboard, but I’m not confident that they communicate information about the circuit as well as alternative approaches. For future posts, I’ll return to using KiCad EDA for schematic capture, unless there is some reason to highlight the physical layout of a breadboard prototype.

As a next step, I’ll be building a simple break-out PCB for a specific computer case to power the fans and supply a PWM signal, based on the ATtiny85 prototype shown here.

Porting the Amiga bouncing ball demo to the NES

Earlier this year, I ported the Amiga bouncing ball demo to the Nintendo Entertainment System. This is a video capture from a NES emulator.

I completed this back in January, but I’m only publishing this blog post now, because I was considering entering it into a demo competition.

How it works

If you are familiar with NES development, then you will probably notice that there is nothing ground-breaking going on here. The ball is made up of 64 8×8 sprites, bouncing around the screen over a static background.

There are two different versions of the ball in sprite memory, and with some palette swapping, this can be stretched to 4 frames of animation, just enough to make the ball appear to spin.

There is enough sprite memory to extend this to 8 frames of animation without any banking tricks, but I would need to start again from scratch to achieve that.

Pre-rendered 3D

I drew the ball in Blender, with twice the number of segments required for the final image. This is a UV sphere with 8 rings, and 32 segments.

I coloured each face with one of 4 colours, then rendered it with the Workbench renderer, which I had configured to use Flat lighting, no anti-aliasing, and the texture colour.

I also tilted the ball.

This gave me a crisp, high-resolution ball with solid colours.

To test the idea, I processed this down to a 64×64 image on a transparent background. I need to substitute colours, so there is no antialiasing here.

I then wrote up a Python script to swap out colours to make the 4-frame pattern. Note that frames 3 and 4 are the same as frames 1 and 2, but with white and red swapped.

I ran this through ImageMagick to convert it into a GIF preview.

convert -delay 3.5 -loop 0 *.png animation.gif

The result appeared workable, so I went about making the same animation in a NES rom.

NES implmentation

On the NES, it’s not possible to create the spinning ball effect with palette changes only, because it would require four colours plus transparency. To overcome this, I split the image into two frames, each stored as two colours plus transparency, which is possible.

To start the code, I checked out a fresh copy of Brad Smith’s NES example project, then deleted things that I didn’t need.

When working with sprites on the NES, it is typical to store object attributes in RAM, then perform a DMA operation once per frame to copy this over to the Picture Processing Unit (PPU). For this project, I wanted to try setting up two different copies of the object attribute memory in RAM – one for each of the two rotations.

This should allow me to set any of the four frames by choosing between two possible colour palettes, and two possible DMA source addresses. This worked, and the first milestone was a spinning ball.

Switching between two DMA sources did not save much effort in the end, because I still needed quite a lot of code to set the X/Y positions of 64 sprites each frame. I set up some assembler macros to help with this.

Physics and sound

I have made one NES game before, and I was not happy with the physics. For this project, I wanted to do a bit better.

For the X position, I use a fixed speed, and simple collision detection with left/right boundaries. The animation frame is calculated from the X position, so the ball changes rotation depending on which direction it is moving, just like the Amiga demo.

The Y location of the ball is a simple loop, and follows the absolute value of a sine wave. I pre-computed this with some Python.

My last project also had no sound, so I read up on the NES APU, and added some noise when the ball changes direction.

Development setup

Since my last NES project, I have improved my development setup quite a bit. As always, I am using the ca65 assembler on Linux.

Previously, I was using a text editor. I have since moved to a custom 6502 assembly plugin on PyCharm, which allows me to quickly jump to definitions/usages. I developed this for my other 65C02 and 65C816 projects, which you can read about on this blog.

When I click run, I’ve set up the project to assemble a NES ROM, then launch with fceux, which has debugging features.

The debug features were previously only available on Windows builds of fceux, so when I made this demo, I was running the emulator via WINE. This has been added to the Linux builds as of v2.5.0, so I’ll most likely switch to that for my next project.

Wrap-up

I learn something new every time I write code for the NES, and it’s a lot of fun to make simple demos for these old systems. It’s also refreshing to create something standalone which I can explain through screenshots, instead of pages of assembly code.

For those who do want to read pages of assembly code, though, this project is up on GitHub at mike42/nes-ball-demo.

Let’s make a bootloader – Part 2

In my previous blog post, I wrote about loading and executing a 512 byte program from an SD card for my 65C816 computer project.

For part 2, I’ll look at what it takes to turn this into a bootloader, which can load a larger program from the SD card. I want the bootloader to control where to read data from, and how much data to load, which will allow me to change the program structure/size without needing to update the code on ROM.

Getting more data

The code in my previous post used software interrupts to print some text to the console, and my first challenge was to implement a similar routine for loading additional data from external storage.

The interface from the bootloader is quite simple: it sets a few registers to specify the source (a block number), destination memory address, and number of blocks to read, then triggers a software interrupt. The current implementation can read up to 64 KiB of data from the SD card each time it is called.

ROM_READ_DISK := $03
; other stuff ...
    ldx #$0000                      ; lower 2 bytes of destination address
    lda #$01                        ; block number
    ldy #$10                        ; number of blocks to read - 8 KiB
    cop ROM_READ_DISK               ; read kernel to RAM

Higher memory addresses

To load data in to higher banks (memory addresses above $ffff, I set the data bank register.

I needed to update a lot of my code to use absolute long (24-bit) addressing for I/O access, where it was previously using absolute (16-bit) addressing, for example:

sta $0c00

In my assembler, ca65, I already use the a: prefix to specify 16-bit addressing. I learned here that I can use the f: prefix for 24-bit addressing.

sta f:$0c00

Without doing this, the assembler chooses the smallest possible address size. This is fine in a 65C02 system, but I find it less helpful on the 65C816, where the meaning of a 16-bit address depends on the data bank register, and the meaning of an 8-bit address depends on the direct page register.

New bootloader

With the ROM code sorted out, I went on to write the new bootloader.

This new code sets the data bank address, then uses a software interrupt to load 8 KiB of data to address $010000

; boot.s: 512-byte bootloader. This utilizes services defined in ROM.
ROM_PRINT_STRING := $02
ROM_READ_DISK := $03

.segment "CODE"
    .a16
    .i16
    jmp code_start

code_start:
    ; load kernel
    ldx #loading_kernel             ; Print loading message (kernel)
    cop ROM_PRINT_STRING
    phb                             ; Save current data bank register
    .a8                             ; Set data bank register to 1
    php
    sep #%00100000
    lda #$01
    pha
    plb
    plp
    .a16
    ldx #$0000                      ; lower 2 bytes of destination address
    lda #$01                        ; block number
    ldy #$10                        ; number of blocks to read - 8 KiB
    cop ROM_READ_DISK               ; read kernel to RAM
    plb                             ; Restore previous data bank register

    ldx #boot                       ; Print boot message
    cop ROM_PRINT_STRING
    jml $010000

loading_kernel: .asciiz "Loading kernel\r\n"
boot: .asciiz "Booting\r\n"

.segment "SIGNATURE"
    wdm $42                         ; Ensure x86 systems don't recognise this as bootable.

The linker configuration for the bootloader is unchanged from part 1.

A placeholder kernel

The bootloader needed some code to load. I don’t have any real operating system to load, so I created the “Hello World” of kernels to work with in the meantime.

This is the simplest possible code I can come up with to test the bootloader. This assembles to 3 machine-language instructions, which occupy just 6 bytes. It is also is position-independent, and will work from any memory bank.

; kernel_tmp.s: A temporary placeholder kernel to test boot process.

ROM_PRINT_CHAR   := $00

.segment "CODE"
    .a16
    .i16
    lda #'z'
    cop ROM_PRINT_CHAR
    stp

The linker configuration for this, kernel_tmp.cfg, creates an 8 KiB binary.

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $0200,  size = $7e00, type = rw, file = "";
    PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE: load = ZP,  type = zp;
    BSS:      load = RAM, type = bss;
    CODE:     load = PRG, type = ro,  start = $e000;
}

The commands I used to assemble and link the bootloader are:

ca65 --feature string_escapes --cpu 65816 --debug-info boot.s
ld65 -o boot.bin -C boot.cfg boot.o

The commands I used to assemble and link the placeholder kernel are:

ca65 --feature string_escapes --cpu 65816 --debug-info kernel_tmp.s
ld65 -o kernel_tmp.bin -C kernel_tmp.cfg kernel_tmp.o

I assembled the final disk image by concatenating these files together.

cat bootloader/boot.bin kernel_tmp/kernel_tmp.bin > disk.img

I used the GNOME disk utility to write the image to an SD card, which helpfully reports that I’m not using all of the available space.

Result

It took me many attempts to get the ROM code right, but this did work. This screen capture shows the test kernel source code on the left, which is being executed at the end of the boot-up process.

If I wanted to load the kernel from within a proper filesystem on the SD card (eg. FAT32), then I would need to update the starting block in the bootloader (hard-coded to 1 at the moment).

The limitations of this mechanism are that the kernel needs to be stored contiguously, be sector-aligned, and located within the first 64 MiB of the SD card.

Speed increase

This was the first time that I wrote code for this system and needed to wait for it to execute. The CPU was clocked at just 921.6 KHz. My SPI/SD card code was also quite generic, and was optimised for ease of debugging rather than execution speed.

I improved this from two angles. My 65C816 test board allows me to use a different clock speed for the CPU and UART chip, so I sped the CPU up to 4 MHz by dropping in an 8 MHz oscillator (it is halved to produce a two-phase clock). As I sped this up, I also needed to add more retries to the SD card initialisation code, since it does not respond immediately on power-up.

I also spent a lot of time hand-optimising the assembly language routine to receive a 512-byte block of data from the SD card. There is room to speed this up further, but it’s fast enough to not be a problem for now.

I had hoped to load an in-memory filesystem (ramdisk) alongside the test kernel, but I’ve deferred this until I can compress it, since reading from SD card is still quite slow.

A debugging detour

Writing bare-metal assembly with no debugger is very rewarding when it works, and very frustrating when there is something I’m not understanding correctly.

I ran into one debugging challenge here which is obvious in hindsight, but I almost couldn’t solve at the time.

This code is (I assure you) a bug-free way to print one hex character, assuming the code is loaded in bank 0. I was using this to hexdump the code I was loading into memory.

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
lda #0                          ; A is 0 for example
and #$0f                        ; Take low-nibble only
tax                             ; Transfer to X
lda f:hex_chars, X              ; Load hex char for X
jsr uart_print_char             ; Print hex char for X

hex_chars: .byte '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'

The problem I had was that if the data bank register was 0, this would print ASCII character ‘0’ as expected. But if the data bank register was 1, it would print some binary character.

Nothing in this code should be affected by the data bank register, so I went through a methodical debugging process to try to list and check all of my assumptions. At one point, I even checked that the assembler was producing the correct opcode for this lda addressing mode (there are red herrings on the mailing lists about this).

I was able to narrow down the problem by writing different variations of the code which should all do the same thing, but used different opcodes to get there. This quickly revealed that it was the tax instruction which did not work as I thought, after finding that I could get the code working if I avoided it:

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
ldx #0                          ; X is 0 for example
lda f:hex_chars, X              ; Load hex char for X
jsr uart_print_char             ; Print hex char for X

hex_chars: .byte '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'

My first faulty assumption was that if I set the accumulator to 0, then transferring the accumulator value to X would set the X register to 0 as well.

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
lda #0                          ; Accumulator is 0
tax                             ; X is 0?

On this architecture, there are two bits of CPU status register which specify the size of the accumulator and index registers. When the accumulator is 8-bits, only the lower 8-bits of the 16-bit value are set by lda. I hadn’t realised that when the index registers are set to 16-bit, the tax instruction transfers all 16 bits of the accumulator to the X register (regardless of the accumulator size), which was causing surprising results.

Far away from the bug-free code I was focusing on, I had used a lazy method to set the data bank register, which involved setting the accumulator to $0101.

My second faulty assumption was that the data bank register was involved at all – in fact lda #$0101 would have been enough to break the later code.

.a16
lda #$0101                      ; Set data bank register to 1
pha
plb
plb

To fix this, when switching to an 8-bit accumulator, I now zero out the high 8-bits.

.a8
lda #0                          ; zero out the B accumulator
xba

Alternative boot method

I also added an option to boot from serial, based on my implementation of an XMODEM receive function for the 6502 processor.

This is a bit like PXE boot for retro computers. From a program such as minicom, you tell the ROM to boot from serial, then upload the kernel via an XMODEM send. It loads up to 64KiB to bank 1 (the same location as the bootloader on the SD card would place it), then executes it.

This is an important option for testing, since it’s a fast way to get some code running while I don’t have an operating system, and writing to an SD card from a modern computer is not a fast process. It may also be an important fallback if I need to troubleshoot SD card routines.

Wrap-up

Previously, I needed to use a ROM programmer to load code onto this computer. I can now use an SD card or serial connection, and have a stable interface for the bootstrapping code to access some minimal services provided by the ROM.

It is also the first time I’m running code outside of bank 0, breaking through the 64 KiB limit of the classic 6502. There is an entire megabyte of RAM on this test board.

Of course, this computer still does nothing useful, but at least it now controlled by an SD card rather than flashing a ROM. On the hardware side of the project, this will help me to convert the design from 5 V to 3.3 V. I’ll need to convert the ROM to a PLCC-packaged flash chip for that, which is not something I’ll want to be programming frequently.

As far as software goes, my plan is to work on some more interesting demo programs, so that I can start to build a library of open source 65C816 code to work with. The hardware design, software and emulator for this computer can be found on GitHub, at mike42/65816-computer.

Let’s make a bootloader – Part 1

I’ve been working on a homebrew computer based on the 16-bit 65C816 CPU. All of my test programs so far have run from an EEPROM chip, which I need to remove and re-program each time I need to make a change.

Plenty of retro systems ran all of their programs from ROM, but I only want to use it for bootstrapping this computer. I’ve got 8 KiB of space for ROM-based programs in the memory map, which should be plenty to check the hardware and load software from disk.

In this two-part blog post, I’ll take a look at handing over control from the ROM to a program loaded from an SD card.

Technical background

I did some quick reading about the process for bootstrapping a PC during the 16-bit era. My homebrew computer is a completely different architecture to an IBM-compatible PC, but I’m planning to follow a few of the conventions from this ecosystem, since I’ll have some similar challenges.

The best resources on this topic are aimed at bootloader developers, such as this wikibooks page.

For a disk to be considered bootable, the first 512-byte sector needs to end with hex 0xAA55 (little-endian) . This is 01010101 10101010 in binary (a great test pattern!). My system is not x68-based, so I’ll store a different value there.

If a disk is bootable, then the BIOS will transfer the 512 byte boot sector to $7C00 and jump to it. The only assumption that the BIOS seems to make about the bootloader structure is that it starts with executable code. I’ll do the same on my system.

It’s worth noting that the first sector may also contain data structures for a filesystem or partitioning scheme, and it’s up to the bootloader code to work around that. For now, my SD card will contain only a bootloader, which does simplify things a bit.

Most bootloaders will then make a series of BIOS calls via software interrupts, which enables them produce text output or load additional data from disk. This is where I’ll have the biggest challenge, since my ROM has no stable interface for a bootloader to call.

Re-visiting SD card handling

My first task was to load the bootloader itself from SD card, storing the first 512 bytes from the disk to RAM, at address $7C00 onwards. This should be straightforward, since I have working 6502 assembly routines for reading from an SD card, and I’ve added a port for an SD card module to my 65C816 test board.

I came up with a routine which prints the contents of the boot sector, then prompts for whether to execute it. My ROM code is not checking the signature at this stage, and is not aware that the boot sector in this screen capture contains x86 machine code within a FAT32 boot sector, but this is a good start.

It took quite a few revisions to get this working, since my old 65C02 code for reading from SD produced strange output on this system. On my 65C816 test board, it showed almost the right values, but it was jumbled up, and mixed with SPI fill bytes ($FF). The below screen capture shows a diff between the expected and actual output of the ROM.

After a long process to rule out other programming and hardware errors, I finally noticed that I was writing the data starting from address $0104, which was never going to work. The default stack pointer on this CPU is $01ff and grows down, so writing 512 bytes to $0104 would always corrupt the stack after a few hundred bytes.

At this stage I was using the assembler to statically allocate a 512 byte space for IO. It appeared in code like this:

.segment "BSS"
io_block_id:              .res 4
io_buffer:                .res 512

The error was in the linker configuration, which I updated to start assigning RAM addresses from $0200 onwards.

 MEMORY {
     ZP:     start = $00,    size = $0100, type = rw, file = "";
-    RAM:    start = $0100,  size = $7e00, type = rw, file = "";
+    RAM:    start = $0200,  size = $7d00, type = rw, file = "";
     PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
 }

The full SD card handling code is too long to post in this blog, but now allows any 512-byte segment from the first 32 MB of the SD card (identified by a 16-bit segment ID) to be loaded into an arbitrary memory address.

Making an API

My next challenge was to define an API for the bootloader to call into the ROM to perform I/O.

I considered using a jump table, but decided to use the cop instruction instead. This triggers a software interrupt on the 65C816, and has parallels to how the int instruction is used to trigger BIOS routines on x86 systems.

I defined a quick API for four basic routines, passing data via registers.

  • print char
  • read char
  • print string
  • load data from storage

The caller would need to set some registers, then call cop from assembly language. Any return data would also be passed via registers.

The cop instruction takes a one-byte operand, which in this case specifies the ID of the function to call.

cop $00

To prove that the interface would work, I implemented just the routine for printing strings.

; interrupt.s: Handling of software interrupts, the interface into the ROM for
; software (eg. bootloaders)
;
; Usage: Set registers and use 'cop' to trigger software interrupt.
; Eg:
;   ldx #'a'
;   cop ROM_PRINT_CHAR
; CPU should be in native mode with all registers 16-bit.

.import uart_printz, uart_print_char
.export cop_handler
.export ROM_PRINT_CHAR, ROM_READ_CHAR, ROM_PRINT_STRING, ROM_READ_DISK

; Routines available in ROM via software interrupts.
; Print one ASCII char.
;   A is char to print
ROM_PRINT_CHAR   := $00

; Read one ASCII char.
;   Returns typed character in A register
ROM_READ_CHAR    := $01

; Print a null-terminated ASCII string.
;   X is address of string, use data bank register for addresses outside bank 0.
ROM_PRINT_STRING := $02

; Read data from disk to RAM in 512 byte blocks.
;   X is address to write to, use data bank register for addresses outside bank 0.
;   A is low 2 bytes of block number
;   Y is number of blocks to read
ROM_READ_DISK    := $03

.segment "CODE"
; table of routines
cop_routines:
.word rom_print_char_handler
.word rom_read_char_hanlder
.word rom_print_string_handler
.word rom_read_disk_handler

cop_handler:
    .a16                            ; use 16-bit accumulator and index registers
    .i16
    rep #%00110000
    ; Save caller context to stack
    pha                             ; Push A, X, Y
    phx
    phy
    phb                             ; Push data bank, direct register
    phd
    ; Set up stack frame for COP handler
    tsc                             ; WIP set direct register to equal stack pointer
    sec
    sbc #cop_handler_local_vars_size
    tcs
    phd
    tcd
caller_k := 15
caller_ret := 13
caller_p := 12
caller_a := 10
caller_x := 8
caller_y := 6
caller_b := 5
caller_d := 3
cop_call_addr := 0
    ; set up 24 bit pointer to COP instruction
    ldx <frame_base+caller_ret
    dex
    dex
    stx <frame_base+cop_call_addr
    .a8                             ; Use 8-bit accumulator
    sep #%00100000
    lda <frame_base+caller_k
    sta <frame_base+cop_call_addr+2
    .a16                            ; Revert to 16-bit accumulator
    rep #%00100000

    ; load COP instruction which triggered this interrupt to figure out routine to run
    lda [<frame_base+cop_call_addr]
    xba                             ; interested only in second byte
    and #%00000011                  ; mask down to final two bits (there are only 4 valid functions at the moment)
    asl                             ; multiply by 2 to index into table of routines
    tax
    jsr (cop_routines, X)

    ; Remove stack frame for COP handler
    pld
    tsc
    clc
    adc #cop_handler_local_vars_size
    tcs

    ; Restore caller context from stack, reverse order
    pld                             ; Pull direct register, data bank
    plb
    ply                             ; Pull Y, X, A
    plx
    pla
    rti

cop_handler_local_vars_size := 3
frame_base := 1

rom_print_char_handler:
    ldx #aa
    jsr uart_printz
    rts

rom_read_char_hanlder:
    ldx #bb
    jsr uart_printz
    rts

rom_print_string_handler:
    ; Print string from X register
    ldx <frame_base+caller_x
    jsr uart_printz
    rts

rom_read_disk_handler:
    ldx #cc
    jsr uart_printz
    rts

aa: .asciiz "Not implemented A\r\n"
bb: .asciiz "Not implemented B\r\n"
cc: .asciiz "Not implemented C\r\n"

This snippet is quite dense, and uses several features which are new to the 65C816, not just the cop instruction.

I’m relocating the direct page to use as a stack frame, which is an idea I got from reading the output of the WDC 65C816 C compiler. Pointers are much easier to work with on the direct page.

This is the first snippet I’ve shared which uses a 24-bit pointer, via “direct page, indirect long” addressing. The pointer is used to load the instruction which triggered the interrupt, so that the code can figure out which function to call.

lda [<frame_base+cop_call_addr]

This snippet is also the first time I’ve used the jump to subroutine instruction (jsr) with the “absolute indirect, indexed with X” address mode. On the 65C02, I could only use this addressing mode on the jmp instruction. The only example of that on this blog is also an interrupt handling example.

jsr (cop_routines, X)

The “Hello World” of bootloaders

My next goal was to load a small program from disk, and show that it can call routines from the ROM. For now it is just a program on the boot sector on an SD card, which demonstrates that the new software interrupt API works.

This assembly file boot.s prints out two strings, so that I can be sure that the ROM is returning control back to the bootloader after a software interrupt completes.

ROM_PRINT_STRING := $02

.segment "CODE"
    .a16
    .i16
    ldx #test_string_1
    cop ROM_PRINT_STRING
    ldx #test_string_2
    cop ROM_PRINT_STRING
    stp

test_string_1: .asciiz "Test 1\r\n"
test_string_2: .asciiz "Test 2\r\n"

.segment "SIGNATURE"
    wdm $42                         ; Ensure x86 systems don't recognise this as bootable.

The linker configuration which goes with this is boot.cfg:

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $7e00,  size = $0200, type = rw, file = "";
    PRG:    start = $7c00,  size = $0200, type = rw, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE:   load = ZP,  type = zp;
    BSS:        load = RAM, type = bss;
    CODE:       load = PRG, type = rw,  start = $7c00;
    SIGNATURE:  load = PRG, type = rw,  start = $7dfe;
}

The commands to assemble and link this are:

ca65 --feature string_escapes --cpu 65816 boot.s
ld65 -o boot.bin -C boot.cfg boot.o

This produces a 512 byte file, which I wrote to SD card.

This is the first time this computer is running code from RAM, which is an important milestone for this project.

Editor improvements

I needed to do some work on my 6502 assembly plugin for IntelliJ during this process, since it didn’t understand the square brackets used for the long address modes.

    lda [<frame_base+cop_call_addr]

While I was fixing this, I also implemented an auto-format feature. This saves me the manual effort of lining up all the comments in a column, as is typical in assembly code.

Lastly, I added support for jumping to unnamed labels, which are a ca65-specific feature.

Next steps

In the second half of this blog post, I’ll get the bootloader to load a larger program from the SD card. I’m hoping to allow the bootloader to control how much code to load, and where to load it from.

Let’s implement power-on self test (POST)

I recently implemented simple Power-on self test (POST) routine for my 65C816 test board, so that it can stop and indicate a hardware failure before attempting to run any normal code.

This was an interesting adventure for two main reasons:

  • This code needs to work with no RAM present in the computer.
  • I wanted to try re-purposing the emulation status (E) output on the 65C816 CPU to blink an LED.

Background

Even modern computers will stop and provide a blinking light or series of beeps if you don’t install RAM or a video card. This is implemented in the BIOS or UEFI.

I got the idea to use the E or MX CPU outputs for this purpose from this thread on the 6502.org forums. This method would allow me to blink a light with just a CPU, clock input, and ROM working.

My main goal is to perform a quick test that each device is present, so that start-up fails in a predictable way if I’ve connected something incorrectly. This is much simpler than the POST routine from a real BIOS, because I’m not doing device detection, and I’m not testing every byte of memory.

Boot-up process

On my test board, I’ve connected an LED directly to the emulation status (E) output on the 65C816 CPU. The CPU starts in emulation mode (E is high). However I have noticed that on power-up, the value of E appears to be random until /RES goes high. If I were wiring this up again, I would also prevent the LED from lighting up while the CPU is in reset:

The first thing the CPU does is read an address from ROM, called the reset vector, which tells it where start executing code.

In my case, the first two instructions set the CPU to native mode, which are clc (clear carry) and xce (exchange carry with emulation).

.segment "CODE"
reset:
    .a8
    .i8
    clc                            ; switch to native mode
    xce
    jmp post_check_loram

By default accumulator and index registers are 8-bit, the .a8 and .i8 directives simply tell the assembler ca65 that this is the case.

Next, the code will jmp to the start of the POST process.

Checking low RAM

The first part of the POST procedure checks if the lower part of RAM is available, by writing values to two address and checking that the same values can be read back.

Note that a:$00 causes the assembler to interpret $00 as an absolute address. This will otherwise be interpreted as direct-page address, which is not what’s intended here.

post_check_loram:
    ldx #%01010101                 ; Power-on self test (POST) - do we have low RAM?
    ldy #%10101010
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%01010101
    bne post_fail_loram
    cpy #%10101010
    bne post_fail_loram
    jmp post_check_hiram

If this fails, then the boot process stops, and the emulation LED blinks in a distinctive pattern (two blinks) forever.

post_fail_loram:                   ; blink emulation mode output with two pulses
    pause 8
    blink
    blink
    jmp post_fail_loram            ; repeat indefinitely

Macros: pause and blink

It’s a mini-challenge to write code to blink an LED in a distinctive pattern without assuming that RAM works. This means no stack operations (eg. jsr and rts instructions), and that I need to store anything I need in 3 bytes: the A, X, Y registers. A triple-nested loop is the best I can come up with.

I wrote a pause macro, which runs a time-wasting loop for the requested duration – approximately a multiple of 100ms at this clock speed. Every time this macro is used, the len value is substituted in, and this code is included in the source file. This example also uses unnamed labels, which is a ca65 feature for writing messy code.

.macro pause len                   ; time-wasting loop as macro
    lda #len
:
    ldx #64
:
    ldy #255
:
    dey
    cpy #0
    bne :-
    dex
    cpx #0
    bne :--
    dec
    cmp #0
    bne :---
.endmacro

The second macro I wrote is blink, which briefly lights up the LED attached to the E output by toggling emulation mode. I’m using the pause macro from both native mode and emulation mode in this snippet, so I can only treat A, X and Y as 8-bit registers.

.macro blink
    sec                            ; switch to emulation mode
    xce
    pause 1
    clc                            ; switch to native mode
    xce
    pause 2
    sec                            ; switch to emulation mode
.endmacro

Checking high RAM

There is also a second RAM chip, and this process is repeated with some differences. For one, I can now use the stack, which is how I set the data bank byte in this snippet.

Here a:$01 is important, because with direct page addressing, $01 means $000001 at this point in the code, where I want to test that I can write to the memory address $080001.

post_check_hiram:
    ldx #%10101010                 ; Power-on self test (POST) - do we have high RAM?
    ldy #%01010101
    lda #$08                       ; data bank to high ram
    pha
    plb
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%10101010
    bne post_fail_hiram
    cpy #%01010101
    bne post_fail_hiram
    lda #$00                       ; reset data bank to boot-up value
    pha
    plb
    jmp post_check_via

The failure here is similar, but the LED will blink 3 times instead of 2.

post_fail_hiram:                   ; blink emulation mode output with three pulses. we could use RAM here?
    pause 8
    blink
    blink
    blink
    jmp post_fail_hiram            ; repeat indefinitely

To make sure that I was writing to different chips, I installed the RAM chips one at a time, first observing the expected failures, and then observing that the code continued past this point with the chip installed.

I also checked with an oscilloscope that both RAM chips are now being accessed during start-up. Now that I’ve got some confidence that the computer now requires both chips to start, I can skip a few debugging steps if I’ve got code that isn’t working later.

Checking the Versatile Interface Adapter (VIA)

The third chip I wanted to add to the POST process is the 65C22 VIA. I kept this check simple, because one read to check for a start-up default is sufficient to test for device presence.

VIA_IER = $c00e
post_check_via:                    ; Power-on self test (POST) - do we have a 65C22 Versatile Interface Adapter (VIA)?
    lda a:VIA_IER
    cmp #%10000000                 ; start-up state, interrupts enabled overall (IFR7) but all interrupt sources (IFR0-6) disabled.
    bne post_fail_via
    jmp post_ok

This stops and blinks 4 times if it fails. I recorded the GIF at the top of this blog post by removing the component which generates a chip-select for the VIA, which causes this code to trigger on boot.

post_fail_via:
    pause 8
    blink
    blink
    blink
    blink
    jmp post_fail_via

Beep for good measure

At the end of the POST process, I put in some code to generate a short beep.

This uses the fact that the 65C22 can toggle the PB7 output each time a certain number of clock-cycles pass. I’ve connected a piezo buzzer to that output, which I’m using as a PC speaker. The 65C22 is serving the role of a programmable interrupt timer from the PC world.

VIA_DDRB = $c002
VIA_T1C_L = $c004
VIA_T1C_H = $c005
VIA_ACR = $c00b
BEEP_FREQ_DIVIDER = 461            ; 1KHz, formula is CPU clock / (desired frequency * 2), or 921600 / (1000 * 2) ~= 461
post_ok:                           ; all good, emit celebratory beep, approx 1KHz for 1/10th second
    ; Start beep
    lda #%10000000                 ; VIA PIN PB7 only
    sta VIA_DDRB
    lda #%11000000                 ; set ACR. first two bits = 11 is continuous square wave output on PB7
    sta VIA_ACR
    lda #<BEEP_FREQ_DIVIDER        ; set T1 low-order counter
    sta VIA_T1C_L
    lda #>BEEP_FREQ_DIVIDER        ; set T1 high-order counter
    sta VIA_T1C_H
    ; wait approx 0.1 seconds
    pause 1
    ; Stop beep
    lda #%11000000                 ; set ACR. returns to a one-shot mode
    sta VIA_ACR
    stz VIA_T1C_L                  ; zero the counters
    stz VIA_T1C_H
    ; POST is now done
    jmp post_done

The post_done label points to the start of the old ROM code, which is currently just a “hello world” program.

Next steps

I’m now able to lock in some of my assumptions about should be available in software, so that I can write more complex programs without second-guessing the hardware.

Once the boot ROM is interacting with more hardware, I may add additional checks. I will probably need to split this into different sections, and make use of jsr/rts once RAM has been tested, because the macros are currently generating a huge amount of machine code. I have 8KiB of ROM in the memory map for this computer, and the code on this page takes up around 1.1KiB.

Building an emulator for my 65C816 computer

I’ve been working on adding text-based input and output to my 65C816 computer prototype, but it’s not yet working reliably.

To unblock software development, I decided to put together an emulator, so that I can test my code while the hardware is out of action.

Choosing a base project

There were three main options I considered for running 65C816 programs on a modern computer,

  1. Write a new emulator from scratch
  2. Extract the useful parts of an open source Super Nintendo or Apple IIgs emulator and adapt them
  3. Find a 65C816 CPU emulator, and extend it.

I found a good candidate for the third option in Lib65816 by Francesco Rigoni, which is a C++ library for emulating a 65C816 CPU. The code is GPLv3 licensed, and I could see where I would need to make changes to match the design of my computer.

Process

After reading the code, I could see some places where I would need to modify the library, so I copied it in to a new folder, together with its logging dependency and sample program, and refactored it into one project.

I started by extending RAM up to 16 banks (1 megabyte), and mapping in a ROM device which serves bytes from a file. I also updated the handling of interrupt vectors, so that the CPU would retrieve them from the system bus.

Adding 65C22 support

My first test program was an example of preemptive multi-tasking from my previous blog post. At this point it ran, but did not switch between tasks, since that requires extra hardware support.

I implemented a minimal support for the 65C22 VIA used in my computer design, just enough to log changes to the outputs, and to fire interrupts from a timer.

The Lib65816 library does not support NMI interrupts, which are used in this test program. I added edge detection, and connected the interrupt output of the 65C22 to the CPU NMI input in code.

This screen capture shows the test program during switching between two tasks, with a different VIA port being accessed before and after the switch.

This gave me some confidence that the CPU emulation was good enough to develop on, so I moved on to the next test program.

Adding serial support

The second test program is also from an earlier blog post, and tests printing and reading characters from an NXP SC16C752 UART.

I added a very minimal emulation of the UART chip I’m using, suppressed all of the logging, and converted output to use the curses library. This last step enabled character-by-character I/O, instead of the default line-buffered I/O.

This screen capture shows the program running under emulation. It simply prints “Hello world”, then echoes back the next 3 characters that the user types.

Wrap-up

I didn’t plan to write an emulator for my custom computer, but I am currently debugging some tricky hardware problems, and this detour will allow me to continue to make progress in other areas while I try to find a solution.

I plan to keep the emulator up to date with hardware changes, since it will make it possible to demonstrate the system without any hardware access. It’s also easier to debug issues now that I can choose two different execution environments.

The hardware design, software and emulator for this computer are online at mike42/65816-computer. I would like to again acknowledge Francesco Rigoni’s work on Lib65816. I’m very grateful that he chose to release this under the GPL, since it allows me to re-use the code for my project.

Let’s implement preemptive multitasking

I am currently working on building a computer based on the 65C816 processor. This processor has a 16-bit stack pointer, which should make it far more capable than the earlier 65C02, at least in theory.

I wanted to check my understanding of how the stack works on this processor, so I tried to build the simplest possible implementation of preemptive multitasking in 65C816 assembly.

What I’m working with

I am working with a breadboard computer. It has no operating system, and no ability to load programs or start/stop processes.

To get code running, I am assembling it on a modern computer, then writing the resulting binary to an EEPROM chip.

To see the result of the code, I am using a logic analyser to watch some output pins on the 65C22 VIA. The 65C22 chip also has a timer, so I connected its interrupt output to the NMI input of the CPU for this test.

Test programs

I wrote two simple programs which infinitely loop. These programs use ca65 assembler syntax.

The 65C22 has two 8-bit ports, so I made each test program write a recognisable pattern to its own output port, so that it’s easy to see which one is running at any given time, and check that it is not disrupted by the task switching.

The first program writes to Port A of the 65C22, and alternates the value between 0000 0000, and 0000 0011.

.segment "CODE"
; Task 1
task_1_main:
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
@repeat_1:
  lda #%00000000                  ; alternate between two values
  sta PORTA
  lda #%00000011
  sta PORTA
  jmp @repeat_1                   ; repeat forever

The second program writes to Port B of the 65C22. This uses a ror instruction to produce a pattern across all digits of the output port.

.segment "CODE"
; Task 2
task_2_main:
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
  clc
  lda #%01010101                  ; grab a start value
@repeat_2:
  ror                             ; rotate right
  sta PORTB
  jmp @repeat_2                   ; repeat forever

Context switching

I’m implementing round-robin scheduling between two processes.

My basic plan was to use a regular interrupt routine to save context, switch the stack pointer to the other process, restore context for the next process, then return from the interrupt.

.segment "CODE"
nmi:
  .a16
  .i16
  rep #%00110000
  ; Save task context to stack
  pha                             ; Push A, X, Y
  phx
  phy
  phb                             ; Push data bank, direct register
  phd

  ; swap stack pointer so that we will restore the other task
  tsc                             ; transfer current stack pointer to memory
  sta temp
  lda next_task_sp                ; load next stack pointer from memory
  tcs
  lda temp                        ; previous task is up next
  sta next_task_sp

  ; Clear interrut
  .a8
  .i8
  sep #%00110000                  ; save X only (assumes it is 8 bits..)
  ldx T1C_L                       ; Clear the interrupt, side-effect of reading

  .a16
  .i16
  rep #%00110000
  ; Restore process context from stack, reverse order
  pld                             ; Pull direct register, data bank
  plb
  ply                             ; Pull Y, X, A
  plx
  pla
  rti

Setup process

The above switch works while the two processes are up and running, that takes some setting up. Firstly, the two variables need to be referenced somewhere.

.segment "BSS"
next_task_sp: .res 2              ; Stack pointer of whichever task is not currently running
temp: .res 2

The boot-up routine then sets everything up. It’s broken up a bit here. First is the CPU initialisation, since the 65C816 will boot to emulation mode.

.segment "CODE"
reset:
  clc                             ; switch to native mode
  xce

The second step is to start a task in the background. This involves pushing appropriate values to the stack, so that when we context switch, “Task 2” will start executing from the top.

  ; Save context as if we are at task_2_main, so we can switch to it later.
  .a16                            ; use 16-bit accumulator and index registers
  .i16
  rep #%00110000
  lda #$3000                      ; set up stack, direct page at $3000
  tcs
  tcd
  ; emulate what is pushed to the stack before NMI is called: program bank, program counter, processor status register
  phk                             ; program bank register, same as current code, will be 0 here.
  pea task_2_main                 ; 16 bit program counter for the start of this task
  php                             ; processor status register
  ; match what we push to the stack in the nmi routine
  lda #0
  tax
  tay
  pha                             ; Push A, X, Y
  phx
  phy
  phb                             ; Push data bank, direct register
  phd
  tsc                             ; save stack pointer to next_task_sp
  sta next_task_sp

  lda #$2000                      ; set up stack, direct page at $2000 for task_1_main
  tcs
  tcd

The nex step is to set up a timer, so that the interrupt routine will fire regularly. This causes interrupts to occur ~28 times per second.

  ; Set up the interrupt timer
  .a8                             ; use 8-bit accumulator and index registers
  .i8
  sep #%00110000
  lda #%11111111                  ; set all VIA pins to output
  sta DDRA
  sta DDRB
  ; set up timer 1
  lda #%01000000                  ; set ACR. first two bits = 01 is continuous interrupts for T1
  sta ACR
  lda #%11000000                  ; enable VIA interrupt for T1
  sta IER
  ; set up a timer at ~65535 clock pulses.
  lda #$ff                        ; set T1 low-order counter
  sta T1C_L
  lda #$ff                        ; set T1 high-order counter
  sta T1C_H

Finally, “Task 1” can be started in the foreground.

  ; start running task 1
  jmp task_1_main

Linker configuration and boilerplate

This is the first time I’m blogging about code written in 65C816 native mode, so for the sake of completeness, I’ll also include the updated linker configuration. The most important change (compared with this blog post) to that 32 bytes are now set aside for interrupt vectors.

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $0100,  size = $7e00, type = rw, file = "";
    PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE: load = ZP,  type = zp;
    BSS:      load = RAM, type = bss;
    CODE:     load = PRG, type = ro,  start = $e000;
    VECTORS:  load = PRG, type = ro,  start = $ffe0;
}

Note that this linker configuration is not quite complete, but it is good enough to get code running. The zero page is not relevant for the code I’ll be writing, so I might remove that at some point.

I also have definitions for the 65C22 I/O registers, which correspond to the location that it is mapped into RAM on my prototype computer.

PORTB = $c000
PORTA = $c001
DDRB = $c002
DDRA = $c003
T1C_L = $c004
T1C_H = $c005
ACR = $c00b
IFR = $c00b
IER = $c00e

The final piece of the puzzle then, are definitions for all those interrupt vectors.

.segment "CODE"
irq:
  rti

unused_interrupt:                 ; Probably make this into a crash.
  rti

.segment "VECTORS"
; native mode interrupt vectors
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; COP
.word unused_interrupt            ; BRK
.word unused_interrupt            ; Abort
.word nmi                         ; NMI
.word unused_interrupt            ; Reserved
.word irq                         ; IRQ

; emulation mode interrupt vectors
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; COP
.word unused_interrupt            ; Reserved
.word unused_interrupt            ; Abort
.word unused_interrupt            ; NMI
.word reset                       ; Reset
.word unused_interrupt            ; IRQ/BRK

The process for building this assembly code into a usable ROM are:

ca65 --cpu 65816 main.s
ld65 -o rom.bin -C system.cfg main.o

This outputs an 8KiB file. I need to pad this with zeroes up to 32KiB, which is the size of the ROM chip I am burning it to.

truncate -s 32768 rom.bin

Lastly, I write this to the ROM.

minipro -p AT28C256 -w rom.bin

Result

I connected a logic analyser to two wires from each output port, the NMI interrupt line, and the reset signal, and opened up sigrok.

This clearly shows the CPU switching between the two processes each time an interrupt is triggered.

Zooming in on the task cut-over, I can also see that the patterns on each port are different, as expected.

Of course this did not work the first time, and I spent quite a bit of time de-constructing and re-constructing the code to isolate bugs.

As result, I shipped some improvements to my 6502 assembly plugin for JetBrains IDE’s, which I have blogged about previously. The plugin will now provide suggestions for mnemonics. There is a project setting for 6502 vs 65C02 vs 65C816 mode, and it will only suggest mnemonics which are available for the CPU.

It also provides a weak warning for binary or hex numbers which are not 8, 16 or 24 bits, which would have saved me some time.

Lastly, it has optional checking for undefined/unused values. This helps catch problems in the editor, and allows me to identify unused code and variables.

Wrap-up

I already knew how multitasking works on a high-level, but it was quite interesting to implement it myself. I have been using this as a test program for my 65C816 computer, since it exercises RAM, ROM, I/O, and uses interrupts.

I don’t think I will be able to use this exact method for more complex programs, because it will break down a bit with multiple interrupt sources. After writing this code, I also found that it’s not common for modern systems to save register values to the stack when context-switching, and that a process control block is used instead.

The next step for my 65C816 computer will be the addition of an NXP UART, which I tested in an earlier blog post.

Interfacing an NXP UART to an 8-bit computer

I recently tested the SC16C752 UART from NXP with my 6502 computer, and I was able to use it to receive and send characters. This blog post shows the simple test program and circuit that I used.

Some background

I built a 65C02-based computer last year, which uses a serial connection for text I/O. I used the WDC 65C51N to provide a UART interface, and I found it to be quite limited.

I am now building a 16-bit computer, and I want to avoid the 65C51N, for two main reasons:

  • It runs at 5v, and I’m planning to convert the design to 3.3v later on. The other 65-series chips I’m using support a wide voltage range.
  • It can only store one byte at a time. I am planning to run at a higher baud rate, and the system will have multiple interrupt sources. It would take very careful programming to avoid losing bytes with a 65C51N in this setup.

I spent some time investigating alternative UART IC’s, and found some discussion on 6502.org about replacing the 6551 with an NXP industrial (SC26 or SC28 series) UART. At the time of writing, many of the hobbyist-friendly PLCC versions of these chips are listed as end of life, though I’ve since found pin and software-compatible versions of this made by MaxLinear (eg. the XR88C92).

For my uses though, I chose the NXP SC16C752. It has two channels, 64-byte receive and transmit buffers, and supports a wide voltage range. I don’t mind that it’s a different series to these standard chips, because I plan to write software from scratch. The only downside is the packaging, which is LQFP48 with 0.5mm pin pitch. I needed to solder this tiny chip to an adapter board in order to use it.

The test circuit

I wired this up to my 6502 computer project as shown in the diagram below. IO3 is a spare address decoding output, which maps the chip to address $8c00, while the rest of the signals connect to the 6502 CPU directly.

I used a 74AC00 and 74AC04 to generate the active-high reset and separate read/write signals (as suggested on the same thread linked before). I plan to use a PLD to generate a qualified write signals on my 65C816 computer build.

Note that there are separate chip-selects for each UART, and I’ve only connected one here. I have also not connected the interrupt output for this test, and have some unused inputs which are left floating.

Developing a test program

I started writing some code, and connected a logic analyser to the UART output. The early tests did not work well, and I now know that this is because I had not enabled the FIFOs, causing characters to be lost.

This screen capture shows an attempt to send the entire alphabet to the UART, but only the letters “A”, “B” and “M” appearing at the output.

I came up with this test program, which does 8-N-1 communication at 115,200 bits per second.

;
; Test program for the SC16C752B dual UART from NXP
;
.import sys_exit

; Assuming /CSA is connected to IO3
UART_BASE = $8c00

; UART registers
; Not included: FIFO ready register, Xon1 Xon2 words.
UART_RHR = UART_BASE        ; Receiver Holding Register (RHR) - R
UART_THR = UART_BASE        ; Transmit Holding Register (THR) - W
UART_IER = UART_BASE + 1    ; Interrupt Enable Register (IER) - R/W
UART_IIR = UART_BASE + 2    ; Interrupt Identification Register (IIR) - R
UART_FCR = UART_BASE + 2    ; FIFO Control Register (FCR) - W
UART_LCR = UART_BASE + 3    ; Line Control Register (LCR) - R/W
UART_MCR = UART_BASE + 4    ; Modem Control Register (MCR) - R/W
UART_LSR = UART_BASE + 5    ; Line Status Register (LSR) - R
UART_MSR = UART_BASE + 6    ; Modem Status Register (MSR) - R
UART_SPR = UART_BASE + 7    ; Scratchpad Register (SPR) - R/W
; Different meaning when LCR is logic 1
UART_DLL = UART_BASE        ; Divisor latch LSB - R/W
UART_DLM = UART_BASE + 1    ; Divisor latch MSB - R/W
; Different meaning when LCR is %1011 1111 ($bh).
UART_EFR = UART_BASE + 2    ; Enhanced Feature Register (EFR) - R/W
UART_TCR = UART_BASE + 6    ; Transmission Control Register (TCR) - R/W
UART_TLR = UART_BASE + 7    ; Trigger Level Register (TLR) - R/W

.segment "CODE"
main:
  lda #$80                  ; Enable divisor latches
  sta UART_LCR
  lda #1                    ; Set divisior to 1 - on a 1.8432 MHZ XTAL1, this gets 115200bps.
  sta UART_DLL
  lda #0
  sta UART_DLM
  lda #%00010111            ; Sets up 8-n-1
  sta UART_LCR
  lda #%00001111            ; Enable FIFO, set DMA mode 1
  sta UART_FCR
  jsr test_print
  jsr test_recv
  lda #0
  jmp sys_exit

test_print:                 ; Send test string to UART
  ldx #0
@test_char:
  lda test_string, X
  beq @test_done
  sta UART_THR
  inx
  jmp @test_char
@test_done:
  rts

test_string: .asciiz "Hello, world!"

test_recv:                  ; Receive three characters and echo them back
  jsr uart_recv_char
  sta UART_THR
  jsr uart_recv_char
  sta UART_THR
  jsr uart_recv_char
  sta UART_THR
  rts

uart_recv_char:             ; Receive next char to A register
  lda UART_LSR
  and #%00000001
  cmp #%00000001
  bne uart_recv_char
  lda UART_RHR
  rts

The program sends the text “Hello, world!”, then waits for input. The next three characters it receives are sent back to the user, and then the program terminates. The only thing in this test program which is specific to my 6502 computer is sys_exit, which a routine to transfer control back to the operating system.

It was very rewarding to see the fast, error-free text output on the logic analyser for the first time. This is a great improvement from the 65C51N.

I was also surprised at how short the delay is to echo back a character. It will be interesting to see how much the latency increases once there is some more complex software handling the input.

Wrap up and next steps

I plan to use the NXP SC16C752 for my 65C816 computer project, which will be controlled via text-based input and output.

I’m testing it on its own first, so that I can refer to a known-good setup if it doesn’t work. This is enough to get started, though there are some features which I wasn’t able to test yet, since I can’t use custom interrupt service routines on this setup.

Adding an SD card reader to my 6502 computer

I recently attempted to add SD card support to my home-built 6502 computer.

This was successful, but more challenging than I expected. This blog post is just a few things that I learned along the way.

Background

SD cards are a popular way to add storage to electronics projects. In my case, I will be using the SD card in SPI mode. My home-built computer has an expansion port with some GPIO pins, which should be sufficient for this.

My basic plan was to:

  • write some test data to an SD card from a modern computer
  • interface the SD card to my home-built 6502 computer
  • write a 6502 assembly program to print out the test data

Since this is a learning project, I started from scratch, and avoided reading any existing implementations of SPI or SD cards before jumping in.

Test data

I started this project by generating a small test file, repeating each character in the alphabet 512 times.

echo abcdefghijklmnopqrstuvwxyz | fold -w1 | while read line; do yes $line | head -n512 ; done | tr -d "\n" > disk.img

It takes a one-line shell command to write this image to an SD card on all good operating systems, or a few clicks in the disk imaging utility.

dd if=disk.img of=/dev/my_sd_card

My laptop has an SD card slot, but I also have Windows on it at the moment, which turned this into a far larger problem than I could have imagined.

Driver support for this particular SD card reader was dropped in Windows 10, there are no built-in tools for imaging disks on Windows, and I had to try multiple tools before I could find one which would write a raw disk image (one which does not contain a valid filesystem).

In retrospect, this was the first of many lessons about not having the right tools for this project.

Physical interface

I connected an SD card module to my 6502 computer, via a breadboard. I used a module which could level-shift, because my computer runs at 5V, where SD cards usually run at 3.3V. This particular module is an Adafruit ADA254.

I also added pull-up resistors to all of the data lines, after noticing that there were several sitations where they would float. The wiring I used is:

The full schematic for my 6502 computer is on this blog post, while the schematic for the SD card module is online here.

Software: Part 1

With some test data and a hardware interface, I got to the third step, which was to read data from the SD card. I was completely unprepared for how tricky it is to do this from scratch, without the right experience or debugging tools.

As a quick re-cap, my development environment is still very limited: I am writing 6502 assembly on a modern computer, and I have the ability to send the assembled program to my home-built 6502 computer over a serial connection.

I started by using lecture notes from the University at Buffalo’s EE-379 course, available online here. This describes SPI, SD card command structure, and the sequence to initialise a card.

The two protocols I needed to implement are SPI, which transfers the bits, plus the SD card commands which run over that. I had to write quite a lot of code with no feedback, because it takes several commands before SD cards are ready to work with in SPI mode, and the exact details vary depending on the generation of card (SDHC vs SDXC, etc).

After days of debugging, I was unable to get a response from the SD card. With few other options, I started filling my code with the assembly-language equivalent of print statements, and found that every byte from the SD card was the default FF. This means ‘no data’ in this context.

Print statements also slowed the program down to a crawl, which allowed me to use a pizeo buzzer to listen to each line. This clicks each time a signal changes between 0 and 1, so I was able to check that each line had the expected activity. I found that the SD card was sending responses, but that the computer was not receiving them, because they were not connected to eachother on the breadboard.

Software: Part 2

After connecting the card correctly, I could see responses for the first time. The format here is byte sent / byte received, with line breaks between commands.

#
SD card detected
40/7f 00/ff 00/ff 00/ff 00/ff 95/ff ff/ff ff/01 ff/ff
48/7f 00/ff 00/ff 01/ff aa/ff 0f/ff ff/ff ff/09 ff/ff ff/ff ff/ff ff/ff
7a/7f 00/ff 00/ff 00/ff 00/ff 75/ff ff/ff ff/01 ff/00 ff/ff ff/80 ff/00 ff/ff ff/ff
#

The correct initialisation sequence depends on the type of SD card, and I spent a lot of time going through each possibility, looking for one one which worked. When this failed, I found this summary online, which includes more information about decoding the responses: Secure Digital (SD) Card Spec and Info.

As it turned out, I was not sending the correct CRC for CMD8, and a CRC error is indicated by the 09 (binary 0000 1001) response to the second command above. Each bit in the response has a different meaning, and I had mistaken this for indicating an unsupported command, which is expected for some card types.

It’s a good thing that I had this problem, because searching for the CRC algorithm turned up this collection of AVR C tutorials, which includes a 4-part series on implementing SD card support.

I noticed that all of the command examples were wrapped with code like this:

    // assert chip select
    SPI_transfer(0xFF);
    CS_ENABLE();
    SPI_transfer(0xFF);
    // ... code to send a command...
    // deassert chip select
    SPI_transfer(0xFF);
    CS_DISABLE();
    SPI_transfer(0xFF);

Later on, I had a problem where responses were sometimes mis-aligned by one bit. When I also tried sending empty bytes between commands, while the SD card is not selected, these problems disappeared entirely.

Software: Part 3

With the initialisation code complete, I could finally request data from the SD card.

Data on an SD card is arranged in 512-byte blocks by default. I wrote a routine to read one block of data from the SD card into RAM, plus a routine to print sections of RAM.

The result was a program which prints the first kilobyte of the SD card, which is the letter a 512 times, followed by the letter b 512 times.

This screen capture shows the output of a program matching a hexdump of the test file, which is exactly what I was aiming for.

The full code is too long to include in a blog post, but can be found in the repository for this project, mike42/6502-computer on GitHub.

Some lessons

While I was able to read data from an SD card, this took a long time, and there were several points where I nearly hit a dead-end.

I couldn’t see what was happening at a hardware level, and I couldn’t compare my setup with one that worked, because I had never done anything similar before. Spot-the-difference is a powerful debugging tool, and by jumping straight into 6502 assembly, while avoiding existing 6502 implementations, I definitely made things difficult for myself.

With that in mind, there are several ways I could have made this easier:

  • Better equipment. A logic analyser, an SD card writer, and spare SD cards would have been useful for testing.
  • Building somewhere less-constrained first. I could have implemented this first on a Raspberry Pi, then ported the result to 6502 assembly. It would then be possible to verify my hardware setup by running existing code, while still getting the learning experience from writing my implementation from scratch. Note that this would have required a different SD card break-out board.
  • Building something simpler first. For example, Ben Eater released a a YouTube video around the time I was working on this, where he reads from an SPI temperature sensor on a 6502 computer. It would be much easier to get started if I had used any other SPI device on a 6502 first.

The next step in this project is to load programs from an SD card, which would require me to implement a filesystem. This is significantly more complex than what I’ve done here, and I know not to approach it the same way.

More equipment

I picked up a logic analyser, USB SD card reader, alternative SD card module, and a spare SD card.

The logic analyser is compatible with sigrok, which is open source. This is the first time I’ve used this software, and it can decode SD card commands running over SPI, among other things.

The USB SD card reader will allow me to write disk images directly from my development environment, instead of copying them over to a laptop. I do most of my development from a virtual machine, and USB peripherals are easy to pass though.

The alternative SD card module is from SparkFun. Compared with the module I was writing in this blog post, it has a smaller footprint, and includes the pull-up resistors. I’m hoping I can use this to fit an SD card into the 3D-printed enclosure for this project.