Switching between VFIO-enabled virtual machines

For the past few months, I’ve been using a setup for software development where I pass the mouse, keyboard and graphics card to a virtual machine via Linux VFIO. The setup is described in detail in its own blog post.

In short though, the idea is to provision a separate virtual machine for each project, so that I can install whatever tools I need to solve the problem at hand, with no real consequences if I break my install.

The problem

I have multiple virtual machines which are configured to use the same hardware, so I can only run one at a time. Before making the changes outlined in this blog post, the method I used to switch to a different VM was unwieldy: I needed is to shut down the current VM, switch monitor inputs to the host graphics, start a different VM, then switch monitor inputs back the the VM graphics.

I intend to do all of my work in VM’s, so my basic goal was to make a ‘switch VM’ mechanism available from within a VM, to remove any need to interact with the host operating system directly.

Creating an API

My plan for this was to run an agent on the host, which will accept a request to switch VM’s. It would then shut down the current VM, and start up a different one instead.

This was the first time I tried using the libvirt Python bindings before, and I was able to quickly make a script which toggled between two VM’s, which are nested within the VM I’m using for development, because of course they are.

Toggling which VM is active, using the Python libvirt API in PyCharm.

I then extended this to use an orderly shutdown, and to work with an arbitrary number of VM’s. CirrOS was not completing its boot in this environment, so I switched the test VM’s to Debian, and also passed through a USB device to each of them so that I would get an error if I ever tried to boot both VM’s simultaneously.

Last, I exposed this over HTTP as a simple API.

For the first iteration, I installed this on the host, and placed some scripts in a folder which is shared between all of the VM’s. The scripts use curl commands to call the API. To trigger a VM switch (or a shutdown of the host), I could go to a folder, and run one of these scripts.

Web UI

With that proof-of-concept out of the way, I made a web interface, so that I can use this instead of some bash scripts in a folder. This is a very simple Angular app.

GNOME extension

For the most part, I’m working on Debian using the GNOME desktop environment, which gives me some options for integrating this further into my desktop.

I first considered making a search provider, but settled instead for making a simple standalone extension to display an indicator instead, based on the example app shown running here.

As if nested virtualization was not mind-bending enough, I started testing this using a GNOME session within a GNOME session.

The extension is written in JavaScript, though the JavaScript API for GNOME extensions was very unfamiliar to me. There some obvious quality issues in the code, but it’s not being published to a repository, and it fires off the correct HTTP requests, so I’ll call it a success.

Deploying a GNOME extension manually is as simple as placing some files my home directory.

Result and next steps

I’ve bookmarked this URL in each of my VM’s.

And I’ve also installed this GNOME extension on the VM that runs on boot.

The idea of this setup is to have one virtual machine per project, so a good next step would be to make it easier to kick of the install process.

I’ve put the code for this project online at mike42/vfio-vm-switcher on GitHub.

Going all-in on GPU passthrough for software development

I recently spent some time improving my software development workflow at home, since my previous setup was starting to limit me. I settled on a configuration which uses GPU passthrough with the KVM hypervisor, running Debian as both a host and a guest.

This post aims to show some of the benefits and drawbacks of this more complex setup for my use case, as well as a specific combination of hardware and software which can be made to work.

Background and previous setup

I was using Debian testing as a host system, and provisioning project-specific virtual machines to work in via virt-manager. I use two monitors, and typically had an IDE (in the VM) open on one monitor, and a web browser (on the host) open on the other.

The setup is simple and effective:

  • The VMs ran in the QEMU user session – previously blogged about here.
  • The path /home/mike/workspace on the host was shared with every VM via a 9p fileshare.
  • Each desktop was set to 1920 x 1080 resolution, with auto-resize disabled. I kept the VM’s at this lower resolution to get acceptable graphical performance on 4K monitors when I set this up.

This allowed me to keep the host operating system uncluttered, which I value for both maintainability and security. Modern software development involves running a lot of code, such as dependencies pulled in via a language-specific package manager, random binaries from GitHub, or tools installed via a sketchy curl ... | sudo sh command. Better to keep that in a VM.

The main drawbacks are with the interface into the VM.

  • Due to scaling, I had imperfect image quality. I also had a small but noticeable input lag when working in my IDE.
  • Development involving 3D graphics was impractical due to the performance difference. I reported a trivial bug in Blender earlier this year but didn’t have an environment suitable for more extensive development on that codebase.
  • There was an audio delay when testing within the VM. I created a NES demo last year and this delay was less than ideal when it came to adding sound.

I was at one point triple-booting Debian, Ubuntu 22.04 and Windows, so that I could also run some software which wouldn’t work well in this setup.

Hardware

GPU passthrough on consumer hardware is highly dependent on motherboard and BIOS support. The most critical components in my setup are:

  • Motherboard: ASUS TUF X470-PLUS GAMING
  • CPU: AMD Ryzen 7 5800X
  • Monitors: 2 x LG 27UD58
  • Graphics card: AMD Radeon RX 6950 XT
    • Installed at PCIEX16_1, directly connected to the CPU
    • Connected to DisplayPort inputs on the two monitors

I added one component specifically for this setup:

  • Second graphics card: AMD Radeon RX 6400
    • Installed at PCIEX16_2, connected to chipset
    • Connected to HDMI inputs on the two monitors (one directly, one via an adapter)

The second graphics card will be used by the host. The RX 6400 is a low-cost, low-power, low-profile card which supports UEFI, Vulkan and the modern amdgpu driver.

In the slot I’ve installed it, it’s limited to PCIE 2.0 x4, and 1920 x 1080 is the highest resolution I can run at 60 Hz on these monitor inputs. I only need to use this to set up the VM’s, or as a recovery interface, so I’m not expecting this to be a major issue.

Inspecting IOMMU groups

On a fresh install of Debian 12, I started walking through the PCI passthrough via OVMF guide on the Arch Wiki, adapting it for my specific setup as necessary.

I verified that I was using UEFI boot, had SVM and IOMMU enabled, and I also enabled resizable BAR. I did not set any IOMMU-related Linux boot options, and also could not find a firmware setting to select which graphics card to use during boot.

Using the script on the Arch Wiki, I then printed out the IOMMU groups on my hardware, which are as follows:

IOMMU Group 0:
    00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 1:
    00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 2:
    00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 3:
    00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 4:
    00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 5:
    00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 6:
    00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 7:
    00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 8:
    00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 9:
    00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 10:
    00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 11:
    00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 12:
    00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
    00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 13:
    00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0 [1022:1440]
    00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1 [1022:1441]
    00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2 [1022:1442]
    00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3 [1022:1443]
    00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4 [1022:1444]
    00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5 [1022:1445]
    00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6 [1022:1446]
    00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7 [1022:1447]
IOMMU Group 14:
    01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
IOMMU Group 15:
    02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
    02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
    02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
    03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    03:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    03:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    03:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    03:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
    04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
    08:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c7)
    09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
    0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 24 [Radeon RX 6400/6500 XT/6500M] [1002:743f] (rev c7)
    0a:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
    0b:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01)
IOMMU Group 16:
    0c:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c0)
IOMMU Group 17:
    0d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
IOMMU Group 18:
    0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
IOMMU Group 19:
    0e:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
IOMMU Group 20:
    0f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 21:
    10:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 22:
    10:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU Group 23:
    10:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU Group 24:
    10:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]

Passthrough for an IOMMU group is all-or nothing. For example, all of the chipset-connected devices are grouped together, and I can’t pass any of them through to a VM unless I pass them all through.

I’m mainly interested in a graphics card, an audio device, and a USB controller, and helpfully I have one of each isolated in their own groups, presumably because they are connected to PCIe lanes which go directly to the CPU.

Diagram illustrating that a graphics card, audio device, and USB controller are connected to PCIe lanes which go directly to the CPU, while a number of other devices are connected to the chipset.

Graphics

I created a virtual machine, also Debian 12, in the QEMU system session. The only important setting for now is the chipset and firmware, and in my case I selected Q35, and the OVMF UEFI firmware. GPU passthrough will not work if a virtual machine is booting with legacy BIOS.

I use the WebGL Aquarium as a simple test for whether 3D acceleration is working. It runs much faster on the host system (this is using the RX 6400). The copy in the VM runs at just 3 frames per second, using SPICE display and virtio virtual GPU at this stage.

The next step was to isolate the GPU from the host. The relevant line from lspci -nn is:

0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)

The vendor and device ID for this card is shown at the end of the line, 1002:73a5.

This needs to be added to the vfio-pci.ids boot option, which in my case involves updating /etc/default/grub.

GRUB_CMDLINE_LINUX="vfio-pci.ids=1002:73a5"

This is then applied by running update-grub2, and rebooting. It’s apparently possible to accomplish this without a reboot, but I’m following the easy path for now.

I verified that it worked by checking that the vfio-pci kernel module is in use for this device.

$ lspci -v
0e:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] (rev c0) (prog-if 00 [VGA controller])
...
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

Before booting up the VM, I added the card as a “Host PCI Device”. It was then visible in lspci output within the VM, but was not being used for video output yet.

To encourage the VM to output to the physical graphics card, I switched the virtualised video card model from “virtio” to “None”.

Booting up the VM after this change, the SPICE display no longer produces output, but USB redirection still works. I passed through the keyboard, then the mouse.

I then switched monitor inputs to the VM, now equipped with a graphics card, and the WebGL aquarium test ran at 30 FPS.

Now that I was switching between near-identical Debian systems, I started using different desktop backgrounds to stay oriented.

Audio output

I needed to make sure that audio output worked reliably in the virtual machine.

Sound was working out of the box through the emulated AC97 device, but I checked an online Audio Video Sync Test, and confirmed that there was a significant delay, somewhere in in the 200-300ms range. This is not good enough, so I deleted the emulated device and decided to try some other options.

I first tried passing through the audio device associated with the graphics card, identified as 1002:ab28.

0e:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]

I took the ID, and added it to the option in /etc/default/grub. Now that there are multiple devices, they are separated by commas.

GRUB_CMDLINE_LINUX="vfio-pci.ids=1002:73a5,1002:ab28"

As before, I ran update-grub2, rebooted, added the device to the VM through virt-manager, booted up the VM, and tried to use the device.

With that change, I could connect headphones to the monitor and the audio was no longer delayed. This environment would now be viable for developing apps with sound, or following along with a video tutorial for example.

Audio input

Next I attempted to pass through the on-board audio controller to see if I could get both audio input and output. Discussions online suggest that this doesn’t always work, but there is no harm in trying.

I’ll skip through the exact process this time, but I again identified the device, isolated it from the host, and passed it through to a VM.

10:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]

Output worked immediately, but the settings app did not show any microphone level, and attempts to capture would immediately stop. I did some basic reading and troubleshooting, but didn’t have a good idea of what was happening.

What worked for me was blindly upgrading my way out of the problem by switching to the Debian testing rolling release in the guest VM.

I was then able to see the input level in settings, and capture the audio with Audacity.

Audio input is not critical for me, but does allow me to move more types of work into virtual machines without switching to a USB sound card.

USB

My computer has two USB controllers. One is available for passthrough, while the other is in the same IOMMU group as all other chipset-attached devices.

02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
10:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]

The controller I passed through is the “Matisse USB 3.0 Host Controller”, device ID 1022:149c, which I initially expected would have plenty of USB 3 ports attached to it.

Reading the manual for my motherboard more carefully, I discovered that this controller is only responsible for 2 USB-A ports and 1 USB-C port, or 20% of the USB ports on the system.

I use a lot of USB peripherals for hardware development. More physical USB ports would be ideal, but I can work around this by using a USB hub.

I’ll also continue to use USB passthrough via libvirt to get the keyboard and mouse into the VM. Instead of manually passing these through each time I start the VM, I added each device in the configuration as a “Physical USB Device”.

This automatically connects the device when the VM boots, and returns it to the host when the VM shuts down, without the SPICE console needing to be open. If I need to control the host and guest at the same time, I can connect a different mouse and keyboard temporarily – I’ve got plenty of USB ports on the host after all.

If this ever becomes a major issue, it should also be possible to switch this setup around, and pass through the IOMMU group containing all chipset-attached PCIe devices instead of the devices I’ve chosen. This would provide extensive I/O and expansion options to the VM, at the cost of things like on-board networking, SATA ports, and an NVMe slot on the host. The 3 USB ports on the “Matisse USB 3.0 Host Controller”, if left to the host, would be just enough for a mouse, keyboard, and USB-C Ethernet adapter.

File share

On my previous setup, I used a 9p fileshare to map a directory on the host to the same path on every VM, allowing an easy way to exchange files.

The path was /home/mike/workspace – a carry-over from when I used Eclipse. In practice it has been slower to work on a fileshare, so I’ll switch to developing in a local directory.

I’ll still set up a fileshare, but with two changes:

  • I’ll map a share to a more generic /home/mike/Shared, and start to back up anything that lands there.
  • I’ll use virtiofs instead of virtio-9p. This claims to be more performant, and it’s apparently possible to get this working on Windows as well.

This is added as a hardware device in virt-manager on the host.

In the guest, I added the following line to /etc/fstab.

shared /home/mike/Shared/ virtiofs

To activate this change, I would previously have created the mount-point and run mount -a. I recently learned that systemd creates mount-points automatically on modern systems, so I instead ran the correct incantation to trigger this process, which is:

$ sudo systemctl daemon-reload
$ sudo systemctl restart local-fs.target

Power management

In my current setup, sleep/wake causes instability. I initially disabled idle suspend on the host in GNOME:

While testing a VM which I had configured to start on boot, the system decided to go to sleep while I was using it. In hindsight this makes complete sense: as far as the host could tell, it was sitting on the login page with no mouse or keyboard input, and had been configured to go to sleep after an idle time-out (the setting within GNOME only applies after login).

Dec 05 16:24:43 mike-kvm systemd-logind[706]: The system will suspend now!

To avoid this, I additionally disabled relevant systemd units (source of this command).

$ sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
Created symlink /etc/systemd/system/sleep.target → /dev/null.
Created symlink /etc/systemd/system/suspend.target → /dev/null.
Created symlink /etc/systemd/system/hibernate.target → /dev/null.
Created symlink /etc/systemd/system/hybrid-sleep.target → /dev/null.

I’ve also configured each VM to simply blank the screen when idle, since allowing a guest to suspend causes its own problems.

Recap of unexpected issues

Despite my efforts to plan around the limitations of my hardware, I did hit four unanticipated problems, mostly highlighted above.

  • I assumed that there would be a setting in my BIOS to change which graphics card to use during boot, but I couldn’t find one.
  • The audio devices associated with both of my graphics cards had the same vendor and device ID. The way I have it configured, no audio output is available on the host as a result.
  • Microphone input didn’t work on the HD Audio Controller until I upgraded the guest operating system. The usual workaround for this is apparently to use a USB sound card.
  • I misunderstood the limitations of my USB setup, so I have fewer USB ports directly available in the VM than I had hoped for.

Wrap-up and future ideas

I really like the idea of switching into an environment which only has the tools I need for the task at hand. Compared with my previous setup for developing in a VM, GPU passthrough (and sound controller passthrough, and USB controller passthrough) is a huge improvement.

I’ve manually provisioned 3 VM’s, including a general-purpose development VM which has a mix of basic tools so that I can get started.

Since I can only have one VM using the graphics card at a time, this setup works similarly to having multiple operating systems installed as a multi-boot setup. It does however have far better separation between the systems – they can’t read and write each others’ disks for example.

The next steps for me will be to streamline the process of switching VM’s and shutting down the system. Both of these currently require me to manually switch monitor inputs to interact with the host, which I would prefer to avoid.

Building a 1U quiet NAS

I recently built a compact, quiet rackmount NAS for home. I haven’t seen any builds quite like it online, so I’m writing a bit about how it came together.

The problem

My old backup was a mirrored pair of 2 TB hard drives in an old desktop computer, with a portable 2 TB hard drive as an off-site copy. The disks are now 9 years old, and 2 TB is small enough that I need to ration the space. I also recently repurposed most of the components in that system, so I no longer have an up-to-date backup.

I want to solve this properly, and hopefully build a replacement setup which I can install in my network cabinet, to run 24/7.

But if I’m going to do that, the must-haves are:

  • It fits in a 1U rack-mount form-factor with maximum 25cm depth
  • It’s quiet
  • It has front USB for making an offline copy of the backups
  • It has 2 SATA disks – I don’t want to be running a NAS off USB-SATA adapters, for example
  • It has wired ethernet – my network is 1 Gigabit

Nice-to-haves would be:

  • Fast front USB
  • Hot-swap disks or more disks
  • Faster network

I’ve worked with servers, and I’ve built small-form-factor computers, so how hard can it be to build a small-form-factor server?

Parts list

I spent a lot of time sketching out possible builds in LibreOffice Draw. Every combination of parts had some compromise, which is a familiar theme from small-form-factor PC building.

I ended up deciding on parts which follow normal PC standards, hopefully giving me a good chance of keeping it working for many years.

  • Case: Case Athena Power RM1UC138
  • Power supply: HDPLEX GAN 250W
  • Motherboard: Topton N6005 Mini ITX
  • RAM: Crucial 16GB (2 x 8GB) DDR4 3200 SO-DIMM
  • Boot disk: Samsung 970 EVO Plus 1TB M.2 SSD
  • Storage disks: 2 x Samsung 870 QVO 8TiB 2.5″ SSD
  • Fans: 4 x Noctua A4x20 PWM
  • Drive cage: Icy Dock MB608SP-B – 6 x 2.5″ SATA bays
  • Set of 6 x 50cm SATA cables

There are some compatibility issues in the above part list, and it took a little bit of problem-solving to get everything working together.

Closer look at the Athena Power RM1UC138

Short-depth 1U computer cases are nearly impossible to find in Australia. I ordered the Athena Power RM1UC138 from the United States, which is an OEM case with some flexibility built in.

On the front it has a 5.25″ bay, which I plan to use to add a drive cage. The front USB is only USB 2.0, which would normally be a disadvantage, but in this instance is a good match for the motherboard I’m using. The power button looks like a toggle switch, but is actually a momentary switch.

The side shows that the rack ears can be put on the front or back, allowing the case to be mounted in either direction.

On the back there is a wire mesh panel for cutting out a custom I/O shield, which is handy since I have an off-brand motherboard.

On the inside, it’s configured for 2 x 3.5″ hard drives by default. It also includes 2 x 2-pin 12v fans. They are loud like you would find in most managed network switches, but are not jet-engine loud like most servers. The fan controller simply distributes 12v, and has no speed control.

This is not a very common case, and I read everything I could find online about it. In order to help the next person who is searching for it, here are two more random pieces of information which I could not confirm until I had the case in-hand:

  • Stand-offs on the case are all 4mm tall and non-removable
  • Screw spacing of USB 2.0 front panel connector is approx 30.5mm (from centre of each screw). Stacked USB 3 headers that are 30mm spacing are available online and could be made to work.

Closer look at the Topton N6005 Mini ITX

I chose to use a Topton motherboard with a built-in Intel N6005 CPU for this build, since the alternatives were either too tall, use a socketed CPU, use an old CPU, would require add-in cards to get multiple SATA ports, or were not sold in Australia. All of these would make it far more difficult to complete the small, quiet build which I was aiming for.

From the few threads online about this board, I gathered that it is fussy about RAM compatibility, so I booted it up at the first opportunity with an SSD containing Pop!_OS to check that it worked. It’s not my use case, but this motherboard would definitely be viable as a lightweight desktop.

The specific memory I used was a 16 GB kit with the model number CT2K8G4SFRA32A. According to Intel’s product documentation, the N6005 only supports 16 GB maximum, and while I could find claims that higher-capacity memory does work, I couldn’t find anybody who posted actual part numbers.

I was happy to find that the built-in cooler is inaudible at idle loads. This was a bit of a risk: the cooler doesn’t have standard dimensions, so I couldn’t have easily replaced it with an alternative if it was noisy. Based on other people’s experiences with this board, I re-pasted the cooler with Noctua NT-H1 thermal paste, to hopefully help keep temperatures down at higher loads so that the fan will not need to spin up as much. I also also avoided using the M.2 slot which receives the most hot air from the cooler.

Topton also sells an N5105 variant which appears to be more popular (more info here), as well as an alternative layout which has a PCIe slot instead of a second M.2 slot.

Custom power supply adapter plate

The case is designed for a Flex ATX power supply, which is not what I’m using. I’ve instead opted to use a passively-cooled HDPLEX GAN 250W power supply, which ships with both an IEC C6 and IEC C14 cable.

I needed to choose one of these cables, and figure out how to securely mount it to the case.

I designed an adapter plate in FreeCAD around the included IEC C6 cable, since it had threaded holes, and screws were included.

I ordered it from a prototype supplier in laser cut 1mm steel, painted in matte black.

This is the first time I’ve used FreeCAD to design a part, and parametric CAD certainly has a learning curve. For this build, it was well worth it, since the result is better (and safer) than anything I could have improvised.

At the time of writing this post, HDPLEX sells plates for mounting their IEC C14 cables in cases accepting SFX and ATX power supplies, but none for cases which accept Flex ATX power supplies.

Custom fan controller

Cooling this build quietly was always going to be a challenge. The case shipped with 2 x 12 V fans, and had a simple splitter which ran them at max speed, which was just too loud.

I designed a replacement fan controller in KiCad, which allows me to upgrade to high-quality 4-pin fans with PWM speed control, and to set the speed using a potentiometer. I wrote about prototyping this in a separate blog post.

This photo shows the custom controller alongside the original one it replaces.

As you can probably guess, I’ve designed this to use the same mounting location, at the front of the case. My power supply has no Molex connectors, so I’m using a SATA-Molex power adapter.

The main drawback to this simple design is that once I close up the case, I can no longer adjust the fan speed.

Final assembly

Before continuing any further, I took apart the case completely and deleted three standoffs with a belt sander, to leave a flat area for power supply installation later.

Once I got the case back together, the motherboard went in first. I raised it by 1mm using plastic washers, hoping to line it up better with the I/O shield included with the motherboard.

The I/O opening for 1U servers is narrower than standard PC builds, so I needed to cut the I/O shield, which I unfortunately did not do correctly.

Since that did not work, I carefully marked and cut out the wire mesh I/O shield included with the case instead. I still left the motherboard raised up on 1mm washers, though this is not necessary anymore. You need to use slightly longer screws if you try this.

After that I installed the four case fans, plus the fan controller. I’m using a front-to-back airflow, with 2 x 40mm fans mounted at the front, and 2 x 40mm fans at the back. I added a Y splitter to the back fans, which did not have long enough cables to reach the fan controller.

The next component I installed was the drive cage. It’s worth mentioning at this point that the drive cage also has a fan header, which is the same as a 3-pin header that you would find on a PC motherboard. It supplies a different voltage for each of the speed settings. Medium is approximately 7.5 volts and is relatively quiet with the included “Good quality DC fan” fan, and high speed is 12 volts. I set mine to off but left the fan installed.

To install items into the 5.25″ bay in this case, you attach a bracket, then fasten it from above. The bracket allows the depth to be adjusted as well.

I also installed disks in the drive cage at this point, and numbers on the front. Disk 1 is connected to SATA0 on the motherboard, disk 2 is connected to SATA1, and so on.

Next was the power supply. I installed the custom plate for the power connector, and also installed the mounting plate on the bottom of the PSU so that it would have a flat surface. After confirming that it would fit, I cleaned both surfaces with alcohol, and applied double-sided tape.

I then followed a rehearsed path to drop the PSU into place. There is no opportunity to adjust it once it sticks.

At this point I connected everything up and booted up the system to start checking for problems, since it’s easier to troubleshoot in this state. Two modifications I made here were to disconnect the bright red HDD LED, and to introduce a SATA power Y splitter, because the power supply SATA cables were stretched to the limit.

It took a lot of work (and cable ties) to arrange the cables flat so that the case could close. In defence of cable ties, they do make maintenance more difficult, but that’s a worthwhile trade-off for keeping cables clear of airflow paths, fan blades, and the guillotine-like action of the top cover sliding shut.

Completed build

After closing the case, the build is, 434mm × 254mm x 44mm, or 4.8 litres, excluding rack ears.

This is how it appears from the front.

And this is how it appears from the back.

Software

I’m starting with Proxmox, with OpenMediaVault deployed as a virtual machine. I haven’t used either of these before, but both are Debian-based and provide convenient web front-ends to the tools I would otherwise be configuring on the command-line.

I’m passing through the disks as block devices. Running the NAS like this should make it possible to provision extra workloads which need their own SATA disks in future, or to switch from OpenMediaVault to stock Debian if necessary, all without connecting a monitor.

Within OpenMediaVault, I’ve configured Linux software RAID, with an ext4 filesystem, shared via Samba, and can access that file share over the network.

I’ve enabled some basic power management features such as C-states. The system idles in the range of 12-14 watts measured from the wall, and goes up to 20 watts when moving files around.

I don’t need a lot of disk capacity, so I’ve been able to preserve a useful property of my old setup, where every disk in the system has a full copy of the data, in a format which can be understood by a normal Linux system. This it makes single-disk recovery possible using any surviving disk from the system on practically any computer, and that disk can be from either an offline copy or one of the disks in the RAID mirror.

I haven’t tested the process of making an offline copy of the backup volume, but that will be up next.

Wrap-up

This is possibly the most effort I’ve ever put into a PC build. The only unexpected issue I encountered is how heavy it is, and wont be rack-mounting it until I get some generic rails.

The computer uses a strange mix of parts, but meets my requirements well. I hope that by writing this up, I’ll be providing some useful notes to anybody attempting to build something similar.

This project also gave me a chance to practice my entry-level CAD skills to build something which I’ll actually be using. I find a lot of utility in paper prototyping, and printed each design in 1:1 scale to check the physical dimensions before ordering anything.

For the circuit board, I used a print-out to check each part footprint, as well as the hole locations for fitting it in the case.

As with many of the projects which I blog about, I’ve put the design files up on GitHub. The fan controller is at mike42/fan-controller-athena-power, while the Flex ATX adapter plate is at mike42/flexatx-adapter-hdplex.

Controlling computer fans with a microcontroller

I’m currently working on building a small computer, and want to add some 4-pin computer fans, running quietly at a fixed speed.

This blog post is just a quick set of notes from prototyping, since it covers a few topics which I haven’t written about before.

The problem

The speed of 4-pin computer fans can be controlled by varying the duty cycle on a 25 KHz PWM signal. This signal normally comes from a 4-pin case fan header on the motherboard, which will not be available in this situation. Rather than run the fans at 100%, I’m going to try to generate my own PWM signal.

Two main considerations led me to choose to use a microcontroller for this:

  • I’ll need some way to adjust the PWM duty cycle after building the circuit, because I don’t know which value will give the best trade-off between airflow and noise yet.
  • Fans need a higher PWM duty cycle to start than they do to run. If I want to run the fans at a very low speed, then I’ll need them to ramp up on start-up, before slowing to the configured value.

It’s worth noting that I’m a complete beginner with microcontrollers. I’ve run some example programs on the Raspberry Pi Pico, but that’s it.

First attempt

I already had MicroPython running on my Raspberry Pi Pico, so that was my starting point.

For development environment, I installed the MicroPython plugin for PyCharm, since I use JetBrains IDE’s already for programming. Most guides for beginners suggest using Thonny.

There are introductory examples on GitHub at raspberrypi/pico-micropython-examples which showed me everything I needed to know. I was able to combine an ADC example and a PWM example within a few minutes.

import time
from machine import Pin, PWM, ADC, Timer

level_input = ADC(0)
pwm_output = PWM(Pin(27))
timer = Timer()

# 25 KHz
pwm_output.freq(25000)


def update_pwm(timer):
    """ Update PWM duty cycle based on ADC input """
    duty = level_input.read_u16()
    pwm_output.duty_u16(duty)


# Start with 50% duty cycle for 2 seconds (to start fan)
pwm_output.duty_u16(32768)
time.sleep(2)

# Update from ADC input after that
timer.init(mode=Timer.PERIODIC, period=100, callback=update_pwm)

On my oscilloscope, I could confirm that the PWM signal had a 25 KHz frequency, and that the code was adjusting the duty cycle as expected. When the analog input (above) is set to a high value, the PWM signal has a high duty cycle.

When set to a low value, the PWM signal has a low duty cycle.

But can it control fans?

I wired up a PC fan to 12 V power, and also sent it this PWM signal, but the speed didn’t change. This was expected, since I knew that I would most likely need to convert the 3.3 V signal up to 5 V.

I ran it through a 74LS04 hex inverter with VCC at 5 V, which did the trick. I could then adjust a potentiometer, and the fan would speed up or slow down.

I captured the breadboard setup in Fritzing. Just note that there are floating inputs on the 74LS04 chip (not generally a good idea) and that the part is incorrectly labelled 74HC04, when the actual part I used was a 74LS04.

This shows a working setup, but it’s got more components than I would like. I decided to implement it a second time, on a simpler micro-controller which can work at 5 V directly.

Porting to the ATtiny85

For a second attempt, I tried using an ATtiny85. This uses the AVR architecture (of Arduino fame), which I’ve never looked at before.

This chip is small, widely available, and can run at 5 V. I can also program it using the TL-866II+ rather than investing into the ecosystem with Arduino development boards or programmers.

I found GPL-licensed code by Marcelo Aquino for controlling 4-wire fans.

After a few false starts trying to compile manually, I followed this guide to get the ATtiny85 ‘board’ definition loaded into the Arduino IDE.

From there I was able to build and get an intel hex file, using the “Export compiled binary” feature.

Writing fuses

The code assumes an 8 MHz clock. The Attiny85 ships from the factory with a “divide by 8” fuse active. This needs to be turned off, otherwise the clock will be 1 MHz. This involves setting some magic values, separate to the program code.

I found these values using a fuse calculator.

The factory default for this chip is:

lfuse 0x62, hfuse 0xdf, efuse 0xff.

To disable the divide-by-8 clock but leave all other values default, this needs to be:

lfuse 0xe2, hfuse 0xdf, efuse 0xff.

I am using the minipro open source tool to program the chip, via a TL-866II+ programmer. First, to get the fuses:

$ minipro -p ATTINY85@DIP8 -r fuses.txt -c config
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Reading config... 0.00Sec  OK

This returns the values expected in a text file.

lfuse = 0x62
hfuse = 0xdf
efuse = 0x00
lock = 0xff

I then set lfuse = 0xe2, and wrote the values back with this command.

$ minipro -p ATTINY85@DIP8 -w fuses.txt -c config
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Writing fuses... 0.01Sec  OK
Writing lock bits... 0.01Sec  OK

Writing code

Now the micro-controller is ready to accept the exported binary containing the program.

minipro -s -p ATTINY85@DIP8 -w sketch_pwm.ino.tiny8.hex -f ihex
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is out of date.
  Expected  04.2.132 (0x284)
  Found     04.2.126 (0x27e)
Chip ID: 0x1E930B  OK
Found Intel hex file.
Erasing... 0.01Sec OK
Writing Code...  1.09Sec  OK
Reading Code...  0.45Sec  OK
Verification OK

With the chip fully programmed, I wired it up on a breadboard with 5 V power.

Checking the output again, the signal was the correct amplitude this time, but the frequency does move around a bit. This is likely because I’m using the internal RC timing on the chip, which is not very accurate. My understanding is that anything near 25 KHz will work fine.

Updates

I made only one change to Marcelo’s code, which is to spin the fan at 50% for a few seconds before using the potentiometer-set value. This is to avoid any issues where the fans fail to start because I’ve set them to run at a very low PWM value.

/*
 *                         ATtiny85
 *                      -------u-------
 *  RST - A0 - (D 5) --| 1 PB5   VCC 8 |-- +5V
 *                     |               |
 *        A3 - (D 3) --| 2 PB3   PB2 7 |-- (D 2) - A1  --> 10K Potentiometer
 *                     |               | 
 *        A2 - (D 4) --| 3 PB4   PB1 6 |-- (D 1) - PWM --> Fan Blue wire
 *                     |               |      
 *              Gnd ---| 4 GND   PB0 5 |-- (D 0) - PWM --> Disabled
 *                     -----------------
 */

// normal delay() won't work anymore because we are changing Timer1 behavior
// Adds delay_ms and delay_us functions
#include <util/delay.h>    // Adds delay_ms and delay_us functions

// Clock at 8mHz
#define F_CPU 8000000  // This is used by delay.h library

const int PWMPin = 1;  // Only works with Pin 1(PB1)
const int PotPin = A1;

void setup()
{
  pinMode(PWMPin, OUTPUT);
  // Phase Correct PWM Mode, no Prescaler
  // PWM on Pin 1(PB1), Pin 0(PB0) disabled
  // 8Mhz / 160 / 2 = 25Khz
  TCCR0A = _BV(COM0B1) | _BV(WGM00);
  TCCR0B = _BV(WGM02) | _BV(CS00); 
  // Set TOP and initialize duty cycle to 50%
  OCR0A = 160;  // TOP - DO NOT CHANGE, SETS PWM PULSE RATE
  OCR0B = 80; // duty cycle for Pin 1(PB1)
  // initial bring-up: leave at 50% for 4 seconds
  _delay_ms(4000);
}

void loop()
{
  int in, out;
  // follow potentiometer-set speed from there
  in = analogRead(PotPin);
  out = map(in, 0, 1023, 0, 160);
  OCR0B = out;
  _delay_ms(200);
}

The wiring of the breadboard is shown below. The capacitor is 0.1 µF for decoupling.

This is far more compact than the Raspberry Pi Pico prototype. I could also miniaturise it further by simply using surface-mount versions of the same components, where using an RP2040 microcontroller from the Pico directly on a custom board would incur some design effort.

Next steps & lessons learned

Although this project is simple, I had to learn quite a few things to prototype it successfully. Using a generic chip programmer like the TL-866II+ appears to be uncommon in the AVR world, and most online guides instead suggest repurposing an Arduino development board or using an Arduino ICSP programmer to program the Attiny85. I was glad to confirm that I could use my existing hardware instead of buying ecosystem-specific items which I would have no other use for. I find the development experience to be far better with the Raspberry Pi Pico, and that’s what I would be choosing for a more complex project.

I also captured the breadboard wiring in Fritzing for this blog post. The diagrams are clearer than a photo of a breadboard, but I’m not confident that they communicate information about the circuit as well as alternative approaches. For future posts, I’ll return to using KiCad EDA for schematic capture, unless there is some reason to highlight the physical layout of a breadboard prototype.

As a next step, I’ll be building a simple break-out PCB for a specific computer case to power the fans and supply a PWM signal, based on the ATtiny85 prototype shown here.

Converting my 65C816 computer project to 3.3 V

I recently spent some time re-building my 65C816 computer project to run at 3.3 volts, instead of 5 volts which it used previously. This blog post covers some of things I needed to take into consideration for the change.

It involved making use of options which I added when I designed this test board, as well as re-visiting all of the mistakes in that design.

Re-cap: Why 3.3 V is useful

One of the goals of this project is to build a modern computer which uses a 65C816 CPU, using only in-production parts.

A lot of interesting retro chips run at 5 V, but it’s much easier to find modern components which run at 3.3 V. I can make use of these options without adding level-shifting if I can switch important buses and control signals to 3.3 V.

ROM

The ROM chip is the most visible change, because the chip has a different footprint. When assembling this board for 5 V use, I used used an AT28C256 Parallel EEPROM, in a PDIP-28 ZIF socket. I couldn’t find 3.3 V drop-in replacement for this, so I added a footprint for a SST39LF010 flash chip, which has a similar-enough interface.

This was the first time I’ve soldered a surface-mount PLCC socket. I attempted to solder this with hot air, which was unsuccessful, so I instead cut the center of the socket out so that I could use a soldering iron. I then added a small square of 1000 GSM card (approx 1mm thick) as a spacer under the chip.

I also added a new make target to build the system ROM for this chip, which involves padding the file to a larger size, and invoking minipro with a different option so that it could write the file. I’m using a TL866II+ for programming, and adapter boards are available for programming PLCC-packaged chips.

$ make flashrom_sst39lf010
cp -f rom.bin rom_sst39lf010.bin
truncate -s 131072 rom_sst39lf010.bin
minipro -p "SST39LF010@PLCC32" -w rom_sst39lf010.bin
Found TL866II+ 04.2.126 (0x27e)
Warning: Firmware is newer than expected.
  Expected  04.2.123 (0x27b)
  Found     04.2.126 (0x27e)
Chip ID OK: 0xBFD5
Erasing... 0.40Sec OK
Writing Code...  8.38Sec  OK
Reading Code...  1.18Sec  OK
Verification OK

UART

The UART chip is an NXP SC16C752B, which is 3.3 V or 5 V compatible. I rescued one of these from an adapter board (previous experiment). This was my first time de-soldering a component with hot air gun. Hopefully it still works!

My FTDI-based USB/UART adapter has a configuration jumper which I adjusted to 3.3 V.

Clock, reset and address decode

I again used a MIC2775 supervisory IC for power-on reset, though the exact part was different. I previously used a MIC2275-46YM5 (4.6 V threshold), where the re-build used a MIC2275-31YM5 (3.1 V threshold).

I needed a 1.8432 MHz oscillator for the UART. The 3.3 V surface-mount part I substituted in had a different footprint, which I had prepared for.

I also prepared for this change by using an ATF22LV10 PLD for address decoding, which is both 3.3 V and 5 V compatible. The ATF22V10 which I used for earlier experiments works at 5 V only.

SD card

I installed the DEV-13743 SD card module more securely this time by soldering it in place and adding some double-sided tape as a spacer. This module is compatible with 3.3 V or 5 V.

The level-shifting and voltage regulation on this module is now superfluous, so I could probably simplify things by adding an SD card slot and some resistors directly in a later revision.

Modifications

There are three errors in the board which I know about (all described in my earlier blog post). This is my second time fixing them, so I tried to make sure the modifications were neat and reliable this time.

Firstly, the flip-flop used as a clock divider is wired up incorrectly, so I cut a trace and run a wire under the board.

Second, the reset button pin assignments are incorrect. Simply rotating the button worked on the previous board, but it wasn’t fitted securely. This this time I cut one of the legs and ran a short wire under the board, and the modification is barely noticeable.

Lastly, the address bus is wired incorrectly into the chip which selects I/O devices. I previously worked around this in software, but this time I cut two traces and ran two wires (the orange wires in the picture). I made an equivalent modification to the 5 V board, so that I could update the software and test it.

The wire I was using for these mods is far too thick and inflexible. I’ve added some 30 AWG silicon-insulated wire to my inventory for the next time I need to do this type of work, which should be more suitable.

Wrap-up

I transferred components from the old board, and attempted to boot it every time I made change. The power-on self test routine built in to the ROM showed that each new chip was being detected, and I soon had a working 65C816 test board again, now running at 3.3 V. I’m using my tiny linear power supply module to power it.

I can now interface to a variety of chips which only run at 3.3 V. This opens up some interesting possibilities for adding peripherals, which I hope to explore in future blog posts.

It will be more difficult to remove and re-program the ROM chip going forward though. This is hopefully not a problem, since can now run code from an SD card or serial connection as well (part 1, part 2).

Porting the Amiga bouncing ball demo to the NES

Earlier this year, I ported the Amiga bouncing ball demo to the Nintendo Entertainment System. This is a video capture from a NES emulator.

I completed this back in January, but I’m only publishing this blog post now, because I was considering entering it into a demo competition.

How it works

If you are familiar with NES development, then you will probably notice that there is nothing ground-breaking going on here. The ball is made up of 64 8×8 sprites, bouncing around the screen over a static background.

There are two different versions of the ball in sprite memory, and with some palette swapping, this can be stretched to 4 frames of animation, just enough to make the ball appear to spin.

There is enough sprite memory to extend this to 8 frames of animation without any banking tricks, but I would need to start again from scratch to achieve that.

Pre-rendered 3D

I drew the ball in Blender, with twice the number of segments required for the final image. This is a UV sphere with 8 rings, and 32 segments.

I coloured each face with one of 4 colours, then rendered it with the Workbench renderer, which I had configured to use Flat lighting, no anti-aliasing, and the texture colour.

I also tilted the ball.

This gave me a crisp, high-resolution ball with solid colours.

To test the idea, I processed this down to a 64×64 image on a transparent background. I need to substitute colours, so there is no antialiasing here.

I then wrote up a Python script to swap out colours to make the 4-frame pattern. Note that frames 3 and 4 are the same as frames 1 and 2, but with white and red swapped.

I ran this through ImageMagick to convert it into a GIF preview.

convert -delay 3.5 -loop 0 *.png animation.gif

The result appeared workable, so I went about making the same animation in a NES rom.

NES implmentation

On the NES, it’s not possible to create the spinning ball effect with palette changes only, because it would require four colours plus transparency. To overcome this, I split the image into two frames, each stored as two colours plus transparency, which is possible.

To start the code, I checked out a fresh copy of Brad Smith’s NES example project, then deleted things that I didn’t need.

When working with sprites on the NES, it is typical to store object attributes in RAM, then perform a DMA operation once per frame to copy this over to the Picture Processing Unit (PPU). For this project, I wanted to try setting up two different copies of the object attribute memory in RAM – one for each of the two rotations.

This should allow me to set any of the four frames by choosing between two possible colour palettes, and two possible DMA source addresses. This worked, and the first milestone was a spinning ball.

Switching between two DMA sources did not save much effort in the end, because I still needed quite a lot of code to set the X/Y positions of 64 sprites each frame. I set up some assembler macros to help with this.

Physics and sound

I have made one NES game before, and I was not happy with the physics. For this project, I wanted to do a bit better.

For the X position, I use a fixed speed, and simple collision detection with left/right boundaries. The animation frame is calculated from the X position, so the ball changes rotation depending on which direction it is moving, just like the Amiga demo.

The Y location of the ball is a simple loop, and follows the absolute value of a sine wave. I pre-computed this with some Python.

My last project also had no sound, so I read up on the NES APU, and added some noise when the ball changes direction.

Development setup

Since my last NES project, I have improved my development setup quite a bit. As always, I am using the ca65 assembler on Linux.

Previously, I was using a text editor. I have since moved to a custom 6502 assembly plugin on PyCharm, which allows me to quickly jump to definitions/usages. I developed this for my other 65C02 and 65C816 projects, which you can read about on this blog.

When I click run, I’ve set up the project to assemble a NES ROM, then launch with fceux, which has debugging features.

The debug features were previously only available on Windows builds of fceux, so when I made this demo, I was running the emulator via WINE. This has been added to the Linux builds as of v2.5.0, so I’ll most likely switch to that for my next project.

Wrap-up

I learn something new every time I write code for the NES, and it’s a lot of fun to make simple demos for these old systems. It’s also refreshing to create something standalone which I can explain through screenshots, instead of pages of assembly code.

For those who do want to read pages of assembly code, though, this project is up on GitHub at mike42/nes-ball-demo.

Let’s make a bootloader – Part 2

In my previous blog post, I wrote about loading and executing a 512 byte program from an SD card for my 65C816 computer project.

For part 2, I’ll look at what it takes to turn this into a bootloader, which can load a larger program from the SD card. I want the bootloader to control where to read data from, and how much data to load, which will allow me to change the program structure/size without needing to update the code on ROM.

Getting more data

The code in my previous post used software interrupts to print some text to the console, and my first challenge was to implement a similar routine for loading additional data from external storage.

The interface from the bootloader is quite simple: it sets a few registers to specify the source (a block number), destination memory address, and number of blocks to read, then triggers a software interrupt. The current implementation can read up to 64 KiB of data from the SD card each time it is called.

ROM_READ_DISK := $03
; other stuff ...
    ldx #$0000                      ; lower 2 bytes of destination address
    lda #$01                        ; block number
    ldy #$10                        ; number of blocks to read - 8 KiB
    cop ROM_READ_DISK               ; read kernel to RAM

Higher memory addresses

To load data in to higher banks (memory addresses above $ffff, I set the data bank register.

I needed to update a lot of my code to use absolute long (24-bit) addressing for I/O access, where it was previously using absolute (16-bit) addressing, for example:

sta $0c00

In my assembler, ca65, I already use the a: prefix to specify 16-bit addressing. I learned here that I can use the f: prefix for 24-bit addressing.

sta f:$0c00

Without doing this, the assembler chooses the smallest possible address size. This is fine in a 65C02 system, but I find it less helpful on the 65C816, where the meaning of a 16-bit address depends on the data bank register, and the meaning of an 8-bit address depends on the direct page register.

New bootloader

With the ROM code sorted out, I went on to write the new bootloader.

This new code sets the data bank address, then uses a software interrupt to load 8 KiB of data to address $010000

; boot.s: 512-byte bootloader. This utilizes services defined in ROM.
ROM_PRINT_STRING := $02
ROM_READ_DISK := $03

.segment "CODE"
    .a16
    .i16
    jmp code_start

code_start:
    ; load kernel
    ldx #loading_kernel             ; Print loading message (kernel)
    cop ROM_PRINT_STRING
    phb                             ; Save current data bank register
    .a8                             ; Set data bank register to 1
    php
    sep #%00100000
    lda #$01
    pha
    plb
    plp
    .a16
    ldx #$0000                      ; lower 2 bytes of destination address
    lda #$01                        ; block number
    ldy #$10                        ; number of blocks to read - 8 KiB
    cop ROM_READ_DISK               ; read kernel to RAM
    plb                             ; Restore previous data bank register

    ldx #boot                       ; Print boot message
    cop ROM_PRINT_STRING
    jml $010000

loading_kernel: .asciiz "Loading kernel\r\n"
boot: .asciiz "Booting\r\n"

.segment "SIGNATURE"
    wdm $42                         ; Ensure x86 systems don't recognise this as bootable.

The linker configuration for the bootloader is unchanged from part 1.

A placeholder kernel

The bootloader needed some code to load. I don’t have any real operating system to load, so I created the “Hello World” of kernels to work with in the meantime.

This is the simplest possible code I can come up with to test the bootloader. This assembles to 3 machine-language instructions, which occupy just 6 bytes. It is also is position-independent, and will work from any memory bank.

; kernel_tmp.s: A temporary placeholder kernel to test boot process.

ROM_PRINT_CHAR   := $00

.segment "CODE"
    .a16
    .i16
    lda #'z'
    cop ROM_PRINT_CHAR
    stp

The linker configuration for this, kernel_tmp.cfg, creates an 8 KiB binary.

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $0200,  size = $7e00, type = rw, file = "";
    PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE: load = ZP,  type = zp;
    BSS:      load = RAM, type = bss;
    CODE:     load = PRG, type = ro,  start = $e000;
}

The commands I used to assemble and link the bootloader are:

ca65 --feature string_escapes --cpu 65816 --debug-info boot.s
ld65 -o boot.bin -C boot.cfg boot.o

The commands I used to assemble and link the placeholder kernel are:

ca65 --feature string_escapes --cpu 65816 --debug-info kernel_tmp.s
ld65 -o kernel_tmp.bin -C kernel_tmp.cfg kernel_tmp.o

I assembled the final disk image by concatenating these files together.

cat bootloader/boot.bin kernel_tmp/kernel_tmp.bin > disk.img

I used the GNOME disk utility to write the image to an SD card, which helpfully reports that I’m not using all of the available space.

Result

It took me many attempts to get the ROM code right, but this did work. This screen capture shows the test kernel source code on the left, which is being executed at the end of the boot-up process.

If I wanted to load the kernel from within a proper filesystem on the SD card (eg. FAT32), then I would need to update the starting block in the bootloader (hard-coded to 1 at the moment).

The limitations of this mechanism are that the kernel needs to be stored contiguously, be sector-aligned, and located within the first 64 MiB of the SD card.

Speed increase

This was the first time that I wrote code for this system and needed to wait for it to execute. The CPU was clocked at just 921.6 KHz. My SPI/SD card code was also quite generic, and was optimised for ease of debugging rather than execution speed.

I improved this from two angles. My 65C816 test board allows me to use a different clock speed for the CPU and UART chip, so I sped the CPU up to 4 MHz by dropping in an 8 MHz oscillator (it is halved to produce a two-phase clock). As I sped this up, I also needed to add more retries to the SD card initialisation code, since it does not respond immediately on power-up.

I also spent a lot of time hand-optimising the assembly language routine to receive a 512-byte block of data from the SD card. There is room to speed this up further, but it’s fast enough to not be a problem for now.

I had hoped to load an in-memory filesystem (ramdisk) alongside the test kernel, but I’ve deferred this until I can compress it, since reading from SD card is still quite slow.

A debugging detour

Writing bare-metal assembly with no debugger is very rewarding when it works, and very frustrating when there is something I’m not understanding correctly.

I ran into one debugging challenge here which is obvious in hindsight, but I almost couldn’t solve at the time.

This code is (I assure you) a bug-free way to print one hex character, assuming the code is loaded in bank 0. I was using this to hexdump the code I was loading into memory.

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
lda #0                          ; A is 0 for example
and #$0f                        ; Take low-nibble only
tax                             ; Transfer to X
lda f:hex_chars, X              ; Load hex char for X
jsr uart_print_char             ; Print hex char for X

hex_chars: .byte '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'

The problem I had was that if the data bank register was 0, this would print ASCII character ‘0’ as expected. But if the data bank register was 1, it would print some binary character.

Nothing in this code should be affected by the data bank register, so I went through a methodical debugging process to try to list and check all of my assumptions. At one point, I even checked that the assembler was producing the correct opcode for this lda addressing mode (there are red herrings on the mailing lists about this).

I was able to narrow down the problem by writing different variations of the code which should all do the same thing, but used different opcodes to get there. This quickly revealed that it was the tax instruction which did not work as I thought, after finding that I could get the code working if I avoided it:

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
ldx #0                          ; X is 0 for example
lda f:hex_chars, X              ; Load hex char for X
jsr uart_print_char             ; Print hex char for X

hex_chars: .byte '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'

My first faulty assumption was that if I set the accumulator to 0, then transferring the accumulator value to X would set the X register to 0 as well.

.a8                             ; Accumulator is 8-bit
.i16                            ; Index registers are 16-bit
lda #0                          ; Accumulator is 0
tax                             ; X is 0?

On this architecture, there are two bits of CPU status register which specify the size of the accumulator and index registers. When the accumulator is 8-bits, only the lower 8-bits of the 16-bit value are set by lda. I hadn’t realised that when the index registers are set to 16-bit, the tax instruction transfers all 16 bits of the accumulator to the X register (regardless of the accumulator size), which was causing surprising results.

Far away from the bug-free code I was focusing on, I had used a lazy method to set the data bank register, which involved setting the accumulator to $0101.

My second faulty assumption was that the data bank register was involved at all – in fact lda #$0101 would have been enough to break the later code.

.a16
lda #$0101                      ; Set data bank register to 1
pha
plb
plb

To fix this, when switching to an 8-bit accumulator, I now zero out the high 8-bits.

.a8
lda #0                          ; zero out the B accumulator
xba

Alternative boot method

I also added an option to boot from serial, based on my implementation of an XMODEM receive function for the 6502 processor.

This is a bit like PXE boot for retro computers. From a program such as minicom, you tell the ROM to boot from serial, then upload the kernel via an XMODEM send. It loads up to 64KiB to bank 1 (the same location as the bootloader on the SD card would place it), then executes it.

This is an important option for testing, since it’s a fast way to get some code running while I don’t have an operating system, and writing to an SD card from a modern computer is not a fast process. It may also be an important fallback if I need to troubleshoot SD card routines.

Wrap-up

Previously, I needed to use a ROM programmer to load code onto this computer. I can now use an SD card or serial connection, and have a stable interface for the bootstrapping code to access some minimal services provided by the ROM.

It is also the first time I’m running code outside of bank 0, breaking through the 64 KiB limit of the classic 6502. There is an entire megabyte of RAM on this test board.

Of course, this computer still does nothing useful, but at least it now controlled by an SD card rather than flashing a ROM. On the hardware side of the project, this will help me to convert the design from 5 V to 3.3 V. I’ll need to convert the ROM to a PLCC-packaged flash chip for that, which is not something I’ll want to be programming frequently.

As far as software goes, my plan is to work on some more interesting demo programs, so that I can start to build a library of open source 65C816 code to work with. The hardware design, software and emulator for this computer can be found on GitHub, at mike42/65816-computer.

Let’s make a bootloader – Part 1

I’ve been working on a homebrew computer based on the 16-bit 65C816 CPU. All of my test programs so far have run from an EEPROM chip, which I need to remove and re-program each time I need to make a change.

Plenty of retro systems ran all of their programs from ROM, but I only want to use it for bootstrapping this computer. I’ve got 8 KiB of space for ROM-based programs in the memory map, which should be plenty to check the hardware and load software from disk.

In this two-part blog post, I’ll take a look at handing over control from the ROM to a program loaded from an SD card.

Technical background

I did some quick reading about the process for bootstrapping a PC during the 16-bit era. My homebrew computer is a completely different architecture to an IBM-compatible PC, but I’m planning to follow a few of the conventions from this ecosystem, since I’ll have some similar challenges.

The best resources on this topic are aimed at bootloader developers, such as this wikibooks page.

For a disk to be considered bootable, the first 512-byte sector needs to end with hex 0xAA55 (little-endian) . This is 01010101 10101010 in binary (a great test pattern!). My system is not x68-based, so I’ll store a different value there.

If a disk is bootable, then the BIOS will transfer the 512 byte boot sector to $7C00 and jump to it. The only assumption that the BIOS seems to make about the bootloader structure is that it starts with executable code. I’ll do the same on my system.

It’s worth noting that the first sector may also contain data structures for a filesystem or partitioning scheme, and it’s up to the bootloader code to work around that. For now, my SD card will contain only a bootloader, which does simplify things a bit.

Most bootloaders will then make a series of BIOS calls via software interrupts, which enables them produce text output or load additional data from disk. This is where I’ll have the biggest challenge, since my ROM has no stable interface for a bootloader to call.

Re-visiting SD card handling

My first task was to load the bootloader itself from SD card, storing the first 512 bytes from the disk to RAM, at address $7C00 onwards. This should be straightforward, since I have working 6502 assembly routines for reading from an SD card, and I’ve added a port for an SD card module to my 65C816 test board.

I came up with a routine which prints the contents of the boot sector, then prompts for whether to execute it. My ROM code is not checking the signature at this stage, and is not aware that the boot sector in this screen capture contains x86 machine code within a FAT32 boot sector, but this is a good start.

It took quite a few revisions to get this working, since my old 65C02 code for reading from SD produced strange output on this system. On my 65C816 test board, it showed almost the right values, but it was jumbled up, and mixed with SPI fill bytes ($FF). The below screen capture shows a diff between the expected and actual output of the ROM.

After a long process to rule out other programming and hardware errors, I finally noticed that I was writing the data starting from address $0104, which was never going to work. The default stack pointer on this CPU is $01ff and grows down, so writing 512 bytes to $0104 would always corrupt the stack after a few hundred bytes.

At this stage I was using the assembler to statically allocate a 512 byte space for IO. It appeared in code like this:

.segment "BSS"
io_block_id:              .res 4
io_buffer:                .res 512

The error was in the linker configuration, which I updated to start assigning RAM addresses from $0200 onwards.

 MEMORY {
     ZP:     start = $00,    size = $0100, type = rw, file = "";
-    RAM:    start = $0100,  size = $7e00, type = rw, file = "";
+    RAM:    start = $0200,  size = $7d00, type = rw, file = "";
     PRG:    start = $e000,  size = $2000, type = ro, file = %O, fill = yes, fillval = $00;
 }

The full SD card handling code is too long to post in this blog, but now allows any 512-byte segment from the first 32 MB of the SD card (identified by a 16-bit segment ID) to be loaded into an arbitrary memory address.

Making an API

My next challenge was to define an API for the bootloader to call into the ROM to perform I/O.

I considered using a jump table, but decided to use the cop instruction instead. This triggers a software interrupt on the 65C816, and has parallels to how the int instruction is used to trigger BIOS routines on x86 systems.

I defined a quick API for four basic routines, passing data via registers.

  • print char
  • read char
  • print string
  • load data from storage

The caller would need to set some registers, then call cop from assembly language. Any return data would also be passed via registers.

The cop instruction takes a one-byte operand, which in this case specifies the ID of the function to call.

cop $00

To prove that the interface would work, I implemented just the routine for printing strings.

; interrupt.s: Handling of software interrupts, the interface into the ROM for
; software (eg. bootloaders)
;
; Usage: Set registers and use 'cop' to trigger software interrupt.
; Eg:
;   ldx #'a'
;   cop ROM_PRINT_CHAR
; CPU should be in native mode with all registers 16-bit.

.import uart_printz, uart_print_char
.export cop_handler
.export ROM_PRINT_CHAR, ROM_READ_CHAR, ROM_PRINT_STRING, ROM_READ_DISK

; Routines available in ROM via software interrupts.
; Print one ASCII char.
;   A is char to print
ROM_PRINT_CHAR   := $00

; Read one ASCII char.
;   Returns typed character in A register
ROM_READ_CHAR    := $01

; Print a null-terminated ASCII string.
;   X is address of string, use data bank register for addresses outside bank 0.
ROM_PRINT_STRING := $02

; Read data from disk to RAM in 512 byte blocks.
;   X is address to write to, use data bank register for addresses outside bank 0.
;   A is low 2 bytes of block number
;   Y is number of blocks to read
ROM_READ_DISK    := $03

.segment "CODE"
; table of routines
cop_routines:
.word rom_print_char_handler
.word rom_read_char_hanlder
.word rom_print_string_handler
.word rom_read_disk_handler

cop_handler:
    .a16                            ; use 16-bit accumulator and index registers
    .i16
    rep #%00110000
    ; Save caller context to stack
    pha                             ; Push A, X, Y
    phx
    phy
    phb                             ; Push data bank, direct register
    phd
    ; Set up stack frame for COP handler
    tsc                             ; WIP set direct register to equal stack pointer
    sec
    sbc #cop_handler_local_vars_size
    tcs
    phd
    tcd
caller_k := 15
caller_ret := 13
caller_p := 12
caller_a := 10
caller_x := 8
caller_y := 6
caller_b := 5
caller_d := 3
cop_call_addr := 0
    ; set up 24 bit pointer to COP instruction
    ldx <frame_base+caller_ret
    dex
    dex
    stx <frame_base+cop_call_addr
    .a8                             ; Use 8-bit accumulator
    sep #%00100000
    lda <frame_base+caller_k
    sta <frame_base+cop_call_addr+2
    .a16                            ; Revert to 16-bit accumulator
    rep #%00100000

    ; load COP instruction which triggered this interrupt to figure out routine to run
    lda [<frame_base+cop_call_addr]
    xba                             ; interested only in second byte
    and #%00000011                  ; mask down to final two bits (there are only 4 valid functions at the moment)
    asl                             ; multiply by 2 to index into table of routines
    tax
    jsr (cop_routines, X)

    ; Remove stack frame for COP handler
    pld
    tsc
    clc
    adc #cop_handler_local_vars_size
    tcs

    ; Restore caller context from stack, reverse order
    pld                             ; Pull direct register, data bank
    plb
    ply                             ; Pull Y, X, A
    plx
    pla
    rti

cop_handler_local_vars_size := 3
frame_base := 1

rom_print_char_handler:
    ldx #aa
    jsr uart_printz
    rts

rom_read_char_hanlder:
    ldx #bb
    jsr uart_printz
    rts

rom_print_string_handler:
    ; Print string from X register
    ldx <frame_base+caller_x
    jsr uart_printz
    rts

rom_read_disk_handler:
    ldx #cc
    jsr uart_printz
    rts

aa: .asciiz "Not implemented A\r\n"
bb: .asciiz "Not implemented B\r\n"
cc: .asciiz "Not implemented C\r\n"

This snippet is quite dense, and uses several features which are new to the 65C816, not just the cop instruction.

I’m relocating the direct page to use as a stack frame, which is an idea I got from reading the output of the WDC 65C816 C compiler. Pointers are much easier to work with on the direct page.

This is the first snippet I’ve shared which uses a 24-bit pointer, via “direct page, indirect long” addressing. The pointer is used to load the instruction which triggered the interrupt, so that the code can figure out which function to call.

lda [<frame_base+cop_call_addr]

This snippet is also the first time I’ve used the jump to subroutine instruction (jsr) with the “absolute indirect, indexed with X” address mode. On the 65C02, I could only use this addressing mode on the jmp instruction. The only example of that on this blog is also an interrupt handling example.

jsr (cop_routines, X)

The “Hello World” of bootloaders

My next goal was to load a small program from disk, and show that it can call routines from the ROM. For now it is just a program on the boot sector on an SD card, which demonstrates that the new software interrupt API works.

This assembly file boot.s prints out two strings, so that I can be sure that the ROM is returning control back to the bootloader after a software interrupt completes.

ROM_PRINT_STRING := $02

.segment "CODE"
    .a16
    .i16
    ldx #test_string_1
    cop ROM_PRINT_STRING
    ldx #test_string_2
    cop ROM_PRINT_STRING
    stp

test_string_1: .asciiz "Test 1\r\n"
test_string_2: .asciiz "Test 2\r\n"

.segment "SIGNATURE"
    wdm $42                         ; Ensure x86 systems don't recognise this as bootable.

The linker configuration which goes with this is boot.cfg:

MEMORY {
    ZP:     start = $00,    size = $0100, type = rw, file = "";
    RAM:    start = $7e00,  size = $0200, type = rw, file = "";
    PRG:    start = $7c00,  size = $0200, type = rw, file = %O, fill = yes, fillval = $00;
}

SEGMENTS {
    ZEROPAGE:   load = ZP,  type = zp;
    BSS:        load = RAM, type = bss;
    CODE:       load = PRG, type = rw,  start = $7c00;
    SIGNATURE:  load = PRG, type = rw,  start = $7dfe;
}

The commands to assemble and link this are:

ca65 --feature string_escapes --cpu 65816 boot.s
ld65 -o boot.bin -C boot.cfg boot.o

This produces a 512 byte file, which I wrote to SD card.

This is the first time this computer is running code from RAM, which is an important milestone for this project.

Editor improvements

I needed to do some work on my 6502 assembly plugin for IntelliJ during this process, since it didn’t understand the square brackets used for the long address modes.

    lda [<frame_base+cop_call_addr]

While I was fixing this, I also implemented an auto-format feature. This saves me the manual effort of lining up all the comments in a column, as is typical in assembly code.

Lastly, I added support for jumping to unnamed labels, which are a ca65-specific feature.

Next steps

In the second half of this blog post, I’ll get the bootloader to load a larger program from the SD card. I’m hoping to allow the bootloader to control how much code to load, and where to load it from.

Building a simple power supply module

I recently put together a tiny module to power my electronics projects. This is a 3cm x 4cm circuit board, and can be assembled to deliver either a fixed 3.3V or 5V output.

Background: Breadboard power supplies

I start a lot of my projects on a breadboard, and previously used a breadboard power supply. The voltage regulators on these are not particularly resilient, and I’ve damaged two of them by wiring something up incorrectly and drawing too much current.

They do not exactly fail-safe, and now pass through the input voltage to whatever I’m working on. This is 12 V in my case, which is enough to fry most of the components I work with.

A bit of searching shows that this is a common issue. The first power supply which did this was a Bud Industries BBP-32701, which uses an LM1117 regulator.

The second one was a DFRobot DFR0140, which uses the AMS1117 regulator.

For the past year or so, I’ve used an L7805 regulator connected to a breadboard instead. This is not particularly efficient, but it works well, and I’ve caused enough short circuits to know that the over-current protection works.

Converting to a PCB

My only goal was to take the exact parts I’ve already got, and make them into a more compact module for permanent use.

I used KiCad to capture the schematic and design the board. The minimum order is generally 5 circuit boards, so I planned to assemble some with an L7805 (5 V), and others with a LD33V (3.3 V) regulator. These have different pin-outs, so I added a series of solder jumpers so that I can use the same board for either regulator.

Layout for this PCB was straightforward. It’s a two-layer board, and I added a ground pour on the bottom layer, and the output voltage on most of the top layer.

I also decided to switch off thermal relief on the copper pours, to see if a direct connection to a ground plane really makes soldering more difficult (it does! lesson learned).

I tried to make this board quite dense, but didn’t go completely overboard: It’s useful to have holes to install standoffs, plus a small area to write on, so that I can distinguish the 5 V and 3.3 V modules later.

Wrap-up

I’m still a little surprised that custom PCB’s are so accessible for hobby use. The unit price of these boards was just 1.36 AUD before shipping, which is cost-competitive with blank prototyping board.

It’s easy to get assembled DC-DC converters, so I don’t suggest building any of these modules yourself. However, as with many of the things I blog about, I’ve uploaded the project files to GitHub under an open source license. The KiCad project, Gerber files and parts list for this project can be found at mike42/simple-power-supply.

Let’s implement power-on self test (POST)

I recently implemented simple Power-on self test (POST) routine for my 65C816 test board, so that it can stop and indicate a hardware failure before attempting to run any normal code.

This was an interesting adventure for two main reasons:

  • This code needs to work with no RAM present in the computer.
  • I wanted to try re-purposing the emulation status (E) output on the 65C816 CPU to blink an LED.

Background

Even modern computers will stop and provide a blinking light or series of beeps if you don’t install RAM or a video card. This is implemented in the BIOS or UEFI.

I got the idea to use the E or MX CPU outputs for this purpose from this thread on the 6502.org forums. This method would allow me to blink a light with just a CPU, clock input, and ROM working.

My main goal is to perform a quick test that each device is present, so that start-up fails in a predictable way if I’ve connected something incorrectly. This is much simpler than the POST routine from a real BIOS, because I’m not doing device detection, and I’m not testing every byte of memory.

Boot-up process

On my test board, I’ve connected an LED directly to the emulation status (E) output on the 65C816 CPU. The CPU starts in emulation mode (E is high). However I have noticed that on power-up, the value of E appears to be random until /RES goes high. If I were wiring this up again, I would also prevent the LED from lighting up while the CPU is in reset:

The first thing the CPU does is read an address from ROM, called the reset vector, which tells it where start executing code.

In my case, the first two instructions set the CPU to native mode, which are clc (clear carry) and xce (exchange carry with emulation).

.segment "CODE"
reset:
    .a8
    .i8
    clc                            ; switch to native mode
    xce
    jmp post_check_loram

By default accumulator and index registers are 8-bit, the .a8 and .i8 directives simply tell the assembler ca65 that this is the case.

Next, the code will jmp to the start of the POST process.

Checking low RAM

The first part of the POST procedure checks if the lower part of RAM is available, by writing values to two address and checking that the same values can be read back.

Note that a:$00 causes the assembler to interpret $00 as an absolute address. This will otherwise be interpreted as direct-page address, which is not what’s intended here.

post_check_loram:
    ldx #%01010101                 ; Power-on self test (POST) - do we have low RAM?
    ldy #%10101010
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%01010101
    bne post_fail_loram
    cpy #%10101010
    bne post_fail_loram
    jmp post_check_hiram

If this fails, then the boot process stops, and the emulation LED blinks in a distinctive pattern (two blinks) forever.

post_fail_loram:                   ; blink emulation mode output with two pulses
    pause 8
    blink
    blink
    jmp post_fail_loram            ; repeat indefinitely

Macros: pause and blink

It’s a mini-challenge to write code to blink an LED in a distinctive pattern without assuming that RAM works. This means no stack operations (eg. jsr and rts instructions), and that I need to store anything I need in 3 bytes: the A, X, Y registers. A triple-nested loop is the best I can come up with.

I wrote a pause macro, which runs a time-wasting loop for the requested duration – approximately a multiple of 100ms at this clock speed. Every time this macro is used, the len value is substituted in, and this code is included in the source file. This example also uses unnamed labels, which is a ca65 feature for writing messy code.

.macro pause len                   ; time-wasting loop as macro
    lda #len
:
    ldx #64
:
    ldy #255
:
    dey
    cpy #0
    bne :-
    dex
    cpx #0
    bne :--
    dec
    cmp #0
    bne :---
.endmacro

The second macro I wrote is blink, which briefly lights up the LED attached to the E output by toggling emulation mode. I’m using the pause macro from both native mode and emulation mode in this snippet, so I can only treat A, X and Y as 8-bit registers.

.macro blink
    sec                            ; switch to emulation mode
    xce
    pause 1
    clc                            ; switch to native mode
    xce
    pause 2
    sec                            ; switch to emulation mode
.endmacro

Checking high RAM

There is also a second RAM chip, and this process is repeated with some differences. For one, I can now use the stack, which is how I set the data bank byte in this snippet.

Here a:$01 is important, because with direct page addressing, $01 means $000001 at this point in the code, where I want to test that I can write to the memory address $080001.

post_check_hiram:
    ldx #%10101010                 ; Power-on self test (POST) - do we have high RAM?
    ldy #%01010101
    lda #$08                       ; data bank to high ram
    pha
    plb
    stx a:$00                      ; store known values at two addresses
    sty a:$01
    ldx a:$00                      ; read back the values - unlikely to be correct if RAM not present
    ldy a:$01
    cpx #%10101010
    bne post_fail_hiram
    cpy #%01010101
    bne post_fail_hiram
    lda #$00                       ; reset data bank to boot-up value
    pha
    plb
    jmp post_check_via

The failure here is similar, but the LED will blink 3 times instead of 2.

post_fail_hiram:                   ; blink emulation mode output with three pulses. we could use RAM here?
    pause 8
    blink
    blink
    blink
    jmp post_fail_hiram            ; repeat indefinitely

To make sure that I was writing to different chips, I installed the RAM chips one at a time, first observing the expected failures, and then observing that the code continued past this point with the chip installed.

I also checked with an oscilloscope that both RAM chips are now being accessed during start-up. Now that I’ve got some confidence that the computer now requires both chips to start, I can skip a few debugging steps if I’ve got code that isn’t working later.

Checking the Versatile Interface Adapter (VIA)

The third chip I wanted to add to the POST process is the 65C22 VIA. I kept this check simple, because one read to check for a start-up default is sufficient to test for device presence.

VIA_IER = $c00e
post_check_via:                    ; Power-on self test (POST) - do we have a 65C22 Versatile Interface Adapter (VIA)?
    lda a:VIA_IER
    cmp #%10000000                 ; start-up state, interrupts enabled overall (IFR7) but all interrupt sources (IFR0-6) disabled.
    bne post_fail_via
    jmp post_ok

This stops and blinks 4 times if it fails. I recorded the GIF at the top of this blog post by removing the component which generates a chip-select for the VIA, which causes this code to trigger on boot.

post_fail_via:
    pause 8
    blink
    blink
    blink
    blink
    jmp post_fail_via

Beep for good measure

At the end of the POST process, I put in some code to generate a short beep.

This uses the fact that the 65C22 can toggle the PB7 output each time a certain number of clock-cycles pass. I’ve connected a piezo buzzer to that output, which I’m using as a PC speaker. The 65C22 is serving the role of a programmable interrupt timer from the PC world.

VIA_DDRB = $c002
VIA_T1C_L = $c004
VIA_T1C_H = $c005
VIA_ACR = $c00b
BEEP_FREQ_DIVIDER = 461            ; 1KHz, formula is CPU clock / (desired frequency * 2), or 921600 / (1000 * 2) ~= 461
post_ok:                           ; all good, emit celebratory beep, approx 1KHz for 1/10th second
    ; Start beep
    lda #%10000000                 ; VIA PIN PB7 only
    sta VIA_DDRB
    lda #%11000000                 ; set ACR. first two bits = 11 is continuous square wave output on PB7
    sta VIA_ACR
    lda #<BEEP_FREQ_DIVIDER        ; set T1 low-order counter
    sta VIA_T1C_L
    lda #>BEEP_FREQ_DIVIDER        ; set T1 high-order counter
    sta VIA_T1C_H
    ; wait approx 0.1 seconds
    pause 1
    ; Stop beep
    lda #%11000000                 ; set ACR. returns to a one-shot mode
    sta VIA_ACR
    stz VIA_T1C_L                  ; zero the counters
    stz VIA_T1C_H
    ; POST is now done
    jmp post_done

The post_done label points to the start of the old ROM code, which is currently just a “hello world” program.

Next steps

I’m now able to lock in some of my assumptions about should be available in software, so that I can write more complex programs without second-guessing the hardware.

Once the boot ROM is interacting with more hardware, I may add additional checks. I will probably need to split this into different sections, and make use of jsr/rts once RAM has been tested, because the macros are currently generating a huge amount of machine code. I have 8KiB of ROM in the memory map for this computer, and the code on this page takes up around 1.1KiB.