← Back to blog

From Math to Silicon: 69.7 Gops/s on a $13.50 Chip

hardwarefpgarisc-vengineering
$13.50
Board cost
94.5M
Ops/sec (single bank)
69.7B
Ops/sec (512 banks)
3
SoC generations

ATOMiK started as a set of equations on a whiteboard. XOR-based delta-state algebra: four operations, provably correct, with some interesting properties. But equations on a whiteboard don't ship products. This is the story of how those equations became silicon — three generations of custom hardware, from a first blink test to a custom 64-bit RISC-V CPU with native ATOMiK instructions and HD video output.

Chapter 1: The Math

The core insight is deceptively simple. Instead of storing state and copying it around, you store a reference point and accumulate XOR deltas:

current_state = initial_state XOR accumulator

XOR gives you an Abelian group for free — commutative, associative, self-inverse, with identity element zero. We proved all of this formally with 92 Lean4 theorems. Not "we tested it" — we proved it. The algebra is correct by construction.

The practical consequence: deltas can arrive in any order, from any number of producers, and the result is identical. No locks. No consensus protocol. No ordering constraints. The accumulator is a shared resource by design.

Chapter 2: First Silicon — PicoRV32 + ATOMiK (v1)

The first hardware target was the Tang Nano 9K — a $13.50 FPGA board with a Gowin GW1NR-9K chip, 8,640 LUTs, and 26 block RAMs. We paired ATOMiK with a PicoRV32 RISC-V soft core.

The ATOMiK core lives on the memory bus as an MMIO peripheral. The CPU writes to specific addresses to trigger LOAD, ACCUM, READ, and SWAP operations. A toggle-handshake CDC bridge crosses between the CPU clock domain (25.2 MHz) and the ATOMiK domain (81 MHz).

// ATOMiK v1 — MMIO-mapped operations
#define ATOMIK_BASE     0x20000000
#define ATOMIK_LOAD     (ATOMIK_BASE + 0x00)
#define ATOMIK_ACCUM    (ATOMIK_BASE + 0x04)
#define ATOMIK_READ     (ATOMIK_BASE + 0x08)
#define ATOMIK_SWAP     (ATOMIK_BASE + 0x0C)

*(volatile uint32_t*)ATOMIK_LOAD = 0xDEADBEEF;
*(volatile uint32_t*)ATOMIK_ACCUM = 0x000000FF;
uint32_t state = *(volatile uint32_t*)ATOMIK_READ;
// state == 0xDEADBE10

Result: single-bank ATOMiK at 81 MHz, 94.5 million operations per second. The entire SoC fits in 44% of the GW1NR-9K. 11/11 hardware tests pass. The core has +23% Fmax margin — it could run faster, but we're limited by the PicoRV32's bus timing.

Chapter 3: Custom CPU — RV64I + ATOMiK ISA (v2/v3)

MMIO works, but it costs bus cycles. Every ATOMiK operation requires a store instruction, a bus transaction, and a load to read the result. What if ATOMiK operations were native CPU instructions?

We built a custom 64-bit RISC-V CPU from scratch. Not a fork — a ground-up implementation with a pipelined FSM (FETCH, DECODE, EXECUTE, WRITEBACK), SPI XIP flash boot, UART, and native ATOMiK custom instructions using the RISC-V custom-0 opcode space:

// ATOMiK v3 — Native ISA extensions (custom-0 opcode 0x0B)
// funct3 encoding:
//   000 = LOAD   (set reference state)
//   001 = ACCUM  (XOR delta into accumulator)
//   010 = READ   (reconstruct current state)
//   011 = SWAP   (atomic read-and-reset)

// In assembly:
.insn r 0x0b, 0, 0, x0, a0, x0    # LOAD  a0
.insn r 0x0b, 1, 0, x0, a1, x0    # ACCUM a1
.insn r 0x0b, 2, 0, a2, x0, x0    # READ  -> a2
.insn r 0x0b, 3, 0, a3, x0, x0    # SWAP  -> a3

ATOMiK operations now execute in a single EXECUTE stage cycle — the same cost as an ADD or XOR instruction. No bus overhead. No MMIO latency. Zero extra cycles.

Chapter 4: HD Video — 1280x720@60Hz (v3.1)

To demonstrate delta-driven display, we added an HDMI output pipeline. On a $13.50 FPGA. At 1280x720@60Hz.

This required 6 pixel pipeline optimizations to hit the 74.25 MHz pixel clock: 3-stage TMDS encoding, pre-registered RNG and cursor flags in svo_tcard, parallel prefix gray-to-binary conversion, split font pipeline, pre-registered BRAM ports, and a register buffer between encoder and TMDS serializer.

The delta display module sits in the video pipeline between the overlay and the encoder. It maintains a per-scanline buffer and applies LUT-mapped delta colors in real time — the display literally shows state changes as they happen, driven by the ATOMiK accumulator.

Final v3.1.0 resource usage on the GW1NR-9K: 6,287 LUT (73%), 3,783 CLS (88%), 20/26 BSRAM (77%). Pixel Fmax: 74.384 MHz (+0.18% margin). We hit the practical optimization ceiling — CLS at 88% is the binding constraint.

Chapter 5: Scaling Up — Zynq Characterization

The Tang Nano 9K proved the architecture. But 8,640 LUTs limits you to a single ATOMiK bank. What happens when you have 53,200 LUTs?

We characterized ATOMiK on the Xilinx Zynq XC7Z020, sweeping from 1 to 512 parallel banks across 4 synthesis strategies (baseline, area, aggressive, maximum):

BanksFmax (MHz)LUTLUT %Gops/s
N=1444.43020.6%0.4
N=4347.85431.0%1.4
N=16266.79411.8%4.4
N=64205.13,4986.6%13.4
N=256148.115,19728.6%38.1
N=512135.623,54244.3%69.7

Sub-linear LUT scaling: 512 banks costs only 44.3% of the fabric. Each additional bank adds ~34 LUTs beyond the first — the shared infrastructure (BRAM, merge tree, CDC bridge) amortizes across all banks.

69.7 billion operations per second. On a $99 development board. With 56% of the fabric still available for your application logic.

What we learned

  • Start with the math. The 92 Lean4 proofs caught edge cases that testing never would have found. When your algebra is provably correct, debugging hardware becomes purely a plumbing exercise.
  • $13.50 is enough. You don't need a $10,000 FPGA board to validate a novel architecture. The Tang Nano 9K proved everything we needed — the Zynq just showed it scales.
  • Custom instructions matter. Going from MMIO (v1) to native ISA (v3) eliminated all bus overhead. ATOMiK ops execute at the same cost as ALU operations.
  • The ceiling is the fabric, not the design. At N=512, we're using 44% of the XC7Z020. N=1024 needs an XC7Z045 (218K LUT) or UltraScale+. The ATOMiK core itself has no inherent scaling limit.

What's next

The ALINX AX7020 Zynq board is on the bench. Next step: PS+PL block design, Linux driver integration, and live benchmarks with the kernel module talking to hardware-accelerated ATOMiK contexts. After that: ASIC evaluation on Sky130.

The math works. The software works. The hardware works. Now we scale.

Try ATOMiK today

Get the SDK

Join 247+ developers building with delta-state algebra