← Back to blogMarch 15, 2026

From Math to Silicon: 69.7 Gops/s on a $13.50 Chip

hardwarefpgarisc-vengineering

$13.50

Board cost

94.5M

Ops/sec (single bank)

69.7B

Ops/sec (512 banks)

SoC generations

ATOMiK started as a set of equations on a whiteboard. XOR-based delta-state algebra: four operations, provably correct, with some interesting properties. But equations on a whiteboard don't ship products. This is the story of how those equations became silicon — three generations of custom hardware, from a first blink test to a custom 64-bit RISC-V CPU with native ATOMiK instructions and HD video output.

Chapter 1: The Math

The core insight is deceptively simple. Instead of storing state and copying it around, you store a reference point and accumulate XOR deltas:

current_state = initial_state XOR accumulator

XOR gives you an Abelian group for free — commutative, associative, self-inverse, with identity element zero. We proved all of this formally with 92 Lean4 theorems. Not "we tested it" — we proved it. The algebra is correct by construction.

The practical consequence: deltas can arrive in any order, from any number of producers, and the result is identical. No locks. No consensus protocol. No ordering constraints. The accumulator is a shared resource by design.

Chapter 2: First Silicon — PicoRV32 + ATOMiK (v1)

The first hardware target was the Tang Nano 9K — a $13.50 FPGA board with a Gowin GW1NR-9K chip, 8,640 LUTs, and 26 block RAMs. We paired ATOMiK with a PicoRV32 RISC-V soft core.

The ATOMiK core lives on the memory bus as an MMIO peripheral. The CPU writes to specific addresses to trigger LOAD, ACCUM, READ, and SWAP operations. A toggle-handshake CDC bridge crosses between the CPU clock domain (25.2 MHz) and the ATOMiK domain (81 MHz).

// ATOMiK v1 — MMIO-mapped operations
#define ATOMIK_BASE     0x20000000
#define ATOMIK_LOAD     (ATOMIK_BASE + 0x00)
#define ATOMIK_ACCUM    (ATOMIK_BASE + 0x04)
#define ATOMIK_READ     (ATOMIK_BASE + 0x08)
#define ATOMIK_SWAP     (ATOMIK_BASE + 0x0C)

*(volatile uint32_t*)ATOMIK_LOAD = 0xDEADBEEF;
*(volatile uint32_t*)ATOMIK_ACCUM = 0x000000FF;
uint32_t state = *(volatile uint32_t*)ATOMIK_READ;
// state == 0xDEADBE10

Result: single-bank ATOMiK at 81 MHz, 94.5 million operations per second. The entire SoC fits in 44% of the GW1NR-9K. 11/11 hardware tests pass. The core has +23% Fmax margin — it could run faster, but we're limited by the PicoRV32's bus timing.

Chapter 3: Custom CPU — RV64I + ATOMiK ISA (v2/v3)

MMIO works, but it costs bus cycles. Every ATOMiK operation requires a store instruction, a bus transaction, and a load to read the result. What if ATOMiK operations were native CPU instructions?

We built a custom 64-bit RISC-V CPU from scratch. Not a fork — a ground-up implementation with a pipelined FSM (FETCH, DECODE, EXECUTE, WRITEBACK), SPI XIP flash boot, UART, and native ATOMiK custom instructions using the RISC-V custom-0 opcode space:

// ATOMiK v3 — Native ISA extensions (custom-0 opcode 0x0B)
// funct3 encoding:
//   000 = LOAD   (set reference state)
//   001 = ACCUM  (XOR delta into accumulator)
//   010 = READ   (reconstruct current state)
//   011 = SWAP   (atomic read-and-reset)

// In assembly:
.insn r 0x0b, 0, 0, x0, a0, x0    # LOAD  a0
.insn r 0x0b, 1, 0, x0, a1, x0    # ACCUM a1
.insn r 0x0b, 2, 0, a2, x0, x0    # READ  -> a2
.insn r 0x0b, 3, 0, a3, x0, x0    # SWAP  -> a3

ATOMiK operations now execute in a single EXECUTE stage cycle — the same cost as an ADD or XOR instruction. No bus overhead. No MMIO latency. Zero extra cycles.

Chapter 4: HD Video — 1280x720@60Hz (v3.1)

To demonstrate delta-driven display, we added an HDMI output pipeline. On a $13.50 FPGA. At 1280x720@60Hz.

This required 6 pixel pipeline optimizations to hit the 74.25 MHz pixel clock: 3-stage TMDS encoding, pre-registered RNG and cursor flags in svo_tcard, parallel prefix gray-to-binary conversion, split font pipeline, pre-registered BRAM ports, and a register buffer between encoder and TMDS serializer.

The delta display module sits in the video pipeline between the overlay and the encoder. It maintains a per-scanline buffer and applies LUT-mapped delta colors in real time — the display literally shows state changes as they happen, driven by the ATOMiK accumulator.

Final v3.1.0 resource usage on the GW1NR-9K: 6,287 LUT (73%), 3,783 CLS (88%), 20/26 BSRAM (77%). Pixel Fmax: 74.384 MHz (+0.18% margin). We hit the practical optimization ceiling — CLS at 88% is the binding constraint.

Chapter 5: Scaling Up — Zynq Characterization

The Tang Nano 9K proved the architecture. But 8,640 LUTs limits you to a single ATOMiK bank. What happens when you have 53,200 LUTs?

We characterized ATOMiK on the Xilinx Zynq XC7Z020, sweeping from 1 to 512 parallel banks across 4 synthesis strategies (baseline, area, aggressive, maximum):

Banks	Fmax (MHz)	LUT	LUT %	Gops/s
N=1	444.4	302	0.6%	0.4
N=4	347.8	543	1.0%	1.4
N=16	266.7	941	1.8%	4.4
N=64	205.1	3,498	6.6%	13.4
N=256	148.1	15,197	28.6%	38.1
N=512	135.6	23,542	44.3%	69.7

Sub-linear LUT scaling: 512 banks costs only 44.3% of the fabric. Each additional bank adds ~34 LUTs beyond the first — the shared infrastructure (BRAM, merge tree, CDC bridge) amortizes across all banks.

69.7 billion operations per second. On a $99 development board. With 56% of the fabric still available for your application logic.

What we learned

Start with the math. The 92 Lean4 proofs caught edge cases that testing never would have found. When your algebra is provably correct, debugging hardware becomes purely a plumbing exercise.
$13.50 is enough. You don't need a $10,000 FPGA board to validate a novel architecture. The Tang Nano 9K proved everything we needed — the Zynq just showed it scales.
Custom instructions matter. Going from MMIO (v1) to native ISA (v3) eliminated all bus overhead. ATOMiK ops execute at the same cost as ALU operations.
The ceiling is the fabric, not the design. At N=512, we're using 44% of the XC7Z020. N=1024 needs an XC7Z045 (218K LUT) or UltraScale+. The ATOMiK core itself has no inherent scaling limit.

What's next

The ALINX AX7020 Zynq board is on the bench. Next step: PS+PL block design, Linux driver integration, and live benchmarks with the kernel module talking to hardware-accelerated ATOMiK contexts. After that: ASIC evaluation on Sky130.

The math works. The software works. The hardware works. Now we scale.

Try ATOMiK today

Get the SDK

Join 247+ developers building with delta-state algebra