sutro

Sparse Parity Benchmark — an energy-efficiency testbed for neural network learning algorithms.

The k-sparse parity problem is the "Drosophila" of algorithmic benchmarking: simple enough to be tractable, rich enough to exhibit grokking (delayed generalization after memorization). The goal is to measure and minimize the energy cost — optimizer steps to 100% test accuracy — of different learning algorithms.

Features

Pure Python — no PyTorch, no NumPy; only math, random, time, and matplotlib
Manual forward/backward — all gradients computed by hand; fully transparent
Fused layer-wise updates — SGD update applied immediately after each layer's gradient, reducing memory reuse distance
Memory reuse distance tracking — measures how cache-friendly the algorithm is (clock counts floats, not ops)
Modular optimizer — swap in any learning algorithm at the marked injection point

Architecture

x ∈ {-1,1}^n  →  Linear(W1, b1)  →  ReLU  →  Linear(W2, b2)  →  scalar f(x)
Loss: Hinge loss = max(0, 1 - f(x)·y)
Optimizer: SGD with weight decay (fused into backward pass)

Usage

python sparse_parity_benchmark.py

No installation needed beyond Python 3 and matplotlib (auto-installed if missing).

Sample Output

[STEP    1] epoch=1  train_loss=103.8993  test_loss=64.8248  train_acc=60%  test_acc=75%
[STEP   20] epoch=1  train_loss=11.9400   test_loss=28.7414  train_acc=90%  test_acc=70%
[STEP   40] epoch=2  train_loss=0.0000    test_loss=0.0000   train_acc=100% test_acc=100%
  🎉 GENERALIZED at epoch 2! (40 steps, 2.2s)

⚡ ENERGY COST = 2 epochs (40 optimizer steps)

Memory Reuse Distance

The benchmark instruments the first optimizer step to measure memory reuse distance: the number of floats that flow through memory between when a buffer is written and when it is next read. The clock advances by buffer size, so a 3,000-float matrix contributes 3,000 to the distance while a scalar contributes 1 — matching real cache eviction behavior.

Small distance → still in cache → energy efficient. Large distance → cache miss → expensive.

Per-buffer summary (fused layer-wise updates)

Buffer            Size  Reads    Avg Dist       Min       Max  Cache?
────────────  ────────  ─────  ──────────  ────────  ────────  ──────
x                    3      2      10,011         4    20,018       ❌
W1               3,000      2      15,015     5,008    25,022       ❌
b1               1,000      2      17,015     5,008    29,022       ❌
h_pre            1,000      2       7,006     1,000    13,011       ✅
h                1,000      2       2,003     1,000     3,006       ✅
W2               1,000      3      12,013     9,008    15,016       ❌
b2                   1      2      12,512     9,008    16,017       ❌
out                  1      1           1         1         1       ✅
y                    1      1       9,007     9,007     9,007       ✅
dout                 1      2       1,502         1     3,003       ✅
dW2              1,000      1       3,002     3,002     3,002       ✅
db2                  1      1       5,002     5,002     5,002       ✅
dh               1,000      1       4,003     4,003     4,003       ✅
dh_pre           1,000      2       4,502     1,000     8,003       ✅

Average reuse distance (per-read):        8,174 floats
Average reuse distance (per-float-read):  10,219 floats
Total working set:                        10,008 floats

Fused updates vs standard backprop

The key optimization: update each layer's parameters immediately after computing its gradient, instead of computing all gradients first and updating at the end.

Buffer	Standard	Fused	Improvement
`dW2` (1,000 floats)	16,005 ❌	3,002 ✅	5.3× closer
`db2` (1 float)	18,005 ❌	5,002 ✅	3.6× closer
Per-read average	9,716	8,174	16% reduction

Gradient buffers dW2, db2 flipped from ❌ to ✅ — consumed while still in cache. For deeper networks, this compounds across more layers.

SGD vs GF(2) Gaussian Elimination (2026-03-09)

Joint work with Yad

compare_reuse_distance.py runs both algorithms on the same sparse parity problem and compares runtime and memory efficiency.

python3 compare_reuse_distance.py

Results

Metric	SGD MLP	GF(2) Gauss Elim	Ratio
Weighted avg reuse distance	2,584	75	34× smaller
Time to 100% accuracy (n=3, k=3)	133 ms	131 µs	1,014× faster
Total FLOPs to 100%	1,158,951	460	2,519× fewer
Total memory reads to 100%	755,514	613	1,232× fewer
Working set	2,725 elements	1,364 elements	2× smaller

Why GF(2) wins

Cache locality: GF(2) operates on a single augmented matrix. Each Gaussian elimination step touches 2 rows (~42 elements), keeping data hot in L1 cache. SGD's W1 matrix (2,000 elements) is read during the forward pass but not re-accessed until end of backward — a reuse distance ≈ entire working set.
Algorithmic fit: Sparse parity is a linear problem over GF(2). Gaussian elimination solves it exactly in O(n²) with n+1 samples, while SGD must discover the structure through iterative gradient descent over many epochs.

GPU Toy Problem (Modal)

gpu_toy.py runs a matrix-multiply benchmark on a cloud NVIDIA L4 GPU via Modal, with cost tracking.

Setup

uv init && uv add modal     # install modal
uv run python -m modal setup # authenticate (opens browser)

Run

uv run modal run gpu_toy.py

Sample Output

NVIDIA L4  |  23034 MiB  |  CUDA 13.0

PyTorch sees: NVIDIA L4
100 matmuls (4096×4096) in 1.127s  (12.20 TFLOPS)

=======================================================
  GPU TFLOPS:          12.20
  GPU compute time:    1.127s
  Container time:      1.9s
  Total wall time:     12.6s  (incl. startup)
  Estimated cost:      $0.0030  (L4 @ $0.84/hr)
=======================================================

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
results_gf2		results_gf2
.gitignore		.gitignore
README.md		README.md
compare_reuse_distance.py		compare_reuse_distance.py
exp_gf2_standalone.py		exp_gf2_standalone.py
get_started.py		get_started.py
gpu_toy.py		gpu_toy.py
loss_plot.png		loss_plot.png
pyproject.toml		pyproject.toml
sparse_parity_benchmark.py		sparse_parity_benchmark.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sutro

Features

Architecture

Usage

Sample Output

Memory Reuse Distance

Per-buffer summary (fused layer-wise updates)

Fused updates vs standard backprop

SGD vs GF(2) Gaussian Elimination (2026-03-09)

Results

Why GF(2) wins

GPU Toy Problem (Modal)

Setup

Run

Sample Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sutro

Features

Architecture

Usage

Sample Output

Memory Reuse Distance

Per-buffer summary (fused layer-wise updates)

Fused updates vs standard backprop

SGD vs GF(2) Gaussian Elimination (2026-03-09)

Results

Why GF(2) wins

GPU Toy Problem (Modal)

Setup

Run

Sample Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages