GBxCuLE

A GPU-accelerated vectorized Game Boy emulator for reinforcement learning. Emulation cores run fully on the GPU.

Adapting CuLE: GPU-Accelerated Atari Emulation for Reinforcement Learning (Dalton et al, NVLabs) for Game Boy instead of Atari.

24 million FPS. 400,000x realtime. >3x faster than SOTA. On a home GPU.

Pokemon Red complete. Built in 5 days with a cybernetic forge (LLM-assisted development with strong automated verification). Read more in the blog post.

Performance

On an NVIDIA DGX Spark [20-core ARM, GB10 GPU], initializing from a Tetris savestate:

Advances 24 frames per step (frame skip = 24)
Extracts the observation tensor
Samples a button-press action (random policy)
Applies the action

No cheating: the GPU executes divergent instruction streams across warp lanes via distinct joypad inputs.

Backend	Best SPS	Env Count
CUDA Warp	17,200	16,384
PyBoy + PufferLib	5,366	64

Untuned, on a dev box. Datacenter GPUs will be faster.

How It Works

The system compiles a Game Boy emulator into a single monolithic NVIDIA Warp kernel. Each GPU thread simulates one complete Game Boy. At 16,384 environments, that's 16,384 Game Boys running in parallel on a single GPU.

The architecture follows a functional core / imperative shell pattern:

src/gbxcule/
├── core/           # Pure logic: ABI layouts, ISA definitions, action codecs
├── kernels/        # Warp GPU/CPU kernels: CPU step, PPU render
│   └── cpu_templates/  # Instruction templates (ALU, loads, jumps, etc.)
├── backends/       # Environment implementations (PyBoy oracle, Warp CPU/CUDA)
└── rl/             # RL algorithms and Pokemon Red environments

The Warp Template System

This is the core innovation that makes the project tractable. Game Boy CPU emulation requires implementing ~500 opcodes. Rather than hand-writing CUDA, we use a template-based code generation pipeline:

1. ISA definitions (core/isa_sm83.py) declare each opcode with a template key and placeholder mappings:

# Opcode 0x3E: LD A,d8
OpcodeSpec(opcode=0x3E, template_key="ld_r8_d8", replacements={"REG_i": "a_i"})

# Opcode 0x06: LD B,d8  -- same template, different register
OpcodeSpec(opcode=0x06, template_key="ld_r8_d8", replacements={"REG_i": "b_i"})

2. Templates (kernels/cpu_templates/) are plain Python functions using Warp primitives. Placeholders like REG_i get substituted at build time:

def template_inc_r8(pc_i: int, f_i: int, REG_i: int) -> None:
    old = REG_i
    REG_i = (REG_i + 1) & 0xFF
    z = wp.where(REG_i == 0, 1, 0)
    hflag = wp.where((old & 0x0F) == 0x0F, 1, 0)
    cflag = (f_i >> 4) & 0x1
    f_i = make_flags(z, 0, hflag, cflag)
    pc_i = (pc_i + 1) & 0xFFFF
    cycles = 4

Note wp.where() instead of if/else for branchless GPU-friendly code.

3. LibCST code generation (kernels/cpu_step_builder.py) performs AST-level transformations:

Specializes templates by renaming placeholders (e.g. REG_i -> a_i)
Builds a two-level bucketed dispatch tree (bucket by high nibble, then linear scan)
Injects the dispatch tree into the kernel skeleton
SHA256-hashes the result and caches the generated module

The final output is a single @wp.kernel function containing all ~500 opcodes, which Warp JIT-compiles to CUDA PTX. The generated kernel is cached at ~/.cache/gbxcule/warp_kernels/.

Micro-ROM Test Harness

Correctness is verified against PyBoy as a trusted oracle. The system generates 23 license-safe micro-ROMs that exercise specific emulator features:

Category	ROMs	What they test
CPU/ALU	ALU_LOOP, ALU_FLAGS, ALU16_SP, CB_BITOPS	Arithmetic, flags, bit operations
Memory	MEM_RWB, LOADS_BASIC	Read/write, load/store variants
Control flow	FLOW_STACK	CALL/RET/JR/JP, PUSH/POP
Interrupts	TIMER_DIV_BASIC, TIMER_IRQ_HALT, EI_DELAY	Timer, HALT wake, EI delay
MBC	MBC1_SWITCH, MBC3_SWITCH, MBC1_RAM, MBC3_RAM	Bank switching, cartridge RAM
PPU	BG_STATIC, BG_SCROLL_ANIM, PPU_WINDOW, PPU_SPRITES	Tiles, scrolling, window, sprites
Divergence	JOY_DIVERGE_PERSIST	Joypad-driven per-env divergence

The verification harness (bench/harness.py) runs the oracle and device-under-test (DUT) in lockstep, comparing CPU state at every step. On mismatch, it writes a hermetic repro bundle containing the ROM, action trace, both states, a diff, and a one-command repro.sh script.

LLM agents used these micro-ROMs during development: Codex one-shotted test ROMs in raw hex, then ran them against both the reference emulator and the DUT to find its own bugs.

Parallel Environment Management

Each environment gets a slice of flat, strided GPU buffers:

mem[env_idx * 65536 .. (env_idx+1) * 65536] — full 64KB address space per Game Boy
Per-env CPU registers, PPU state, cartridge state as parallel arrays
Observations and rewards computed inside the kernel (zero CPU-GPU transfer)

Episode resets use a CuLE-style snapshot cache: a golden state is captured once, then a Warp kernel performs masked memcpy to restore only terminated environments. This avoids disk I/O entirely during training.

Torch integration is zero-copy via wp.to_torch(), so the RL training loop never leaves the GPU.

Quick Start

Requires uv and a Game Boy ROM (not included).

git clone https://github.com/bkase/gbxcule.git && cd gbxcule
make setup    # Install dependencies
make hooks    # Install pre-commit hooks (runs make check)
make roms     # Generate micro-ROMs

Verify Correctness (CPU)

make verify        # Step-exact comparison: PyBoy oracle vs Warp CPU
make verify-smoke  # Quick smoke test (frames_per_step=24)

Run Benchmarks

# CPU baselines
make bench

# Tetris benchmark (requires tetris.gb ROM and CUDA GPU)
make bench-tetris-gpu

# Full scaling sweep (CUDA)
make bench-gpu

GPU Verification

make verify-gpu    # PyBoy oracle vs Warp CUDA
make check-gpu     # Fast CUDA smoke test

Run All Checks

make check         # Format, lint, typecheck, build ROMs, compile kernels, test

This is what the pre-commit hook runs. The CPU-only gate completes in under 2 minutes.

RL Infrastructure

The emulator was built to enable GPU-native RL training on Pokemon Red. While I explored several approaches before setting the project aside, the infrastructure is functional and may be useful to others.

What's Here

Three RL algorithms, all wired up to the GPU emulator:

PPO (sync and async) — most developed, scales to 16k parallel environments
A2C — simpler streaming variant
DreamerV3 — full world-model implementation (~5k lines) with RSSM, imagination rollouts, and CUDA replay buffers

Pokemon Red environments with quest-aware observations:

pokered_packed_parcel_env.py — multi-modal observations (packed pixels + game senses + event flags), hash-based exploration bonuses, quest-specific rewards (parcel pickup/delivery), curiosity reset on key events
pokered_packed_goal_env.py — goal-template matching via L1 pixel distance
pokered_pixels_env.py — basic pixel observations with frame stacking

Neural network architectures:

NatureCNN (Atari-style, 3 conv layers)
IMPALA ResNet (deeper, residual blocks)
Dual-lobe model (CNN for pixels + MLP for game senses + MLP for event flags, fused)

A five-stage curriculum with pre-computed savestates:

Stage	Task	Max Steps
1	Exit Oak's lab	128
2	Return home (outside)	128
3	Enter house	96
4	Upstairs at video game	96
5	It's time to go	64

Training scripts in tools/:

# Sync PPO for Oak's Parcel quest (most recent, recommended starting point)
uv run python tools/rl_train_ppo_parcel.py

# Async PPO variant
uv run python tools/rl_train_async_parcel.py

# DreamerV3
uv run python tools/rl_train_gpu.py

What Didn't Work (and Why)

We tried PPO, A2C, and DreamerV3 on the Oak's Parcel delivery quest. DreamerV3 could learn a world model but was sample-inefficient for this shaped problem. PPO with exploration bonuses showed promise on short navigation stages but hit walls on the full quest. Pokemon Red RL remains hard due to long credit assignment and sparse rewards.

The emulator and training infrastructure work. The RL problem is unsolved. If you're excited about Pokemon RL, this is a solid foundation — start with Stage 1 (Exit Oak's lab) and the sync PPO setup.

Project Structure

gbxcule/
├── src/gbxcule/
│   ├── core/               # Pure logic (ABI, ISA, action codec, signatures)
│   ├── kernels/            # Warp kernels + template system
│   │   └── cpu_templates/  # Instruction templates (alu, loads, jumps, stack, bitops)
│   ├── backends/           # PyBoy oracle, Warp CPU/CUDA, PufferLib integrations
│   └── rl/                 # PPO, A2C, DreamerV3, Pokemon Red envs
│       └── dreamer_v3/     # Full DreamerV3 implementation
├── bench/
│   ├── harness.py          # Benchmark + verification harness
│   ├── tetris_bench.py     # Tetris-specific benchmark
│   ├── roms/               # Micro-ROM generator + suite.yaml
│   └── analysis/           # Scaling plots, summary reports
├── tests/                  # 90 test files (CPU instructions, PPU, RL algorithms)
├── tools/                  # Training scripts, profiling, capture utilities
├── states/                 # Savestates for RL stages
├── configs/                # YAML/JSON configurations
├── history/                # Planning documents (PRD, architecture, milestone plans)
├── ARCHITECTURE.md         # Engineering architecture (detailed)
└── CONSTITUTION.md         # Project principles

Documentation

ARCHITECTURE.md — full engineering architecture and tech stack decisions
history/prd.md — product requirements document
history/ — milestone plans, epic breakdowns, development history

Acknowledgments

CuLE (Dalton et al, NVLabs) — the original GPU-accelerated Atari emulation paper that inspired this project
PyBoy — trusted reference emulator used as the oracle
NVIDIA Warp — Python framework for JIT-compiling to CUDA kernels
Peter Whidden's Pokemon RL and drubinstein's Pokemon RL Experiment — inspiration for the Pokemon Red RL goal
PufferLib — vectorized environment framework used for CPU baselines

Name		Name	Last commit message	Last commit date
Latest commit History 386 Commits
.beads		.beads
.githooks		.githooks
bench		bench
configs		configs
docs		docs
history		history
src/gbxcule		src/gbxcule
states		states
tests		tests
third_party		third_party
tools		tools
typings		typings
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CONSTITUTION.md		CONSTITUTION.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GBxCuLE

Performance

How It Works

The Warp Template System

Micro-ROM Test Harness

Parallel Environment Management

Quick Start

Verify Correctness (CPU)

Run Benchmarks

GPU Verification

Run All Checks

RL Infrastructure

What's Here

What Didn't Work (and Why)

Project Structure

Documentation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GBxCuLE

Performance

How It Works

The Warp Template System

Micro-ROM Test Harness

Parallel Environment Management

Quick Start

Verify Correctness (CPU)

Run Benchmarks

GPU Verification

Run All Checks

RL Infrastructure

What's Here

What Didn't Work (and Why)

Project Structure

Documentation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages