A GPU-accelerated vectorized Game Boy emulator for reinforcement learning. Emulation cores run fully on the GPU.
Adapting CuLE: GPU-Accelerated Atari Emulation for Reinforcement Learning (Dalton et al, NVLabs) for Game Boy instead of Atari.
24 million FPS. 400,000x realtime. >3x faster than SOTA. On a home GPU.
Pokemon Red complete. Built in 5 days with a cybernetic forge (LLM-assisted development with strong automated verification). Read more in the blog post.
On an NVIDIA DGX Spark [20-core ARM, GB10 GPU], initializing from a Tetris savestate:
- Advances 24 frames per step (frame skip = 24)
- Extracts the observation tensor
- Samples a button-press action (random policy)
- Applies the action
No cheating: the GPU executes divergent instruction streams across warp lanes via distinct joypad inputs.
| Backend | Best SPS | Env Count |
|---|---|---|
| CUDA Warp | 17,200 | 16,384 |
| PyBoy + PufferLib | 5,366 | 64 |
Untuned, on a dev box. Datacenter GPUs will be faster.
The system compiles a Game Boy emulator into a single monolithic NVIDIA Warp kernel. Each GPU thread simulates one complete Game Boy. At 16,384 environments, that's 16,384 Game Boys running in parallel on a single GPU.
The architecture follows a functional core / imperative shell pattern:
src/gbxcule/
├── core/ # Pure logic: ABI layouts, ISA definitions, action codecs
├── kernels/ # Warp GPU/CPU kernels: CPU step, PPU render
│ └── cpu_templates/ # Instruction templates (ALU, loads, jumps, etc.)
├── backends/ # Environment implementations (PyBoy oracle, Warp CPU/CUDA)
└── rl/ # RL algorithms and Pokemon Red environments
This is the core innovation that makes the project tractable. Game Boy CPU emulation requires implementing ~500 opcodes. Rather than hand-writing CUDA, we use a template-based code generation pipeline:
1. ISA definitions (core/isa_sm83.py) declare each opcode with a template key and placeholder mappings:
# Opcode 0x3E: LD A,d8
OpcodeSpec(opcode=0x3E, template_key="ld_r8_d8", replacements={"REG_i": "a_i"})
# Opcode 0x06: LD B,d8 -- same template, different register
OpcodeSpec(opcode=0x06, template_key="ld_r8_d8", replacements={"REG_i": "b_i"})2. Templates (kernels/cpu_templates/) are plain Python functions using Warp primitives. Placeholders like REG_i get substituted at build time:
def template_inc_r8(pc_i: int, f_i: int, REG_i: int) -> None:
old = REG_i
REG_i = (REG_i + 1) & 0xFF
z = wp.where(REG_i == 0, 1, 0)
hflag = wp.where((old & 0x0F) == 0x0F, 1, 0)
cflag = (f_i >> 4) & 0x1
f_i = make_flags(z, 0, hflag, cflag)
pc_i = (pc_i + 1) & 0xFFFF
cycles = 4Note wp.where() instead of if/else for branchless GPU-friendly code.
3. LibCST code generation (kernels/cpu_step_builder.py) performs AST-level transformations:
- Specializes templates by renaming placeholders (e.g.
REG_i->a_i) - Builds a two-level bucketed dispatch tree (bucket by high nibble, then linear scan)
- Injects the dispatch tree into the kernel skeleton
- SHA256-hashes the result and caches the generated module
The final output is a single @wp.kernel function containing all ~500 opcodes, which Warp JIT-compiles to CUDA PTX. The generated kernel is cached at ~/.cache/gbxcule/warp_kernels/.
Correctness is verified against PyBoy as a trusted oracle. The system generates 23 license-safe micro-ROMs that exercise specific emulator features:
| Category | ROMs | What they test |
|---|---|---|
| CPU/ALU | ALU_LOOP, ALU_FLAGS, ALU16_SP, CB_BITOPS | Arithmetic, flags, bit operations |
| Memory | MEM_RWB, LOADS_BASIC | Read/write, load/store variants |
| Control flow | FLOW_STACK | CALL/RET/JR/JP, PUSH/POP |
| Interrupts | TIMER_DIV_BASIC, TIMER_IRQ_HALT, EI_DELAY | Timer, HALT wake, EI delay |
| MBC | MBC1_SWITCH, MBC3_SWITCH, MBC1_RAM, MBC3_RAM | Bank switching, cartridge RAM |
| PPU | BG_STATIC, BG_SCROLL_ANIM, PPU_WINDOW, PPU_SPRITES | Tiles, scrolling, window, sprites |
| Divergence | JOY_DIVERGE_PERSIST | Joypad-driven per-env divergence |
The verification harness (bench/harness.py) runs the oracle and device-under-test (DUT) in lockstep, comparing CPU state at every step. On mismatch, it writes a hermetic repro bundle containing the ROM, action trace, both states, a diff, and a one-command repro.sh script.
LLM agents used these micro-ROMs during development: Codex one-shotted test ROMs in raw hex, then ran them against both the reference emulator and the DUT to find its own bugs.
Each environment gets a slice of flat, strided GPU buffers:
mem[env_idx * 65536 .. (env_idx+1) * 65536]— full 64KB address space per Game Boy- Per-env CPU registers, PPU state, cartridge state as parallel arrays
- Observations and rewards computed inside the kernel (zero CPU-GPU transfer)
Episode resets use a CuLE-style snapshot cache: a golden state is captured once, then a Warp kernel performs masked memcpy to restore only terminated environments. This avoids disk I/O entirely during training.
Torch integration is zero-copy via wp.to_torch(), so the RL training loop never leaves the GPU.
Requires uv and a Game Boy ROM (not included).
git clone https://github.com/bkase/gbxcule.git && cd gbxcule
make setup # Install dependencies
make hooks # Install pre-commit hooks (runs make check)
make roms # Generate micro-ROMsmake verify # Step-exact comparison: PyBoy oracle vs Warp CPU
make verify-smoke # Quick smoke test (frames_per_step=24)# CPU baselines
make bench
# Tetris benchmark (requires tetris.gb ROM and CUDA GPU)
make bench-tetris-gpu
# Full scaling sweep (CUDA)
make bench-gpumake verify-gpu # PyBoy oracle vs Warp CUDA
make check-gpu # Fast CUDA smoke testmake check # Format, lint, typecheck, build ROMs, compile kernels, testThis is what the pre-commit hook runs. The CPU-only gate completes in under 2 minutes.
The emulator was built to enable GPU-native RL training on Pokemon Red. While I explored several approaches before setting the project aside, the infrastructure is functional and may be useful to others.
Three RL algorithms, all wired up to the GPU emulator:
- PPO (sync and async) — most developed, scales to 16k parallel environments
- A2C — simpler streaming variant
- DreamerV3 — full world-model implementation (~5k lines) with RSSM, imagination rollouts, and CUDA replay buffers
Pokemon Red environments with quest-aware observations:
pokered_packed_parcel_env.py— multi-modal observations (packed pixels + game senses + event flags), hash-based exploration bonuses, quest-specific rewards (parcel pickup/delivery), curiosity reset on key eventspokered_packed_goal_env.py— goal-template matching via L1 pixel distancepokered_pixels_env.py— basic pixel observations with frame stacking
Neural network architectures:
- NatureCNN (Atari-style, 3 conv layers)
- IMPALA ResNet (deeper, residual blocks)
- Dual-lobe model (CNN for pixels + MLP for game senses + MLP for event flags, fused)
A five-stage curriculum with pre-computed savestates:
| Stage | Task | Max Steps |
|---|---|---|
| 1 | Exit Oak's lab | 128 |
| 2 | Return home (outside) | 128 |
| 3 | Enter house | 96 |
| 4 | Upstairs at video game | 96 |
| 5 | It's time to go | 64 |
Training scripts in tools/:
# Sync PPO for Oak's Parcel quest (most recent, recommended starting point)
uv run python tools/rl_train_ppo_parcel.py
# Async PPO variant
uv run python tools/rl_train_async_parcel.py
# DreamerV3
uv run python tools/rl_train_gpu.pyWe tried PPO, A2C, and DreamerV3 on the Oak's Parcel delivery quest. DreamerV3 could learn a world model but was sample-inefficient for this shaped problem. PPO with exploration bonuses showed promise on short navigation stages but hit walls on the full quest. Pokemon Red RL remains hard due to long credit assignment and sparse rewards.
The emulator and training infrastructure work. The RL problem is unsolved. If you're excited about Pokemon RL, this is a solid foundation — start with Stage 1 (Exit Oak's lab) and the sync PPO setup.
gbxcule/
├── src/gbxcule/
│ ├── core/ # Pure logic (ABI, ISA, action codec, signatures)
│ ├── kernels/ # Warp kernels + template system
│ │ └── cpu_templates/ # Instruction templates (alu, loads, jumps, stack, bitops)
│ ├── backends/ # PyBoy oracle, Warp CPU/CUDA, PufferLib integrations
│ └── rl/ # PPO, A2C, DreamerV3, Pokemon Red envs
│ └── dreamer_v3/ # Full DreamerV3 implementation
├── bench/
│ ├── harness.py # Benchmark + verification harness
│ ├── tetris_bench.py # Tetris-specific benchmark
│ ├── roms/ # Micro-ROM generator + suite.yaml
│ └── analysis/ # Scaling plots, summary reports
├── tests/ # 90 test files (CPU instructions, PPU, RL algorithms)
├── tools/ # Training scripts, profiling, capture utilities
├── states/ # Savestates for RL stages
├── configs/ # YAML/JSON configurations
├── history/ # Planning documents (PRD, architecture, milestone plans)
├── ARCHITECTURE.md # Engineering architecture (detailed)
└── CONSTITUTION.md # Project principles
- ARCHITECTURE.md — full engineering architecture and tech stack decisions
- history/prd.md — product requirements document
- history/ — milestone plans, epic breakdowns, development history
- CuLE (Dalton et al, NVLabs) — the original GPU-accelerated Atari emulation paper that inspired this project
- PyBoy — trusted reference emulator used as the oracle
- NVIDIA Warp — Python framework for JIT-compiling to CUDA kernels
- Peter Whidden's Pokemon RL and drubinstein's Pokemon RL Experiment — inspiration for the Pokemon Red RL goal
- PufferLib — vectorized environment framework used for CPU baselines