Skip to content

ALB-077: CUDA Graph capture for training loop — eliminate 59% sync overhead #59

@noahgift

Description

@noahgift

Problem

nsys profiling at seq=1024 shows cuStreamSynchronize consumes 59.2% of CUDA API time (2.86s of 4.4s total). 228 syncs × 12.5ms avg per training step.

GPU utilization is ~75% — the remaining 25% is pipeline bubbles from sync points.

Root Cause

Per-block interleaved training requires stream.synchronize() at multiple points:

  1. Post-initialization (after LM head upload)
  2. Pre-backward H2D uploads (ALB-065 Rule 6)
  3. Before workspace D2H during accumulation
  4. End of backward pass

Solution

CUDA Graph capture/replay. Infrastructure already exists in trueno-gpu 0.4.26 (PAR-037):

  • CudaStream::begin_capture() / end_capture()
  • CudaGraph::instantiate()CudaGraphExec
  • CudaStream::launch_graph(&exec) — ~3-10μs vs ~20-50μs per kernel

Implementation Plan

  1. Capture first training step as graph: stream.begin_capture() → full step → stream.end_capture()
  2. Cache CudaGraphExec keyed by (max_seq_len, batch_size)
  3. Replay graph for subsequent steps: stream.launch_graph(&exec)
  4. Handle NaN-skip: always run fwd+bwd, check loss AFTER replay

Constraints

  • Fixed topology required (no branching during capture)
  • Same batch_size/seq_len for all replays (already true in production)
  • NaN check moves to post-replay (backward always runs, wasted compute on NaN steps)

Expected Impact

Metric Current (Phase 5b) Expected (Phase 6)
Step time 444 ms ~200 ms
Tok/s 9,216 ~20K
MFU 26.7% ~58%
v3 wall time (250K steps) 1.3 days ~14 hours

Files to Change

  • entrenar/src/train/transformer_trainer/cuda_trainer.rs — graph capture wrapper
  • No trueno-gpu changes needed (infrastructure complete)

Contract

contracts/cuda-graphs-v1.yaml (to be written before implementation)

References

  • nsys profile data: albor@479a0d3 §6.10
  • trueno-gpu graph API: src/driver/graph.rs
  • trueno-gpu graph tests: src/driver/cuda_tests/cuda_graph_tests.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions