-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
nsys profiling at seq=1024 shows cuStreamSynchronize consumes 59.2% of CUDA API time (2.86s of 4.4s total). 228 syncs × 12.5ms avg per training step.
GPU utilization is ~75% — the remaining 25% is pipeline bubbles from sync points.
Root Cause
Per-block interleaved training requires stream.synchronize() at multiple points:
- Post-initialization (after LM head upload)
- Pre-backward H2D uploads (ALB-065 Rule 6)
- Before workspace D2H during accumulation
- End of backward pass
Solution
CUDA Graph capture/replay. Infrastructure already exists in trueno-gpu 0.4.26 (PAR-037):
CudaStream::begin_capture()/end_capture()CudaGraph::instantiate()→CudaGraphExecCudaStream::launch_graph(&exec)— ~3-10μs vs ~20-50μs per kernel
Implementation Plan
- Capture first training step as graph:
stream.begin_capture()→ full step →stream.end_capture() - Cache
CudaGraphExeckeyed by(max_seq_len, batch_size) - Replay graph for subsequent steps:
stream.launch_graph(&exec) - Handle NaN-skip: always run fwd+bwd, check loss AFTER replay
Constraints
- Fixed topology required (no branching during capture)
- Same batch_size/seq_len for all replays (already true in production)
- NaN check moves to post-replay (backward always runs, wasted compute on NaN steps)
Expected Impact
| Metric | Current (Phase 5b) | Expected (Phase 6) |
|---|---|---|
| Step time | 444 ms | ~200 ms |
| Tok/s | 9,216 | ~20K |
| MFU | 26.7% | ~58% |
| v3 wall time (250K steps) | 1.3 days | ~14 hours |
Files to Change
entrenar/src/train/transformer_trainer/cuda_trainer.rs— graph capture wrapper- No trueno-gpu changes needed (infrastructure complete)
Contract
contracts/cuda-graphs-v1.yaml (to be written before implementation)
References
- nsys profile data: albor@479a0d3 §6.10
- trueno-gpu graph API:
src/driver/graph.rs - trueno-gpu graph tests:
src/driver/cuda_tests/cuda_graph_tests.rs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels