ALB-077: CUDA Graph capture for training loop — eliminate 59% sync overhead

## Problem

nsys profiling at seq=1024 shows `cuStreamSynchronize` consumes **59.2%** of CUDA API time (2.86s of 4.4s total). 228 syncs × 12.5ms avg per training step.

GPU utilization is ~75% — the remaining 25% is pipeline bubbles from sync points.

## Root Cause

Per-block interleaved training requires `stream.synchronize()` at multiple points:
1. Post-initialization (after LM head upload)
2. Pre-backward H2D uploads (ALB-065 Rule 6)
3. Before workspace D2H during accumulation
4. End of backward pass

## Solution

CUDA Graph capture/replay. Infrastructure already exists in trueno-gpu 0.4.26 (PAR-037):
- `CudaStream::begin_capture()` / `end_capture()`
- `CudaGraph::instantiate()` → `CudaGraphExec`
- `CudaStream::launch_graph(&exec)` — ~3-10μs vs ~20-50μs per kernel

### Implementation Plan

1. Capture first training step as graph: `stream.begin_capture()` → full step → `stream.end_capture()`
2. Cache `CudaGraphExec` keyed by `(max_seq_len, batch_size)`
3. Replay graph for subsequent steps: `stream.launch_graph(&exec)`
4. Handle NaN-skip: always run fwd+bwd, check loss AFTER replay

### Constraints

- Fixed topology required (no branching during capture)
- Same batch_size/seq_len for all replays (already true in production)
- NaN check moves to post-replay (backward always runs, wasted compute on NaN steps)

## Expected Impact

| Metric | Current (Phase 5b) | Expected (Phase 6) |
|--------|-------------------|-------------------|
| Step time | 444 ms | ~200 ms |
| Tok/s | 9,216 | ~20K |
| MFU | 26.7% | ~58% |
| v3 wall time (250K steps) | 1.3 days | ~14 hours |

## Files to Change

- `entrenar/src/train/transformer_trainer/cuda_trainer.rs` — graph capture wrapper
- No trueno-gpu changes needed (infrastructure complete)

## Contract

`contracts/cuda-graphs-v1.yaml` (to be written before implementation)

## References

- nsys profile data: albor@479a0d3 §6.10
- trueno-gpu graph API: `src/driver/graph.rs`
- trueno-gpu graph tests: `src/driver/cuda_tests/cuda_graph_tests.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALB-077: CUDA Graph capture for training loop — eliminate 59% sync overhead #59

Problem

Root Cause

Solution

Implementation Plan

Constraints

Expected Impact

Files to Change

Contract

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Current (Phase 5b)	Expected (Phase 6)
Step time	444 ms	~200 ms
Tok/s	9,216	~20K
MFU	26.7%	~58%
v3 wall time (250K steps)	1.3 days	~14 hours

ALB-077: CUDA Graph capture for training loop — eliminate 59% sync overhead #59

Description

Problem

Root Cause

Solution

Implementation Plan

Constraints

Expected Impact

Files to Change

Contract

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions