Commit bfd5962
committed
feat: GPU-resident factorLU hot path for CUDA graph capture
Eliminate all CPU→GPU transfers and forbidden CUDA operations from the
factorLU dense loop, making it fully compatible with CUDA graph capture
(XLA command buffers / WhileCmd).
Changes:
1. GPU prepareAssemble kernel: replaces CPU loop + pinned H→D copy with
a CUDA kernel that reads device-resident skeleton arrays directly.
Eliminates ensurePinnedBuf call and cudaMemcpyAsync.
2. Recording mode for GemmWorkItems: beginRecording() runs factorLU with
all GPU operations as no-ops, capturing the GemmWorkItem schedule and
flush-point boundaries. endRecording() uploads items to device in a
single H→D copy. Subsequent factorizations dispatch from pre-computed
device buffer — no per-lump CPU computation or H→D transfers.
3. beginDenseOps stream fix: use sym.stream_ instead of stream 0 for
cudaEventRecord/cudaStreamWaitEvent. Stream 0 is invalid during CUDA
graph capture. Event is pre-created in preAllocateForLU.
4. All NumericCtx GPU methods (getrf, trsm, applyRowPerm, assemble,
doElimination, maxAbsDiag, readValue, perturbSmallDiagonals) are
no-ops in recording mode.
The recording mode is structure-dependent only — the GemmWorkItem offsets
and dimensions never change between NR iterations because the sparsity
pattern is fixed at solver creation.
Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)1 parent 1f111cf commit bfd5962
2 files changed
+193
-24
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
81 | 89 | | |
82 | 90 | | |
83 | 91 | | |
| |||
0 commit comments