Skip to content

Commit bfd5962

Browse files
committed
feat: GPU-resident factorLU hot path for CUDA graph capture
Eliminate all CPU→GPU transfers and forbidden CUDA operations from the factorLU dense loop, making it fully compatible with CUDA graph capture (XLA command buffers / WhileCmd). Changes: 1. GPU prepareAssemble kernel: replaces CPU loop + pinned H→D copy with a CUDA kernel that reads device-resident skeleton arrays directly. Eliminates ensurePinnedBuf call and cudaMemcpyAsync. 2. Recording mode for GemmWorkItems: beginRecording() runs factorLU with all GPU operations as no-ops, capturing the GemmWorkItem schedule and flush-point boundaries. endRecording() uploads items to device in a single H→D copy. Subsequent factorizations dispatch from pre-computed device buffer — no per-lump CPU computation or H→D transfers. 3. beginDenseOps stream fix: use sym.stream_ instead of stream 0 for cudaEventRecord/cudaStreamWaitEvent. Stream 0 is invalid during CUDA graph capture. Event is pre-created in preAllocateForLU. 4. All NumericCtx GPU methods (getrf, trsm, applyRowPerm, assemble, doElimination, maxAbsDiag, readValue, perturbSmallDiagonals) are no-ops in recording mode. The recording mode is structure-dependent only — the GemmWorkItem offsets and dimensions never change between NR iterations because the sparsity pattern is fixed at solver creation. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
1 parent 1f111cf commit bfd5962

File tree

2 files changed

+193
-24
lines changed

2 files changed

+193
-24
lines changed

baspacho/baspacho/MatOps.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,14 @@ struct NumericCtxBase {
7878
// devDstPivots must be a device-allocated buffer with enough space.
7979
// Default: no-op (CPU backends don't have deferred pivot copies).
8080
virtual void flushDevicePivots(int64_t* devDstPivots) { (void)devDstPivots; }
81+
82+
// Recording mode: run factorLU with GPU operations as no-ops to capture
83+
// all GemmWorkItem data and flush-point boundaries. The recorded data is
84+
// structure-dependent only (never changes between NR iterations), so it can
85+
// be pre-uploaded to device and reused across all subsequent factorizations.
86+
// This eliminates per-lump CPU→GPU transfers during CUDA graph capture/replay.
87+
virtual void beginRecording() {}
88+
virtual void endRecording() {}
8189
};
8290

8391
struct SolveCtxBase {

0 commit comments

Comments
 (0)