Skip to content

[CUDA] Support user compute stream with CUDA graph in CUDA plugin EP#29221

Merged
tianleiwu merged 12 commits into
mainfrom
tlwu/20260623/cuda_plugin_ep_cuda_graph_stream
Jun 25, 2026
Merged

[CUDA] Support user compute stream with CUDA graph in CUDA plugin EP#29221
tianleiwu merged 12 commits into
mainfrom
tlwu/20260623/cuda_plugin_ep_cuda_graph_stream

Conversation

@tianleiwu

Copy link
Copy Markdown
Contributor

Description

The CUDA plugin EP previously rejected combining a user-provided compute stream
(user_compute_stream) with CUDA graph capture (enable_cuda_graph), returning
ORT_INVALID_ARGUMENT. This PR removes that restriction so the two options can
be used together: when both are set, graph capture and replay run on the
user-owned stream (the same stream the kernels are issued to), matching the
bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a
shared stream stable and Memcpy-free.

Summary of Changes

Allow user stream + CUDA graph

File Change
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Remove the validation that rejected user_compute_stream + enable_cuda_graph together.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc PerThreadContext accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does not create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before.

Stable, Memcpy-free CUDA graph capture

File Change
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw cudaMallocAsync/cudaMalloc. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP.
onnxruntime/core/providers/cuda/tensor/shape_op.cc Add an adapter-based Shape kernel under #ifdef BUILD_CUDA_EP_AS_PLUGIN with identical semantics to the CPU Shape. Registering Shape on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture.
cmake/onnxruntime_providers_cuda_plugin.cmake Stop excluding shape_op.cc from the plugin build so the adapter-based Shape kernel is compiled in.

Null-allocator fallback in PrePack (plugin boundary)

In the plugin build the AllocatorPtr passed to PrePack can arrive null across
the library boundary. Each kernel now falls back to its own default-memory
allocator (Info().GetAllocator(OrtMemTypeDefault)), which is always valid.

Misc

Testing

  • New test: onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc covering:
    1. Session creation succeeds with both user_compute_stream and enable_cuda_graph set (regression for the removed validation).
    2. Capture + replay on the user stream produce correct results.
    3. Replay after an in-place input update on the user stream is correct.
  • Tests are gated on ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP and skip gracefully when no CUDA device or plugin library is available.

Motivation and Context

Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference
with their own kernels) previously could not also benefit from CUDA graph capture
on the plugin EP. This change brings the plugin EP to parity with the bundled
CUDA EP for that workflow.

Checklist

  • Tests added/updated
  • No breaking changes (relaxes a previously rejected option combination)
  • Documentation updated (if applicable)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR brings the CUDA plugin Execution Provider to parity with the bundled CUDA EP by allowing CUDA Graph capture/replay (enable_cuda_graph) to be used together with a user-provided compute stream (user_compute_stream). It also includes supporting changes to make capture stable (avoiding capture-time allocations and cross-EP Memcpy nodes) and adds coverage for the new behavior.

Changes:

  • Allow user_compute_stream + enable_cuda_graph together in the CUDA plugin EP, capturing/replaying on the user stream.
  • Make CUDA graph capture more stable in the plugin EP by routing scratch allocations through the EP allocator and ensuring Shape is available on the CUDA EP to avoid Memcpy nodes.
  • Add a new plugin EP test that validates session creation and correctness across capture/replay and in-place input updates on the user stream.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Removes the invalid-argument restriction and documents the combined user-stream + CUDA-graph behavior.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Updates per-thread CUDA-graph context to optionally use a user-owned stream and avoid destroying it.
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Routes scratch/workspace allocations through the EP allocator instead of raw CUDA malloc paths.
onnxruntime/core/providers/cuda/tensor/shape_op.cc Adds a plugin-build adapter-based Shape kernel while keeping output on CPU memory to prevent graph-breaking Memcpy nodes.
cmake/onnxruntime_providers_cuda_plugin.cmake Includes shape_op.cc in the plugin build to enable the new plugin Shape implementation.
onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc Adds null-allocator fallback in PrePack for plugin boundary robustness.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc Adds null-allocator fallback in PrePack for plugin boundary robustness.
onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc Adds null-allocator fallback in PrePack for plugin boundary robustness.
onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc Adds coverage for combined user stream + CUDA graph capture/replay correctness.
onnxruntime/core/framework/session_state.cc Pure formatting (line wrap).

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
@yuslepukhin

yuslepukhin commented Jun 23, 2026

Copy link
Copy Markdown
Member

// Keep is_packed=false so the original fp16/bf16 head_sink remains available to the Flash/fallback

The pre-packing algo expects the allocator to be CPU based and in some cases attempts to externalize them to disk. #Resolved


Refers to: onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc:144 in 789c771. [](commit_id = 789c771, deletion_comment = False)

Comment thread onnxruntime/core/providers/cuda/tensor/shape_op.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
@tianleiwu

Copy link
Copy Markdown
Contributor Author

The pre-packing algo expects the allocator to be CPU based and in some cases attempts to externalize them to disk.

Re: group_query_attention.cc PrePack head_sink fallback. This kernel keeps is_packed = false and never populates prepacked_weights, so the framework does not register a PrePackedWeights container for this input and there is no disk externalization of the result. The allocator is used only to materialize the cached FP32 attention sink in device memory (an H2D copy followed by a conversion kernel), so a default-memory (device) allocator is exactly the kind required here, and a CPU allocator would be wrong. I added a comment clarifying this in 6bfd824. The same reasoning applies to the QMoE and MatMulNBits PrePack fallbacks, which also keep is_packed = false.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Outdated
yuslepukhin
yuslepukhin previously approved these changes Jun 23, 2026

@yuslepukhin yuslepukhin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

…atch Stream*

- PerThreadContext: derive graph-stream ownership from explicit
  use_external_stream intent instead of (external_stream == nullptr), so a
  user-selected CUDA default stream (cudaStream_t(0)) combined with CUDA graph
  is treated as external/user-owned and not destroyed.
- GetScratchBuffer: stop forwarding a stack-temporary PluginStreamShim Stream*
  to the stream-aware arena. A plugin kernel only has the raw cudaStream_t, not
  the framework OrtSyncStream* the arena persists per chunk and later
  dereferences, so the temporary would dangle and be type-confused. Pass a null
  stream; capture stability comes from arena chunk reuse, and the CUDA graph
  path runs on a single unified stream.
- Reconcile arena/cuda-graph plugin docs with the null-stream scratch behavior.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md Outdated
@tianleiwu tianleiwu requested a review from Copilot June 24, 2026 01:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/linux_cuda_plugin_ci.yml Outdated
@yuslepukhin

Copy link
Copy Markdown
Member

Associated Issues / Duplicate Check

Missing negative tests

  • No test verifying that user_compute_stream + external allocator still correctly rejects
  • No test for cudaStream_t(0) as the user stream (edge case explicitly discussed in code)
  • No test for graph annotation ID switching with user stream

Verdict

The PR is functionally correct, well-documented, and conservative in its safety guarantees (disabling concurrent runs until the proper API exists). The code changes are minimal and well-targeted. The main concern is minor missing negative test coverage. The review comments were adequately addressed. Ready for merge pending the final approval from yuslepukhin (who already approved an earlier revision).

@tianleiwu tianleiwu merged commit 3b022ec into main Jun 25, 2026
86 checks passed
@tianleiwu tianleiwu deleted the tlwu/20260623/cuda_plugin_ep_cuda_graph_stream branch June 25, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants