[CUDA] Support user compute stream with CUDA graph in CUDA plugin EP by tianleiwu · Pull Request #29221 · microsoft/onnxruntime

tianleiwu · 2026-06-23T07:51:53Z

Description

The CUDA plugin EP previously rejected combining a user-provided compute stream
(user_compute_stream) with CUDA graph capture (enable_cuda_graph), returning
ORT_INVALID_ARGUMENT. This PR removes that restriction so the two options can
be used together: when both are set, graph capture and replay run on the
user-owned stream (the same stream the kernels are issued to), matching the
bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a
shared stream stable and Memcpy-free.

Summary of Changes

Allow user stream + CUDA graph

File	Change
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc	Remove the validation that rejected `user_compute_stream` + `enable_cuda_graph` together.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	`PerThreadContext` accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does not create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before.

Stable, Memcpy-free CUDA graph capture

File	Change
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h	Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP.
onnxruntime/core/providers/cuda/tensor/shape_op.cc	Add an adapter-based `Shape` kernel under `#ifdef BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`. Registering `Shape` on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture.
cmake/onnxruntime_providers_cuda_plugin.cmake	Stop excluding `shape_op.cc` from the plugin build so the adapter-based `Shape` kernel is compiled in.

Null-allocator fallback in PrePack (plugin boundary)

In the plugin build the AllocatorPtr passed to PrePack can arrive null across
the library boundary. Each kernel now falls back to its own default-memory
allocator (Info().GetAllocator(OrtMemTypeDefault)), which is always valid.

Misc

onnxruntime/core/framework/session_state.cc — wrap a long line (no behavior change).

Testing

New test: onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc covering:
1. Session creation succeeds with both user_compute_stream and enable_cuda_graph set (regression for the removed validation).
2. Capture + replay on the user stream produce correct results.
3. Replay after an in-place input update on the user stream is correct.
Tests are gated on ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP and skip gracefully when no CUDA device or plugin library is available.

Motivation and Context

Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference
with their own kernels) previously could not also benefit from CUDA graph capture
on the plugin EP. This change brings the plugin EP to parity with the bundled
CUDA EP for that workflow.

Checklist

Tests added/updated
No breaking changes (relaxes a previously rejected option combination)
Documentation updated (if applicable)

Copilot

Pull request overview

This PR brings the CUDA plugin Execution Provider to parity with the bundled CUDA EP by allowing CUDA Graph capture/replay (enable_cuda_graph) to be used together with a user-provided compute stream (user_compute_stream). It also includes supporting changes to make capture stable (avoiding capture-time allocations and cross-EP Memcpy nodes) and adds coverage for the new behavior.

Changes:

Allow user_compute_stream + enable_cuda_graph together in the CUDA plugin EP, capturing/replaying on the user stream.
Make CUDA graph capture more stable in the plugin EP by routing scratch allocations through the EP allocator and ensuring Shape is available on the CUDA EP to avoid Memcpy nodes.
Add a new plugin EP test that validates session creation and correctness across capture/replay and in-place input updates on the user stream.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc	Removes the invalid-argument restriction and documents the combined user-stream + CUDA-graph behavior.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	Updates per-thread CUDA-graph context to optionally use a user-owned stream and avoid destroying it.
onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h	Routes scratch/workspace allocations through the EP allocator instead of raw CUDA malloc paths.
onnxruntime/core/providers/cuda/tensor/shape_op.cc	Adds a plugin-build adapter-based `Shape` kernel while keeping output on CPU memory to prevent graph-breaking Memcpy nodes.
cmake/onnxruntime_providers_cuda_plugin.cmake	Includes `shape_op.cc` in the plugin build to enable the new plugin `Shape` implementation.
onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc	Adds null-allocator fallback in `PrePack` for plugin boundary robustness.
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc	Adds null-allocator fallback in `PrePack` for plugin boundary robustness.
onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc	Adds null-allocator fallback in `PrePack` for plugin boundary robustness.
onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc	Adds coverage for combined user stream + CUDA graph capture/replay correctness.
onnxruntime/core/framework/session_state.cc	Pure formatting (line wrap).

yuslepukhin · 2026-06-23T17:16:49Z

// Keep is_packed=false so the original fp16/bf16 head_sink remains available to the Flash/fallback

The pre-packing algo expects the allocator to be CPU based and in some cases attempts to externalize them to disk. #Resolved

Refers to: onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc:144 in 789c771. [](commit_id = 789c771, deletion_comment = False)

…rePack comments

tianleiwu · 2026-06-23T18:05:13Z

The pre-packing algo expects the allocator to be CPU based and in some cases attempts to externalize them to disk.

Re: group_query_attention.cc PrePack head_sink fallback. This kernel keeps is_packed = false and never populates prepacked_weights, so the framework does not register a PrePackedWeights container for this input and there is no disk externalization of the result. The allocator is used only to materialize the cached FP32 attention sink in device memory (an H2D copy followed by a conversion kernel), so a default-memory (device) allocator is exactly the kind required here, and a CPU allocator would be wrong. I added a comment clarifying this in 6bfd824. The same reasoning applies to the QMoE and MatMulNBits PrePack fallbacks, which also keep is_packed = false.

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

yuslepukhin

…atch Stream* - PerThreadContext: derive graph-stream ownership from explicit use_external_stream intent instead of (external_stream == nullptr), so a user-selected CUDA default stream (cudaStream_t(0)) combined with CUDA graph is treated as external/user-owned and not destroyed. - GetScratchBuffer: stop forwarding a stack-temporary PluginStreamShim Stream* to the stream-aware arena. A plugin kernel only has the raw cudaStream_t, not the framework OrtSyncStream* the arena persists per chunk and later dereferences, so the temporary would dangle and be type-confused. Pass a null stream; capture stability comes from arena chunk reuse, and the CUDA graph path runs on a single unified stream. - Reconcile arena/cuda-graph plugin docs with the null-stream scratch behavior.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

yuslepukhin · 2026-06-25T00:18:40Z

Associated Issues / Duplicate Check

No linked GitHub issue in the PR description.
Follow-up PR [CUDA Plugin EP] Expose kernel sync stream for scratch allocation #29244 (stacked on this) properly exposes KernelContext_GetSyncStream to solve the null-stream-tag limitation for concurrent runs.

Missing negative tests

No test verifying that user_compute_stream + external allocator still correctly rejects
No test for cudaStream_t(0) as the user stream (edge case explicitly discussed in code)
No test for graph annotation ID switching with user stream

Verdict

The PR is functionally correct, well-documented, and conservative in its safety guarantees (disabling concurrent runs until the proper API exists). The code changes are minimal and well-targeted. The main concern is minor missing negative test coverage. The review comments were adequately addressed. Ready for merge pending the final approval from yuslepukhin (who already approved an earlier revision).

cuda plugin user stream for cuda graph

789c771

tianleiwu requested review from Copilot and yuslepukhin June 23, 2026 07:53

Copilot started reviewing on behalf of tianleiwu June 23, 2026 07:54 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated

yuslepukhin reviewed Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/tensor/shape_op.cc Outdated

yuslepukhin reviewed Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h

Address review feedback: stream-order plugin scratch, clarify Shape/P…

6bfd824

…rePack comments

refine fix

8100544

github-actions Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated

tianleiwu added 2 commits June 23, 2026 13:28

refactoring

d257b39

lintrunner

fed19f7

tianleiwu requested review from Copilot and yuslepukhin June 23, 2026 20:56

Copilot started reviewing on behalf of tianleiwu June 23, 2026 20:57 View session

update docs

2de6303

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Outdated

yuslepukhin previously approved these changes Jun 23, 2026

View reviewed changes

add test to CI

ed12914

tianleiwu dismissed yuslepukhin’s stale review via ed12914 June 23, 2026 21:19

tianleiwu added 2 commits June 23, 2026 16:56

update docs

55683b6

tianleiwu requested a review from Copilot June 24, 2026 00:11

Copilot started reviewing on behalf of tianleiwu June 24, 2026 00:12 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Outdated

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h

Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md Outdated

tianleiwu added 2 commits June 23, 2026 18:13

fix(cuda): honor default user stream in plugin EP

45b1e11

fix(cuda): disable unsafe plugin concurrent runs

3ce7ff3

tianleiwu requested a review from Copilot June 24, 2026 01:18

Copilot started reviewing on behalf of tianleiwu June 24, 2026 01:19 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc

Comment thread .github/workflows/linux_cuda_plugin_ci.yml Outdated

test(cuda): rely on user stream ordering

a1a2574

tianleiwu mentioned this pull request Jun 24, 2026

[CUDA Plugin EP] Expose kernel sync stream for scratch allocation #29244

Open

4 tasks

tianleiwu requested a review from yuslepukhin June 24, 2026 22:50

yuslepukhin approved these changes Jun 25, 2026

View reviewed changes

tianleiwu merged commit 3b022ec into main Jun 25, 2026
86 checks passed

tianleiwu deleted the tlwu/20260623/cuda_plugin_ep_cuda_graph_stream branch June 25, 2026 00:21

Uh oh!

Conversation

tianleiwu commented Jun 23, 2026

Description

Summary of Changes

Allow user stream + CUDA graph

Stable, Memcpy-free CUDA graph capture

Null-allocator fallback in PrePack (plugin boundary)

Misc

Testing

Motivation and Context

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

yuslepukhin commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Jun 23, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

yuslepukhin commented Jun 25, 2026

Associated Issues / Duplicate Check

Missing negative tests

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuslepukhin commented Jun 23, 2026 •

edited

Loading