[CUDA EP] Add pad op version from 19 to 23 support for CUDA#27416
[CUDA EP] Add pad op version from 19 to 23 support for CUDA#27416ShirasawaSama wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds CUDA Execution Provider coverage for ONNX Pad in opset 19–23 (previously only registered up to opset 18), including implementing wrap mode behavior so models exported with newer opsets no longer force a CPU fallback for Pad.
Changes:
- Register CUDA
Padkernels for opset 19–20, 21–22, and 23 (and make opset 18 explicitly versioned). - Add CUDA kernel support for
wrapmode, including handling negative pads via slicing metadata. - Update an existing
wrappadding test comment now that CUDA is expected to support opset 19.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/test/providers/cpu/tensor/pad_test.cc |
Updates wrap-mode test context now that CUDA can register opset 19+ Pad. |
onnxruntime/core/providers/cuda/tensor/pad_impl.h |
Extends CUDA pad kernel APIs to accept slice/effective-dim metadata needed for wrap + negative pads. |
onnxruntime/core/providers/cuda/tensor/pad_impl.cu |
Implements wrap mode in CUDA kernels and wires new parameters through launch paths. |
onnxruntime/core/providers/cuda/tensor/pad.cc |
Adds CUDA kernel registrations for opset 19–23 and passes slice/effective dims into CUDA implementations. |
onnxruntime/core/providers/cuda/cuda_execution_provider.cc |
Declares/registers the additional versioned CUDA Pad kernels in the EP registry. |
Comments suppressed due to low confidence (1)
onnxruntime/test/providers/cpu/tensor/pad_test.cc:1401
- This test previously avoided CUDA by using an opset version CUDA didn’t register for. Now that CUDA is expected to support opset 19+, it would be good to make the test actually fail if Pad falls back to CPU (otherwise a future regression could silently reintroduce CPU offload while still passing). Consider running this case with
session.disable_cpu_ep_fallback=1and restricting execution providers to CUDA for this test so it validates the new CUDA registration/support for opset 19–23.
OpTester test("Pad", 19);
test.AddInput<float>("data", input_shape, input_data);
test.AddInput<int64_t>("pads", {static_cast<int64_t>(pads.size())}, pads, true);
test.AddOutput<float>("output", expected_shape, expected_data);
test.AddAttribute("mode", "wrap");
test.ConfigExcludeEps({kDmlExecutionProvider, kQnnExecutionProvider,
kTensorrtExecutionProvider, kWebGpuExecutionProvider});
test.RunWithConfig();
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // CUDA registers only up to 18 and does not impl wrap mode | ||
| // so we force version to 19 to automatically exclude EPs that do not | ||
| // implement wrap mode similar to the above tests. |
There was a problem hiding this comment.
I am guessing there are Wrap mode Pad tests already in ?
|
Can you please resolve the conflicts ? |
|
Sorry, I think I found some errors in my math formula (My final code review). I'll try adding more unit tests to cover them. |
|
@hariharans29 While reviewing behavior, I compared CUDA with CPU and the ONNX spec, and checked how other execution providers handle Pad. 1. CUDA vs CPU (pre-pad formula)
2. Why current CUDA still passes all testsWith the current ONNX Pad semantics:
Therefore the situation “pre-pad and 3. Other EPs (Pad / Wrap)
So today, only CPU is the clear reference for “effective region + wrap”; CUDA uses effective extent but a different pre-pad formula; WebGPU does not use effective extent for Wrap. 4. How the ONNX spec defines Pad(Ref: https://onnx.ai/onnx/operators/onnx__Pad.html)
Under the spec, each side of each axis has a single integer (add or remove). So “pre-pad (add at begin) and slice_starts[dim] != 0 (remove at begin)” would require two values for the same begin, which the spec does not allow. That scenario is not a valid spec case — it is a hypothetical that the spec does not define. For all spec-valid inputs, current CUDA behavior is compliant; the only difference is in that unreachable case. 5. QuestionFor the PR, should I:
I am happy to implement either approach based on the team’s preference. Thank you. |
e4ea6f1 to
cef6716
Compare
3b7e80a to
37eabee
Compare
Description
Add pad op version from 19 to 23 support for CUDA
Motivation and Context
The current CUDA executor does not support the pad operation in Opset from 19 to 23. When an ONNX model exported in Opset from 19 to 23 is run on the CUDA executor, the pad operation is forcibly offloaded to the CPU, resulting in significant performance degradation.