[CUDA EP] Add pad op version from 19 to 23 support for CUDA by ShirasawaSama · Pull Request #27416 · microsoft/onnxruntime

ShirasawaSama · 2026-02-22T16:21:08Z

Description

Add pad op version from 19 to 23 support for CUDA

Motivation and Context

The current CUDA executor does not support the pad operation in Opset from 19 to 23. When an ONNX model exported in Opset from 19 to 23 is run on the CUDA executor, the pad operation is forcibly offloaded to the CPU, resulting in significant performance degradation.

Copilot

Pull request overview

Adds CUDA Execution Provider coverage for ONNX Pad in opset 19–23 (previously only registered up to opset 18), including implementing wrap mode behavior so models exported with newer opsets no longer force a CPU fallback for Pad.

Changes:

Register CUDA Pad kernels for opset 19–20, 21–22, and 23 (and make opset 18 explicitly versioned).
Add CUDA kernel support for wrap mode, including handling negative pads via slicing metadata.
Update an existing wrap padding test comment now that CUDA is expected to support opset 19.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`onnxruntime/test/providers/cpu/tensor/pad_test.cc`	Updates wrap-mode test context now that CUDA can register opset 19+ Pad.
`onnxruntime/core/providers/cuda/tensor/pad_impl.h`	Extends CUDA pad kernel APIs to accept slice/effective-dim metadata needed for wrap + negative pads.
`onnxruntime/core/providers/cuda/tensor/pad_impl.cu`	Implements `wrap` mode in CUDA kernels and wires new parameters through launch paths.
`onnxruntime/core/providers/cuda/tensor/pad.cc`	Adds CUDA kernel registrations for opset 19–23 and passes slice/effective dims into CUDA implementations.
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`	Declares/registers the additional versioned CUDA `Pad` kernels in the EP registry.

Comments suppressed due to low confidence (1)

onnxruntime/test/providers/cpu/tensor/pad_test.cc:1401

This test previously avoided CUDA by using an opset version CUDA didn’t register for. Now that CUDA is expected to support opset 19+, it would be good to make the test actually fail if Pad falls back to CPU (otherwise a future regression could silently reintroduce CPU offload while still passing). Consider running this case with session.disable_cpu_ep_fallback=1 and restricting execution providers to CUDA for this test so it validates the new CUDA registration/support for opset 19–23.

  OpTester test("Pad", 19);
  test.AddInput<float>("data", input_shape, input_data);
  test.AddInput<int64_t>("pads", {static_cast<int64_t>(pads.size())}, pads, true);
  test.AddOutput<float>("output", expected_shape, expected_data);
  test.AddAttribute("mode", "wrap");
  test.ConfigExcludeEps({kDmlExecutionProvider, kQnnExecutionProvider,
                         kTensorrtExecutionProvider, kWebGpuExecutionProvider});
  test.RunWithConfig();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 · 2026-02-24T20:26:16Z

onnxruntime/test/providers/cpu/tensor/pad_test.cc


-  // CUDA registers only up to 18 and does not impl wrap mode
-  // so we force version to 19 to automatically exclude EPs that do not
-  // implement wrap mode similar to the above tests.


I am guessing there are Wrap mode Pad tests already in ?

hariharans29 · 2026-02-24T20:26:48Z

Can you please resolve the conflicts ?

ShirasawaSama · 2026-02-24T20:36:04Z

Sorry, I think I found some errors in my math formula (My final code review). I'll try adding more unit tests to cover them.

ShirasawaSama · 2026-02-24T21:13:32Z

@hariharans29 While reviewing behavior, I compared CUDA with CPU and the ONNX spec, and checked how other execution providers handle Pad.

1. CUDA vs CPU (pre-pad formula)

CPU (pad.cc): For the pre-pad branch of Wrap, the effective index is
(eff_len - lower_pads[dim] + out_coord) % eff_len,
then mapped to the input using the effective (sliced) region. So Wrap is applied on the tensor after applying negative pads as removal, consistent with the ONNX spec.
CUDA (pad_impl.cu): The pre-pad branch uses
pad_head = lower_pads[dim] - slice_starts[dim],
then offset = (out_coord - pad_head) % eff_len,
and in_coord = offset - slice_starts[dim].
This matches the CPU result only when slice_starts[dim] == 0. When slice_starts[dim] != 0, the two formulas differ in principle.

2. Why current CUDA still passes all tests

With the current ONNX Pad semantics:

If the begin pad is negative on a dimension, lower_pads[dim] is negative, so there is no pre-pad on that dimension (only crop and/or post-pad). The pre-pad branch is never entered.
If the end pad is negative, slice_starts[dim] remains 0, so the current CUDA formula matches the CPU.

Therefore the situation “pre-pad and slice_starts[dim] != 0 on the same dimension” does not occur with the current spec and test set. That is why the current CUDA impl still leaves all tests passing.

3. Other EPs (Pad / Wrap)

WebGPU: Uses PadBase::SeparateNegativeToSlices and lower_pads, but the shader uses the input tensor’s shape for Wrap (in_coord = data_shape + output_index - lower_pads). It does not use effective extent; with negative padding, Wrap semantics may differ from CPU/spec.
DML: Delegates to DirectML’s Pad (including DML_PADDING_MODE_WRAP). Semantics follow DirectML, not a direct port of CPU.
JS: Inherits PadBase and passes pads/mode to JSEP; actual math is in JS/WASM — alignment with CPU depends on that implementation.
NNAPI / CoreML / QNN / WebNN: Map ONNX Pad to the backend’s Pad op; behavior is backend-defined (e.g. CoreML only constant mode). None of these “reuse” the CPU kernel.

So today, only CPU is the clear reference for “effective region + wrap”; CUDA uses effective extent but a different pre-pad formula; WebGPU does not use effective extent for Wrap.

4. How the ONNX spec defines Pad

(Ref: https://onnx.ai/onnx/operators/onnx__Pad.html)

pads: “Tensor of integers indicating the number of padding elements to add or remove (if negative) at the beginning and end of each axis.”
Format: [x1_begin, x2_begin, …, x1_end, x2_end, …] — so each axis has one value per side; negative means remove (crop) on that side.
wrap mode: “Wrap-around padding as if the data tensor forms a torus.”
So Wrap is defined on the tensor that results from applying those add/remove operations — i.e. the effective tensor. Implementations (like CPU) that first apply negative pads as removal, then pad/wrap on the resulting effective region, match this.

Under the spec, each side of each axis has a single integer (add or remove). So “pre-pad (add at begin) and slice_starts[dim] != 0 (remove at begin)” would require two values for the same begin, which the spec does not allow. That scenario is not a valid spec case — it is a hypothetical that the spec does not define. For all spec-valid inputs, current CUDA behavior is compliant; the only difference is in that unreachable case.

5. Question

For the PR, should I:

A) Leave the CUDA implementation as-is (all current tests pass; the differing case is unreachable and not a valid spec input), or
B) Change the CUDA pre-pad formula to match the CPU (e.g. use (eff_len - lower_pads[dim] + out_coord) % eff_len for the effective index, then map to input) for strict spec/CPU alignment and future-proofing?

I am happy to implement either approach based on the team’s preference. Thank you.

ShirasawaSama changed the title ~~Add pad op version from 19 to 23 support for CUDA~~ [CUDA EP] Add pad op version from 19 to 23 support for CUDA Feb 23, 2026

tianleiwu requested a review from Copilot February 24, 2026 19:05

Copilot started reviewing on behalf of tianleiwu February 24, 2026 19:06 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

hariharans29 reviewed Feb 24, 2026

View reviewed changes

ShirasawaSama force-pushed the feature/add-pad-op-version-19-to-23-support-for-CUDA branch 2 times, most recently from e4ea6f1 to cef6716 Compare March 2, 2026 18:13

Add pad op version 19 to 23 support for CUDA

37eabee

ShirasawaSama force-pushed the feature/add-pad-op-version-19-to-23-support-for-CUDA branch from 3b7e80a to 37eabee Compare March 2, 2026 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA EP] Add pad op version from 19 to 23 support for CUDA#27416

[CUDA EP] Add pad op version from 19 to 23 support for CUDA#27416
ShirasawaSama wants to merge 1 commit intomicrosoft:mainfrom
ShirasawaSama:feature/add-pad-op-version-19-to-23-support-for-CUDA

ShirasawaSama commented Feb 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hariharans29 Feb 24, 2026

Uh oh!

ShirasawaSama Feb 25, 2026

Uh oh!

hariharans29 commented Feb 24, 2026

Uh oh!

ShirasawaSama commented Feb 24, 2026 •

edited

Loading

Uh oh!

ShirasawaSama commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShirasawaSama commented Feb 22, 2026

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hariharans29 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

ShirasawaSama Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Feb 24, 2026

Uh oh!

ShirasawaSama commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShirasawaSama commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. CUDA vs CPU (pre-pad formula)

2. Why current CUDA still passes all tests

3. Other EPs (Pad / Wrap)

4. How the ONNX spec defines Pad

5. Question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShirasawaSama commented Feb 24, 2026 •

edited

Loading

ShirasawaSama commented Feb 24, 2026 •

edited

Loading