Skip to content

[CUDA EP] Add hardswish op and add bf16 support for hardsigmoid#25562

Merged
justinchuby merged 12 commits intomicrosoft:mainfrom
Stonesjtu:cuda-hardswish
Aug 21, 2025
Merged

[CUDA EP] Add hardswish op and add bf16 support for hardsigmoid#25562
justinchuby merged 12 commits intomicrosoft:mainfrom
Stonesjtu:cuda-hardswish

Conversation

@Stonesjtu
Copy link
Contributor

@Stonesjtu Stonesjtu commented Jul 28, 2025

Description

Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid

Motivation and Context

HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.

@Stonesjtu
Copy link
Contributor Author

@microsoft-github-policy-service agree

@Stonesjtu
Copy link
Contributor Author

Stonesjtu commented Jul 29, 2025

Can anyone help triggering the CI?
@jywu-msft can you review this PR or assign the responsible reviewers?

@Stonesjtu Stonesjtu changed the title Add hardswish op for CUDA EP [CUDA EP] Add hardswish op and add bf16 support for harsigmoid Jul 30, 2025
@justinchuby justinchuby added the core runtime issues related to core runtime label Jul 30, 2025
@justinchuby justinchuby requested a review from Copilot July 30, 2025 15:41

This comment was marked as outdated.

Stonesjtu and others added 3 commits July 31, 2025 10:36
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Stonesjtu Stonesjtu changed the title [CUDA EP] Add hardswish op and add bf16 support for harsigmoid [CUDA EP] Add hardswish op and add bf16 support for hardsigmoid Jul 31, 2025
@Stonesjtu
Copy link
Contributor Author

Stonesjtu commented Jul 31, 2025

@justinchuby The new tests regarding HardSwish pass locally. Can you trigger the CI again?

linking the fusion pass: microsoft/onnxscript#2472

@Stonesjtu
Copy link
Contributor Author

The CI failed for OpenVINO & CoreML(arm64) & Android-NN-API, which should be irrelevant to this PR. I disabled the HardSwish tests for non-cuda EPs.

@Stonesjtu
Copy link
Contributor Author

@justinchuby CI should pass, can you review this PR?

@justinchuby
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

justinchuby
justinchuby previously approved these changes Aug 11, 2025
@Stonesjtu
Copy link
Contributor Author

@justinchuby plz trigger the CI

@justinchuby
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@justinchuby
Copy link
Contributor

@Stonesjtu
Copy link
Contributor Author

@justinchuby Thanks. Doc is updated as shown in the Azure CI.

@justinchuby
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the HardSwish operator and extends bf16 (BFloat16) support for HardSigmoid in the CUDA execution provider. The motivation is to provide a fused HardSwish implementation that should be twice as fast as the current approach of using HardSigmoid + Add.

  • Adds HardSwish operator implementation with support for float, double, MLFloat16, and BFloat16 types
  • Extends HardSigmoid operator to support BFloat16 data type
  • Updates versioning for both operators to support opset 22 with the new BFloat16 support

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/activation/activation_op_test.cc Adds unit tests for HardSwish operator
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Registers HardSwish and updated HardSigmoid kernels with proper versioning
onnxruntime/core/providers/cuda/activation/activations_impl.h Adds HardSwish to activation operations list
onnxruntime/core/providers/cuda/activation/activations_impl.cu Implements HardSwish CUDA kernel function
onnxruntime/core/providers/cuda/activation/activations.h Declares HardSwish class template
onnxruntime/core/providers/cuda/activation/activations.cc Defines HardSwish operator registration macros
docs/OperatorKernels.md Updates documentation for HardSwish and HardSigmoid operator support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@justinchuby
Copy link
Contributor

@Stonesjtu could you merge from main? Sorry for the inconvenience but we need the latest change to unblock the iphone simulator pipeline.

@justinchuby
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@justinchuby justinchuby merged commit 21404e3 into microsoft:main Aug 21, 2025
86 checks passed
gedoensmax pushed a commit to gedoensmax/onnxruntime that referenced this pull request Sep 2, 2025
…osoft#25562)

### Description
<!-- Describe your changes. -->
Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.

---------

Co-authored-by: kaiyu <kaiyu@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Stonesjtu Stonesjtu deleted the cuda-hardswish branch October 21, 2025 06:19
adrastogi pushed a commit that referenced this pull request Jan 5, 2026
### Description
<!-- Describe your changes. -->
Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.

---------

Co-authored-by: kaiyu <kaiyu@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core runtime issues related to core runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants