[CUDA EP] Add hardswish op and add bf16 support for hardsigmoid#25562
[CUDA EP] Add hardswish op and add bf16 support for hardsigmoid#25562justinchuby merged 12 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
|
Can anyone help triggering the CI? |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@justinchuby The new tests regarding HardSwish pass locally. Can you trigger the CI again? linking the fusion pass: microsoft/onnxscript#2472 |
|
The CI failed for OpenVINO & CoreML(arm64) & Android-NN-API, which should be irrelevant to this PR. I disabled the HardSwish tests for non-cuda EPs. |
|
@justinchuby CI should pass, can you review this PR? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
@justinchuby plz trigger the CI |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
@Stonesjtu could you fix the documentation according to https://aiinfra.visualstudio.com/PublicPackages/_build/results?buildId=908666&view=logs&j=7f366e99-16b2-52cc-e1ff-653af284e397&t=834305f1-2220-521d-a5bb-dfba0f922108&l=5484 ? |
|
@justinchuby Thanks. Doc is updated as shown in the Azure CI. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for the HardSwish operator and extends bf16 (BFloat16) support for HardSigmoid in the CUDA execution provider. The motivation is to provide a fused HardSwish implementation that should be twice as fast as the current approach of using HardSigmoid + Add.
- Adds HardSwish operator implementation with support for float, double, MLFloat16, and BFloat16 types
- Extends HardSigmoid operator to support BFloat16 data type
- Updates versioning for both operators to support opset 22 with the new BFloat16 support
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/activation/activation_op_test.cc | Adds unit tests for HardSwish operator |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Registers HardSwish and updated HardSigmoid kernels with proper versioning |
| onnxruntime/core/providers/cuda/activation/activations_impl.h | Adds HardSwish to activation operations list |
| onnxruntime/core/providers/cuda/activation/activations_impl.cu | Implements HardSwish CUDA kernel function |
| onnxruntime/core/providers/cuda/activation/activations.h | Declares HardSwish class template |
| onnxruntime/core/providers/cuda/activation/activations.cc | Defines HardSwish operator registration macros |
| docs/OperatorKernels.md | Updates documentation for HardSwish and HardSigmoid operator support |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@Stonesjtu could you merge from main? Sorry for the inconvenience but we need the latest change to unblock the iphone simulator pipeline. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
…osoft#25562) ### Description <!-- Describe your changes. --> Add HardSwish operator which is x*HardSigmoid(x) Add bf16 support for HardSigmoid ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> HardSwish is implemented as HardSidmoid + Add in CUDA EP currently. A fused HardSwish should take half the time of HardSigmoid + Add. --------- Co-authored-by: kaiyu <kaiyu@bytedance.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> Add HardSwish operator which is x*HardSigmoid(x) Add bf16 support for HardSigmoid ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> HardSwish is implemented as HardSidmoid + Add in CUDA EP currently. A fused HardSwish should take half the time of HardSigmoid + Add. --------- Co-authored-by: kaiyu <kaiyu@bytedance.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Description
Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid
Motivation and Context
HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.