Skip to content

Conversation

@xhcao
Copy link
Contributor

@xhcao xhcao commented Nov 14, 2025

In order to use transpose-shared instead transpose-naive, we could split transpose perm{2310} in two steps, which benifits Conv operator.

Description

Motivation and Context

In order to use transpose-shared instead transpose-naive,
we could split transpose perm{2310} in two steps, which
benifits Conv operator.
@xhcao
Copy link
Contributor Author

xhcao commented Nov 14, 2025

The PR gets performance on sdunet-v1.5-demo-layernorm model, all Conv|Transpose time is from 224ms to 135ms

@xhcao
Copy link
Contributor Author

xhcao commented Nov 14, 2025

@jchen10 @daijh PTAL

@jchen10
Copy link
Contributor

jchen10 commented Nov 14, 2025

Looks great. As we discussed in #26554 (comment), we are going to cache the transposed kernel. This PR could be less beneficial for Conv|Transpose. Maybe we could find other place to apply this optimization later.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants