Skip to content

Commit 1936d64

Browse files
xhcaowp
andauthored
webgpu: fix dispatch size issue of Transpose operator (#26501)
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: wp <[email protected]>
1 parent b39e144 commit 1936d64

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

onnxruntime/core/providers/webgpu/tensor/transpose.cc

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -162,18 +162,16 @@ Status Transpose::DoTranspose(onnxruntime::webgpu::ComputeContext& context,
162162
uint32_t dispatch_z = 1;
163163

164164
// This temporary workaround addresses a significant performance bottleneck
165-
// (~12x slower) for the shape (3, 3, 2560, 1280) due to an issue with Intel's
165+
// (~12x slower) for the input shape (1280, 2560, 3, 3) due to an issue with Intel's
166166
// GPU drivers. We manually normalize the dispatch group size to restore
167167
// performance.
168168
//
169169
// TODO: Revert this change once the driver issue is fixed.
170-
if (context.AdapterInfo().vendor == std::string_view{"intel"}) {
171-
// Only adjusted the dispatch size when rank is 4 yet.
172-
if (rank == static_cast<size_t>(4)) {
173-
dispatch_x = ceil_div(input_shape[0] * input_shape[1], 2);
174-
dispatch_y = ceil_div(input_shape[2], 4);
175-
dispatch_z = ceil_div(input_shape[3], 8);
176-
}
170+
if (context.AdapterInfo().vendor == std::string_view{"intel"} && rank == 4) {
171+
uint32_t dispatch_size = dispatch_x;
172+
dispatch_x = 4;
173+
dispatch_y = 8;
174+
dispatch_z = ceil_div(dispatch_size, dispatch_x * dispatch_y);
177175
}
178176
program.SetDispatchGroupSize(dispatch_x, dispatch_y, dispatch_z);
179177
}

0 commit comments

Comments
 (0)