[webgpu] Optimize Conv by im2col-matmul #26603

daijh · 2025-11-19T03:19:13Z

Description

This PR optimizes the Conv operation by implementing two new compute shaders: oihw_to_ohwi and im2col-matmul.

oihw_to_ohwi:
Improves performance over the default Transpose shader by utilizing workgroup memory to ensure continuous memory read/write patterns.

im2col-matmul:

Employs a workgroup size of 64.
Dynamically selects tile sizes (32x64 or 16x64) based on the source/weight shape.
Each invocation handles a dedicated weight element.
Uses subgroupShuffle to efficiently access the source tile, leveraging k_vec4 vectorization for better memory throughput.

Testing on Lunar Lake demonstrated up to an 87% performance improvement in Conv_2D operations.

Motivation and Context

See above.

daijh · 2025-11-19T03:22:15Z

Lunar Lake
onnxruntime commit d55ade0

Operation

Milliseconds	conv2d-mm	im2col-matmul
src: 1x128x512x512 weight: 128x128x3x3	56.071	42.824
src: 1x2560x8x8 weight: 1280x2560x3x3	21.066	11.263
src: 1x1280x8x8 weight: 1280x1280x3x3	10.384	6.357

sd-turbo

Milliseconds	conv2d-mm	im2col-matmul
sd-turbo-unet-fp16-demo.onnx	1010.245	612.092
sd-turbo-vae-decoder-fp16-demo.onnx	2317.391	1848.545

daijh · 2025-11-19T08:01:46Z

@guschmue @fs-eire @qjia7 PTAL.

qjia7 · 2025-11-24T09:07:07Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+  const uint32_t kernel_height = onnxruntime::narrow<uint32_t>(kernel_shape[2]);
+  const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]);
+
+  TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};


Suggested change

TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

qjia7 · 2025-11-24T09:07:30Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+  const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]);
+
+  TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};
+  Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);


Suggested change

Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

qjia7 · 2025-11-24T10:07:46Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+
+  const uint32_t M_tiles = ceil_div(im2col_m, tile_m);
+  const uint32_t N_tiles = ceil_div(im2col_n, tile_n);
+  im2col_mm_program.SetDispatchGroupSize(M_tiles, N_tiles, batch);


How about enhancing the current TransposeProgram with shared path instead of adding a new one?
You are doing transpose from perm [0, 1, 2, 3] to perm [0, 2, 3, 1]. It equals that we are transposing from [o, i, hw] to [o, hw, i]. You can simply extend the DoTranspose with shared path to support any shape that only transpose the last two dimensions and keep the previous dimensions unchanged. Currently, the shared path only supports 2d transpose from new shape from perm [0, 1] to new shape with perm [1, 0]. We can extend it to transpose from [0, 1, 2] to [0, 2, 1] if the transpose meets the requirement that only transpose the last two dimensions by reshape it into 3d tensor [d0 * d1*...*dn-3, dn-2, dn-1]

qjia7 · 2025-11-24T10:14:36Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.wgsl.template

+    for (var inner_k_idx = 0u; inner_k_idx < TILE_K_VEC_SIZE; inner_k_idx++) {
+      let weight_data = weight_tile[inner_k_idx][local_idx];
+#if use_subgroup
+      let src_data = src_tile[inner_k_idx][sg_id];


What if the sg_size is larger than or less than TILE_M_SIZE?

daijh added 3 commits November 18, 2025 14:11

[webgpu] im2col matmul

b1e5290

Update

2f7487e

Update

4efeff4

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 21, 2025

qjia7 reviewed Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] Optimize Conv by im2col-matmul #26603

[webgpu] Optimize Conv by im2col-matmul #26603

daijh commented Nov 19, 2025

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};
	TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

	Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);
	Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

[webgpu] Optimize Conv by im2col-matmul #26603

Are you sure you want to change the base?

[webgpu] Optimize Conv by im2col-matmul #26603

Conversation

daijh commented Nov 19, 2025

Description

Motivation and Context

Uh oh!

daijh commented Nov 19, 2025

Operation

sd-turbo

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

qjia7 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

qjia7 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

qjia7 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

qjia7 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants