-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[webgpu] Optimize Conv by im2col-matmul #26603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Lunar Lake Operation
sd-turbo
|
| const uint32_t kernel_height = onnxruntime::narrow<uint32_t>(kernel_shape[2]); | ||
| const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]); | ||
|
|
||
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; | |
| TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; |
| const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]); | ||
|
|
||
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; | ||
| Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); | |
| Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); |
|
|
||
| const uint32_t M_tiles = ceil_div(im2col_m, tile_m); | ||
| const uint32_t N_tiles = ceil_div(im2col_n, tile_n); | ||
| im2col_mm_program.SetDispatchGroupSize(M_tiles, N_tiles, batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about enhancing the current TransposeProgram with shared path instead of adding a new one?
You are doing transpose from perm [0, 1, 2, 3] to perm [0, 2, 3, 1]. It equals that we are transposing from [o, i, hw] to [o, hw, i]. You can simply extend the DoTranspose with shared path to support any shape that only transpose the last two dimensions and keep the previous dimensions unchanged. Currently, the shared path only supports 2d transpose from new shape from perm [0, 1] to new shape with perm [1, 0]. We can extend it to transpose from [0, 1, 2] to [0, 2, 1] if the transpose meets the requirement that only transpose the last two dimensions by reshape it into 3d tensor [d0 * d1*...*dn-3, dn-2, dn-1]
| for (var inner_k_idx = 0u; inner_k_idx < TILE_K_VEC_SIZE; inner_k_idx++) { | ||
| let weight_data = weight_tile[inner_k_idx][local_idx]; | ||
| #if use_subgroup | ||
| let src_data = src_tile[inner_k_idx][sg_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the sg_size is larger than or less than TILE_M_SIZE?
Description
This PR optimizes the
Convoperation by implementing two new compute shaders:oihw_to_ohwiandim2col-matmul.oihw_to_ohwi:Improves performance over the default Transpose shader by utilizing workgroup memory to ensure continuous memory read/write patterns.
im2col-matmul:Testing on Lunar Lake demonstrated up to an 87% performance improvement in Conv_2D operations.
Motivation and Context
See above.