Skip to content

Conversation

@Acly
Copy link
Collaborator

@Acly Acly commented Apr 29, 2025

This implements support for GGML_OP_CONV_2D_DW in Vulkan backend.

Motivation is the same as for CPU (#1152): while depthwise convolution can be implemented via im2col -> mul_mat, this is quite wasteful, and a direct kernel performs much better.

Timings (W=512, H=512, C=256)

Method Layout Time
ggml_conv_2d_dw (im2col) WHCN 38.5 ms ± 0.02
ggml_conv_2d_dw_direct WHCN 1.9 ms ± 0.02
ggml_conv_2d_dw_direct CWHN 2.1 ms ± 0.01

Measured on RTX 4070. Larger batch sizes run into max allocation issues when using im2col.

Regarding separate kernel for CWHN (channels most contiguous, aka. NHWC): it's actually slightly slower than the default memory layout here. I kept it because regular (and transposed) Conv2D generally prefers it, and it can avoid extra permute/copy steps.

@ggerganov ggerganov requested a review from 0cc4m April 29, 2025 13:30
@ggerganov
Copy link
Member

Also tagging @jeffbolznv in case you are interested in taking a look.

FLOAT_TYPE sum = 0.0;
for (uint knl_y = 0; knl_y < p.knl_h; ++knl_y) {
uint src_y = dst_y * p.stride_y + knl_y * p.dilation_y - p.pad_y;
if (src_y < 0 || src_y >= p.src_h) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since src_x and src_y are unsigned, the < 0 conditions can be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible src_y underflows if pad_y is large enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there is padding the expression can become negative and wraps to a very large unsigned int, which will then be caught by the >= check (for typical values). So in the end it does what's intended, and the src_y < 0 check can be omitted.

The alternative is to use signed int and keep the check (bit cleaner, more instructions). Using unsigned and the check as it is makes no sense, I'll fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the check and added a comment to indicate wrapping is intentional.

for (uint knl_x = 0; knl_x < p.knl_w; ++knl_x) {
uint src_x = dst_x * p.stride_x + knl_x * p.dilation_x - p.pad_x;
if (src_x < 0 || src_x >= p.src_w) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess padding is always considered to be with a value of zero, never replicating the border?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no way to specify padding modes other than zero for convolutions so far.

test_cases.emplace_back(new test_conv_2d_dw({17, 34, 9, 1}, {3, 3, 1, 9}, 1, 0, 1, true));
test_cases.emplace_back(new test_conv_2d_dw({32, 8, 64, 1}, {3, 3, 1, 64}, 2, 1, 1, false));
test_cases.emplace_back(new test_conv_2d_dw({32, 8, 64, 1}, {3, 3, 1, 64}, 2, 1, 1, true));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add perf tests that correspond to the examples you gave in the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jeffbolznv
Copy link
Contributor

This is very cool. I didn't realize this op had been added. The change LGTM, I just had one minor suggestion.

@ggerganov ggerganov merged commit 0482de9 into ggml-org:master May 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants