-
Notifications
You must be signed in to change notification settings - Fork 1.4k
vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) #1204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also tagging @jeffbolznv in case you are interested in taking a look. |
| FLOAT_TYPE sum = 0.0; | ||
| for (uint knl_y = 0; knl_y < p.knl_h; ++knl_y) { | ||
| uint src_y = dst_y * p.stride_y + knl_y * p.dilation_y - p.pad_y; | ||
| if (src_y < 0 || src_y >= p.src_h) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since src_x and src_y are unsigned, the < 0 conditions can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible src_y underflows if pad_y is large enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there is padding the expression can become negative and wraps to a very large unsigned int, which will then be caught by the >= check (for typical values). So in the end it does what's intended, and the src_y < 0 check can be omitted.
The alternative is to use signed int and keep the check (bit cleaner, more instructions). Using unsigned and the check as it is makes no sense, I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the check and added a comment to indicate wrapping is intentional.
| for (uint knl_x = 0; knl_x < p.knl_w; ++knl_x) { | ||
| uint src_x = dst_x * p.stride_x + knl_x * p.dilation_x - p.pad_x; | ||
| if (src_x < 0 || src_x >= p.src_w) { | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess padding is always considered to be with a value of zero, never replicating the border?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no way to specify padding modes other than zero for convolutions so far.
| test_cases.emplace_back(new test_conv_2d_dw({17, 34, 9, 1}, {3, 3, 1, 9}, 1, 0, 1, true)); | ||
| test_cases.emplace_back(new test_conv_2d_dw({32, 8, 64, 1}, {3, 3, 1, 64}, 2, 1, 1, false)); | ||
| test_cases.emplace_back(new test_conv_2d_dw({32, 8, 64, 1}, {3, 3, 1, 64}, 2, 1, 1, true)); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add perf tests that correspond to the examples you gave in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
This is very cool. I didn't realize this op had been added. The change LGTM, I just had one minor suggestion. |
This implements support for
GGML_OP_CONV_2D_DWin Vulkan backend.Motivation is the same as for CPU (#1152): while depthwise convolution can be implemented via im2col -> mul_mat, this is quite wasteful, and a direct kernel performs much better.
Timings (W=512, H=512, C=256)
ggml_conv_2d_dw(im2col)ggml_conv_2d_dw_directggml_conv_2d_dw_directMeasured on RTX 4070. Larger batch sizes run into max allocation issues when using im2col.
Regarding separate kernel for CWHN (channels most contiguous, aka. NHWC): it's actually slightly slower than the default memory layout here. I kept it because regular (and transposed) Conv2D generally prefers it, and it can avoid extra permute/copy steps.