-
Notifications
You must be signed in to change notification settings - Fork 49
add vectorization path on maxpool backward channel last #1907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds a vectorization path for maxpool backward operations in channel-last memory layout to improve performance. The change introduces a new templated kernel implementation that processes multiple elements simultaneously using vector operations.
- Refactors existing backward kernel to accumulate gradients locally before writing
- Adds new vectorized kernel implementation for channel-last memory layout
- Includes vectorization logic (currently commented out) with macro for launching vectorized kernels
// case 4: | ||
// LAUNCH_MAXPOOL_BACKWARD_CHANNEL_LAST_VEC( | ||
// scalar_t, | ||
// 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vec_size parameter should be 4, not 1, for the case 4 branch. This appears to be a copy-paste error that would prevent proper vectorization when vec_size is 4.
// 1, | |
// 4, |
Copilot uses AI. Check for mistakes.
grad_vec[i] = static_cast<scalar_t>(grad_vec[i]) + | ||
static_cast<scalar_t>(gout_val_vec[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast static_cast<scalar_t>(grad_vec[i])
is redundant since grad_vec[i]
is already of type scalar_t
. This should be simplified to grad_vec[i] += static_cast<scalar_t>(gout_val_vec[i]);
grad_vec[i] = static_cast<scalar_t>(grad_vec[i]) + | |
static_cast<scalar_t>(gout_val_vec[i]); | |
grad_vec[i] += static_cast<scalar_t>(gout_val_vec[i]); |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Part 2 of #1861
on PVC, 101,628 Scoreboard stalls decrease to 75,976. Significantly fewer instruction fetch and distance stalls, enabling higher effective bandwidth to HBM.