Skip to content

add vectorization path on maxpool backward channel last #1907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

chunhuanMeng
Copy link
Contributor

@chunhuanMeng chunhuanMeng commented Aug 5, 2025

Part 2 of #1861
on PVC, 101,628 Scoreboard stalls decrease to 75,976. Significantly fewer instruction fetch and distance stalls, enabling higher effective bandwidth to HBM.

shape device before opt after opt
[4096, 64, 27, 27] pvc 27.10ms 12.70 ms
[4096, 192, 13, 13] pvc 17.97ms 8.51 ms
[4096, 256, 6, 6] pvc 5.10 ms 2.47 ms

@Copilot Copilot AI review requested due to automatic review settings August 5, 2025 07:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a vectorization path for maxpool backward operations in channel-last memory layout to improve performance. The change introduces a new templated kernel implementation that processes multiple elements simultaneously using vector operations.

  • Refactors existing backward kernel to accumulate gradients locally before writing
  • Adds new vectorized kernel implementation for channel-last memory layout
  • Includes vectorization logic (currently commented out) with macro for launching vectorized kernels

// case 4:
// LAUNCH_MAXPOOL_BACKWARD_CHANNEL_LAST_VEC(
// scalar_t,
// 1,
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vec_size parameter should be 4, not 1, for the case 4 branch. This appears to be a copy-paste error that would prevent proper vectorization when vec_size is 4.

Suggested change
// 1,
// 4,

Copilot uses AI. Check for mistakes.

Comment on lines +394 to +395
grad_vec[i] = static_cast<scalar_t>(grad_vec[i]) +
static_cast<scalar_t>(gout_val_vec[i]);
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast static_cast<scalar_t>(grad_vec[i]) is redundant since grad_vec[i] is already of type scalar_t. This should be simplified to grad_vec[i] += static_cast<scalar_t>(gout_val_vec[i]);

Suggested change
grad_vec[i] = static_cast<scalar_t>(grad_vec[i]) +
static_cast<scalar_t>(gout_val_vec[i]);
grad_vec[i] += static_cast<scalar_t>(gout_val_vec[i]);

Copilot uses AI. Check for mistakes.

@jianyizh jianyizh requested a review from liangan1 August 13, 2025 02:39
@chuanqi129 chuanqi129 linked an issue Aug 13, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maxpooling takes too long on BMG
2 participants