Skip to content

Conversation

jianyizh
Copy link
Contributor

@jianyizh jianyizh commented Jul 28, 2025

Part 1 of #1861
tested on shapes from alexnet training
on BMG, 831719 Scoreboard stalls decrease to 497,098. instruction fetch and distance stall also get better.

shape device before opt after opt
[4096, 64, 55, 55] pvc 8.02 ms 5.44 ms
[4096, 64, 55, 55] bmg 12.45 ms 8.89 ms
[4096, 192, 27, 27] pvc 5.72 ms 3.85 ms
[4096, 192, 27, 27] bmg 9.00 ms 5.06 ms
[4096, 256, 13, 13] pvc 1.68 ms 1.12 ms
[4096, 256, 13, 13] bmg 2.83 ms 1.35 ms

@Copilot Copilot AI review requested due to automatic review settings July 28, 2025 08:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a vectorized code path for the max pooling forward operation when using channel-last memory layout, providing significant performance improvements on Intel GPU architectures. The optimization uses vectorized memory operations and SYCL kernels to improve throughput.

Key changes:

  • Introduces a new vectorized kernel MaxPool2dChannelLastVec that processes multiple channels simultaneously
  • Adds automatic vector size selection (8, 4, 2, or 1) based on data alignment and hardware capabilities
  • Implements dynamic work group sizing based on hardware thread availability

Co-authored-by: Copilot <[email protected]>
@jianyizh jianyizh requested review from toyxu and EikanWang July 28, 2025 09:15
@jianyizh jianyizh requested a review from liangan1 August 13, 2025 02:35
@chuanqi129 chuanqi129 linked an issue Aug 13, 2025 that may be closed by this pull request
@jianyizh jianyizh added this pull request to the merge queue Aug 19, 2025
Merged via the queue into main with commit c091232 Aug 19, 2025
21 checks passed
@jianyizh jianyizh deleted the jianyi/maxpool branch August 19, 2025 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maxpooling takes too long on BMG
3 participants