Skip to content

[CK_TILE] Stream-K XCD remapping#4279

Open
assistant-librarian[bot] wants to merge 10 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3652
Open

[CK_TILE] Stream-K XCD remapping#4279
assistant-librarian[bot] wants to merge 10 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3652

Conversation

@assistant-librarian
Copy link
Contributor

Proposed changes

This PR adds support for XCD remapping as detailed in this document. On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this PR, which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the gfx942 as they introduce the XCDs.

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

🔁 Imported from ROCm/composable_kernel#3652
🧑‍💻 Originally authored by @arai713

This change adds in a function to remap block ids from their original round
robin assignment to a contiguous layout across XCDs. This function is added
to the StreamKTilePartitioner and called in the operator() functions. There
are also unit tests to verify the correctness of the function on minimal
arrays. These changes should improve locality and the cache hit rate, therefore
improving performance overall.
@assistant-librarian assistant-librarian bot added the external contribution Code contribution from users community.. label Feb 3, 2026
@arai713 arai713 force-pushed the import/develop/ROCm_composable_kernel/pr-3652 branch from 80369c4 to 1d1c3a9 Compare February 16, 2026 22:52
@arai713 arai713 marked this pull request as ready for review February 23, 2026 21:55
@arai713 arai713 requested a review from a team as a code owner February 23, 2026 21:55
cgmillette and others added 2 commits March 11, 2026 07:27
This commit removes the use of an enum to map XCD values by architecture
and switches to querying the number of XCDs from the device through the
hip API. The unit tests have been changed to hardcode XCD values to
simplify them.
@arai713 arai713 force-pushed the import/develop/ROCm_composable_kernel/pr-3652 branch from 0edc006 to 23361e2 Compare March 11, 2026 07:28
Copy link
Contributor

@cgmillette cgmillette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgmillette cgmillette enabled auto-merge (squash) March 18, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants