[CK_TILE] Stream-K XCD remapping#4279
Open
assistant-librarian[bot] wants to merge 10 commits intodevelopfrom
Open
[CK_TILE] Stream-K XCD remapping#4279assistant-librarian[bot] wants to merge 10 commits intodevelopfrom
assistant-librarian[bot] wants to merge 10 commits intodevelopfrom
Conversation
This change adds in a function to remap block ids from their original round robin assignment to a contiguous layout across XCDs. This function is added to the StreamKTilePartitioner and called in the operator() functions. There are also unit tests to verify the correctness of the function on minimal arrays. These changes should improve locality and the cache hit rate, therefore improving performance overall.
ecamartins
reviewed
Feb 12, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_xcd.hpp
Outdated
Show resolved
Hide resolved
…evelop/ROCm_composable_kernel/pr-3652
…g runs to the gfx942 architecture
80369c4 to
1d1c3a9
Compare
ecamartins
reviewed
Feb 25, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp
Outdated
Show resolved
Hide resolved
ecamartins
reviewed
Feb 25, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_xcd.hpp
Outdated
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp
Outdated
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp
Outdated
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
...lekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_tile_partitioner_impl.hpp
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
projects/composablekernel/test/ck_tile/gemm_streamk/test_streamk_tile_partitioner.cpp
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
projects/composablekernel/test/ck_tile/gemm_streamk/test_streamk_tile_partitioner.cpp
Outdated
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
...lekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_tile_partitioner_impl.hpp
Show resolved
Hide resolved
cgmillette
reviewed
Feb 27, 2026
projects/composablekernel/include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp
Outdated
Show resolved
Hide resolved
This commit removes the use of an enum to map XCD values by architecture and switches to querying the number of XCDs from the device through the hip API. The unit tests have been changed to hardcode XCD values to simplify them.
0edc006 to
23361e2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
This PR adds support for XCD remapping as detailed in this document. On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this PR, which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM.
Note: This only applies to the gfx942 as they introduce the XCDs.
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed files🔁 Imported from ROCm/composable_kernel#3652
🧑💻 Originally authored by @arai713