[CK_Tile] Refactor amdgcn_mma policy structs by krithalith · Pull Request #5272 · ROCm/rocm-libraries

krithalith · 2026-03-10T10:40:10Z

Motivation

The point of this MR is to update the intrinsic layout parameters to simplify them and make them more clear and flexible. Also, a number of simple refactors were performed to reduce boilerplate and code duplication.

Technical Details

In CK Tile and old CK, the full set of information available in the intrinsic wrappers, for WMMA and MFMA combined, would be something like:

// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;

static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;

static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;

using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor  = sequence<2, 2>;
using kABYs2RHsMinor  = sequence<0, 2>;

using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor  = sequence<1, 1>;
using kCYs2RHsMinor  = sequence<0, 2>;

using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor  = sequence<2, 2>;
using kCTYs2RHsMinor  = sequence<0, 2>;

Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:

// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kABKPerLane;  // K2 * K0, Always the same, even for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat;     // Used for RDNA3 repeated inputs and CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat;     // Used for RDNA3 repeated inputs and CDNA block hiding.
static constexpr index_t kCMPerLane;   // M2 * M0
static constexpr index_t kCMNumAccess; // M2

// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.

Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.

I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp

Other refactorings I did in this MR:

Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).
Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

chris-tsiaousis-hpc

Thanks for those changes @krithalith, things look much simpler now! I've added some comments and proposals for improvement.

chris-tsiaousis-hpc · 2026-03-10T14:48:20Z

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

+// TODO: Describe layout params.
+/**
+ *  @class  amdgcn_mma_base
+ *  @brief  Helper base class for amdgcn_mma structs to avoid a lot of code duplication. Also puts


"Helper" here is not correct IMO. This is just a base class.

chris-tsiaousis-hpc · 2026-03-10T14:51:07Z

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

+    static constexpr index_t kN = FragN;
+    static constexpr index_t kK = FragK;
+
+    // Layout constants


In the description of the PR you mention:

// Layout parameters static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even for diff A / B layouts static constexpr index_t kAKNumAccess; // K2 static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kBKNumAccess; // K2 static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kCMPerLane; // M2 * M0 static constexpr index_t kCMNumAccess; // M2

I'd like to have those comments here as well. It would also be useful for future devs having a peek on this to mention that M = M0 * M1 * M2 and so on...

Yes I'll add some inline comments and plan to add a more detailed description of the parameters, maybe with some ASCII art at the top of the file somewhere.

chris-tsiaousis-hpc · 2026-03-10T14:53:18Z

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

 *  @tparam CompilerTarget The current compiler target
 *  @tparam Enabler SFINAE enabler
 */
+// clang-format off


Why turn off clang-format here and for such a long section? If you only need this for the instantiation of the base class, maybe just do it one line before and enable it right after?

chris-tsiaousis-hpc · 2026-03-10T14:59:11Z

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/sparse_gfx9.hpp

    exec(AVecType& aVec, BVecType const& bVec, CVecType const& cVec) -> CVecType
    {
-        static constexpr index_t CompressedSize = ABVecN / kCompressionRatio;
+        static constexpr index_t CompressedSize = vector_traits<AVecType>::vector_size / 2;


I'd prefer this 2 to be a constexpr variable like it used to, since we are still waiting for an answer on whether other compression ratios will be supported. It is also cleaner than a hardcoded number within the exec function...

Yes I knew you would bring this up and you are right, I was just in line-reduction mode :)

chris-tsiaousis-hpc · 2026-03-10T15:01:25Z

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/sparse_gfx9.hpp

        // and evaluate changing this to a transform at a higher level.
        // aVec not being const can cause problems when running multiple intrinsics.
-        const int32_t idx = ck_tile::compress_a_impl<fp16_t, CompressedSize>(aVec);
+        const index_t idx = ck_tile::compress_a_impl<fp16_t, CompressedSize>(aVec);


While it compiles, this is not correct since the function returns an int32_t and the builtin expects an int as a fourth parameter.

chris-tsiaousis-hpc · 2026-03-10T15:01:51Z

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/sparse_gfx9.hpp

            a_vec_pruned, bVec, cVec, idx, PARAMS.UseFirstIndex, PARAMS.ByteIndexToOverride)};
    }
 };
+// clang-format on


Again, avoid prolonging the disabled clang-format section

chris-tsiaousis-hpc · 2026-03-10T15:02:34Z

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/wmma/sparse_gfx12.hpp

        return {__builtin_amdgcn_swmmac_f32_16x16x32_f16_w32(a_vec_pruned, bVec, cVec, idx)};
    }
 };
+// clang-format on


Avoid prolonging the disabled clang-format section

chris-tsiaousis-hpc · 2026-03-10T15:02:42Z

projects/composablekernel/include/ck_tile/core/arch/mma/wmma/wmma_gfx11.hpp

        return {__builtin_amdgcn_wmma_f32_16x16x16_f16_w32(aVec, bVec, cVec)};
    }
 };
+// clang-format on


Avoid prolonging the disabled clang-format section

chris-tsiaousis-hpc · 2026-03-10T15:02:50Z

projects/composablekernel/include/ck_tile/core/arch/mma/wmma/wmma_gfx12.hpp

        return {__builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12(aVec, bVec, cVec)};
    }
 };
+// clang-format on


Avoid prolonging the disabled clang-format section

wj-laskowski

Looks great! My comments are mostly nits and some thinking out loud.

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/wmma/wmma_gfx12.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/sparse_gfx9.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/selector.hpp

projects/composablekernel/test/ck_tile/core/arch/mma/test_amdgcn_mma.cpp

projects/composablekernel/test/ck_tile/core/arch/mma/test_amdgcn_mma_layout_util.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/mfma/mfma_selector.hpp

wj-laskowski · 2026-03-11T15:46:11Z

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

+ * Note that K0 = 2 (first unmerge size, fastest changing), K1 = 3 (second unmerge size,
+ * second-fastest changing), and K2 = 12 / 2 / 3 = 2 (outermost dimension, whatever is left).
+ *
+ * If we were to use this unmerge op to decribe an A matrix layout in registers, we might have for


projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/mma.hpp

projects/composablekernel/include/ck_tile/core/arch/mma/mma_traits.hpp

cgmillette · 2026-03-18T15:02:22Z

projects/composablekernel/test/ck_tile/core/arch/mma/test_amdgcn_mma.cpp

-// Test MmaDefaultSelector for supported DummyAmdgcnMma on fragment sizes other than 16x16x16
-// This tests that the selector can still pick the correct MMA op even if the fragment sizes differ
-TEST(TestAmdgcnMma, MmaDefaultSelectorSupportedFragment)
+// Test MmaDefaultSelector for supported DummyAmdgcnMma on chunk sizes other than 16x16x16


Some tests still have references to "chunk"

Oops, my search range was too small, nice catch!

cgmillette

Great PR! Excellent work.
I really enjoyed seeing the reduction of boilerplate code with the extraction of the base class.

…ister vector sizes.

…e with "frag". In places where "frag" was already used, replace that with "chunk".

… from MmaOp.

…missed a number of Block / Frag / Chunk refactor spots.

…reduce dummy exec print verbosity.

krithalith added the project: composablekernel label Mar 10, 2026

krithalith requested a review from a team as a code owner March 10, 2026 10:40

krithalith added the organization: streamhpc contributors from streamhpc label Mar 10, 2026

krithalith force-pushed the users/krithalith/ck/unification_policy_struct_refactor branch from b37eacf to 6fb3ad3 Compare March 10, 2026 10:55

krithalith marked this pull request as draft March 10, 2026 10:57

wj-laskowski self-requested a review March 10, 2026 11:00

chris-tsiaousis-hpc requested changes Mar 10, 2026

View reviewed changes

wj-laskowski reviewed Mar 10, 2026

View reviewed changes

wj-laskowski self-requested a review March 11, 2026 11:00

wj-laskowski approved these changes Mar 11, 2026

View reviewed changes

krithalith marked this pull request as ready for review March 11, 2026 15:57

chris-tsiaousis-hpc approved these changes Mar 11, 2026

View reviewed changes

krithalith force-pushed the users/krithalith/ck/unification_policy_struct_refactor branch 2 times, most recently from 0e26f69 to 78119c5 Compare March 12, 2026 09:03

krithalith changed the title ~~[WMMA / MFMA unification] Refactor amdgcn_mma policy structs~~ [CK_Tile] Refactor amdgcn_mma policy structs Mar 16, 2026

cgmillette reviewed Mar 17, 2026

View reviewed changes

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Mar 17, 2026

View reviewed changes

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp Show resolved Hide resolved

cgmillette reviewed Mar 17, 2026

View reviewed changes

projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp Show resolved Hide resolved

cgmillette reviewed Mar 17, 2026

View reviewed changes

projects/composablekernel/include/ck_tile/core/arch/mma/mma.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Mar 17, 2026

View reviewed changes

projects/composablekernel/include/ck_tile/core/arch/mma/mma_traits.hpp Show resolved Hide resolved

krithalith force-pushed the users/krithalith/ck/unification_policy_struct_refactor branch from e5b031e to 2dbdd57 Compare March 18, 2026 10:43

krithalith requested a review from cgmillette March 18, 2026 10:53

cgmillette reviewed Mar 18, 2026

View reviewed changes

cgmillette approved these changes Mar 18, 2026

View reviewed changes

krithalith added 5 commits March 18, 2026 15:41

Replace layout params with cleaner ones in amdgcn structs, derive reg…

d7b21e9

…ister vector sizes.

Reduce policy struct code duplication with amdgcn_mma_base class

0d32546

Remove already overloaded "block" term for intrinsic sizes and replac…

de0fe34

…e with "frag". In places where "frag" was already used, replace that with "chunk".

Simplify MmaOpTraits now that almost everything is directly available…

fe6fa03

… from MmaOp.

Address PR comments except for those about layout explanations. Also …

0034987

…missed a number of Block / Frag / Chunk refactor spots.

krithalith added 4 commits March 18, 2026 15:41

Add detailed layout parameter descriptions.

a56e114

Address review comments (comment changes only)

59f3a60

Address review comments: tweak comments + change Chunk to WaveTile + …

c999f96

…reduce dummy exec print verbosity.

Change some remaining references to Chunk to WaveTile

aa5bcbc

krithalith force-pushed the users/krithalith/ck/unification_policy_struct_refactor branch from 9f4f0a7 to aa5bcbc Compare March 18, 2026 15:41

Conversation

krithalith commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Uh oh!

chris-tsiaousis-hpc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wj-laskowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgmillette left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

krithalith commented Mar 10, 2026 •

edited

Loading