[CK_Tile] Refactor amdgcn_mma policy structs#5272
[CK_Tile] Refactor amdgcn_mma policy structs#5272krithalith wants to merge 9 commits intodevelopfrom
Conversation
b37eacf to
6fb3ad3
Compare
chris-tsiaousis-hpc
left a comment
There was a problem hiding this comment.
Thanks for those changes @krithalith, things look much simpler now! I've added some comments and proposals for improvement.
| // TODO: Describe layout params. | ||
| /** | ||
| * @class amdgcn_mma_base | ||
| * @brief Helper base class for amdgcn_mma structs to avoid a lot of code duplication. Also puts |
There was a problem hiding this comment.
"Helper" here is not correct IMO. This is just a base class.
| static constexpr index_t kN = FragN; | ||
| static constexpr index_t kK = FragK; | ||
|
|
||
| // Layout constants |
There was a problem hiding this comment.
In the description of the PR you mention:
// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and CDNA block hiding.
static constexpr index_t kCMPerLane; // M2 * M0
static constexpr index_t kCMNumAccess; // M2
I'd like to have those comments here as well. It would also be useful for future devs having a peek on this to mention that M = M0 * M1 * M2 and so on...
There was a problem hiding this comment.
Yes I'll add some inline comments and plan to add a more detailed description of the parameters, maybe with some ASCII art at the top of the file somewhere.
| * @tparam CompilerTarget The current compiler target | ||
| * @tparam Enabler SFINAE enabler | ||
| */ | ||
| // clang-format off |
There was a problem hiding this comment.
Why turn off clang-format here and for such a long section? If you only need this for the instantiation of the base class, maybe just do it one line before and enable it right after?
| exec(AVecType& aVec, BVecType const& bVec, CVecType const& cVec) -> CVecType | ||
| { | ||
| static constexpr index_t CompressedSize = ABVecN / kCompressionRatio; | ||
| static constexpr index_t CompressedSize = vector_traits<AVecType>::vector_size / 2; |
There was a problem hiding this comment.
I'd prefer this 2 to be a constexpr variable like it used to, since we are still waiting for an answer on whether other compression ratios will be supported. It is also cleaner than a hardcoded number within the exec function...
There was a problem hiding this comment.
Yes I knew you would bring this up and you are right, I was just in line-reduction mode :)
| // and evaluate changing this to a transform at a higher level. | ||
| // aVec not being const can cause problems when running multiple intrinsics. | ||
| const int32_t idx = ck_tile::compress_a_impl<fp16_t, CompressedSize>(aVec); | ||
| const index_t idx = ck_tile::compress_a_impl<fp16_t, CompressedSize>(aVec); |
There was a problem hiding this comment.
While it compiles, this is not correct since the function returns an int32_t and the builtin expects an int as a fourth parameter.
| a_vec_pruned, bVec, cVec, idx, PARAMS.UseFirstIndex, PARAMS.ByteIndexToOverride)}; | ||
| } | ||
| }; | ||
| // clang-format on |
There was a problem hiding this comment.
Again, avoid prolonging the disabled clang-format section
| return {__builtin_amdgcn_swmmac_f32_16x16x32_f16_w32(a_vec_pruned, bVec, cVec, idx)}; | ||
| } | ||
| }; | ||
| // clang-format on |
There was a problem hiding this comment.
Avoid prolonging the disabled clang-format section
| return {__builtin_amdgcn_wmma_f32_16x16x16_f16_w32(aVec, bVec, cVec)}; | ||
| } | ||
| }; | ||
| // clang-format on |
There was a problem hiding this comment.
Avoid prolonging the disabled clang-format section
| return {__builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12(aVec, bVec, cVec)}; | ||
| } | ||
| }; | ||
| // clang-format on |
There was a problem hiding this comment.
Avoid prolonging the disabled clang-format section
wj-laskowski
left a comment
There was a problem hiding this comment.
Looks great! My comments are mostly nits and some thinking out loud.
projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp
Outdated
Show resolved
Hide resolved
projects/composablekernel/include/ck_tile/core/arch/mma/wmma/wmma_gfx12.hpp
Show resolved
Hide resolved
projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/sparse_gfx9.hpp
Outdated
Show resolved
Hide resolved
projects/composablekernel/include/ck_tile/core/arch/mma/sparse/mfma/selector.hpp
Outdated
Show resolved
Hide resolved
projects/composablekernel/test/ck_tile/core/arch/mma/test_amdgcn_mma.cpp
Outdated
Show resolved
Hide resolved
projects/composablekernel/test/ck_tile/core/arch/mma/test_amdgcn_mma_layout_util.hpp
Show resolved
Hide resolved
projects/composablekernel/include/ck_tile/core/arch/mma/mfma/mfma_selector.hpp
Outdated
Show resolved
Hide resolved
| * Note that K0 = 2 (first unmerge size, fastest changing), K1 = 3 (second unmerge size, | ||
| * second-fastest changing), and K2 = 12 / 2 / 3 = 2 (outermost dimension, whatever is left). | ||
| * | ||
| * If we were to use this unmerge op to decribe an A matrix layout in registers, we might have for |
0e26f69 to
78119c5
Compare
projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp
Outdated
Show resolved
Hide resolved
projects/composablekernel/include/ck_tile/core/arch/mma/mma.hpp
Outdated
Show resolved
Hide resolved
e5b031e to
2dbdd57
Compare
| // Test MmaDefaultSelector for supported DummyAmdgcnMma on fragment sizes other than 16x16x16 | ||
| // This tests that the selector can still pick the correct MMA op even if the fragment sizes differ | ||
| TEST(TestAmdgcnMma, MmaDefaultSelectorSupportedFragment) | ||
| // Test MmaDefaultSelector for supported DummyAmdgcnMma on chunk sizes other than 16x16x16 |
There was a problem hiding this comment.
Some tests still have references to "chunk"
There was a problem hiding this comment.
Oops, my search range was too small, nice catch!
cgmillette
left a comment
There was a problem hiding this comment.
Great PR! Excellent work.
I really enjoyed seeing the reduction of boilerplate code with the extraction of the base class.
…ister vector sizes.
…e with "frag". In places where "frag" was already used, replace that with "chunk".
…missed a number of Block / Frag / Chunk refactor spots.
9f4f0a7 to
aa5bcbc
Compare
Motivation
The point of this MR is to update the intrinsic layout parameters to simplify them and make them more clear and flexible. Also, a number of simple refactors were performed to reduce boilerplate and code duplication.
Technical Details
In CK Tile and old CK, the full set of information available in the intrinsic wrappers, for WMMA and MFMA combined, would be something like:
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:
Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.
Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.
I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp
Other refactorings I did in this MR:
Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).
Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.