Skip to content

[CK Tile] Unification work - mma transformations pipeline#5508

Draft
chris-tsiaousis-hpc wants to merge 20 commits intodevelopfrom
users/chris-tsiaousis-hpc/ck/mma-transformations-pipeline
Draft

[CK Tile] Unification work - mma transformations pipeline#5508
chris-tsiaousis-hpc wants to merge 20 commits intodevelopfrom
users/chris-tsiaousis-hpc/ck/mma-transformations-pipeline

Conversation

@chris-tsiaousis-hpc
Copy link
Contributor

Motivation

In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.

A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.

We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.

Technical Details

The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.

Test Plan

New tests are added.

Test Result

Tests should pass.

Submission Checklist

@wj-laskowski wj-laskowski self-requested a review March 17, 2026 09:15
@chris-tsiaousis-hpc chris-tsiaousis-hpc force-pushed the users/chris-tsiaousis-hpc/ck/mma-transformations-pipeline branch from 12f78be to a96f4c8 Compare March 17, 2026 09:54
idx = detail::compress_a_impl<fp16_t, CompressedSize>(v);

VecCompressed result;
__builtin_memcpy(&result, &v, sizeof(VecCompressed));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a

        idx = detail::compress_a_impl<ScalarT, CompressedSize>(v);
        return *static_cast<const VecCompressed*>(&v);

here gives us an error:

sparse_transforms.hpp:79:17: error: static_cast from '_Float16 * __attribute__((ext_vector_type(16)))' to 'const VecCompressed *' (aka 'const ck_tile::impl::ext_vector<_Float16, 8>::type *') is not allowed
   79 |         return *static_cast<const VecCompressed*>(&v);

Can we avoid copying ? Ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buffer earlier used reinterpret_cast, maybe it works here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use what the original code did:

const AVecCompressed a_vec_pruned = {a_vec[0], a_vec[1], a_vec[2], a_vec[3]};

With some if constexpr or some sort of template for different sizes. There may be some perf reason why it was implemented like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast gives a similar error, unfortunately.
What the "original" code did is still creating a new vector and copying over elements to it. It is also not universal since the compressed A vector doesn't have the same size for GFX9 and GFX12...
I will investigate a way to change the size of the ext_vector without unnecessary copies.

@chris-tsiaousis-hpc chris-tsiaousis-hpc force-pushed the users/chris-tsiaousis-hpc/ck/mma-transformations-pipeline branch from a96f4c8 to 06106c2 Compare March 17, 2026 10:29
@chris-tsiaousis-hpc chris-tsiaousis-hpc added the organization: streamhpc contributors from streamhpc label Mar 17, 2026
template <typename MmaOp, typename CompilerTarget, typename Enable = void>
// TODO: c++20 template <MmaOpI MmaOp, amdgcn_target_arch_id CompilerTarget, typename Enable = void>
struct MmaTransformsDefaultSelector;
struct MmaTransformsDefaultSelector
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not good. It was done because DEVICE and HOST code are weirdly intertwined and I'll think of a way to revert this.

@@ -0,0 +1,113 @@
// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename it to sparse_mma_pipeline.hpp or something like that? And also move it to the dedicated sparse folder?

C_TRANSPOSE = 0x1,
SWIZZLE_A = 0x2,
SWIZZLE_B = 0x4,
DOUBLE_ATTR_NUM_ACCESS = 0x8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if all these operations belong here and whether we should treat them the same as compress_a. Did we discuss this already? I thought *ATTR_NUM_ACCESS can always be computed deterministically, i.e. not a pipeline option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think Swizzle and AttrNumAccess are just modifiers on the tile distribution encoding and do not affect the vector fragments. Same for CTranspose except the vector fragments get swapped. I don't know if there is a need for explicit transposition anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the CTranspose can be kept as a pipeline implementation in the base class. Thoughts?

int32_t non_zero_pos = 0;

static_for<0, 3, 1>{}([&](auto j) {
if(a_vec[i * 4 + j] != 0.0f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use static_cast here to avoid any implicit conversions

* non‑zeros are found, remaining fields default to 2 (see below).
*/
template <typename ADataType, index_t CompressedSize, typename AVec>
static CK_TILE_DEVICE int32_t compress_a_impl(AVec& a_vec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we testing this anywhere now? MmaSelector_Sparse_F16_F16_F32_16x16x32_Real fill all A/B with 1s and the layout test tests intrinsic after the compression, so that also won't pick up any errors (I think). Maybe we need a unit test for this?

#include "mma_selector.hpp"
#include "mma_traits.hpp"
#include "mma_transforms.hpp"
#include <netdb.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this header for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed by mistake

#include "mma_traits.hpp"
#include "mma_transforms.hpp"
#include <netdb.h>
#include <optional>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also no std::optional or this is needed for something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed by mistake

@wj-laskowski wj-laskowski self-requested a review March 18, 2026 09:55
Copy link
Contributor

@wj-laskowski wj-laskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I like the new approach, got a few questions

COMPRESS_A = 0x20,
};

struct MmaPipelineOptionFlags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add something that tests functionality added here?

}
}
}
else // if constexpr(AccumPolicy == MmaAccumPolicy::COL_MAJOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to keep this comment

}
}
else // if constexpr(AccumPolicy == MmaAccumPolicy::COL_MAJOR)
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or better would be add else if constexpr and fail in else

#include "ck_tile/core/arch/mma/mma_op_family.hpp"
#include "ck_tile/core/config.hpp"
#include "ck_tile/core/numeric/half.hpp"
#include "ck_tile/core/numeric/pk_fp4.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need this header for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed by mistake

idx = detail::compress_a_impl<fp16_t, CompressedSize>(v);

VecCompressed result;
__builtin_memcpy(&result, &v, sizeof(VecCompressed));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buffer earlier used reinterpret_cast, maybe it works here?


#include "ck_tile/core/numeric/integer.hpp"
#include "ck_tile/core/tensor/static_distributed_tensor.hpp"
namespace ck_tile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these with changes we go back to the state before adding sparse amdgcn_mma structs? Nice work :)

The default implementation should include the pass through transforms and is needed to avoid instantiations of an undefined template.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* non‑zeros are found, remaining fields default to 2 (see below).
*/
template <typename ADataType, index_t CompressedSize, typename AVec>
static CK_TILE_DEVICE int32_t compress_a_impl(AVec& a_vec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to exist separately from the existing compress_a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the idea is that the compress_a function will be removed when we integrate this to CK Tile. I also reverted the changes done to that function to minimize untested regression.

struct SparseCompressTransform
{
template <typename VecType>
CK_TILE_DEVICE static decltype(auto) exec(VecType&& v, int32_t& idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is decltype really necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Not having it would create a copy even if I find a way to change the vector size and return a reference to it's address. decltype allows to keep the reference to the return type.

template <typename VecType>
CK_TILE_DEVICE static decltype(auto) exec(VecType&& v, int32_t& idx)
{
using VecTraits = vector_traits<std::decay_t<VecType>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CK library generally uses remove_cvref_t<> instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use that as it is C++20, didn't know ck has its own implementation. I'll use that, thanks!

COMPRESS_A = 0x20,
};

struct MmaPipelineOptionFlags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very verbose but I guess necessary if we really don't want to allow raw enums...

Comment on lines +127 to +134
template <typename VecTA, typename VecTB, typename VecTC>
CK_TILE_DEVICE static decltype(auto) exec(VecTA&& a, VecTB&& b, VecTC&& accum)
{
auto pre = Derived::template preApply<Flags>(
std::forward<VecTA>(a), std::forward<VecTB>(b), std::forward<VecTC>(accum));
Derived::execImpl(pre);
return Derived::template postApply<Flags>(pre);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha we are making a second-order wrapper for the intrinsic just like in CK Tile, making more sense to me now.

@krithalith
Copy link
Contributor

Nice work, design makes sense to me an should be compatible with CK Tile, will need to change the amdgcn_mma_base and the tile distribution enc calc slightly if we go with this.

I don't understand this comment though:

This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins.

How would you chain sparse ops? The result is not sparse.

This is because this filename was too general for what it did. Also transfered basic components to the reusable base class.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
…d A vector size)

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
We should, in the future remove the `::Type`.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
…unction

Also added a test for this.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
@chris-tsiaousis-hpc chris-tsiaousis-hpc force-pushed the users/chris-tsiaousis-hpc/ck/mma-transformations-pipeline branch from 9f7a0ab to e2e5634 Compare March 18, 2026 16:02
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants