Skip to content

Conversation

@anamikac-intel
Copy link

@anamikac-intel anamikac-intel commented Oct 22, 2025

PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572

@tdeng5 tdeng5 requested review from jiyang1011 and taozha2 October 22, 2025 06:59
@anamikac-intel
Copy link
Author

@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.

image


//remember this PR https://github.com/intel/sycl-tla/pull/565/files
private:
constexpr static bool is_source_supported = not cute::is_void_v<ElementC> && not cute::is_void_v<CopyOpG2R>;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiyang1011 - The validation logic from PR #565 that sets is_source_supported to false when CopyOpG2R is void needs updating. With this PR's automatic ops selection, both CopyOpG2R and CopyOpR2G can now legitimately be void since make_block_2d_copy_* automatically selects appropriate operations.

Copy link

@jiyang1011 jiyang1011 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set a default copy trait like XeCopyAuto or something else which will also call make_block_2d_copy_* ?

@tdeng5 tdeng5 added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 23, 2025
@sanchitintel

This comment was marked as resolved.

@anamikac-intel anamikac-intel marked this pull request as ready for review October 23, 2025 07:40
auto thread_xe_load_c = params.xe_load_c.get_thread_slice(thread_idx);
// Get thread-level partitioning across the entire workgroup tile
auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx);
Tensor tCgC = thread_xe_load_c.partition_S(gD);
Copy link
Author

@anamikac-intel anamikac-intel Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petercad - I was trying to fix register spills... reducing copy operations not helping much. Issue seems to coming from tiling done.. In legacy code after partition tCgC and tCgD was :
tCgC : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)
tCgD : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)

So we have 8 fragments of size 4 x 4

whereas in new code we have 128 fragments of size 1 x 1:

tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)

I tried titling further with SubgroupTileShape{}: (_32,_64,_32) but same result
g_wg_D: ArithTuple(0,0,0) o (_256,_256):(_1@0,_1@1)
gD: ArithTuple(0,0,0) o (_32,_64):(_1@0,_1@1)
tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)

This seems to be the actual issue.. so when I reshaped the layout to tCgC/tCgD ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) (8 fragments of 4x4) the perf drop is fixed. But re-layouting tCgC/tCgD might not be best option so can you please check.

image

Copy link

@sanchitintel sanchitintel Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 8 fragments of size 4 x 4

No, the per-thread D or C fragment size is 128 elements in both cases, but layout is different.
The WG tile size is (256, 256, 32).
There are 32 subgroups, each with 16 threads.
Each SG fragment for C or D is (32, 64) spatially since the subgroup layout is 8x4 in the example.
Each C or D thread fragment is sized 128 elements.

Both ((_8,_1),_4,_4) and ((_8,(_4,_4)),_1,_1) have 128 elements.

This comment was marked as outdated.

This comment was marked as duplicate.

Copy link

@sanchitintel sanchitintel Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @petercad, recent commits having a lot more changes than just the thread fragment layout (seems to be equivalent to the previous one) seem to suggest that there's a lurking factor that fixed the performance issues that were observed earlier in this PR, and that the thread-fragment layout of new C, D copy atoms isn't problematic.

Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this code close to legacy one except I am using new copy atoms with the reshaping layout... but only concern is it only works with ops that has 16 width × 8 height #573 (comment)

Copy link

@petercad petercad Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchitintel -- will look at it in more detail shortly. The underlying cause for earlier regressions seems to be twofold:

  • Code scheduling issues in IGC. It seems it is not moving loads/stores around sufficiently to reduce register pressure.
  • For the C loads, make_block_2d_copy_c will try to make the blocks as large as possible (because it's operating on the assumption that you're loading all of C at once) but that brings additional register pressure

The second point is not enough to explain the spills (there is plenty of register space even if you do load huge chunks of C), but it aggravates the first point.

@anamikac-intel anamikac-intel changed the title [WIP] Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Oct 23, 2025
@sanchitintel sanchitintel changed the title Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Use newer version of copy_atom in GEMM epilogue collective Oct 23, 2025
@sanchitintel sanchitintel changed the title Use newer version of copy_atom in GEMM epilogue collective Use newer version of copy_atom in epilogue collective Oct 24, 2025
@anamikac-intel
Copy link
Author

anamikac-intel commented Oct 24, 2025

Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination:

Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height)
Store: XE_2D_U32x8x16_ST_N (dimensions: 16 width × 8 height)
(whose tCgC is ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) and tCgD is ArithTuple(0,0,0) o ((_8,_1),_4,_4):
((_1@0,_0),_8@0,_16@1))

When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1)

It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail

Seems legacy code has limited compatibility with load/store op variants
image

@sanchitintel
Copy link

sanchitintel commented Oct 24, 2025

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

@sanchitintel sanchitintel requested a review from taozha2 October 24, 2025 09:07
Comment on lines +261 to +288
CUTLASS_DEVICE auto reshape_with_unit_insertion(Tensor&& tensor) {
using namespace cute;

auto orig_layout = tensor.layout();
auto orig_shape = orig_layout.shape();
auto orig_stride = orig_layout.stride();

auto first_dim = get<0>(orig_shape);
auto outer_part = get<0>(first_dim);
auto inner_part = get<1>(first_dim);

auto first_stride = get<0>(orig_stride);
auto outer_stride = get<0>(first_stride);
auto inner_stride = get<1>(first_stride);

auto target_shape = make_shape(
make_shape(outer_part, _1{}),
get<0>(inner_part),
get<1>(inner_part)
);

auto target_stride = make_stride(
make_stride(outer_stride, _0{}),
get<0>(inner_stride),
get<1>(inner_stride)
);

return make_tensor(tensor.data(), make_layout(target_shape, target_stride));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision!

IMHO, this seems to be a very good workaround to leverage new C, D copy atoms, if it'd work for any copy atoms used in epilogues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then trC and trD has to modified for copy operation to be work properly

Sorry, do you mean further changes besides the ones currently in this PR?

Copy link
Author

@anamikac-intel anamikac-intel Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy code seems to be restricted to XE_2D_U32x8x16_LD_N/ST_N operations (16×8 dimensions only). If 16×8 is the design target then no further changes needed otherwise, generalizing trC/trD requires decoupling from the fixed fragmentSize which appears tied to the accumulator layout (_8,_4,_4) suggesting 16×8 as the optimal dimension. Not sure fully waiting for peter to check

@anamikac-intel
Copy link
Author

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8

In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38:
/home/gta/test/cutlass-sycl/include/cute/arch/copy_xe_2d.hpp:163:17: error: static assertion failed due to requirement '16 <= 8': Height exceeds hardware limits
163 | static_assert(Height <= 8, "Height exceeds hardware limits");
| ^~~~~~~~~~~
/home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:1156:34: note: in instantiation of template class 'cute::XE_STORE_2D<32, 16, 16>' requested here

Tensor tCgC = thread_xe_load_c.partition_S(gD);
// Get thread-level partitioning across the entire workgroup tile
auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx);
Tensor tCgC = reshape_with_unit_insertion(thread_xe_load_c.partition_S(gD));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this reshape and why is it needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants