Use newer version of copy_atom in epilogue collective #573

anamikac-intel · 2025-10-22T06:24:22Z

PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572

anamikac-intel · 2025-10-22T09:10:55Z

@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.

anamikac-intel · 2025-10-22T12:41:07Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

-
+//remember this PR https://github.com/intel/sycl-tla/pull/565/files
 private:
-  constexpr static bool is_source_supported = not cute::is_void_v<ElementC> && not cute::is_void_v<CopyOpG2R>;


@jiyang1011 - The validation logic from PR #565 that sets is_source_supported to false when CopyOpG2R is void needs updating. With this PR's automatic ops selection, both CopyOpG2R and CopyOpR2G can now legitimately be void since make_block_2d_copy_* automatically selects appropriate operations.

Could we set a default copy trait like XeCopyAuto or something else which will also call make_block_2d_copy_* ?

Added right extension

Added legacy dispatchpolicy

anamikac-intel · 2025-10-23T07:50:35Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

-    auto thread_xe_load_c = params.xe_load_c.get_thread_slice(thread_idx);
+    // Get thread-level partitioning across the entire workgroup tile
+    auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx);
    Tensor tCgC = thread_xe_load_c.partition_S(gD);


@petercad - I was trying to fix register spills... reducing copy operations not helping much. Issue seems to coming from tiling done.. In legacy code after partition tCgC and tCgD was :
tCgC : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)
tCgD : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)

So we have 8 fragments of size 4 x 4

whereas in new code we have 128 fragments of size 1 x 1:

tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)

I tried titling further with SubgroupTileShape{}: (_32,_64,_32) but same result
g_wg_D: ArithTuple(0,0,0) o (_256,_256):(_1@0,_1@1)
gD: ArithTuple(0,0,0) o (_32,_64):(_1@0,_1@1)
tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)

This seems to be the actual issue.. so when I reshaped the layout to tCgC/tCgD ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) (8 fragments of 4x4) the perf drop is fixed. But re-layouting tCgC/tCgD might not be best option so can you please check.

we have 8 fragments of size 4 x 4

No, the per-thread D or C fragment size is 128 elements in both cases, but layout is different.
The WG tile size is (256, 256, 32).
There are 32 subgroups, each with 16 threads.
Each SG fragment for C or D is (32, 64) spatially since the subgroup layout is 8x4 in the example.
Each C or D thread fragment is sized 128 elements.

Both ((_8,_1),_4,_4) and ((_8,(_4,_4)),_1,_1) have 128 elements.

Hi @petercad, recent commits having a lot more changes than just the thread fragment layout (seems to be equivalent to the previous one) seem to suggest that there's a lurking factor that fixed the performance issues that were observed earlier in this PR, and that the thread-fragment layout of new C, D copy atoms isn't problematic.

Thanks!

Yes this code close to legacy one except I am using new copy atoms with the reshaping layout... but only concern is it only works with ops that has 16 width × 8 height #573 (comment)

@sanchitintel -- will look at it in more detail shortly. The underlying cause for earlier regressions seems to be twofold:

Code scheduling issues in IGC. It seems it is not moving loads/stores around sufficiently to reduce register pressure.

For the C loads, make_block_2d_copy_c will try to make the blocks as large as possible (because it's operating on the assumption that you're loading all of C at once) but that brings additional register pressure

The second point is not enough to explain the spills (there is plenty of register space even if you do load huge chunks of C), but it aggravates the first point.

include/cutlass/epilogue/collective/xe_epilogue.hpp

anamikac-intel · 2025-10-24T07:06:28Z

Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination:

Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height)
Store: XE_2D_U32x8x16_ST_N (dimensions: 16 width × 8 height)
(whose tCgC is ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) and tCgD is ArithTuple(0,0,0) o ((_8,_1),_4,_4):
((_1@0,_0),_8@0,_16@1))

When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1)

It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail

Seems legacy code has limited compatibility with load/store op variants

sanchitintel · 2025-10-24T08:26:32Z

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

sanchitintel · 2025-10-24T09:57:40Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+  CUTLASS_DEVICE auto reshape_with_unit_insertion(Tensor&& tensor) {
+    using namespace cute;
+
+    auto orig_layout = tensor.layout();
+    auto orig_shape = orig_layout.shape();
+    auto orig_stride = orig_layout.stride();
+
+    auto first_dim = get<0>(orig_shape);
+    auto outer_part = get<0>(first_dim); 
+    auto inner_part = get<1>(first_dim);
+
+    auto first_stride = get<0>(orig_stride); 
+    auto outer_stride = get<0>(first_stride); 
+    auto inner_stride = get<1>(first_stride); 
+
+    auto target_shape = make_shape(
+        make_shape(outer_part, _1{}), 
+        get<0>(inner_part),              
+        get<1>(inner_part)  
+    );
+
+    auto target_stride = make_stride(
+        make_stride(outer_stride, _0{}),  
+        get<0>(inner_stride), 
+        get<1>(inner_stride) 
+    );
+
+    return make_tensor(tensor.data(), make_layout(target_shape, target_stride));


Thanks for the revision!

IMHO, this seems to be a very good workaround to leverage new C, D copy atoms, if it'd work for any copy atoms used in epilogues.

but then trC and trD has to modified for copy operation to be work properly

Sorry, do you mean further changes besides the ones currently in this PR?

Legacy code seems to be restricted to XE_2D_U32x8x16_LD_N/ST_N operations (16×8 dimensions only). If 16×8 is the design target then no further changes needed otherwise, generalizing trC/trD requires decoupling from the fixed fragmentSize which appears tied to the accumulator layout (_8,_4,_4) suggesting 16×8 as the optimal dimension. Not sure fully waiting for peter to check

anamikac-intel · 2025-10-24T10:18:06Z

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8

In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38:
/home/gta/test/cutlass-sycl/include/cute/arch/copy_xe_2d.hpp:163:17: error: static assertion failed due to requirement '16 <= 8': Height exceeds hardware limits
163 | static_assert(Height <= 8, "Height exceeds hardware limits");
| ^~~~~~~~~~~
/home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:1156:34: note: in instantiation of template class 'cute::XE_STORE_2D<32, 16, 16>' requested here

rolandschulz · 2025-10-24T19:50:49Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

-    Tensor tCgC = thread_xe_load_c.partition_S(gD);
+    // Get thread-level partitioning across the entire workgroup tile
+    auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx);
+    Tensor tCgC = reshape_with_unit_insertion(thread_xe_load_c.partition_S(gD));


What's the purpose of this reshape and why is it needed?

Add new atoms in collectiveEpilogue

578ae95

tdeng5 requested review from jiyang1011 and taozha2 October 22, 2025 06:59

Use new make_block_2d_copy_{C,D} APIs for loads/stores

aaf2685

anamikac-intel commented Oct 22, 2025

View reviewed changes

Code Cleanup

407a875

sanchitintel mentioned this pull request Oct 22, 2025

[CuTe] [Xe] Separate make_block_2d_copy_{C,D} APIs for loads/stores #572

Merged

tdeng5 added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 23, 2025

This comment was marked as resolved.

Sign in to view

anamikac-intel added 3 commits October 23, 2025 11:27

Merge branch 'main' into anamikac/add-newatoms-collectiveEpilogue

45ac04e

Rename xe_epilogue_legacy.cpp to xe_epilogue_legacy.hpp

12282e8

Added right extension

Update xe_epilogue_legacy.hpp

6596ac8

Added legacy dispatchpolicy

anamikac-intel marked this pull request as ready for review October 23, 2025 07:40

anamikac-intel commented Oct 23, 2025

View reviewed changes

Avoid register spills

7ea69d5

sanchitintel reviewed Oct 23, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

anamikac-intel changed the title ~~[WIP] Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test~~ Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Oct 23, 2025

sanchitintel changed the title ~~Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test~~ Use newer version of copy_atom in GEMM epilogue collective Oct 23, 2025

sanchitintel reviewed Oct 23, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

sanchitintel requested a review from rolandschulz October 24, 2025 02:33

sanchitintel changed the title ~~Use newer version of copy_atom in GEMM epilogue collective~~ Use newer version of copy_atom in epilogue collective Oct 24, 2025

taozha2 reviewed Oct 24, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

remove hardcode layout

6944d90

sanchitintel requested a review from taozha2 October 24, 2025 09:07

sanchitintel reviewed Oct 24, 2025

View reviewed changes

rolandschulz reviewed Oct 24, 2025

View reviewed changes

Use newer version of copy_atom in epilogue collective #573

Are you sure you want to change the base?

Use newer version of copy_atom in epilogue collective #573

Conversation

anamikac-intel commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anamikac-intel commented Oct 22, 2025

Uh oh!

anamikac-intel Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jiyang1011 Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

anamikac-intel Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as duplicate.

Uh oh!

sanchitintel Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anamikac-intel Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anamikac-intel commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

anamikac-intel Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anamikac-intel commented Oct 24, 2025

Uh oh!

rolandschulz Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

anamikac-intel commented Oct 22, 2025 •

edited

Loading

jiyang1011 Oct 23, 2025 •

edited

Loading

anamikac-intel Oct 23, 2025 •

edited

Loading

sanchitintel Oct 23, 2025 •

edited

Loading

sanchitintel Oct 24, 2025 •

edited

Loading

petercad Oct 24, 2025 •

edited

Loading

anamikac-intel commented Oct 24, 2025 •

edited

Loading

sanchitintel commented Oct 24, 2025 •

edited

Loading

anamikac-intel Oct 24, 2025 •

edited

Loading