-
Notifications
You must be signed in to change notification settings - Fork 64
Use newer version of copy_atom in epilogue collective #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use newer version of copy_atom in epilogue collective #573
Conversation
|
@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.
|
|
|
||
| //remember this PR https://github.com/intel/sycl-tla/pull/565/files | ||
| private: | ||
| constexpr static bool is_source_supported = not cute::is_void_v<ElementC> && not cute::is_void_v<CopyOpG2R>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiyang1011 - The validation logic from PR #565 that sets is_source_supported to false when CopyOpG2R is void needs updating. With this PR's automatic ops selection, both CopyOpG2R and CopyOpR2G can now legitimately be void since make_block_2d_copy_* automatically selects appropriate operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we set a default copy trait like XeCopyAuto or something else which will also call make_block_2d_copy_* ?
This comment was marked as resolved.
This comment was marked as resolved.
Added right extension
Added legacy dispatchpolicy
| auto thread_xe_load_c = params.xe_load_c.get_thread_slice(thread_idx); | ||
| // Get thread-level partitioning across the entire workgroup tile | ||
| auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx); | ||
| Tensor tCgC = thread_xe_load_c.partition_S(gD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petercad - I was trying to fix register spills... reducing copy operations not helping much. Issue seems to coming from tiling done.. In legacy code after partition tCgC and tCgD was :
tCgC : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)
tCgD : ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1)
So we have 8 fragments of size 4 x 4
whereas in new code we have 128 fragments of size 1 x 1:
tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
I tried titling further with SubgroupTileShape{}: (_32,_64,_32) but same result
g_wg_D: ArithTuple(0,0,0) o (_256,_256):(_1@0,_1@1)
gD: ArithTuple(0,0,0) o (_32,_64):(_1@0,_1@1)
tCgC: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
tCgD: ArithTuple(0,0,0) o ((_8,(_4,_4)),_1,_1):((_1@0,(_8@0,_16@1)),_0,_0)
This seems to be the actual issue.. so when I reshaped the layout to tCgC/tCgD ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) (8 fragments of 4x4) the perf drop is fixed. But re-layouting tCgC/tCgD might not be best option so can you please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have 8 fragments of size 4 x 4
No, the per-thread D or C fragment size is 128 elements in both cases, but layout is different.
The WG tile size is (256, 256, 32).
There are 32 subgroups, each with 16 threads.
Each SG fragment for C or D is (32, 64) spatially since the subgroup layout is 8x4 in the example.
Each C or D thread fragment is sized 128 elements.
Both ((_8,_1),_4,_4) and ((_8,(_4,_4)),_1,_1) have 128 elements.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @petercad, recent commits having a lot more changes than just the thread fragment layout (seems to be equivalent to the previous one) seem to suggest that there's a lurking factor that fixed the performance issues that were observed earlier in this PR, and that the thread-fragment layout of new C, D copy atoms isn't problematic.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this code close to legacy one except I am using new copy atoms with the reshaping layout... but only concern is it only works with ops that has 16 width × 8 height #573 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchitintel -- will look at it in more detail shortly. The underlying cause for earlier regressions seems to be twofold:
- Code scheduling issues in IGC. It seems it is not moving loads/stores around sufficiently to reduce register pressure.
- For the C loads, make_block_2d_copy_c will try to make the blocks as large as possible (because it's operating on the assumption that you're loading all of C at once) but that brings additional register pressure
The second point is not enough to explain the spills (there is plenty of register space even if you do load huge chunks of C), but it aggravates the first point.
|
Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination: Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height) When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1) It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail Seems legacy code has limited compatibility with load/store op variants |
Do you know the background? SPIRV doesn't seem to have any restrictions for something like Thanks! |
| CUTLASS_DEVICE auto reshape_with_unit_insertion(Tensor&& tensor) { | ||
| using namespace cute; | ||
|
|
||
| auto orig_layout = tensor.layout(); | ||
| auto orig_shape = orig_layout.shape(); | ||
| auto orig_stride = orig_layout.stride(); | ||
|
|
||
| auto first_dim = get<0>(orig_shape); | ||
| auto outer_part = get<0>(first_dim); | ||
| auto inner_part = get<1>(first_dim); | ||
|
|
||
| auto first_stride = get<0>(orig_stride); | ||
| auto outer_stride = get<0>(first_stride); | ||
| auto inner_stride = get<1>(first_stride); | ||
|
|
||
| auto target_shape = make_shape( | ||
| make_shape(outer_part, _1{}), | ||
| get<0>(inner_part), | ||
| get<1>(inner_part) | ||
| ); | ||
|
|
||
| auto target_stride = make_stride( | ||
| make_stride(outer_stride, _0{}), | ||
| get<0>(inner_stride), | ||
| get<1>(inner_stride) | ||
| ); | ||
|
|
||
| return make_tensor(tensor.data(), make_layout(target_shape, target_stride)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revision!
IMHO, this seems to be a very good workaround to leverage new C, D copy atoms, if it'd work for any copy atoms used in epilogues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then trC and trD has to modified for copy operation to be work properly
Sorry, do you mean further changes besides the ones currently in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legacy code seems to be restricted to XE_2D_U32x8x16_LD_N/ST_N operations (16×8 dimensions only). If 16×8 is the design target then no further changes needed otherwise, generalizing trC/trD requires decoupling from the fixed fragmentSize which appears tied to the accumulator layout (_8,_4,_4) suggesting 16×8 as the optimal dimension. Not sure fully waiting for peter to check
I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8 In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38: |
| Tensor tCgC = thread_xe_load_c.partition_S(gD); | ||
| // Get thread-level partitioning across the entire workgroup tile | ||
| auto thread_xe_load_c = copy_c.get_thread_slice(thread_idx); | ||
| Tensor tCgC = reshape_with_unit_insertion(thread_xe_load_c.partition_S(gD)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this reshape and why is it needed?


PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572