Skip to content

Conversation

@victor-eds
Copy link
Contributor

Detect sub-group transpose cases as those in which warp and lane dimensions in the resulting linear layouts get swapped and no transfer within block-groups is needed. Use sub-group write operations to store the contents in local memory and vector operations to write back. These will be translated to non-transposed and transposed store and loads respectively. As data is moved within sub-groups, no barriers are needed.

For now, handle only the case of a single sub_group_size^2 block being transposed.

Instead of using the triton_gen operation directly, we could be using a higher level operation thus moving legalization to a common place, as we are also legalizing this same operation in other places in our codebase.

This may be split in the future by performing N*M iterations for matrices of size N*sub_group_sizexM*sub_group_size.

…near layout

Detect sub-group transpose cases as those in which warp and lane dimensions get swapped
and no transfer within block-groups is needed. Use sub-group write operations to store
the contents in local memory and vector operations to write back. These will be translated
to non-transposed and transposed store and loads respectively. As data is moved within
sub-groups, no barriers are needed.

For now, handle only the case of a `single sub_group_size^2` block being transposed.

This may be split in the future by performing `N*M` iterations for matrices of size
`N*sub_group_sizexM*sub_group_size`.

Signed-off-by: victor-eds <[email protected]>
@victor-eds victor-eds requested review from a team, etiotto and whitneywhtsang October 18, 2024 12:36
@victor-eds victor-eds self-assigned this Oct 18, 2024
@victor-eds victor-eds requested a review from hwnam831 October 18, 2024 15:34
@victor-eds
Copy link
Contributor Author

@hwnam831 can you please check my linear layout knowledge is on point?

@whitneywhtsang
Copy link
Contributor

@hwnam831 can you please check my linear layout knowledge is on point?

Nam has finished his internship, and he is back to school, so he may not be able to check this PR.

@victor-eds victor-eds removed the request for review from hwnam831 October 18, 2024 16:28
@victor-eds
Copy link
Contributor Author

Part of #2266.

@victor-eds victor-eds requested a review from chengjunlu October 22, 2024 08:16
@chengjunlu
Copy link
Contributor

LGTM.

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change upstreamable?

@victor-eds
Copy link
Contributor Author

Is this change upstreamable?

I don't think so. Not sure if performance translates to NVIDIA too.

@victor-eds
Copy link
Contributor Author

Performance evaluation undergoing.

@victor-eds victor-eds marked this pull request as draft October 22, 2024 14:45
@victor-eds
Copy link
Contributor Author

victor-eds commented Oct 23, 2024

Let's use the following code to compare the generated ASM:

#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>

module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, triton_gpu.target = "xpu", "triton_gpu.threads-per-warp" = 16 : i32} {
  tt.func @test_f16(%arg0: tensor<16x16xf16, #blocked>, %arg1: tensor<16x16x!tt.ptr<f16>, #blocked1>) {
    %0 = triton_gpu.convert_layout %arg0 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #blocked1>
    tt.store %arg1, %0 : tensor<16x16x!tt.ptr<f16>, #blocked1>
    tt.return
  }
}

Before this patch, we did not detect these were within the warp transposes, so we just generated basic stores and loads to SLM:

# 8 stores
store.slm.d32.a32 (16|M0)  [r8:1]       r19:1              {F@4,$3}
store.slm.d32.a32 (16|M0)  [r15:1]      r20:1              {I@2,$4}
...
# 16 loads
load.slm.d16u32.a32 (16|M0)  r69:1      [r63:1]            {I@4,$13}
load.slm.d16u32.a32 (16|M0)  r65:1      [r47:1]            {I@5,$14}
...

whereas the optimized case leverages more efficient instructions:

# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}

This also translates to less pointer arithmetic and less ALU instructions overall:

# Base case
//.numALUInst: 224
# Optimized case
//.numALUInst: 217

Note arithmetic instruction count should be way less (I don't have an estimate), but base case reuses indices computed for SLM access for UGM access.

Also, as we know data won't be shared across warps, we just do not need synchronization:

# Base case
//.syncInstCount: 9
# Optimized case
//.syncInstCount: 0

Update: Optimized transpose code fits in a single screenshot:

image

@etiotto etiotto marked this pull request as ready for review October 23, 2024 13:58
@chengjunlu
Copy link
Contributor

chengjunlu commented Oct 24, 2024

whereas the optimized case leverages more efficient instructions:

# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}

An interesting thing is that the assemble shows the store is transposed and the load is not transposed:
store.slm.d64x64t.a32

The comments in the Triton code explain there is non-transpose store and transpose load. Any background for this?

@victor-eds
Copy link
Contributor Author

whereas the optimized case leverages more efficient instructions:

# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}

An interesting thing is that the assemble shows the store is transposed and the load is not transposed: store.slm.d64x64t.a32

The comments in the Triton code explain there is non-transpose store and transpose load. Any background for this?

Yes: bad comments :) I will update later. Thanks for pointing out!

@victor-eds victor-eds enabled auto-merge (squash) October 24, 2024 10:37
@victor-eds victor-eds merged commit 988b62b into intel:main Oct 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port "sub-group transpose reduction" to default path

5 participants