[TritonIntelGPUToLLVM] Detect sub-group transpose `convert_layout` cases #2511

victor-eds · 2024-10-18T12:36:45Z

Detect sub-group transpose cases as those in which warp and lane dimensions in the resulting linear layouts get swapped and no transfer within block-groups is needed. Use sub-group write operations to store the contents in local memory and vector operations to write back. These will be translated to non-transposed and transposed store and loads respectively. As data is moved within sub-groups, no barriers are needed.

For now, handle only the case of a single sub_group_size^2 block being transposed.

Instead of using the triton_gen operation directly, we could be using a higher level operation thus moving legalization to a common place, as we are also legalizing this same operation in other places in our codebase.

This may be split in the future by performing N*M iterations for matrices of size N*sub_group_sizexM*sub_group_size.

…near layout Detect sub-group transpose cases as those in which warp and lane dimensions get swapped and no transfer within block-groups is needed. Use sub-group write operations to store the contents in local memory and vector operations to write back. These will be translated to non-transposed and transposed store and loads respectively. As data is moved within sub-groups, no barriers are needed. For now, handle only the case of a `single sub_group_size^2` block being transposed. This may be split in the future by performing `N*M` iterations for matrices of size `N*sub_group_sizexM*sub_group_size`. Signed-off-by: victor-eds <[email protected]>

victor-eds · 2024-10-18T15:35:00Z

@hwnam831 can you please check my linear layout knowledge is on point?

whitneywhtsang · 2024-10-18T15:43:11Z

@hwnam831 can you please check my linear layout knowledge is on point?

Nam has finished his internship, and he is back to school, so he may not be able to check this PR.

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp

victor-eds · 2024-10-21T12:45:57Z

Part of #2266.

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp

chengjunlu · 2024-10-22T08:47:36Z

LGTM.

whitneywhtsang

Is this change upstreamable?

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp

victor-eds · 2024-10-22T11:44:54Z

Is this change upstreamable?

I don't think so. Not sure if performance translates to NVIDIA too.

victor-eds · 2024-10-22T14:45:14Z

Performance evaluation undergoing.

victor-eds · 2024-10-23T10:30:34Z

Let's use the following code to compare the generated ASM:

#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>

module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, triton_gpu.target = "xpu", "triton_gpu.threads-per-warp" = 16 : i32} {
  tt.func @test_f16(%arg0: tensor<16x16xf16, #blocked>, %arg1: tensor<16x16x!tt.ptr<f16>, #blocked1>) {
    %0 = triton_gpu.convert_layout %arg0 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #blocked1>
    tt.store %arg1, %0 : tensor<16x16x!tt.ptr<f16>, #blocked1>
    tt.return
  }
}

Before this patch, we did not detect these were within the warp transposes, so we just generated basic stores and loads to SLM:

# 8 stores
store.slm.d32.a32 (16|M0)  [r8:1]       r19:1              {F@4,$3}
store.slm.d32.a32 (16|M0)  [r15:1]      r20:1              {I@2,$4}
...
# 16 loads
load.slm.d16u32.a32 (16|M0)  r69:1      [r63:1]            {I@4,$13}
load.slm.d16u32.a32 (16|M0)  r65:1      [r47:1]            {I@5,$14}
...

whereas the optimized case leverages more efficient instructions:

# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}

This also translates to less pointer arithmetic and less ALU instructions overall:

# Base case
//.numALUInst: 224
# Optimized case
//.numALUInst: 217

Note arithmetic instruction count should be way less (I don't have an estimate), but base case reuses indices computed for SLM access for UGM access.

Also, as we know data won't be shared across warps, we just do not need synchronization:

# Base case
//.syncInstCount: 9
# Optimized case
//.syncInstCount: 0

Update: Optimized transpose code fits in a single screenshot:

test/Conversion/intel/sub-group-transpose.mlir

chengjunlu · 2024-10-24T00:46:41Z

whereas the optimized case leverages more efficient instructions:

# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}

An interesting thing is that the assemble shows the store is transposed and the load is not transposed:
store.slm.d64x64t.a32

The comments in the Triton code explain there is non-transpose store and transpose load. Any background for this?

victor-eds · 2024-10-24T06:52:29Z

whereas the optimized case leverages more efficient instructions:
# Single store
store.slm.d64x64t.a32 (1|M0)  [r9:1]    r1:8               {A@1,$3}
# Single load
load.slm.d32x8.a32 (16|M0)  r9:8        [r2:1]             {I@1,$4}
An interesting thing is that the assemble shows the store is transposed and the load is not transposed: store.slm.d64x64t.a32

The comments in the Triton code explain there is non-transpose store and transpose load. Any background for this?

Yes: bad comments :) I will update later. Thanks for pointing out!

victor-eds requested review from a team, etiotto and whitneywhtsang October 18, 2024 12:36

victor-eds self-assigned this Oct 18, 2024

victor-eds added 3 commits October 18, 2024 14:00

Check transposition feasibility before performing it

986d273

Make safer and simplify tests

fe480ef

Add comment back

5b59ae4

victor-eds requested a review from hwnam831 October 18, 2024 15:34

victor-eds removed the request for review from hwnam831 October 18, 2024 16:28

Fix layout check

f6649b9

victor-eds commented Oct 18, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp Show resolved Hide resolved

victor-eds mentioned this pull request Oct 21, 2024

[OptRed] Extend -tritonintelgpu-optimize-reduction-locality to support repCluster[0] > 2 #2519

Closed

Merge branch 'main' into sub-group-slm-transpose

57b4375

etiotto requested review from Dewei-Wang-sh and chengjunlu October 21, 2024 14:33

etiotto linked an issue Oct 21, 2024 that may be closed by this pull request

Port "sub-group transpose reduction" to default path #2266

Closed

victor-eds mentioned this pull request Oct 21, 2024

[TritonIntelGPUToLLVM] Extend sub-group transposition support #2521

Merged

chengjunlu reviewed Oct 22, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated Show resolved Hide resolved

chengjunlu reviewed Oct 22, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated Show resolved Hide resolved

victor-eds added 2 commits October 22, 2024 10:00

Merge branch 'main' into sub-group-slm-transpose

1c2f1a8

Address comments

b0ea488

victor-eds requested a review from chengjunlu October 22, 2024 08:16

chengjunlu approved these changes Oct 22, 2024

View reviewed changes

whitneywhtsang reviewed Oct 22, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated Show resolved Hide resolved

Use TargetInfoBase in code

1616bd1

victor-eds requested a review from whitneywhtsang October 22, 2024 12:23

etiotto approved these changes Oct 22, 2024

View reviewed changes

victor-eds marked this pull request as draft October 22, 2024 14:45

Merge branch 'main' into sub-group-slm-transpose

73f55a2

etiotto marked this pull request as ready for review October 23, 2024 13:58

etiotto approved these changes Oct 23, 2024

View reviewed changes

test/Conversion/intel/sub-group-transpose.mlir Outdated Show resolved Hide resolved

jopperm approved these changes Oct 24, 2024

View reviewed changes

victor-eds added 3 commits October 24, 2024 11:52

Merge branch 'main' into sub-group-slm-transpose

548c7c2

Drop not needed attrs

3169a7e

Fix comment

e5e6b35

victor-eds enabled auto-merge (squash) October 24, 2024 10:37

victor-eds merged commit 988b62b into intel:main Oct 24, 2024
4 checks passed

[TritonIntelGPUToLLVM] Detect sub-group transpose convert_layout cases #2511

[TritonIntelGPUToLLVM] Detect sub-group transpose convert_layout cases #2511

Uh oh!

Conversation

victor-eds commented Oct 18, 2024

Uh oh!

victor-eds commented Oct 18, 2024

Uh oh!

whitneywhtsang commented Oct 18, 2024

Uh oh!

Uh oh!

victor-eds commented Oct 21, 2024

Uh oh!

Uh oh!

Uh oh!

chengjunlu commented Oct 22, 2024

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victor-eds commented Oct 22, 2024

Uh oh!

victor-eds commented Oct 22, 2024

Uh oh!

victor-eds commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chengjunlu commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

victor-eds commented Oct 24, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[TritonIntelGPUToLLVM] Detect sub-group transpose `convert_layout` cases #2511

[TritonIntelGPUToLLVM] Detect sub-group transpose `convert_layout` cases #2511

victor-eds commented Oct 23, 2024 •

edited

Loading

chengjunlu commented Oct 24, 2024 •

edited

Loading