Question about Triton GPU IR SharedEncoding #2026

chengjunlu · 2023-08-03T08:51:31Z

chengjunlu
Aug 3, 2023

There is very little documents about the SharedEncoding in Triton GPU IR.

I have a new MMA layout attribute for Intel XMX layout for lowering the tt.dot to Intel XMX engine. (https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_esimd/sycl_ext_intel_esimd.md#horizontal-packing-for-a-c-and-result)

The convert layout will be decomposed from blocked->shared->mma in the optimization passes. Like:

      %74 = tt.load %arg11 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16xf16, #blocked>
      %75 = triton_gpu.convert_layout %74 : (tensor<16x16xf16, #blocked>) -> tensor<16x16xf16, #shared>
      %78 = triton_gpu.convert_layout %75 : (tensor<16x16xf16, #shared>) -> tensor<16x16xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #triton_intel_gpu.intel_mma<{warpsPerCTA = [1, 1], warpTileShape = [4, 2, 16], warpTileStride = [32, 1, 2]}>}>>

I read the https://github.com/openai/triton/blob/5df904233c11a65bd131ead7268f84cca7804275/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td#L52.
But I still get the idea about it.

My question is how to understand the meaning of the SharedEncoding ? By knowing the details, I can lower the swizzle properly.

Answered by zhanglx13

Aug 17, 2023

Say we have a 16 (M) by 16 (N) tensor A and each element is a f32. And we want to do swizzling along the N dim (row).

We want to swizzle the elements within each row when putting the elements in shared memory. Here is how the parameters control the swizzling behavior

Multiple consecutive rows can have the same swizzling pattern. The number of rows that have the same swizzling pattern is perPhase, which is calculated as perPhase = 128 / (elementsPerRow * elementTypeInBytes). In this example, perPhase = 128 / (16*4) = 2, which means every 2 rows have the same swizzling pattern
maxPhase means how many patterns in total do we want. This is usually set according to how shared memory is acces…

View full answer

Amoskin · 2023-08-16T12:34:27Z

Amoskin
Aug 16, 2023

I cant help you !

0 replies

zhanglx13 · 2023-08-16T18:08:01Z

zhanglx13
Aug 16, 2023
Collaborator

It means the elements in the tensor are in shared memory. And the mapping from each element of the tensor to shared memory addresses is represented by the swizzling parameters: vec, perPhase, and maxPhase.

0 replies

chengjunlu · 2023-08-17T00:20:26Z

chengjunlu
Aug 17, 2023
Author

It means the elements in the tensor are in shared memory. And the mapping from each element of the tensor to shared memory addresses is represented by the swizzling parameters: vec, perPhase, and maxPhase.

How is the layout characterized by the vec, perPhase and the maxPhase?
Can you explain it with some example? like 16x16 tensor with SharedEncoding.

0 replies

zhanglx13 · 2023-08-17T02:34:07Z

zhanglx13
Aug 17, 2023
Collaborator

Say we have a 16 (M) by 16 (N) tensor A and each element is a f32. And we want to do swizzling along the N dim (row).

We want to swizzle the elements within each row when putting the elements in shared memory. Here is how the parameters control the swizzling behavior

Multiple consecutive rows can have the same swizzling pattern. The number of rows that have the same swizzling pattern is perPhase, which is calculated as perPhase = 128 / (elementsPerRow * elementTypeInBytes). In this example, perPhase = 128 / (16*4) = 2, which means every 2 rows have the same swizzling pattern
maxPhase means how many patterns in total do we want. This is usually set according to how shared memory is accessed to minimize bank conflicts. In this toy example, without assuming any access pattern, we can set maxPhase to 8, so that we have enough swizzling patterns to cover all the 16 rows.
When we do swizzling, we swizzle/shuffle/shift (whatever you want to call it) vec elements as a whole package. This is also determined by the "user" of the shared memory. Let's assume vec = 2.

In the codebase, different swizzling pattern is represented by a variable called phase. And swizzling function is the xor function:
col_swizzled = (col / vec) ^ phase * vec

The data layout in shared memory becomes

0 replies

chengjunlu · 2023-08-18T00:32:24Z

chengjunlu
Aug 18, 2023
Author

maxPhase means how many patterns in total do we want. This is usually set according to how shared memory is accessed to minimize bank conflicts. In this toy example, without assuming any access pattern, we can set maxPhase to 8, so that we have enough swizzling patterns to cover all the 16 rows.

Can you explain more about to minimize the bank conflicts?
I cannot understand how the swizzle pattern defined in SharedEndcoding can help. Does it only work with NV MMAEncoding? Or it is general for help to minimize the bank conflicts?

1 reply

zhanglx13 Aug 18, 2023
Collaborator

Shared memory bank conflicts are very complicated.
Let's focus on load bank conflicts, which can happen during shared memory load.
Without swizzling, the data layout in global memory and shared memory are the same, as

The first row is put in bank 0 to 15 and second row is put in bank 16 to 31 and so on.
When preparing operands for mma instructions, threads in a warp usually access element in the same column. As you can see, A0,0, A2,0, A4,0, .. are in bank 0 and they are accessed at the same cycle ==> bank conflicts.

If we have swizzling, as in

Now A0,0, A2,0, A4,0, ... are in different banks.

We can also use padding to shift element to achieve the same effect. But the drawback of padding is that extra shared memory space is taken, which can hurt occupancy.

Swizzling generally works. We also use swizzling to avoid bank conflicts on AMD GPUs.

chengjunlu · 2023-08-18T02:39:42Z

chengjunlu
Aug 18, 2023
Author

@zhanglx13 Thanks for the warm help.
It only make sense when it co-work with the later accessing to the tensor on SLM like the convert_layout #shared -> #mma.
The first I thought is that it is a general idea to reduce the bank conflict when to spill the tensor to the SLM.

I think my question is clear now.
Thanks for all your helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about Triton GPU IR SharedEncoding #2026

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about Triton GPU IR SharedEncoding #2026

Uh oh!

Uh oh!

chengjunlu Aug 3, 2023

Replies: 6 comments · 1 reply

Uh oh!

Amoskin Aug 16, 2023

Uh oh!

zhanglx13 Aug 16, 2023 Collaborator

Uh oh!

chengjunlu Aug 17, 2023 Author

Uh oh!

zhanglx13 Aug 17, 2023 Collaborator

Uh oh!

chengjunlu Aug 18, 2023 Author

Uh oh!

zhanglx13 Aug 18, 2023 Collaborator

Uh oh!

chengjunlu Aug 18, 2023 Author

chengjunlu
Aug 3, 2023

Replies: 6 comments 1 reply

Amoskin
Aug 16, 2023

zhanglx13
Aug 16, 2023
Collaborator

chengjunlu
Aug 17, 2023
Author

zhanglx13
Aug 17, 2023
Collaborator

chengjunlu
Aug 18, 2023
Author

zhanglx13 Aug 18, 2023
Collaborator

chengjunlu
Aug 18, 2023
Author