- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[MLIR][XeGPU] Remove the transpose attribte from Gather/Scatter ops and Cleanup the documents #145389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][XeGPU] Remove the transpose attribte from Gather/Scatter ops and Cleanup the documents #145389
Changes from 5 commits
010f507
              d9e1cbd
              581b97c
              d5b5643
              3a67083
              0fe0f4b
              2e72e49
              b766535
              8ea09e9
              96da366
              45a7de2
              86f7acb
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -80,9 +80,6 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface | |
| information e.g., memref<?x?xf16>, the strides information has to be explicitly | ||
| passed via the "strides" and "const_strides" argument. | ||
| 
     | 
||
| In SIMT mode, tensor descriptor is augmented with `LayoutAttr` which describes the | ||
| mapping of the tensor descriptor to the work items. | ||
| 
     | 
||
| Example 1 (suppose the tensor shape inferred by the compiler is 8x16): | ||
| ```mlir | ||
| %0 = memref.alloc() : memref<1024x1024xf32> | ||
| 
        
          
        
         | 
    @@ -106,15 +103,6 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface | |
| %c1 = arith.constant 1 : index | ||
| %1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: ui64 -> TensorDesc<8x16xf32> | ||
| ``` | ||
| 
     | 
||
| Example 4 (SIMT mode): | ||
| ```mlir | ||
| %0 = memref.alloc() : memref<1024x1024xf32> | ||
| %c0 = arith.constant 0 : index | ||
| %c1 = arith.constant 8 : index | ||
| %1 = xegpu.create_nd_tdesc %0[%c0, %c0] : memref<1024x1024xf32> | ||
| -> !xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> | ||
| ``` | ||
| }]; | ||
| 
     | 
||
| let arguments = (ins | ||
| 
          
            
          
           | 
    @@ -301,9 +289,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [ | |
| fp32 or fp64. It implies that vnni and transpose cannot exit at the | ||
| same time. | ||
| 
     | 
||
| In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `LayoutAttr` | ||
| which describes the mapping of the tensor to the work items. In this case, result | ||
| vector represents the data to be loaded by each work-item. | ||
| In SIMT mode, result vector represents the data to be loaded by each work-item. | ||
| 
     | 
||
| Example 1: | ||
| ```mlir | ||
| 
        
          
        
         | 
    @@ -317,8 +303,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [ | |
| ```mlir | ||
| xegpu.load_nd %1 {l1_hint = #xegpu.cache_hint<cached>, | ||
| l2_hint = #xegpu.cache_hint<uncached>}> | ||
| : !xegpu.tensor_desc<8x16xf32, | ||
| #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<8x1xf32> | ||
| : !xegpu.tensor_desc<8x16xf32> -> vector<8xf32> | ||
| ``` | ||
| 
     | 
||
| 
     | 
||
| 
          
            
          
           | 
    @@ -359,9 +344,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [ | |
| of cache, L1, L2 and L3. If hardware does not have a correspoding cache, | ||
| Corresponding cache hint attribute will be masked. | ||
| 
     | 
||
| In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `LayoutAttr` | ||
| which describes the mapping of the tensor to the work items. In this case, input | ||
| vector represents the data to be stored by each work-item. | ||
| In SIMT mode, the input vector represents the data to be stored by each work-item. | ||
| 
     | 
||
| Example 1: | ||
| ```mlir | ||
| 
        
          
        
         | 
    @@ -375,8 +358,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [ | |
| xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>, | ||
| l2_hint = #xegpu.cache_hint<write_back>, | ||
| l3_hint = #xegpu.cache_hint<write_through>} | ||
| : vector<8x1xf16>, !xegpu.tensor_desc<8x16xf16, | ||
| #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> | ||
| : vector<8xf16>, !xegpu.tensor_desc<8x16xf16> | ||
| ``` | ||
| 
     | 
||
| 
     | 
||
| 
          
            
          
           | 
    @@ -410,15 +392,10 @@ def XeGPU_UpdateNdOffsetOp : XeGPU_Op<"update_nd_offset", | |
| The offsets are relative offset to the current position in the number | ||
| of elements. It will result in a same type TensorDesc as the input. | ||
| 
     | 
||
| Example 1: | ||
| Example: | ||
| ``` | ||
| %2 = xegpu.update_nd_offset %1, [0, 16]: !xegpu.tensor_desc<8x16xf32> | ||
| ``` | ||
| Example 2 (SIMT mode): | ||
| ``` | ||
| %2 = xegpu.update_nd_offset %1, [0, 16]: | ||
| !xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> | ||
| ``` | ||
| }]; | ||
| 
     | 
||
| let arguments = (ins | ||
| 
          
            
          
           | 
    @@ -476,11 +453,6 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> { | |
| match the dimension of offsets. It may also has a second dimension corresponding to | ||
| the chunk_size if the chunk size is larger than 1. | ||
| 
     | 
||
| In SIMT mode, similar to `create_nd_tdesc` the resulting tensor descriptor is augmented | ||
| with `LayoutAttr` which describes the mapping of the tensor descriptor to the work items. | ||
| In this case, the first dimension of the tensor descriptor represents the work-items, and | ||
| the second dimension represents the chunk size. | ||
| 
     | 
||
| Example 1: It assumes subgroup size is 4, and accesses a[0], a[16], a[32], a[64] | ||
| ```mlir | ||
| %a = memref.alloc() : memref<1024xf32> | ||
| 
        
          
        
         | 
    @@ -505,15 +477,6 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> { | |
| %1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex> | ||
| -> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>> | ||
| ``` | ||
| 
     | 
||
| Example 4: SIMT mode | ||
| ```mlir | ||
| %0 = memref.alloc() : memref<1024xf32> | ||
| %off = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex> | ||
| %1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex> | ||
| -> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>, | ||
| #xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>> | ||
| ``` | ||
| }]; | ||
| 
     | 
||
| let arguments = (ins XeGPU_BaseAddrType: $source, | ||
| 
          
            
          
           | 
    @@ -609,19 +572,13 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [ | |
| let description = [{ It (aka. load) load data per each work-item. The output | ||
| describes the data being loaded at the subgroup level, so its size is | ||
| consistent with the number of work-items in a subgroup. When the chunk size | ||
| is larger than 2, the output vector is a 2D vector, with dim-1 correspoding | ||
| to work-items, and dim-0 corresponding to the chunk size loaded by each work-item. | ||
| Specially, there is a transpose effect on the result (as compared to the TensorDesc) | ||
| due to the hardware implementation. Therefore, a transpose attribute is introduced | ||
| on purpose, making sure users are aware of this implicit transformation. | ||
| 
     | 
||
| is larger than 2, the output vector is a 2D vector, with dim-0 correspoding | ||
| to work-items, and dim-1 corresponding to the chunk size loaded by each work-item. | ||
| The mask operand masks out memory access so that it is safe to pass out-of-boundary | ||
| addresses/offsets as long as they are masked. It applies to slots of SIMD lanes. | ||
| 
     | 
||
| In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `LayoutAttr` | ||
| which describes the mapping of the tensor to the work items. In this case, result vector | ||
| represents the data to be loaded by each work-item. Each work-item recieves a `chunk_size` | ||
| number of elements. | ||
| In SIMT mode, the result vector represents the data to be loaded by each work-item. | ||
| Each work-item recieves a `chunk_size` number of elements. | ||
| 
     | 
||
| Example 1: | ||
| ```mlir | ||
| 
        
          
        
         | 
    @@ -634,29 +591,25 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [ | |
| 
     | 
||
| Example 2: | ||
| ```mlir | ||
| %2 = xegpu.load %1, %0 {transpose, | ||
| l1_hint = #xegpu.cache_hint<cached>, | ||
| %2 = xegpu.load %1, %0 {l1_hint = #xegpu.cache_hint<cached>, | ||
| l2_hint = #xegpu.cache_hint<uncached>, | ||
| l3_hint = #xegpu.cache_hint<uncached>} | ||
| : !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>>, | ||
| vector<16xi1> -> vector<8x16xf32> | ||
                
      
                  charithaintc marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||
| ``` | ||
| Example 3 (SIMT mode): | ||
| ```mlir | ||
| %2 = xegpu.load %1, %0 {transpose, | ||
| l1_hint = #xegpu.cache_hint<cached>, | ||
| %2 = xegpu.load %1, %0 {l1_hint = #xegpu.cache_hint<cached>, | ||
| l2_hint = #xegpu.cache_hint<uncached>, | ||
| l3_hint = #xegpu.cache_hint<uncached>} | ||
| : !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>, | ||
| !xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>> | ||
| vector<16xi1> -> vector<8x1xf32> | ||
| : !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>> | ||
| vector<16xi1> -> vector<8xf32> | ||
| ``` | ||
| 
     | 
||
| }]; | ||
| 
     | 
||
| let arguments = (ins XeGPU_TensorDesc: $TensorDesc, | ||
| XeGPU_MaskType: $mask, | ||
| OptionalAttr<UnitAttr>: $transpose, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint); | ||
| 
          
            
          
           | 
    @@ -699,10 +652,8 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [ | |
| has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is | ||
| introduced on purpose, making sure users are aware of this implicit transformation. | ||
| 
     | 
||
| In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `LayoutAttr` | ||
| which describes the mapping of the tensor to the work items. In this case, input vector | ||
| represents the data to be stored by each work-item. Each work-item recieves a `chunk_size` | ||
| number of elements. | ||
| In SIMT mode, the input vector represents the data to be stored by each work-item. | ||
| Each work-item stores a `chunk_size` number of elements. | ||
| 
     | 
||
| Example 1: | ||
| ```mlir | ||
| 
        
          
        
         | 
    @@ -714,29 +665,17 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [ | |
| 
     | 
||
| Example 2: | ||
| ```mlir | ||
| xegpu.store %0, %1, %2 {transpose, | ||
| l1_hint = #xegpu.cache_hint<uncached>, | ||
| l2_hint = #xegpu.cache_hint<write_back>, | ||
| l3_hint = #xegpu.cache_hint<write_through>} | ||
| xegpu.store %0, %1, %2 {l1_hint = #xegpu.cache_hint<uncached>, | ||
                
       | 
||
| l2_hint = #xegpu.cache_hint<write_back>, | ||
| l3_hint = #xegpu.cache_hint<write_through>} | ||
| : vector<8x16xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1> | ||
                
      
                  charithaintc marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||
| ``` | ||
| Example 3 (SIMT mode): | ||
| ```mlir | ||
| xegpu.store %0, %1, %2 {transpose, | ||
| l1_hint = #xegpu.cache_hint<uncached>, | ||
| l2_hint = #xegpu.cache_hint<write_back>, | ||
| l3_hint = #xegpu.cache_hint<write_through>} | ||
| : vector<8x1xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>, | ||
| !xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>> vector<16xi1> | ||
| ``` | ||
| 
     | 
||
| }]; | ||
| 
     | 
||
| let arguments = (ins | ||
| XeGPU_ValueType: $value, | ||
| XeGPU_TensorDesc: $TensorDesc, | ||
| XeGPU_MaskType: $mask, | ||
| OptionalAttr<UnitAttr>: $transpose, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint, | ||
| OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint); | ||
| 
          
            
          
           | 
    @@ -773,20 +712,13 @@ def XeGPU_UpdateOffsetOp: XeGPU_Op<"update_offset", | |
| update the offset per work-item, so its offsets contains values representing | ||
| shifts for each work-item. | ||
| 
     | 
||
| Example 1: | ||
| Example: | ||
| ```mlir | ||
| %off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex> | ||
| %2 = xegpu.update_offset %1, %off : | ||
| !xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>>, vector<4xindex> | ||
| ``` | ||
| 
     | 
||
| Example 2 (SIMT mode): | ||
| ```mlir | ||
| %off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex> | ||
| %2 = xegpu.update_offset %1, %off : | ||
| !xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>, | ||
| #xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>, vector<4xindex> | ||
| ``` | ||
| }]; | ||
| 
     | 
||
| let arguments = (ins XeGPU_TensorDesc: $TensorDesc, | ||
| 
          
            
          
           | 
    ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -8,6 +8,7 @@ | |
| 
     | 
||
| #include "mlir/Dialect/Utils/IndexingUtils.h" | ||
| #include "mlir/Dialect/XeGPU/IR/XeGPU.h" | ||
| #include "mlir/Dialect/XeGPU/Utils/XeGPUUtils.h" | ||
| #include "mlir/IR/Builders.h" | ||
| #include "mlir/IR/DialectImplementation.h" | ||
| #include "llvm/ADT/TypeSwitch.h" | ||
| 
          
            
          
           | 
    @@ -309,11 +310,23 @@ LogicalResult TensorDescType::verify( | |
| llvm::ArrayRef<int64_t> shape, mlir::Type elementType, | ||
| mlir::Attribute encoding, mlir::Attribute layout) { | ||
| size_t rank = shape.size(); | ||
| // Low-precision types are packed in 32-bit units. | ||
| int32_t packingFactor = 32 / elementType.getIntOrFloatBitWidth(); | ||
| if (rank != 1 && rank != 2) | ||
| return emitError() << "expected 1D or 2D tensor"; | ||
| 
     | 
||
| auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding); | ||
| if (blockAttr) { | ||
| MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace(); | ||
| if (rank == 2 && memorySpaceAttr && | ||
| memorySpaceAttr.getValue() == MemorySpace::SLM) | ||
| return emitError() << "SLM is not supported for 2D block tensor"; | ||
| } | ||
| 
     | 
||
| // for gather and scatter ops, Low-precision types are packed in 32-bit units. | ||
| unsigned bitWidth = elementType.getIntOrFloatBitWidth(); | ||
| int packingFactor = | ||
                
       | 
||
| bitWidth < targetinfo::packedSizeInBitsForGatherScatter | ||
| ? targetinfo::packedSizeInBitsForGatherScatter / bitWidth | ||
| : 1; | ||
| auto scatterAttr = mlir::dyn_cast_if_present<ScatterTensorDescAttr>(encoding); | ||
| if (scatterAttr) { | ||
| // Expected tensor ranks for scattered data: | ||
| 
        
          
        
         | 
    @@ -336,14 +349,6 @@ LogicalResult TensorDescType::verify( | |
| } | ||
| } | ||
| 
     | 
||
| auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding); | ||
| if (blockAttr) { | ||
| MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace(); | ||
| if (rank == 2 && memorySpaceAttr && | ||
| memorySpaceAttr.getValue() == MemorySpace::SLM) | ||
| return emitError() << "SLM is not supported for 2D block tensor"; | ||
| } | ||
| 
     | 
||
| auto layoutAttr = llvm::dyn_cast_if_present<LayoutAttr>(layout); | ||
| if (layoutAttr) { | ||
| if (rank != (size_t)layoutAttr.getRank()) | ||
| 
          
            
          
           | 
    ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming here that there will always be a transpose operation after the load?
I wonder how a user can understand the semantics of this op. what if the user does not want the transpose and want to use the op in isolation (which is perfectly legal)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no transpose. The semantic is each row corresponding to a lane. In the SIMD lowering pipeline, the transpose will be added when we lower the
load_gatherto the corresponding intrinsic. For SIMT lowering, there is no transpose at all.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it again.
It seems like now xegpu.load (with chunck > 1) is just a logical operation. meaning it does not have a matching HW instruction. Logically we can use it without an accompanying transpose operation. that is true.
In practice, it will always come with an accompanying transpose. It will mostly be useful for A*BT case. In that case we always need an explicit
vector.transposeafter thexegpu.load. During lowering the load + transpose are optimized away in both SIMD and SIMT paths. Essentially we say that "we have a HW instruction that can do both these together, so transpose here is a nop". No need to do any shuffling to the transpose.For A*B case, I think doing multiple loads will maybe be cheaper than doing a load gather and then doing an in-resister transpose. not sure about this case.
A*BT case
A*B case.
A*BT case is clear to me. but not sure what we do with A*B case here ? Maybe I am still missing something. @Jianhui-Li can you also clarify on these examples. I know that A*B is not a real use case, but still confused how layout propagation works here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. It is clear now after discussing with @Jianhui-Li. A*B case will need a convert_layout because the load is not giving us the layout needed for DPAS B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For A* B case, as you use load w/ chunk_size for B, which assumes [16, 1] [1, 2] layout. The propagation needs to insert a xegpu.conv_layout to convert it to [1, 16][2, 1] before it feed to DPAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in lowering perspective we expect two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
One thing to note in the lowering: In you code example, the user specify xegpu.load w/ chunk_size, which will be lowered XeVM.load w/ vector size by default (each lane load contiguous data).
If user override the layout of xegpu.load w/ chunk_size, say forcing it to be takes [1, 16] [2, 1] layout, it will need to lowered to multiple regular XeVM.load, since now the data loaded by each lane are not contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the user allowed to do this? I also like it if we keep it relaxed. But I can see in this PR we have hard coded the scattered load layout to [16, 1][1, 2]. Check here
https://github.com/llvm/llvm-project/pull/145389/files#diff-fcc9cdbf8bb4e5d37e661524b877082aee9b7badb0317f980c1881da564a926dR230-R237
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the propagation pass will be improved to allow user to set the layout which override the default decision.