Skip to content

Commit 34d586f

Browse files
authored
[MLIR][XeGPU] Extend SGMapAttr and Add ConvertLayoutOp (#132425)
This PR improves the SGMapAttr to enable workgroup-level programming, representing the first step in expanding the XeGPU dialect from subgroup to workgroup level, and renames it to LayoutAttr
1 parent 442050c commit 34d586f

File tree

7 files changed

+727
-389
lines changed

7 files changed

+727
-389
lines changed

mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td

Lines changed: 115 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ def XeGPU_BlockTensorDescAttr: XeGPU_TensorDescAttr<"BlockTensorDesc", "block_td
3535
It is default to `Global`.
3636
2. `array_length`: It describes how many horizontally consecutive blocks
3737
will be loaded by a hardware load instruction. If the TensorDesc shape
38-
is 8x16, with array_length = 2. The loaded block shape will be acctually
38+
is 8x16, with array_length = 2. The loaded block shape will be actually
3939
8x32. Its default value is 1.
4040
3. `boundary_check`: It is used to indicates the hardware whether to do
4141
out-of-boundary check. The default value is true.
@@ -154,33 +154,128 @@ def XeGPU_FenceScopeAttr:
154154
let assemblyFormat = "$value";
155155
}
156156

157-
def XeGPU_SGMapAttr : XeGPUAttr<"SGMap", "sg_map"> {
157+
def XeGPU_LayoutAttr : XeGPUAttr<"Layout", "layout"> {
158158
let summary = [{
159-
Describes the mapping between work item (WI) and the 2D tensor specified by the tensor descriptor.
159+
Describes the data distribution to subgroups and work-items for a tensor
160+
specified by the tensor descriptor.
160161
}];
161162
let description = [{
162-
To distribute the XeGPU operation to work items, the tensor_desc must be specified with the sg_map
163-
attribute at the tensor description creation time.
164-
Within the `sg_map`, `wi_layout` specifies the layout of work items,
165-
describing the mapping of work items to the tensor.
166-
wi_layout[0] x wi_layout[1] must be equal to the total number of work items within a subgroup.
167-
`wi_data` specifies the minimum number of data elements assigned to each work item for a single distribution.
168-
169-
E.g., #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
170-
In this example, the subgroup has 16 work items in wi_layout=[1, 16],
171-
each accessing 1 element as specified by wi_data=[1, 1].
172-
173-
`wi_data[0] * wi_data[1]` can be greater than 1, meaning that each work item operates on multiple elements,
174-
which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector, or packed to a storage data type.
175-
The multiple elements indicated by `wi_data` can only be from one dimension and must be contiguous in the memory along either dimension.
163+
XeGPU operations use `LayoutAttr` to define how data is distributed across subgroups and work-items.
164+
This attribute is specified in tensor descriptors during tensor description creation. `LayoutAttr`
165+
includes the following parameters:
166+
167+
* `sg_layout`: Specifies the total number of subgroups and their layout within a workgroup.
168+
It is mandatory for workgroup-level programming. Its presence implies workgroup-level code.
169+
* `sg_data`: Defines the data size accessed per subgroup. It is optionally used with `sg_layout`
170+
for workgroup-level programming. When it is left empty, the size accessed per subgroup can be
171+
derived from the tensor shape and `sg_layout` using the formula:
172+
`sg_data[i] = tensor_shape[i] / sg_layout[i]`.
173+
* `inst_data`: Specifies the data size that is processed by an instruction. It is optionally
174+
used with lane_layout. When it is left empty, the data size per instruction is equivalent to
175+
the sg_data for workgroup-level programming or equivalent to tensor shape for subgroup-level
176+
programming.
177+
* `lane_layout` : Specifies the total number of work-items and their arrangement within a subgroup.
178+
It is mandatory for subgroup-level programming and optional for workgroup-level programming.
179+
* `lane_data` : Specifies the shape of the tensor fragment that each lane accesses. It defines a single,
180+
minimal distribution unit. Processing the entire tensor may require one or more distribution units per
181+
hardware instruction.
182+
* `order`: Specifies the dimension order used to linearize n-dimensional sg_layout and lane_layout to
183+
1-dimensional layout. The first dimension in the order list is the fastest-changing dimension. If it
184+
is not present, the default value is [1, 0].
185+
186+
### Examples:
187+
1. Subgroup level layout:
188+
```mlir
189+
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>
190+
```
191+
In this example, there are 16 work-items per subgroup, and is organized as
192+
[[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1.
193+
194+
2. Subgroup level layout with order:
195+
```mlir
196+
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
197+
```
198+
In this example, there are 16 work-items per subgroup, and is organized as
199+
[[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1.
200+
201+
3. Subgroup level layout with inst_data
202+
```mlir
203+
#xegpu.layout<inst_data = [8, 16], lane_layout = [2, 8], lane_data = [2, 2]>
204+
```
205+
In this example, the original problem size is partitioned into smaller subproblems of dimensions [8, 16],
206+
which are then distributed among 16 work-items arranged as [[0, 1, 2, ..., 7], [8, 9, ..., 15]]. Each
207+
work-item is assigned four 2x2 blocks in a round-robin manner.
208+
209+
4. Workgroup level layout:
210+
```mlir
211+
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1]>
212+
```
213+
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
214+
arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which
215+
is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]].
216+
217+
5. Workgroup level layout with order:
218+
```mlir
219+
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
220+
```
221+
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
222+
arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which
223+
is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]].
224+
225+
6. Workgroup level layout with inst_data:
226+
```mlir
227+
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], inst_data = [8, 16], lane_layout = [2, 8], lane_data = [1, 1]>
228+
```
229+
This example is similar to the previous ones, but the `inst_data` parameter divides `sg_data` into two instructions,
230+
each processing an 8x16 block. These blocks are further distributed across 16 work-items with a distribution unit of 1x1.
231+
Unlike the 2x2 distribution unit in example 3, which results in accessing contiguous 2x2 blocks, the 1x1 distribution
232+
unit may result in non-contiguous access.
176233
}];
234+
177235
let parameters = (ins
178-
ArrayRefParameter<"uint32_t">:$wi_layout,
179-
ArrayRefParameter<"uint32_t">:$wi_data
236+
OptionalParameter<"DenseI32ArrayAttr">: $sg_layout,
237+
OptionalParameter<"DenseI32ArrayAttr">: $sg_data,
238+
OptionalParameter<"DenseI32ArrayAttr">: $inst_data,
239+
OptionalParameter<"DenseI32ArrayAttr">: $lane_layout,
240+
OptionalParameter<"DenseI32ArrayAttr">: $lane_data,
241+
OptionalParameter<"DenseI32ArrayAttr">: $order
180242
);
181243

244+
let builders = [
245+
AttrBuilder<(ins "llvm::ArrayRef<int>": $lane_layout,
246+
"llvm::ArrayRef<int>": $lane_data),
247+
[{
248+
auto sg_layout = DenseI32ArrayAttr();
249+
auto sg_data = DenseI32ArrayAttr();
250+
auto inst_data = DenseI32ArrayAttr();
251+
auto order = DenseI32ArrayAttr();
252+
return $_get($_ctxt, sg_layout, sg_data, inst_data,
253+
DenseI32ArrayAttr::get($_ctxt, lane_layout),
254+
DenseI32ArrayAttr::get($_ctxt, lane_data), order);
255+
}]>
256+
];
257+
258+
let extraClassDeclaration = [{
259+
bool isWgLayout() {
260+
return getSgLayout() != nullptr;
261+
}
262+
263+
bool isSgLayout() {
264+
return getSgLayout() == nullptr && getLaneLayout() != nullptr;
265+
}
182266

183-
let hasCustomAssemblyFormat = 1;
267+
int64_t getRank() {
268+
if (auto attr = getSgLayout())
269+
return attr.size();
270+
if (auto attr = getInstData())
271+
return attr.size();
272+
if (auto attr = getLaneLayout())
273+
return attr.size();
274+
return 0;
275+
}
276+
}];
277+
278+
let assemblyFormat = "`<` struct(params) `>`";
184279
let genVerifyDecl = 1;
185280
}
186281

mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td

Lines changed: 44 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
8080
information e.g., memref<?x?xf16>, the strides information has to be explicitly
8181
passed via the "strides" and "const_strides" argument.
8282

83-
In SIMT mode, tensor descriptor is augmented with `SGMapAttr` which describes the
83+
In SIMT mode, tensor descriptor is augmented with `LayoutAttr` which describes the
8484
mapping of the tensor descriptor to the work items.
8585

8686
Example 1 (suppose the tensor shape inferred by the compiler is 8x16):
@@ -113,7 +113,7 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
113113
%c0 = arith.constant 0 : index
114114
%c1 = arith.constant 8 : index
115115
%1 = xegpu.create_nd_tdesc %0[%c0, %c0] : memref<1024x1024xf32>
116-
-> !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
116+
-> !xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
117117
```
118118
}];
119119

@@ -306,7 +306,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
306306
fp32 or fp64. It implies that vnni and transpose cannot exit at the
307307
same time.
308308

309-
In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `SGMapAttr`
309+
In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
310310
which describes the mapping of the tensor to the work items. In this case, result
311311
vector represents the data to be loaded by each work-item.
312312

@@ -323,7 +323,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
323323
xegpu.load_nd %1 {l1_hint = #xegpu.cache_hint<cached>,
324324
l2_hint = #xegpu.cache_hint<uncached>}>
325325
: !xegpu.tensor_desc<8x16xf32,
326-
#xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>> -> vector<8x1xf32>
326+
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<8x1xf32>
327327
```
328328

329329

@@ -364,7 +364,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
364364
of cache, L1, L2 and L3. If hardware does not have a correspoding cache,
365365
Corresponding cache hint attribute will be masked.
366366

367-
In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `SGMapAttr`
367+
In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
368368
which describes the mapping of the tensor to the work items. In this case, input
369369
vector represents the data to be stored by each work-item.
370370

@@ -381,7 +381,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
381381
l2_hint = #xegpu.cache_hint<write_back>,
382382
l3_hint = #xegpu.cache_hint<write_through>}
383383
: vector<8x1xf16>, !xegpu.tensor_desc<8x16xf16,
384-
#xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
384+
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
385385
```
386386

387387

@@ -422,7 +422,7 @@ def XeGPU_UpdateNdOffsetOp : XeGPU_Op<"update_nd_offset",
422422
Example 2 (SIMT mode):
423423
```
424424
%2 = xegpu.update_nd_offset %1, [0, 16]:
425-
!xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
425+
!xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
426426
```
427427
}];
428428

@@ -482,7 +482,7 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
482482
the chunk_size if the chunk size is larger than 1.
483483

484484
In SIMT mode, similar to `create_nd_tdesc` the resulting tensor descriptor is augmented
485-
with `SGMapAttr` which describes the mapping of the tensor descriptor to the work items.
485+
with `LayoutAttr` which describes the mapping of the tensor descriptor to the work items.
486486
In this case, the first dimension of the tensor descriptor represents the work-items, and
487487
the second dimension represents the chunk size.
488488

@@ -517,7 +517,7 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
517517
%off = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
518518
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
519519
-> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>,
520-
#xegpu.sg_map<wi_layout = [4, 1], wi_data = [1, 1]>>
520+
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>
521521
```
522522
}];
523523

@@ -623,7 +623,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [
623623
The mask operand masks out memory access so that it is safe to pass out-of-boundary
624624
addresses/offsets as long as they are masked. It applies to slots of SIMD lanes.
625625

626-
In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `SGMapAttr`
626+
In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `LayoutAttr`
627627
which describes the mapping of the tensor to the work items. In this case, result vector
628628
represents the data to be loaded by each work-item. Each work-item recieves a `chunk_size`
629629
number of elements.
@@ -653,7 +653,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [
653653
l2_hint = #xegpu.cache_hint<uncached>,
654654
l3_hint = #xegpu.cache_hint<uncached>}
655655
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>,
656-
!xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>>
656+
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>>
657657
vector<16xi1> -> vector<8x1xf32>
658658
```
659659

@@ -704,7 +704,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [
704704
has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is
705705
introduced on purpose, making sure users are aware of this implicit transformation.
706706

707-
In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `SGMapAttr`
707+
In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `LayoutAttr`
708708
which describes the mapping of the tensor to the work items. In this case, input vector
709709
represents the data to be stored by each work-item. Each work-item recieves a `chunk_size`
710710
number of elements.
@@ -732,7 +732,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [
732732
l2_hint = #xegpu.cache_hint<write_back>,
733733
l3_hint = #xegpu.cache_hint<write_through>}
734734
: vector<8x1xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>,
735-
!xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>> vector<16xi1>
735+
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>> vector<16xi1>
736736
```
737737

738738
}];
@@ -790,7 +790,7 @@ def XeGPU_UpdateOffsetOp: XeGPU_Op<"update_offset",
790790
%off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
791791
%2 = xegpu.update_offset %1, %off :
792792
!xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>,
793-
#xegpu.sg_map<wi_layout = [4, 1], wi_data = [1, 1]>>, vector<4xindex>
793+
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>, vector<4xindex>
794794
```
795795
}];
796796

@@ -840,9 +840,9 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
840840
factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
841841
can be represented as `B: vector<8x16x2xf16>`.
842842

843-
In SIMT mode, DpasOp expects attributes `sg_map_a`, `sg_map_b`, and `sg_map_c`
844-
which descibes the data fragment owned by each work-item w.r.t. the tensor
845-
descriptor these data are loaded from.
843+
In SIMT mode, DpasOp expects layout attributes `a`, `b`, and `c` (only if acc is used)
844+
which describe the data fragment owned by each work-item w.r.t. the tensor descriptor
845+
these data are loaded from.
846846

847847
Note: on PVC, the hardware can perform load with VNNI transformation when data
848848
element type is 16-bit or lower precision, taking 2 or 4 elements from
@@ -853,9 +853,9 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
853853
XeGPU_DpasOpType : $lhs,
854854
XeGPU_DpasOpType : $rhs,
855855
Optional<XeGPU_Vector2DType>: $acc,
856-
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_a,
857-
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_b,
858-
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_c);
856+
OptionalAttr<XeGPU_LayoutAttr>:$a_layout,
857+
OptionalAttr<XeGPU_LayoutAttr>:$b_layout,
858+
OptionalAttr<XeGPU_LayoutAttr>:$c_layout);
859859
let results = (outs XeGPU_Vector2DType: $result);
860860

861861
let extraClassDeclaration = [{
@@ -876,6 +876,10 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
876876
VectorType getResultType() {
877877
return getResult().getType();
878878
}
879+
880+
bool hasAcc() {
881+
return getAcc() != nullptr;
882+
}
879883
}];
880884

881885
let assemblyFormat = [{
@@ -979,4 +983,24 @@ def XeGPU_FenceOp: XeGPU_Op<"fence", []> {
979983
let extraClassDeclaration = extraBaseClassDeclaration;
980984
}
981985

986+
def XeGPU_ConvertLayoutOp: XeGPU_Op<"convert_layout", [Pure, AllTypesMatch<["source", "result"]>]> {
987+
let summary = "Convert the layout of the input operand";
988+
let description = [{
989+
`convert_layout` adjusts the data distribution across subgroups and/or work-items by modifying
990+
the `LayoutAttr`. Both `srcMap` and `resMap` must correspond to the same programming scope, such
991+
as workgroup-level (wg) or subgroup-level (sg) code. This operation is not valid once the IR is
992+
lowered to WI level because that is the end result of all distributions.
993+
}];
994+
let arguments = (ins XeGPU_Vector2DType: $source,
995+
XeGPU_LayoutAttr: $srcMap,
996+
XeGPU_LayoutAttr: $resMap
997+
);
998+
let results = (outs XeGPU_Vector2DType: $result);
999+
let assemblyFormat = [{
1000+
$source attr-dict `:` type($source)
1001+
}];
1002+
1003+
let hasVerifier = 1;
1004+
}
1005+
9821006
#endif // MLIR_DIALECT_XEGPU_IR_XEGPUOPS_TD

0 commit comments

Comments
 (0)