-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[MLIR][XeGPU] Add anchor_layout and update propagation to honor user-specified layouts #169267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-clang-tools-extra @llvm/pr-subscribers-mlir Author: Jianhui Li (Jianhui-Li) ChangesIntroduce anchor layout for XeGPU anchor ops: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored by XeGPU distribution and lowerinngs once specified.
Patch is 107.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169267.diff 14 Files Affected:
diff --git a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
index 4c67856b559b1..344fb23ba7b8d 100644
--- a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
+++ b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
@@ -253,6 +253,22 @@ def XeGPU_PrefetchNdOp : XeGPU_Op<"prefetch_nd", []> {
It issues an instruction to prefetch a block of data from continuous
memory regions to each level of the cache based on their cache policy.
+ Arguments:
+ - `TensorDesc`: A tensor descriptor specifying the base nd-region of
+ memory and tensor tile to be prefetched.
+
+ - `offsets`: index values representing per-dimension offsets from the
+ base position encoded in `TensorDesc`. It is encoded via "offsets"
+ and "const_offsets".
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] An cache-hint attribute
+ indicating the desired behavior at the L1, L2, and L3 cache levels.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation
+ as an anchor, enabling users to assign a layout that governs distribution
+ at the subgroup and/or work-item level. Only valid at workgroup and subgroup
+ level.
+
Example:
```mlir
xegpu.prefetch_nd %tdesc {l1_hint = #xegpu.cache_hint<cached>,
@@ -268,7 +284,8 @@ def XeGPU_PrefetchNdOp : XeGPU_Op<"prefetch_nd", []> {
OptionalAttr<DenseI64ArrayAttr>: $const_offsets,
OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint,
- OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint);
+ OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let extraClassDeclaration = extraBaseClassDeclaration # [{
xegpu::TensorDescType getTensorDescType() {
@@ -325,16 +342,37 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
a block of data from memory to register. It takes a set of optional cache
hints for each level of cache, L1, L2 and L3. If hardware does not have a
correspoding cache, Corresponding cache hint attribute will be masked.
- VNNI transformation is an hardware feature for Intel GPU, which is used to
- do data packing during the load for B operand of matrix operation, if
- the bit width of the data type is less then 32 bits, e.g., fp16. And
- transpose is another Intel hardware feature, which will do transpose
- operation when loading the data if the bit width of the data type is
- fp32 or fp64. It implies that vnni and transpose cannot exit at the
- same time. It is only available to 1D or 2D blocked tensor_desc.
+
+ On Intel GPUs, hardware-supported packing rearranges data elements during
+ the load of the B operand when the element bit-width is less than 32 bits
+ (for example, fp16). The transpose feature reorders data during the load
+ when the element type is fp32 or fp64. These two features are mutually
+ exclusive and shall not be enabled simultaneously. Both features support only
+ 2D blocked tensor_desc.
In SIMT mode, result vector represents the data to be loaded by each work-item.
+ Arguments:
+
+ - `TensorDesc`: A tensor descriptor specifying the base nd-region of memory
+ and the tensor tile to be loaded.
+
+ - `offsets`: Index values representing per-dimension offsets from the base position
+ encoded in `TensorDesc`. They are encoded via `offsets` and `const_offsets`.
+
+ - `packed`: [optional] A unit attribute indicating that packing is applied
+ during the load when supported by the hardware. Only valid at lane level.
+
+ - `transpose`: [optional] An attribute describing a hardware-supported transpose
+ to be applied during the load. Only valid at Lane level.
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] Cache-hint attributes indicating the
+ desired behavior at the L1, L2, and L3 cache levels.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup and/or
+ work-item level. Only valid at workgroup and subgroup levels.
+
Example 1:
```mlir
xegpu.load_nd %1 {transpose = [1, 0],
@@ -360,7 +398,8 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
OptionalAttr<DenseI64ArrayAttr>: $transpose,
OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint,
- OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint);
+ OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let results = (outs XeGPU_ValueType: $value);
@@ -389,7 +428,6 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
return getTensorDescType().getShape();
}
-
}];
let assemblyFormat = [{
@@ -430,6 +468,23 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
In SIMT mode, the input vector represents the data to be stored by each work-item.
+ Arguments:
+
+ - `value`: A vector value representing the tensor tile to be stored.
+
+ - `TensorDesc`: A tensor descriptor specifying the base nd-region of memory and
+ the tensor tile to be stored.
+
+ - `offsets`: Index values representing per-dimension offsets from the base position
+ encoded in `TensorDesc`. They are encoded via `offsets` and `const_offsets`.
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] Cache-hint attributes indicating the
+ desired behavior at the L1, L2, and L3 cache levels.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup and/or
+ work-item level. Only valid at workgroup and subgroup levels.
+
Example 1:
```mlir
xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>,
@@ -454,7 +509,8 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
OptionalAttr<DenseI64ArrayAttr>: $const_offsets,
OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint,
- OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint);
+ OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let extraClassDeclaration = extraBaseClassDeclaration # [{
VectorType getValueType() {
@@ -565,8 +621,10 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
It accepts the following parameters:
Arguments:
+
- `source`: a 1D memref or pointer (i64, i32, ui64, ui32) represents the flattened
memory object.
+
- `offsets`: a vector containing offsets of each access point. Its size
is fixed to the hardware supportted subgroup size, e.g., 16 on PVC,
implying each element in the vector corresponds to a work-item (SIMT lane)
@@ -665,17 +723,25 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> {
it works on scattered TensorDesc instead.
Arguments:
+
- `source`: represents the memory region to be loaded from, which can be either a
tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32).
In case of tensor_desc, offsets come from the producer create_tdesc op.
tensor_desc cannot be used in SIMT mode.
+
- `offsets`: represents offsets from source. required if `source` in not a TensorDescType.
offsets is a vector of `index` type and vector length is either the subgroup size
or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
- - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache.
- - `offset_align_byte`: required if `source` is a pointer. If `source` is not a pointer,
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] cache hints for each level of cache.
+
+ - `offset_align_byte`: [optional] required if `source` is a pointer. If `source` is not a pointer,
it is not allowed. Represents the alignment in bytes of each offset in offsets.
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup and/or
+ work-item level. Only valid at workgroup and subgroup levels.
+
Example 1:
```mlir
xegpu.prefetch %tdesc {l1_hint = #xegpu.cache_hint<cached>,
@@ -724,7 +790,8 @@ def XeGPU_PrefetchOp : XeGPU_Op<"prefetch", []> {
OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint,
- OptionalAttr<I64Attr>:$offset_align_byte);
+ OptionalAttr<I64Attr>:$offset_align_byte,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let extraClassDeclaration = extraBaseClassDeclaration # [{
Type getSourceType() {
@@ -776,18 +843,27 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
each work-item. If size is not 1, size should be equal to the chunk size,
Arguments:
+
- `source`: represents the memory region to be loaded from, which can be either a
tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32).
In case of tensor_desc, offsets come from the producer create_tdesc op.
tensor_desc cannot be used in SIMT mode.
+
- `offsets`: represents offsets from source. required if `source` in not a TensorDescType.
offsets is a vector of `index` type and vector length is either the subgroup size
or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
+
- `mask`: is a vector of `i1` type, which is used to mask out the memory access.
mask is a vector of size equal to the subgroup size, or 1 in SIMT mode.
scalar mask is also valid for SIMT mode.
- - `chunk_size`: (optional) represents contiguous number of elements to load from per work item.
- - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache.
+
+ - `chunk_size`: [optional] represents contiguous number of elements to load from per work item.
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] cache hints for each level of cache.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup and/or
+ work-item level. Only valid at workgroup and subgroup levels.
Results:
- `res`: represents loaded data
@@ -844,7 +920,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint,
- OptionalAttr<DistributeLayoutAttr>:$layout);
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let results = (outs AnyTypeOf<[XeGPU_ValueType, XeGPU_ScalarType]>:$value);
let extraClassDeclaration = extraBaseClassDeclaration # [{
@@ -903,7 +979,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [MemoryEffects<[MemRead]>]> {
"xegpu::CachePolicyAttr": $l1_hint,
"xegpu::CachePolicyAttr": $l2_hint,
"xegpu::CachePolicyAttr": $l3_hint,
- "xegpu::DistributeLayoutAttr": $layout)>
+ "xegpu::DistributeLayoutAttr": $anchor_layout)>
];
let hasVerifier = 1;
@@ -923,19 +999,30 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
each work-item. If size is not 1, size should be equal to the chunk size.
Arguments:
+
- `value`: represents the data to be stored.
+
- `dest`: represents the memory region to be stored to, which can be either a
tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32).
In case of tensor_desc, offsets come from the producer create_tdesc op.
tensor_desc cannot be used in SIMT mode.
+
- `offsets`: represents offsets from dest. required if `source` in not a TensorDescType.
offsets is a vector of `index` type and vector length is either the subgroup size
or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
+
- `mask`: is a vector of `i1` type, which is used to mask out the memory access.
mask is a vector of size equal to the subgroup size, or 1 in SIMT mode.
scalar mask is also valid for SIMT mode.
- - `chunk_size`: (optional) represents contiguous number of elements to store to per work item.
- - `l1_hint`, `l2_hint`, `l3_hint`: are optional cache hints for each level of cache.
+
+ - `chunk_size`: [optional] represents contiguous number of elements to store to per work item.
+
+ - `l1_hint`, `l2_hint`, `l3_hint`: [optional] cache hints for each level of cache.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup and/or
+ work-item level. Only valid at workgroup and subgroup levels.
+
Example 1:
```mlir
@@ -988,7 +1075,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
OptionalAttr<XeGPU_CacheHintAttr>:$l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l2_hint,
OptionalAttr<XeGPU_CacheHintAttr>:$l3_hint,
- OptionalAttr<DistributeLayoutAttr>:$layout);
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let extraClassDeclaration = extraBaseClassDeclaration#[{
Type getDestType() {
@@ -1046,7 +1133,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [MemoryEffects<[MemWrite]>]> {
"xegpu::CachePolicyAttr": $l1_hint,
"xegpu::CachePolicyAttr": $l2_hint,
"xegpu::CachePolicyAttr": $l3_hint,
- "xegpu::DistributeLayoutAttr": $layout)>
+ "xegpu::DistributeLayoutAttr": $anchor_layout)>
];
let hasVerifier = 1;
@@ -1112,28 +1199,38 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
size, B of `kxn` size, and accumulate on matrix C of `mxn` to the same size
matrix , `m=8`, `n=16` and `k=8 * 32/bit_width_of_elem_type`. So for fp16
data type, the matrices are `A: vector<8x16xf16>`, `B: vector<16x16xf16>`,
- and `C/D: vector<8x16xf32>`. Besides the matrix size requirements, DPAS
- also requires A and B to be loaded with the required data layout. Specially,
- VNNI layout is required for B operand. It is achieved via adding `packed`
- attribute to the `load_nd` operator. Due to the VNNI transformation, B operands
- can be represented as a 3D vector, with the last dimension representing the VNNI
- factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
- can be represented as `B: vector<8x16x2xf16>`.
+ and `C/D: vector<8x16xf32>`.
In SIMT code, each work-item from a subgroup holds a data fragment for A, B, C and the result,
which are represented as 1D vectors. Please refer to [OpenCL Intel extentions]
(https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html)
for more details about the fragment distribution.
- Note: on PVC, the hardware can perform load with VNNI transformation when data
- element type is 16-bit or lower precision, taking 2 or 4 elements from
- the first dimension and inserted into the newly added innermost dimension.
+ Arguments:
+
+ - `lhs`: A vector value representing the left-hand-side matrix tile (A) participating in the
+ matrix multiply.
+
+ - `rhs`: A vector value representing the right-hand-side matrix tile (B).
+
+ - `acc`: [optional] A vector value representing the accumulator matrix tile (C). When present, the
+ result is computed as `lhs * rhs + acc`; otherwise, the accumulator is implicitly assumed to be zero.
+
+ - `anchor_layout_a`, `anchor_layout_b`, `anchor_layout_cd`: [optional] Attributes that identify this
+ operation as anchors for operands A, B, and the accumulator/result, enabling users to assign layouts
+ that govern distribution at the subgroup and/or work-item level. Only valid at workgroup and subgroup
+ level.
+
}];
let arguments = (ins
XeGPU_DpasOprType : $lhs,
XeGPU_DpasOprType : $rhs,
- Optional<XeGPU_DpasResType>: $acc);
+ Optional<XeGPU_DpasResType>: $acc,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout_a,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout_b,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout_cd
+ );
let results = (outs XeGPU_DpasResType: $result);
let extraClassDeclaration = [{
@@ -1180,13 +1277,31 @@ def XeGPU_AtomicRMWOp: XeGPU_Op<"atomic_rmw", [Pure,
has the same shape with `TensorDesc`, and is used to enable or disable specific
data points of the `TensorDesc`. The `value` operand represents the new value to
be applied during the modification.
+ Arguments:
+ - `kind`: An attribute that specifies the atomic operation to be performed
+ (e.g., add, min, max, exchange, etc.).
+
+ - `tensorDesc`: A `TensorDesc` describing the memory region on which the atomic
+ read-modify-write is performed.
+
+ - `mask`: A predicate mask with the same shape as `tensorDesc`. Only elements
+ with a true (non-zero) mask value participate in the atomic operation;
+ masked-out elements are not modified.
+
+ - `value`: The input values used by the atomic operation. It must have the same
+ shape and element type as `tensorDesc` and `result`.
+
+ - `anchor_layout`: [optional] An attribute that identifies the operation as an anchor,
+ enabling users to assign a layout that governs distribution at the subgroup
+ and/or work-item level. Only valid at workgroup and subgroup levels.
}];
let arguments = (ins
AtomicRMWKindAttr:$kind,
XeGPU_TensorDesc:$tensorDesc,
XeGPU_MaskType:$mask,
- XeGPU_ValueType:$value);
+ XeGPU_ValueType:$value,
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout);
let results = (outs XeGPU_ValueType:$result);
@@ -1268,6 +1383,13 @@ def XeGPU_ConvertLayoutOp: XeGPU_Op<"convert_layout", [Pure, AllTypesMatch<["sou
the `target_layout`. Both `input_layout` and `target_layout` must correspond to the same programming
scope, such as workgroup-level (wg) or subgroup-level (sg) code. This operation is not valid once
the IR is lowered to WI level because that is the end result of all distributions.
+ Arguments:
+ - `source`: The input vector whose data is to be redistributed. The source and
+ result types must match.
+ - `input_layout`: The layout attribute describing the current distribution of `source`
+ across subgroups and/or work-items.
+ - `target_layout`: The layout attribute describing the desired distribution of the result
+ across subgroups and/or work-items.
}];
let arguments = (ins XeGPU_VectorType: $source,
DistributeLayoutAttr: $input_layout,
@@ -1319,7 +1441,7 @@ def XeGPU_LoadMatrixOp: XeGPU_Op<"load_matrix", [MemoryEffects<[MemRead]>,
Variadic<Index>: $offsets,
DenseI64ArrayAttr: $const_offsets,
OptionalAttr<UnitAttr>:$subgroup_block_io,
- OptionalAttr<DistributeLayoutAttr>:$layout
+ OptionalAttr<DistributeLayoutAttr>:$anchor_layout
);
let results = (outs AnyTypeOf<[XeGPU_ValueType, XeGPU_ScalarType]>:$res);
let assemblyFormat = [{
@@ -1335,19 +1457,20 @@ def XeGPU_LoadMatrixOp: XeGPU_Op<"load_matrix", [MemoryEffects<[MemRead]>,
Arguments:
- `mem_desc`: the memory descriptor identifying the SLM region.
- `offsets`: the coordinates within the matrix to read from.
- - `subgroup_block_io`: [optional] An attribute indicating that the operation can be
- lowered to a subgroup block load. When this attribute is present,
- the offsets are subgroup-uniform across all lanes.
- - `layout`: [optional] An attribute for guiding ...
[truncated]
|
charithaintc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will approve after feedback for initial review.
janghaeng-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In xegpu::setDistributeLayoutAttr, shoudn't anchor_layout be set for the ops with anchor_layout? (e.g. ld/st matrix)
|
|
charithaintc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % existing concerns/comments
| ArrayRef<LayoutInfoLattice *> operands, | ||
| ArrayRef<const LayoutInfoLattice *> results); | ||
|
|
||
| bool hasParamsOfLayoutKind(xegpu::DistributeLayoutAttr anchorLayout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this method has nothing to do with layout propagation. I would move it to static method helper hasParamsOfLayoutKind(xegpu::DistributeLayoutAttr anchorLayout, LayoutKind kind) and keep the LayoutInfoPropagation clean.
| } | ||
| // Propagate the new layout to the tensor descriptor operand. | ||
| propagateIfChanged(operands[0], operands[0]->meet(tensorDescLayout)); | ||
| propagateIfChanged(operands[0], operands[0]->meet(loadLayout)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here what happens if the user specify the layout at create_nd? I guess that is ignored right?
nit: also what happens if the layout assigned to the value (valueLayout) does not match the anchor layout? shouldn't we verify it mainly because we don't have layout conflict handling yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user specifies the layout at create_nd, it won't be honored.
we will address this in next refactor steps |
akroviakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like anchor ops might benefit from a dedicated interface.
For us, it would make layout handling in ops easier (e.g., use it in updateOp() in propagation instead of within visitors).
For users, it would serve as a single place that clearly defines the "anchor" term and serves as an easy-to-notice marker for ops when reading tablegen.
| tensor_desc cannot be used in SIMT mode. | ||
| tensor_desc cannot be used at lane level. | ||
|
|
||
| - `offsets`: represents offsets from source. required if `source` in not a TensorDescType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `offsets`: represents offsets from source. required if `source` in not a TensorDescType. | |
| - `offsets`: represents offsets from source. required if `source` is not a TensorDescType. |
| prefetchLayout = getDefaultSIMTLayoutInfo( | ||
| tdescTy, uArch, uArchInstruction->getPackedFormatBitSize()); | ||
|
|
||
| prefetch.setLayoutAttr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be done in the updateOp() utility as we do for other cases? This could be abstracted to an "anchor interface" for ease of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updateop() needs to be refactored. We should not update the "operand_result_*" attribute and hope it can be used by another pass. so we can try interface idea then.
| // CHECK-NEXT: %[[T2:.*]]:3 = scf.for %{{.*}} iter_args(%[[ARG4:.*]] = %[[T0]], %[[ARG5:.*]] = %[[T1]], %[[ARG6:.*]] = %[[CST]]) -> | ||
| // CHECK-SAME: (!xegpu.tensor_desc<8x16xf16, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>, !xegpu.tensor_desc<16x16xf16, #xegpu.layout<lane_layout = [1, 16], lane_data = [2, 1]>>, vector<8x16xf32>) { | ||
| // CHECK-NEXT: %[[T4:.*]] = xegpu.load_nd %[[ARG4]] {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} : | ||
| // CHECK-NEXT: %[[T4:.*]] = xegpu.load_nd %[[ARG4]] <{layout = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>}> {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the redundancy between
<{layout = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>}>
and
{layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layout_result_0 should be removed but in future refactor steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like unrelated change
Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored by XeGPU distribution and lowerinngs once specified.