-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[MLIR][XeGPU] Remove the transpose attribte from Gather/Scatter ops and Cleanup the documents #145389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][XeGPU] Remove the transpose attribte from Gather/Scatter ops and Cleanup the documents #145389
Changes from 9 commits
010f507
d9e1cbd
581b97c
d5b5643
3a67083
0fe0f4b
2e72e49
b766535
8ea09e9
96da366
45a7de2
86f7acb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ | |
|
|
||
| #include "mlir/Dialect/Utils/IndexingUtils.h" | ||
| #include "mlir/Dialect/XeGPU/IR/XeGPU.h" | ||
| #include "mlir/Dialect/XeGPU/Utils/XeGPUUtils.h" | ||
| #include "mlir/IR/Builders.h" | ||
| #include "mlir/IR/DialectImplementation.h" | ||
| #include "llvm/ADT/TypeSwitch.h" | ||
|
|
@@ -309,11 +310,23 @@ LogicalResult TensorDescType::verify( | |
| llvm::ArrayRef<int64_t> shape, mlir::Type elementType, | ||
| mlir::Attribute encoding, mlir::Attribute layout) { | ||
| size_t rank = shape.size(); | ||
| // Low-precision types are packed in 32-bit units. | ||
| int32_t packingFactor = 32 / elementType.getIntOrFloatBitWidth(); | ||
| if (rank != 1 && rank != 2) | ||
| return emitError() << "expected 1D or 2D tensor"; | ||
|
|
||
| auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding); | ||
| if (blockAttr) { | ||
| MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace(); | ||
| if (rank == 2 && memorySpaceAttr && | ||
| memorySpaceAttr.getValue() == MemorySpace::SLM) | ||
| return emitError() << "SLM is not supported for 2D block tensor"; | ||
| } | ||
|
|
||
| // for gather and scatter ops, Low-precision types are packed in 32-bit units. | ||
| unsigned bitWidth = elementType.getIntOrFloatBitWidth(); | ||
| int packingFactor = | ||
|
||
| bitWidth < targetinfo::packedSizeInBitsForGatherScatter | ||
| ? targetinfo::packedSizeInBitsForGatherScatter / bitWidth | ||
| : 1; | ||
| auto scatterAttr = mlir::dyn_cast_if_present<ScatterTensorDescAttr>(encoding); | ||
| if (scatterAttr) { | ||
| // Expected tensor ranks for scattered data: | ||
|
|
@@ -336,14 +349,6 @@ LogicalResult TensorDescType::verify( | |
| } | ||
| } | ||
|
|
||
| auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding); | ||
| if (blockAttr) { | ||
| MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace(); | ||
| if (rank == 2 && memorySpaceAttr && | ||
| memorySpaceAttr.getValue() == MemorySpace::SLM) | ||
| return emitError() << "SLM is not supported for 2D block tensor"; | ||
| } | ||
|
|
||
| auto layoutAttr = llvm::dyn_cast_if_present<LayoutAttr>(layout); | ||
| if (layoutAttr) { | ||
| if (rank != (size_t)layoutAttr.getRank()) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming here that there will always be a transpose operation after the load?
I wonder how a user can understand the semantics of this op. what if the user does not want the transpose and want to use the op in isolation (which is perfectly legal)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no transpose. The semantic is each row corresponding to a lane. In the SIMD lowering pipeline, the transpose will be added when we lower the
load_gatherto the corresponding intrinsic. For SIMT lowering, there is no transpose at all.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it again.
It seems like now xegpu.load (with chunck > 1) is just a logical operation. meaning it does not have a matching HW instruction. Logically we can use it without an accompanying transpose operation. that is true.
In practice, it will always come with an accompanying transpose. It will mostly be useful for A*BT case. In that case we always need an explicit
vector.transposeafter thexegpu.load. During lowering the load + transpose are optimized away in both SIMD and SIMT paths. Essentially we say that "we have a HW instruction that can do both these together, so transpose here is a nop". No need to do any shuffling to the transpose.For A*B case, I think doing multiple loads will maybe be cheaper than doing a load gather and then doing an in-resister transpose. not sure about this case.
A*BT case
A*B case.
A*BT case is clear to me. but not sure what we do with A*B case here ? Maybe I am still missing something. @Jianhui-Li can you also clarify on these examples. I know that A*B is not a real use case, but still confused how layout propagation works here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. It is clear now after discussing with @Jianhui-Li. A*B case will need a convert_layout because the load is not giving us the layout needed for DPAS B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For A* B case, as you use load w/ chunk_size for B, which assumes [16, 1] [1, 2] layout. The propagation needs to insert a xegpu.conv_layout to convert it to [1, 16][2, 1] before it feed to DPAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in lowering perspective we expect two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
One thing to note in the lowering: In you code example, the user specify xegpu.load w/ chunk_size, which will be lowered XeVM.load w/ vector size by default (each lane load contiguous data).
If user override the layout of xegpu.load w/ chunk_size, say forcing it to be takes [1, 16] [2, 1] layout, it will need to lowered to multiple regular XeVM.load, since now the data loaded by each lane are not contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the user allowed to do this? I also like it if we keep it relaxed. But I can see in this PR we have hard coded the scattered load layout to [16, 1][1, 2]. Check here
https://github.com/llvm/llvm-project/pull/145389/files#diff-fcc9cdbf8bb4e5d37e661524b877082aee9b7badb0317f980c1881da564a926dR230-R237
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the propagation pass will be improved to allow user to set the layout which override the default decision.