Skip to content

Conversation

@charithaintc
Copy link
Contributor

@charithaintc charithaintc commented Oct 28, 2025

This pass rewrites certain xegpu CreateNd and LoadNd operations that feeds into vector.transpose to more optimal form to improve performance. Specifically, low precision (bitwidth < 32) LoadNd ops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops.

Update:
Pass is renamed to OptimizeBlockLoads because later we plan to add the array length optimization into this pass as well. This will break down a larger load (like 32x32xf16) into more DPAS-favorable array length loads (32x16xf16 with array length = 2). Both these optmizations require rewriting CreateNd and LoadNd and it makes sense to have a common pass for both.

Copy link
Contributor

@akroviakov akroviakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments

%4:4 = scf.for %arg3 = %c0 to %c256 step %c32 iter_args(%arg4 = %1, %arg5 = %1, %arg6 = %1, %arg7 = %1)
-> (vector<8x16xf32>, vector<8x16xf32>, vector<8x16xf32>, vector<8x16xf32>) {
%6 = xegpu.load_nd %3[%c0, %arg3] { layout_result_0 = #b }
: !xegpu.tensor_desc<32x16xf16, #b, #xegpu.block_tdesc_attr<array_length = 2 : i64>> -> vector<2x32x16xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the #b layout is 2d but the result vector is 3d. I think we should consider just using 2d (32x32 instead of 2x32x16) for block array load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some changes to XeGPU op verifications and some existing lowering passes. So I would like to keep this test to show the current capability of this pass.
I will do the cleanup for 3D vector representation in another PR.

// converted.
target.addDynamicallyLegalOp<xegpu::CreateNdDescOp>(
[&](xegpu::CreateNdDescOp createNdOp) {
return !canBeOptimized(createNdOp.getType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using name like "canLoadTransposed"

Copy link
Contributor Author

@charithaintc charithaintc Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to canBeCanonicalizedForTranspose. I don't want to mention any specific op (like Load) in the name because we are modifying an op sequence here.

VectorType origVectorType =
VectorType::get(origDataShape, adaptorType.getElementType());
Value data;
// Orig data shape is 3D for the array length case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2d works better for array block load result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. This will be cleaned up. I will create an issue to track this.

@charithaintc charithaintc changed the title [mlir][xegpu] Add OptimizeTranspose pass. [mlir][xegpu] Add OptimizeBlockLoads pass. Nov 3, 2025
Copy link
Contributor

@Jianhui-Li Jianhui-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rewriter, loc,
VectorType::get(supportedShape, data.getType().getElementType()),
newTensorDesc, ArrayRef<OpFoldResult>{loadOffsetX, loadOffsetY},
origLoadOp.getPackedAttr(), origLoadOp.getTransposeAttr(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we still need to set Pack and Transpose attributes? I think these info are associated with the layout, and we only set them at the WI level when the lane level layout is dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. this is only for copying existing attributes from the f16 load to i32 load. it has no impact.

@charithaintc charithaintc merged commit 9703bda into llvm:main Nov 4, 2025
8 of 9 checks passed
@adam-smnk
Copy link
Contributor

Just a fly-by comment, not adding the new pass to the pipeline?

@charithaintc
Copy link
Contributor Author

Just a fly-by comment, not adding the new pass to the pipeline?

ah sorry. I was unaware of this requirement. I think I will add it once this is e2e tested properly downstream.

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 5, 2025

LLVM Buildbot has detected a new failure on builder ppc64le-mlir-rhel-clang running on ppc64le-mlir-rhel-test while building mlir at step 6 "test-build-check-mlir-build-only-check-mlir".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/129/builds/32450

Here is the relevant piece of the build log for the reference
Step 6 (test-build-check-mlir-build-only-check-mlir) failure: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill
...
PASS: MLIR-Unit :: Interfaces/./MLIRInterfacesTests/12/22 (3624 of 3634)
PASS: MLIR-Unit :: Pass/./MLIRPassTests/10/13 (3625 of 3634)
PASS: MLIR-Unit :: Interfaces/./MLIRInterfacesTests/11/22 (3626 of 3634)
PASS: MLIR-Unit :: IR/./MLIRIRTests/0/130 (3627 of 3634)
PASS: MLIR :: mlir-reduce/dce-test.mlir (3628 of 3634)
PASS: MLIR :: mlir-tblgen/cpp-class-comments.td (3629 of 3634)
PASS: MLIR :: mlir-reduce/crashop-reduction.mlir (3630 of 3634)
PASS: MLIR :: mlir-reduce/multiple-function.mlir (3631 of 3634)
PASS: MLIR :: mlir-tblgen/op-error.td (3632 of 3634)
PASS: MLIR :: Pass/pipeline-options-parsing.mlir (3633 of 3634)
command timed out: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1998.228646

@adam-smnk
Copy link
Contributor

Just a fly-by comment, not adding the new pass to the pipeline?

ah sorry. I was unaware of this requirement. I think I will add it once this is e2e tested properly downstream.

Not a requirement per se, just curious 😀

Ideally, the pipeline should be kept up to date when possible. But no worry if a new pass/transform needs more testing first (standalone or within the pipeline) or might not be universally applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants