[mlir][XeGPU][VectorToXeGPU] Propagate vector layouts to xegpu ops #163071

dchigarev · 2025-10-12T14:02:37Z

This PR adds xegpu.layout and xegpu.cache_hint attributes propagation from an original vector ops to the generated xegpu.load(_nd)/.store(_nd) ops in VectorToXeGPU pass.

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2025-10-12T14:20:33Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+  // Extract indices layout and propagate it to all 'vector' ops created here
+  auto indicesLayout = mlir::xegpu::getDistributeLayoutAttr(indices);
  VectorType vecType = cast<VectorType>(indices.getType());


In this section we compute 'flat' indices for the xegpu.load/store op using vector's %indices operand like this:

%base_offset = vector.broadcast index -> vector<...> %last_stride = vector.broadcast index -> vector<...> %flat_indices = %indices * %last_stride + %base_offset

We want all the vector broadcast/add/mul operations that are generated here to have layout_result_0 to be equal to the %indices layout.

We should use the propagation pass to properly set up the layout attributes for non-anchor ops.

llvmbot · 2025-10-12T14:21:15Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Dmitry Chigarev (dchigarev)

Changes

This PR adds xegpu.layout and xegpu.cache_hint attributes propagation from an original vector.gather/.scatter op to the generated xegpu.load/.store ops in VectorToXeGPU pass.

A similiar PR for vector.transfer_read/write will follow.

Patch is 28.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/163071.diff

3 Files Affected:

(modified) mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp (+84-23)
(modified) mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir (+185)
(modified) mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir (+149)

diff --git a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
index e2c7d803e5a5e..2e365ec692c2e 100644
--- a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
+++ b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
@@ -97,6 +97,20 @@ static LogicalResult transferPreconditions(PatternRewriter &rewriter,
   return success();
 }
 
+// Extract cache hints from the op attributes if available.
+static void getOpCacheHints(Operation *op,
+                            SmallVector<xegpu::CachePolicyAttr, 3> &hints) {
+  assert(hints.size() == 3 &&
+         "Expecting a vector of size 3 for l1, l2, l3 hints.");
+  // get l1, l2, l3 hints from attributes if available.
+  if (auto l1Attr = op->getAttrOfType<xegpu::CachePolicyAttr>("l1_hint"))
+    hints[0] = l1Attr;
+  if (auto l2Attr = op->getAttrOfType<xegpu::CachePolicyAttr>("l2_hint"))
+    hints[1] = l2Attr;
+  if (auto l3Attr = op->getAttrOfType<xegpu::CachePolicyAttr>("l3_hint"))
+    hints[2] = l3Attr;
+}
+
 static xegpu::CreateNdDescOp
 createNdDescriptor(PatternRewriter &rewriter, Location loc,
                    xegpu::TensorDescType descType, TypedValue<MemRefType> src,
@@ -374,22 +388,30 @@ static Value computeOffsets(PatternRewriter &rewriter, OpType gatScatOp,
         arith::AddIOp::create(rewriter, loc, baseOffset, offsetContrib);
   }
   Value indices = gatScatOp.getIndices();
+  // Extract indices layout and propagate it to all 'vector' ops created here
+  auto indicesLayout = mlir::xegpu::getDistributeLayoutAttr(indices);
   VectorType vecType = cast<VectorType>(indices.getType());
 
-  Value strideVector =
-      vector::BroadcastOp::create(rewriter, loc, vecType, strides.back())
-          .getResult();
-  Value stridedIndices =
-      arith::MulIOp::create(rewriter, loc, strideVector, indices).getResult();
-
-  Value baseVector =
-      vector::BroadcastOp::create(
-          rewriter, loc,
-          VectorType::get(vecType.getShape(), rewriter.getIndexType()),
-          baseOffset)
-          .getResult();
-  return arith::AddIOp::create(rewriter, loc, baseVector, stridedIndices)
-      .getResult();
+  auto strideVector =
+      vector::BroadcastOp::create(rewriter, loc, vecType, strides.back());
+  mlir::xegpu::setDistributeLayoutAttr(strideVector->getOpResult(0),
+                                       indicesLayout);
+
+  auto stridedIndices =
+      arith::MulIOp::create(rewriter, loc, strideVector.getResult(), indices);
+  mlir::xegpu::setDistributeLayoutAttr(stridedIndices->getOpResult(0),
+                                       indicesLayout);
+
+  auto baseVector = vector::BroadcastOp::create(
+      rewriter, loc,
+      VectorType::get(vecType.getShape(), rewriter.getIndexType()), baseOffset);
+  mlir::xegpu::setDistributeLayoutAttr(baseVector->getOpResult(0),
+                                       indicesLayout);
+
+  auto result = arith::AddIOp::create(rewriter, loc, baseVector.getResult(),
+                                      stridedIndices.getResult());
+  mlir::xegpu::setDistributeLayoutAttr(result->getOpResult(0), indicesLayout);
+  return result.getResult();
 }
 
 template <
@@ -616,16 +638,39 @@ struct GatherLowering : public OpRewritePattern<vector::GatherOp> {
         computeOffsets(rewriter, gatherOp, meta.first, meta.second);
     Value flatMemref = memrefToIndexPtr(gatherOp, rewriter);
 
+    auto numOffsets = gatherOp.getOffsets().size();
+    auto layoutRes = mlir::xegpu::getDistributeLayoutAttr(gatherOp.getResult());
+    auto layoutIndices = mlir::xegpu::getDistributeLayoutAttr(
+        gatherOp->getOpOperand(numOffsets + 1));
+    auto layoutMask = mlir::xegpu::getDistributeLayoutAttr(
+        gatherOp->getOpOperand(numOffsets + 2));
+    auto layoutPassThru = mlir::xegpu::getDistributeLayoutAttr(
+        gatherOp->getOpOperand(numOffsets + 3));
+
+    SmallVector<xegpu::CachePolicyAttr, 3> cacheHints{xegpu::CachePolicyAttr{},
+                                                      xegpu::CachePolicyAttr{},
+                                                      xegpu::CachePolicyAttr{}};
+    getOpCacheHints(gatherOp, cacheHints);
     auto xeGatherOp = xegpu::LoadGatherOp::create(
         rewriter, loc, vectorType, flatMemref, localOffsets, gatherOp.getMask(),
         /*chunk_size=*/IntegerAttr{},
-        /*l1_hint=*/xegpu::CachePolicyAttr{},
-        /*l2_hint=*/xegpu::CachePolicyAttr{},
-        /*l3_hint=*/xegpu::CachePolicyAttr{});
+        /*l1_hint=*/cacheHints[0],
+        /*l2_hint=*/cacheHints[1],
+        /*l3_hint=*/cacheHints[2]);
+    mlir::xegpu::setDistributeLayoutAttr(xeGatherOp->getOpResult(0), layoutRes);
+    mlir::xegpu::setDistributeLayoutAttr(xeGatherOp->getOpOperand(1),
+                                         layoutIndices);
+    mlir::xegpu::setDistributeLayoutAttr(xeGatherOp->getOpOperand(2),
+                                         layoutMask);
 
     auto selectOp =
         arith::SelectOp::create(rewriter, loc, gatherOp.getMask(),
                                 xeGatherOp.getResult(), gatherOp.getPassThru());
+    mlir::xegpu::setDistributeLayoutAttr(selectOp->getOpOperand(0), layoutMask);
+    mlir::xegpu::setDistributeLayoutAttr(selectOp->getOpOperand(2),
+                                         layoutPassThru);
+    mlir::xegpu::setDistributeLayoutAttr(selectOp->getOpResult(0), layoutRes);
+
     rewriter.replaceOp(gatherOp, selectOp.getResult());
     return success();
   }
@@ -650,12 +695,28 @@ struct ScatterLowering : public OpRewritePattern<vector::ScatterOp> {
         computeOffsets(rewriter, scatterOp, meta.first, meta.second);
     Value flatMemref = memrefToIndexPtr(scatterOp, rewriter);
 
-    xegpu::StoreScatterOp::create(rewriter, loc, scatterOp.getValueToStore(),
-                                  flatMemref, localOffsets, scatterOp.getMask(),
-                                  /*chunk_size=*/IntegerAttr{},
-                                  /*l1_hint=*/xegpu::CachePolicyAttr{},
-                                  /*l2_hint=*/xegpu::CachePolicyAttr{},
-                                  /*l3_hint=*/xegpu::CachePolicyAttr{});
+    auto numOffsets = scatterOp.getOffsets().size();
+    auto layoutIndices = mlir::xegpu::getDistributeLayoutAttr(
+        scatterOp->getOpOperand(numOffsets + 1));
+    auto layoutMask = mlir::xegpu::getDistributeLayoutAttr(
+        scatterOp->getOpOperand(numOffsets + 2));
+    auto layoutVal = mlir::xegpu::getDistributeLayoutAttr(
+        scatterOp->getOpOperand(numOffsets + 3));
+    SmallVector<xegpu::CachePolicyAttr, 3> cacheHints{xegpu::CachePolicyAttr{},
+                                                      xegpu::CachePolicyAttr{},
+                                                      xegpu::CachePolicyAttr{}};
+    getOpCacheHints(scatterOp, cacheHints);
+    auto storeOp = xegpu::StoreScatterOp::create(
+        rewriter, loc, scatterOp.getValueToStore(), flatMemref, localOffsets,
+        scatterOp.getMask(),
+        /*chunk_size=*/IntegerAttr{},
+        /*l1_hint=*/cacheHints[0],
+        /*l2_hint=*/cacheHints[1],
+        /*l3_hint=*/cacheHints[2]);
+    mlir::xegpu::setDistributeLayoutAttr(storeOp->getOpOperand(0), layoutVal);
+    mlir::xegpu::setDistributeLayoutAttr(storeOp->getOpOperand(2),
+                                         layoutIndices);
+    mlir::xegpu::setDistributeLayoutAttr(storeOp->getOpOperand(3), layoutMask);
     rewriter.eraseOp(scatterOp);
     return success();
   }
diff --git a/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
index 2a319869a7b06..cb4cef9724da8 100644
--- a/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
+++ b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
@@ -249,3 +249,188 @@ gpu.func @non_unit_inner_stride_3D(
 // CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[V]], %[[PASS]] : vector<8xi1>, vector<8xf32>
 // CHECK:        gpu.return %[[RES]] : vector<8xf32>
 }
+
+// -----
+
+gpu.module @xevm_module {
+// Layouts are only specified for the gather op itself.
+gpu.func @load_dynamic_layout_operands(%source: memref<?x?xf32>,
+    %off0: index, %off1: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %res = vector.gather %source[%off0, %off1][%indices], %mask,
+       %pass_thru {
+        layout_result_0 = #xegpu.layout<sg_layout = [0]>,
+        layout_operand_3 = #xegpu.layout<sg_layout = [1]>,
+        layout_operand_4 = #xegpu.layout<sg_layout = [2]>,
+        layout_operand_5 = #xegpu.layout<sg_layout = [3]>
+  } : memref<?x?xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %res : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_layout_operands(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x?xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>, %[[MASK:.+]]: vector<8x16xi1>, %[[PASS:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// %indices producer doesn't have a layout, so as 'broadcast/add' ops computing linear index.
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}} :  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], {{.*}} : vector<8x16xindex>
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64:.+]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]]
+// CHECK-SAME:   {layout_operand_1 = #xegpu.layout<sg_layout = [1]>, layout_operand_2 = #xegpu.layout<sg_layout = [2]>,
+// CHECK-SAME:   layout_result_0 = #xegpu.layout<sg_layout = [0]>}
+// CHECK:        %[[RES:.+]] = arith.select {{[^{]*}}
+// CHECK-SAME:   {{{[^}]*}}layout_operand_0 = #xegpu.layout<sg_layout = [2]>,
+// CHECK-SAME:   {{[^}]*}}layout_operand_2 = #xegpu.layout<sg_layout = [3]>,
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [0]>} : vector<8x16xi1>, vector<8x16xf32>
+}
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @load_dynamic_layout_mixed(%source: memref<?x?x?xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %mask: vector<8x16xi1>) -> vector<8x16xf32> {
+  %pass_thru = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+  %cst_1 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [1]>} dense<[[0], [32], [64], [96], [128], [160], [192], [224]]> : vector<8x1xindex>
+  %cst_2 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [2]>} dense<[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]> : vector<1x16xindex>
+  %0 = vector.broadcast %cst_1 {layout_result_0 = #xegpu.layout<sg_layout = [3]>} : vector<8x1xindex> to vector<8x16xindex>
+  %1 = vector.broadcast %cst_2 {layout_result_0 = #xegpu.layout<sg_layout = [4]>} : vector<1x16xindex> to vector<8x16xindex>
+  %2 = arith.addi %0, %1 {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+
+  %res = vector.gather %source[%off0, %off1, %off2][%2], %mask,
+       %pass_thru {
+        layout_result_0 = #xegpu.layout<sg_layout = [6]>,
+        layout_operand_5 = #xegpu.layout<sg_layout = [7]>
+  } : memref<?x?x?xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  %res2 = arith.addf %res, %pass_thru : vector<8x16xf32>
+  gpu.return %res2 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_layout_mixed(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x?x?xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>) -> vector<8x16xf32> {
+// CHECK:        %[[PASS_THRU:.+]] = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+// Verify that linear-indices computation uses layout from the 'indices' producer op (%2).
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} :  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64:.+]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]]
+// CHECK-SAME:   {{{[^}]*}}layout_operand_2 = #xegpu.layout<sg_layout = [7]>
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>}
+// CHECK:        %[[RES:.+]] = arith.select {{[^{]*}}
+// CHECK-SAME:   {{{[^}]*}}layout_operand_0 = #xegpu.layout<sg_layout = [7]>,
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>} : vector<8x16xi1>, vector<8x16xf32>
+}
+
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @load_static_layout_mixed(%source: memref<8x16x32xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %mask: vector<8x16xi1>) -> vector<8x16xf32> {
+  %pass_thru = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+  %cst_1 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [1]>} dense<[[0], [32], [64], [96], [128], [160], [192], [224]]> : vector<8x1xindex>
+  %cst_2 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [2]>} dense<[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]> : vector<1x16xindex>
+  %0 = vector.broadcast %cst_1 {layout_result_0 = #xegpu.layout<sg_layout = [3]>} : vector<8x1xindex> to vector<8x16xindex>
+  %1 = vector.broadcast %cst_2 {layout_result_0 = #xegpu.layout<sg_layout = [4]>} : vector<1x16xindex> to vector<8x16xindex>
+  %2 = arith.addi %0, %1 {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+
+  %res = vector.gather %source[%off0, %off1, %off2][%2], %mask,
+       %pass_thru {
+        layout_result_0 = #xegpu.layout<sg_layout = [6]>,
+        layout_operand_5 = #xegpu.layout<sg_layout = [7]>
+  } : memref<8x16x32xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  %res2 = arith.addf %res, %pass_thru : vector<8x16xf32>
+  gpu.return %res2 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_static_layout_mixed(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x16x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>) -> vector<8x16xf32> {
+// CHECK:        %[[PASS_THRU:.+]] = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+// Verify that linear-indices computation uses layout from the 'indices' producer op (%2).
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} :  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64:.+]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]]
+// CHECK-SAME:   {{{[^}]*}}layout_operand_2 = #xegpu.layout<sg_layout = [7]>
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>}
+// CHECK:        %[[RES:.+]] = arith.select {{[^{]*}}
+// CHECK-SAME:   {{{[^}]*}}layout_operand_0 = #xegpu.layout<sg_layout = [7]>,
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>} : vector<8x16xi1>, vector<8x16xf32>
+}
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @load_dynamic_layout_mixed_override(%source: memref<?x?x?xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %mask: vector<8x16xi1>) -> vector<8x16xf32> {
+  %pass_thru = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+  %cst_1 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [1]>} dense<[[0], [32], [64], [96], [128], [160], [192], [224]]> : vector<8x1xindex>
+  %cst_2 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [2]>} dense<[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]> : vector<1x16xindex>
+  %0 = vector.broadcast %cst_1 {layout_result_0 = #xegpu.layout<sg_layout = [3]>} : vector<8x1xindex> to vector<8x16xindex>
+  %1 = vector.broadcast %cst_2 {layout_result_0 = #xegpu.layout<sg_layout = [4]>} : vector<1x16xindex> to vector<8x16xindex>
+  %2 = arith.addi %0, %1 {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+
+  %res = vector.gather %source[%off0, %off1, %off2][%2], %mask,
+       %pass_thru {
+        layout_result_0 = #xegpu.layout<sg_layout = [6]>,
+        layout_operand_4 = #xegpu.layout<sg_layout = [99]>, // overriding %2's layout
+        layout_operand_5 = #xegpu.layout<sg_layout = [7]>
+  } : memref<?x?x?xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  %res2 = arith.addf %res, %pass_thru : vector<8x16xf32>
+  gpu.return %res2 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_layout_mixed_override(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x?x?xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>) -> vector<8x16xf32> {
+// CHECK:        %[[PASS_THRU:.+]] = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [0]>} dense<0.000000e+00> : vector<8x16xf32>
+// Verify that linear-indices computation uses layout from the 'indices' producer op (%2)
+// and not it's overriden version from the scatter_op (sg_layout = [99])
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} :  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64:.+]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]]
+// CHECK-SAME:   {layout_operand_1 = #xegpu.layout<sg_layout = [99]>, layout_operand_2 = #xegpu.layout<sg_layout = [7]>
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>}
+// CHECK:        %[[RES:.+]] = arith.select {{[^{]*}}
+// CHECK-SAME:   {{{[^}]*}}layout_operand_0 = #xegpu.layout<sg_layout = [7]>,
+// CHECK-SAME:   {{[^}]*}}layout_result_0 = #xegpu.layout<sg_layout = [6]>} : vector<8x16xi1>, vector<8x16xf32>
+}
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @load_with_cache_hints(%source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
+       %pass_thru {
+        l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>,
+        l3_hint = #xegpu.cache_hint<streaming>
+  } : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_with_cache_hints(
+// CHECK:        xegpu.load {{[^<]*}}
+// CHECK-SAME:   <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>, l3_hint = #xegpu.cache_hint<streaming>}>
+}
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @load_with_partial_cache_hints(%source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
+       %pass_thru {
+        l1_hint = #xegpu.cache_hint<cached>,
+        l3_hint = #xegpu.cache_hint<streaming>
+  } : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_with_partial_cache_hints(
+// CHECK:        xegpu.load {{[^<]*}}
+// CHECK-SAME:   <{l1_hint = #xegpu.cache_hint<cached>, l3_hint = #xegpu.cache_hint<streaming>}>
+}
diff --git a/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
index ffd3f170c0fad..9393d88e0f54c 100644
--- a/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
+++ b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
@@ -204,3 +204,152 @@ gpu.func @scatter_into_subview(%vals: vector<8xf16>,
 // CHECK:        xegpu.store %[[VALS]], %[[BASE_I64]]{{\[}}%[[LIN]]{{\]}}, %[[MASK]] : vector<8xf16>, i64, vector<8xindex>, vector<8xi1>
 // CHECK:        gpu.return
 }
+
+// -----
+
+gpu.module @xevm_module {
+gpu.func @store_dynamic_layout_operands(%vec: vector<8x16xf32>, %source: memref<?x?xf32>,
+    %off0: index, %off1: index,...
[truncated]

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

adam-smnk · 2025-10-13T10:32:18Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp


    auto selectOp =
        arith::SelectOp::create(rewriter, loc, gatherOp.getMask(),
                                xeGatherOp.getResult(), gatherOp.getPassThru());


Just to double-check, the layout isn't assigned to the second operand (LoadGather) as it's already in the producer's result.
I assume it's left to the propagation to fill the gap? Any drawbacks in assigning layout in both places?

Just to double-check, the layout isn't assigned to the second operand (LoadGather) as it's already in the producer's result

Right, I thought that's enough. There seems to be no drawbacks though from assigning layout in both places. Applied layout in both places in the last commit just in case

setting the operand layout is done at the client of layout propagation result. So I think there is no need to update layout_operand[_] here

Is it always the case? Say, the value to store comes as a function's argument (our func is not inlined yet) and it's impossible to determine the producer but it's possible to determine the layout from {layout_operand_i} attribute. Should we support such cases? Are there any negative side-effects from setting a layout_operand_i even if we can access the layout from layout_result_i of the value producer?

Signed-off-by: Dmitry Chigarev <[email protected]>

adam-smnk

Minor suggestion
LGTM but I'll let sb else double check the layout assignments.

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

Signed-off-by: dchigarev <[email protected]>

Jianhui-Li

ignore this comment. still reviewing.

Jianhui-Li

Can we extend the PR to cover transfer_read/transfer_write to xegpu.load_nd/store_nd also? You only need to carry cache hint and layout attribute to anchor ops, and you don't need to extend op definition since the layout needs to set to the tensor descriptor operand of load_nd/store_nd.

Jianhui-Li · 2025-10-13T19:45:32Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+  // Extract indices layout and propagate it to all 'vector' ops created here
+  auto indicesLayout = mlir::xegpu::getDistributeLayoutAttr(indices);
  VectorType vecType = cast<VectorType>(indices.getType());


We should use the propagation pass to properly set up the layout attributes for non-anchor ops.

Jianhui-Li · 2025-10-13T19:59:58Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+        /*l1_hint=*/cacheHints[0],
+        /*l2_hint=*/cacheHints[1],
+        /*l3_hint=*/cacheHints[2]);
+    xegpu::setDistributeLayoutAttr(xeGatherOp->getOpResult(0), layoutRes);


We should expand xegpu.load/store with optional attribute DistributeLayoutAttr (like loadmatrix op definition), and set the operation's attribute only. The propagation process will propagate it to other ops.

Jianhui-Li · 2025-10-13T20:07:28Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

@@ -616,16 +636,34 @@ struct GatherLowering : public OpRewritePattern<vector::GatherOp> {
        computeOffsets(rewriter, gatherOp, meta.first, meta.second);
    Value flatMemref = memrefToIndexPtr(gatherOp, rewriter);

+    auto layoutRes = xegpu::getDistributeLayoutAttr(gatherOp.getResult());


All three layouts should be exactly the same.
We can take the layout from Result vector, and set to the store op's layout attribute (need to expand the op definition to make the attribute persistent, not lost after optimization passes).

I found at least one scenario in our test cases where not all operands of xegpu.store have the same layout.

It also looks quite confusing in case of xegpu.load to have mixed layout_operand_* and layout attribute:

%3 = xegpu.load %src[%offset], %1 <{chunk_size=8}> { layout_operand_1 = #xegpu.layout<lane_layout = [16], lane_data = [1]>, layout = #xegpu.layout< lane_layout = [16, 1], lane_data = [1, 2]> // layout for what?? } : memref<256xf16>, vector<16xindex>, vector<16xi1> -> vector<16x8xf16> // ----- %3 = xegpu.load %src[%offset], %1 <{chunk_size=8}> { layout_operand_1 = #xegpu.layout<lane_layout = [16], lane_data = [1]>, layout_result_0 = #xegpu.layout< lane_layout = [16, 1], lane_data = [1, 2]> // the purpose of the layout is very clear } : memref<256xf16>, vector<16xindex>, vector<16xi1> -> vector<16x8xf16>

Given that, I don't think we need to have a separate layout attribute for xegpu load/store ops (or we have to rethink it once again, and make it not like load_matrix/store_matrix layout attribute).

At the entry XeGPU IR, we need to set the layout to anchor op, and rely on the propagation pass to propagate the layout to all values. The anchor op include: load/store, load_matrix/store_matrix, load_nd/store_nd, and dpas, convert_layout. The layout describes the layout for the tensor tile: for memory operation, it describes the memory operand (like memref, mem_desc, nd_tdesc), for dpas, it describes the vector operands.

The layout propagated are temporarily attributes and can be lost during IR transformation or lowering. But the anchor op defined layout attributes are permanents and we rely on these attributes being set up properly so able to recover in case user compose xegpu passes with their own passes.

The example you show above actually looks fine to me, as long as the propogation rules decides to drop one dimension for the mask and index and keep the tensor tile layout as 2d. It does look a bit complex, but I don't expect this kind of layout is exposed to user since chunkload should be lower-level representation for memory coalesce or load_matrix lowering.

So I still suggest changing this PR and just set the layout attribute according to the tensor tile's layout for anchor ops only and remove the attributes setting for other ops and operands.

charithaintc

LGTM % existing comments

charithaintc · 2025-10-13T23:54:42Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp


    auto selectOp =
        arith::SelectOp::create(rewriter, loc, gatherOp.getMask(),
                                xeGatherOp.getResult(), gatherOp.getPassThru());


setting the operand layout is done at the client of layout propagation result. So I think there is no need to update layout_operand[_] here

charithaintc · 2025-10-13T23:55:58Z

mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir

+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} :  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], {{.*}} {layout_result_0 = #xegpu.layout<sg_layout = [5]>} : vector<8x16xindex>
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64:.+]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]]
+// CHECK-SAME:   {{{[^}]*}}layout_operand_2 = #xegpu.layout<sg_layout = [7]>


similar to my previous comment operand layouts are not needed.

akroviakov · 2025-10-14T17:37:41Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp


+    auto layoutRes = xegpu::getDistributeLayoutAttr(gatherOp.getResult());
+    auto layoutIndices =
+        xegpu::getDistributeLayoutAttr(gatherOp.getIndicesMutable());


Any reason to use mutable getters?

otherwise it won't get layout_operand_i assigned to this operand

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2025-10-16T14:39:42Z

Can we extend the PR to cover transfer_read/transfer_write to xegpu.load_nd/store_nd also? You only need to carry cache hint and layout attribute to anchor ops, and you don't need to extend op definition since the layout needs to set to the tensor descriptor operand of load_nd/store_nd.

@Jianhui-Li added layout/cache hints propagation for the rest of the ops in VectorToXeGPU pass

dchigarev added 3 commits October 12, 2025 13:00

[mlir][XeGPU][VectorToXeGPU] Propagate vector layouts to xegpu ops

a22b251

Signed-off-by: dchigarev <[email protected]>

add cache hint propagation

76c8129

Signed-off-by: dchigarev <[email protected]>

Add in-code comments

fdb0540

Signed-off-by: dchigarev <[email protected]>

dchigarev commented Oct 12, 2025

View reviewed changes

dchigarev marked this pull request as ready for review October 12, 2025 14:20

llvmbot added mlir:gpu mlir labels Oct 12, 2025

adam-smnk reviewed Oct 13, 2025

View reviewed changes

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

adam-smnk reviewed Oct 13, 2025

View reviewed changes

adam-smnk requested review from Jianhui-Li and charithaintc October 13, 2025 10:32

apply review suggestions

79e37d8

Signed-off-by: Dmitry Chigarev <[email protected]>

adam-smnk reviewed Oct 13, 2025

View reviewed changes

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

use 'op.get*Mutable()' when getting layout attrs

3afe5d5

Signed-off-by: dchigarev <[email protected]>

dchigarev force-pushed the dchigarev/xegpu-carry-layouts branch from 1a160b8 to 3afe5d5 Compare October 13, 2025 17:18

Jianhui-Li reviewed Oct 13, 2025

View reviewed changes

charithaintc reviewed Oct 13, 2025

View reviewed changes

charithaintc approved these changes Oct 13, 2025

View reviewed changes

dchigarev mentioned this pull request Oct 14, 2025

[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops #163414

Draft

akroviakov reviewed Oct 14, 2025

View reviewed changes

dchigarev added 2 commits October 16, 2025 13:37

only set layouts for anchor ops

2a43ee6

Signed-off-by: dchigarev <[email protected]>

Propagate layouts/cache hints for transfer_read/write/load/store

e0cf57b

Signed-off-by: dchigarev <[email protected]>

dchigarev requested a review from Jianhui-Li October 16, 2025 16:36

[mlir][XeGPU][VectorToXeGPU] Propagate vector layouts to xegpu ops #163071

Are you sure you want to change the base?

[mlir][XeGPU][VectorToXeGPU] Propagate vector layouts to xegpu ops #163071

Uh oh!

Conversation

dchigarev commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam-smnk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jianhui-Li left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charithaintc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dchigarev commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dchigarev commented Oct 12, 2025 •

edited

Loading

llvmbot commented Oct 12, 2025 •

edited

Loading

Jianhui-Li left a comment •

edited

Loading