Skip to content

Conversation

silee2
Copy link
Contributor

@silee2 silee2 commented Sep 11, 2025

Add op definition for subgroup block load and store ops:
xevm.blockload and xevm.blockstore

@llvmbot
Copy link
Member

llvmbot commented Sep 11, 2025

@llvm/pr-subscribers-mlir-llvm

@llvm/pr-subscribers-mlir

Author: Sang Ik Lee (silee2)

Changes

Add op definition for subgroup block load and store ops:
xevm.blockload and xevm.blockstore


Full diff: https://github.com/llvm/llvm-project/pull/158118.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td (+72)
  • (modified) mlir/test/Dialect/LLVMIR/xevm.mlir (+23)
diff --git a/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td
index f457f47d56219..5b7814c37bbd1 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/XeVMOps.td
@@ -187,6 +187,78 @@ def XeVM_StoreCacheControlAttr
   let assemblyFormat = "`<` $value `>`";
 }
 
+def XeVM_BlockLoadOp
+    : XeVM_Op<"blockload">,
+      Results<(outs FixedVectorOfRankAndType<[1], [XeVM_ElemType]>:$res)>,
+      Arguments<(ins Arg<LLVM_AnyPointer, "", [MemRead]>:$ptr,
+          OptionalAttr<XeVM_LoadCacheControlAttr>:$cache_control)> {
+  let summary = "subgroup block load";
+  let description = [{
+    Reads one or more components of Result data for each invocation
+    in the subgroup from the specified `ptr` as a block operation.
+    The data is read strided, so the first value read is:
+    ```
+      ptr[ SubgroupLocalInvocationId ]
+    ```
+    and the second value read is:
+    ```
+      ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]
+    ```
+    Result type may be a scalar or vector type of scalar element type.
+
+    The parameters are:
+      * `ptr` - the base address to load from
+      * `cache_control` - an enumerator that sets the cache behaviour
+
+    Example:
+    ```mlir
+      %loaded_a = xevm.blockload %src,
+                      <{cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
+                    : (!llvm.ptr<1>) -> vector<4xi16>
+    ```
+  }];
+  let assemblyFormat = [{
+    operands prop-dict attr-dict `:` functional-type(operands, results)
+  }];
+}
+
+def XeVM_BlockStoreOp
+    : XeVM_Op<"blockstore">,
+      Arguments<(ins Arg<LLVM_AnyPointer, "", [MemWrite]>:$ptr,
+          FixedVectorOfRankAndType<[1], [XeVM_ElemType]>:$val,
+          OptionalAttr<XeVM_StoreCacheControlAttr>:$cache_control)> {
+  let summary = "subgroup block store";
+  let description = [{
+    Writes one or more components of `val` for each invocation
+    in the subgroup to the specified `ptr` as a block operation.
+    The data is written strided, so the first value is written to:
+    ```
+      ptr[ SubgroupLocalInvocationId ]
+    ```
+    and the second value is written to:
+    ```
+      ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]
+    ```
+    `val` type may be a scalar or vector type of scalar element type.
+
+    The parameters are:
+      * `ptr` - the base address to store to
+      * `val` - the value to store
+      * `cache_control` - an enumerator that sets the cache behaviour
+
+    Example:
+    ```mlir
+      xevm.blockstore %ptr, %val
+        <{cache_control=#xevm.store_cache_control<L1uc_L2uc_L3uc>}>
+        : (!llvm.ptr<1>, vector<4xi16>)
+    ```
+  }];
+
+  let assemblyFormat = [{
+    operands prop-dict attr-dict `:` `(` type(operands) `)`
+  }];
+}
+
 def XeVM_BlockLoad2dOp
     : XeVM_Op<"blockload2d">,
       Results<(outs FixedVectorOfRankAndType<[1], [XeVM_ElemType]>:$res)>,
diff --git a/mlir/test/Dialect/LLVMIR/xevm.mlir b/mlir/test/Dialect/LLVMIR/xevm.mlir
index 3dd5f872f898c..bb1f650a1cd12 100644
--- a/mlir/test/Dialect/LLVMIR/xevm.mlir
+++ b/mlir/test/Dialect/LLVMIR/xevm.mlir
@@ -58,6 +58,29 @@ func.func @blockprefetch2d(%ptr: !llvm.ptr<1>, %base_width: i32, %base_height: i
   return
 }
 
+// -----
+// CHECK-LABEL: func.func @blockload(
+// CHECK-SAME: %[[ARG0:.*]]: !llvm.ptr<1>)
+func.func @blockload(%ptr: !llvm.ptr<1>) -> vector<4xi16> {
+  // CHECK: %[[VAR0:.*]] = xevm.blockload %[[ARG0]]
+  // CHECK-SAME: cache_control = #xevm.load_cache_control<L1uc_L2uc_L3uc>
+  // CHECK-SAME: (!llvm.ptr<1>) -> vector<4xi16>
+  %loaded = xevm.blockload %ptr <{cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
+              : (!llvm.ptr<1>) -> vector<4xi16>
+  return %loaded : vector<4xi16>
+}
+
+// -----
+// CHECK-LABEL: func.func @blockstore(
+// CHECK-SAME: %[[ARG0:.*]]: !llvm.ptr<1>,
+// CHECK-SAME: %[[ARG1:.*]]: vector<4xi32>)
+func.func @blockstore(%ptr: !llvm.ptr<1>, %value: vector<4xi32>) {
+  // CHECK: xevm.blockstore %[[ARG0]], %[[ARG1]]
+  // CHECK-SAME: (!llvm.ptr<1>, vector<4xi32>)
+  xevm.blockstore %ptr, %value : (!llvm.ptr<1>, vector<4xi32>)
+  return
+}
+
 // -----
 // CHECK-LABEL: func.func @mma(
 // CHECK-SAME: %[[ARG0:.*]]: vector<8xf32>, %[[ARG1:.*]]: vector<8xi16>, %[[ARG2:.*]]: vector<8xi32>)

Result type may be a scalar or vector type of scalar element type.

The parameters are:
* `ptr` - the base address to load from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding: it must be uniform across subgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated description.

@silee2
Copy link
Contributor Author

silee2 commented Sep 15, 2025

Copy link
Contributor

@charithaintc charithaintc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

vTy = op.getVal().getType();
int elemTySize = vTy.getElementType().getIntOrFloatBitWidth() / 8;
if (elemTySize == 1) {
llvm::SmallSet<int, 5> validSizes{1, 2, 4, 8, 16};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems target specific? add a TODO or move to a dedicated location for HW specifics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not target arch or chip specific but the restrictions are OpenCL / SPIR-V Intel extensions specific.
In that sense, it applies to all Intel HW and not target specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put links to related specs above.

// CHECK-SAME: cache_control = #xevm.load_cache_control<L1uc_L2uc_L3uc>
// CHECK-SAME: (!llvm.ptr<1>) -> vector<4xi16>
%loaded = xevm.blockload %ptr <{cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
: (!llvm.ptr<1>) -> vector<4xi16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is output not a multiple of SG size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output is distributed to work item lanes.
The vector size represents how many elements are gathered per work item lane.

@silee2 silee2 merged commit 0021a6b into llvm:main Sep 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants