[mlir][gpu] Add `subgroup_broadcast` op #152808

Hardcode84 · 2025-08-08T22:28:39Z

subgroup_broadcast allow to broadcast the value from one lane to all lanes in subgroup.

Supported modes:

first_active_lane - broadcast value from the first active lane in subgroup.
specific_lane - broadcast value from the specified lane, lane index must be within subgroup.
any_lane - if src value is uniform across all the subgroup lanes return it unchanged, otherwise result is poison. This variant essentially an uniformity hint for the compiler, conveying that specific value is uniform across all subgroup lanes. Dropping any_lane broadcast should not change the code semantics.

llvmbot · 2025-08-08T22:29:13Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Ivan Butygin (Hardcode84)

Changes

broadcast_lane allow to broadcast the value from one lane to all lanes in subgroup.

Supported modes:

first_lane - broadcast value from the first active lane in subgroup.
lane - broadcast value from the specified lane, lane index must be withing subgroup.
any_lane - if src value is uniform across all the subgroup lanes return it unchanged, otherwise result is poison. This variant essentially an uniformity hint for the compiler, conveying that specific value is uniform across all subgroup lanes. Dropping any_lane broadcast will not change the code semantics.

Full diff: https://github.com/llvm/llvm-project/pull/152808.diff

7 Files Affected:

(modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+43-1)
(modified) mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp (+24-1)
(modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+37)
(modified) mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir (+17-1)
(added) mlir/test/Dialect/GPU/broadcast-speculatability.mlir (+23)
(modified) mlir/test/Dialect/GPU/int-range-interface.mlir (+19)
(modified) mlir/test/Dialect/GPU/ops.mlir (+14-2)

diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index f946bb731e2ca..6592a5c55b0c2 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1517,7 +1517,7 @@ def GPU_GPUModuleOp : GPU_Op<"module", [
     /// Sets the targets of the module.
     void setTargets(ArrayRef<TargetAttrInterface> targets);
   }];
-  
+
   let hasVerifier = 1;
 }
 
@@ -3212,4 +3212,46 @@ def GPU_WarpExecuteOnLane0Op : GPU_Op<"warp_execute_on_lane_0",
   }];
 }
 
+def GPU_BroadcastType : I32EnumAttr<"BroadcastType",
+    "a lane to broadcast from",
+    [
+      I32EnumAttrCase<"first_lane", 0>,
+      I32EnumAttrCase<"any_lane", 1>,
+      I32EnumAttrCase<"lane", 2>
+    ]>{
+  let genSpecializedAttr = 0;
+  let cppNamespace = "::mlir::gpu";
+}
+def GPU_BroadcastTypeAttr : EnumAttr<GPU_Dialect, GPU_BroadcastType, "broadcast">;
+
+def GPU_BroadcastLaneOp : GPU_Op<"broadcast_lane",
+    [NoMemoryEffect, AllTypesMatch<["result", "src"]>,
+    DeclareOpInterfaceMethods<InferIntRangeInterface, ["inferResultRanges"]>,
+    DeclareOpInterfaceMethods<ConditionallySpeculatable, ["getSpeculatability"]>] #
+    ElementwiseMappable.traits>,
+  Arguments<(ins AnyType:$src,
+                 Optional<I32>:$lane,
+                 GPU_BroadcastTypeAttr:$broadcast_type)> {
+  let summary = "Broadcasts a value from the specific lane across subgroup";
+  let description = [{
+      Broadcasts the value from the one lane to the all lanes in subgroup.
+
+      The possible broadcats types are:
+
+      * `first_lane` - first active lane in subgroup.
+      * `lane` - from the specified lane, lane index must be withing subgroup.
+      * `any_lane` - if `src` value is uniform across all the subgroup
+      lanes return it unchanged, otherwise result is poison. This variant
+      essentially an uniformity hint for the compiler, conveying that
+      specific value is uniform across all subgroup lanes. Dropping `any_lane`
+      broadcast will not change the code semantics.
+      ```
+  }];
+  let results = (outs AnyType:$result);
+  let assemblyFormat = [{
+    $src `,` $broadcast_type ($lane^)?  attr-dict `:` type($result)
+  }];
+  let hasVerifier = 1;
+}
+
 #endif // GPU_OPS
diff --git a/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp b/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
index d22364e1ef441..4d081cefb5f35 100644
--- a/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
+++ b/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
@@ -160,6 +160,27 @@ struct GPUSubgroupSizeOpToROCDL : ConvertOpToLLVMPattern<gpu::SubgroupSizeOp> {
   const amdgpu::Chipset chipset;
 };
 
+struct GPUBroadcastLaneOpToROCDL
+    : public ConvertOpToLLVMPattern<gpu::BroadcastLaneOp> {
+  using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern;
+
+  LogicalResult
+  matchAndRewrite(gpu::BroadcastLaneOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    Value src = adaptor.getSrc();
+    if (adaptor.getBroadcastType() == gpu::BroadcastType::lane) {
+      rewriter.replaceOpWithNewOp<ROCDL::ReadlaneOp>(op, src.getType(), src,
+                                                     adaptor.getLane());
+    } else { // first_lane or any_lane
+      // any_lane is lowered to readfirstlane too, to force value into scalar
+      // register.
+      rewriter.replaceOpWithNewOp<ROCDL::ReadfirstlaneOp>(op, src.getType(),
+                                                          src);
+    }
+    return success();
+  }
+};
+
 struct GPUShuffleOpLowering : public ConvertOpToLLVMPattern<gpu::ShuffleOp> {
   using ConvertOpToLLVMPattern<gpu::ShuffleOp>::ConvertOpToLLVMPattern;
 
@@ -453,7 +474,9 @@ void mlir::populateGpuToROCDLConversionPatterns(
   // TODO: Add alignment for workgroup memory
   patterns.add<GPUDynamicSharedMemoryOpLowering>(converter);
 
-  patterns.add<GPUShuffleOpLowering, GPULaneIdOpToROCDL>(converter);
+  patterns
+      .add<GPUShuffleOpLowering, GPULaneIdOpToROCDL, GPUBroadcastLaneOpToROCDL>(
+          converter);
   patterns.add<GPUSubgroupSizeOpToROCDL>(converter, chipset);
 
   populateMathToROCDLConversionPatterns(converter, patterns);
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 2503ccb6a2cfe..2b1adc90477af 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -2511,6 +2511,43 @@ bool WarpExecuteOnLane0Op::areTypesCompatible(Type lhs, Type rhs) {
       verifyDistributedType(lhs, rhs, getWarpSize(), getOperation()));
 }
 
+//===----------------------------------------------------------------------===//
+// GPU_BroadcastLaneOp
+//===----------------------------------------------------------------------===//
+
+void gpu::BroadcastLaneOp::inferResultRanges(
+    ArrayRef<ConstantIntRanges> argRanges, SetIntRangeFn setResultRange) {
+  setResultRange(getResult(), argRanges.front());
+}
+
+Speculation::Speculatability gpu::BroadcastLaneOp::getSpeculatability() {
+  switch (getBroadcastType()) {
+  case BroadcastType::first_lane:
+    // Cannot speculate first_lane broadcast, because speculating it across
+    // control flow can change the active lanes.
+    return Speculation::NotSpeculatable;
+  case BroadcastType::any_lane:
+    LLVM_FALLTHROUGH;
+  case BroadcastType::lane:
+    return Speculation::Speculatable;
+  }
+}
+
+LogicalResult gpu::BroadcastLaneOp::verify() {
+  switch (getBroadcastType()) {
+  case BroadcastType::first_lane:
+    LLVM_FALLTHROUGH;
+  case BroadcastType::any_lane:
+    if (getLane())
+      return emitOpError() << "lane can only be specified for lane broadcast";
+    return success();
+  case BroadcastType::lane:
+    if (!getLane())
+      return emitOpError() << "lane must be specified for lane broadcast";
+    return success();
+  }
+}
+
 //===----------------------------------------------------------------------===//
 // GPU KernelMetadataAttr
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir b/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
index 2b6adffc81f72..ed62e1c689ae3 100644
--- a/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
+++ b/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
@@ -701,7 +701,7 @@ gpu.module @test_module {
     // CHECK: %[[#CAST_VALUE:]] = llvm.bitcast %[[#VALUE]] : f32 to i32
     // CHECK: %[[#PERMUTE:]] = rocdl.ds_bpermute %[[#ALIGNED_DST_LANE]], %[[#CAST_VALUE]] : (i32, i32) -> i32
     // CHECK: %[[#CAST_SHFL_VALUE:]] = llvm.bitcast %[[#PERMUTE]] : i32 to f32
-    %shfli, %predi = gpu.shuffle idx %arg0, %arg1, %arg2 : f32 
+    %shfli, %predi = gpu.shuffle idx %arg0, %arg1, %arg2 : f32
     // *** UP mode shuffle ***
     // CHECK: %[[#LANE_ID:]] = rocdl.mbcnt.hi
     // CHECK: %[[#ZERO:]] = llvm.mlir.constant(0 : i32) : i32
@@ -776,3 +776,19 @@ gpu.module @test_module {
     func.return %bDimX : index
   }
 }
+
+// -----
+
+gpu.module @test_module {
+// CHECK-LABEL: func @broadcast
+//  CHECK-SAME:   (%[[ARG:.*]]: i64, %[[IDX:.*]]: i32)
+func.func @broadcast(%arg0 : index, %arg1 : i32) -> (index, index, index) {
+//       CHECK:   %{{.*}} = rocdl.readfirstlane %[[ARG]] : i64
+//       CHECK:   %{{.*}} = rocdl.readfirstlane %[[ARG]] : i64
+//       CHECK:   %{{.*}} = rocdl.readlane %[[ARG]], %[[IDX]] : (i64, i32) -> i64
+  %0 = gpu.broadcast_lane %arg0, first_lane : index
+  %1 = gpu.broadcast_lane %arg0, any_lane : index
+  %2 = gpu.broadcast_lane %arg0, lane %arg1 : index
+  func.return %0, %1, %2 : index, index, index
+}
+}
diff --git a/mlir/test/Dialect/GPU/broadcast-speculatability.mlir b/mlir/test/Dialect/GPU/broadcast-speculatability.mlir
new file mode 100644
index 0000000000000..facbe8761c1fd
--- /dev/null
+++ b/mlir/test/Dialect/GPU/broadcast-speculatability.mlir
@@ -0,0 +1,23 @@
+// RUN: mlir-opt %s --loop-invariant-code-motion | FileCheck %s
+
+func.func private @side_effect(%arg0 : f32, %arg1 : f32, %arg2 : f32)
+
+// CHECK-LABEL: func @broadcast_hoisting
+//  CHECK-SAME: (%[[ARG:.*]]: f32, %[[IDX:.*]]: i32)
+func.func @broadcast_hoisting(%arg0 : f32, %arg1 : i32) {
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  %c10 = arith.constant 10 : index
+// CHECK: %[[V1:.*]] = gpu.broadcast_lane %[[ARG]], any_lane : f32
+// CHECK: %[[V2:.*]] = gpu.broadcast_lane %[[ARG]], lane %[[IDX]] : f32
+// CHECK: scf.for
+// CHECK: %[[V0:.*]] = gpu.broadcast_lane %[[ARG]], first_lane : f32
+// CHECK: func.call @side_effect(%[[V0]], %[[V1]], %[[V2]])
+  scf.for %i = %c0 to %c10 step %c1 {
+    %0 = gpu.broadcast_lane %arg0, first_lane : f32
+    %1 = gpu.broadcast_lane %arg0, any_lane : f32
+    %2 = gpu.broadcast_lane %arg0, lane %arg1 : f32
+    func.call @side_effect(%0, %1, %2) : (f32, f32, f32) -> ()
+  }
+  func.return
+}
diff --git a/mlir/test/Dialect/GPU/int-range-interface.mlir b/mlir/test/Dialect/GPU/int-range-interface.mlir
index 1613f83b17bde..cfb99283652a2 100644
--- a/mlir/test/Dialect/GPU/int-range-interface.mlir
+++ b/mlir/test/Dialect/GPU/int-range-interface.mlir
@@ -329,3 +329,22 @@ module attributes {gpu.container_module} {
     }
   }
 }
+
+// -----
+
+// CHECK-LABEL: func @broadcast
+func.func @broadcast(%idx: i32) {
+  %0 = test.with_bounds { umin = 0 : index, umax = 10 : index, smin = 0 : index, smax = 10 : index } : index
+  %1 = gpu.broadcast_lane %0, first_lane : index
+  %2 = gpu.broadcast_lane %0, any_lane : index
+  %3 = gpu.broadcast_lane %0, lane %idx : index
+
+  // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+  // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+  // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+
+  %4 = test.reflect_bounds %1 : index
+  %5 = test.reflect_bounds %2 : index
+  %6 = test.reflect_bounds %3 : index
+  return
+}
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index 9cc0bf8f41d5a..95b6d21097a37 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -126,7 +126,7 @@ module attributes {gpu.container_module} {
       // CHECK-NEXT: %{{.*}} = arith.addf %{{.*}}, %{{.*}} : f32
       // CHECK-NEXT: gpu.yield %{{.*}} : f32
       // CHECK-NEXT: } : (f32) -> f32
-      %sum2 = gpu.all_reduce %one { 
+      %sum2 = gpu.all_reduce %one {
       ^bb(%lhs : f32, %rhs : f32):
         %tmp = arith.addf %lhs, %rhs : f32
         gpu.yield %tmp : f32
@@ -259,7 +259,7 @@ module attributes {gpu.container_module} {
       %1 = arith.cmpi slt, %arg0, %arg0 : i32
       scf.if %1 {
         gpu.printf ", "
-      } 
+      }
       gpu.return
     }
 
@@ -542,3 +542,15 @@ func.func @warp_operand_result(%laneid: index, %v0 : vector<4xi32>) -> (vector<4
   }
   return %2 : vector<4xi32>
 }
+
+// CHECK-LABEL: func @broadcast_lane
+//  CHECK-SAME: (%[[ARG:.*]]: f32, %[[IDX:.*]]: i32)
+func.func @broadcast_lane(%arg0 : f32, %arg1 : i32) -> (f32, f32, f32) {
+  // CHECK: gpu.broadcast_lane %[[ARG]], first_lane : f32
+  %0 = gpu.broadcast_lane %arg0, first_lane : f32
+  // CHECK: gpu.broadcast_lane %[[ARG]], any_lane : f32
+  %1 = gpu.broadcast_lane %arg0, any_lane : f32
+  // CHECK: gpu.broadcast_lane %[[ARG]], lane %[[IDX]] : f32
+  %2 = gpu.broadcast_lane %arg0, lane %arg1 : f32
+  func.return %0, %1, %2 : f32, f32, f32
+}

krzysz00 · 2025-08-09T00:17:05Z

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Hold on, is readlane speculatable?

From the past discussion #152551 (comment)

@kuhar

I think structured control flow guarantees that your parents are at least as active as you, so this seems safe for me

Does speculatability only get used in structured control flow, though?

You can speculate within a single block for sure, but practically speaking we don'y have anything that speculates across blocks in MLIR AFAIK. This is only being used by LICM.

We can say that any_lane includes active inactive lanes to make this unambiguous.

technically, CSE can merge ops across blocks

%1 = gpu.broadcast %src, lane %idx scf.if lane_id < 13 { %2 = gpu.broadcast %src, lane %idx }

not sure if it's a problem, though

If your op semantics include inactive lanes this is fine

Jianhui-Li · 2025-08-09T00:42:07Z

How is the broadcast first_lane semantics different than gpu.shuffle?

idx example:
%cst0 = arith.constant 0 : i32
%7, %8 = gpu.shuffle idx %0, %cst0, %width : f32
Broadcasts the value from lane 0 to all lanes.

Can broadcast lane semantics be implemented by gpu.shuffle down + gpu.shffle idx?

Why the compiler hint has to be part of broadcast op semantics as "all_lane"?

Hardcode84 · 2025-08-09T09:37:12Z

How is the broadcast first_lane semantics different than gpu.shuffle?

Unlike gpu.shuffle, result of broadcast is guaranteed to be uniform across the subgroup, which can enable more efficient lowering (e.g. using scalar registers instead of vector on AMDGPU). Regarding any_lane option, it follows the same broadcast logic (take value from some lane and make it uniform across all the subgroup), the only difference is that user guarantees the input value is also uniform, so compiler can choose any lane to take from and still put the result into scalar reg. And unlike first_lane, any_lane provides more relaxed speculation guarantees. first_lane cannot be speculated across the control flow as it can change active lanes, but any_lane can as it knows all inputs are already uniform.

any_lane and speculation was one the original motivations for this op (#152740 (comment) for the technical details)

grypp · 2025-08-09T13:57:38Z

As far as I understand, there is a difference between shuffle and broadcast here, but this distinction exists only for AMDGPU. I’m not familiar with Intel GPUs or others, but NVIDIA GPUs don’t have this distinction in PTX level.

I’d recommend implementing this op in the AMD-specific dialect rather than in the target agnostic GPU dialect.

kuhar · 2025-08-09T13:58:07Z

Can broadcast lane semantics be implemented by gpu.shuffle down + gpu.shffle idx?

This op is inspired by the operations in the vulkan subgroup extension: https://www.khronos.org/blog/vulkan-subgroup-tutorial#:~:text=T-,subgroupBroadcast,-(T%20value%2C%20uint and meant to have performant lowering that uses much cheaper instructions than shuffles.

Shuffles are not the universally best abstraction for modern chips that have much more performant primitives with narrower semantics (e.g., https://github.com/nod-ai/shark-ai/blob/main/docs/amdgpu_kernel_optimization_guide.md#data-parallel-primitives-and-warp-level-reduction + new v_permlane in CDNA4). If we restrict ourselves to shuffles, we have to do some heavy pattern matching / idiom recognition and make sure that the emitter caters towards these.

kuhar · 2025-08-09T14:02:54Z

In my view, broadcasting a lane is a fundamental primitive both at the level of the SIMT programming model and the hardware. It can be emulated with shuffles, but not efficiently without some idiom recognition.

kuhar · 2025-08-09T14:03:45Z

As far as I understand, there is a difference between shuffle and broadcast here, but this distinction exists only for AMDGPU. I’m not familiar with Intel GPUs or others, but NVIDIA GPUs don’t have this distinction in PTX level.

I’d recommend implementing this op in the AMD-specific dialect rather than in the target agnostic GPU dialect.

It's portable and exists at the level of Vulkan / SPIR-V: https://www.khronos.org/blog/vulkan-subgroup-tutorial#:~:text=T-,subgroupBroadcast,-(T%20value%2C%20uint . We will implement conversion to SPIR-V in a followup PR.

kuhar · 2025-08-09T14:19:16Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

What do you mean by dropping? I think you mean changing to a specific lane or first_lane?

kuhar · 2025-08-09T14:19:52Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

I'd say that this returns a value from any lane within the subgroup, active or inactive, assuming uniformity.

ok, I did reword doc somewhat

kuhar · 2025-08-09T14:20:18Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Maybe put the enum before the operand like with gpu.subgroup_reduce?

I choose the current order to be able to do gpu.broadcast_lane %src, lane %idx : f32, i.e. having line index directly after the lane

Jianhui-Li

The proposed subgroup uniform broadcast operation doesn't have support on Intel GPU. Since it is also unsupported on NVIDIA GPUs, I believe it would be more appropriate to place this operation in the AMDGPU dialect until it is supported by multiple GPU vendors.

Jianhui-Li · 2025-08-11T21:35:21Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

What is the semantics if the lane being specified is inactive lane?

Result will be undefined/posion, I will update docs, thanks.

Please describe the behavior what if the lane specified is out of range (like larger than subgroup size) also.

Updated the descricption

Jianhui-Li · 2025-08-11T21:37:25Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

when I first read first_lane, I though it is lane 0. I think it is important to keep active in the attribute name. Consider: first_active_lane, specific_lane, and any_lane.

yeah, makes sence

Jianhui-Li · 2025-08-11T21:40:47Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

consider broadcast_lane to subgroup_uniform_broadcast

How about just gpu.subgroup_broadcast?

I do kinda like subgroup_broadcast

Agree. uniform is ambiguous term and requires better documentation in the op definition. The corresponding SPIRV function actually have non_uniform, indicating that it doesn't requires all participating lanes being active.

Hardcode84 · 2025-08-11T22:25:36Z

The proposed subgroup uniform broadcast operation doesn't have support on Intel GPU

I think it does

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L2467

Jianhui-Li · 2025-08-11T23:26:33Z

The proposed subgroup uniform broadcast operation doesn't have support on Intel GPU

I think it does

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L2467

The one you refer to is not officially supported interface.

The closest one I identified is the following:
//non-uniform broadcast for a specific lane
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformBroadcast

//non-uniform broadcasts the first active value:
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformBroadcastFirst

These non-uniform ops only guarantee the result is uniform for all active lanes, not all lanes. SPIRV dialect actually already includes these operations so I assumed these were different.

I may misunderstand the semantics of the proposed broadcast_lane op, as the definition says "broadcast the value from one lane to all lanes in subgroup". Could you please clarify that all lanes refers just to all active lanes?

Hardcode84 · 2025-08-11T23:40:21Z

I may misunderstand the semantics of the proposed broadcast_lane op, as the definition says "broadcast the value from one lane to all lanes in subgroup". Could you please clarify that all lanes refers just to all active lanes?

Yeah, "broadcast the value from one lane to all active lanes in subgroup" definition will be better. Although I'm not sure it will make any practical difference as we don't have any way to access inactive lanes in gpu dialect currently I think.

Jianhui-Li · 2025-08-12T19:22:36Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

I suggest to describe the following semantics to explicitly explain:

Does it requires all lanes must be active?

For inactive lanes, how these operation impacts them if at all?

Updated the description

Jianhui-Li · 2025-08-12T19:28:03Z

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

As the any_lane semantics got lost during lowering, I assume that any_lane is designed to help MLIR optimization at GPU dialect level and target-independent.

Could you please describe what is the optimization passes you plan to do for any_lane, and what is the use case you target for?

any_lane allows freely speculate across the control flow, see mlir/test/Dialect/GPU/broadcast-speculatability.mlir test

The test shows the LICM optimization can utilize the speculatability property.
But I can't figure out how high level dialects can use the broadcast any_lane op. Particularly, does the user needs to know the target-specific implementation to use this operation effectively? Is the usage related to broadcast, or just a hint?

Jianhui-Li · 2025-08-15T16:01:18Z

After the discussion, I don't have issue with first_lane and specific_lane op semantics. They have counterpart in the SPIV op definition so lowering should not be a problem, and I agree that GPU op definition can be slightly higher level than the exact HW semantics (like shuffle) as long as it captures common use case.
//non-uniform broadcast for a specific lane
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformBroadcast
//non-uniform broadcasts the first active value:
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformBroadcastFirst

I still have reservation on the "any_lane" variant. I am not sure how user will use it, whether the usage is tied to a specific target implementation, and whether there is alternative way to represent compiler hint so may or may not have to tie with broadcast.

@grypp @Hardcode84 @krzysz00 @kuhar

krzysz00 · 2025-08-15T17:38:09Z

So the point of any_lane is to mark a value that, from higher-level information or compiler knowledge, is known to be subgroup-uniform, in order to allow more speculation than reading the first active lane would permit.

One example is something like [thread ID x] / [whatever I know my subgroup size is].

This will often lower yo something more concrete like reading the first active lane, but allow more transformations before that lowering takes place.

(There's also been mention on a related ticket about the possibility of adding a "read sny" to AMDGPU as a PLVM backend version of this, but I don't think that went anywhere).

The motivation behind any_lane is to allow moving specific known-iniform corporations, which eventually have to be in scalar registers for certain instructions, out of loops and forcing the transition from vector to scalar registers before the loop, because otherwise default compiler transformations will keep redoing the readfirstlane each loop iteration.

Does that help?

kuhar · 2025-08-15T17:53:23Z

I still have reservation on the "any_lane" variant. I am not sure how user will use it, whether the usage is tied to a specific target implementation, and whether there is alternative way to represent compiler hint so may or may not have to tie with broadcast.

I am not sure how user will use it

As an example, this comes up when you distribute a broadcast across threads. After the distribution happens, there's no existing way to hint the value is uniform. This op would be emitted by the code that performs distribution.

whether the usage is tied to a specific target implementation

It's not.

whether there is alternative way to represent compiler hint so may or may not have to tie with broadcast

There are no existing alternatives for this.

Jianhui-Li · 2025-08-15T21:17:41Z

One example is something like [thread ID x] / [whatever I know my subgroup size is].
This will often lower to something more concrete like reading the first active lane, but allow more transformations before that lowering takes place.

Thanks. The example certainly helps addressing my use case question.

There are no existing alternatives for this.

Why the compiler hint has to be part of broadcast OP? I see alternatives not attaching the hint to broadcast.

Ivan's early propose doesn't have broadcast semantics. [mlir][amdgpu] Introduce assume_subgroup_uniform op by Hardcode84 · Pull Request #152740 · llvm/llvm-project

Another example of providing compiler hint without using broadcast: There is a "Uniform" SPIR-V decoration: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Decoration

Uniform
Apply only to an object. Asserts that, for each [dynamic instance](https://registry.khronos.org/SPIR-V
/specs/unified1/SPIRV.html#DynamicInstance) of the instruction that computes the result, all 
invocations in the same [tangle](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Tangle) 
within the invocation’s Subgroup scope compute the same result value.

These alternatives are more straightforward to my eyes.

User don't have to understand the broadcast "any_lane" semantics and realize this is just to describe the input value property and nothing to do with the result value.
Each HW targets can choose to implement the hint differently. AMD GPU may choose to implement as "readFirstLane". Others may choose another way or simply ignore it. But with the current proposal, each HW target must implement the broadcast semantics because the broadcast semantic (not as a hint) can be used to do something meaningful as shown below.

    $value = 1
    $result = 0
    if (lane_id & 0x1) 
       $result = subgroup.broadcast @any_lane $value

Jianhui-Li · 2025-09-03T22:09:14Z

@Jianhui-Li ping

I was on vacation last two weeks so not respond in time. I don't think we reach consensus for the "any_lane" attribute. I was hoping that the GPU dialect maintainer makes a final call in situation like this.

fabianmcg · 2025-09-03T22:59:44Z

Apologies, I saw the PR, gave it a glance, and since there was active discussion I didn't think I had to take a deeper look.

I think I concur with @Jianhui-Li, or at least it's also not clear to me. Further, I'd argue first_lane and specific_lane are the same case, and what we are missing is a gpu.first_active_lane_id op.

Also, what's the point of any_lane? From what I can understand, it's nop and it just means the value is uniform across all lanes, otherwise how is the compiler choosing which lane to broadcast?

Edit: The thing that is not clear to me, is the rationale to justify these.

krzysz00 · 2025-09-03T23:13:58Z

first_active_lane isn't the same as any_lane(0) - it's the first active lane, which is an operation that often exists.

And the point of any_lane is to be a strong MLIR-level indicator that the value is uniform, which means that you could replace it with first_active_lane or specific_lane(31) or what have you and it would be correct - and it provides for lowerings to any "this value is uniform, grab it from your favorite late" intrinsics that might eventually get added in backends.

fabianmcg · 2025-09-03T23:23:24Z

first_active_lane isn't the same as any_lane(0) - it's the first active lane, which is an operation that often exists.

I didn't say it was lane 0, here is my quote:

what we are missing is a gpu.first_active_lane_id op.

On the second point:

is to be a strong MLIR-level indicator that the value is uniform

That is good. However, this PR is overloading an op with semantics that should belong to a different op, as they don't really fit with broadcast semantics, what am I broadcasting if there's nothing to broadcast? I'd argue, something like gpu.uniform_value metadata nop is better.

kuhar · 2025-09-04T00:12:34Z

Hi @Jianhui-Li and @fabianmcg,

I gave this a green light to merge because the comments seemed reasonably addressed from what I could tell and @Jianhui-Li did not leave a blocking review -- this is the tool to indicate that you do not wish the PR be merged as is.

Since there's still a little bit to work out, I propose we do not revert the whole thing but instead either fix forward or drop just the any_lane variant, which I think is the only thing we haven't aggreged on. Does this sound reasonable @Jianhui-Li and @fabianmcg ?

krzysz00 · 2025-09-04T00:17:19Z

Half the point of this op is that any_lane can be speculated in a way that reading the first active lane or reading a specific lane can't be

krzysz00 · 2025-09-04T00:18:48Z

gpu.first_active_lane_id op

But that's not the primititive. The primitive is "read from the first active lane" - AMD literally has a readfirstlane instruction, SPIR-V has one, and so on.

Nothing in that process requires - or allows - you to compute the ID of the first active lane, which is a much less trivial process. (The easiest form is everyone holds up their lane IDs and you readfirstlane that, which is circular)

krzysz00 · 2025-09-04T00:20:18Z

Well, no, a read from any lane is a broadcast to all lanes - it's just declared that you get to take your pick of whose value you broadcast because they're all the same.

All modes of this operation are broadcasts - some of them (any_lane) just come with extra information that lets you optimize said broadcast.

krzysz00 · 2025-09-04T00:21:14Z

And ... it's not a nop. Broadcasting from any lane has "get this into a subgroup-uniform register (if such a thing exists on my target)" semantics, which isn't nothing

Jianhui-Li · 2025-09-04T01:16:49Z

Hi @Jianhui-Li and @fabianmcg,

I gave this a green light to merge because the comments seemed reasonably addressed from what I could tell and @Jianhui-Li did not leave a blocking review -- this is the tool to indicate that you do not wish the PR be merged as is.

Since there's still a little bit to work out, I propose we do not revert the whole thing but instead either fix forward or drop just the any_lane variant, which I think is the only thing we haven't aggreged on. Does this sound reasonable @Jianhui-Li and @fabianmcg ?

I agree. Like I stated earlier, I agree with the first_active_lane and specific_lane variants but think more discussion needed to use any_lane as a compiler hint.

ftynse · 2025-09-04T07:08:53Z

I was hoping that the GPU dialect maintainer makes a final call in situation like this.

The role of a maintainer is to ensure PRs get reviewed, bugs get fixed, RFCs get engagement, etc. Maintainers do not have decision power or the authority to make any final calls. The discussion seems to be going on, if consensus proves elusive, please tag the MLIR area team for help.

(I was summoned to the thread for procedural comment as the chair of the area team)

fabianmcg · 2025-09-04T12:34:43Z

Thank you for the clarification @ftynse .

And I agree with @kuhar lets continue the discussion and try to fix forward any lingering issues.

Half the point of this op is that any_lane can be speculated in a way that reading the first active lane or reading a specific lane can't be

Well, no, a read from any lane is a broadcast to all lanes - it's just declared that you get to take your pick of whose value you broadcast because they're all the same.

All modes of this operation are broadcasts - some of them (any_lane) just come with extra information that lets you optimize said broadcast.

That's not the point. Tell me how the compiler chooses which lane to take? The only way to do it is to take a random lane, but that would mean all lanes hold the same value, at which point what you have is gpu.uniform_value. I'm arguing that's a better abstraction than an overloaded broadcast.

gpu.first_active_lane_id op

But that's not the primititive. The primitive is "read from the first active lane" - AMD literally has a readfirstlane instruction, SPIR-V has one, and so on.

Nothing in that process requires - or allows - you to compute the ID of the first active lane, which is a much less trivial process. (The easiest form is everyone holds up their lane IDs and you readfirstlane that, which is circular)

You can compute first_active_lane_id using __activemask(ballot) and bit operations. However, for this case:

%lane = gpu.first_active_lane_id
%v = gpu.broadcast %lane , %value

A pattern could take that and produce a readfirstlane for AMD before first_active_lane_id gets lowered.

And ... it's not a nop. Broadcasting from any lane has "get this into a subgroup-uniform register (if such a thing exists on my target)" semantics, which isn't nothing

In gpu would be a nop, as you are only signaling the value is uniform. An optimization can take that it's uniform and when lowered to AMD use a readfirstlane to promote from a VGPR to SGPR.

krzysz00 · 2025-09-04T16:39:49Z

That's not the point. Tell me how the compiler chooses which lane to take? The only way to do it is to take a random lane, but that would mean all lanes hold the same value, at which point what you have is gpu.uniform_value. I'm arguing that's a better abstraction than an overloaded broadcast.

I don't have strong opinions on whether gpu.uniform_value is a better abstraction than sneaking in a case into a broadcast op.

The reason I'd think you'd put it here is that - while an any_lane could be a nop, it has hinting semantics - I want this value to be treated as uniform / promoted to uniform registers / ... where that's a thing on my target. So there is a sense in which you will perform some sort of broadcast to implement this, where such an operation exists.

And it's not a "random" lane - it's an arbitrary lane. Any lowering to any kind of broadcast operation is a valid lowering for a broadcast from any_lane is valid - possibly including the one where it's a nop, but you're supposed to do a broadcast from something.

And re the pattern match on first_active_lane_id - because first_active_lane_id isn't really a primitive or a common abstraction, I'm not sure we should add such a thing. And also, I don't trust that pattern match. Having "read first active lane", which is a thing that really exists, as an operation (or a mode of an operation) feels quite reasonable to me

fabianmcg · 2025-09-04T16:57:23Z

I'll preface the comment with, I'm not trying to be pedantic. I'm trying to get to a common understanding, and move forward.

And it's not a "random" lane - it's an arbitrary lane. Any lowering to any kind of broadcast operation is a valid lowering for a broadcast from any_lane is valid - possibly including the one where it's a nop, but you're supposed to do a broadcast from something.

Arbitrary is a synonym for random. https://www.merriam-webster.com/dictionary/arbitrary
My point is, any_lane implies randomness because it doesn't describe which lane to choose. As such, is not a good abstraction. That's why I think a uniform_value is a better op. And in targets like AMD, this can be promoted to a VGPR to SGPR conversion.

And re the pattern match on first_active_lane_id - because first_active_lane_id isn't really a primitive or a common abstraction, I'm not sure we should add such a thing. And also, I don't trust that pattern match. Having "read first active lane", which is a thing that really exists, as an operation (or a mode of an operation) feels quite reasonable to m

I agree that's not a common primitive, I'm just saying it's an option to model it. I don't get why a pattern match would present any issues. But how about:

Add the gpu.uniform_value op.
Remove the mode flag from the broadcast op. If no lane is passed, the semantics mean use the first active lane. If a lane is passed, use that lane.

kuhar · 2025-09-04T17:18:24Z

Arbitrary is a synonym for random. https://www.merriam-webster.com/dictionary/arbitrary
My point is, any_lane implies randomness because it doesn't describe which lane to choose. As such, is not a good abstraction.

This is incorrect. In the context of program semantics / formal analysis, arbitrary means non-deterministic, which is different from random because it does not require any specific distribution. The implementation chose any lane, hence the name. Perhaps this framing is not well known by the larger MLIR community and we could pick a different name, e.g., gpu.subgroup_broadcast uniform.

But overall, I think that this does fit broadcast semantics and follows the same lowering, so not having an extra op for uniform values reduces the maintenance burden.

kuhar · 2025-09-04T17:23:11Z

You can see how the llvm language reference uses similar wording to define the freeze instruction: https://llvm.org/docs/LangRef.html#id334 . If you search for other uses of 'arbitrary' there are other examples of the use 'any' as the enum option.

joker-eph · 2025-09-04T17:29:56Z

@Jianhui-Li did not leave a blocking review -- this is the tool to indicate that you do not wish the PR be merged as is.

I don't agree: there is no policy on this. I never use this "tool" and still expect all open discussion or concerns I have on a PR to be blocking by default.
It is the author responsibility to ensure that the discussion converges. Here is hasn't been the case and IMO this should not have been merged. Unless you get a straightforward consensus with @Jianhui-Li , please revert while discussing.

fabianmcg · 2025-09-04T19:30:27Z

After talking with @kuhar, we agreed on a partial revert of the any_lane option, followed by a new PR to discuss the matter further. Are you satisfied with this resolution @Jianhui-Li ?

Jianhui-Li · 2025-09-04T23:59:35Z

After talking with @kuhar, we agreed on a partial revert of the any_lane option, followed by a new PR to discuss the matter further. Are you satisfied with this resolution @Jianhui-Li ?

I am OK with the partial revert. Thanks for the discussion.

This partially reverts llvm#152808. Post-commit comments revealed that the `any_lane` variant hasn't been fully aggreed upon at the time of landing.

kuhar · 2025-09-08T00:23:16Z

Partial revert PR for any_lane: #157373

This partially reverts #152808. Post-commit comments revealed that the `any_lane` variant hasn't been fully agreed upon at the time of landing.

…157373) This partially reverts llvm/llvm-project#152808. Post-commit comments revealed that the `any_lane` variant hasn't been fully agreed upon at the time of landing.

llvmbot added mlir:gpu mlir labels Aug 8, 2025

Hardcode84 requested review from krzysz00 and kuhar August 8, 2025 22:29

Hardcode84 mentioned this pull request Aug 8, 2025

[mlir][amdgpu] Introduce assume_subgroup_uniform op #152740

Closed

Hardcode84 requested review from grypp and rengolin August 8, 2025 22:36

rengolin requested review from Jianhui-Li and silee2 August 8, 2025 23:28

krzysz00 reviewed Aug 9, 2025

View reviewed changes

kuhar reviewed Aug 9, 2025

View reviewed changes

Jianhui-Li reviewed Aug 11, 2025

View reviewed changes

Jianhui-Li reviewed Aug 12, 2025

View reviewed changes

Hardcode84 changed the title ~~[mlir][gpu] Add broadcast_lane op~~ [mlir][gpu] Add subgroup_broadcast op Aug 13, 2025

Hardcode84 force-pushed the broadcast-lane branch from 3416bf1 to 0bdc237 Compare August 16, 2025 07:30

add readlane type check

30d28ca

Hardcode84 force-pushed the broadcast-lane branch from 5c5cdcf to 30d28ca Compare August 29, 2025 14:07

Hardcode84 requested a review from fabianmcg as a code owner August 29, 2025 14:07

Hardcode84 merged commit 4880940 into llvm:main Aug 30, 2025
9 checks passed

Hardcode84 deleted the broadcast-lane branch August 30, 2025 06:25

kuhar mentioned this pull request Sep 8, 2025

[mlir][gpu] Revert gpu.subgroup_broadcast with any_lane #157373

Merged

kuhar added a commit that referenced this pull request Sep 8, 2025

[mlir][gpu] Revert gpu.subgroup_broadcast with any_lane (#157373)

2b3d3fc

This partially reverts #152808. Post-commit comments revealed that the `any_lane` variant hasn't been fully agreed upon at the time of landing.

Hardcode84 mentioned this pull request Sep 10, 2025

[mlir][gpu] Add gpu.subgroup_uniform op #157743

Open

[mlir][gpu] Add subgroup_broadcast op #152808

[mlir][gpu] Add subgroup_broadcast op #152808

Uh oh!

Conversation

Hardcode84 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li commented Aug 9, 2025

Uh oh!

Hardcode84 commented Aug 9, 2025

Uh oh!

grypp commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kuhar commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kuhar commented Aug 9, 2025

Uh oh!

kuhar commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hardcode84 commented Aug 11, 2025

Uh oh!

Jianhui-Li commented Aug 11, 2025

Uh oh!

Hardcode84 commented Aug 11, 2025

[mlir][gpu] Add `subgroup_broadcast` op #152808

[mlir][gpu] Add `subgroup_broadcast` op #152808

Hardcode84 commented Aug 8, 2025 •

edited

Loading

llvmbot commented Aug 8, 2025 •

edited

Loading

grypp commented Aug 9, 2025 •

edited

Loading

kuhar commented Aug 9, 2025 •

edited

Loading

kuhar commented Aug 9, 2025 •

edited

Loading

Jianhui-Li left a comment •

edited

Loading

Jianhui-Li commented Aug 15, 2025 •

edited

Loading

fabianmcg commented Sep 3, 2025 •

edited

Loading

kuhar commented Sep 4, 2025 •

edited

Loading

krzysz00 commented Sep 4, 2025 •

edited

Loading