-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[mlir][gpu] Add subgroup_broadcast
op
#152808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
`broadcast_lane` allow to broadcast the value from one lane to all lanes in subgroup. Supported modes: * `first_lane` - broadcast value from the first active lane in subgroup. * `lane` - broadcast value from the specified lane, lane index must be withing subgroup. * `any_lane` - if `src` value is uniform across all the subgroup lanes return it unchanged, otherwise result is poison. This variant essentially an uniformity hint for the compiler, conveying that specific value is uniform across all subgroup lanes. Dropping `any_lane` broadcast will not change the code semantics.
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-gpu Author: Ivan Butygin (Hardcode84) Changes
Supported modes:
Full diff: https://github.com/llvm/llvm-project/pull/152808.diff 7 Files Affected:
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index f946bb731e2ca..6592a5c55b0c2 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1517,7 +1517,7 @@ def GPU_GPUModuleOp : GPU_Op<"module", [
/// Sets the targets of the module.
void setTargets(ArrayRef<TargetAttrInterface> targets);
}];
-
+
let hasVerifier = 1;
}
@@ -3212,4 +3212,46 @@ def GPU_WarpExecuteOnLane0Op : GPU_Op<"warp_execute_on_lane_0",
}];
}
+def GPU_BroadcastType : I32EnumAttr<"BroadcastType",
+ "a lane to broadcast from",
+ [
+ I32EnumAttrCase<"first_lane", 0>,
+ I32EnumAttrCase<"any_lane", 1>,
+ I32EnumAttrCase<"lane", 2>
+ ]>{
+ let genSpecializedAttr = 0;
+ let cppNamespace = "::mlir::gpu";
+}
+def GPU_BroadcastTypeAttr : EnumAttr<GPU_Dialect, GPU_BroadcastType, "broadcast">;
+
+def GPU_BroadcastLaneOp : GPU_Op<"broadcast_lane",
+ [NoMemoryEffect, AllTypesMatch<["result", "src"]>,
+ DeclareOpInterfaceMethods<InferIntRangeInterface, ["inferResultRanges"]>,
+ DeclareOpInterfaceMethods<ConditionallySpeculatable, ["getSpeculatability"]>] #
+ ElementwiseMappable.traits>,
+ Arguments<(ins AnyType:$src,
+ Optional<I32>:$lane,
+ GPU_BroadcastTypeAttr:$broadcast_type)> {
+ let summary = "Broadcasts a value from the specific lane across subgroup";
+ let description = [{
+ Broadcasts the value from the one lane to the all lanes in subgroup.
+
+ The possible broadcats types are:
+
+ * `first_lane` - first active lane in subgroup.
+ * `lane` - from the specified lane, lane index must be withing subgroup.
+ * `any_lane` - if `src` value is uniform across all the subgroup
+ lanes return it unchanged, otherwise result is poison. This variant
+ essentially an uniformity hint for the compiler, conveying that
+ specific value is uniform across all subgroup lanes. Dropping `any_lane`
+ broadcast will not change the code semantics.
+ ```
+ }];
+ let results = (outs AnyType:$result);
+ let assemblyFormat = [{
+ $src `,` $broadcast_type ($lane^)? attr-dict `:` type($result)
+ }];
+ let hasVerifier = 1;
+}
+
#endif // GPU_OPS
diff --git a/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp b/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
index d22364e1ef441..4d081cefb5f35 100644
--- a/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
+++ b/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
@@ -160,6 +160,27 @@ struct GPUSubgroupSizeOpToROCDL : ConvertOpToLLVMPattern<gpu::SubgroupSizeOp> {
const amdgpu::Chipset chipset;
};
+struct GPUBroadcastLaneOpToROCDL
+ : public ConvertOpToLLVMPattern<gpu::BroadcastLaneOp> {
+ using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern;
+
+ LogicalResult
+ matchAndRewrite(gpu::BroadcastLaneOp op, OpAdaptor adaptor,
+ ConversionPatternRewriter &rewriter) const override {
+ Value src = adaptor.getSrc();
+ if (adaptor.getBroadcastType() == gpu::BroadcastType::lane) {
+ rewriter.replaceOpWithNewOp<ROCDL::ReadlaneOp>(op, src.getType(), src,
+ adaptor.getLane());
+ } else { // first_lane or any_lane
+ // any_lane is lowered to readfirstlane too, to force value into scalar
+ // register.
+ rewriter.replaceOpWithNewOp<ROCDL::ReadfirstlaneOp>(op, src.getType(),
+ src);
+ }
+ return success();
+ }
+};
+
struct GPUShuffleOpLowering : public ConvertOpToLLVMPattern<gpu::ShuffleOp> {
using ConvertOpToLLVMPattern<gpu::ShuffleOp>::ConvertOpToLLVMPattern;
@@ -453,7 +474,9 @@ void mlir::populateGpuToROCDLConversionPatterns(
// TODO: Add alignment for workgroup memory
patterns.add<GPUDynamicSharedMemoryOpLowering>(converter);
- patterns.add<GPUShuffleOpLowering, GPULaneIdOpToROCDL>(converter);
+ patterns
+ .add<GPUShuffleOpLowering, GPULaneIdOpToROCDL, GPUBroadcastLaneOpToROCDL>(
+ converter);
patterns.add<GPUSubgroupSizeOpToROCDL>(converter, chipset);
populateMathToROCDLConversionPatterns(converter, patterns);
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 2503ccb6a2cfe..2b1adc90477af 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -2511,6 +2511,43 @@ bool WarpExecuteOnLane0Op::areTypesCompatible(Type lhs, Type rhs) {
verifyDistributedType(lhs, rhs, getWarpSize(), getOperation()));
}
+//===----------------------------------------------------------------------===//
+// GPU_BroadcastLaneOp
+//===----------------------------------------------------------------------===//
+
+void gpu::BroadcastLaneOp::inferResultRanges(
+ ArrayRef<ConstantIntRanges> argRanges, SetIntRangeFn setResultRange) {
+ setResultRange(getResult(), argRanges.front());
+}
+
+Speculation::Speculatability gpu::BroadcastLaneOp::getSpeculatability() {
+ switch (getBroadcastType()) {
+ case BroadcastType::first_lane:
+ // Cannot speculate first_lane broadcast, because speculating it across
+ // control flow can change the active lanes.
+ return Speculation::NotSpeculatable;
+ case BroadcastType::any_lane:
+ LLVM_FALLTHROUGH;
+ case BroadcastType::lane:
+ return Speculation::Speculatable;
+ }
+}
+
+LogicalResult gpu::BroadcastLaneOp::verify() {
+ switch (getBroadcastType()) {
+ case BroadcastType::first_lane:
+ LLVM_FALLTHROUGH;
+ case BroadcastType::any_lane:
+ if (getLane())
+ return emitOpError() << "lane can only be specified for lane broadcast";
+ return success();
+ case BroadcastType::lane:
+ if (!getLane())
+ return emitOpError() << "lane must be specified for lane broadcast";
+ return success();
+ }
+}
+
//===----------------------------------------------------------------------===//
// GPU KernelMetadataAttr
//===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir b/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
index 2b6adffc81f72..ed62e1c689ae3 100644
--- a/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
+++ b/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir
@@ -701,7 +701,7 @@ gpu.module @test_module {
// CHECK: %[[#CAST_VALUE:]] = llvm.bitcast %[[#VALUE]] : f32 to i32
// CHECK: %[[#PERMUTE:]] = rocdl.ds_bpermute %[[#ALIGNED_DST_LANE]], %[[#CAST_VALUE]] : (i32, i32) -> i32
// CHECK: %[[#CAST_SHFL_VALUE:]] = llvm.bitcast %[[#PERMUTE]] : i32 to f32
- %shfli, %predi = gpu.shuffle idx %arg0, %arg1, %arg2 : f32
+ %shfli, %predi = gpu.shuffle idx %arg0, %arg1, %arg2 : f32
// *** UP mode shuffle ***
// CHECK: %[[#LANE_ID:]] = rocdl.mbcnt.hi
// CHECK: %[[#ZERO:]] = llvm.mlir.constant(0 : i32) : i32
@@ -776,3 +776,19 @@ gpu.module @test_module {
func.return %bDimX : index
}
}
+
+// -----
+
+gpu.module @test_module {
+// CHECK-LABEL: func @broadcast
+// CHECK-SAME: (%[[ARG:.*]]: i64, %[[IDX:.*]]: i32)
+func.func @broadcast(%arg0 : index, %arg1 : i32) -> (index, index, index) {
+// CHECK: %{{.*}} = rocdl.readfirstlane %[[ARG]] : i64
+// CHECK: %{{.*}} = rocdl.readfirstlane %[[ARG]] : i64
+// CHECK: %{{.*}} = rocdl.readlane %[[ARG]], %[[IDX]] : (i64, i32) -> i64
+ %0 = gpu.broadcast_lane %arg0, first_lane : index
+ %1 = gpu.broadcast_lane %arg0, any_lane : index
+ %2 = gpu.broadcast_lane %arg0, lane %arg1 : index
+ func.return %0, %1, %2 : index, index, index
+}
+}
diff --git a/mlir/test/Dialect/GPU/broadcast-speculatability.mlir b/mlir/test/Dialect/GPU/broadcast-speculatability.mlir
new file mode 100644
index 0000000000000..facbe8761c1fd
--- /dev/null
+++ b/mlir/test/Dialect/GPU/broadcast-speculatability.mlir
@@ -0,0 +1,23 @@
+// RUN: mlir-opt %s --loop-invariant-code-motion | FileCheck %s
+
+func.func private @side_effect(%arg0 : f32, %arg1 : f32, %arg2 : f32)
+
+// CHECK-LABEL: func @broadcast_hoisting
+// CHECK-SAME: (%[[ARG:.*]]: f32, %[[IDX:.*]]: i32)
+func.func @broadcast_hoisting(%arg0 : f32, %arg1 : i32) {
+ %c0 = arith.constant 0 : index
+ %c1 = arith.constant 1 : index
+ %c10 = arith.constant 10 : index
+// CHECK: %[[V1:.*]] = gpu.broadcast_lane %[[ARG]], any_lane : f32
+// CHECK: %[[V2:.*]] = gpu.broadcast_lane %[[ARG]], lane %[[IDX]] : f32
+// CHECK: scf.for
+// CHECK: %[[V0:.*]] = gpu.broadcast_lane %[[ARG]], first_lane : f32
+// CHECK: func.call @side_effect(%[[V0]], %[[V1]], %[[V2]])
+ scf.for %i = %c0 to %c10 step %c1 {
+ %0 = gpu.broadcast_lane %arg0, first_lane : f32
+ %1 = gpu.broadcast_lane %arg0, any_lane : f32
+ %2 = gpu.broadcast_lane %arg0, lane %arg1 : f32
+ func.call @side_effect(%0, %1, %2) : (f32, f32, f32) -> ()
+ }
+ func.return
+}
diff --git a/mlir/test/Dialect/GPU/int-range-interface.mlir b/mlir/test/Dialect/GPU/int-range-interface.mlir
index 1613f83b17bde..cfb99283652a2 100644
--- a/mlir/test/Dialect/GPU/int-range-interface.mlir
+++ b/mlir/test/Dialect/GPU/int-range-interface.mlir
@@ -329,3 +329,22 @@ module attributes {gpu.container_module} {
}
}
}
+
+// -----
+
+// CHECK-LABEL: func @broadcast
+func.func @broadcast(%idx: i32) {
+ %0 = test.with_bounds { umin = 0 : index, umax = 10 : index, smin = 0 : index, smax = 10 : index } : index
+ %1 = gpu.broadcast_lane %0, first_lane : index
+ %2 = gpu.broadcast_lane %0, any_lane : index
+ %3 = gpu.broadcast_lane %0, lane %idx : index
+
+ // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index}
+
+ %4 = test.reflect_bounds %1 : index
+ %5 = test.reflect_bounds %2 : index
+ %6 = test.reflect_bounds %3 : index
+ return
+}
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index 9cc0bf8f41d5a..95b6d21097a37 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -126,7 +126,7 @@ module attributes {gpu.container_module} {
// CHECK-NEXT: %{{.*}} = arith.addf %{{.*}}, %{{.*}} : f32
// CHECK-NEXT: gpu.yield %{{.*}} : f32
// CHECK-NEXT: } : (f32) -> f32
- %sum2 = gpu.all_reduce %one {
+ %sum2 = gpu.all_reduce %one {
^bb(%lhs : f32, %rhs : f32):
%tmp = arith.addf %lhs, %rhs : f32
gpu.yield %tmp : f32
@@ -259,7 +259,7 @@ module attributes {gpu.container_module} {
%1 = arith.cmpi slt, %arg0, %arg0 : i32
scf.if %1 {
gpu.printf ", "
- }
+ }
gpu.return
}
@@ -542,3 +542,15 @@ func.func @warp_operand_result(%laneid: index, %v0 : vector<4xi32>) -> (vector<4
}
return %2 : vector<4xi32>
}
+
+// CHECK-LABEL: func @broadcast_lane
+// CHECK-SAME: (%[[ARG:.*]]: f32, %[[IDX:.*]]: i32)
+func.func @broadcast_lane(%arg0 : f32, %arg1 : i32) -> (f32, f32, f32) {
+ // CHECK: gpu.broadcast_lane %[[ARG]], first_lane : f32
+ %0 = gpu.broadcast_lane %arg0, first_lane : f32
+ // CHECK: gpu.broadcast_lane %[[ARG]], any_lane : f32
+ %1 = gpu.broadcast_lane %arg0, any_lane : f32
+ // CHECK: gpu.broadcast_lane %[[ARG]], lane %[[IDX]] : f32
+ %2 = gpu.broadcast_lane %arg0, lane %arg1 : f32
+ func.return %0, %1, %2 : f32, f32, f32
+}
|
return Speculation::NotSpeculatable; | ||
case BroadcastType::any_lane: | ||
LLVM_FALLTHROUGH; | ||
case BroadcastType::lane: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on, is readlane speculatable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the past discussion #152551 (comment)
I think structured control flow guarantees that your parents are at least as active as you, so this seems safe for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does speculatability only get used in structured control flow, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can speculate within a single block for sure, but practically speaking we don'y have anything that speculates across blocks in MLIR AFAIK. This is only being used by LICM.
We can say that any_lane
includes active inactive lanes to make this unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, CSE can merge ops across blocks
%1 = gpu.broadcast %src, lane %idx
scf.if lane_id < 13 {
%2 = gpu.broadcast %src, lane %idx
}
not sure if it's a problem, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your op semantics include inactive lanes this is fine
How is the broadcast first_lane semantics different than gpu.shuffle? idx example: Can broadcast lane semantics be implemented by gpu.shuffle down + gpu.shffle idx? Why the compiler hint has to be part of broadcast op semantics as "all_lane"? |
Unlike
|
As far as I understand, there is a difference between shuffle and broadcast here, but this distinction exists only for AMDGPU. I’m not familiar with Intel GPUs or others, but NVIDIA GPUs don’t have this distinction in PTX level. I’d recommend implementing this op in the AMD-specific dialect rather than in the target agnostic GPU dialect. |
This op is inspired by the operations in the vulkan subgroup extension: https://www.khronos.org/blog/vulkan-subgroup-tutorial#:~:text=T-,subgroupBroadcast,-(T%20value%2C%20uint and meant to have performant lowering that uses much cheaper instructions than shuffles. Shuffles are not the universally best abstraction for modern chips that have much more performant primitives with narrower semantics (e.g., https://github.com/nod-ai/shark-ai/blob/main/docs/amdgpu_kernel_optimization_guide.md#data-parallel-primitives-and-warp-level-reduction + new v_permlane in CDNA4). If we restrict ourselves to shuffles, we have to do some heavy pattern matching / idiom recognition and make sure that the emitter caters towards these. |
In my view, broadcasting a lane is a fundamental primitive both at the level of the SIMT programming model and the hardware. It can be emulated with shuffles, but not efficiently without some idiom recognition. |
It's portable and exists at the level of Vulkan / SPIR-V: https://www.khronos.org/blog/vulkan-subgroup-tutorial#:~:text=T-,subgroupBroadcast,-(T%20value%2C%20uint . We will implement conversion to SPIR-V in a followup PR. |
* `any_lane` - if `src` value is uniform across all the subgroup | ||
lanes return it unchanged, otherwise result is poison. This variant | ||
essentially an uniformity hint for the compiler, conveying that | ||
specific value is uniform across all subgroup lanes. Dropping `any_lane` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by dropping? I think you mean changing to a specific lane or first_lane?
|
||
* `first_lane` - first active lane in subgroup. | ||
* `lane` - from the specified lane, lane index must be withing subgroup. | ||
* `any_lane` - if `src` value is uniform across all the subgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that this returns a value from any lane within the subgroup, active or inactive, assuming uniformity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I did reword doc somewhat
}]; | ||
let results = (outs AnyType:$result); | ||
let assemblyFormat = [{ | ||
$src `,` $broadcast_type ($lane^)? attr-dict `:` type($result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put the enum before the operand like with gpu.subgroup_reduce
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose the current order to be able to do gpu.broadcast_lane %src, lane %idx : f32
, i.e. having line index directly after the lane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed subgroup uniform broadcast operation doesn't have support on Intel GPU. Since it is also unsupported on NVIDIA GPUs, I believe it would be more appropriate to place this operation in the AMDGPU dialect until it is supported by multiple GPU vendors.
subgroup. | ||
* `lane` - broadcasts from the specified lane. The lane index must be | ||
uniform and within the subgroup size. The result is poison if the lane | ||
index is invalid or non-subgroup-uniform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the semantics if the lane being specified is inactive lane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result will be undefined/posion, I will update docs, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe the behavior what if the lane specified is out of range (like larger than subgroup size) also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the descricption
|
||
The possible broadcast types are: | ||
|
||
* `first_lane` - broadcasts the value from the first active lane in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I first read first_lane
, I though it is lane 0. I think it is important to keep active
in the attribute name. Consider: first_active_lane
, specific_lane
, and any_lane
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, makes sence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
def GPU_BroadcastTypeAttr : EnumAttr<GPU_Dialect, GPU_BroadcastType, "broadcast">; | ||
|
||
def GPU_BroadcastLaneOp : GPU_Op<"broadcast_lane", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider broadcast_lane to subgroup_uniform_broadcast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just gpu.subgroup_broadcast
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. uniform
is ambiguous term and requires better documentation in the op definition. The corresponding SPIRV function actually have non_uniform
, indicating that it doesn't requires all participating lanes being active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I think it does |
The one you refer to is not officially supported interface. The closest one I identified is the following: //non-uniform broadcasts the first active value: These non-uniform ops only guarantee the result is uniform for all active lanes, not all lanes. SPIRV dialect actually already includes these operations so I assumed these were different. I may misunderstand the semantics of the proposed broadcast_lane op, as the definition says "broadcast the value from one lane to all lanes in subgroup". Could you please clarify that all lanes refers just to all active lanes? |
Yeah, "broadcast the value from one lane to all active lanes in subgroup" definition will be better. Although I'm not sure it will make any practical difference as we don't have any way to access inactive lanes in gpu dialect currently I think. |
GPU_BroadcastTypeAttr:$broadcast_type)> { | ||
let summary = "Broadcasts a value from the specific lane across subgroup"; | ||
let description = [{ | ||
Broadcasts a value from one lane to all lanes in a subgroup. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to describe the following semantics to explicitly explain:
- Does it requires all lanes must be active?
- For inactive lanes, how these operation impacts them if at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description
} else { // first_lane or any_lane | ||
// any_lane is lowered to readfirstlane too, to force value into scalar | ||
// register. | ||
rewriter.replaceOpWithNewOp<ROCDL::ReadfirstlaneOp>(op, src.getType(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the any_lane
semantics got lost during lowering, I assume that any_lane
is designed to help MLIR optimization at GPU dialect level and target-independent.
Could you please describe what is the optimization passes you plan to do for any_lane
, and what is the use case you target for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any_lane
allows freely speculate across the control flow, see mlir/test/Dialect/GPU/broadcast-speculatability.mlir test
broadcast_lane
opsubgroup_broadcast
op
broadcast_lane
allow to broadcast the value from one lane to all lanes in subgroup.Supported modes:
first_lane
- broadcast value from the first active lane in subgroup.lane
- broadcast value from the specified lane, lane index must be within subgroup.any_lane
- ifsrc
value is uniform across all the subgroup lanes return it unchanged, otherwise result is poison. This variant essentially an uniformity hint for the compiler, conveying that specific value is uniform across all subgroup lanes. Droppingany_lane
broadcast should not change the code semantics.