-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU] Add llvm.amdgcn.dead intrinsic #123190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-amdgpu Author: Diana Picus (rovka) ChangesShaders that use the llvm.amdgcn.init.whole.wave intrinsic need to Unfortunately, this kind of phi node will get optimized away and the This patch fixes the issue by introducing a llvm.amdgcn.dead intrinsic, At the moment, the llvm.amdgcn.dead intrinsic works only on i32 values. Full diff: https://github.com/llvm/llvm-project/pull/123190.diff 5 Files Affected:
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index b529642a558710..3a3e7949aa770f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3442,4 +3442,11 @@ def int_amdgcn_addrspacecast_nonnull : DefaultAttrsIntrinsic<
[llvm_anyptr_ty], [llvm_anyptr_ty],
[IntrNoMem, IntrSpeculatable]
>;
+
+/// Make it clear to the backend that this value is really dead. For instance,
+/// when used as an input to a phi node, it will make it possible for the
+/// backend to allocate the dead lanes for operations within the corresponding
+/// incoming block.
+def int_amdgcn_dead: DefaultAttrsIntrinsic<[llvm_i32_ty], [],
+ [IntrNoMem, IntrWillReturn, IntrNoCallback]>;
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 926c1e4b23b4a1..702e62fb6366da 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1108,6 +1108,12 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC(MachineInstr &I) const {
case Intrinsic::amdgcn_permlane16_swap:
case Intrinsic::amdgcn_permlane32_swap:
return selectPermlaneSwapIntrin(I, IntrinsicID);
+ case Intrinsic::amdgcn_dead: {
+ I.setDesc(TII.get(TargetOpcode::IMPLICIT_DEF));
+ I.removeOperand(1); // drop intrinsic ID
+ return RBI.constrainGenericRegister(I.getOperand(0).getReg(),
+ AMDGPU::VGPR_32RegClass, *MRI);
+ }
default:
return selectImpl(I, *CoverageInfo);
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 224c368cff4a1f..e1ef60e669b1f3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4675,6 +4675,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_set_inactive:
case Intrinsic::amdgcn_set_inactive_chain_arg:
case Intrinsic::amdgcn_permlane64:
+ case Intrinsic::amdgcn_dead:
return getDefaultMappingAllVGPR(MI);
case Intrinsic::amdgcn_cvt_pkrtz:
if (Subtarget.hasSALUFloatInsts() && isSALUMapping(MI))
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index cdc1132579d8d8..81163180058072 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -4145,3 +4145,9 @@ def V_ILLEGAL : Enc32, InstSI<(outs), (ins), "v_illegal"> {
let hasSideEffects = 1;
let SubtargetPredicate = isGFX10Plus;
}
+
+// FIXME: Would be nice if we could set the register class for the destination
+// register too.
+def IMP_DEF_FROM_INTRINSIC: Pat<
+ (i32 (int_amdgcn_dead)), (IMPLICIT_DEF)>;
+
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
index 353f4d90cad1f2..07d82e1243f713 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
@@ -1124,4 +1124,142 @@ tail:
unreachable
}
+; Since functions that contain amdgcn.init.whole.wave do not preserve the inactive
+; lanes of any VGPRs, the middle end will explicitly preserve them if needed by adding
+; dummy VGPR arguments. Since only the inactive lanes are important, we need to make
+; it clear to the backend that it's safe to allocate v9's active lanes inside
+; shader. This is achieved by using the llvm.amdgcn.dead intrinsic.
+define amdgpu_cs_chain void @with_inactive_vgprs(ptr inreg %callee, i32 inreg %exec, i32 inreg %sgpr, i32 %active.vgpr, i32 %inactive.vgpr) {
+; GISEL12-LABEL: with_inactive_vgprs:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; GISEL12-NEXT: s_mov_b32 s4, s0
+; GISEL12-NEXT: s_mov_b32 s5, s1
+; GISEL12-NEXT: s_mov_b32 s0, s3
+; GISEL12-NEXT: s_wait_alu 0xfffe
+; GISEL12-NEXT: s_and_saveexec_b32 s1, s6
+; GISEL12-NEXT: s_cbranch_execz .LBB6_2
+; GISEL12-NEXT: ; %bb.1: ; %shader
+; GISEL12-NEXT: v_dual_mov_b32 v10, s5 :: v_dual_mov_b32 v9, s4
+; GISEL12-NEXT: flat_load_b32 v11, v[9:10]
+; GISEL12-NEXT: ;;#ASMSTART
+; GISEL12-NEXT: ; use v0-7
+; GISEL12-NEXT: ;;#ASMEND
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: v_add_nc_u32_e32 v8, v8, v11
+; GISEL12-NEXT: flat_store_b32 v[9:10], v11
+; GISEL12-NEXT: ; implicit-def: $vgpr9
+; GISEL12-NEXT: .LBB6_2: ; %tail.block
+; GISEL12-NEXT: s_wait_alu 0xfffe
+; GISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s1
+; GISEL12-NEXT: s_mov_b32 exec_lo, s2
+; GISEL12-NEXT: s_wait_alu 0xfffe
+; GISEL12-NEXT: s_setpc_b64 s[4:5]
+;
+; DAGISEL12-LABEL: with_inactive_vgprs:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: s_mov_b32 s5, s1
+; DAGISEL12-NEXT: s_mov_b32 s4, s0
+; DAGISEL12-NEXT: s_wait_alu 0xfffe
+; DAGISEL12-NEXT: s_and_saveexec_b32 s0, s6
+; DAGISEL12-NEXT: s_cbranch_execz .LBB6_2
+; DAGISEL12-NEXT: ; %bb.1: ; %shader
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, s5 :: v_dual_mov_b32 v9, s4
+; DAGISEL12-NEXT: flat_load_b32 v11, v[9:10]
+; DAGISEL12-NEXT: ;;#ASMSTART
+; DAGISEL12-NEXT: ; use v0-7
+; DAGISEL12-NEXT: ;;#ASMEND
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: v_add_nc_u32_e32 v8, v8, v11
+; DAGISEL12-NEXT: flat_store_b32 v[9:10], v11
+; DAGISEL12-NEXT: ; implicit-def: $vgpr9
+; DAGISEL12-NEXT: .LBB6_2: ; %tail.block
+; DAGISEL12-NEXT: s_wait_alu 0xfffe
+; DAGISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; DAGISEL12-NEXT: s_mov_b32 s0, s3
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s2
+; DAGISEL12-NEXT: s_wait_alu 0xfffe
+; DAGISEL12-NEXT: s_setpc_b64 s[4:5]
+;
+; GISEL10-LABEL: with_inactive_vgprs:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; GISEL10-NEXT: s_mov_b32 s4, s0
+; GISEL10-NEXT: s_mov_b32 s5, s1
+; GISEL10-NEXT: s_mov_b32 s0, s3
+; GISEL10-NEXT: s_and_saveexec_b32 s1, s6
+; GISEL10-NEXT: s_cbranch_execz .LBB6_2
+; GISEL10-NEXT: ; %bb.1: ; %shader
+; GISEL10-NEXT: v_mov_b32_e32 v10, s5
+; GISEL10-NEXT: v_mov_b32_e32 v9, s4
+; GISEL10-NEXT: flat_load_dword v11, v[9:10]
+; GISEL10-NEXT: ;;#ASMSTART
+; GISEL10-NEXT: ; use v0-7
+; GISEL10-NEXT: ;;#ASMEND
+; GISEL10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: v_add_nc_u32_e32 v8, v8, v11
+; GISEL10-NEXT: flat_store_dword v[9:10], v11
+; GISEL10-NEXT: ; implicit-def: $vgpr9
+; GISEL10-NEXT: .LBB6_2: ; %tail.block
+; GISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s1
+; GISEL10-NEXT: s_mov_b32 exec_lo, s2
+; GISEL10-NEXT: s_setpc_b64 s[4:5]
+;
+; DAGISEL10-LABEL: with_inactive_vgprs:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: s_mov_b32 s5, s1
+; DAGISEL10-NEXT: s_mov_b32 s4, s0
+; DAGISEL10-NEXT: s_and_saveexec_b32 s0, s6
+; DAGISEL10-NEXT: s_cbranch_execz .LBB6_2
+; DAGISEL10-NEXT: ; %bb.1: ; %shader
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, s5
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, s4
+; DAGISEL10-NEXT: flat_load_dword v11, v[9:10]
+; DAGISEL10-NEXT: ;;#ASMSTART
+; DAGISEL10-NEXT: ; use v0-7
+; DAGISEL10-NEXT: ;;#ASMEND
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v8, v8, v11
+; DAGISEL10-NEXT: flat_store_dword v[9:10], v11
+; DAGISEL10-NEXT: ; implicit-def: $vgpr9
+; DAGISEL10-NEXT: .LBB6_2: ; %tail.block
+; DAGISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; DAGISEL10-NEXT: s_mov_b32 s0, s3
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s2
+; DAGISEL10-NEXT: s_setpc_b64 s[4:5]
+entry:
+ %0 = call i32 @llvm.amdgcn.dead()
+ %1 = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %1, label %shader, label %tail.block
+
+shader: ; preds = %entry
+ %use.another.vgpr = load i32, ptr %callee ; smth that won't be moved past the inline asm
+ call void asm sideeffect "; use v0-7", "~{v0},~{v1},~{v2},~{v3},~{v4},~{v5},~{v6},~{v7}"()
+ store i32 %use.another.vgpr, ptr %callee
+ %active.vgpr.new = add i32 %active.vgpr, %use.another.vgpr
+ br label %tail.block
+
+tail.block: ; preds = %.exit27, %.exit49, %244, %243, %entry
+ %active.vgpr.arg = phi i32 [ %active.vgpr, %entry ], [ %active.vgpr.new, %shader ]
+ %inactive.vgpr.arg = phi i32 [ %inactive.vgpr, %entry ], [ %0, %shader ]
+ %vgprs.0 = insertvalue { i32, i32 } poison, i32 %active.vgpr.arg, 0
+ %vgprs = insertvalue { i32, i32 } %vgprs.0, i32 %inactive.vgpr.arg, 1
+ call void (ptr, i32, i32, { i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i32.i32.sl_i32i32(ptr inreg %callee, i32 inreg %exec, i32 inreg %sgpr, { i32, i32} %vgprs, i32 0)
+ unreachable
+}
+
declare amdgpu_gfx <16 x i32> @write_v0_v15(<16 x i32>)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs test in test/Analysis/UniformityAnalysis
Does InstSimplify/InstCombine optimize away a phi if all but one of its inputs is undefined? I can see why it might be tempting to do that to simplify the IR, but it also increases liveness (and hence register pressure) in a way that is impossible to undo later. @nikic do you have any thoughts/comments about this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use named values in tests. Should also have a dedicated test with just the intrinsic uses in it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this just convergent poison? Why isn't it IntrConvergent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in my current usage it's ok for it to be sunk from the entry block into shader. In my very limited understanding, convergent would preclude that (but do correct me if I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is hosting correct? In the future with convergence tokens this will be explicit for where movement can occur
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hoisting should be correct too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hoisting is semantically correct, just undesired in practice.
FWIW phi node elimination makes similar decisions (i.e. even if the middle-end leaves the phi nodes in place, we still get the long live ranges after phi-node-elim). |
|
Ping |
This is going to look better when using amdgcn.dead instead.
Shaders that use the llvm.amdgcn.init.whole.wave intrinsic need to
explicitly preserve the inactive lanes of VGPRs of interest by adding
them as dummy arguments. The code usually looks something like this:
```
define amdgcn_cs_chain void f(active vgpr args..., i32 %inactive.vgpr1, ..., i32 %inactive.vgprN) {
entry:
%c = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %c, label %shader, label %tail
shader:
[...]
tail:
%inactive.vgpr.arg1 = phi i32 [ %inactive.vgpr1, %entry], [poison, %shader]
[...]
; %inactive.vgpr* then get passed into a llvm.amdgcn.cs.chain call
```
Unfortunately, this kind of phi node will get optimized away and the
backend won't be able to figure out that it's ok to use the active lanes
of `%inactive.vgpr*` inside `shader`.
This patch fixes the issue by introducing a llvm.amdgcn.dead intrinsic,
whose result can be used inside the PHI instead of the poison. This will
be selected to an IMPLICIT_DEF, which the backend can work with.
At the moment, the llvm.amdgcn.dead intrinsic works only on i32 values.
Support for other types can be added later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hoisting is semantically correct, just undesired in practice.
| /// when used as an input to a phi node, it will make it possible for the | ||
| /// backend to allocate the dead lanes for operations within the corresponding | ||
| /// incoming block. | ||
| def int_amdgcn_dead: DefaultAttrsIntrinsic<[llvm_i32_ty], [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to overload this as llvm_any_ty at some point, but I suppose for now this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just start out with it with mangling
Shaders that use the llvm.amdgcn.init.whole.wave intrinsic need to
explicitly preserve the inactive lanes of VGPRs of interest by adding
them as dummy arguments. The code usually looks something like this:
Unfortunately, this kind of phi node will get optimized away and the
backend won't be able to figure out that it's ok to use the active lanes
of
%inactive.vgpr*insideshader.This patch fixes the issue by introducing a llvm.amdgcn.dead intrinsic,
whose result can be used as a PHI operand instead of the poison. This will
be selected to an IMPLICIT_DEF, which the backend can work with.
At the moment, the llvm.amdgcn.dead intrinsic works only on i32 values.
Support for other types can be added later if needed.