Skip to content

Conversation

@GinShio
Copy link

@GinShio GinShio commented Nov 11, 2024

Sometimes, we know the value is uniform, but backend cannot easily prove that it is uniform.

This change introduces the intrinsic readanylane, which is similar to readfirstlane, but has a couple of advantages:

  • It can be sunk into control flow.
  • If the result is needed in a vgpr then the v_readfirstlane instruction can be optimized away.

CC: @arsenm, @jayfoad, @nhaehnle, @ruiling

Sometimes, we know the value is uniform, but backend cannot easily prove
that it is uniform.

This change introduces the intrinsic `readanylane`, which is similar to
readfirstlane, but has a couple of advantages:
 + It doesn't convergent, so can be moved between control flows.
 + If the result is needed in a vgpr then the v_readfirstlane
   instruction can be optimized away.
@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Nov 11, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-ir

Author: Xin "Russell" Liu (GinShio)

Changes

Sometimes, we know the value is uniform, but backend cannot easily prove that it is uniform.

This change introduces the intrinsic readanylane, which is similar to readfirstlane, but has a couple of advantages:

  • It doesn't convergent, so can be moved between control flows.
  • If the result is needed in a vgpr then the v_readfirstlane instruction can be optimized away.

CC: @arsenm, @jayfoad, @nhaehnle, @ruiling


Patch is 38.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/115696.diff

14 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+6)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+7-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+14-11)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+9)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll (+492)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ptr.ll (+126)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index d6375ab77cfb32..bb7931d4a95c92 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2152,6 +2152,12 @@ def int_amdgcn_readfirstlane :
   Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
             [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
+// This is similar to readfirstlane, but marks value that is uniform, allowed sunk / hoist into
+// control flow. The result is undefined if the value is actual divergent.
+def int_amdgcn_readanylane :
+  Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
+            [IntrNoCallback, IntrNoFree, IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 // The lane argument must be uniform across the currently active threads of the
 // current wave. Otherwise, the result is undefined.
 def int_amdgcn_readlane :
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index e3a330d45aaa57..edd8e042d3f4b6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2775,6 +2775,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) {
   case Intrinsic::amdgcn_strict_wqm:
     Opcode = AMDGPU::STRICT_WQM;
     break;
+  case Intrinsic::amdgcn_readanylane:
+    Opcode = AMDGPU::SI_READANYLANE;
+    break;
   case Intrinsic::amdgcn_interp_p1_f16:
     SelectInterpP1F16(N);
     return;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 8beb9defee66a0..3bdc258f180f88 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -1081,6 +1081,7 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
   }
   case Intrinsic::amdgcn_permlane64:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_readlane: {
     // If the first argument is uniform these intrinsics return it unchanged.
     const Use &Src = II.getArgOperandUse(0);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index d51d136ba4200c..a5e984bde0e6c4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -25,6 +25,7 @@
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
+#include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include <optional>
@@ -97,9 +98,11 @@ bool AMDGPUInstructionSelector::isVCC(Register Reg,
 
 bool AMDGPUInstructionSelector::constrainCopyLikeIntrin(MachineInstr &MI,
                                                         unsigned NewOpc) const {
+  const bool NeedExec = NewOpc != AMDGPU::SI_READANYLANE;
   MI.setDesc(TII.get(NewOpc));
   MI.removeOperand(1); // Remove intrinsic ID.
-  MI.addOperand(*MF, MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
+  if (NeedExec)
+    MI.addOperand(*MF, MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
 
   MachineOperand &Dst = MI.getOperand(0);
   MachineOperand &Src = MI.getOperand(1);
@@ -112,7 +115,7 @@ bool AMDGPUInstructionSelector::constrainCopyLikeIntrin(MachineInstr &MI,
     = TRI.getConstrainedRegClassForOperand(Dst, *MRI);
   const TargetRegisterClass *SrcRC
     = TRI.getConstrainedRegClassForOperand(Src, *MRI);
-  if (!DstRC || DstRC != SrcRC)
+  if (!DstRC || (NeedExec && DstRC != SrcRC))
     return false;
 
   return RBI.constrainGenericRegister(Dst.getReg(), *DstRC, *MRI) &&
@@ -1061,6 +1064,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC(MachineInstr &I) const {
     return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WQM);
   case Intrinsic::amdgcn_writelane:
     return selectWritelane(I);
+  case Intrinsic::amdgcn_readanylane:
+    return constrainCopyLikeIntrin(I, AMDGPU::SI_READANYLANE);
   case Intrinsic::amdgcn_div_scale:
     return selectDivScale(I);
   case Intrinsic::amdgcn_icmp:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 545eb9046ff030..5ff64e3be58669 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -5475,6 +5475,7 @@ bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper &Helper,
     auto LaneOp = B.buildIntrinsic(IID, {VT}).addUse(Src0);
     switch (IID) {
     case Intrinsic::amdgcn_readfirstlane:
+    case Intrinsic::amdgcn_readanylane:
     case Intrinsic::amdgcn_permlane64:
       return LaneOp.getReg(0);
     case Intrinsic::amdgcn_readlane:
@@ -7561,6 +7562,7 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_permlane16:
   case Intrinsic::amdgcn_permlanex16:
   case Intrinsic::amdgcn_permlane64:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
index 6a79aa0cbf4df7..4972ccbce3618e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
@@ -137,7 +137,8 @@ void AMDGPUMCInstLower::lower(const MachineInstr *MI, MCInst &OutMI) const {
              Opcode == AMDGPU::SI_TCRETURN_GFX) {
     // TODO: How to use branch immediate and avoid register+add?
     Opcode = AMDGPU::S_SETPC_B64;
-  }
+  } else if (Opcode == AMDGPU::SI_READANYLANE)
+    Opcode = AMDGPU::V_READFIRSTLANE_B32;
 
   int MCOpcode = TII->pseudoToMCOpcode(Opcode);
   if (MCOpcode == -1) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 415c068367074f..1728876eafffcc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4658,6 +4658,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[3] = AMDGPU::getValueMapping(IdxBank, IdxSize);
       [[fallthrough]];
     }
+    case Intrinsic::amdgcn_readanylane:
+      [[fallthrough]];
     case Intrinsic::amdgcn_readfirstlane: {
       unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
       unsigned SrcSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 60fa2adc62dc8c..a36c38e105ce6e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -366,6 +366,7 @@ def UniformIntrinsics : GenericTable {
 }
 
 def : AlwaysUniform<int_amdgcn_readfirstlane>;
+def : AlwaysUniform<int_amdgcn_readanylane>;
 def : AlwaysUniform<int_amdgcn_readlane>;
 def : AlwaysUniform<int_amdgcn_icmp>;
 def : AlwaysUniform<int_amdgcn_fcmp>;
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 73834773f66e3c..c1f35e62e633fd 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1116,18 +1116,20 @@ void SIFoldOperandsImpl::foldOperand(
 
     unsigned UseOpc = UseMI->getOpcode();
     if (UseOpc == AMDGPU::V_READFIRSTLANE_B32 ||
+        UseOpc == AMDGPU::SI_READANYLANE ||
         (UseOpc == AMDGPU::V_READLANE_B32 &&
          (int)UseOpIdx ==
-         AMDGPU::getNamedOperandIdx(UseOpc, AMDGPU::OpName::src0))) {
+             AMDGPU::getNamedOperandIdx(UseOpc, AMDGPU::OpName::src0))) {
+      // readanylane doesn't care exec
+      const bool ReadAnyLean = UseOpc == AMDGPU::SI_READANYLANE;
       // %vgpr = V_MOV_B32 imm
       // %sgpr = V_READFIRSTLANE_B32 %vgpr
       // =>
       // %sgpr = S_MOV_B32 imm
       if (FoldingImmLike) {
-        if (execMayBeModifiedBeforeUse(*MRI,
-                                       UseMI->getOperand(UseOpIdx).getReg(),
-                                       *OpToFold.getParent(),
-                                       *UseMI))
+        if (!ReadAnyLean && execMayBeModifiedBeforeUse(
+                                *MRI, UseMI->getOperand(UseOpIdx).getReg(),
+                                *OpToFold.getParent(), *UseMI))
           return;
 
         UseMI->setDesc(TII->get(AMDGPU::S_MOV_B32));
@@ -1136,15 +1138,15 @@ void SIFoldOperandsImpl::foldOperand(
           UseMI->getOperand(1).ChangeToImmediate(OpToFold.getImm());
         else
           UseMI->getOperand(1).ChangeToFrameIndex(OpToFold.getIndex());
-        UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
+        if (!ReadAnyLean)
+          UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
         return;
       }
 
       if (OpToFold.isReg() && TRI->isSGPRReg(*MRI, OpToFold.getReg())) {
-        if (execMayBeModifiedBeforeUse(*MRI,
-                                       UseMI->getOperand(UseOpIdx).getReg(),
-                                       *OpToFold.getParent(),
-                                       *UseMI))
+        if (!ReadAnyLean && execMayBeModifiedBeforeUse(
+                                *MRI, UseMI->getOperand(UseOpIdx).getReg(),
+                                *OpToFold.getParent(), *UseMI))
           return;
 
         // %vgpr = COPY %sgpr0
@@ -1155,7 +1157,8 @@ void SIFoldOperandsImpl::foldOperand(
         UseMI->getOperand(1).setReg(OpToFold.getReg());
         UseMI->getOperand(1).setSubReg(OpToFold.getSubReg());
         UseMI->getOperand(1).setIsKill(false);
-        UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
+        if (!ReadAnyLean)
+          UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
         return;
       }
     }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 37dc433d154f64..0acc90faa268db 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6186,6 +6186,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
       Operands.push_back(Src1);
       [[fallthrough]];
     case Intrinsic::amdgcn_readfirstlane:
+    case Intrinsic::amdgcn_readanylane:
     case Intrinsic::amdgcn_permlane64:
       Operands.push_back(Src0);
       break;
@@ -8837,6 +8838,7 @@ SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     return lowerADDRSPACECAST(Op, DAG);
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_permlane16:
   case Intrinsic::amdgcn_permlanex16:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index ad45af00f2bd75..ce5f19b2561dbe 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4159,7 +4159,8 @@ bool SIInstrInfo::hasUnwantedEffectsWhenEXECEmpty(const MachineInstr &MI) const
   if (Opcode == AMDGPU::V_READFIRSTLANE_B32 ||
       Opcode == AMDGPU::V_READLANE_B32 || Opcode == AMDGPU::V_WRITELANE_B32 ||
       Opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR ||
-      Opcode == AMDGPU::SI_SPILL_S32_TO_VGPR)
+      Opcode == AMDGPU::SI_SPILL_S32_TO_VGPR ||
+      Opcode == AMDGPU::SI_READANYLANE)
     return true;
 
   return false;
@@ -9619,6 +9620,7 @@ SIInstrInfo::getInstructionUniformity(const MachineInstr &MI) const {
   unsigned opcode = MI.getOpcode();
   if (opcode == AMDGPU::V_READLANE_B32 ||
       opcode == AMDGPU::V_READFIRSTLANE_B32 ||
+      opcode == AMDGPU::SI_READANYLANE ||
       opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR)
     return InstructionUniformity::AlwaysUniform;
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 25df5dabdc6aa1..575fac67288e01 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -546,6 +546,10 @@ def SI_MASKED_UNREACHABLE : SPseudoInstSI <(outs), (ins),
   let maybeAtomic = 0;
 }
 
+def SI_READANYLANE : SPseudoInstSI <(outs SReg_32:$dst), (ins VGPR_32:$src)> {
+  let SALU = 1;
+}
+
 // Used as an isel pseudo to directly emit initialization with an
 // s_mov_b32 rather than a copy of another initialized
 // register. MachineCSE skips copies, and we don't want to have to
@@ -3504,6 +3508,11 @@ def : GCNPat<
   (S_MOV_B32 SReg_32:$src)
 >;
 
+def : GCNPat<
+  (i32 (int_amdgcn_readanylane (i32 imm:$src))),
+  (S_MOV_B32 SReg_32:$src)
+>;
+
 multiclass BFMPatterns <ValueType vt, PatFrag SHL, PatFrag ADD, InstSI BFM> {
   def : GCNPat <
     (vt (SHL (vt (add (vt (shl 1, vt:$a)), -1)), vt:$b)),
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll
new file mode 100644
index 00000000000000..0da1f47d8fe1f3
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll
@@ -0,0 +1,492 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -verify-machineinstrs -o - < %s | FileCheck -check-prefix=CHECK-SDAG -enable-var-scope %s
+; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -verify-machineinstrs -global-isel -o - < %s | FileCheck -check-prefix=CHECK-GISEL -enable-var-scope %s
+
+define void @test_readanylane_i1(ptr addrspace(1) %out, i1 %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v2
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-SDAG-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-SDAG-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v2
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-GISEL-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %src)
+  store i1 %readanylane, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i1_inreg(ptr addrspace(1) %out, i1 inreg %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1_inreg:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-SDAG-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-SDAG-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1_inreg:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-GISEL-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %src)
+  store i1 %readanylane, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i1_select(ptr addrspace(1) %out, i32 %src, i32 %src1) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1_select:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 42, v2
+; CHECK-SDAG-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc_lo
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_3)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v4
+; CHECK-SDAG-NEXT:    s_bitcmp1_b32 s0, 0
+; CHECK-SDAG-NEXT:    s_cselect_b32 vcc_lo, -1, 0
+; CHECK-SDAG-NEXT:    v_cndmask_b32_e32 v2, v3, v2, vcc_lo
+; CHECK-SDAG-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1_select:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 42, v2
+; CHECK-GISEL-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc_lo
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_3)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v4
+; CHECK-GISEL-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s0
+; CHECK-GISEL-NEXT:    v_cndmask_b32_e32 v2, v3, v2, vcc_lo
+; CHECK-GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %cmp = icmp eq i32 %src, 42
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %cmp)
+  %sel = select i1 %readanylane, i32 %src, i32 %src1
+  store i32 %sel, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i16(i16 %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i16:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_3)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 0xffff
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i16:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i16 @llvm.amdgcn.readanylane.i16(i16 %src)
+  call void asm sideeffect "; use $0", "s"(i16 %readanylane)
+  ret void
+}
+
+define void @test_readanylane_half(half %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_half:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_half:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call half @llvm.amdgcn.readanylane.f16(half %src)
+  call void asm sideeffect "; use $0", "s"(half %readanylane)
+  ret void
+}
+
+define void @test_readanylane_float(float %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_float:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_float:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call float @llvm.amdgcn.readanylane.f32(float %src)
+  call void asm sideeffect "; use $0", "s"(float %readanylane)
+  ret void
+}
+
+define void @test_readanylane_i32_immed() #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i32_immed:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    s_mov_b32 s0, 42
+; CHECK-SDAG-NEXT:    ;;#AS...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Nov 11, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Xin "Russell" Liu (GinShio)

Changes

Sometimes, we know the value is uniform, but backend cannot easily prove that it is uniform.

This change introduces the intrinsic readanylane, which is similar to readfirstlane, but has a couple of advantages:

  • It doesn't convergent, so can be moved between control flows.
  • If the result is needed in a vgpr then the v_readfirstlane instruction can be optimized away.

CC: @arsenm, @jayfoad, @nhaehnle, @ruiling


Patch is 38.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/115696.diff

14 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+6)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+7-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+14-11)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+9)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll (+492)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ptr.ll (+126)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index d6375ab77cfb32..bb7931d4a95c92 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2152,6 +2152,12 @@ def int_amdgcn_readfirstlane :
   Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
             [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
+// This is similar to readfirstlane, but marks value that is uniform, allowed sunk / hoist into
+// control flow. The result is undefined if the value is actual divergent.
+def int_amdgcn_readanylane :
+  Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
+            [IntrNoCallback, IntrNoFree, IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 // The lane argument must be uniform across the currently active threads of the
 // current wave. Otherwise, the result is undefined.
 def int_amdgcn_readlane :
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index e3a330d45aaa57..edd8e042d3f4b6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2775,6 +2775,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) {
   case Intrinsic::amdgcn_strict_wqm:
     Opcode = AMDGPU::STRICT_WQM;
     break;
+  case Intrinsic::amdgcn_readanylane:
+    Opcode = AMDGPU::SI_READANYLANE;
+    break;
   case Intrinsic::amdgcn_interp_p1_f16:
     SelectInterpP1F16(N);
     return;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 8beb9defee66a0..3bdc258f180f88 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -1081,6 +1081,7 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
   }
   case Intrinsic::amdgcn_permlane64:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_readlane: {
     // If the first argument is uniform these intrinsics return it unchanged.
     const Use &Src = II.getArgOperandUse(0);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index d51d136ba4200c..a5e984bde0e6c4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -25,6 +25,7 @@
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
+#include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include <optional>
@@ -97,9 +98,11 @@ bool AMDGPUInstructionSelector::isVCC(Register Reg,
 
 bool AMDGPUInstructionSelector::constrainCopyLikeIntrin(MachineInstr &MI,
                                                         unsigned NewOpc) const {
+  const bool NeedExec = NewOpc != AMDGPU::SI_READANYLANE;
   MI.setDesc(TII.get(NewOpc));
   MI.removeOperand(1); // Remove intrinsic ID.
-  MI.addOperand(*MF, MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
+  if (NeedExec)
+    MI.addOperand(*MF, MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
 
   MachineOperand &Dst = MI.getOperand(0);
   MachineOperand &Src = MI.getOperand(1);
@@ -112,7 +115,7 @@ bool AMDGPUInstructionSelector::constrainCopyLikeIntrin(MachineInstr &MI,
     = TRI.getConstrainedRegClassForOperand(Dst, *MRI);
   const TargetRegisterClass *SrcRC
     = TRI.getConstrainedRegClassForOperand(Src, *MRI);
-  if (!DstRC || DstRC != SrcRC)
+  if (!DstRC || (NeedExec && DstRC != SrcRC))
     return false;
 
   return RBI.constrainGenericRegister(Dst.getReg(), *DstRC, *MRI) &&
@@ -1061,6 +1064,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC(MachineInstr &I) const {
     return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WQM);
   case Intrinsic::amdgcn_writelane:
     return selectWritelane(I);
+  case Intrinsic::amdgcn_readanylane:
+    return constrainCopyLikeIntrin(I, AMDGPU::SI_READANYLANE);
   case Intrinsic::amdgcn_div_scale:
     return selectDivScale(I);
   case Intrinsic::amdgcn_icmp:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 545eb9046ff030..5ff64e3be58669 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -5475,6 +5475,7 @@ bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper &Helper,
     auto LaneOp = B.buildIntrinsic(IID, {VT}).addUse(Src0);
     switch (IID) {
     case Intrinsic::amdgcn_readfirstlane:
+    case Intrinsic::amdgcn_readanylane:
     case Intrinsic::amdgcn_permlane64:
       return LaneOp.getReg(0);
     case Intrinsic::amdgcn_readlane:
@@ -7561,6 +7562,7 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_permlane16:
   case Intrinsic::amdgcn_permlanex16:
   case Intrinsic::amdgcn_permlane64:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
index 6a79aa0cbf4df7..4972ccbce3618e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
@@ -137,7 +137,8 @@ void AMDGPUMCInstLower::lower(const MachineInstr *MI, MCInst &OutMI) const {
              Opcode == AMDGPU::SI_TCRETURN_GFX) {
     // TODO: How to use branch immediate and avoid register+add?
     Opcode = AMDGPU::S_SETPC_B64;
-  }
+  } else if (Opcode == AMDGPU::SI_READANYLANE)
+    Opcode = AMDGPU::V_READFIRSTLANE_B32;
 
   int MCOpcode = TII->pseudoToMCOpcode(Opcode);
   if (MCOpcode == -1) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 415c068367074f..1728876eafffcc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4658,6 +4658,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[3] = AMDGPU::getValueMapping(IdxBank, IdxSize);
       [[fallthrough]];
     }
+    case Intrinsic::amdgcn_readanylane:
+      [[fallthrough]];
     case Intrinsic::amdgcn_readfirstlane: {
       unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
       unsigned SrcSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 60fa2adc62dc8c..a36c38e105ce6e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -366,6 +366,7 @@ def UniformIntrinsics : GenericTable {
 }
 
 def : AlwaysUniform<int_amdgcn_readfirstlane>;
+def : AlwaysUniform<int_amdgcn_readanylane>;
 def : AlwaysUniform<int_amdgcn_readlane>;
 def : AlwaysUniform<int_amdgcn_icmp>;
 def : AlwaysUniform<int_amdgcn_fcmp>;
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 73834773f66e3c..c1f35e62e633fd 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1116,18 +1116,20 @@ void SIFoldOperandsImpl::foldOperand(
 
     unsigned UseOpc = UseMI->getOpcode();
     if (UseOpc == AMDGPU::V_READFIRSTLANE_B32 ||
+        UseOpc == AMDGPU::SI_READANYLANE ||
         (UseOpc == AMDGPU::V_READLANE_B32 &&
          (int)UseOpIdx ==
-         AMDGPU::getNamedOperandIdx(UseOpc, AMDGPU::OpName::src0))) {
+             AMDGPU::getNamedOperandIdx(UseOpc, AMDGPU::OpName::src0))) {
+      // readanylane doesn't care exec
+      const bool ReadAnyLean = UseOpc == AMDGPU::SI_READANYLANE;
       // %vgpr = V_MOV_B32 imm
       // %sgpr = V_READFIRSTLANE_B32 %vgpr
       // =>
       // %sgpr = S_MOV_B32 imm
       if (FoldingImmLike) {
-        if (execMayBeModifiedBeforeUse(*MRI,
-                                       UseMI->getOperand(UseOpIdx).getReg(),
-                                       *OpToFold.getParent(),
-                                       *UseMI))
+        if (!ReadAnyLean && execMayBeModifiedBeforeUse(
+                                *MRI, UseMI->getOperand(UseOpIdx).getReg(),
+                                *OpToFold.getParent(), *UseMI))
           return;
 
         UseMI->setDesc(TII->get(AMDGPU::S_MOV_B32));
@@ -1136,15 +1138,15 @@ void SIFoldOperandsImpl::foldOperand(
           UseMI->getOperand(1).ChangeToImmediate(OpToFold.getImm());
         else
           UseMI->getOperand(1).ChangeToFrameIndex(OpToFold.getIndex());
-        UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
+        if (!ReadAnyLean)
+          UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
         return;
       }
 
       if (OpToFold.isReg() && TRI->isSGPRReg(*MRI, OpToFold.getReg())) {
-        if (execMayBeModifiedBeforeUse(*MRI,
-                                       UseMI->getOperand(UseOpIdx).getReg(),
-                                       *OpToFold.getParent(),
-                                       *UseMI))
+        if (!ReadAnyLean && execMayBeModifiedBeforeUse(
+                                *MRI, UseMI->getOperand(UseOpIdx).getReg(),
+                                *OpToFold.getParent(), *UseMI))
           return;
 
         // %vgpr = COPY %sgpr0
@@ -1155,7 +1157,8 @@ void SIFoldOperandsImpl::foldOperand(
         UseMI->getOperand(1).setReg(OpToFold.getReg());
         UseMI->getOperand(1).setSubReg(OpToFold.getSubReg());
         UseMI->getOperand(1).setIsKill(false);
-        UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
+        if (!ReadAnyLean)
+          UseMI->removeOperand(2); // Remove exec read (or src1 for readlane)
         return;
       }
     }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 37dc433d154f64..0acc90faa268db 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6186,6 +6186,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
       Operands.push_back(Src1);
       [[fallthrough]];
     case Intrinsic::amdgcn_readfirstlane:
+    case Intrinsic::amdgcn_readanylane:
     case Intrinsic::amdgcn_permlane64:
       Operands.push_back(Src0);
       break;
@@ -8837,6 +8838,7 @@ SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     return lowerADDRSPACECAST(Op, DAG);
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readanylane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_permlane16:
   case Intrinsic::amdgcn_permlanex16:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index ad45af00f2bd75..ce5f19b2561dbe 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4159,7 +4159,8 @@ bool SIInstrInfo::hasUnwantedEffectsWhenEXECEmpty(const MachineInstr &MI) const
   if (Opcode == AMDGPU::V_READFIRSTLANE_B32 ||
       Opcode == AMDGPU::V_READLANE_B32 || Opcode == AMDGPU::V_WRITELANE_B32 ||
       Opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR ||
-      Opcode == AMDGPU::SI_SPILL_S32_TO_VGPR)
+      Opcode == AMDGPU::SI_SPILL_S32_TO_VGPR ||
+      Opcode == AMDGPU::SI_READANYLANE)
     return true;
 
   return false;
@@ -9619,6 +9620,7 @@ SIInstrInfo::getInstructionUniformity(const MachineInstr &MI) const {
   unsigned opcode = MI.getOpcode();
   if (opcode == AMDGPU::V_READLANE_B32 ||
       opcode == AMDGPU::V_READFIRSTLANE_B32 ||
+      opcode == AMDGPU::SI_READANYLANE ||
       opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR)
     return InstructionUniformity::AlwaysUniform;
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 25df5dabdc6aa1..575fac67288e01 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -546,6 +546,10 @@ def SI_MASKED_UNREACHABLE : SPseudoInstSI <(outs), (ins),
   let maybeAtomic = 0;
 }
 
+def SI_READANYLANE : SPseudoInstSI <(outs SReg_32:$dst), (ins VGPR_32:$src)> {
+  let SALU = 1;
+}
+
 // Used as an isel pseudo to directly emit initialization with an
 // s_mov_b32 rather than a copy of another initialized
 // register. MachineCSE skips copies, and we don't want to have to
@@ -3504,6 +3508,11 @@ def : GCNPat<
   (S_MOV_B32 SReg_32:$src)
 >;
 
+def : GCNPat<
+  (i32 (int_amdgcn_readanylane (i32 imm:$src))),
+  (S_MOV_B32 SReg_32:$src)
+>;
+
 multiclass BFMPatterns <ValueType vt, PatFrag SHL, PatFrag ADD, InstSI BFM> {
   def : GCNPat <
     (vt (SHL (vt (add (vt (shl 1, vt:$a)), -1)), vt:$b)),
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll
new file mode 100644
index 00000000000000..0da1f47d8fe1f3
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readanylane.ll
@@ -0,0 +1,492 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -verify-machineinstrs -o - < %s | FileCheck -check-prefix=CHECK-SDAG -enable-var-scope %s
+; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -verify-machineinstrs -global-isel -o - < %s | FileCheck -check-prefix=CHECK-GISEL -enable-var-scope %s
+
+define void @test_readanylane_i1(ptr addrspace(1) %out, i1 %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v2
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-SDAG-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-SDAG-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v2
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-GISEL-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %src)
+  store i1 %readanylane, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i1_inreg(ptr addrspace(1) %out, i1 inreg %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1_inreg:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-SDAG-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-SDAG-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1_inreg:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    s_and_b32 s0, s0, 1
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; CHECK-GISEL-NEXT:    global_store_b8 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %src)
+  store i1 %readanylane, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i1_select(ptr addrspace(1) %out, i32 %src, i32 %src1) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i1_select:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 42, v2
+; CHECK-SDAG-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc_lo
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_3)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v4
+; CHECK-SDAG-NEXT:    s_bitcmp1_b32 s0, 0
+; CHECK-SDAG-NEXT:    s_cselect_b32 vcc_lo, -1, 0
+; CHECK-SDAG-NEXT:    v_cndmask_b32_e32 v2, v3, v2, vcc_lo
+; CHECK-SDAG-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i1_select:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 42, v2
+; CHECK-GISEL-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc_lo
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_3)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v4
+; CHECK-GISEL-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; CHECK-GISEL-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s0
+; CHECK-GISEL-NEXT:    v_cndmask_b32_e32 v2, v3, v2, vcc_lo
+; CHECK-GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %cmp = icmp eq i32 %src, 42
+  %readanylane = call i1 @llvm.amdgcn.readanylane.i1(i1 %cmp)
+  %sel = select i1 %readanylane, i32 %src, i32 %src1
+  store i32 %sel, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define void @test_readanylane_i16(i16 %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i16:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_3)
+; CHECK-SDAG-NEXT:    s_and_b32 s0, s0, 0xffff
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_i16:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call i16 @llvm.amdgcn.readanylane.i16(i16 %src)
+  call void asm sideeffect "; use $0", "s"(i16 %readanylane)
+  ret void
+}
+
+define void @test_readanylane_half(half %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_half:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_half:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call half @llvm.amdgcn.readanylane.f16(half %src)
+  call void asm sideeffect "; use $0", "s"(half %readanylane)
+  ret void
+}
+
+define void @test_readanylane_float(float %src) #1 {
+; CHECK-SDAG-LABEL: test_readanylane_float:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-SDAG-NEXT:    ;;#ASMSTART
+; CHECK-SDAG-NEXT:    ; use s0
+; CHECK-SDAG-NEXT:    ;;#ASMEND
+; CHECK-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; CHECK-GISEL-LABEL: test_readanylane_float:
+; CHECK-GISEL:       ; %bb.0:
+; CHECK-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-GISEL-NEXT:    ;;#ASMSTART
+; CHECK-GISEL-NEXT:    ; use s0
+; CHECK-GISEL-NEXT:    ;;#ASMEND
+; CHECK-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %readanylane = call float @llvm.amdgcn.readanylane.f32(float %src)
+  call void asm sideeffect "; use $0", "s"(float %readanylane)
+  ret void
+}
+
+define void @test_readanylane_i32_immed() #1 {
+; CHECK-SDAG-LABEL: test_readanylane_i32_immed:
+; CHECK-SDAG:       ; %bb.0:
+; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-SDAG-NEXT:    s_mov_b32 s0, 42
+; CHECK-SDAG-NEXT:    ;;#AS...
[truncated]

@jayfoad
Copy link
Contributor

jayfoad commented Nov 11, 2024

Thanks for working on this!

It doesn't convergent, so can be moved between control flows.

Unfortunately the new intrinsic still has to be convergent.

Consider these transformations:

A: x = def();                            B: x = def();
   y = read*lane(x);                        if (divergentcondition) {
   if (divergentcondition) {                  y = read*lane(x);
     use(y);                                  use(y);
   }                                        }

Transforming A-->B is not safe for readfirstlane, but it is safe for readanylane (because if x is uniform outside the if then it must also be uniform inside the if). Transforming B-->A is not safe for readfirstlane and not safe for readanylane.

Unfortunately the convergent attribute currently restricts both of these transforms, so we need readanylane to be convergent to prevent B-->A.

Maybe in future we can have some better way to express this which will allow transforming A-->B but prevent B-->A.

Cc: @nhaehnle @ssahasra

@GinShio
Copy link
Author

GinShio commented Nov 11, 2024

Transforming A-->B is not safe for readfirstlane, but it is safe for readanylane (because if x is uniform outside the if then it must also be uniform inside the if). Transforming B-->A is not safe for readfirstlane and not safe for readanylane.

Sorry, I don't quite understand why the latter is unsafe. (B-->A for readanylane)
My understanding is, if x is uniform, readanylane is not affected by exec flag.

block-beta
  columns 8
  IfTh0[" "] IfTh1["42"] IfTh2["42"] IfTh3["42"] IfTh4["42"] IfTh5[" "] IfTh6["42"] IfTh7["42"]
  ThenTh0[" "] ThenTh1["42"] ThenTh2["42"] ThenTh3["42"] ThenTh4["42"] ThenTh5[" "] ThenTh6["42"] ThenTh7["42"]
  ElseTh0[" "] ElseTh1["42"] ElseTh2["42"] ElseTh3["42"] ElseTh4["42"] ElseTh5[" "] ElseTh6["42"] ElseTh7["42"]

  classDef disable fill:#808080,stroke:#333;
  classDef convergence fill:#218733,stroke:#333;
  classDef then fill:#A93376,stroke:#333;
  classDef else fill:#65CBEE,stroke:#333;
  class IfTh0,IfTh5,IfTh7 disable
  class IfTh1,IfTh2,IfTh3,IfTh4,IfTh6,IfTh7 convergence
  class ThenTh0,ThenTh1,ThenTh2,ThenTh4,ThenTh5 disable
  class ThenTh3,ThenTh6,ThenTh7 then
  class ElseTh0,ElseTh3,ElseTh5,ElseTh6,ElseTh7 disable
  class ElseTh1,ElseTh2,ElseTh4 else
Loading

For example, the 1st line is if-block, then it is then-block, and the last else-block.
Looks like reading data is a safe behavior no matter which bb it is in.

Could you help me point out my mistake?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 11, 2024

The definition of readanylane(x) has to say something like "if x is uniform then the result is x, otherwise the result is undefined". The first part "if x is uniform" means that it is uniform at the place where it is used, not at the place where it is defined. In your example that means that all the dark red cells must have the same value (42) but other green cells could be different.

(Another example of why we have to consider uniformity at the point of use is temporal divergence. You can have a uniform def inside a loop with a divergent exit condition, but a use of that def outside the loop must be considered as divergent.)

New intrinsic has to be convergent. It\'s uniform uniform at the place
where it is used.

This commit changes:
 + Add attribute `convergent`.
 + Remove testcase `test_readanylane_hoist`. The intrinsic is no longer
   allowed hoist.
@GinShio
Copy link
Author

GinShio commented Nov 11, 2024

Thanks a lot! Added attribute convergent.

// control flow. The result is undefined if the value is actual divergent.
def int_amdgcn_readanylane :
Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
[IntrConvergent, IntrNoCallback, IntrNoFree, IntrNoMem, IntrSpeculatable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add IntrSpeculatable deliberately? If so, you should add a test showing what sort of transform it enables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed.


bool AMDGPUInstructionSelector::constrainCopyLikeIntrin(MachineInstr &MI,
unsigned NewOpc) const {
const bool NeedExec = NewOpc != AMDGPU::SI_READANYLANE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't readanylane need exec?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1st version, it's SAlu and I thought it's uniform when defined. So...
Now, the intrinsic depends on exec.

@ruiling
Copy link
Contributor

ruiling commented Nov 12, 2024

The first part "if x is uniform" means that it is uniform at the place where it is used, not at the place where it is defined. In your example that means that all the dark red cells must have the same value (42) but other green cells could be different.

I would say let's say it clearly that readanylane is defined only if the input is uniform when defined and at the use of readanylane call. We really want the intrinsic to be not convergent to allow related middle-end optimizations.

btw, I think for the existing uniformity analysis, we can say that: "if a value is uniform at use, then it should be uniform at define", right? We only have the case that a value is defined uniform but divergent at its use currently.

@ssahasra
Copy link
Collaborator

I would say let's say it clearly that readanylane is defined only if the input is uniform when defined and at the use of readanylane call. We really want the intrinsic to be not convergent to allow related middle-end optimizations.

The change description starts with:

"Sometimes, we know the value is uniform, but backend cannot easily prove that it is uniform."

Can this be solved with an attribute instead?

If we do need an intrinsic, we should stop comparing it with readfirstlane, and just focus on what readanylane means. Would it be correct to rename it to something like "assumeuniform"? Does this provide information that can be propagated backwards when initializing uniformity analysis?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 12, 2024

I would say let's say it clearly that readanylane is defined only if the input is uniform when defined and at the use of readanylane call. We really want the intrinsic to be not convergent to allow related middle-end optimizations.

I'm not comfortable with saying "the input is uniform when defined". The problem is with code like this:

  x = uniform_def();
  y = divergent_def();
  if (x == y) {
    ...
  }

Inside the body of the if, generic optimizations should be allowed to replace uses of x with y and vice versa. But if they replace readanylane(x) with readanylane(y) then that would break the "uniform when defined" condition.

I don't have a good example of an optimization that actually does this, but I do think they should be allowed to do this.

btw, I think for the existing uniformity analysis, we can say that: "if a value is uniform at use, then it should be uniform at define", right? We only have the case that a value is defined uniform but divergent at its use currently.

Yes I think that's correct, currently.

Can this be solved with an attribute instead?

I'm not sure how that would work. We want to apply this to a specific IR Use, or to an IR Value at a specific point in the CFG. The use case is quite similar to the existing @llvm.assume intrinsic.

If we do need an intrinsic, we should stop comparing it with readfirstlane, and just focus on what readanylane means.

Yeah. I would define it something like "if the argument is uniform, return that value, otherwise the result is undefined" with an explanatory note saying why you would want to use it.

Would it be correct to rename it to something like "assumeuniform"?

I guess so. Personally I like the name readanylane, but that might be because I came up with it :-)

Does this provide information that can be propagated backwards when initializing uniformity analysis?

Yes, if UA did any backwards propagation then I think this could be a good thing to propagate backwards from.

@GinShio
Copy link
Author

GinShio commented Nov 12, 2024

I thought maybe we should consider uniform when defined or uniform when used.

If it's former, this intrinsic should be to undefined behavior when defined is divergent. That means, this example is invalid, user shouldn't generate readanylane in this case.
If it's latter, keep the attribute convergent and modify the change code. Like Jay said, just allow suck it.

Looks like, the former has more opportunities to optimize.

@ssahasra
Copy link
Collaborator

A: x = def();                            B: x = def();
   y = read*lane(x);                        if (divergentcondition) {
   if (divergentcondition) {                  y = read*lane(x);
     use(y);                                  use(y);
   }                                        }

I am not sure what is the primary goal here. If the goal is to allow certain defs to be sunk into divergent control flow, then maybe this particular situation is sufficient justification to add a new attribute that allows allow sinking. This can be combined with builtin that declares "uniform at use". A combination of these two will allow the A --> B transformation.

In program B, being able to say that a value is uniform at its use inside divergent control-flow can be useful in itself. Trying to hoist this value like in B --> A is hard anyway. This should be considered separately from the A --> B transform. In the program B, if the user wanted the value to be hoisted, then they probably have sufficient information to write A instead.

Syntax-wise, I am leaning more towards something that looks like @llvm.assume. In fact, we could consider @llvm.is_uniform as a builtin predicate, and then the user can say @llvm.assume(@llvm.is_uniform()). This is similar to the existing @llvm.is.constant.* builtins.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 13, 2024

If the goal is to allow certain defs to be sunk into divergent control flow, then maybe this particular situation is sufficient justification to add a new attribute that allows allow sinking. This can be combined with builtin that declares "uniform at use". A combination of these two will allow the A --> B transformation.

Would this new attribute be in addition to convergent, or instead of it? I still think readanylane should be convergent because it depends on the set of active threads, so I see this as a special case where we know the rules can be relaxed: in general readanylane should not be moved past divergent control flow, but in the specific case of sinking into an if it is OK.

FYI previously, probably before most of your work on convergence, someone proposed splitting convergent into two separate attributes, one to prevent sinking (adding additional control dependencies) and one to prevent hoisting (removing control dependencies).

Let me add one more example:

A: x = def();                           B: x = def();                           C: x = def();
   y = read*lane(x);                       if (divergentcondition) {               y = read*lane(x);
   if (divergentcondition) {                 y = read*lane(x);                     if (divergentcondition) {
     use(y);                                 use(y);                                 z = read*lane(x);
   }                                       }                                         use(z);
                                                                                   }

CSE will transform C-->A if we don't prevent it in some way. That's one of the reasons for marking readanylane as convergent.

 + Remove unused attribute `speculatable`.
 + Change machine instruction from SAlu to VAlu, add attributes.
 + Add testcase for InstCombine, and CSE.
 + Update docs.
@llvmbot llvmbot added llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Nov 13, 2024
@GinShio
Copy link
Author

GinShio commented Nov 13, 2024

Sorry, my understanding of readanylane may be gap. But, I prefer @llvm.assume if we want to mark the value is uniform when defined.

Let me add one more example:

A: x = def();                           B: x = def();                           C: x = def();
   y = read*lane(x);                       if (divergentcondition) {               y = read*lane(x);
   if (divergentcondition) {                 y = read*lane(x);                     if (divergentcondition) {
     use(y);                                 use(y);                                 z = read*lane(x);
   }                                       }                                         use(z);
                                                                                   }

CSE will transform C-->A if we don't prevent it in some way. That's one of the reasons for marking readanylane as convergent.

This example (C --> A) maybe good if I understand correctly when readanylane used. If we think x is uniform in if-block (y) and also uniform in then-block (z), we can safely do CSE in this case.

I thought we must prevent optimize to GVN hoisting (Following case that B --> A, uniform when used). It's one of reasons.

A: x = def();                            B: x = def();
   y = readanylane(x);                      if (divergentcondition) {
   if (divergentcondition) {                  y = readanylane(x);
     dosomthing0(y);                          dosomthing0(y);
   } else {                                 } else {
     dosomthing1(y);                          z = readanylane(x);
   }                                          dosomthing1(z);
                                            }

Finally, I updated the code. Currently, readanylane hits that value is uniform when used.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 13, 2024

Let me add one more example:

A: x = def();                           B: x = def();                           C: x = def();
   y = read*lane(x);                       if (divergentcondition) {               y = read*lane(x);
   if (divergentcondition) {                 y = read*lane(x);                     if (divergentcondition) {
     use(y);                                 use(y);                                 z = read*lane(x);
   }                                       }                                         use(z);
                                                                                   }

CSE will transform C-->A if we don't prevent it in some way. That's one of the reasons for marking readanylane as convergent.

This example (C --> A) maybe good if I understand correctly when readanylane used. If we think x is uniform in if-block (y) and also uniform in then-block (z), we can safely do CSE in this case.

The problem is when x is uniform inside the if but divergent outside the if. Then z is defined but y is undefined, so it is not OK to replace a use of z with a use of y.

Incorrectly added FileCheck pattern for readanylane
@GinShio
Copy link
Author

GinShio commented Nov 14, 2024

The problem is when x is uniform inside the if but divergent outside the if. Then z is defined but y is undefined, so it is not OK to replace a use of z with a use of y.

Sorry, confused again.

y = readanylane(x) that means x is uniform outside of if, and z = readanylane(x) that means x is uniform inside of if.
Looks like okay, OTOW, x is always uniform in this case, either inside or outside of the if. Right?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 14, 2024

y = readanylane(x) that means x is uniform outside of if, and z = readanylane(x) that means x is uniform inside of if.
Looks like okay, OTOW, x is always uniform in this case, either inside or outside of the if. Right?

I don't think it is helpful to say "readanylane(x) means x is uniform", because the presence of the readanylane() call does not magically make x uniform if it was not already uniform. It is better to describe it as "readanylane(x) is x if x is uniform, otherwise readanylane(x) is undefined".

With that in mind, does my explanation make sense to you:

The problem is when x is uniform inside the if but divergent outside the if. Then z is defined but y is undefined, so it is not OK to replace a use of z with a use of y.

@GinShio
Copy link
Author

GinShio commented Nov 14, 2024

With that in mind, does my explanation make sense to you:

Yes, that's a good explanation.

because the presence of the readanylane() call does not magically make x uniform if it was not already uniform.

IIUC, if x is not uniform outside of if and we used readanylane, the transform C-->A:

  • is okay, because it's undefined behavior if src of readanylane is non-uniform, we can assume it's uniform when used (magically make x uniform). In this case, user shouldn't the intrinsic if they don't guarantee x is uniform when using readanylane.
  • is bad, because it's undefined result, we cannot assume that.

We expect the latter.

@ssahasra
Copy link
Collaborator

ssahasra commented Nov 15, 2024

Would this new attribute be in addition to convergent, or instead of it? I still think readanylane should be convergent because it depends on the set of active threads, so I see this as a special case where we know the rules can be relaxed: in general readanylane should not be moved past divergent control flow, but in the specific case of sinking into an if it is OK.

There should probably be a family of attributes, where convergent is the most conservative. The other attributes will add nuance that relaxes what convergent means. For example, the property that we need here is that "this convergent call is valid for the any subset of the set of threads that implicitly reach here, but not for a superset". This is more specific than saying things like "may sink" or "may add control dependencies". Alternatively we could even implement this as metadata that goes along with the convergent attribute.

EDIT: If we go for new attributes, the expectation is that each of them is useable by itself. They don't have to be added to convergent, but any of them can be replaced with convergent without losing correctness.

@ssahasra
Copy link
Collaborator

It is better to describe it as "readanylane(x) is x if x is uniform, otherwise readanylane(x) is undefined".

Compared to that, @llvm.assume(@llvm.is.uniform()) actively tells the compiler that the user knows that this use of the value is uniform. But then, if the user is wrong, then the result is undefined anyway. In what way is one better than the other?

@GinShio
Copy link
Author

GinShio commented Nov 29, 2024

It is better to describe it as "readanylane(x) is x if x is uniform, otherwise readanylane(x) is undefined".

Compared to that, @llvm.assume(@llvm.is.uniform()) actively tells the compiler that the user knows that this use of the value is uniform. But then, if the user is wrong, then the result is undefined anyway. In what way is one better than the other?

Sorry, I don't focus on this for long time. Thanks a lot for testing the llvm.is.uniform!!!
We discussed offline: "Yes, we know the value is uniform, we wish the value can be hoisted or other." A typical case is subgroup add.
We will try to use is.uniform. That's cool!

For this inst, maybe it's not what we want, but we can continue to improve it if necessary...

Thanks again to you and Jay for detailed explanation.

@jayfoad
Copy link
Contributor

jayfoad commented Dec 2, 2024

It is better to describe it as "readanylane(x) is x if x is uniform, otherwise readanylane(x) is undefined".

Compared to that, @llvm.assume(@llvm.is.uniform()) actively tells the compiler that the user knows that this use of the value is uniform. But then, if the user is wrong, then the result is undefined anyway. In what way is one better than the other?

Are you asking me? I am slightly reluctant to use @llvm.assume because I am less familiar with it, and it seems more complicated: to tell whether a value %x is uniform you would have to search its uses looking for the @llvm.assume(@llvm.is.uniform(%x)) pattern, and probably check that those intrinsic calls dominate the use of %x. I guess InstCombine is already set up to do this for you, but I am not sure about other parts of the compiler that might want to do an ad hoc optimization when they can assume %x is uniform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:ir llvm:transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants