[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions #140694

chrisjbris · 2025-05-20T08:57:56Z

Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional combines to make use of these newly legalised instructions.
Update several tests and separate legacy r600 tests where necessary.

github-actions · 2025-05-20T08:58:17Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-05-20T08:58:19Z

@llvm/pr-subscribers-backend-amdgpu

Author: Chris Jackson (chrisjbris)

Changes

Make use of s_or_b64/s_and_b64/s_xor_b64 for v2i32. Legalising these causes a number of test regressions, so extra work in the combiner and Tablegen patterns was necessary.

Rebasing broke a couple of the regression fixes, and some refactoring of the DagCombine modifications is needed to tidy up. So keeping this in draft for now while I fixup those issues.

Patch is 45.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140694.diff

19 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1-1)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+97-4)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+3)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+64-2)
(modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+15)
(modified) llvm/lib/Target/AMDGPU/VOP2Instructions.td (+7-2)
(modified) llvm/test/CodeGen/AMDGPU/and.ll (+1-2)
(modified) llvm/test/CodeGen/AMDGPU/bf16-conversions.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/bfi_int.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll (+26-10)
(modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+6-4)
(modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+54-134)
(modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+28-14)
(added) llvm/test/CodeGen/AMDGPU/rotr-v2i32.ll (+32)
(modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+23-11)
(modified) llvm/test/CodeGen/AMDGPU/xor.ll (+2-2)

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index affcd78ea61b0..6f0c524d33c5d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -235,6 +235,7 @@ bool VectorLegalizer::Run() {
 
   LegalizedNodes.clear();
 
+
   // Remove dead nodes now.
   DAG.RemoveDeadNodes();
 
@@ -1282,6 +1283,8 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
   }
 
   SDValue Unrolled = DAG.UnrollVectorOp(Node);
+  LLVM_DEBUG(dbgs() << "\nUnrolled node: "; Unrolled->dump());
+  LLVM_DEBUG(dbgs() << "\n");
   if (Node->getNumValues() == 1) {
     Results.push_back(Unrolled);
   } else {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 7ed055e8da2b6..388efe0368256 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -152,7 +152,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
 
   setOperationAction(ISD::LOAD, MVT::i128, Promote);
   AddPromotedToType(ISD::LOAD, MVT::i128, MVT::v4i32);
-
+  
   // TODO: Would be better to consume as directly legal
   setOperationAction(ISD::ATOMIC_LOAD, MVT::f32, Promote);
   AddPromotedToType(ISD::ATOMIC_LOAD, MVT::f32, MVT::i32);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ba7e11a853347..7338abd67dfe5 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -40,6 +40,7 @@
 #include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/Support/CommandLine.h"
+
 #include "llvm/Support/KnownBits.h"
 #include "llvm/Support/ModRef.h"
 #include "llvm/Transforms/Utils/LowerAtomic.h"
@@ -430,6 +431,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::VECTOR_SHUFFLE, {MVT::v2i32, MVT::v2f32}, Legal);
   }
 
+  setOperationAction({ISD::AND, ISD::OR, ISD::XOR}, MVT::v2i32, Legal);
+  // Prevent SELECT from being implemented with the above bitwise ops and instead use cndmask.
+  setOperationAction(ISD::SELECT, MVT::v2i32, Custom);
+ // Enable MatchRotate to produce ISD::ROTR, which is later transformed to alignbit.
+  setOperationAction(ISD::ROTR, MVT::v2i32, Legal);
+
   setOperationAction(ISD::BUILD_VECTOR, {MVT::v4f16, MVT::v4i16, MVT::v4bf16},
                      Custom);
 
@@ -835,6 +842,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     AddPromotedToType(ISD::SELECT, MVT::v2f16, MVT::i32);
   } else {
     // Legalization hack.
+    // Hmm.
     setOperationAction(ISD::SELECT, {MVT::v2i16, MVT::v2f16}, Custom);
 
     setOperationAction({ISD::FNEG, ISD::FABS}, MVT::v2f16, Custom);
@@ -1986,6 +1994,13 @@ bool SITargetLowering::shouldConvertConstantLoadToIntImm(const APInt &Imm,
   return true;
 }
 
+bool SITargetLowering::shouldFoldSelectWithIdentityConstant(unsigned BinOpcode,
+                                                            EVT VT) const {
+  return (BinOpcode == ISD::AND || BinOpcode == ISD::OR ||
+          BinOpcode == ISD::XOR) &&
+         VT.getScalarType() == MVT::i64;
+}
+
 bool SITargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
@@ -12872,6 +12887,52 @@ SDValue SITargetLowering::performOrCombine(SDNode *N,
     }
   }
 
+  // Detect identity v2i32 OR and replace with identity source node.
+  // Specifically an Or that has operands constructed from the same source node
+  // via extract_vector_elt and build_vector.
+  if (VT == MVT::v2i32) {
+    if (LHS->getOpcode() == ISD::BUILD_VECTOR &&
+        RHS->getOpcode() == ISD::BUILD_VECTOR) {
+      // DAG.canonicalizeCommutativeBinop(ISD::OR, RHS, LHS);
+      SDValue BVLHS = LHS->getOperand(0);
+      SDValue CLHS = LHS->getOperand(1);
+      SDValue CRHS = RHS->getOperand(0);
+      SDValue BVRHS = RHS->getOperand(1);
+      LLVM_DEBUG(
+          dbgs()
+              << "### Performing v2i32 SIISelLowering DAGCombine::CombineOR\n";);
+
+      auto *LC = dyn_cast<ConstantSDNode>(LHS->getOperand(1));
+      auto *RC = dyn_cast<ConstantSDNode>(RHS->getOperand(0));
+
+      if (LC && RC) {
+
+        // Test for and normalise build vectors.
+        if (LHS->getOpcode() == ISD::BUILD_VECTOR &&
+            RHS->getOpcode() == ISD::BUILD_VECTOR &&
+            // Check cast to constantnode here
+            LHS->getConstantOperandVal(1) == 0 &&
+            RHS->getConstantOperandVal(0) == 0) {
+
+          // Get the extract_vector_element operands.
+          SDValue LEVE = LHS->getOperand(0);
+          SDValue REVE = RHS->getOperand(1);
+
+          if (LEVE->getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
+              REVE->getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
+            // Check that the the elements from the same vector are extracted.
+            if (LEVE->getOperand(0) == REVE->getOperand(0) &&
+                LEVE->getOperand(1) != REVE->getOperand(1)) {
+              LLVM_DEBUG(dbgs() << "### Found identity OR, folding...\n";);
+              SDValue IdentitySrc = LEVE.getOperand(0);
+              return IdentitySrc;
+            }
+          }
+        }
+      }
+    }
+  }
+
   if (VT != MVT::i64 || DCI.isBeforeLegalizeOps())
     return SDValue();
 
@@ -12915,20 +12976,52 @@ SDValue SITargetLowering::performXorCombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
   if (SDValue RV = reassociateScalarOps(N, DCI.DAG))
     return RV;
-
+  
+  SelectionDAG &DAG = DCI.DAG;
+  EVT VT = N->getValueType(0);
   SDValue LHS = N->getOperand(0);
   SDValue RHS = N->getOperand(1);
 
-  const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);
-  SelectionDAG &DAG = DCI.DAG;
+  if (VT == MVT::v2i32 && LHS.getNumOperands() > 1) {
+
+    const ConstantSDNode *CRHS_0 = dyn_cast<ConstantSDNode>(RHS.getOperand(0));
+    const ConstantSDNode *CRHS_1 = dyn_cast<ConstantSDNode>(RHS.getOperand(1));
+    SDValue LHS_0 = LHS.getOperand(0);
+    SDValue LHS_1 = LHS.getOperand(1);
+
+    if (LHS.getOpcode() == ISD::VSELECT && VT == MVT::v2i32) {
+      if (CRHS_0 && CRHS_0->getAPIntValue().isSignMask() &&
+          shouldFoldFNegIntoSrc(N, LHS_0))
+        if (CRHS_1 && CRHS_1->getAPIntValue().isSignMask() &&
+            shouldFoldFNegIntoSrc(N, LHS_1)) {
+          SDLoc DL(N);
+          SDValue CastLHS =
+              DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(1));
+          SDValue CastRHS =
+              DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(2));
+          SDValue FNegLHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastLHS);
+          SDValue FNegRHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastRHS);
+          SDValue NewSelect = DAG.getNode(ISD::VSELECT, DL, MVT::v2f32,
+                                          LHS->getOperand(0), FNegLHS, FNegRHS);
+          return DAG.getNode(ISD::BITCAST, DL, VT, NewSelect);
+        }
+    }
+    // Possibly split vector here if one side does have a constant RHS.
+  }
 
-  EVT VT = N->getValueType(0);
+  // Add test for when only one of the RHS vector elements is a const. Might be possible to optimise this case.
+
+
+  const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);
+ 
+ 
   if (CRHS && VT == MVT::i64) {
     if (SDValue Split =
             splitBinaryBitConstantOp(DCI, SDLoc(N), ISD::XOR, LHS, CRHS))
       return Split;
   }
 
+
   // Make sure to apply the 64-bit constant splitting fold before trying to fold
   // fneg-like xors into 64-bit select.
   if (LHS.getOpcode() == ISD::SELECT && VT == MVT::i32) {
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..b651c2530d628 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -366,6 +366,9 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
                                         Type *Ty) const override;
 
+  bool shouldFoldSelectWithIdentityConstant(unsigned BinOpcode,
+                                            EVT VT) const override;
+
   bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                unsigned Index) const override;
   bool isExtractVecEltCheap(EVT VT, unsigned Index) const override;
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 2e2913d88cc54..7944f783c82ba 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -2334,9 +2334,9 @@ def : AMDGPUPatIgnoreCopies <
                 (COPY_TO_REGCLASS VSrc_b32:$z, VGPR_32))
 >;
 
-// 64-bit version
+foreach vt = [i64, v2i32] in {
 def : AMDGPUPatIgnoreCopies <
-  (DivergentBinFrag<xor> i64:$z, (and i64:$x, (xor i64:$y, i64:$z))),
+  (DivergentBinFrag<xor> vt:$z, (and vt:$x, (xor vt:$y, vt:$z))),
   (REG_SEQUENCE VReg_64,
     (V_BFI_B32_e64 (i32 (EXTRACT_SUBREG VReg_64:$x, sub0)),
               (i32 (EXTRACT_SUBREG VReg_64:$y, sub0)),
@@ -2345,6 +2345,7 @@ def : AMDGPUPatIgnoreCopies <
               (i32 (EXTRACT_SUBREG VReg_64:$y, sub1)),
               (i32 (EXTRACT_SUBREG VReg_64:$z, sub1))), sub1)
 >;
+}
 
 def : AMDGPUPat <
   (fcopysign f32:$src0, f32:$src1),
@@ -2378,6 +2379,24 @@ def : AMDGPUPat <
 let True16Predicate = NotHasTrue16BitInsts in {
 def : ROTRPattern <V_ALIGNBIT_B32_e64>;
 
+def : AMDGPUPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0))), sub0,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1))), sub1)
+>;
+
+// Prevents regression in fneg-modifier-casting.ll along with modifications to XorCombine().
+def : AMDGPUPat <
+  (fneg (select i1:$src0, (f32 (bitconvert i32:$src1)), (f32 (bitconvert i32:$src2)))),
+    (V_CNDMASK_B32_e64 (i32 1), $src2, (i32 1), $src1, $src0)>;
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
           (V_ALIGNBIT_B32_e64 (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub0)), $src1)>;
@@ -2385,6 +2404,20 @@ def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
 def : GCNPat<(i32 (trunc (srl i64:$src0, (i32 ShiftAmt32Imm:$src1)))),
           (V_ALIGNBIT_B32_e64 (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub0)), $src1)>;
+
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0))), sub0,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1))), sub1)
+>;
+
 } // end True16Predicate = NotHasTrue16BitInsts
 
 let True16Predicate = UseRealTrue16Insts in {
@@ -2397,6 +2430,20 @@ def : GCNPat <
                           /* clamp */ 0, /* op_sel */ 0)
 >;
 
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_t16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (EXTRACT_SUBREG (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)) ,lo16),0,0), sub0,
+    (V_ALIGNBIT_B32_t16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      0, (EXTRACT_SUBREG (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)) ,lo16),0,0), sub1)
+>;
+
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (i32 ShiftAmt32Imm:$src1)))),
           (V_ALIGNBIT_B32_t16_e64 0, /* src0_modifiers */
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
@@ -2423,6 +2470,20 @@ def : GCNPat <
                              $src1, /* clamp */ 0, /* op_sel */ 0)
 >;
 
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_fake16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)),0,0), sub0,
+    (V_ALIGNBIT_B32_fake16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1)),0,0), sub1)
+>;
+
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
      (V_ALIGNBIT_B32_fake16_e64 0, /* src0_modifiers */
                                (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
@@ -2449,6 +2510,7 @@ def : GCNPat<(fshr i32:$src0, i32:$src1, i32:$src2),
 >;
 } // end True16Predicate = UseFakeTrue16Insts
 
+
 /********** ====================== **********/
 /**********   Indirect addressing  **********/
 /********** ====================== **********/
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 40b3dfb94ce2f..f2e1a27644afb 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1779,6 +1779,21 @@ def : GCNPat <
   (S_MOV_B32 imm:$imm)
 >;
 
+def : GCNPat <
+  (v2i32 (UniformBinFrag<and> v2i32:$x, v2i32:$y)),
+  (S_AND_B64 SReg_64:$x, SReg_64:$y)
+>;
+
+def : GCNPat <
+  (v2i32 (UniformBinFrag<or> v2i32:$x, v2i32:$y)),
+  (S_OR_B64 SReg_64:$x, SReg_64:$y)
+>;
+
+def : GCNPat <
+  (v2i32 (UniformBinFrag<xor> v2i32:$x, v2i32:$y)),
+  (S_XOR_B64 SReg_64:$x, SReg_64:$y)
+>;
+
 // Same as a 32-bit inreg
 def : GCNPat<
   (i32 (UniformUnaryFrag<sext> i16:$src)),
diff --git a/llvm/lib/Target/AMDGPU/VOP2Instructions.td b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
index 0c7e20fc1ebf3..0fe09a66ada58 100644
--- a/llvm/lib/Target/AMDGPU/VOP2Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
@@ -954,9 +954,9 @@ def : DivergentClampingBinOp<sub, V_SUB_CO_U32_e64>;
 def : DivergentBinOp<adde, V_ADDC_U32_e32>;
 def : DivergentBinOp<sube, V_SUBB_U32_e32>;
 
-class divergent_i64_BinOp <SDPatternOperator Op, Instruction Inst> :
+class divergent_i64_BinOp <SDPatternOperator Op, Instruction Inst, ValueType vt = i64> :
   GCNPat<
-      (DivergentBinFrag<Op> i64:$src0, i64:$src1),
+      (DivergentBinFrag<Op> vt:$src0, vt:$src1),
       (REG_SEQUENCE VReg_64,
         (Inst
           (i32 (EXTRACT_SUBREG $src0, sub0)),
@@ -973,6 +973,11 @@ def :  divergent_i64_BinOp <and, V_AND_B32_e64>;
 def :  divergent_i64_BinOp <or,  V_OR_B32_e64>;
 def :  divergent_i64_BinOp <xor, V_XOR_B32_e64>;
 
+def :  divergent_i64_BinOp <and, V_AND_B32_e64, v2i32>;
+def :  divergent_i64_BinOp <or,  V_OR_B32_e64, v2i32>;
+def :  divergent_i64_BinOp <xor, V_XOR_B32_e64, v2i32>;
+
+
 // mul24 w/ 64 bit output.
 class mul24_64_Pat<SDPatternOperator Op, Instruction InstLo, Instruction InstHi> : GCNPat<
   (i64 (Op i32:$src0, i32:$src1)),
diff --git a/llvm/test/CodeGen/AMDGPU/and.ll b/llvm/test/CodeGen/AMDGPU/and.ll
index c6233642110ea..05402b3c89409 100644
--- a/llvm/test/CodeGen/AMDGPU/and.ll
+++ b/llvm/test/CodeGen/AMDGPU/and.ll
@@ -8,8 +8,7 @@ declare i32 @llvm.amdgcn.workitem.id.x() #0
 ; EG: AND_INT {{\*? *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
 ; EG: AND_INT {{\*? *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
 
-; SI: s_and_b32 s{{[0-9]+, s[0-9]+, s[0-9]+}}
-; SI: s_and_b32 s{{[0-9]+, s[0-9]+, s[0-9]+}}
+; SI: s_and_b64
 
 define amdgpu_kernel void @test2(ptr addrspace(1) %out, ptr addrspace(1) %in) {
   %b_ptr = getelementptr <2 x i32>, ptr addrspace(1) %in, i32 1
diff --git a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
index a597faa028f22..ca8f7736f6093 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
@@ -151,25 +151,25 @@ define amdgpu_ps float @v_test_cvt_v2f64_v2bf16_v(<2 x double> %src) {
 ; GFX-950-LABEL: v_test_cvt_v2f64_v2bf16_v:
 ; GFX-950:       ; %bb.0:
 ; GFX-950-NEXT:    v_cvt_f32_f64_e32 v6, v[2:3]
+; GFX-950-NEXT:    v_and_b32_e32 v4, 1, v6
+; GFX-950-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v4
 ; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v6
-; GFX-950-NEXT:    v_and_b32_e32 v7, 1, v6
 ; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[2:3]|, |v[4:5]|
-; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[2:3], v[4:5]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v7
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], v[2:3], v[4:5]
+; GFX-950-NEXT:    v_cvt_f32_f64_e32 v7, v[0:1]
 ; GFX-950-NEXT:    v_cndmask_b32_e64 v2, -1, 1, s[2:3]
 ; GFX-950-NEXT:    v_add_u32_e32 v2, v6, v2
-; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v5, v[0:1]
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
 ; GFX-950-NEXT:    v_cndmask_b32_e32 v4, v2, v6, vcc
-; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v5
-; GFX-950-NEXT:    v_and_b32_e32 v6, 1, v5
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v7
+; GFX-950-NEXT:    v_and_b32_e32 v8, 1, v7
 ; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, |v[2:3]|
-; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[0:1], v[2:3]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v6
+; GFX-950-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v8
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], v[0:1], v[2:3]
 ; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
-; GFX-950-NEXT:    v_add_u32_e32 v0, v5, v0
-; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX-950-NEXT:    v_add_u32_e32 v0, v7, v0
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v7, vcc
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, v4
 ; GFX-950-NEXT:    ; return to shader part epilog
   %res = fptrunc <2 x double> %src to <2 x bfloat>
diff --git a/llvm/test/CodeGen/AMDGPU/bfi_int.ll b/llvm/test/CodeGen/AMDGPU/bfi_int.ll
index 201b97d479c68..d76ecbd73fe6e 100644
--- a/llvm/test/CodeGen/AMDGPU/bfi_int.ll
+++ b/llvm/test/CodeGen/AMDGPU/bfi_int.ll
@@ -582,15 +582,15 @@ define <2 x i32> @v_bitselect_v2i32_pat1(<2 x i32> %a, <2 x i32> %b, <2 x i32> %
 ; GFX7-LABEL: v_bitselect_v2i32_pat1:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX7-NEXT:    v_bfi_b32 v1, v3, v1, v5
+; GFX7-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_bitselect_v2i32_pat1:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX8-NEXT:    v_bfi_b32 v1, v3, v1, v5
+; GFX8-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_bitselect_v2i32_pat1:
diff --git a/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll b/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
index a01c2fa152ab3..2d73f17d74d8b 100644
--- a/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
+++ b/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
@@ -31,8 +31,8 @@ define <2 x half> @test_pown_reduced_fast_v2f16_known_odd(<2 x half> %x, <2 x i3
 ; GFX9-LABEL: test_pown_reduced_fast_v2f16_known_odd:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_or_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    v_or_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_or_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    v_cvt_f32_i32_e32 v2, v2
 ; GFX9-NEXT:    v_cvt_f32_i32_e32 v1, v1
 ; GFX9-NEXT:    v_and_b32_e32 v3, 0x7fff7fff, v0
diff --git a/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll b/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
index d63a36c4b2958..7e2e8b577e085 100644
--- a/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
+++ b/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
@@ -28,12 +28,15 @@ define amdgpu_ps <2 x i32> @s_or_v2i32_disjoint(<2 x i32> inreg %a, <2 x i32> in
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr2
   ; CHECK-NEX...
[truncated]

github-actions · 2025-05-20T09:00:30Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/rotr-v2i32.ll

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

chrisjbris · 2025-05-28T09:21:29Z

Combined existing commits and modified commit message for easier review.

JanekvO

Couple of low hanging fruit comments for now

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

JanekvO

Should also add the issue you're fixing: #124776

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/xor.ll

llvm/test/CodeGen/AMDGPU/vector_range_metadata.ll

llvm/test/CodeGen/AMDGPU/shl64_reduce.ll

llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

chrisjbris · 2025-06-10T10:51:19Z

Rebased commits and updated the commit message for easier review.

or/xor/and are legalized.

performExtractVectorEltCombine().

chrisjbris · 2025-09-09T15:26:07Z

Rebased.

JanekvO

LGTM
One last request on my end: fix the github summary or redo the commit message when landing

github-actions · 2025-09-11T12:33:05Z

@chrisjbris Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

chrisjbris requested review from JanekvO, arsenm and bcahoon May 20, 2025 08:57

chrisjbris self-assigned this May 20, 2025

chrisjbris added the backend:AMDGPU label May 20, 2025

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 125a6d4 to c6fac80 Compare May 20, 2025 10:16

arsenm reviewed May 20, 2025

View reviewed changes

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from bcb5393 to 29e949b Compare May 21, 2025 09:37

JanekvO reviewed May 21, 2025

View reviewed changes

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 7f18b35 to 957644f Compare May 22, 2025 12:45

arsenm reviewed May 22, 2025

View reviewed changes

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from d1afe09 to 06c6c6e Compare May 28, 2025 09:20

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 06c6c6e to 995e7fb Compare May 28, 2025 09:46

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 022b765 to 548e519 Compare June 4, 2025 13:30

chrisjbris marked this pull request as ready for review June 4, 2025 13:35

JanekvO reviewed Jun 4, 2025

View reviewed changes

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from edf0ee6 to 359fde6 Compare June 6, 2025 09:56

JanekvO reviewed Jun 6, 2025

View reviewed changes

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 71b1e29 to b3b53fe Compare June 10, 2025 10:44

chrisjbris mentioned this pull request Jun 10, 2025

AMDGPU should make v2i32 and/or/xor legal #124776

Closed

chrisjbris added 25 commits September 9, 2025 10:18

Remove unhelpful commentary.

5f3e156

Remove unnecessary driveby clang-format

3e0fe34

Remove dead checks in xor.ll

89ea0ae

Remove unnnecessary node duplication

75c72c8

Remove dead xorcombine.

57dafa9

Work to fix regressions in integer select srcmod generation when v2i32

0a05bba

or/xor/and are legalized.

Correct clang-format

f10e16f

Remove unnecessary code reordering

51e44b8

clang-format

a0b2a91

clang-format

fe8b60e

Simplify ReplaceSrc()

309e2f9

Remove dead ternary expression from AMDGPUDAGToDAG.

7815116

Change unnecessary use of PeekSrc to Src in ReplaceSrc().

bc4813a

Update comment for v2i32 case.

5791b10

Apply reviewer corrections to getShiftForReduction()

82e51ed

Update getShiftForReduction() to type check and add comment.

b4fcd59

Remove dead truncate, correct flags() call and add detail to comment.

a8668b7

Correct comment.

151460d

Updated test.

90fa0e7

Replace getShiftForeReduction with an additional combine in

81cf80c

performExtractVectorEltCombine().

Correct formatting.

5736525

Remove deprecated getShiftForRefuction() function

1943ddc

Add TODO to performExtractVectorEltCombine()

4fc323e

Simplify combine with isConstorConstSplat().

c9c5967

Further simplify extract_vector_elt combine

deb29c8

chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from d76e6fd to deb29c8 Compare September 9, 2025 15:26

JanekvO approved these changes Sep 11, 2025

View reviewed changes

chrisjbris merged commit 5e6564b into llvm:main Sep 11, 2025
9 checks passed

[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions #140694

[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions #140694

Uh oh!

Conversation

chrisjbris commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 20, 2025

Uh oh!

llvmbot commented May 20, 2025

Uh oh!

github-actions bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisjbris commented May 28, 2025

Uh oh!

JanekvO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanekvO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisjbris commented Jun 10, 2025

Uh oh!

chrisjbris commented Sep 9, 2025

Uh oh!

JanekvO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chrisjbris commented May 20, 2025 •

edited

Loading

github-actions bot commented May 20, 2025 •

edited

Loading