[AMDGPU] Legalize 64bit elements for BUILD_VECTOR on gfx942 #145052

JanekvO · 2025-06-20T15:31:27Z

Any subtarget with full mov64 support may be able to take advantage of build_vector with 64b vector elts, Change to build_vector select for 64b element support and adds build_vector combine that will try to combine build_vector with 32b elements into its 64b equivalent.

Related to #138843

llvmbot · 2025-06-20T15:32:00Z

@llvm/pr-subscribers-backend-amdgpu

Author: Janek van Oirschot (JanekvO)

Changes

Any subtarget with full mov64 support may be able to take advantage of build_vector with 64b vector elts, Change to build_vector select for 64b element support and adds build_vector combine that will try to combine build_vector with 32b elements into its 64b equivalent.

Patch is 3.28 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145052.diff

45 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+15-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+8)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+99-15)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
(modified) llvm/test/CodeGen/AMDGPU/bf16-conversions.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-no-rtn.ll (+154-85)
(modified) llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-rtn.ll (+162-89)
(modified) llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f64.ll (+618-214)
(modified) llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.v2f16-no-rtn.ll (+154-85)
(modified) llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.v2f16-rtn.ll (+162-89)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+16-32)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll (+101-58)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+315-275)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+315-275)
(modified) llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll (+12-17)
(modified) llvm/test/CodeGen/AMDGPU/maximumnum.ll (+988-473)
(modified) llvm/test/CodeGen/AMDGPU/minimumnum.ll (+988-473)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+288-141)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+3-5)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v2i64.ll (+140-94)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v3i64.ll (+337-238)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v4i64.ll (+817-456)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll (+2575-1764)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p0.v2p0.ll (+140-94)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p0.v3p0.ll (+337-238)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p0.v4p0.ll (+817-456)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v2i64.ll (+429-256)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v3i64.ll (+1151-691)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v4i64.ll (+2800-2259)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p0.v2p0.ll (+429-256)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p0.v3p0.ll (+1151-691)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p0.v4p0.ll (+2800-2259)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v2i64.ll (+1839-962)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v3i64.ll (+1376-1013)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v4i64.ll (+4594-3830)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v2p0.ll (+1839-962)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v3p0.ll (+1376-1013)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v4p0.ll (+4594-3830)
(modified) llvm/test/CodeGen/AMDGPU/smfmac_no_agprs.ll (+11-14)
(modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmax.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmin.ll (+3-2)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+2-3)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 6e990cb2e160c..044c073e7f918 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -440,6 +440,8 @@ void AMDGPUDAGToDAGISel::SelectBuildVector(SDNode *N, unsigned RegClassID) {
   EVT EltVT = VT.getVectorElementType();
   SDLoc DL(N);
   SDValue RegClass = CurDAG->getTargetConstant(RegClassID, DL, MVT::i32);
+  unsigned NumRegs = EltVT.getSizeInBits() / 32;
+  bool IsGCN = CurDAG->getSubtarget().getTargetTriple().isAMDGCN();
 
   if (NumVectorElts == 1) {
     CurDAG->SelectNodeTo(N, AMDGPU::COPY_TO_REGCLASS, EltVT, N->getOperand(0),
@@ -449,12 +451,13 @@ void AMDGPUDAGToDAGISel::SelectBuildVector(SDNode *N, unsigned RegClassID) {
 
   assert(NumVectorElts <= 32 && "Vectors with more than 32 elements not "
                                   "supported yet");
+  assert((IsGCN || (!IsGCN && NumRegs == 1)) &&
+         "R600 does not support 64-bit reg_seq elements");
   // 32 = Max Num Vector Elements
   // 2 = 2 REG_SEQUENCE operands per element (value, subreg index)
   // 1 = Vector Register Class
   SmallVector<SDValue, 32 * 2 + 1> RegSeqArgs(NumVectorElts * 2 + 1);
 
-  bool IsGCN = CurDAG->getSubtarget().getTargetTriple().isAMDGCN();
   RegSeqArgs[0] = CurDAG->getTargetConstant(RegClassID, DL, MVT::i32);
   bool IsRegSeq = true;
   unsigned NOps = N->getNumOperands();
@@ -464,8 +467,9 @@ void AMDGPUDAGToDAGISel::SelectBuildVector(SDNode *N, unsigned RegClassID) {
       IsRegSeq = false;
       break;
     }
-    unsigned Sub = IsGCN ? SIRegisterInfo::getSubRegFromChannel(i)
-                         : R600RegisterInfo::getSubRegFromChannel(i);
+    unsigned Sub =
+        IsGCN ? SIRegisterInfo::getSubRegFromChannel(i * NumRegs, NumRegs)
+              : R600RegisterInfo::getSubRegFromChannel(i);
     RegSeqArgs[1 + (2 * i)] = N->getOperand(i);
     RegSeqArgs[1 + (2 * i) + 1] = CurDAG->getTargetConstant(Sub, DL, MVT::i32);
   }
@@ -475,8 +479,9 @@ void AMDGPUDAGToDAGISel::SelectBuildVector(SDNode *N, unsigned RegClassID) {
     MachineSDNode *ImpDef = CurDAG->getMachineNode(TargetOpcode::IMPLICIT_DEF,
                                                    DL, EltVT);
     for (unsigned i = NOps; i < NumVectorElts; ++i) {
-      unsigned Sub = IsGCN ? SIRegisterInfo::getSubRegFromChannel(i)
-                           : R600RegisterInfo::getSubRegFromChannel(i);
+      unsigned Sub =
+          IsGCN ? SIRegisterInfo::getSubRegFromChannel(i * NumRegs, NumRegs)
+                : R600RegisterInfo::getSubRegFromChannel(i);
       RegSeqArgs[1 + (2 * i)] = SDValue(ImpDef, 0);
       RegSeqArgs[1 + (2 * i) + 1] =
           CurDAG->getTargetConstant(Sub, DL, MVT::i32);
@@ -644,9 +649,12 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
       break;
     }
 
-    assert(VT.getVectorElementType().bitsEq(MVT::i32));
+    EVT VET = VT.getVectorElementType();
+    assert(VET.bitsEq(MVT::i32) || VET.bitsEq(MVT::i64));
+    unsigned EltSize = VET.getSizeInBits();
     unsigned RegClassID =
-        SIRegisterInfo::getSGPRClassForBitWidth(NumVectorElts * 32)->getID();
+        SIRegisterInfo::getSGPRClassForBitWidth(NumVectorElts * EltSize)
+            ->getID();
     SelectBuildVector(N, RegClassID);
     return;
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 134adc681215f..8a12b9b25d713 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -5206,6 +5206,14 @@ SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N,
   case ISD::BITCAST: {
     EVT DestVT = N->getValueType(0);
 
+    // Avoid undoing build_vector with 64b elements if subtarget supports 64b
+    // movs (i.e., avoid inf loop through combines).
+    if (Subtarget->isGCN()) {
+      const GCNSubtarget &ST = DAG.getSubtarget<GCNSubtarget>();
+      if (ST.hasMovB64())
+        break;
+    }
+
     // Push casts through vector builds. This helps avoid emitting a large
     // number of copies when materializing floating point vector constants.
     //
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 07d79d677104a..2dc12357cb4b0 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -357,9 +357,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   // Most operations are naturally 32-bit vector operations. We only support
   // load and store of i64 vectors, so promote v2i64 vector operations to v4i32.
   for (MVT Vec64 : {MVT::v2i64, MVT::v2f64}) {
-    setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
-    AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v4i32);
-
+    if (STI.hasMovB64())
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Legal);
+    else {
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
+      AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v4i32);
+    }
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, Vec64, Promote);
     AddPromotedToType(ISD::EXTRACT_VECTOR_ELT, Vec64, MVT::v4i32);
 
@@ -371,9 +374,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   for (MVT Vec64 : {MVT::v3i64, MVT::v3f64}) {
-    setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
-    AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v6i32);
-
+    if (STI.hasMovB64())
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Legal);
+    else {
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
+      AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v6i32);
+    }
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, Vec64, Promote);
     AddPromotedToType(ISD::EXTRACT_VECTOR_ELT, Vec64, MVT::v6i32);
 
@@ -385,9 +391,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   for (MVT Vec64 : {MVT::v4i64, MVT::v4f64}) {
-    setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
-    AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v8i32);
-
+    if (STI.hasMovB64())
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Legal);
+    else {
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
+      AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v8i32);
+    }
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, Vec64, Promote);
     AddPromotedToType(ISD::EXTRACT_VECTOR_ELT, Vec64, MVT::v8i32);
 
@@ -399,9 +408,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   for (MVT Vec64 : {MVT::v8i64, MVT::v8f64}) {
-    setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
-    AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v16i32);
-
+    if (STI.hasMovB64())
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Legal);
+    else {
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
+      AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v16i32);
+    }
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, Vec64, Promote);
     AddPromotedToType(ISD::EXTRACT_VECTOR_ELT, Vec64, MVT::v16i32);
 
@@ -413,9 +425,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   for (MVT Vec64 : {MVT::v16i64, MVT::v16f64}) {
-    setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
-    AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v32i32);
-
+    if (STI.hasMovB64())
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Legal);
+    else {
+      setOperationAction(ISD::BUILD_VECTOR, Vec64, Promote);
+      AddPromotedToType(ISD::BUILD_VECTOR, Vec64, MVT::v32i32);
+    }
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, Vec64, Promote);
     AddPromotedToType(ISD::EXTRACT_VECTOR_ELT, Vec64, MVT::v32i32);
 
@@ -945,6 +960,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   setTargetDAGCombine({ISD::ADD,
+                       ISD::BUILD_VECTOR,
                        ISD::UADDO_CARRY,
                        ISD::SUB,
                        ISD::USUBO_CARRY,
@@ -15486,6 +15502,72 @@ SDValue SITargetLowering::performClampCombine(SDNode *N,
   return SDValue(CSrc, 0);
 }
 
+SDValue
+SITargetLowering::performBuildVectorCombine(SDNode *N,
+                                            DAGCombinerInfo &DCI) const {
+  const GCNSubtarget *ST = getSubtarget();
+  if (DCI.Level < AfterLegalizeDAG || !ST->hasMovB64())
+    return SDValue();
+
+  SelectionDAG &DAG = DCI.DAG;
+  SDLoc SL(N);
+  BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(N);
+
+  EVT VT = N->getValueType(0);
+  EVT EltVT = VT.getVectorElementType();
+  unsigned SizeBits = VT.getSizeInBits();
+  unsigned EltSize = EltVT.getSizeInBits();
+
+  // Skip if:
+  //  - Value type isn't multiplication of 64 bit (e.g., v3i32), or
+  //  - BuildVector instruction has non-constants, or
+  //  - Element type has already been combined into i64 elements
+  if ((SizeBits % 64) != 0 || !BVN->isConstant() || EltVT == MVT::i64)
+    return SDValue();
+
+  // Construct the 64b values.
+  SmallVector<uint64_t, 8> ImmVals;
+  uint64_t ImmVal = 0;
+  uint64_t ImmSize = 0;
+  for (SDValue Opand : N->ops()) {
+    ConstantSDNode *C = dyn_cast<ConstantSDNode>(Opand);
+    if (!C)
+      return SDValue();
+
+    ImmVal |= C->getZExtValue() << ImmSize;
+    ImmSize += EltSize;
+    if (ImmSize > 64)
+      return SDValue();
+    if (ImmSize == 64) {
+      if (!isUInt<32>(ImmVal))
+        return SDValue();
+      ImmVals.push_back(ImmVal);
+      ImmVal = 0;
+      ImmSize = 0;
+    }
+  }
+
+  // Avoid emitting build_vector with 1 element and directly emit value.
+  if (ImmVals.size() == 1) {
+    SDValue Val = DAG.getConstant(ImmVals[0], SL, MVT::i64);
+    return DAG.getNode(ISD::BITCAST, SL, MVT::v2i32, Val);
+  }
+
+  // Construct and return build_vector with 64b elements.
+  if (!ImmVals.empty()) {
+    SmallVector<SDValue, 8> VectorConsts;
+    for (uint64_t I : ImmVals)
+      VectorConsts.push_back(DAG.getConstant(I, SL, MVT::i64));
+    unsigned NewNumElts = SizeBits / 64;
+    LLVMContext &Ctx = *DAG.getContext();
+    EVT NewVT = EVT::getVectorVT(Ctx, MVT::i64, NewNumElts);
+    SDValue BV = DAG.getBuildVector(
+        NewVT, SL, ArrayRef(VectorConsts.begin(), VectorConsts.end()));
+    return DAG.getBitcast(VT, BV);
+  }
+  return SDValue();
+}
+
 SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
   switch (N->getOpcode()) {
@@ -15573,6 +15655,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
     return performFCanonicalizeCombine(N, DCI);
   case AMDGPUISD::RCP:
     return performRcpCombine(N, DCI);
+  case ISD::BUILD_VECTOR:
+    return performBuildVectorCombine(N, DCI);
   case ISD::FLDEXP:
   case AMDGPUISD::FRACT:
   case AMDGPUISD::RSQ:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 89fb12b52c3e6..8be0631b37dd2 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -231,6 +231,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+  SDValue performBuildVectorCombine(SDNode *N, DAGCombinerInfo &DCI) const;
 
   bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;
 
diff --git a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
index a597faa028f22..8c48f9a4a7b5b 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
@@ -152,24 +152,24 @@ define amdgpu_ps float @v_test_cvt_v2f64_v2bf16_v(<2 x double> %src) {
 ; GFX-950:       ; %bb.0:
 ; GFX-950-NEXT:    v_cvt_f32_f64_e32 v6, v[2:3]
 ; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v6
-; GFX-950-NEXT:    v_and_b32_e32 v7, 1, v6
-; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[2:3]|, |v[4:5]|
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[0:1], |v[2:3]|, |v[4:5]|
 ; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[2:3], v[4:5]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v7
-; GFX-950-NEXT:    v_cndmask_b32_e64 v2, -1, 1, s[2:3]
-; GFX-950-NEXT:    v_add_u32_e32 v2, v6, v2
+; GFX-950-NEXT:    v_and_b32_e32 v2, 1, v6
+; GFX-950-NEXT:    v_cndmask_b32_e64 v7, -1, 1, s[0:1]
+; GFX-950-NEXT:    v_cvt_f32_f64_e32 v8, v[0:1]
+; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v2
+; GFX-950-NEXT:    v_add_u32_e32 v7, v6, v7
 ; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v5, v[0:1]
-; GFX-950-NEXT:    v_cndmask_b32_e32 v4, v2, v6, vcc
-; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v5
-; GFX-950-NEXT:    v_and_b32_e32 v6, 1, v5
-; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, |v[2:3]|
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v8
+; GFX-950-NEXT:    v_cndmask_b32_e32 v4, v7, v6, vcc
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[0:1], |v[0:1]|, |v[2:3]|
 ; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[0:1], v[2:3]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v6
-; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
-; GFX-950-NEXT:    v_add_u32_e32 v0, v5, v0
+; GFX-950-NEXT:    v_and_b32_e32 v0, 1, v8
+; GFX-950-NEXT:    v_cndmask_b32_e64 v5, -1, 1, s[0:1]
+; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v0
+; GFX-950-NEXT:    v_add_u32_e32 v5, v8, v5
 ; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v5, v8, vcc
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, v4
 ; GFX-950-NEXT:    ; return to shader part epilog
   %res = fptrunc <2 x double> %src to <2 x bfloat>
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-no-rtn.ll b/llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-no-rtn.ll
index a14114358433a..9b783ed9d7772 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-no-rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-atomic-fadd.f32-no-rtn.ll
@@ -1,11 +1,11 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX908_GFX11 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX90A_GFX942 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX90A_GFX942 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefixes=GFX90A_GFX942,GFX90A %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefixes=GFX90A_GFX942,GFX942 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX908_GFX11 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx908 -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX908_GFX11 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX90A_GFX942 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX90A_GFX942 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefixes=GFX90A_GFX942,GFX90A %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefixes=GFX90A_GFX942,GFX942 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -enable-new-pm -stop-after=amdgpu-isel < %s | FileCheck -check-prefix=GFX908_GFX11 %s
 
 define amdgpu_ps void @buffer_atomic_fadd_f32_offset_no_rtn(float %val, <4 x i32> inreg %rsrc, i32 inreg %soffset) {
@@ -167,25 +167,41 @@ define amdgpu_ps void @buffer_ptr_atomic_fadd_f32_offset_no_rtn(float %val, ptr
   ; GFX908_GFX11-NEXT:   BUFFER_ATOMIC_ADD_F32_OFFSET [[COPY5]], killed [[REG_SEQUENCE2]], [[COPY]], 0, 0, implicit $exec :: (volatile dereferenceable load store (s32) on %ir.rsrc, align 1, addrspace 8)
   ; GFX908_GFX11-NEXT:   S_ENDPGM 0
   ;
-  ; GFX90A_GFX942-LABEL: name: buffer_ptr_atomic_fadd_f32_offset_no_rtn
-  ; GFX90A_GFX942: bb.0 (%ir-block.0):
-  ; GFX90A_GFX942-NEXT:   liveins: $vgpr0, $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
-  ; GFX90A_GFX942-NEXT: {{  $}}
-  ; GFX90A_GFX942-NEXT:   [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
-  ; GFX90A_GFX942-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
-  ; GFX90A_GFX942-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
-  ; GFX90A_GFX942-NEXT:   [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
-  ; GFX90A_GFX942-NEXT:   [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
-  ; GFX90A_GFX942-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY $vgpr0
-  ; GFX90A_GFX942-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GFX90A_GFX942-NEXT:   [[COPY6:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub1
-  ; GFX90A_GFX942-NEXT:   [[COPY7:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
-  ; GFX90A_GFX942-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1
-  ; GFX90A_GFX942-NEXT:   [[COPY8:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub1
-  ; GFX90A_GFX942-NEXT:   [[COPY9:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub0
-  ; GFX90A_GFX942-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY9]], %subreg.sub0, killed [[COPY8]], %subreg.sub1, killed [[COPY7]], %subreg.sub2, killed [[COPY6]], %subreg.sub3
-  ; GFX90A_GFX942-NEXT:   BUFFER_ATOMIC_ADD_F32_OFFSET [[COPY5]], killed [[REG_SEQUENCE2]], [[COPY]], 0, 0, implicit $exec :: (volatile dereferenceable load store (s32) on %ir.rsrc, align 1, addrspace 8)
-  ; GFX90A_GFX942-NEXT:   S_ENDPGM 0
+  ; GFX90A-LABEL: name: buffer_ptr_atomic_fadd_f32_offset_no_rtn
+  ; GFX90A: bb.0 (%ir-block.0):
+  ; GFX90A-NEXT:   liveins: $vgpr0, $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+  ; GFX90A-NEXT: {{  $}}
+  ; GFX90A-NEXT:   [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+  ; GFX90A-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+  ; GFX90A-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+  ; GFX90A-NEXT:   [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+  ; GFX90A-NEXT:   [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+  ; GFX90A-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX90A-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GFX90A-NEXT:   [[COPY6:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub1
+  ; GFX90A-NEXT:   [[COPY7:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
+  ; GFX90A-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1
+  ; GFX90A-NEXT:   [[COPY8:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub1
+  ; GFX90A-NEXT:   [[COPY9:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub0
+  ; GFX90A-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY9]], %subreg.sub0, killed [[COPY8]], %subreg.sub1, killed [[COPY7]], %subreg.sub2, killed [[COPY6]], %subreg.sub3
+  ; GFX90A-NEXT:   BUFFER_ATOMIC_ADD_F32_OFFSET [[COPY5]], killed [[REG_SEQUENCE2]], [[COPY]], 0, 0, implicit $exec :: (volatile dereferenceable load store (s32) on %ir.rsrc, align 1, addrspace 8)
+  ; GFX90A-NEXT:   S_ENDPGM 0
+  ;
+  ; GFX942-LABEL: name: buffer_ptr_atomic_fadd_f32_offset_no_rtn
+  ; GFX942: bb.0 (%ir-block.0):
+  ; GFX942-NEXT:   liveins: $vgpr0, $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+  ; GFX942-NEXT: {{  $}}
+  ; GFX942-NEXT:   [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+  ; GFX942-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+  ; GFX942-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+  ; GFX942-NEXT:   [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+  ; GFX942-NEXT:   [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+  ; GFX942-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX942-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GFX942-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1
+  ; GFX942-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[REG_SEQUENCE1]], %subreg.sub0_sub1, killed [[REG_SEQUENCE]], %subreg.sub2_sub3
+  ; GFX942-NEXT:   BUFFER_ATOMIC_ADD_F32_OFFSET [[COPY5]], killed [[REG_SEQUENCE2]], [[COPY]], 0, 0, implicit $exec :: (volatile dereferenc...
[truncated]

JanekvO · 2025-06-20T15:36:05Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

Not convinced we can elide this superfluous v_mov easily with current llvm. Previously this would've been CSE'd with the other v_mov_b32 that materializes 0 but now it requires MachinseCSE to become aware of subregisters which I don't think is as trivial (Unless there's some low hanging fruit I'm unaware of).

I've actually been working on that on and off for a while. The actual patch isn't hard, fixing all the regressions elsewhere is time consuming

llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v4i64.ll

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

arsenm · 2025-06-24T16:57:01Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

I don't understand what ImmSize is for. All the sizes are exactly computable from the type and number of operands, you shouldn't need to sum anything?

I'd still have to keep track of some idx to know how to shift the immediate value into ImmVal or know when we've reached a finished ImmVal, I just though I could combine iteration and element size to more conveniently deal with the shift value and end-of-64b ImmVal.
For now I'll leave it as ImmSize, but if you'd prefer the iteration idx + compute using vector type, let me know

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v4i64.ll

arsenm · 2025-06-24T17:03:25Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

I've actually been working on that on and off for a while. The actual patch isn't hard, fixing all the regressions elsewhere is time consuming

JanekvO · 2025-07-01T14:12:56Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll

Regression that came from rebase; haven't looked into it yet but seems a missing case in imm folding

JanekvO · 2025-07-16T10:41:24Z

ping

JanekvO · 2025-07-21T14:37:10Z

Rebase

JanekvO · 2025-07-28T11:05:56Z

Rebase

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

arsenm · 2025-07-30T10:50:25Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

Something about this seems off; can you keep this logic in terms of the BUILD_VECTOR legality directly in the BITCAST combine implementation

I couldn't do an obvious isOperationLegal check. Particularly for the combine v2i32 (bitcast i64:k) -> build_vector lo_32(k), hi_32(k) which I'm trying to avoid as 64b can be directly materialized through [v/s]_mov_b64. I think it would have to boil down to gating it on hasMovB64() again in AMDGPUTargetLowering::PerformDAGCombine's BITCAST combine.

Can you drop this part of the patch and look into this later? You still cannot use v_mov_b64 to materialize a general 64-bit constant. Targets with v_mov_b64 are under-represented in the test suite and this is likely introducing many regressions. Plus, still want the non-constant case that just pushes the cast through build_vector

Sorry, I'm probably missing the suggested change here but removing the combines will end up with even more regressions, no? Or is the suggestion to make the combines amdgpu target agnostic (i.e., stop guarding on hasMov64B)?

I mean split this into a separate PR, which you can do prior to this patch. The solution isn't to just fully disable this combine based on the target, this logic should be integrated directly in the bitcast combine. It's value dependent for the specific constants and the target

Opened #154115

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

arsenm · 2025-08-01T08:09:48Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

Can you drop this part of the patch and look into this later? You still cannot use v_mov_b64 to materialize a general 64-bit constant. Targets with v_mov_b64 are under-represented in the test suite and this is likely introducing many regressions. Plus, still want the non-constant case that just pushes the cast through build_vector

chrisjbris

Just some suggestions, nothing critical.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

JanekvO · 2025-08-12T16:20:55Z

Rebase

JanekvO · 2025-09-17T12:24:39Z

Rebase after landing #154115

JanekvO · 2025-09-19T14:52:12Z

llvm/test/CodeGen/AMDGPU/mfma-loop.ll

 ; GFX90A-NEXT:    global_store_dwordx4 v0, a[0:3], s[0:1]
 ; GFX90A-NEXT:    s_endpgm
 ;
 ; GFX942-LABEL: test_mfma_nested_loop_zeroinit:


Materializes 0 as b64 which is done through VGPR with v_mov_b64, results in regression of +1 instruction instead of materializing into a0 and re-using that to set the remainder.

JanekvO requested review from arsenm, changpeng and jayfoad June 20, 2025 15:31

llvmbot added the backend:AMDGPU label Jun 20, 2025

JanekvO commented Jun 20, 2025

View reviewed changes

JanekvO commented Jun 24, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v4i64.ll Outdated Show resolved Hide resolved

arsenm reviewed Jun 24, 2025

View reviewed changes

JanekvO force-pushed the legalize_buildvector_b64_elt branch from 730a90c to 10e2de6 Compare July 1, 2025 12:28

JanekvO commented Jul 1, 2025

View reviewed changes

JanekvO requested a review from arsenm July 8, 2025 09:35

JanekvO mentioned this pull request Jul 9, 2025

[AMDGPU] Fold multiple aligned v_mov_b32 to v_mov_b64 on mov 64b enabled targets #138843

Open

JanekvO force-pushed the legalize_buildvector_b64_elt branch from 3fcafc8 to 6c324a8 Compare July 21, 2025 14:36

JanekvO force-pushed the legalize_buildvector_b64_elt branch from 6c324a8 to 059a068 Compare July 28, 2025 11:04

arsenm reviewed Jul 30, 2025

View reviewed changes

JanekvO force-pushed the legalize_buildvector_b64_elt branch from 059a068 to 7955d7c Compare July 31, 2025 16:15

arsenm reviewed Aug 1, 2025

View reviewed changes

chrisjbris reviewed Aug 11, 2025

View reviewed changes

JanekvO force-pushed the legalize_buildvector_b64_elt branch from 448598c to d7c91f5 Compare August 12, 2025 16:20

[AMDGPU] Legalize 64bit elements for BUILD_VECTOR on gfx942

4af9f87

JanekvO force-pushed the legalize_buildvector_b64_elt branch from d7c91f5 to 4af9f87 Compare September 17, 2025 12:23

JanekvO commented Sep 19, 2025

View reviewed changes

JanekvO requested review from arsenm and chrisjbris September 22, 2025 11:06

[AMDGPU] Legalize 64bit elements for BUILD_VECTOR on gfx942 #145052

Are you sure you want to change the base?

[AMDGPU] Legalize 64bit elements for BUILD_VECTOR on gfx942 #145052

Conversation

JanekvO commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanekvO commented Jul 16, 2025

Uh oh!

JanekvO commented Jul 21, 2025

Uh oh!

JanekvO commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisjbris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanekvO commented Aug 12, 2025

Uh oh!

JanekvO commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JanekvO commented Jun 20, 2025 •

edited

Loading

arsenm Aug 13, 2025 •

edited

Loading