[NVPTX] Disable v2f32 registers when no operations supported, or via cl::opt #154476

AlexMaclean · 2025-08-20T05:43:04Z

The addition of v2f32 as a legal type, supported by the B64 register class, has caused performance regressions, broken inline assembly, and resulted in a couple (now fixed) mis-compilations. In order to mitigate these issues, only mark this as a legal type when there exist operations that support it, since for targets where this is not the case it serves no purpose. To enable further debugging, add an option to disable v2f32.

In order to allow for a target-dependent set of legal types, ComputePTXValueVTs has been fully re-written to take advantage of TargetLowering call-lowering APIs.

llvmbot · 2025-08-20T05:43:39Z

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

The addition of v2f32 as a legal type, supported by the B64 register class, has caused performance regressions, broken inline assembly, and resulted in a couple (now fixed) mis-compilations. In order to mitigate these issues, only mark this as a legal type when there exist operations that support it, since for targets where this is not the case it serves no purpose. To enable further debugging, add an option to disable v2f32.

In order to allow for a target-dependent set of legal types, ComputePTXValueVTs has been fully re-written to take advantage of TargetLowering call-lowering APIs.

Patch is 246.56 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154476.diff

20 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+52-92)
(modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp (+10)
(modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+1-3)
(modified) llvm/test/CodeGen/NVPTX/aggregate-return.ll (+22-17)
(modified) llvm/test/CodeGen/NVPTX/bf16-instructions.ll (+48-48)
(modified) llvm/test/CodeGen/NVPTX/f16x2-instructions.ll (+35-50)
(modified) llvm/test/CodeGen/NVPTX/f32x2-instructions.ll (+1845-1060)
(modified) llvm/test/CodeGen/NVPTX/ldparam-v4.ll (+4-3)
(modified) llvm/test/CodeGen/NVPTX/load-store-vectors-256.ll (+170-84)
(modified) llvm/test/CodeGen/NVPTX/load-with-non-coherent-cache.ll (+16-12)
(modified) llvm/test/CodeGen/NVPTX/misaligned-vector-ldst.ll (+105-105)
(modified) llvm/test/CodeGen/NVPTX/mulwide.ll (+35-21)
(added) llvm/test/CodeGen/NVPTX/no-f32x2.ll (+33)
(modified) llvm/test/CodeGen/NVPTX/param-load-store.ll (+8-18)
(modified) llvm/test/CodeGen/NVPTX/pr126337.ll (+5-6)
(modified) llvm/test/CodeGen/NVPTX/read-global-variable-constant.ll (+6-6)
(modified) llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll (+112-137)
(modified) llvm/test/CodeGen/NVPTX/vec-param-load.ll (+25-26)
(modified) llvm/test/CodeGen/NVPTX/vector-loads.ll (+24-24)
(modified) llvm/test/CodeGen/NVPTX/vector-stores.ll (+10-8)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 74e6c139c610d..f5302cd1bbd17 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -195,7 +195,8 @@ static bool IsPTXVectorType(MVT VT) {
 //    - unsigned int NumElts - The number of elements in the final vector
 //    - EVT EltVT - The type of the elements in the final vector
 static std::optional<std::pair<unsigned int, MVT>>
-getVectorLoweringShape(EVT VectorEVT, bool CanLowerTo256Bit) {
+getVectorLoweringShape(EVT VectorEVT, const NVPTXSubtarget &STI,
+                       unsigned AddressSpace) {
   if (!VectorEVT.isSimple())
     return std::nullopt;
   const MVT VectorVT = VectorEVT.getSimpleVT();
@@ -212,6 +213,8 @@ getVectorLoweringShape(EVT VectorEVT, bool CanLowerTo256Bit) {
   // The size of the PTX virtual register that holds a packed type.
   unsigned PackRegSize;
 
+  bool CanLowerTo256Bit = STI.has256BitVectorLoadStore(AddressSpace);
+
   // We only handle "native" vector sizes for now, e.g. <4 x double> is not
   // legal.  We can (and should) split that into 2 stores of <2 x double> here
   // but I'm leaving that as a TODO for now.
@@ -262,6 +265,8 @@ getVectorLoweringShape(EVT VectorEVT, bool CanLowerTo256Bit) {
     LLVM_FALLTHROUGH;
   case MVT::v2f32: // <1 x f32x2>
   case MVT::v4f32: // <2 x f32x2>
+    if (!STI.hasF32x2Instructions())
+      return std::pair(NumElts, EltVT);
     PackRegSize = 64;
     break;
   }
@@ -277,97 +282,46 @@ getVectorLoweringShape(EVT VectorEVT, bool CanLowerTo256Bit) {
 }
 
 /// ComputePTXValueVTs - For the given Type \p Ty, returns the set of primitive
-/// EVTs that compose it.  Unlike ComputeValueVTs, this will break apart vectors
-/// into their primitive components.
+/// legal-ish MVTs that compose it. Unlike ComputeValueVTs, this will legalize
+/// the types as required by the calling convention (with special handling for
+/// i8s).
 /// NOTE: This is a band-aid for code that expects ComputeValueVTs to return the
 /// same number of types as the Ins/Outs arrays in LowerFormalArguments,
 /// LowerCall, and LowerReturn.
 static void ComputePTXValueVTs(const TargetLowering &TLI, const DataLayout &DL,
+                               LLVMContext &Ctx, CallingConv::ID CallConv,
                                Type *Ty, SmallVectorImpl<EVT> &ValueVTs,
-                               SmallVectorImpl<uint64_t> *Offsets = nullptr,
+                               SmallVectorImpl<uint64_t> *Offsets,
                                uint64_t StartingOffset = 0) {
+  assert(Offsets && "Offsets must be non-null");
+
   SmallVector<EVT, 16> TempVTs;
   SmallVector<uint64_t, 16> TempOffsets;
-
-  // Special case for i128 - decompose to (i64, i64)
-  if (Ty->isIntegerTy(128) || Ty->isFP128Ty()) {
-    ValueVTs.append({MVT::i64, MVT::i64});
-
-    if (Offsets)
-      Offsets->append({StartingOffset + 0, StartingOffset + 8});
-
-    return;
-  }
-
-  // Given a struct type, recursively traverse the elements with custom ComputePTXValueVTs.
-  if (StructType *STy = dyn_cast<StructType>(Ty)) {
-    auto const *SL = DL.getStructLayout(STy);
-    auto ElementNum = 0;
-    for(auto *EI : STy->elements()) {
-      ComputePTXValueVTs(TLI, DL, EI, ValueVTs, Offsets,
-                         StartingOffset + SL->getElementOffset(ElementNum));
-      ++ElementNum;
-    }
-    return;
-  }
-
-  // Given an array type, recursively traverse the elements with custom ComputePTXValueVTs.
-  if (ArrayType *ATy = dyn_cast<ArrayType>(Ty)) {
-    Type *EltTy = ATy->getElementType();
-    uint64_t EltSize = DL.getTypeAllocSize(EltTy);
-    for (int I : llvm::seq<int>(ATy->getNumElements()))
-      ComputePTXValueVTs(TLI, DL, EltTy, ValueVTs, Offsets, StartingOffset + I * EltSize);
-    return;
-  }
-
-  // Will split structs and arrays into member types, but will not split vector
-  // types. We do that manually below.
   ComputeValueVTs(TLI, DL, Ty, TempVTs, &TempOffsets, StartingOffset);
 
-  for (auto [VT, Off] : zip(TempVTs, TempOffsets)) {
-    // Split vectors into individual elements that fit into registers.
-    if (VT.isVector()) {
-      unsigned NumElts = VT.getVectorNumElements();
-      EVT EltVT = VT.getVectorElementType();
-      // Below we must maintain power-of-2 sized vectors because
-      // TargetLoweringBase::getVectorTypeBreakdown() which is invoked in
-      // ComputePTXValueVTs() cannot currently break down non-power-of-2 sized
-      // vectors.
-
-      // If the element type belongs to one of the supported packed vector types
-      // then we can pack multiples of this element into a single register.
-      if (VT == MVT::v2i8) {
-        // We can pack 2 i8s into a single 16-bit register. We only do this for
-        // loads and stores, which is why we have a separate case for it.
-        EltVT = MVT::v2i8;
-        NumElts = 1;
-      } else if (VT == MVT::v3i8) {
-        // We can also pack 3 i8s into 32-bit register, leaving the 4th
-        // element undefined.
-        EltVT = MVT::v4i8;
-        NumElts = 1;
-      } else if (NumElts > 1 && isPowerOf2_32(NumElts)) {
-        // Handle default packed types.
-        for (MVT PackedVT : NVPTX::packed_types()) {
-          const auto NumEltsPerReg = PackedVT.getVectorNumElements();
-          if (NumElts % NumEltsPerReg == 0 &&
-              EltVT == PackedVT.getVectorElementType()) {
-            EltVT = PackedVT;
-            NumElts /= NumEltsPerReg;
-            break;
-          }
-        }
-      }
+  for (const auto [VT, Off] : zip(TempVTs, TempOffsets)) {
+    MVT RegisterVT = TLI.getRegisterTypeForCallingConv(Ctx, CallConv, VT);
+    unsigned NumRegs = TLI.getNumRegistersForCallingConv(Ctx, CallConv, VT);
+
+    // Since we actually can load/store b8, we need to ensure that we'll use
+    // the original sized type for any i8s or i8 vectors.
+    if (VT.getScalarType() == MVT::i8) {
+      if (RegisterVT == MVT::i16)
+        RegisterVT = MVT::i8;
+      else if (RegisterVT == MVT::v2i16)
+        RegisterVT = MVT::v2i8;
+      else
+        assert(RegisterVT == MVT::v4i8 &&
+               "Expected v4i8, v2i16, or i16 for i8 RegisterVT");
+    }
 
-      for (unsigned J : seq(NumElts)) {
-        ValueVTs.push_back(EltVT);
-        if (Offsets)
-          Offsets->push_back(Off + J * EltVT.getStoreSize());
-      }
-    } else {
-      ValueVTs.push_back(VT);
-      if (Offsets)
-        Offsets->push_back(Off);
+    // TODO: This is horribly incorrect for cases where the vector elements are
+    // not a multiple of bytes (ex i1) and legal or i8. However, this problem
+    // has existed for as long as NVPTX has and no one has complained, so we'll
+    // leave it for now.
+    for (unsigned I : seq(NumRegs)) {
+      ValueVTs.push_back(RegisterVT);
+      Offsets->push_back(Off + I * RegisterVT.getStoreSize());
     }
   }
 }
@@ -630,7 +584,9 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   addRegisterClass(MVT::v2f16, &NVPTX::B32RegClass);
   addRegisterClass(MVT::bf16, &NVPTX::B16RegClass);
   addRegisterClass(MVT::v2bf16, &NVPTX::B32RegClass);
-  addRegisterClass(MVT::v2f32, &NVPTX::B64RegClass);
+
+  if (STI.hasF32x2Instructions())
+    addRegisterClass(MVT::v2f32, &NVPTX::B64RegClass);
 
   // Conversion to/from FP16/FP16x2 is always legal.
   setOperationAction(ISD::BUILD_VECTOR, MVT::v2f16, Custom);
@@ -671,7 +627,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v2f32, Expand);
   setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v2f32, Expand);
   // Need custom lowering in case the index is dynamic.
-  setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2f32, Custom);
+  if (STI.hasF32x2Instructions())
+    setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2f32, Custom);
 
   // Custom conversions to/from v2i8.
   setOperationAction(ISD::BITCAST, MVT::v2i8, Custom);
@@ -1605,7 +1562,8 @@ SDValue NVPTXTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
     } else {
       SmallVector<EVT, 16> VTs;
       SmallVector<uint64_t, 16> Offsets;
-      ComputePTXValueVTs(*this, DL, Arg.Ty, VTs, &Offsets, VAOffset);
+      ComputePTXValueVTs(*this, DL, Ctx, CLI.CallConv, Arg.Ty, VTs, &Offsets,
+                         VAOffset);
       assert(VTs.size() == Offsets.size() && "Size mismatch");
       assert(VTs.size() == ArgOuts.size() && "Size mismatch");
 
@@ -1755,7 +1713,7 @@ SDValue NVPTXTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
   if (!Ins.empty()) {
     SmallVector<EVT, 16> VTs;
     SmallVector<uint64_t, 16> Offsets;
-    ComputePTXValueVTs(*this, DL, RetTy, VTs, &Offsets);
+    ComputePTXValueVTs(*this, DL, Ctx, CLI.CallConv, RetTy, VTs, &Offsets);
     assert(VTs.size() == Ins.size() && "Bad value decomposition");
 
     const Align RetAlign = getArgumentAlignment(CB, RetTy, 0, DL);
@@ -3216,8 +3174,8 @@ NVPTXTargetLowering::LowerSTOREVector(SDValue Op, SelectionDAG &DAG) const {
   if (ValVT != MemVT)
     return SDValue();
 
-  const auto NumEltsAndEltVT = getVectorLoweringShape(
-      ValVT, STI.has256BitVectorLoadStore(N->getAddressSpace()));
+  const auto NumEltsAndEltVT =
+      getVectorLoweringShape(ValVT, STI, N->getAddressSpace());
   if (!NumEltsAndEltVT)
     return SDValue();
   const auto [NumElts, EltVT] = NumEltsAndEltVT.value();
@@ -3385,6 +3343,7 @@ SDValue NVPTXTargetLowering::LowerFormalArguments(
     const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
     SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals) const {
   const DataLayout &DL = DAG.getDataLayout();
+  LLVMContext &Ctx = *DAG.getContext();
   auto PtrVT = getPointerTy(DAG.getDataLayout());
 
   const Function &F = DAG.getMachineFunction().getFunction();
@@ -3456,7 +3415,7 @@ SDValue NVPTXTargetLowering::LowerFormalArguments(
     } else {
       SmallVector<EVT, 16> VTs;
       SmallVector<uint64_t, 16> Offsets;
-      ComputePTXValueVTs(*this, DL, Ty, VTs, &Offsets, 0);
+      ComputePTXValueVTs(*this, DL, Ctx, CallConv, Ty, VTs, &Offsets, 0);
       assert(VTs.size() == ArgIns.size() && "Size mismatch");
       assert(VTs.size() == Offsets.size() && "Size mismatch");
 
@@ -3468,7 +3427,7 @@ SDValue NVPTXTargetLowering::LowerFormalArguments(
       for (const unsigned NumElts : VI) {
         // i1 is loaded/stored as i8
         const EVT LoadVT = VTs[I] == MVT::i1 ? MVT::i8 : VTs[I];
-        const EVT VecVT = getVectorizedVT(LoadVT, NumElts, *DAG.getContext());
+        const EVT VecVT = getVectorizedVT(LoadVT, NumElts, Ctx);
 
         SDValue VecAddr = DAG.getObjectPtrOffset(
             dl, ArgSymbol, TypeSize::getFixed(Offsets[I]));
@@ -3513,6 +3472,7 @@ NVPTXTargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
   }
 
   const DataLayout &DL = DAG.getDataLayout();
+  LLVMContext &Ctx = *DAG.getContext();
 
   const SDValue RetSymbol = DAG.getExternalSymbol("func_retval0", MVT::i32);
   const auto RetAlign = getFunctionParamOptimizedAlign(&F, RetTy, DL);
@@ -3525,7 +3485,7 @@ NVPTXTargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
 
   SmallVector<EVT, 16> VTs;
   SmallVector<uint64_t, 16> Offsets;
-  ComputePTXValueVTs(*this, DL, RetTy, VTs, &Offsets);
+  ComputePTXValueVTs(*this, DL, Ctx, CallConv, RetTy, VTs, &Offsets);
   assert(VTs.size() == OutVals.size() && "Bad return value decomposition");
 
   const auto GetRetVal = [&](unsigned I) -> SDValue {
@@ -6056,8 +6016,8 @@ static void replaceLoadVector(SDNode *N, SelectionDAG &DAG,
   if (ResVT != MemVT)
     return;
 
-  const auto NumEltsAndEltVT = getVectorLoweringShape(
-      ResVT, STI.has256BitVectorLoadStore(LD->getAddressSpace()));
+  const auto NumEltsAndEltVT =
+      getVectorLoweringShape(ResVT, STI, LD->getAddressSpace());
   if (!NumEltsAndEltVT)
     return;
   const auto [NumElts, EltVT] = NumEltsAndEltVT.value();
diff --git a/llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp b/llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp
index e5d680c19d921..a84ceaba991c7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXSubtarget.cpp
@@ -29,6 +29,12 @@ static cl::opt<bool>
     NoF16Math("nvptx-no-f16-math", cl::Hidden,
               cl::desc("NVPTX Specific: Disable generation of f16 math ops."),
               cl::init(false));
+
+static cl::opt<bool> NoF32x2("nvptx-no-f32x2", cl::Hidden,
+                             cl::desc("NVPTX Specific: Disable generation of "
+                                      "f32x2 instructions and registers."),
+                             cl::init(false));
+
 // Pin the vtable to this file.
 void NVPTXSubtarget::anchor() {}
 
@@ -70,6 +76,10 @@ bool NVPTXSubtarget::allowFP16Math() const {
   return hasFP16Math() && NoF16Math == false;
 }
 
+bool NVPTXSubtarget::hasF32x2Instructions() const {
+  return SmVersion >= 100 && PTXVersion >= 86 && !NoF32x2;
+}
+
 bool NVPTXSubtarget::hasNativeBF16Support(int Opcode) const {
   if (!hasBF16Math())
     return false;
diff --git a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
index 81af55edccadb..acf025b70ce34 100644
--- a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
+++ b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
@@ -117,9 +117,7 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
     return HasTcgen05 && PTXVersion >= 86;
   }
   // f32x2 instructions in Blackwell family
-  bool hasF32x2Instructions() const {
-    return SmVersion >= 100 && PTXVersion >= 86;
-  }
+  bool hasF32x2Instructions() const;
 
   // TMA G2S copy with cta_group::1/2 support
   bool hasCpAsyncBulkTensorCTAGroupSupport() const {
diff --git a/llvm/test/CodeGen/NVPTX/aggregate-return.ll b/llvm/test/CodeGen/NVPTX/aggregate-return.ll
index bf51973e88357..fab60bdb3f2d1 100644
--- a/llvm/test/CodeGen/NVPTX/aggregate-return.ll
+++ b/llvm/test/CodeGen/NVPTX/aggregate-return.ll
@@ -10,19 +10,20 @@ declare {float, float} @bars({float, float} %input)
 define void @test_v2f32(<2 x float> %input, ptr %output) {
 ; CHECK-LABEL: test_v2f32(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b64 %rd<4>;
+; CHECK-NEXT:    .reg .b32 %r<5>;
+; CHECK-NEXT:    .reg .b64 %rd<2>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.b64 %rd1, [test_v2f32_param_0];
+; CHECK-NEXT:    ld.param.v2.b32 {%r1, %r2}, [test_v2f32_param_0];
 ; CHECK-NEXT:    { // callseq 0, 0
 ; CHECK-NEXT:    .param .align 8 .b8 param0[8];
 ; CHECK-NEXT:    .param .align 8 .b8 retval0[8];
-; CHECK-NEXT:    st.param.b64 [param0], %rd1;
+; CHECK-NEXT:    st.param.v2.b32 [param0], {%r1, %r2};
 ; CHECK-NEXT:    call.uni (retval0), barv, (param0);
-; CHECK-NEXT:    ld.param.b64 %rd2, [retval0];
+; CHECK-NEXT:    ld.param.v2.b32 {%r3, %r4}, [retval0];
 ; CHECK-NEXT:    } // callseq 0
-; CHECK-NEXT:    ld.param.b64 %rd3, [test_v2f32_param_1];
-; CHECK-NEXT:    st.b64 [%rd3], %rd2;
+; CHECK-NEXT:    ld.param.b64 %rd1, [test_v2f32_param_1];
+; CHECK-NEXT:    st.v2.b32 [%rd1], {%r3, %r4};
 ; CHECK-NEXT:    ret;
   %call = tail call <2 x float> @barv(<2 x float> %input)
   store <2 x float> %call, ptr %output, align 8
@@ -32,24 +33,28 @@ define void @test_v2f32(<2 x float> %input, ptr %output) {
 define void @test_v3f32(<3 x float> %input, ptr %output) {
 ; CHECK-LABEL: test_v3f32(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b32 %r<3>;
-; CHECK-NEXT:    .reg .b64 %rd<4>;
+; CHECK-NEXT:    .reg .b32 %r<7>;
+; CHECK-NEXT:    .reg .b64 %rd<6>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.b64 %rd1, [test_v3f32_param_0];
-; CHECK-NEXT:    ld.param.b32 %r1, [test_v3f32_param_0+8];
+; CHECK-NEXT:    ld.param.v2.b32 {%r1, %r2}, [test_v3f32_param_0];
+; CHECK-NEXT:    ld.param.b32 %r3, [test_v3f32_param_0+8];
 ; CHECK-NEXT:    { // callseq 1, 0
 ; CHECK-NEXT:    .param .align 16 .b8 param0[16];
 ; CHECK-NEXT:    .param .align 16 .b8 retval0[16];
-; CHECK-NEXT:    st.param.b32 [param0+8], %r1;
-; CHECK-NEXT:    st.param.b64 [param0], %rd1;
+; CHECK-NEXT:    st.param.b32 [param0+8], %r3;
+; CHECK-NEXT:    st.param.v2.b32 [param0], {%r1, %r2};
 ; CHECK-NEXT:    call.uni (retval0), barv3, (param0);
-; CHECK-NEXT:    ld.param.b32 %r2, [retval0+8];
-; CHECK-NEXT:    ld.param.b64 %rd2, [retval0];
+; CHECK-NEXT:    ld.param.b32 %r4, [retval0+8];
+; CHECK-NEXT:    ld.param.v2.b32 {%r5, %r6}, [retval0];
 ; CHECK-NEXT:    } // callseq 1
-; CHECK-NEXT:    ld.param.b64 %rd3, [test_v3f32_param_1];
-; CHECK-NEXT:    st.b32 [%rd3+8], %r2;
-; CHECK-NEXT:    st.b64 [%rd3], %rd2;
+; CHECK-NEXT:    cvt.u64.u32 %rd1, %r5;
+; CHECK-NEXT:    cvt.u64.u32 %rd2, %r6;
+; CHECK-NEXT:    shl.b64 %rd3, %rd2, 32;
+; CHECK-NEXT:    or.b64 %rd4, %rd1, %rd3;
+; CHECK-NEXT:    ld.param.b64 %rd5, [test_v3f32_param_1];
+; CHECK-NEXT:    st.b32 [%rd5+8], %r4;
+; CHECK-NEXT:    st.b64 [%rd5], %rd4;
 ; CHECK-NEXT:    ret;
   %call = tail call <3 x float> @barv3(<3 x float> %input)
 ; Make sure we don't load more values than than we need to.
diff --git a/llvm/test/CodeGen/NVPTX/bf16-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
index aee58a044a986..a386e4292777b 100644
--- a/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
@@ -688,25 +688,25 @@ define <8 x float> @test_extload_bf16x8(ptr addrspace(3) noundef %arg) #0 {
 ; SM70-NEXT:  // %bb.0:
 ; SM70-NEXT:    ld.param.b64 %rd1, [test_extload_bf16x8_param_0];
 ; SM70-NEXT:    ld.shared.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1];
-; SM70-NEXT:    mov.b32 {%rs1, %rs2}, %r2;
-; SM70-NEXT:    cvt.u32.u16 %r5, %rs2;
+; SM70-NEXT:    mov.b32 {%rs1, %rs2}, %r3;
+; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r4;
+; SM70-NEXT:    mov.b32 {%rs5, %rs6}, %r1;
+; SM70-NEXT:    mov.b32 {%rs7, %rs8}, %r2;
+; SM70-NEXT:    cvt.u32.u16 %r5, %rs8;
 ; SM70-NEXT:    shl.b32 %r6, %r5, 16;
-; SM70-NEXT:    cvt.u32.u16 %r7, %rs1;
+; SM70-NEXT:    cvt.u32.u16 %r7, %rs7;
 ; SM70-NEXT:    shl.b32 %r8, %r7, 16;
-; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
-; SM70-NEXT:    cvt.u32.u16 %r9, %rs4;
+; SM70-NEXT:    cvt.u32.u16 %r9, %rs6;
 ; SM70-NEXT:    shl.b32 %r10, %r9, 16;
-; SM70-NEXT:    cvt.u32.u16 %r11, %rs3;
+; SM70-NEXT:    cvt.u32.u16 %r11, %rs5;
 ; SM70-NEXT:    shl.b32 %r12, %r11, 16;
-; SM70-NEXT:    mov.b32 {%rs5, %rs6}, %r4;
-; SM70-NEXT:    cvt.u32.u16 %r13, %rs6;
+; SM70-NEXT:    cvt.u32.u16 %r13, %rs4;
 ; SM70-NEXT:    shl.b32 %r14, %r13, 16;
-; SM70-NEXT:    cvt.u32.u16 %r15, %rs5;
+; SM70-NEXT:    cvt.u32.u16 %r15, %rs3;
 ; SM70-NEXT:    shl.b32 %r16, %r15, 16;
-; SM70-NEXT:    mov.b32 {%rs7, %rs8}, %r3;
-; SM70-NEXT:    cvt.u32.u16 %r17, %rs8;
+; SM70-NEXT:    cvt.u32.u16 %r17, %rs2;
 ; SM70-NEXT:    shl.b32 %r18, %r17, 16;
-; SM70-NEXT:    cvt.u32.u16 %r19, %rs7;
+; SM70-NEXT:    cvt.u32.u16 %r19, %rs1;
 ; SM70-NEXT:    shl.b32 %r20, %r19, 16;
 ; SM70-NEXT:    st.param.v4.b32 [func_retval0+16], {%r20, %r18, %r16, %r14};
 ; SM70-NEXT:    st.param.v4.b32 [func_retval0], {%r12, %r10, %r8, %r6};
@@ -721,18 +721,18 @@ define <8 x float> @test_extload_bf16x8(ptr addrspace(3) noundef %arg) #0 {
 ; SM80-NEXT:  // %bb.0:
 ; SM80-NEXT:    ld.param.b64 %rd1, [test_extload_bf16x8_param_0];
 ; SM80-NEXT:    ld.shared.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1];
-; SM80-NEXT:    mov.b32 {%rs1, %rs2}, %r2;
-; SM80-NEXT:    cvt.f32.bf16 %r5, %rs2;
-; SM80-NEXT:    cvt.f32.bf16 %r6, %rs1;
-; SM80-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
-; SM80-NEXT:    cvt.f32.bf16 %r7, %rs4;
-; SM80-NEXT:    cvt.f32.bf16 %r8, %rs3;
-; SM80-NEXT:    mov.b32 {%rs5, %rs6}, %r4;
-; SM80-NEXT:    cvt.f32.bf16 %r9, %rs6;
-; SM80-NEXT:    cvt.f32.bf16 %r10, %rs5;
-; SM80-NEXT:    mov.b32 {%rs7, %rs8}, %r3;
-; SM80-NEXT:    cvt.f32.bf16 %r11, %rs8;
-; SM80-NEXT:    cvt.f32.bf16 %r12, %rs7;
+; SM80-NEXT:    mov.b32 {%rs1, %rs2}, %r3;
+; SM80-NEXT:    mov.b32 {%rs3, %rs4}, %r4;
+; SM80-NEXT:    mov.b32 {%rs5, %rs6}, %r1;
+; SM80-NEXT:    mov.b32 {%rs7, %rs8}, %r2;
+; SM80-NEXT:    cvt.f32.bf16 %r5, %rs8;
+; SM80-NEXT:    cvt.f32.bf16 %r6, %rs7;
+; SM80-NEXT:    cvt.f32.bf16 %r7, %rs6;
+; SM80-NEXT:    cvt.f32.bf16 %r8, %rs5;
+; SM80-NEXT:    cvt.f32.bf16 %r9, %rs4;
+; SM80-NEXT:    cvt.f32.bf16 %r10, %rs3;
+; SM80-NEXT:    cvt.f32.bf16 %r11, %rs2;
+; SM80-NEXT:    cvt.f32.bf16 %r12, %rs1;
 ; SM80-NEXT:    st.param.v4.b32 [func_retval0+16], {%r12, %r11, %r10, %r9};
 ; SM80-NEXT:    st.param.v4.b32 [func_retval0], {%r8, %r7, %r6, %r5};
 ; SM80-NEXT:    ret;
@@ -746,18 +746,18 @@ define <8 x float> @test_ex...
[truncated]

llvm/test/CodeGen/NVPTX/aggregate-return.ll

Prince781

LGTM.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Artem-B · 2025-08-20T21:06:52Z

I can confirm that the patch fixes the regression reported in #153109

…cl::opt

llvm-ci · 2025-08-21T17:48:45Z

LLVM Buildbot has detected a new failure on builder llvm-nvptx64-nvidia-win running on as-builder-8 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/155/builds/12174

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
...
PASS: LLVM :: CodeGen/Generic/add-with-overflow-24.ll (12862 of 60666)
PASS: LLVM :: CodeGen/Generic/bool-to-double.ll (12863 of 60666)
PASS: LLVM :: CodeGen/Generic/MIRDebugify/multifunction-module.mir (12864 of 60666)
PASS: LLVM :: CodeGen/Generic/PBQP.ll (12865 of 60666)
PASS: LLVM :: CodeGen/Generic/bswap.ll (12866 of 60666)
PASS: LLVM :: CodeGen/Generic/assume.ll (12867 of 60666)
PASS: LLVM :: CodeGen/Generic/badarg6.ll (12868 of 60666)
PASS: LLVM :: CodeGen/Generic/call-void.ll (12869 of 60666)
UNSUPPORTED: LLVM :: CodeGen/Generic/live-debug-vars-undef-use.mir (12870 of 60666)
XPASS: LLVM :: CodeGen/Generic/add-with-overflow-128.ll (12871 of 60666)
******************** TEST 'LLVM :: CodeGen/Generic/add-with-overflow-128.ll' FAILED ********************
Exit Code: 0

Command Output (stdout):
--
# RUN: at line 1
c:\buildbot\as-builder-8\llvm-nvptx64-nvidia-win\build\bin\llc.exe < C:\buildbot\as-builder-8\llvm-nvptx64-nvidia-win\llvm-project\llvm\test\CodeGen\Generic\add-with-overflow-128.ll
# executed command: 'c:\buildbot\as-builder-8\llvm-nvptx64-nvidia-win\build\bin\llc.exe'
# .---command stdout------------
# | //
# | // Generated by LLVM NVPTX Back-End
# | //
# | 
# | .version 6.0
# | .target sm_30
# | .address_size 64
# | 
# | 	// .globl	func2                   // -- Begin function func2
# | .extern .func  (.param .b32 func_retval0) printf
# | (
# | 	.param .b64 printf_param_0,
# | 	.param .align 8 .b8 printf_vararg[]
# | )
# | ;
# | .global .align 1 .b8 ok[4] = {37, 100, 10};
# | .global .align 1 .b8 no[4] = {110, 111, 10};
# |                                         // @func2
# | .visible .func  (.param .b32 func_retval0) func2(
# | 	.param .align 16 .b8 func2_param_0[16],
# | 	.param .align 16 .b8 func2_param_1[16]
# | )
# | {
# | 	.reg .pred 	%p<6>;
# | 	.reg .b32 	%r<2>;
# | 	.reg .b64 	%rd<11>;
# | 
# | // %bb.0:                               // %entry
# | 	ld.param.v2.b64 	{%rd1, %rd2}, [func2_param_0];
# | 	ld.param.v2.b64 	{%rd3, %rd4}, [func2_param_1];

llvm-ci · 2025-08-21T18:05:14Z

LLVM Buildbot has detected a new failure on builder llvm-nvptx-nvidia-win running on as-builder-8 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/54/builds/12005

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
...
PASS: LLVM :: CodeGen/Generic/multiple-return-values-cross-block-with-invoke.ll (12902 of 60666)
PASS: LLVM :: CodeGen/Generic/MIRStripDebug/all.mir (12903 of 60666)
XFAIL: LLVM :: CodeGen/Generic/live-debug-label.ll (12904 of 60666)
UNSUPPORTED: LLVM :: CodeGen/Generic/stacksave-restore.ll (12905 of 60666)
PASS: LLVM :: CodeGen/Generic/2008-02-04-Ctlz.ll (12906 of 60666)
PASS: LLVM :: CodeGen/Generic/2007-01-15-LoadSelectCycle.ll (12907 of 60666)
PASS: LLVM :: CodeGen/Generic/pr2625.ll (12908 of 60666)
PASS: LLVM :: CodeGen/Generic/sjlj-eh-prepare.ll (12909 of 60666)
XFAIL: LLVM :: CodeGen/Generic/available_externally_alias.ll (12910 of 60666)
XPASS: LLVM :: CodeGen/Generic/add-with-overflow-128.ll (12911 of 60666)
******************** TEST 'LLVM :: CodeGen/Generic/add-with-overflow-128.ll' FAILED ********************
Exit Code: 0

Command Output (stdout):
--
# RUN: at line 1
c:\buildbot\as-builder-8\llvm-nvptx-nvidia-win\build\bin\llc.exe < C:\buildbot\as-builder-8\llvm-nvptx-nvidia-win\llvm-project\llvm\test\CodeGen\Generic\add-with-overflow-128.ll
# executed command: 'c:\buildbot\as-builder-8\llvm-nvptx-nvidia-win\build\bin\llc.exe'
# .---command stdout------------
# | //
# | // Generated by LLVM NVPTX Back-End
# | //
# | 
# | .version 6.0
# | .target sm_30
# | .address_size 32
# | 
# | 	// .globl	func2                   // -- Begin function func2
# | .extern .func  (.param .b32 func_retval0) printf
# | (
# | 	.param .b32 printf_param_0,
# | 	.param .align 8 .b8 printf_vararg[]
# | )
# | ;
# | .global .align 1 .b8 ok[4] = {37, 100, 10};
# | .global .align 1 .b8 no[4] = {110, 111, 10};
# |                                         // @func2
# | .visible .func  (.param .b32 func_retval0) func2(
# | 	.param .align 16 .b8 func2_param_0[16],
# | 	.param .align 16 .b8 func2_param_1[16]
# | )
# | {
# | 	.reg .pred 	%p<6>;
# | 	.reg .b32 	%r<6>;
# | 	.reg .b64 	%rd<7>;
# | 
# | // %bb.0:                               // %entry
# | 	ld.param.v2.b64 	{%rd1, %rd2}, [func2_param_0];
# | 	ld.param.v2.b64 	{%rd3, %rd4}, [func2_param_1];

llvm-ci · 2025-08-21T19:36:01Z

LLVM Buildbot has detected a new failure on builder llvm-nvptx-nvidia-ubuntu running on as-builder-7 while building llvm at step 6 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/180/builds/23537

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-llvm) failure: test (failure)
...
PASS: LLVM :: CodeGen/Generic/2014-02-05-OpaqueConstants.ll (12855 of 62142)
PASS: LLVM :: CodeGen/Generic/2011-07-07-ScheduleDAGCrash.ll (12856 of 62142)
PASS: LLVM :: CodeGen/Generic/2012-06-08-APIntCrash.ll (12857 of 62142)
PASS: LLVM :: CodeGen/Generic/ConstantExprLowering.ll (12858 of 62142)
UNSUPPORTED: LLVM :: CodeGen/Generic/pr12507.ll (12859 of 62142)
UNSUPPORTED: LLVM :: CodeGen/Generic/pr24662.ll (12860 of 62142)
UNSUPPORTED: LLVM :: CodeGen/Generic/pr49872.ll (12861 of 62142)
PASS: LLVM :: CodeGen/Generic/MIRStripDebug/no-metadata-present.mir (12862 of 62142)
PASS: LLVM :: CodeGen/Generic/MIRDebugify/multifunction-module.mir (12863 of 62142)
XPASS: LLVM :: CodeGen/Generic/add-with-overflow-128.ll (12864 of 62142)
******************** TEST 'LLVM :: CodeGen/Generic/add-with-overflow-128.ll' FAILED ********************
Exit Code: 0

Command Output (stdout):
--
//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_30
.address_size 32

	// .globl	func2                   // -- Begin function func2
.extern .func  (.param .b32 func_retval0) printf
(
	.param .b32 printf_param_0,
	.param .align 8 .b8 printf_vararg[]
)
;
.global .align 1 .b8 ok[4] = {37, 100, 10};
.global .align 1 .b8 no[4] = {110, 111, 10};
                                        // @func2
.visible .func  (.param .b32 func_retval0) func2(
	.param .align 16 .b8 func2_param_0[16],
	.param .align 16 .b8 func2_param_1[16]
)
{
	.reg .pred 	%p<6>;
	.reg .b32 	%r<6>;
	.reg .b64 	%rd<7>;

// %bb.0:                               // %entry
	ld.param.v2.b64 	{%rd1, %rd2}, [func2_param_0];
	ld.param.v2.b64 	{%rd3, %rd4}, [func2_param_1];
	add.cc.s64 	%rd5, %rd1, %rd3;
	addc.cc.s64 	%rd6, %rd2, %rd4;
	setp.lt.u64 	%p1, %rd5, %rd1;
	setp.eq.b64 	%p2, %rd6, %rd2;

) The refactoring of ComputePTXValueVTs in #154476 caused the complier to no longer crash when lowering i256 and i96. This has caused a few tests to unexpectedly pass. Update these tests and tweak how we emit parameter declarations to correctly lower these types.

AlexMaclean requested review from Artem-B and Prince781 August 20, 2025 05:43

AlexMaclean self-assigned this Aug 20, 2025

llvmbot added the backend:NVPTX label Aug 20, 2025

Artem-B mentioned this pull request Aug 20, 2025

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

Artem-B approved these changes Aug 20, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/aggregate-return.ll Outdated Show resolved Hide resolved

Prince781 approved these changes Aug 20, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

Prince781 linked an issue Aug 20, 2025 that may be closed by this pull request

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

Prince781 removed a link to an issue Aug 20, 2025

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

[NVPTX] Disable v2f32 registers when no operations supported, or via …

d33c9cb

…cl::opt

AlexMaclean force-pushed the dev/amaclean/upstream-no-f32x2 branch from 7923c1a to 8260332 Compare August 21, 2025 16:24

address comments

8260332

Prince781 approved these changes Aug 21, 2025

View reviewed changes

AlexMaclean merged commit a9de1ab into llvm:main Aug 21, 2025
9 checks passed

AlexMaclean mentioned this pull request Aug 21, 2025

[NVPTX] Allow more argument integer types, such as i256 and i96 #154824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Disable v2f32 registers when no operations supported, or via cl::opt #154476

[NVPTX] Disable v2f32 registers when no operations supported, or via cl::opt #154476

Uh oh!

AlexMaclean commented Aug 20, 2025

Uh oh!

llvmbot commented Aug 20, 2025

Uh oh!

Uh oh!

Prince781 left a comment

Uh oh!

Uh oh!

Artem-B commented Aug 20, 2025

Uh oh!

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[NVPTX] Disable v2f32 registers when no operations supported, or via cl::opt #154476

[NVPTX] Disable v2f32 registers when no operations supported, or via cl::opt #154476

Uh oh!

Conversation

AlexMaclean commented Aug 20, 2025

Uh oh!

llvmbot commented Aug 20, 2025

Uh oh!

Uh oh!

Prince781 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Artem-B commented Aug 20, 2025

Uh oh!

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

llvm-ci commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants