[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388

dakersnar · 2025-09-17T15:45:50Z

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For both loads and stores we only perform the optimization when we can generate a legal llvm masked load/store intrinsic, masking off the "extra" elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.

Edited: both stores and loads must use masked intrinsics for this optimization to be legal.

llvmbot · 2025-09-17T15:46:28Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.

Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
(modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
(modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
(modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
(modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

llvmbot · 2025-09-17T15:46:28Z

@llvm/pr-subscribers-backend-nvptx

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.

Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
(modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
(modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
(modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
(modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

llvmbot · 2025-09-17T15:46:28Z

@llvm/pr-subscribers-llvm-analysis

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.

Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
(modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
(modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
(modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
(added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
(modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

github-actions · 2025-09-17T15:49:36Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

nikic · 2025-09-25T20:07:13Z

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements.

I'm a bit concerned about the direction of this change wrt the assumption that loads are always safe. Do you need this to work in cases where the memory is not known dereferenceable?

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR. To give an obvious example, let's say you have a noalias store to just the "gap" element. Then introducing a load from that location will introduce UB, even if the result is ultimately "unused".

This also intersects with the larger question of whether allocations can have gaps (as in, munmaped regions). This change is not necessarily incompatible with that because in practice gaps would be at page granularity, but it does constrain the design space.

gonzalobg · 2025-09-25T20:23:55Z

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR.

Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR?
That'd still allow the backend to issue a fully covered load if the backend knows that this is ok for the particular platform.

nikic · 2025-09-25T21:06:11Z

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR.

Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR? That'd still allow the backend to issue a fully covered load if the backend knows that this is ok for the particular platform.

Yeah, a masked load is definitely fine.

dakersnar · 2025-09-25T22:41:02Z

Yeah, a masked load is definitely fine.

I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead.

Two small details I want to get thoughts on before I proceed:

the masked load intrinsic includes this "pass through" vector as a parameter, which contains values to return in the positions of the masked-off elements. Can/should I fill that input vector with poison in this case, since those values should never be used?
I will need to adjust some of the generic SelectionDag backend files to properly handle masked loads with `MD_invariant_load" metadata. Currently that metadata gets dropped when the SelectionDAG is built. I assume this would be an acceptable change?

nikic · 2025-09-26T09:53:11Z

I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead.

I wouldn't say I'm confident that we can't make it work in this form, but it's the kind of thing that gets bogged down in extended IR design discussions. If the masked load variant isn't too hard to get working, then it's probably more expedient to just go with that.

Two small details I want to get thoughts on before I proceed:

the masked load intrinsic includes this "pass through" vector as a parameter, which contains values to return in the positions of the masked-off elements. Can/should I fill that input vector with poison in this case, since those values should never be used?

Yes, using poison passthru should be fine.

I will need to adjust some of the generic SelectionDag backend files to properly handle masked loads with `MD_invariant_load" metadata. Currently that metadata gets dropped when the SelectionDAG is built. I assume this would be an acceptable change?

Without knowing the details: Yes, that sounds fine to me.

dakersnar · 2025-09-26T19:38:19Z

I assume that if we use masked loads, we no longer need the new isLegalToWidenLoads API, and should determine legality in the same way that masked store legality is currently determined?

dakersnar · 2025-09-30T16:49:32Z

I'm working on the masked load implementation, it might take a while, I'll re-ping these two PRs once I'm ready for another review. Thanks for the feedback so far, folks.

dakersnar · 2025-10-17T20:53:33Z

I've updated this PR to generate masked loads as discussed, and it can be reviewed now. In the next day or two, I will update the NVPTX change that this is dependent on (#159387) to handle lowering of the masked load intrinsics. At that point they will both be ready for review. I'm hoping to merge them at the same time, back to back, once review iteration is done.

dakersnar · 2025-10-22T21:32:24Z

Both PRs are now completed + updated with recent LLVM changes and ready for full review.

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll

dakersnar · 2025-11-21T23:42:53Z

Added a test that combines redundant loads with gap filling and chain extending. I'm also going to temporarily push a cherry pick of #159387 into this branch to ensure the pipeline is green, and then I'll rebase it away to make reviewing easier again. Just looking for showstoppers. If anyone happens to review this change while that is happening, ignore the changes from that commit.

Edit: These are the commits to review: https://github.com/llvm/llvm-project/pull/159388/files/551f136fe621027307a66954d1974cdd80432271, with ce4b7e0 just being here temporarily to check the pipeline.

Edit: checks are green, removed the cherry pick.

dakersnar · 2025-11-22T02:21:26Z

Checks are green.

@cmc-rep, @arsenm, and/or @Artem-B , ping for review please. Thank you :).

cmc-rep

The latest merge looks good to me

Artem-B

LGTM overall with a test nit.

llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll

cmc-rep · 2025-11-24T19:27:47Z

Unfortunately my change has been reverted again. Looks like this PR has to remerge with the latest main.

dakersnar · 2025-11-24T19:38:29Z

@cmc-rep Do you know the reason yet? Would it make sense for me to merge my change first before relanding yours?

dakersnar · 2025-11-24T19:43:16Z

In the meantime, I'm going to merge the backend changes (#159387), now that this change has gotten approval.

cmc-rep · 2025-11-24T19:58:02Z

It causes some assertion in llvm when compiling some AMD library.
Please move ahead with your PR. Now, I see how you did the merge, so it won't be very difficult for me to merge it in the other direction

Artem-B · 2025-11-24T22:42:49Z

It causes some assertion in llvm when compiling some AMD library.

Is there a reproducer? Stack trace pointing to the specific assertion? Anything else that we can use to figure out what went wrong?

This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: #159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation

…#159387) This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: llvm/llvm-project#159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation

cmc-rep · 2025-11-26T14:59:29Z

It causes some assertion in llvm when compiling some AMD library.

Is there a reproducer? Stack trace pointing to the specific assertion? Anything else that we can use to figure out what went wrong?

This is the case causing problem for my change:
%464 = load <1 x i32>, ptr addrspace(4) %coefficients_.i.i1034, align 4
%394 = load i32, ptr addrspace(4) %coefficients_.i.i1034, align 4
Original chain element is <1 x i32>, new load is i32, it still tries to create suffle.

So looks like my change is only reverted in AMD internal branch, not here. So I will add the fix here quickly.

#169671

dakersnar · 2025-12-02T17:59:25Z

cc @gandhi56 for visibility

This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: llvm#159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation

dakersnar · 2025-12-03T17:35:31Z

llvm/test/Transforms/LoadStoreVectorizer/NVPTX/4x2xhalf.ll

+; CHECK-NEXT:    store <3 x half> [[S4]], ptr [[IN4]], align 4
+; CHECK-NEXT:    ret void
+;
+  %load1 = load <3 x half>, ptr %rd0, align 16


I updated this test to match what we expect to see after InferAlignment runs and puts the alignment of 16 onto this load <3 x half>. This is better representative of an input the vectorizer should expect. And with that alignment + the masked load change, this test extends the two load <3xhalf> into an <8xhalf> masked load.

…locas, fix test alignment

dakersnar · 2025-12-04T00:07:39Z

Likely merging this on Monday, assuming green pipelines.

dakersnar requested review from AlexMaclean, Artem-B, arsenm and nikic September 17, 2025 15:45

llvmbot added vectorizers backend:NVPTX llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Sep 17, 2025

dakersnar mentioned this pull request Sep 17, 2025

[NVPTX] Lower LLVM masked vector loads and stores to PTX #159387

Merged

arsenm requested a review from gandhi56 September 18, 2025 02:21

arsenm reviewed Sep 18, 2025

View reviewed changes

llvm/include/llvm/Analysis/TargetTransformInfo.h Outdated Show resolved Hide resolved

arsenm reviewed Sep 18, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

Artem-B reviewed Sep 20, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll Show resolved Hide resolved

arsenm reviewed Sep 25, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

dakersnar mentioned this pull request Oct 17, 2025

[Transform][LoadStoreVectorizer] allow redundant in Chain #163019

Merged

Artem-B reviewed Oct 23, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll Show resolved Hide resolved

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll Outdated Show resolved Hide resolved

dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 47cf25a to bba61b5 Compare October 24, 2025 17:04

Add redundant element test with gap filling

551f136

dakersnar requested a review from cmc-rep November 21, 2025 23:48

dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from ce4b7e0 to 551f136 Compare November 22, 2025 02:20

cmc-rep approved these changes Nov 24, 2025

View reviewed changes

Artem-B approved these changes Nov 24, 2025

View reviewed changes

llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll Show resolved Hide resolved

llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll Outdated Show resolved Hide resolved

dakersnar added 3 commits November 26, 2025 11:08

Merge branch 'main' into github/dkersnar/lsv-gap-fill

98d6f23

Update test to be auto generated

b9fa06d

Refactor to prevent extending chain too soon

97e8a10

dakersnar mentioned this pull request Dec 2, 2025

[LSV] Merge contiguous chains across scalar types #154069

Merged

dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 7cfc98c to 97e8a10 Compare December 3, 2025 03:36

dakersnar commented Dec 3, 2025

View reviewed changes

Fix alignment on test, update checks to generate masked load

eb6df17

dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 42a1b3f to eb6df17 Compare December 3, 2025 18:02

dakersnar added 2 commits December 3, 2025 13:48

Merge branch 'main' into github/dkersnar/lsv-gap-fill

80b68fd

Fix overeager alignment upgrading when vectorizer tries to upgrade al…

37cef5b

…locas, fix test alignment

dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from a0968d5 to 37cef5b Compare December 3, 2025 23:20

Fix test check

af095c2

[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388

Are you sure you want to change the base?

[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388

Conversation

dakersnar commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 17, 2025

Uh oh!

llvmbot commented Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikic commented Sep 25, 2025

Uh oh!

gonzalobg commented Sep 25, 2025

Uh oh!

nikic commented Sep 25, 2025

Uh oh!

dakersnar commented Sep 25, 2025

Uh oh!

nikic commented Sep 26, 2025

Uh oh!

dakersnar commented Sep 26, 2025

Uh oh!

dakersnar commented Sep 30, 2025

Uh oh!

dakersnar commented Oct 17, 2025

Uh oh!

dakersnar commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

dakersnar commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dakersnar commented Nov 22, 2025

Uh oh!

cmc-rep left a comment

Choose a reason for hiding this comment

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cmc-rep commented Nov 24, 2025

Uh oh!

dakersnar commented Nov 24, 2025

Uh oh!

dakersnar commented Nov 24, 2025

Uh oh!

cmc-rep commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B commented Nov 24, 2025

Uh oh!

cmc-rep commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dakersnar commented Dec 2, 2025

Uh oh!

dakersnar Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

dakersnar commented Sep 17, 2025 •

edited

Loading

llvmbot commented Sep 17, 2025 •

edited

Loading

github-actions bot commented Sep 17, 2025 •

edited

Loading

dakersnar commented Nov 21, 2025 •

edited

Loading

cmc-rep commented Nov 24, 2025 •

edited

Loading

cmc-rep commented Nov 26, 2025 •

edited

Loading