Skip to content

Conversation

@dakersnar
Copy link
Contributor

@dakersnar dakersnar commented Sep 17, 2025

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For both loads and stores we only perform the optimization when we can generate a legal llvm masked load/store intrinsic, masking off the "extra" elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.

Edited: both stores and loads must use masked intrinsics for this optimization to be legal.

@llvmbot llvmbot added vectorizers backend:NVPTX llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Sep 17, 2025
@llvmbot
Copy link
Member

llvmbot commented Sep 17, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.


Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
  • (modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
  • (modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
  • (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Sep 17, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.


Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
  • (modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
  • (modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
  • (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Sep 17, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Drew Kersnar (dakersnar)

Changes

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.


Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff

15 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
  • (modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
  • (modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
  • (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
   LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
                                   unsigned AddressSpace) const;
 
+  /// Return true if it is legal to widen loads beyond their current width,
+  /// assuming the result is still well-aligned. For example, converting a load
+  /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+  /// <4 x i32>.
+  LLVM_ABI bool isLegalToWidenLoads() const;
+
   /// Return true if the target supports nontemporal store.
   LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalToWidenLoads() const { return false; }
+
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
 }
 
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+  return TTIImpl->isLegalToWidenLoads();
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
     return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
   }
 
+  bool isLegalToWidenLoads() const override { return true; };
+
   // NVPTX has infinite registers of all kinds, but the actual machine doesn't.
   // We conservatively return 1 here which is just enough to enable the
   // vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
 
 #define DEBUG_TYPE "load-store-vectorizer"
 
+cl::opt<bool>
+    ExtendLoads("vect-extend-loads", cl::Hidden,
+                cl::desc("Load more elements if the target VF is higher "
+                         "than the chain length."),
+                cl::init(true));
+
+cl::opt<bool> ExtendStores(
+    "vect-extend-stores", cl::Hidden,
+    cl::desc("Store more elements if the target VF is higher "
+             "than the chain length and we have access to masked stores."),
+    cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+    "vect-fill-load-gaps", cl::Hidden,
+    cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+    cl::init(true));
+
+cl::opt<bool>
+    FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+                  cl::desc("Should Stores be introduced in gaps to enable "
+                           "vectorization into masked stores."),
+                  cl::init(true));
+
 STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
 STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
 
@@ -246,12 +269,16 @@ class Vectorizer {
   const DataLayout &DL;
   IRBuilder<> Builder;
 
-  // We could erase instrs right after vectorizing them, but that can mess up
-  // our BB iterators, and also can make the equivalence class keys point to
-  // freed memory.  This is fixable, but it's simpler just to wait until we're
-  // done with the BB and erase all at once.
+  /// We could erase instrs right after vectorizing them, but that can mess up
+  /// our BB iterators, and also can make the equivalence class keys point to
+  /// freed memory.  This is fixable, but it's simpler just to wait until we're
+  /// done with the BB and erase all at once.
   SmallVector<Instruction *, 128> ToErase;
 
+  /// We insert load/store instructions and GEPs to fill gaps and extend chains
+  /// to enable vectorization. Keep track and delete them later.
+  DenseSet<Instruction *> ExtraElements;
+
 public:
   Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
              DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Is a load/store with this alignment allowed by TTI and at least as fast
+  /// as an unvectorized load/store.
+  bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+                              unsigned VecElemBits) const;
+
+  /// Before attempting to fill gaps, check if the chain is a candidate for
+  /// a masked store, to save compile time if it is not possible for the address
+  /// space and element type.
+  bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+  /// Create a new GEP and a new Load/Store instruction such that the GEP
+  /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+  /// Extra elements will either be combined into a vector/masked store or
+  /// deleted before the end of the pass.
+  ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+                                    StringRef Prefix,
+                                    Align Alignment = Align(1));
+
+  /// Delete dead GEPs and extra Load/Store instructions created by
+  /// createExtraElementAfter
+  void deleteExtraElements();
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
       Changed |= runOnPseudoBB(*It, *std::next(It));
 
     for (Instruction *I : ToErase) {
+      // These will get deleted in deleteExtraElements.
+      // This is because ExtraElements will include both extra elements
+      // that *were* vectorized and extra elements that *were not*
+      // vectorized. ToErase will only include extra elements that *were*
+      // vectorized, so in order to avoid double deletion we skip them here and
+      // handle them in deleteExtraElements.
+      if (ExtraElements.contains(I))
+        continue;
       auto *PtrOperand = getLoadStorePointerOperand(I);
       if (I->use_empty())
         I->eraseFromParent();
       RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
     }
     ToErase.clear();
+    deleteExtraElements();
   }
 
   return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     dumpChain(C);
   });
 
+  // If the chain is not contiguous, we try to fill the gap with "extra"
+  // elements to artificially make it contiguous, to try to enable
+  // vectorization.
+  // - Filling gaps in loads is always ok if the target supports widening loads.
+  // - For stores, we only fill gaps if there is a potentially legal masked
+  //   store for the target. If later on, we don't end up with a chain that
+  //   could be vectorized into a legal masked store, the chains with extra
+  //   elements will be filtered out in splitChainByAlignment.
+  bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+                         ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+                         : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+  unsigned ASPtrBits =
+      DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+  // Compute the alignment of the leader of the chain (which every stored offset
+  // is based on) using the current first element of the chain. This is
+  // conservative, we may be able to derive better alignment by iterating over
+  // the chain and finding the leader.
+  Align LeaderOfChainAlign =
+      commonAlignment(getLoadStoreAlignment(C[0].Inst),
+                      C[0].OffsetFromLeader.abs().getLimitedValue());
+
   std::vector<Chain> Ret;
   Ret.push_back({C.front()});
 
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
     unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
     assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
                               "collectEquivalenceClass");
-    APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+    APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+    APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
 
     // Add this instruction to the end of the current chain, or start a new one.
     bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
                       << *Prev.Inst << " (ends at offset " << PrevReadEnd
                       << ") -> " << *It->Inst << " (starts at offset "
                       << It->OffsetFromLeader << ")\n");
-    if (AreContiguous)
+
+    if (AreContiguous) {
       CurChain.push_back(*It);
-    else
-      Ret.push_back({*It});
+      continue;
+    }
+
+    // For now, we aren't filling gaps between load/stores of different sizes.
+    // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+    // elements. Generating loads/stores with too many unused bytes has a side
+    // effect of increasing register pressure (on NVIDIA targets at least),
+    // which could cancel out the benefits of reducing number of load/stores.
+    if (TryFillGaps &&
+        SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+      APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+      APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+      if (GapSzBytes == PrevSzBytes) {
+        // There is a single gap between Prev and Curr, create one extra element
+        ChainElem NewElem = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        CurChain.push_back(NewElem);
+        CurChain.push_back(*It);
+        continue;
+      }
+      // There are two gaps between Prev and Curr, only create two extra
+      // elements if Prev is the first element in a sequence of four.
+      // This has the highest chance of resulting in a beneficial vectorization.
+      if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+        ChainElem NewElem1 = createExtraElementAfter(
+            Prev, PrevSzBytes, "GapFill",
+            commonAlignment(LeaderOfChainAlign,
+                            OffsetOfGapStart.abs().getLimitedValue()));
+        ChainElem NewElem2 = createExtraElementAfter(
+            NewElem1, PrevSzBytes, "GapFill",
+            commonAlignment(
+                LeaderOfChainAlign,
+                (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+        CurChain.push_back(NewElem1);
+        CurChain.push_back(NewElem2);
+        CurChain.push_back(*It);
+        continue;
+      }
+    }
+
+    // The chain is not contiguous and cannot be made contiguous with gap
+    // filling, so we need to start a new chain.
+    Ret.push_back({*It});
   }
 
   // Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // For compile time reasons, we cache whether or not the superset
+  // of all candidate chains contains any extra stores from earlier gap
+  // filling.
+  bool CandidateChainsMayContainExtraStores =
+      !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+        return ExtraElements.contains(E.Inst);
+      });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
-      // Is a load/store with this alignment allowed by TTI and at least as fast
-      // as an unvectorized load/store?
-      //
-      // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
-      auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
-                               &F = F](Align Alignment) {
-        if (Alignment.value() % SizeBytes == 0)
-          return true;
-        unsigned VectorizedSpeed = 0;
-        bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
-            F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
-        if (!AllowsMisaligned) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " is misaligned, and therefore can't be vectorized.\n");
-          return false;
-        }
-
-        unsigned ElementwiseSpeed = 0;
-        (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
-                                             Alignment, &ElementwiseSpeed);
-        if (VectorizedSpeed < ElementwiseSpeed) {
-          LLVM_DEBUG(dbgs()
-                     << "LSV: Access of " << SizeBytes << "B in addrspace "
-                     << AS << " with alignment " << Alignment.value()
-                     << " has relative speed " << VectorizedSpeed
-                     << ", which is lower than the elementwise speed of "
-                     << ElementwiseSpeed
-                     << ".  Therefore this access won't be vectorized.\n");
-          return false;
-        }
-        return true;
-      };
-
       // If we're loading/storing from an alloca, align it if possible.
       //
       // FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
                             isa<AllocaInst>(PtrOperand->stripPointerCasts());
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
       Align PrefAlign = Align(StackAdjustedAlignment);
-      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(PrefAlign)) {
+      if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
         Align NewAlign = getOrEnforceKnownAlignment(
             PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         }
       }
 
-      if (!IsAllowedAndFast(Alignment)) {
+      Chain ExtendingLoadsStores;
+      bool ExtendChain = IsLoadChain
+                             ? ExtendLoads
+                             : ExtendStores;
+      if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+          VecElemBits >= 8) {
+        // TargetVF may be a lot higher than NumVecElems,
+        // so only extend to the next power of 2.
+        assert(VecElemBits % 8 == 0);
+        unsigned VecElemBytes = VecElemBits / 8;
+        unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+        unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+        assert(NewNumVecElems <= TargetVF);
+
+        LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+                          << NumVecElems << " "
+                          << (IsLoadChain ? "loads" : "stores") << " to "
+                          << NewNumVecElems << " elements\n");
+        // Do not artificially increase the chain if it becomes misaligned,
+        // otherwise we may unnecessary split the chain when the target actually
+        // supports non-pow2 VF.
+        if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+            ((IsLoadChain ? TTI.isLegalToWidenLoads()
+                          : TTI.isLegalMaskedStore(
+                                FixedVectorType::get(VecElemTy, NewNumVecElems),
+                                Alignment, AS, /*IsMaskConstant=*/true)))) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: extending " << (IsLoadChain ? "load" : "store")
+                     << " chain of " << NumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << SizeBytes << " to "
+                     << NewNumVecElems << " "
+                     << (IsLoadChain ? "loads" : "stores")
+                     << " with total byte size of " << NewSizeBytes
+                     << ", TargetVF=" << TargetVF << " \n");
+
+          unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+          ChainElem Prev = C[CEnd];
+          for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+            ChainElem NewElem = createExtraElementAfter(
+                Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+            ExtendingLoadsStores.push_back(NewElem);
+            Prev = ExtendingLoadsStores.back();
+          }
+
+          // Update the size and number of elements for upcoming checks.
+          SizeBytes = NewSizeBytes;
+          NumVecElems = NewNumVecElems;
+        }
+      }
+
+      if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
         LLVM_DEBUG(
             dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
                       "because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
         continue;
       }
 
+      if (CandidateChainsMayContainExtraStores) {
+        // The legality of adding extra stores to ExtendingLoadsStores has
+        // already been checked, but if the candidate chain contains extra
+        // stores from an earlier optimization, confirm legality now.
+        // This filter is essential because, when filling gaps in
+        // splitChainByContinuity, we queried the API to check that (for a given
+        // element type and address space) there *may* be a legal masked store
+        // we can try to create. Now, we need to check if the actual chain we
+        // ended up with is legal to turn into a masked store.
+        // This is relevant for NVPTX targets, for example, where a masked store
+        // is only legal if we have ended up with a 256-bit vector.
+        bool CandidateChainContainsExtraStores = llvm::any_of(
+            ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+            [this](const ChainElem &E) {
+              return ExtraElements.contains(E.Inst);
+            });
+
+        if (CandidateChainContainsExtraStores &&
+            !TTI.isLegalMaskedStore(
+                FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+                /*IsMaskConstant=*/true)) {
+          LLVM_DEBUG(dbgs()
+                     << "LSV: splitChainByAlignment discarding candidate chain "
+                        "because it contains extra stores that we cannot "
+                        "legally vectorize into a masked store \n");
+          continue;
+        }
+      }
+
       // Hooray, we can vect...
[truncated]

@github-actions
Copy link

github-actions bot commented Sep 17, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@arsenm arsenm requested a review from gandhi56 September 18, 2025 02:21
@nikic
Copy link
Contributor

nikic commented Sep 25, 2025

For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements.

I'm a bit concerned about the direction of this change wrt the assumption that loads are always safe. Do you need this to work in cases where the memory is not known dereferenceable?

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR. To give an obvious example, let's say you have a noalias store to just the "gap" element. Then introducing a load from that location will introduce UB, even if the result is ultimately "unused".

This also intersects with the larger question of whether allocations can have gaps (as in, munmaped regions). This change is not necessarily incompatible with that because in practice gaps would be at page granularity, but it does constrain the design space.

@gonzalobg
Copy link
Contributor

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR.

Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR?
That'd still allow the backend to issue a fully covered load if the backend knows that this is ok for the particular platform.

@nikic
Copy link
Contributor

nikic commented Sep 25, 2025

The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR.

Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR? That'd still allow the backend to issue a fully covered load if the backend knows that this is ok for the particular platform.

Yeah, a masked load is definitely fine.

@dakersnar
Copy link
Contributor Author

Yeah, a masked load is definitely fine.

I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead.

Two small details I want to get thoughts on before I proceed:

  • the masked load intrinsic includes this "pass through" vector as a parameter, which contains values to return in the positions of the masked-off elements. Can/should I fill that input vector with poison in this case, since those values should never be used?
  • I will need to adjust some of the generic SelectionDag backend files to properly handle masked loads with `MD_invariant_load" metadata. Currently that metadata gets dropped when the SelectionDAG is built. I assume this would be an acceptable change?

@nikic
Copy link
Contributor

nikic commented Sep 26, 2025

I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead.

I wouldn't say I'm confident that we can't make it work in this form, but it's the kind of thing that gets bogged down in extended IR design discussions. If the masked load variant isn't too hard to get working, then it's probably more expedient to just go with that.

Two small details I want to get thoughts on before I proceed:

  • the masked load intrinsic includes this "pass through" vector as a parameter, which contains values to return in the positions of the masked-off elements. Can/should I fill that input vector with poison in this case, since those values should never be used?

Yes, using poison passthru should be fine.

  • I will need to adjust some of the generic SelectionDag backend files to properly handle masked loads with `MD_invariant_load" metadata. Currently that metadata gets dropped when the SelectionDAG is built. I assume this would be an acceptable change?

Without knowing the details: Yes, that sounds fine to me.

@dakersnar
Copy link
Contributor Author

I assume that if we use masked loads, we no longer need the new isLegalToWidenLoads API, and should determine legality in the same way that masked store legality is currently determined?

@dakersnar
Copy link
Contributor Author

I'm working on the masked load implementation, it might take a while, I'll re-ping these two PRs once I'm ready for another review. Thanks for the feedback so far, folks.

@dakersnar
Copy link
Contributor Author

I've updated this PR to generate masked loads as discussed, and it can be reviewed now. In the next day or two, I will update the NVPTX change that this is dependent on (#159387) to handle lowering of the masked load intrinsics. At that point they will both be ready for review. I'm hoping to merge them at the same time, back to back, once review iteration is done.

@dakersnar
Copy link
Contributor Author

Both PRs are now completed + updated with recent LLVM changes and ready for full review.

@dakersnar dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 47cf25a to bba61b5 Compare October 24, 2025 17:04
@dakersnar
Copy link
Contributor Author

dakersnar commented Nov 21, 2025

Added a test that combines redundant loads with gap filling and chain extending. I'm also going to temporarily push a cherry pick of #159387 into this branch to ensure the pipeline is green, and then I'll rebase it away to make reviewing easier again. Just looking for showstoppers. If anyone happens to review this change while that is happening, ignore the changes from that commit.

Edit: These are the commits to review: https://github.com/llvm/llvm-project/pull/159388/files/551f136fe621027307a66954d1974cdd80432271, with ce4b7e0 just being here temporarily to check the pipeline.

Edit: checks are green, removed the cherry pick.

@dakersnar dakersnar requested a review from cmc-rep November 21, 2025 23:48
@dakersnar dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from ce4b7e0 to 551f136 Compare November 22, 2025 02:20
@dakersnar
Copy link
Contributor Author

image

Checks are green.

@cmc-rep, @arsenm, and/or @Artem-B , ping for review please. Thank you :).

Copy link
Contributor

@cmc-rep cmc-rep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest merge looks good to me

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall with a test nit.

@cmc-rep
Copy link
Contributor

cmc-rep commented Nov 24, 2025

Unfortunately my change has been reverted again. Looks like this PR has to remerge with the latest main.

@dakersnar
Copy link
Contributor Author

@cmc-rep Do you know the reason yet? Would it make sense for me to merge my change first before relanding yours?

@dakersnar
Copy link
Contributor Author

In the meantime, I'm going to merge the backend changes (#159387), now that this change has gotten approval.

@cmc-rep
Copy link
Contributor

cmc-rep commented Nov 24, 2025

It causes some assertion in llvm when compiling some AMD library.
Please move ahead with your PR. Now, I see how you did the merge, so it won't be very difficult for me to merge it in the other direction

@Artem-B
Copy link
Member

Artem-B commented Nov 24, 2025

It causes some assertion in llvm when compiling some AMD library.

Is there a reproducer? Stack trace pointing to the specific assertion? Anything else that we can use to figure out what went wrong?

dakersnar added a commit that referenced this pull request Nov 25, 2025
This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
#159388.

In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).

# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.

# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.

For example, 
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10

->

STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);

```

* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".

# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.

Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Nov 25, 2025
…#159387)

This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
llvm/llvm-project#159388.

In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).

# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.

# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.

For example,
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10

->

STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);

```

* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".

# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.

Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
@cmc-rep
Copy link
Contributor

cmc-rep commented Nov 26, 2025

It causes some assertion in llvm when compiling some AMD library.

Is there a reproducer? Stack trace pointing to the specific assertion? Anything else that we can use to figure out what went wrong?

This is the case causing problem for my change:
%464 = load <1 x i32>, ptr addrspace(4) %coefficients_.i.i1034, align 4
%394 = load i32, ptr addrspace(4) %coefficients_.i.i1034, align 4
Original chain element is <1 x i32>, new load is i32, it still tries to create suffle.

So looks like my change is only reverted in AMD internal branch, not here. So I will add the fix here quickly.

#169671

@dakersnar
Copy link
Contributor Author

cc @gandhi56 for visibility

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Dec 3, 2025
This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
llvm#159388.

In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).

# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.

# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.

For example, 
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10

->

STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);

```

* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".

# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.

Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
@dakersnar dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 7cfc98c to 97e8a10 Compare December 3, 2025 03:36
; CHECK-NEXT: store <3 x half> [[S4]], ptr [[IN4]], align 4
; CHECK-NEXT: ret void
;
%load1 = load <3 x half>, ptr %rd0, align 16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this test to match what we expect to see after InferAlignment runs and puts the alignment of 16 onto this load <3 x half>. This is better representative of an input the vectorizer should expect. And with that alignment + the masked load change, this test extends the two load <3xhalf> into an <8xhalf> masked load.

@dakersnar dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from 42a1b3f to eb6df17 Compare December 3, 2025 18:02
@dakersnar dakersnar force-pushed the github/dkersnar/lsv-gap-fill branch from a0968d5 to 37cef5b Compare December 3, 2025 23:20
@dakersnar
Copy link
Contributor Author

Likely merging this on Monday, assuming green pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:NVPTX llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants