[LSV] Merge contiguous chains across scalar types #154069

gandhi56 · 2025-08-18T07:38:05Z

This change enables the LoadStoreVectorizer to merge and vectorize contiguous chains even when their scalar element types differ, as long as the total bitwidth matches. To do so, we rebase offsets between chains, normalize value types to a common integer type, and insert the necessary casts around loads and stores. This uncovers more vectorization opportunities and explains the expected codegen updates across AMDGPU tests.

Key changes:

Chain merging
- Build contiguous subchains and then merge adjacent ones when:
  - They refer to the same underlying pointer object and address space.
  - They are either all loads or all stores.
  - A constant leader-to-leader delta exists.
  - Rebasing one chain into the other's coordinate space does not overlap.
  - All elements have equal total bit width.
- Rebase the second chain by the computed delta and append it to the first.
Type normalization and casting
- Normalize merged chains to a common integer type sized to the total bits.
- For loads: create a new load of the normalized type, copy metadata, and cast back to the original type for uses if needed.
- For stores: bitcast the value to the normalized type and store that.
- Insert zext/trunc for integer size changes; use bit-or-pointer casts when sizes match.
Cleanups
- Erase replaced instructions and DCE pointer operands when safe.
- New helpers: computeLeaderDelta, chainsOverlapAfterRebase, rebaseChain, normalizeChainToType, and allElemsMatchTotalBits.

Impact:

Increases vectorization opportunities across mixed-typed but size-compatible access chains.
Large set of expected AMDGPU codegen diffs due to more/changed vectorization.

This PR resolves #97715.

llvmbot · 2025-08-18T07:38:38Z

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Anshil Gandhi (gandhi56)

Changes

This change enables the Load/Store Vectorizer to merge and vectorize contiguous chains even when their scalar element types differ, as long as the total bitwidth matches. To do so, we rebase offsets between chains, normalize value types to a common integer type, and insert the necessary casts around loads and stores. This uncovers more vectorization opportunities and explains the expected codegen updates across AMDGPU tests.

Key changes:

Chain merging
- Build contiguous subchains and then merge adjacent ones when:
  - They refer to the same underlying pointer object and address space.
  - They are either all loads or all stores.
  - A constant leader-to-leader delta exists.
  - Rebasing one chain into the other's coordinate space does not overlap.
  - All elements have equal total bit width.
- Rebase the second chain by the computed delta and append it to the first.
Type normalization and casting
- Normalize merged chains to a common integer type sized to the total bits.
- For loads: create a new load of the normalized type, copy metadata, and cast back to the original type for uses if needed.
- For stores: bitcast the value to the normalized type and store that.
- Insert zext/trunc for integer size changes; use bit-or-pointer casts when sizes match.
Cleanups
- Erase replaced instructions and DCE pointer operands when safe.
- New helpers: computeLeaderDelta, chainsOverlapAfterRebase, rebaseChain, normalizeChainToType, and allElemsMatchTotalBits.

Impact:

Increases vectorization opportunities across mixed-typed but size-compatible access chains.
Large set of expected AMDGPU codegen diffs due to more/changed vectorization.

Patch is 772.03 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154069.diff

65 Files Affected:

(modified) llvm/include/llvm/Transforms/Utils/Local.h (+4)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+45)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+221-62)
(modified) llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (+3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+144-192)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll (+36-49)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.scale.ll (+8-12)
(modified) llvm/test/CodeGen/AMDGPU/add_i1.ll (+68-10)
(modified) llvm/test/CodeGen/AMDGPU/add_i64.ll (+105-28)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+121-121)
(modified) llvm/test/CodeGen/AMDGPU/alloca.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn-ieee.ll (+364-84)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn-load-offset-from-reg.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+760-727)
(modified) llvm/test/CodeGen/AMDGPU/and.ll (+22-35)
(modified) llvm/test/CodeGen/AMDGPU/atomic_cmp_swap_local.ll (+175-68)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+22-25)
(modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-i16.ll (+19-17)
(modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+27-76)
(modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+53-51)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+171-162)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics.ll (+602-706)
(modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+26-28)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+54-54)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+13-13)
(modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+168-192)
(modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+92-100)
(modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+89-106)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+633-718)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+65-109)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+10-12)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.div.fmas.ll (+189-86)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+36-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+43-43)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+43-43)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+38-38)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+38-38)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+28-28)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+96-96)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+35-33)
(modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+27-25)
(modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+23-21)
(modified) llvm/test/CodeGen/AMDGPU/shl.ll (+264-236)
(modified) llvm/test/CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+130-129)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+65-61)
(modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+11-10)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+24-15)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+21-3)

diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index df146458b4e6f..ad29d69900235 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -448,6 +448,10 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 /// replacement for the source instruction).
 LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
+/// Copy the metadata from the source instruction to the destination (the
+/// replacement for the source instruction).
+LLVM_ABI void copyMetadataForStore(StoreInst &Dest, const StoreInst &Source);
+
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
 /// its original position.
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index f5208d50c6aae..ecdc920d442dc 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3496,6 +3496,51 @@ void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   }
 }
 
+void llvm::copyMetadataForStore(StoreInst &Dest, const StoreInst &Source) {
+  SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  for (const auto &MDPair : MD) {
+    unsigned ID = MDPair.first;
+    MDNode *N = MDPair.second;
+    switch (ID) {
+    case LLVMContext::MD_dbg:
+    case LLVMContext::MD_prof:
+    case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_alias_scope:
+    case LLVMContext::MD_noalias:
+    case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_access_group:
+    case LLVMContext::MD_noundef:
+    case LLVMContext::MD_noalias_addrspace:
+    case LLVMContext::MD_mem_parallel_loop_access:
+      Dest.setMetadata(ID, N);
+      break;
+
+    case LLVMContext::MD_tbaa: {
+      MDNode *NewTyNode =
+          MDB.createTBAAScalarTypeNode(NewType->getStructName(), N);
+      Dest.setMetadata(LLVMContext::MD_tbaa, NewTyNode);
+      break;
+    }
+    case LLVMContext::MD_nonnull:
+      break;
+
+    case LLVMContext::MD_align:
+    case LLVMContext::MD_dereferenceable:
+    case LLVMContext::MD_dereferenceable_or_null:
+      // These only directly apply if the new type is also a pointer.
+      if (NewType->isPointerTy())
+        Dest.setMetadata(ID, N);
+      break;
+
+    case LLVMContext::MD_range:
+      break;
+    }
+  }
+}
+
 void llvm::patchReplacementInstruction(Instruction *I, Value *Repl) {
   auto *ReplInst = dyn_cast<Instruction>(Repl);
   if (!ReplInst)
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..12ac5c891aefa 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,6 +112,7 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
+#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -268,11 +269,6 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
-  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
-  /// in the same BB with the same value for getUnderlyingObject() etc.
-  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
-                             ArrayRef<Instruction *> EqClass);
-
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -337,12 +333,24 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
+  /// Inserts a cast instruction to convert Inst to DstTy.
+  Value *insertCast(Value *Val, Type *DstTy);
+
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  // Helpers for chain merging.
+  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
+  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
+                                const APInt &Delta) const;
+  static bool allElemsMatchTotalBits(const Chain &C, unsigned TotalBits,
+                                     const DataLayout &DL);
+  static void rebaseChain(Chain &C, const APInt &Delta);
+  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -424,6 +432,20 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
+static const Value *getUnderlyingPtrObject(const Value *Ptr) {
+  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
+  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
+    // The select's themselves are distinct instructions even if they share
+    // the same condition and evaluate to consecutive pointers for true and
+    // false values of the condition. Therefore using the select's themselves
+    // for grouping instructions would put consecutive accesses into different
+    // lists and they won't be even checked for being consecutive, and won't
+    // be vectorized.
+    return Sel->getCondition();
+  }
+  return ObjPtr;
+}
+
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -467,6 +489,98 @@ bool Vectorizer::run() {
   return Changed;
 }
 
+Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
+  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
+    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
+  }
+
+  // If the types are of different sizes and both are integers, we can use
+  // zext or sext to cast.
+  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
+    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
+      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
+    }
+    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
+  }
+
+  return nullptr;
+}
+
+std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
+                                                    Instruction *I2) {
+  assert((isa<LoadInst>(I1) || isa<StoreInst>(I1)) &&
+         (isa<LoadInst>(I2) || isa<StoreInst>(I2)) &&
+         "computeLeaderDelta must be called with load or store instructions");
+  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
+  const Value *Ptr1 = getLoadStorePointerOperand(I1);
+  const Value *Ptr2 = getLoadStorePointerOperand(I2);
+  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
+                           CtxInst);
+}
+
+bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
+                                          const APInt &Delta) const {
+  for (const ChainElem &EB : B) {
+    APInt OffB = EB.OffsetFromLeader + Delta;
+    unsigned SizeB = DL.getTypeStoreSize(getLoadStoreType(EB.Inst));
+    APInt EndB = OffB + SizeB;
+    for (const ChainElem &EA : A) {
+      APInt OffA = EA.OffsetFromLeader;
+      unsigned SizeA = DL.getTypeStoreSize(getLoadStoreType(EA.Inst));
+      APInt EndA = OffA + SizeA;
+      if (OffB.slt(EndA) && OffA.slt(EndB))
+        return true;
+    }
+  }
+  return false;
+}
+
+bool Vectorizer::allElemsMatchTotalBits(const Chain &C, unsigned TotalBits,
+                                        const DataLayout &DL) {
+  return llvm::all_of(C, [&](const ChainElem &E) {
+    return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
+  });
+}
+
+void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
+  for (ChainElem &E : C)
+    E.OffsetFromLeader += Delta;
+}
+
+void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
+  for (ChainElem &Elem : C) {
+    Instruction *Inst = Elem.Inst;
+    Type *OrigValTy = getLoadStoreType(Inst);
+    if (OrigValTy == CastTy)
+      continue;
+
+    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
+      Builder.SetInsertPoint(LI);
+      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
+                                             LI->getName() + ".mut");
+      copyMetadataForLoad(*NewLoad, *LI);
+      Value *CastBack = insertCast(NewLoad, OrigValTy);
+      if (!CastBack)
+        llvm_unreachable("Failed to insert cast");
+      LI->replaceAllUsesWith(CastBack);
+      ToErase.emplace_back(LI);
+      Elem.Inst = NewLoad;
+    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
+      Builder.SetInsertPoint(SI);
+      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
+      if (!CastVal)
+        llvm_unreachable("Failed to insert cast");
+      StoreInst *NewStore =
+          Builder.CreateStore(CastVal, SI->getPointerOperand());
+      NewStore->setAlignment(SI->getAlign());
+      NewStore->setVolatile(SI->isVolatile());
+      copyMetadataForStore(*NewStore, *SI);
+      ToErase.emplace_back(SI);
+      Elem.Inst = NewStore;
+    }
+  }
+}
+
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -479,49 +593,111 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
+  SmallVector<Chain> ContiguousSubChains;
+
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End))
-    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
+       collectEquivalenceClasses(Begin, End)) {
 
-  return Changed;
-}
+    LLVM_DEBUG({
+      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+             << " keyed on " << EqClassKey << ":\n";
+      for (Instruction *I : EqClass)
+        dbgs() << "  " << *I << "\n";
+    });
 
-bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
-                                       ArrayRef<Instruction *> EqClass) {
-  bool Changed = false;
+    for (Chain &C : gatherChains(EqClass)) {
 
-  LLVM_DEBUG({
-    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-           << " keyed on " << EqClassKey << ":\n";
-    for (Instruction *I : EqClass)
-      dbgs() << "  " << *I << "\n";
-  });
+      // Split up the chain into increasingly smaller chains, until we can
+      // finally vectorize the chains.
+      //
+      // (Don't be scared by the depth of the loop nest here.  These operations
+      // are all at worst O(n lg n) in the number of instructions, and splitting
+      // chains doesn't change the number of instrs.  So the whole loop nest is
+      // O(n lg n).)
+      for (auto &C : splitChainByMayAliasInstrs(C)) {
+        for (auto &C : splitChainByContiguity(C)) {
+          ContiguousSubChains.emplace_back(C);
+        }
+      }
+    }
+  }
 
-  std::vector<Chain> Chains = gatherChains(EqClass);
-  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
-                    << " nontrivial chains.\n";);
-  for (Chain &C : Chains)
-    Changed |= runOnChain(C);
-  return Changed;
-}
+  // Merge chains in reverse order, so that the first chain is the largest.
+  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
+    Chain &C1 = ContiguousSubChains[I - 1];
+    Chain &C2 = ContiguousSubChains[I];
 
-bool Vectorizer::runOnChain(Chain &C) {
-  LLVM_DEBUG({
-    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
-    dumpChain(C);
-  });
+    // If the scalar types of the chains are the same, we can merge them
+    // without inserting any casts.
+    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
+        getLoadStoreType(C2[0].Inst)->getScalarType())
+      continue;
+
+    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
+    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
+    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
+    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
+    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
+    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
+
+    // If the chains are mapped to different types, have distinct underlying
+    // pointer objects, or include both loads and stores, skip.
+    if (getUnderlyingPtrObject(C1Ptr) != getUnderlyingPtrObject(C2Ptr) ||
+        C1IsLoad != C2IsLoad || AS1 != AS2)
+      continue;
+
+    // Compute constant offset between chain leaders; if unknown, skip.
+    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
+    if (!DeltaOpt)
+      continue;
+
+    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
+    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
+      continue;
+
+    // Determine the common integer cast type for normalization and ensure total
+    // bitwidth matches across all elements of both chains.
+    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
+    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
+    auto AllElemsMatchTotalBits = [&](const Chain &C) {
+      return llvm::all_of(C, [&](const ChainElem &E) {
+        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
+      });
+    };
+    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
+      continue;
+
+    // Rebase C2's offsets into C1's coordinate space prior to merging.
+    rebaseChain(C2, *DeltaOpt);
+
+    // Merge C2 into C1 by appending all elements of C2 to C1, then erase C2
+    // from ContiguousSubChains.
+    C1.insert(C1.end(), C2.begin(), C2.end());
+    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
+
+    // Normalize the value operand/result type of each instruction in C1 to
+    // C1CastTy.
+    Type *C1CastTy =
+        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
+    normalizeChainToType(C1, C1CastTy);
+  }
+
+  for (auto &C : ContiguousSubChains) {
+    if (C.size() <= 1)
+      continue;
+    for (auto &AlignedSubChain : splitChainByAlignment(C))
+      Changed |= vectorizeChain(AlignedSubChain);
+  }
+
+  // Erase all instructions scheduled for deletion in this pseudo-BB.
+  for (Instruction *I : ToErase) {
+    auto *PtrOperand = getLoadStorePointerOperand(I);
+    if (I->use_empty())
+      I->eraseFromParent();
+    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
+  }
+  ToErase.clear();
 
-  // Split up the chain into increasingly smaller chains, until we can finally
-  // vectorize the chains.
-  //
-  // (Don't be scared by the depth of the loop nest here.  These operations are
-  // all at worst O(n lg n) in the number of instructions, and splitting chains
-  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
-  bool Changed = false;
-  for (auto &C : splitChainByMayAliasInstrs(C))
-    for (auto &C : splitChainByContiguity(C))
-      for (auto &C : splitChainByAlignment(C))
-        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -578,7 +754,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (NewChain.size() > 1) {
+        if (!NewChain.empty()) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -590,7 +766,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (NewChain.size() > 1) {
+    if (!NewChain.empty()) {
       LLVM_DEBUG({
         dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
         dumpChain(NewChain);
@@ -643,8 +819,6 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
       Ret.push_back({*It});
   }
 
-  // Filter out length-1 chains, these are uninteresting.
-  llvm::erase_if(Ret, [](const auto &Chain) { return Chain.size() <= 1; });
   return Ret;
 }
 
@@ -1427,20 +1601,6 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
                                       BasicBlock::iterator End) {
   EquivalenceClassMap Ret;
 
-  auto GetUnderlyingObject = [](const Value *Ptr) -> const Value * {
-    const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-    if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-      // The select's themselves are distinct instructions even if they share
-      // the same condition and evaluate to consecutive pointers for true and
-      // false values of the condition. Therefore using the select's themselves
-      // for grouping instructions would put consecutive accesses into different
-      // lists and they won't be even checked for being consecutive, and won't
-      // be vectorized.
-      return Sel->getCondition();
-    }
-    return ObjPtr;
-  };
-
   for (Instruction &I : make_range(Begin, End)) {
     auto *LI = dyn_cast<LoadInst>(&I);
     auto *SI = dyn_cast<StoreInst>(&I);
@@ -1488,7 +1648,7 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
         (VecTy && TTI.getLoadVectorFactor(VF, TySize, TySize / 8, VecTy) == 0))
       continue;
 
-    Ret[{GetUnderlyingObject(Ptr), AS,
+    Ret[{getUnderlyingPtrObject(Ptr), AS,
          DL.getTypeSizeInBits(getLoadStoreType(&I)->getScalarType()),
          /*IsLoad=*/LI != nullptr}]
         .emplace_back(&I);
@@ -1583,8 +1743,7 @@ std::vector<Chain> Vectorizer::gatherChains(ArrayRef<Instruction *> Instrs) {
   Ret.reserve(Chains.size());
   // Iterate over MRU rather than Chains so the order is deterministic.
   for (auto &E : MRU)
-    if (E.second.size() > 1)
-      Ret.emplace_back(std::move(E.second));
+    Ret.emplace_back(std::move(E.second));
   return Ret;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll b/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll
index 840165d5a7e7a..fc1b636f525fd 100644
--- a/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll
+++ b/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll
@@ -298,26 +298,24 @@ define amdgpu_kernel void @local_address_store(ptr addrspace(3) %out, i32 %val)
 define amdgpu_kernel void @local_address_gep_store(ptr addrspace(3) %out, i32, i32 %val, i32 %offset) {
 ; GFX7-LABEL: local_address_gep_store:
 ; GFX7:       ; %bb.0:
-; GFX7-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0xb
-; GFX7-NEXT:    s_load_dword s2, s[4:5], 0x9
+; GFX7-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x9
 ; GFX7-NEXT:    s_mov_b32 m0, -1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX7-NEXT:    s_lshl_b32 s1, s1, 2
-; GFX7-NEXT:    v_mov_b32_e32 v0, s0
-; GFX7-NEXT:    s_add_i32 s0, s2, s1
+; GFX7-NEXT:    s_lshl_b32 s2, s2, 2
+; GFX7-NEXT:    s_add_i32 s0, s0, s2
+; GFX7-NEXT:    v_mov_b32_e32 v0, s1
 ; GFX7-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX7-NEXT:    ds_write_b32 v1, v0
 ; GFX7-NEXT:    s_endpgm
 ;
 ; GFX8-LABEL: local_address_gep_store:
 ; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX8-NEXT:    s_load_dword s2, s[4:5], 0x24
+; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX8-NEXT:    s_mov_b32 m0, -1
 ; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8-NEXT:    s_lshl_b32 s1, s1, 2
-; GFX8-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8-NEXT:    s_add_i32 s0, s2, s1
+; GFX8-NEXT:    s_lshl_b32 s2, s2, 2
+; GFX8-NEXT:    s_add_i32 s0, s0, s2
+; GFX8-NEXT:    v_mov_b32_e32 v0, s1
 ; GFX8-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX8-NEXT:    ds_write_b32 v1, v0
 ; GFX8-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll
index 523d51ddcd2bc..a9271794bc03b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -m...
[truncated]

llvm/lib/Transforms/Utils/Local.cpp

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/test/CodeGen/AMDGPU/add_i1.ll

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/lib/Transforms/Utils/Local.cpp

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

github-actions · 2025-08-22T19:10:35Z

✅ With the latest revision this PR passed the C/C++ code formatter.

gandhi56 · 2025-08-31T06:47:45Z

Internal PSDB passed.

Copilot

Pull Request Overview

This pull request enables the LoadStoreVectorizer to merge contiguous memory access chains across different scalar types that have matching total bitwidth. The vectorizer now normalizes chains to common integer types and inserts necessary type casts, uncovering more vectorization opportunities.

Key changes:

Implements chain merging logic to combine contiguous access patterns with different element types but same total bitwidth
Adds type normalization system that converts merged chains to common integer types with appropriate casting
Updates test expectations across AMDGPU codegen to reflect improved vectorization patterns

Reviewed Changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated no comments.

File	Description
llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll	Demonstrates new vectorization of mixed i32/v2i16 loads with bitcasting
llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll	Shows vectorization of mixed f32/v2f16 loads/stores with type normalization
Various AMDGPU test files	Updated codegen expectations due to improved vectorization patterns

Comments suppressed due to low confidence (4)

gandhi56 · 2025-09-08T15:05:29Z

ping

llvm/lib/Transforms/Utils/Local.cpp

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

gandhi56 · 2025-11-26T20:44:15Z

Added llvm/test/Transforms/InstCombine/copy-access-metadata.ll to test copying of each metadata implemented in the utility function copyMetadataForAccess, for loads.
Added llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-loads-stores.ll to implement vectorization-specific copyMetadataForAccess checks.
Rebased and updated all tests.

gandhi56 · 2025-11-27T07:51:56Z

I've not looked at the LSV changes (leaving this to @arsenm), but I'll note that the IR test coverage here looks pretty weak for a significant change to the pass. The part I looked at (the Local.cpp changes) don't appear to have any test coverage at all.

Can this PR be merged, @arsenm ?

gandhi56 · 2025-12-02T04:04:56Z

Thanks for the feedback. I have addressed them all. With approval from John, I will merge this PR.

dakersnar · 2025-12-02T16:06:42Z

llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll

+; CHECK-SAME: ptr addrspace(1) [[PTR1:%.*]]) {
+; CHECK-NEXT:    [[LOAD_0:%.*]] = load i32, ptr addrspace(1) [[PTR1]], align 4
+; CHECK-NEXT:    [[GEP2:%.*]] = getelementptr inbounds i32, ptr addrspace(1) [[PTR1]], i64 1
+; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i32>, ptr addrspace(1) [[GEP2]], align 4


How is this created despite only an alignment of 4? Is this test intended to merge all of these values together? is it behaving as intended?

dakersnar · 2025-12-02T16:15:39Z

llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll

+define void @merge_i32_float(ptr addrspace(1) %ptr1, ptr addrspace(2) %ptr2) {
+; CHECK-LABEL: define void @merge_i32_float(
+; CHECK-SAME: ptr addrspace(1) [[PTR1:%.*]], ptr addrspace(2) [[PTR2:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i32>, ptr addrspace(1) [[PTR1]], align 4
+; CHECK-NEXT:    [[LOAD_01:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT:    [[LOAD_12:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32 [[LOAD_12]] to float
+; CHECK-NEXT:    [[TMP3:%.*]] = insertelement <2 x i32> poison, i32 [[LOAD_01]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast float [[TMP2]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i32> [[TMP3]], i32 [[TMP4]], i32 1
+; CHECK-NEXT:    store <2 x i32> [[TMP5]], ptr addrspace(2) [[PTR2]], align 4
+; CHECK-NEXT:    ret void
+;
+  %gep.1 = getelementptr inbounds i32, ptr addrspace(1) %ptr1, i64 1
+  %load.0 = load i32, ptr addrspace(1) %ptr1
+  %load.1 = load float, ptr addrspace(1) %gep.1
+  %store.gep.1 = getelementptr inbounds i32, ptr addrspace(2) %ptr2, i64 1
+  store i32 %load.0, ptr addrspace(2) %ptr2
+  store float %load.1, ptr addrspace(2) %store.gep.1
+  ret void
+}


Just to confirm, the old vectorizer was already capable of vectorizing chains of elements that have different-type but same-sized scalar types, such as this chain of an i32 and a float, right?

dakersnar · 2025-12-02T16:26:13Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+  // Erase all instructions scheduled for deletion in this pseudo-BB.
+  for (Instruction *I : ToErase) {
+    auto *PtrOperand = getLoadStorePointerOperand(I);
+    if (I->use_empty())
+      I->eraseFromParent();
+    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
+  }
+  ToErase.clear();


You are still going through ToErase elsewhere in the file, is this a mistake? Should you get rid of the code here: https://github.com/gandhi56/llvm-project/blob/dc0dcc2b7d78deeaf00c781a91dccc98355e7356/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp#L479-L485?

Also, can you update the comment here to reflect this change: https://github.com/gandhi56/llvm-project/blob/dc0dcc2b7d78deeaf00c781a91dccc98355e7356/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp#L250-L254

dakersnar · 2025-12-02T17:30:46Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+
+  // Helpers for chain merging.
+  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
+  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
+                                const APInt &Delta) const;
+  static void rebaseChain(Chain &C, const APInt &Delta);
+  void normalizeChainToType(Chain &C, Type *CastTy);


It would be nice to have comments above each of these, matching the rest of the file.

dakersnar · 2025-12-02T17:43:10Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

 #include <optional>
 #include <tuple>
 #include <type_traits>
+#include <unordered_map>


I don't think this include is used in the file

dakersnar · 2025-12-02T17:58:47Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

-      Ret.emplace_back(std::move(E.second));
+    Ret.emplace_back(std::move(E.second));


Can you explain this change? Are you basically allowing chains that have a size of 1 in order to merge them later?

Yup, that's the idea.

dakersnar · 2025-12-02T18:11:43Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+    // If the scalar types of the chains are the same, we can merge them
+    // without inserting any casts.
+    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
+        getLoadStoreType(C2[0].Inst)->getScalarType())
+      continue;


I'm confused by this comment. Entering the if statement and running continue would result in not merging these chains, right?

dakersnar · 2025-12-02T18:19:20Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+  // Merge chains in reverse order, so that the first chain is the largest.
+  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {


nit: it would nice if this whole for loop was inside a helper function, it would make reading the overall algorithm easier.

dakersnar · 2025-12-02T18:27:56Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

-    Changed |= runOnChain(C);
-  return Changed;
-}
+  // Merge chains in reverse order, so that the first chain is the largest.


This comment concerns me. We don't have a guarantee that chains are in increasing size order at this point in the code. Am I missing something?

The goal of this optimization is to find chains that would be in the same equivalence class if not for type differences, right? Why are we arbitrarily comparing adjacent chains to do so, can you explain this heuristic? Also, what if two adjacent chains have aliasing issues that would normally be blocked via splitChainByMayAliasInstrs, and you re-merge them here without checking for aliasing?

dakersnar · 2025-12-02T18:48:31Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
+                                          const APInt &Delta) const {
+  ConstantRange ARange(
+      A.front().OffsetFromLeader,
+      A.back().OffsetFromLeader +
+          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
+  ConstantRange BRange(
+      B.front().OffsetFromLeader + Delta,
+      B.back().OffsetFromLeader + Delta +
+          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
+  return !ARange.intersectWith(BRange).isEmptySet();
+}


@gandhi56 This contains a bug. After #168135, the last element of the chain is not necessarily the furthest reading element of the chain.

For example, we can have a chain that looks like this:
...other loads...
%l1 = load <2 x i32> ptr %a
%l2 = load i32 ptr %a

%l2 is the back of the chain, but %l1 extends further than %l2. So checking for overlap using %l1 would be wrong.

dakersnar · 2025-12-02T18:56:27Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+    APInt Sz = C2.front().OffsetFromLeader +
+               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
+               C1.back().OffsetFromLeader + *DeltaOpt;
+    if (!Sz.isPowerOf2())
+      continue;
+


Do you mind explaining this calculation? This doesn't seem right to me.

dakersnar · 2025-12-02T19:02:05Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+    // Rebase C2's offsets into C1's coordinate space prior to merging and
+    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
+    // from ContiguousSubChains.
+    rebaseChain(C2, *DeltaOpt);
+    C1.insert(C1.end(), C2.begin(), C2.end());
+    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);


Are we assuming that C2 and C1 are contiguous? splitChainByAlias is expecting contiguous chains, but I don't see how we have any guarantees that they are if we do a merge like this.

dakersnar · 2025-12-02T19:06:56Z

Unless I am mistaken, I think there are a number of correctness issues in this change, in addition to some less important style issues. I think we may need to revert and rethink this change. I can come up with correctness-issue-demonstrating test examples soon if you would like.

dakersnar · 2025-12-02T20:32:17Z

Here are two mis-compile examples, one due to breaking of aliasing, and one due to of breaking of contiguity. I think this needs to be reworked from scratch, there are too many issues to safely resolve in follow ups. @arsenm is this sufficient motivation to revert this change?

I also question the overall strategy. This seems like a hacky way to do this, have you explored simply expanding the definition of "equivalence class" to allow all these chain elements in the same chains to begin with, rather than trying to patch them together after the first two filters (aliasing and contiguity) are complete?

Edit: the aliasing miscompile is obvious, but see #159388 (comment) for more info on why the contiguity miscompile is wrong.

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S -o - %s | FileCheck %s


; spltiChainByMayAliasInstrs should block this.
define void @correct1(ptr nocapture align 8 %ptr) {
; CHECK-LABEL: define void @correct1(
; CHECK-SAME: ptr align 8 captures(none) [[PTR:%.*]]) {
; CHECK-NEXT:    [[PTR0:%.*]] = getelementptr i8, ptr [[PTR]], i64 0
; CHECK-NEXT:    [[PTR3:%.*]] = getelementptr i8, ptr [[PTR]], i64 4
; CHECK-NEXT:    [[L1:%.*]] = load i32, ptr [[PTR0]], align 8
; CHECK-NEXT:    call void @llvm.memset.p0.i64(ptr [[PTR3]], i8 0, i64 1, i1 false)
; CHECK-NEXT:    [[L0:%.*]] = load i32, ptr [[PTR3]], align 4
; CHECK-NEXT:    ret void
;
  %ptr0 = getelementptr i8, ptr %ptr, i64 0
  %ptr3 = getelementptr i8, ptr %ptr, i64 4

  %l0 = load i32, ptr %ptr0, align 8
  call void @llvm.memset.p0.i64(ptr %ptr3, i8 0, i64 1, i1 false)
  %l1 = load i32, ptr %ptr3, align 4

  ret void
}

; This should be blocked like above due to aliasing, but the merging does not check for aliasing.
define void @miscompile1(ptr nocapture align 8 %ptr) {
; CHECK-LABEL: define void @miscompile1(
; CHECK-SAME: ptr align 8 captures(none) [[PTR:%.*]]) {
; CHECK-NEXT:    [[PTR0:%.*]] = getelementptr i8, ptr [[PTR]], i64 0
; CHECK-NEXT:    [[PTR3:%.*]] = getelementptr i8, ptr [[PTR]], i64 4
; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i32>, ptr [[PTR0]], align 8
; CHECK-NEXT:    [[L11:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
; CHECK-NEXT:    [[L0_MUT2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
; CHECK-NEXT:    call void @llvm.memset.p0.i64(ptr [[PTR3]], i8 0, i64 1, i1 false)
; CHECK-NEXT:    [[L0_MUT_BC:%.*]] = bitcast i32 [[L0_MUT2]] to <2 x i16>
; CHECK-NEXT:    ret void
;
  %ptr0 = getelementptr i8, ptr %ptr, i64 0
  %ptr3 = getelementptr i8, ptr %ptr, i64 4

  %l0 = load i32, ptr %ptr0, align 8
  call void @llvm.memset.p0.i64(ptr %ptr3, i8 0, i64 1, i1 false)
  %l1 = load <2 x i16>, ptr %ptr3, align 4

  ret void
}

; This merge 1) breaks the promise of "contiguous" that splitChainByContiguity is supposed to guarantee, but secondly it results in a load
; that is loading more than the original code, which we have established is not legal without using a masked load.
define void @miscompile2(ptr nocapture align 16 %ptr) {
; CHECK-LABEL: define void @miscompile2(
; CHECK-SAME: ptr align 16 captures(none) [[PTR:%.*]]) {
; CHECK-NEXT:    [[PTR0:%.*]] = getelementptr i8, ptr [[PTR]], i64 0
; CHECK-NEXT:    [[TMP1:%.*]] = load <4 x i32>, ptr [[PTR0]], align 16
; CHECK-NEXT:    [[L01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0
; CHECK-NEXT:    [[L1_MUT2:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3
; CHECK-NEXT:    [[L1_MUT_BC:%.*]] = bitcast i32 [[L1_MUT2]] to <2 x i16>
; CHECK-NEXT:    ret void
;
  %ptr0 = getelementptr i8, ptr %ptr, i64 0
  ; 8 byte gap between ptr0 and ptr1
  %ptr1 = getelementptr i8, ptr %ptr, i64 12

  %l0 = load i32, ptr %ptr0, align 16
  %l1 = load <2 x i16>, ptr %ptr1, align 8

  ret void
}

declare void @llvm.memset.p0.i64(ptr, i8, i64, i1)

This reverts commit fbdf8ab.

Reverts #154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment). While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like `load <2xi16>` and `load i32` to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after `splitChainByContiguity`, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

… (#170381) Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: llvm/llvm-project#154069 (comment). While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like `load <2xi16>` and `load i32` to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after `splitChainByContiguity`, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

This change enables the LoadStoreVectorizer to merge and vectorize contiguous chains even when their scalar element types differ, as long as the total bitwidth matches. To do so, we rebase offsets between chains, normalize value types to a common integer type, and insert the necessary casts around loads and stores. This uncovers more vectorization opportunities and explains the expected codegen updates across AMDGPU tests. Key changes: - Chain merging - Build contiguous subchains and then merge adjacent ones when: - They refer to the same underlying pointer object and address space. - They are either all loads or all stores. - A constant leader-to-leader delta exists. - Rebasing one chain into the other's coordinate space does not overlap. - All elements have equal total bit width. - Rebase the second chain by the computed delta and append it to the first. - Type normalization and casting - Normalize merged chains to a common integer type sized to the total bits. - For loads: create a new load of the normalized type, copy metadata, and cast back to the original type for uses if needed. - For stores: bitcast the value to the normalized type and store that. - Insert zext/trunc for integer size changes; use bit-or-pointer casts when sizes match. - Cleanups - Erase replaced instructions and DCE pointer operands when safe. - New helpers: computeLeaderDelta, chainsOverlapAfterRebase, rebaseChain, normalizeChainToType, and allElemsMatchTotalBits. Impact: - Increases vectorization opportunities across mixed-typed but size-compatible access chains. - Large set of expected AMDGPU codegen diffs due to more/changed vectorization. This PR resolves llvm#97715.

Reverts llvm#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: llvm#154069 (comment). While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like `load <2xi16>` and `load i32` to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after `splitChainByContiguity`, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

This change enables the LoadStoreVectorizer to merge and vectorize contiguous chains even when their scalar element types differ, as long as the total bitwidth matches. To do so, we rebase offsets between chains, normalize value types to a common integer type, and insert the necessary casts around loads and stores. This uncovers more vectorization opportunities and explains the expected codegen updates across AMDGPU tests. Key changes: - Chain merging - Build contiguous subchains and then merge adjacent ones when: - They refer to the same underlying pointer object and address space. - They are either all loads or all stores. - A constant leader-to-leader delta exists. - Rebasing one chain into the other's coordinate space does not overlap. - All elements have equal total bit width. - Rebase the second chain by the computed delta and append it to the first. - Type normalization and casting - Normalize merged chains to a common integer type sized to the total bits. - For loads: create a new load of the normalized type, copy metadata, and cast back to the original type for uses if needed. - For stores: bitcast the value to the normalized type and store that. - Insert zext/trunc for integer size changes; use bit-or-pointer casts when sizes match. - Cleanups - Erase replaced instructions and DCE pointer operands when safe. - New helpers: computeLeaderDelta, chainsOverlapAfterRebase, rebaseChain, normalizeChainToType, and allElemsMatchTotalBits. Impact: - Increases vectorization opportunities across mixed-typed but size-compatible access chains. - Large set of expected AMDGPU codegen diffs due to more/changed vectorization. This PR resolves llvm#97715.

Reverts llvm#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: llvm#154069 (comment). While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like `load <2xi16>` and `load i32` to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after `splitChainByContiguity`, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

llvmbot added backend:AMDGPU llvm:globalisel vectorizers llvm:transforms labels Aug 18, 2025

gandhi56 self-assigned this Aug 18, 2025

gandhi56 removed the llvm:globalisel label Aug 18, 2025

gandhi56 mentioned this pull request Aug 18, 2025

[LSV] Insert casts to vectorize mismatched types #134436

Closed

gandhi56 requested review from arsenm, hidekisaito, nikic and shiltian August 18, 2025 07:42

arsenm reviewed Aug 19, 2025

View reviewed changes

gandhi56 added a commit to gandhi56/llvm-project that referenced this pull request Aug 22, 2025

[AMDGPU] Update codegen tests for PR llvm#154069

5d0483e

gandhi56 added a commit that referenced this pull request Aug 22, 2025

[AMDGPU] Update codegen tests for PR #154069 (#154862)

7dfd5ba

gandhi56 force-pushed the issue-97715/lsv-merge-chains branch from 9dd14e7 to c50b7ff Compare August 22, 2025 19:07

llvmbot added the llvm:globalisel label Aug 22, 2025

gandhi56 force-pushed the issue-97715/lsv-merge-chains branch from c50b7ff to 97222ff Compare August 22, 2025 19:11

gandhi56 force-pushed the issue-97715/lsv-merge-chains branch 2 times, most recently from d006836 to f694360 Compare August 31, 2025 02:02

gandhi56 requested review from Copilot and rampitec September 2, 2025 16:25

Copilot AI reviewed Sep 2, 2025

View reviewed changes

gandhi56 requested a review from arsenm September 3, 2025 04:46

LU-JOHN reviewed Sep 17, 2025

View reviewed changes

llvm/lib/Transforms/Utils/Local.cpp Outdated Show resolved Hide resolved

LU-JOHN reviewed Sep 17, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

Add metadata-copying tests

ca97396

gandhi56 force-pushed the issue-97715/lsv-merge-chains branch from 6901539 to f961be4 Compare November 26, 2025 20:40

Rebase, update tests

dc0dcc2

gandhi56 force-pushed the issue-97715/lsv-merge-chains branch from f961be4 to dc0dcc2 Compare December 2, 2025 01:44

gandhi56 merged commit fbdf8ab into llvm:main Dec 2, 2025
10 checks passed

dakersnar reviewed Dec 2, 2025

View reviewed changes

dakersnar added a commit that referenced this pull request Dec 2, 2025

Revert "[LSV] Merge contiguous chains across scalar types (#154069)"

8e57b16

This reverts commit fbdf8ab.

dakersnar mentioned this pull request Dec 2, 2025

Revert "[LSV] Merge contiguous chains across scalar types" #170381

Merged

		Ret.emplace_back(std::move(E.second));
		Ret.emplace_back(std::move(E.second));

		// Merge chains in reverse order, so that the first chain is the largest.
		for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {

[LSV] Merge contiguous chains across scalar types #154069

[LSV] Merge contiguous chains across scalar types #154069

Uh oh!

Conversation

gandhi56 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gandhi56 commented Aug 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

gandhi56 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

gandhi56 commented Nov 26, 2025

Uh oh!

gandhi56 commented Nov 27, 2025

Uh oh!

gandhi56 commented Dec 2, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakersnar Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Dec 2, 2025

Uh oh!

dakersnar commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

gandhi56 commented Aug 18, 2025 •

edited

Loading

llvmbot commented Aug 18, 2025 •

edited

Loading

github-actions bot commented Aug 22, 2025 •

edited

Loading

dakersnar Dec 2, 2025 •

edited

Loading

dakersnar commented Dec 2, 2025 •

edited

Loading