-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
@llvm/pr-subscribers-backend-nvptx Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
@llvm/pr-subscribers-llvm-analysis Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
/// assuming the result is still well-aligned. For example, converting a load | ||
/// i32 to a load i64, or vectorizing three continuous load i32s into a load | ||
/// <4 x i32>. | ||
LLVM_ABI bool isLegalToWidenLoads() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs context argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you're referring to an instance of LLVMContext, right? If so, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I mean a hook without parameters corresponding to a specific load is close to useless. At minimum would need address space, alignment, type etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was unsure what we may want to check because for NVPTX the answer is "true" under any conditions, and I'm not familiar enough with other architectures to know what they may want to check.
Address space
and (Element?) Type
sound reasonable enough, but Alignment
would make this less ergonomic to use. The use case of this API is to check whether widening loads is generally allowed for a given target, and the result is used to answer the question "can we attempt to fill gaps, which would eventually result in widened loads?" As documented in the comment, there is an assumption that the answer spit out by the API depends on the resulting widened load being sufficiently aligned. In this case, at the first point in the code we call this API in (splitChainByContiguity
), we do not know the alignment of the resulting load, as alignment analysis and chain splitting (splitChainByAlignment
) happens later.
What we would have to do to incorporate an alignment argument into the API is something similar to how I'm using the existing isLegalMaskedStore
, which is this not-so-clean helper function: https://github.com/llvm/llvm-project/pull/159388/files#diff-e0eab10050c9ef433cb0d7bc38e32e8d7fe44cdb0cf7422ae7a98966bff53672R1865-R1892. The helper is essentially converting the answers to "is this specific masked store legal" to the more general "are masked stores generally legal on this target", which is what we need to know at that point in the code. If you think that is best, I'm ok with it, but it feels a little hacky to me, so I was trying to come up with something better for this new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding AS as a parameter sounds reasonable to me, but I am not sure about the other "context". Currently, this feature is only enabled with the NVPTX target. If other targets want to enable this feature, they should modify the TTI API according to their own needs, rather than trying to guess someone else’s requirements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadInst + type it wants to widen to
; ENABLED-NEXT: cvt.f32.f16 %r11, %rs5; | ||
; ENABLED-NEXT: add.rn.f32 %r12, %r10, %r11; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs9, %r12; | ||
; ENABLED-NEXT: ld.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look right. Our input is presumably an array of f16
elements, but we end up loading 4 x b32, but then appear to ignore the last two elements. It should have been ld.v2.b32
, or, perhaps the load should have remained ld.v4.f16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the difference in number of ld instructions in the PTX. The old output has two load instructions to load 5 b16s: a ld.v4.b16 and a ld.b16. The new version, in the LSV, "extends" the chain of 5 loads to the next power of two, a chain of 8 loads with 3 unused tail elements, vectorizing it a single load <8 x i16>
. This gets lowered by the backend to a ld.v4.b32
, with 2.5 elements (containing the packed 5 b16s) ending up being used, the rest unused.
This reduction from two load instructions to one load instruction is an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've missed the 5th load of f16. Generated code looks correct.
My next question is whether this extension is always beneficial. E.g. if we do that on shared memory, it may potentially increase bank contention due to the extra loads. In the worst case we'd waste ~25% of shared memory bandwidth for this particular extension from v5f16 to v4b32.
I think we should take AS info into account and have some sort of user-controllable knob to enable/disable the gap filling, if needed. E.g. it's probably always good for loads from global AS, it's a maybe for shared memory (less instructions may win over bank conflicts if the extra loads happen to be broadcast to other thread's loads, but would waste bandwidth otherwise), and we can't say much about generic AS, as it could go either way, I think.
For masked writes it's more likely to be a win, as we don't actually write extra data, so the potential downside is a possible register pressure bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do that on shared memory, it may potentially increase bank contention due to the extra loads.
I don't think that's a concern for CUDA GPU. But it's a good idea to add AS as a parameter to the TTI API, other targets may want to control this feature for specific AS.
Co-authored-by: Matt Arsenault <[email protected]>
I'm a bit concerned about the direction of this change wrt the assumption that loads are always safe. Do you need this to work in cases where the memory is not known The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR. To give an obvious example, let's say you have a noalias store to just the "gap" element. Then introducing a load from that location will introduce UB, even if the result is ultimately "unused". This also intersects with the larger question of whether allocations can have gaps (as in, munmaped regions). This change is not necessarily incompatible with that because in practice gaps would be at page granularity, but it does constrain the design space. |
Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR? |
Yeah, a masked load is definitely fine. |
I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead. Two small details I want to get thoughts on before I proceed:
|
I wouldn't say I'm confident that we can't make it work in this form, but it's the kind of thing that gets bogged down in extended IR design discussions. If the masked load variant isn't too hard to get working, then it's probably more expedient to just go with that.
Yes, using poison passthru should be fine.
Without knowing the details: Yes, that sounds fine to me. |
I assume that if we use masked loads, we no longer need the new isLegalToWidenLoads API, and should determine legality in the same way that masked store legality is currently determined? |
I'm working on the masked load implementation, it might take a while, I'll re-ping these two PRs once I'm ready for another review. Thanks for the feedback so far, folks. |
I've updated this PR to generate masked loads as discussed, and it can be reviewed now. In the next day or two, I will update the NVPTX change that this is dependent on (#159387) to handle lowering of the masked load intrinsics. At that point they will both be ready for review. I'm hoping to merge them at the same time, back to back, once review iteration is done. |
// supports non-pow2 VF. | ||
if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) && | ||
((IsLoadChain && | ||
TTI.isLegalMaskedLoad( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: all these calls to isLegalMaskedLoad/Store need to be updated to pass in the new enum that represents a Constant mask that is being workshopped in the other PR.
This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.
This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.
For both loads and stores we only perform the optimization when we can generate a legal llvm masked load/store intrinsic, masking off the "extra" elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.
Edited: both stores and loads must use masked intrinsics for this optimization to be legal.