-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
|
@llvm/pr-subscribers-backend-nvptx Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
|
@llvm/pr-subscribers-llvm-analysis Author: Drew Kersnar (dakersnar) ChangesThis change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2. This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things. For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved. Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff 15 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
I'm a bit concerned about the direction of this change wrt the assumption that loads are always safe. Do you need this to work in cases where the memory is not known The problem here is that even if this may be valid from a hardware perspective, it is not necessarily valid in LLVM IR. To give an obvious example, let's say you have a noalias store to just the "gap" element. Then introducing a load from that location will introduce UB, even if the result is ultimately "unused". This also intersects with the larger question of whether allocations can have gaps (as in, munmaped regions). This change is not necessarily incompatible with that because in practice gaps would be at page granularity, but it does constrain the design space. |
Would widening into a masked load that masks the unused elements (so that they are not loaded) be legal in LLVM IR? |
Yeah, a masked load is definitely fine. |
I prototyped an approach that uses this instead, and it did work. Only real downsides are in implementation difficulty: it places a higher burden on the backend to efficiently lower masked loads, requires other passes to keep masked loads in mind when optimizing loads, etc. But if we are confident that the current proposed approach is functionally incorrect, then I can and will pivot to that approach instead. Two small details I want to get thoughts on before I proceed:
|
I wouldn't say I'm confident that we can't make it work in this form, but it's the kind of thing that gets bogged down in extended IR design discussions. If the masked load variant isn't too hard to get working, then it's probably more expedient to just go with that.
Yes, using poison passthru should be fine.
Without knowing the details: Yes, that sounds fine to me. |
|
I assume that if we use masked loads, we no longer need the new isLegalToWidenLoads API, and should determine legality in the same way that masked store legality is currently determined? |
|
I'm working on the masked load implementation, it might take a while, I'll re-ping these two PRs once I'm ready for another review. Thanks for the feedback so far, folks. |
|
I've updated this PR to generate masked loads as discussed, and it can be reviewed now. In the next day or two, I will update the NVPTX change that this is dependent on (#159387) to handle lowering of the masked load intrinsics. At that point they will both be ready for review. I'm hoping to merge them at the same time, back to back, once review iteration is done. |
|
Both PRs are now completed + updated with recent LLVM changes and ready for full review. |
47cf25a to
bba61b5
Compare
|
Added a test that combines redundant loads with gap filling and chain extending. I'm also going to temporarily push a cherry pick of #159387 into this branch to ensure the pipeline is green, and then I'll rebase it away to make reviewing easier again. Just looking for showstoppers. If anyone happens to review this change while that is happening, ignore the changes from that commit. Edit: These are the commits to review: https://github.com/llvm/llvm-project/pull/159388/files/551f136fe621027307a66954d1974cdd80432271, with ce4b7e0 just being here temporarily to check the pipeline. Edit: checks are green, removed the cherry pick. |
ce4b7e0 to
551f136
Compare
cmc-rep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest merge looks good to me
Artem-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall with a test nit.
|
Unfortunately my change has been reverted again. Looks like this PR has to remerge with the latest main. |
|
@cmc-rep Do you know the reason yet? Would it make sense for me to merge my change first before relanding yours? |
|
In the meantime, I'm going to merge the backend changes (#159387), now that this change has gotten approval. |
|
It causes some assertion in llvm when compiling some AMD library. |
Is there a reproducer? Stack trace pointing to the specific assertion? Anything else that we can use to figure out what went wrong? |
This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: #159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation
…#159387) This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: llvm/llvm-project#159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation
This is the case causing problem for my change: So looks like my change is only reverted in AMD internal branch, not here. So I will add the fix here quickly. |
|
cc @gandhi56 for visibility |
This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: llvm#159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation
7cfc98c to
97e8a10
Compare
| ; CHECK-NEXT: store <3 x half> [[S4]], ptr [[IN4]], align 4 | ||
| ; CHECK-NEXT: ret void | ||
| ; | ||
| %load1 = load <3 x half>, ptr %rd0, align 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this test to match what we expect to see after InferAlignment runs and puts the alignment of 16 onto this load <3 x half>. This is better representative of an input the vectorizer should expect. And with that alignment + the masked load change, this test extends the two load <3xhalf> into an <8xhalf> masked load.
42a1b3f to
eb6df17
Compare
a0968d5 to
37cef5b
Compare
|
Likely merging this on Monday, assuming green pipelines. |

This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.
This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.
For both loads and stores we only perform the optimization when we can generate a legal llvm masked load/store intrinsic, masking off the "extra" elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: #159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.
Edited: both stores and loads must use masked intrinsics for this optimization to be legal.