-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[Transform][LoadStoreVectorizer] allow redundant in Chain #163019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-llvm-transforms Author: Gang Chen (cmc-rep) ChangesThis can absorb redundant loads when forming vector load. Can be used to fix the situation created by VectorCombine. See: https://discourse.llvm.org/t/what-is-the-purpose-of-vectorizeloadinsert-in-the-vectorcombine-pass/88532 Full diff: https://github.com/llvm/llvm-project/pull/163019.diff 1 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..484a0b762ad12 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -157,6 +157,7 @@ using EqClassKey =
struct ChainElem {
Instruction *Inst;
APInt OffsetFromLeader;
+ bool Redundant = false; // Set to true when load is redundant.
ChainElem(Instruction *Inst, APInt OffsetFromLeader)
: Inst(std::move(Inst)), OffsetFromLeader(std::move(OffsetFromLeader)) {}
};
@@ -626,26 +627,33 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
std::vector<Chain> Ret;
Ret.push_back({C.front()});
+ APInt PrevReadEnd = C[0].OffsetFromLeader +
+ DL.getTypeSizeInBits(getLoadStoreType(&*C[0].Inst)) / 8;
for (auto It = std::next(C.begin()), End = C.end(); It != End; ++It) {
// `prev` accesses offsets [PrevDistFromBase, PrevReadEnd).
auto &CurChain = Ret.back();
- const ChainElem &Prev = CurChain.back();
- unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
+ unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*It->Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
// Add this instruction to the end of the current chain, or start a new one.
+ APInt ReadEnd = It->OffsetFromLeader + SzBits / 8;
+ bool IsRedundant = ReadEnd.sle(PrevReadEnd);
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
- LLVM_DEBUG(dbgs() << "LSV: Instructions are "
- << (AreContiguous ? "" : "not ") << "contiguous: "
- << *Prev.Inst << " (ends at offset " << PrevReadEnd
- << ") -> " << *It->Inst << " (starts at offset "
+
+ LLVM_DEBUG(dbgs() << "LSV: Instruction is "
+ << (AreContiguous
+ ? "contiguous"
+ : ((IsRedundant ? "redundant" : "chain-breaker")))
+ << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ It->Redundant = IsRedundant;
+ if (AreContiguous || IsRedundant)
CurChain.push_back(*It);
else
Ret.push_back({*It});
+ PrevReadEnd = APIntOps::smax(PrevReadEnd, ReadEnd);
}
// Filter out length-1 chains, these are uninteresting.
@@ -874,10 +882,12 @@ bool Vectorizer::vectorizeChain(Chain &C) {
Type *VecElemTy = getChainElemTy(C);
bool IsLoadChain = isa<LoadInst>(C[0].Inst);
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
- unsigned ChainBytes = std::accumulate(
- C.begin(), C.end(), 0u, [&](unsigned Bytes, const ChainElem &E) {
- return Bytes + DL.getTypeStoreSize(getLoadStoreType(E.Inst));
- });
+ unsigned ChainBytes = 0;
+ for (auto &E : C) {
+ if (E.Redundant)
+ continue;
+ ChainBytes += DL.getTypeStoreSize(getLoadStoreType(E.Inst));
+ }
assert(ChainBytes % DL.getTypeStoreSize(VecElemTy) == 0);
// VecTy is a power of 2 and 1 byte at smallest, but VecElemTy may be smaller
// than 1 byte (e.g. VecTy == <32 x i1>).
@@ -916,20 +926,19 @@ bool Vectorizer::vectorizeChain(Chain &C) {
getLoadStorePointerOperand(C[0].Inst),
Alignment);
- unsigned VecIdx = 0;
for (const ChainElem &E : C) {
Instruction *I = E.Inst;
Value *V;
Type *T = getLoadStoreType(I);
+ int EOffset = (E.OffsetFromLeader - C[0].OffsetFromLeader).getSExtValue();
+ int VecIdx = 8 * EOffset / DL.getTypeSizeInBits(VecElemTy);
if (auto *VT = dyn_cast<FixedVectorType>(T)) {
auto Mask = llvm::to_vector<8>(
llvm::seq<int>(VecIdx, VecIdx + VT->getNumElements()));
V = Builder.CreateShuffleVector(VecInst, Mask, I->getName());
- VecIdx += VT->getNumElements();
} else {
V = Builder.CreateExtractElement(VecInst, Builder.getInt32(VecIdx),
I->getName());
- ++VecIdx;
}
if (V->getType() != I->getType())
V = Builder.CreateBitOrPointerCast(V, I->getType());
@@ -964,22 +973,24 @@ bool Vectorizer::vectorizeChain(Chain &C) {
// Build the vector to store.
Value *Vec = PoisonValue::get(VecTy);
- unsigned VecIdx = 0;
- auto InsertElem = [&](Value *V) {
+ auto InsertElem = [&](Value *V, unsigned VecIdx) {
if (V->getType() != VecElemTy)
V = Builder.CreateBitOrPointerCast(V, VecElemTy);
- Vec = Builder.CreateInsertElement(Vec, V, Builder.getInt32(VecIdx++));
+ Vec = Builder.CreateInsertElement(Vec, V, Builder.getInt32(VecIdx));
};
for (const ChainElem &E : C) {
auto *I = cast<StoreInst>(E.Inst);
+ int EOffset = (E.OffsetFromLeader - C[0].OffsetFromLeader).getSExtValue();
+ int VecIdx = 8 * EOffset / DL.getTypeSizeInBits(VecElemTy);
if (FixedVectorType *VT =
dyn_cast<FixedVectorType>(getLoadStoreType(I))) {
for (int J = 0, JE = VT->getNumElements(); J < JE; ++J) {
InsertElem(Builder.CreateExtractElement(I->getValueOperand(),
- Builder.getInt32(J)));
+ Builder.getInt32(J)),
+ VecIdx++);
}
} else {
- InsertElem(I->getValueOperand());
+ InsertElem(I->getValueOperand(), VecIdx);
}
}
|
|
@llvm/pr-subscribers-vectorizers Author: Gang Chen (cmc-rep) ChangesThis can absorb redundant loads when forming vector load. Can be used to fix the situation created by VectorCombine. See: https://discourse.llvm.org/t/what-is-the-purpose-of-vectorizeloadinsert-in-the-vectorcombine-pass/88532 Full diff: https://github.com/llvm/llvm-project/pull/163019.diff 1 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..484a0b762ad12 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -157,6 +157,7 @@ using EqClassKey =
struct ChainElem {
Instruction *Inst;
APInt OffsetFromLeader;
+ bool Redundant = false; // Set to true when load is redundant.
ChainElem(Instruction *Inst, APInt OffsetFromLeader)
: Inst(std::move(Inst)), OffsetFromLeader(std::move(OffsetFromLeader)) {}
};
@@ -626,26 +627,33 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
std::vector<Chain> Ret;
Ret.push_back({C.front()});
+ APInt PrevReadEnd = C[0].OffsetFromLeader +
+ DL.getTypeSizeInBits(getLoadStoreType(&*C[0].Inst)) / 8;
for (auto It = std::next(C.begin()), End = C.end(); It != End; ++It) {
// `prev` accesses offsets [PrevDistFromBase, PrevReadEnd).
auto &CurChain = Ret.back();
- const ChainElem &Prev = CurChain.back();
- unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
+ unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*It->Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
// Add this instruction to the end of the current chain, or start a new one.
+ APInt ReadEnd = It->OffsetFromLeader + SzBits / 8;
+ bool IsRedundant = ReadEnd.sle(PrevReadEnd);
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
- LLVM_DEBUG(dbgs() << "LSV: Instructions are "
- << (AreContiguous ? "" : "not ") << "contiguous: "
- << *Prev.Inst << " (ends at offset " << PrevReadEnd
- << ") -> " << *It->Inst << " (starts at offset "
+
+ LLVM_DEBUG(dbgs() << "LSV: Instruction is "
+ << (AreContiguous
+ ? "contiguous"
+ : ((IsRedundant ? "redundant" : "chain-breaker")))
+ << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ It->Redundant = IsRedundant;
+ if (AreContiguous || IsRedundant)
CurChain.push_back(*It);
else
Ret.push_back({*It});
+ PrevReadEnd = APIntOps::smax(PrevReadEnd, ReadEnd);
}
// Filter out length-1 chains, these are uninteresting.
@@ -874,10 +882,12 @@ bool Vectorizer::vectorizeChain(Chain &C) {
Type *VecElemTy = getChainElemTy(C);
bool IsLoadChain = isa<LoadInst>(C[0].Inst);
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
- unsigned ChainBytes = std::accumulate(
- C.begin(), C.end(), 0u, [&](unsigned Bytes, const ChainElem &E) {
- return Bytes + DL.getTypeStoreSize(getLoadStoreType(E.Inst));
- });
+ unsigned ChainBytes = 0;
+ for (auto &E : C) {
+ if (E.Redundant)
+ continue;
+ ChainBytes += DL.getTypeStoreSize(getLoadStoreType(E.Inst));
+ }
assert(ChainBytes % DL.getTypeStoreSize(VecElemTy) == 0);
// VecTy is a power of 2 and 1 byte at smallest, but VecElemTy may be smaller
// than 1 byte (e.g. VecTy == <32 x i1>).
@@ -916,20 +926,19 @@ bool Vectorizer::vectorizeChain(Chain &C) {
getLoadStorePointerOperand(C[0].Inst),
Alignment);
- unsigned VecIdx = 0;
for (const ChainElem &E : C) {
Instruction *I = E.Inst;
Value *V;
Type *T = getLoadStoreType(I);
+ int EOffset = (E.OffsetFromLeader - C[0].OffsetFromLeader).getSExtValue();
+ int VecIdx = 8 * EOffset / DL.getTypeSizeInBits(VecElemTy);
if (auto *VT = dyn_cast<FixedVectorType>(T)) {
auto Mask = llvm::to_vector<8>(
llvm::seq<int>(VecIdx, VecIdx + VT->getNumElements()));
V = Builder.CreateShuffleVector(VecInst, Mask, I->getName());
- VecIdx += VT->getNumElements();
} else {
V = Builder.CreateExtractElement(VecInst, Builder.getInt32(VecIdx),
I->getName());
- ++VecIdx;
}
if (V->getType() != I->getType())
V = Builder.CreateBitOrPointerCast(V, I->getType());
@@ -964,22 +973,24 @@ bool Vectorizer::vectorizeChain(Chain &C) {
// Build the vector to store.
Value *Vec = PoisonValue::get(VecTy);
- unsigned VecIdx = 0;
- auto InsertElem = [&](Value *V) {
+ auto InsertElem = [&](Value *V, unsigned VecIdx) {
if (V->getType() != VecElemTy)
V = Builder.CreateBitOrPointerCast(V, VecElemTy);
- Vec = Builder.CreateInsertElement(Vec, V, Builder.getInt32(VecIdx++));
+ Vec = Builder.CreateInsertElement(Vec, V, Builder.getInt32(VecIdx));
};
for (const ChainElem &E : C) {
auto *I = cast<StoreInst>(E.Inst);
+ int EOffset = (E.OffsetFromLeader - C[0].OffsetFromLeader).getSExtValue();
+ int VecIdx = 8 * EOffset / DL.getTypeSizeInBits(VecElemTy);
if (FixedVectorType *VT =
dyn_cast<FixedVectorType>(getLoadStoreType(I))) {
for (int J = 0, JE = VT->getNumElements(); J < JE; ++J) {
InsertElem(Builder.CreateExtractElement(I->getValueOperand(),
- Builder.getInt32(J)));
+ Builder.getInt32(J)),
+ VecIdx++);
}
} else {
- InsertElem(I->getValueOperand());
+ InsertElem(I->getValueOperand(), VecIdx);
}
}
|
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef([^a-zA-Z0-9_-]|$)|UndefValue::get)' 'HEAD~1' HEAD llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-call.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll llvm/test/CodeGen/AMDGPU/divergence-driven-trunc-to-i1.ll llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll llvm/test/CodeGen/AMDGPU/mad_uint24.ll llvm/test/CodeGen/AMDGPU/sad.ll llvm/test/CodeGen/AMDGPU/simplifydemandedbits-recursion.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll llvm/test/Transforms/LoadStoreVectorizer/X86/subchain-interleaved.llThe following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
}Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
}Please refer to the Undefined Behavior Manual for more information. |
|
I took a look at your linked discussion thread. Is there a reason this problem cannot be solved in VectorCombine itself? This feels like it might be a band-aid solution. |
|
No one answers my question in discourse. I feel it may be difficult to convince people that that specific function in VectorCombine only creates noise. My change to LoadStoreVectorizer not only clean up that case, it also brings some benefit as some test changes are showing. I won't call it a band-aid. It is an enhancement to LoadStoreVectorizer |
llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll
Outdated
Show resolved
Hide resolved
I think both points have merit, but...
It does sound like it's not a bug, but a feature. Whether vector load is profitable would generally require target-specific knowledge. E.g. on NVPTX, some vector loads may be nearly free, while others costly, if too many loaded elements are unused. LoadStoreVectorizer may be a reasonable place to handle some cases that we determine to be unprofitable. And we may still need to handle even more cases in the back-end. |
But VectorCombine already uses TTI to get target-specific cost information. |
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
I don't feel that VectorCombine is well equiped to understand the cost during "vectorizeLoadInsert". It has TII, however, it only has a very narrow view of the IR, i.e. it does not look beyond the current load+insertelement. LoadStoreVectorizer is better equipped for this. |
3331fe8 to
ec6cce8
Compare
llvm/test/Transforms/LoadStoreVectorizer/X86/subchain-interleaved.ll
Outdated
Show resolved
Hide resolved
18e0bab to
ccd258f
Compare
|
Heads up, I'm working on a change that overlaps with your change: #159388. Conceptually I don't see an issue with them both coexisting. One of us will have to resolve some conflicts but it shouldn't be a big deal. |
|
Ping, would like to get this in. |
nhaehnle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I do have some minor suggestions for improvement though.
| << *Prev.Inst << " (ends at offset " << PrevReadEnd | ||
| << ") -> " << *It->Inst << " (starts at offset " | ||
| APInt ReadEnd = It->OffsetFromLeader + SzBits / 8; | ||
| // Alllow redundancy: partial or full overlaping counts as contiguous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: *Allow
| APInt ReadEnd = It->OffsetFromLeader + SzBits / 8; | ||
| // Alllow redundancy: partial or full overlaping counts as contiguous. | ||
| int ExtraBytes = | ||
| PrevReadEnd.sle(ReadEnd) ? (ReadEnd - PrevReadEnd).getSExtValue() : 0; | ||
| bool AreContiguous = It->OffsetFromLeader.sle(PrevReadEnd) && | ||
| SzBits % ElemBytes == 0 && ExtraBytes % ElemBytes == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an assertion where it feels like it should be applicable (element size is part of the equivalence class key during gatherChains) and simplify the logic a little bit.
| APInt ReadEnd = It->OffsetFromLeader + SzBits / 8; | |
| // Alllow redundancy: partial or full overlaping counts as contiguous. | |
| int ExtraBytes = | |
| PrevReadEnd.sle(ReadEnd) ? (ReadEnd - PrevReadEnd).getSExtValue() : 0; | |
| bool AreContiguous = It->OffsetFromLeader.sle(PrevReadEnd) && | |
| SzBits % ElemBytes == 0 && ExtraBytes % ElemBytes == 0; | |
| uint64_t SzBytes = SzBits / 8; | |
| assert(SzBytes % ElemBytes == 0); | |
| APInt ReadEnd = It->OffsetFromLeader + SzBits / 8; | |
| // Allow redundancy: partial or full overlap counts as contiguous. | |
| bool AreContiguous = false; | |
| if (It->OffsetFromLeader.sle(PrevReadEnd) { | |
| uint64_t Overlap = (PrevReadEnd - It->OffsetFromLeader).getZExtValue(); | |
| if (Overlap % ElemBytes == 0) | |
| AreContiguous = true; | |
| } |
| int BytesAdded = DL.getTypeSizeInBits(getLoadStoreType(&*C[0].Inst)) / 8; | ||
| APInt PrevReadEnd = C[0].OffsetFromLeader + BytesAdded; | ||
| int ChainBytes = BytesAdded; | ||
| for (auto It = std::next(C.begin()), End = C.end(); It != End; ++It) { | ||
| unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*It->Inst)); | ||
| APInt ReadEnd = It->OffsetFromLeader + SzBits / 8; | ||
| // Update ChainBytes considering possible overlap. | ||
| BytesAdded = | ||
| PrevReadEnd.sle(ReadEnd) ? (ReadEnd - PrevReadEnd).getSExtValue() : 0; | ||
| ChainBytes += BytesAdded; | ||
| PrevReadEnd = APIntOps::smax(PrevReadEnd, ReadEnd); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the loop, couldn't you just look at the first and last chain elements? (This applies to the old code as well, but while we're touching it...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to look at every chain element because we could have a chain element in the middle that covers the rest of the chain and determines the ChainBytes.
| // This can happen due to a chain of redundant loads. | ||
| // In this case, just use the element-type, and avoid ExtractElement. | ||
| if (NumElem == 1) | ||
| VecTy = VecElemTy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the type here in the middle of the code is unfortunate. Better to move it to the top where VecTy is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only want to apply the type change for load-chain. Right now, we perhaps won't see redundancy in store chain due to aliasing check. When that part is improved, we can have the following:
vec1 = insert-element <1 x i32> vec0, v0, 0
vec2 = insert-element <1 x i32> vec1, v1, 0
store <1xi32> vec2, addr
Having those insert-elements preserves the correct write order.
0a1295a to
76de96d
Compare
What happens if you have a chain that looks like this: Here is the debug output when splitChainByAlignment considers the candidate chain that contains all four loads: Note that last line. The algorithm believes that that chain is only accessing 12B, which is wrong, it is accessing 16B. In this case, it only results in a false negative, that chain does not get vectorized. But there is probably a scenario where this bug causes a false positive and generates incorrect IR. Given this, I personally think this should be reverted and reworked. Even if the above issues can be resolved in a follow up patch, I'm still hesitant to believe that this is the best solution, or that the LSV is the right place to solve this problem. I think it would be a good idea to create a new LSV-specific lit test (which I think is what was requested here: #163019 (comment)) that tests a few scenarios that your patch is targeting, so that we can properly arrive at the cleanest correct solution. FWIW, I have tinkered with similar duplicate load problems in the LSV and determined that EarlyCSE/GVN were better ways to solve this problem. But if you can simplify the cases your patch is improving into a new unit test, I could be convinced the LSV is the place for this. Just throwing out ideas: maybe duplicate loads can be simply removed from the chain, and the chain can continue onward without them in |
|
I did resolve many Matt's comment in the code in the last update. However, I didn't add a new IR test, not sure what kind of new test should be written. |
|
Not sure I have fully understand your test case. Let me try it. However, I don't think my change will cause false positive. Notice that what the chain size is recomputed during vectorizeChain (not just based upon the splitChainByContiguity). Please let me know if you come up with a false-positive case. |
I see you addressed some but not all of them, but I particularly think a new test case that contains all the cases that you want to improve would be really helpful to ensure this is the best solution. This could be done by extracting the pre-LSV IR from the cases where you are seeing improvement in your compiler, especially IR that represents the original motivating case. |
I thought about that, the case I had in discourse is fairly simple. I felt it has been covered by the existing test change I have seen. |
|
Here's a case that demonstrates the issue on x86. Unfortunately, it is a little convoluted, SizeBytes in x86 allows misaligned loads up to a certain size if it determines it is fast. My testing indicates that it allows 12B misaligned vectors but not 16B. In the first test, it determines that 16B is too large and splits up the vector. See this debug output, with the last line being the justification for why it does not create a v4. But for the second test, the additional load causes SizeBytes to be incorrectly computed as 12 instead of 16, even though the size of the chain is 16 bytes. Thus, |
|
Thanks for all the explanation and cases. I will revert the change, and add the patch for SizeBytes, plus the these cases. |
|
Sounds good, thank you! I'll be more active with feedback when you put up that patch. Like I said, I suspect this can be done in a cleaner and safer way, but I need to look at the important cases you are trying to tackle more closely first. |
…vm#163019)" This reverts commit 92e5608.
|
Here is the reverting PR #168105 |
This is the fixed version of llvm#163019
…n Chain (#168135) This is the fixed version of llvm/llvm-project#163019
…68135) This is the fixed version of llvm/llvm-project#163019 Signed-off-by: Hafidz Muzakky <[email protected]>
…vm#168135) This is the fixed version of llvm#163019
…vm#168135) This is the fixed version of llvm#163019
…vm#168135) This is the fixed version of llvm#163019
…vm#168135) This is the fixed version of llvm#163019
This can absorb redundant loads when forming vector load. Can be used to fix the situation created by VectorCombine. See: https://discourse.llvm.org/t/what-is-the-purpose-of-vectorizeloadinsert-in-the-vectorcombine-pass/88532