-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[VPlan] Get Addr computation cost with scalar type if it is uniform for gather/scatter. #150371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Elvis Wang (ElvisWang123) ChangesThis patch query In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will account the cost of address computation. But there are some cases that the address is uniform across all lanes, that makes the address can be calculated with scalar type and broadcast. I have a followup optimization that tries to convert gather/scatter with uniform memory access to scalar load/store + broadcast (and select if needed). With this optimization, we can remove this temporary change. This patch is preparation for #149955. Full diff: https://github.com/llvm/llvm-project/pull/150371.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3ce9d29d34553..7adb87f4557f8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6932,6 +6932,12 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry());
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
for (VPRecipeBase &R : *VPBB) {
+ if (auto *MR = dyn_cast<VPWidenMemoryRecipe>(&R)) {
+ // The address computation cost can be query as scalar type if the
+ // address is uniform.
+ if (!MR->isConsecutive() && vputils::isSingleScalar(MR->getAddr()))
+ return true;
+ }
if (auto *IR = dyn_cast<VPInterleaveRecipe>(&R)) {
auto *IG = IR->getInterleaveGroup();
unsigned NumMembers = IG->getNumMembers();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 57b713d3dfcb9..e8a3951bbeb20 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3083,9 +3083,18 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF,
const Value *Ptr = getLoadStorePointerOperand(&Ingredient);
assert(!Reverse &&
"Inconsecutive memory access should not have the order.");
- return Ctx.TTI.getAddressComputationCost(Ty) +
- Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment,
- Ctx.CostKind, &Ingredient);
+ InstructionCost Cost = 0;
+
+ // If the address value is uniform across all lane, then the address can be
+ // calculated with scalar type and broacast.
+ if (vputils::isSingleScalar(getAddr()))
+ Cost += Ctx.TTI.getAddressComputationCost(Ty->getScalarType());
+ else
+ Cost += Ctx.TTI.getAddressComputationCost(Ty);
+
+ return Cost + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked,
+ Alignment, Ctx.CostKind,
+ &Ingredient);
}
InstructionCost Cost = 0;
|
// If the address value is uniform across all lane, then the address can be | ||
// calculated with scalar type and broacast. | ||
if (vputils::isSingleScalar(getAddr())) | ||
Cost += Ctx.TTI.getAddressComputationCost(Ty->getScalarType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine for RISC-V but is this accurate for e.g. x86? I think the address might still be broadcasted? cc @alexey-bataev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is too optimistic
Close since it is too optimistic. Will try another method to fix the regression caused by #149955. |
Hi, after thinking about this a bit more I think this PR is actually NFC, and is probably the easiest way forward.
I really can't think of a better way to avoid the regression in #149955. Is it possible to reopen this PR? Hopefully all of this will become simpler once the widening decisions are driven by VPlan/done as VPlan transformations themselves. Pasting in the TTI hooks for reference: InstructionCost ARMTTIImpl::getAddressComputationCost(Type *Ty,
ScalarEvolution *SE,
const SCEV *Ptr) const {
// Address computations in vectorized code with non-consecutive addresses will
// likely result in more instructions compared to scalar code where the
// computation can more often be merged into the index mode. The resulting
// extra micro-ops can significantly decrease throughput.
unsigned NumVectorInstToHideOverhead = 10;
int MaxMergeDistance = 64;
if (ST->hasNEON()) {
if (Ty->isVectorTy() && SE &&
!BaseT::isConstantStridedAccessLessThan(SE, Ptr, MaxMergeDistance + 1))
return NumVectorInstToHideOverhead;
// In many cases the address computation is not merged into the instruction
// addressing mode.
return 1;
}
return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}
InstructionCost
AArch64TTIImpl::getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
const SCEV *Ptr) const {
// Address computations in vectorized code with non-consecutive addresses will
// likely result in more instructions compared to scalar code where the
// computation can more often be merged into the index mode. The resulting
// extra micro-ops can significantly decrease throughput.
unsigned NumVectorInstToHideOverhead = NeonNonConstStrideOverhead;
int MaxMergeDistance = 64;
if (Ty->isVectorTy() && SE &&
!BaseT::isConstantStridedAccessLessThan(SE, Ptr, MaxMergeDistance + 1))
return NumVectorInstToHideOverhead;
// In many cases the address computation is not merged into the instruction
// addressing mode.
return 1;
}
InstructionCost X86TTIImpl::getAddressComputationCost(Type *Ty,
ScalarEvolution *SE,
const SCEV *Ptr) const {
// Address computations in vectorized code with non-consecutive addresses will
// likely result in more instructions compared to scalar code where the
// computation can more often be merged into the index mode. The resulting
// extra micro-ops can significantly decrease throughput.
const unsigned NumVectorInstToHideOverhead = 10;
// Cost modeling of Strided Access Computation is hidden by the indexing
// modes of X86 regardless of the stride value. We dont believe that there
// is a difference between constant strided access in gerenal and constant
// strided value which is less than or equal to 64.
// Even in the case of (loop invariant) stride whose value is not known at
// compile time, the address computation will not incur more than one extra
// ADD instruction.
if (Ty->isVectorTy() && SE && !ST->hasAVX2()) {
// TODO: AVX2 is the current cut-off because we don't have correct
// interleaving costs for prior ISA's.
if (!BaseT::isStridedAccess(Ptr))
return NumVectorInstToHideOverhead;
if (!BaseT::getConstantStrideStep(SE, Ptr))
return 1;
}
return BaseT::getAddressComputationCost(Ty, SE, Ptr);
} |
…or gather/scatter. This patch query `getAddressComputationCost()` with scalar type if the address is uniform. This can help the cost for gather/scatter more accurate. In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will account the cost of address computation. But there are some cases that the addr is uniform accross lanes, that makes the address can be calculated with scalar type and broadcast. I have a follow optimization that try to converts gather/scatter with uniform memory acces to scalar load/store + broadcast. With this optimization, we can remove this temporary change.
Thanks @lukel97 will reopen this. |
// The address computation cost can be query as scalar type if the | ||
// address is uniform. | ||
if (!MR->isConsecutive() && vputils::isSingleScalar(MR->getAddr())) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you seeing any cost model assertions with this PR? I guess the VPlan cost model never passed in the pointer SCEV to begin with so is it already out of sync on other targets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebase, don't need this anymore. Removed, thanks!
d56bab9
to
8989f33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Btw, I think this makes more sense if we change the Type
argument to getAddressComputationCost
to the type of the pointer, not the value. But we don't do this consistently today.
LoopVectorizationCostModel::getMemInstScalarizationCost
passes in the pointer type:
// Get the cost of the scalar memory instruction and address computation.
InstructionCost Cost =
VF.getFixedValue() * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);
X86 calls it PtrTy:
InstructionCost getAddressComputationCost(Type *PtrTy, ScalarEvolution *SE,
But VectorCombine passes in the element type:
ScalarizedCost += TTI.getAddressComputationCost(VecTy->getElementType());
As does VPWidenMemoryRecipe::computeCost
.
I think we should fix this in a separate PR.
// If the address value is uniform across all lane, then the address can be | ||
// calculated with scalar type and broacast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If the address value is uniform across all lane, then the address can be | |
// calculated with scalar type and broacast. | |
// If the address value is uniform across all lanes, then the address can be | |
// calculated with scalar type and broadcast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to add a test for this?
I think this is NFC with the current TTIs, since they only use the type when the SCEV argument is passed. We don't pass it in VPWidenMemoryRecipe::computeCost. It makes a difference with #149955 though because RISC-V will start checking the type without the SCEV. Should the changes be bundled up into that PR? |
This patch query
getAddressComputationCost()
with scalar type if the address is uniform. This can help the cost for gather/scatter more accurate.In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will account the cost of address computation. But there are some cases that the address is uniform across all lanes, that makes the address can be calculated with scalar type and broadcast.
I have a followup optimization that tries to convert gather/scatter with uniform memory access to scalar load/store + broadcast (and select if needed). With this optimization, we can remove this temporary change.
This patch is preparation for #149955.