-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449
Conversation
… blocks Split off from llvm#158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Luke Lau (lukel97) ChangesSplit off from #158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway. Patch is 56.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/160449.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ca092dcfcb492..c0c2063ca81b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1249,6 +1249,25 @@ class LoopVectorizationCostModel {
/// Superset of instructions that return true for isScalarWithPredication.
bool isPredicatedInst(Instruction *I) const;
+ /// A helper function that returns how much we should divide the cost of a
+ /// predicated block by. Typically this is the reciprocal of the block
+ /// probability, i.e. if we return X we are assuming the predicated block will
+ /// execute once for every X iterations of the loop header so the block should
+ /// only contribute 1/X of its cost to the total cost calculation, but when
+ /// optimizing for code size it will just be 1 as code size costs don't depend
+ /// on execution probabilities.
+ ///
+ /// TODO: We should use actual block probability here, if available.
+ /// Currently, we always assume predicated blocks have a 50% chance of
+ /// executing.
+ inline unsigned
+ getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+ BasicBlock *BB) const {
+ if (!Legal->blockNeedsPredication(BB))
+ return 1;
+ return CostKind == TTI::TCK_CodeSize ? 1 : 2;
+ }
+
/// Return the costs for our two available strategies for lowering a
/// div/rem operation which requires speculating at least one lane.
/// First result is for scalarization (will be invalid for scalable
@@ -2902,7 +2921,8 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
// Scale the cost by the probability of executing the predicated blocks.
// This assumes the predicated block for each vector lane is equally
// likely.
- ScalarizationCost = ScalarizationCost / getPredBlockCostDivisor(CostKind);
+ ScalarizationCost =
+ ScalarizationCost / getPredBlockCostDivisor(CostKind, I->getParent());
}
InstructionCost SafeDivisorCost = 0;
@@ -5035,7 +5055,7 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
}
// Scale the total scalar cost by block probability.
- ScalarCost /= getPredBlockCostDivisor(CostKind);
+ ScalarCost /= getPredBlockCostDivisor(CostKind, I->getParent());
// Compute the discount. A non-negative discount means the vector version
// of the instruction costs more, and scalarizing would be beneficial.
@@ -5088,7 +5108,7 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
// cost by the probability of executing it. blockNeedsPredication from
// Legal is used so as to not include all blocks in tail folded loops.
if (VF.isScalar() && Legal->blockNeedsPredication(BB))
- BlockCost /= getPredBlockCostDivisor(CostKind);
+ BlockCost /= getPredBlockCostDivisor(CostKind, BB);
Cost += BlockCost;
}
@@ -5167,7 +5187,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.
if (isPredicatedInst(I)) {
- Cost /= getPredBlockCostDivisor(CostKind);
+ Cost /= getPredBlockCostDivisor(CostKind, I->getParent());
// Add the cost of an i1 extract and a branch
auto *VecI1Ty =
@@ -6727,6 +6747,11 @@ bool VPCostContext::skipCostComputation(Instruction *UI, bool IsVector) const {
SkipCostComputation.contains(UI);
}
+unsigned VPCostContext::getPredBlockCostDivisor(
+ TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const {
+ return CM.getPredBlockCostDivisor(CostKind, BB);
+}
+
InstructionCost
LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
VPCostContext &CostCtx) const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index a1c6f7977885f..e3b0c2bff9d02 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -855,7 +855,9 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
// For the scalar case, we may not always execute the original predicated
// block, Thus, scale the block's cost by the probability of executing it.
if (VF.isScalar())
- return ThenCost / getPredBlockCostDivisor(Ctx.CostKind);
+ if (auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Then))
+ return ThenCost / Ctx.getPredBlockCostDivisor(Ctx.CostKind,
+ VPIRBB->getIRBasicBlock());
return ThenCost;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index fe59774b7c838..fe082851ca00c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -50,21 +50,6 @@ Value *getRuntimeVF(IRBuilderBase &B, Type *Ty, ElementCount VF);
Value *createStepForVF(IRBuilderBase &B, Type *Ty, ElementCount VF,
int64_t Step);
-/// A helper function that returns how much we should divide the cost of a
-/// predicated block by. Typically this is the reciprocal of the block
-/// probability, i.e. if we return X we are assuming the predicated block will
-/// execute once for every X iterations of the loop header so the block should
-/// only contribute 1/X of its cost to the total cost calculation, but when
-/// optimizing for code size it will just be 1 as code size costs don't depend
-/// on execution probabilities.
-///
-/// TODO: We should use actual block probability here, if available. Currently,
-/// we always assume predicated blocks have a 50% chance of executing.
-inline unsigned
-getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
- return CostKind == TTI::TCK_CodeSize ? 1 : 2;
-}
-
/// A range of powers-of-2 vectorization factors with fixed start and
/// adjustable end. The range includes start and excludes end, e.g.,:
/// [1, 16) = {1, 2, 4, 8}
@@ -364,6 +349,10 @@ struct VPCostContext {
/// has already been pre-computed.
bool skipCostComputation(Instruction *UI, bool IsVector) const;
+ /// \returns how much the cost of a predicated block should be divided by.
+ unsigned getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+ BasicBlock *BB) const;
+
/// Returns the OperandInfo for \p V, if it is a live-in.
TargetTransformInfo::OperandValueInfo getOperandInfo(VPValue *V) const;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index aa3de3613b68e..2e77b75b16e47 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3170,7 +3170,7 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
// Scale the cost by the probability of executing the predicated blocks.
// This assumes the predicated block for each vector lane is equally
// likely.
- ScalarCost /= getPredBlockCostDivisor(Ctx.CostKind);
+ ScalarCost /= Ctx.getPredBlockCostDivisor(Ctx.CostKind, UI->getParent());
return ScalarCost;
}
case Instruction::Load:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index e4ee6776ae24c..790e1d20b6ec1 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -612,63 +612,18 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
;
; COMMON-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
; COMMON-SAME: ptr [[DST:%.*]]) {
-; COMMON-NEXT: [[ENTRY:.*:]]
-; COMMON-NEXT: br label %[[VECTOR_PH:.*]]
-; COMMON: [[VECTOR_PH]]:
-; COMMON-NEXT: br label %[[VECTOR_BODY:.*]]
-; COMMON: [[VECTOR_BODY]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; COMMON: [[PRED_STORE_IF]]:
-; COMMON-NEXT: [[TMP0:%.*]] = getelementptr i8, ptr [[DST]], i64 0
-; COMMON-NEXT: store i8 0, ptr [[TMP0]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE]]
-; COMMON: [[PRED_STORE_CONTINUE]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2:.*]]
-; COMMON: [[PRED_STORE_IF1]]:
-; COMMON-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[DST]], i64 1
-; COMMON-NEXT: store i8 1, ptr [[TMP1]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE2]]
-; COMMON: [[PRED_STORE_CONTINUE2]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF3:.*]], label %[[PRED_STORE_CONTINUE4:.*]]
-; COMMON: [[PRED_STORE_IF3]]:
-; COMMON-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[DST]], i64 2
-; COMMON-NEXT: store i8 2, ptr [[TMP2]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE4]]
-; COMMON: [[PRED_STORE_CONTINUE4]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF5:.*]], label %[[PRED_STORE_CONTINUE6:.*]]
-; COMMON: [[PRED_STORE_IF5]]:
-; COMMON-NEXT: [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 3
-; COMMON-NEXT: store i8 3, ptr [[TMP3]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE6]]
-; COMMON: [[PRED_STORE_CONTINUE6]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF7:.*]], label %[[PRED_STORE_CONTINUE8:.*]]
-; COMMON: [[PRED_STORE_IF7]]:
-; COMMON-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 4
-; COMMON-NEXT: store i8 4, ptr [[TMP4]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE8]]
-; COMMON: [[PRED_STORE_CONTINUE8]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF9:.*]], label %[[PRED_STORE_CONTINUE10:.*]]
-; COMMON: [[PRED_STORE_IF9]]:
-; COMMON-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[DST]], i64 5
-; COMMON-NEXT: store i8 5, ptr [[TMP5]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE10]]
-; COMMON: [[PRED_STORE_CONTINUE10]]:
-; COMMON-NEXT: br i1 true, label %[[PRED_STORE_IF11:.*]], label %[[PRED_STORE_CONTINUE12:.*]]
-; COMMON: [[PRED_STORE_IF11]]:
-; COMMON-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[DST]], i64 6
-; COMMON-NEXT: store i8 6, ptr [[TMP6]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE12]]
-; COMMON: [[PRED_STORE_CONTINUE12]]:
-; COMMON-NEXT: br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[EXIT:.*]]
-; COMMON: [[PRED_STORE_IF13]]:
-; COMMON-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[DST]], i64 7
-; COMMON-NEXT: store i8 7, ptr [[TMP7]], align 1
-; COMMON-NEXT: br label %[[EXIT]]
-; COMMON: [[EXIT]]:
-; COMMON-NEXT: br label %[[SCALAR_PH:.*]]
-; COMMON: [[SCALAR_PH]]:
-; COMMON-NEXT: br [[EXIT1:label %.*]]
-; COMMON: [[SCALAR_PH1:.*:]]
+; COMMON-NEXT: [[ENTRY:.*]]:
+; COMMON-NEXT: br label %[[EXIT1:.*]]
+; COMMON: [[EXIT1]]:
+; COMMON-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[EXIT1]] ]
+; COMMON-NEXT: [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i8
+; COMMON-NEXT: [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
+; COMMON-NEXT: store i8 [[IV_TRUNC]], ptr [[GEP]], align 1
+; COMMON-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
+; COMMON-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
+; COMMON-NEXT: br i1 [[EC]], label %[[SCALAR_PH1:.*]], label %[[EXIT1]]
+; COMMON: [[SCALAR_PH1]]:
+; COMMON-NEXT: ret void
;
entry:
br label %loop
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
index cc7b4aecc3642..71c2a05af964f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
@@ -274,69 +274,11 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
;
; PRED-LABEL: define void @iv_trunc(
; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; PRED-NEXT: [[ENTRY:.*:]]
+; PRED-NEXT: [[ENTRY:.*]]:
; PRED-NEXT: [[MUL_X:%.*]] = add i32 [[X]], 1
-; PRED-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1
-; PRED-NEXT: br label %[[VECTOR_SCEVCHECK:.*]]
-; PRED: [[VECTOR_SCEVCHECK]]:
-; PRED-NEXT: [[TMP1:%.*]] = sub i32 -1, [[X]]
-; PRED-NEXT: [[TMP2:%.*]] = icmp slt i32 [[MUL_X]], 0
-; PRED-NEXT: [[TMP3:%.*]] = select i1 [[TMP2]], i32 [[TMP1]], i32 [[MUL_X]]
-; PRED-NEXT: [[TMP4:%.*]] = trunc i64 [[N]] to i32
-; PRED-NEXT: [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP3]], i32 [[TMP4]])
-; PRED-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
-; PRED-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
-; PRED-NEXT: [[TMP5:%.*]] = sub i32 0, [[MUL_RESULT]]
-; PRED-NEXT: [[TMP6:%.*]] = icmp ugt i32 [[TMP5]], 0
-; PRED-NEXT: [[TMP7:%.*]] = select i1 [[TMP2]], i1 [[TMP6]], i1 false
-; PRED-NEXT: [[TMP8:%.*]] = or i1 [[TMP7]], [[MUL_OVERFLOW]]
-; PRED-NEXT: [[TMP9:%.*]] = icmp ugt i64 [[N]], 4294967295
-; PRED-NEXT: [[TMP10:%.*]] = icmp ne i32 [[MUL_X]], 0
-; PRED-NEXT: [[TMP11:%.*]] = and i1 [[TMP9]], [[TMP10]]
-; PRED-NEXT: [[TMP12:%.*]] = or i1 [[TMP8]], [[TMP11]]
-; PRED-NEXT: br i1 [[TMP12]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; PRED: [[VECTOR_PH]]:
-; PRED-NEXT: [[TMP13:%.*]] = sub i64 [[TMP0]], 2
-; PRED-NEXT: [[TMP14:%.*]] = icmp ugt i64 [[TMP0]], 2
-; PRED-NEXT: [[TMP15:%.*]] = select i1 [[TMP14]], i64 [[TMP13]], i64 0
-; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[MUL_X]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
-; PRED-NEXT: br label %[[VECTOR_BODY:.*]]
-; PRED: [[VECTOR_BODY]]:
-; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE2:.*]] ]
-; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT: [[VEC_IND:%.*]] = phi <2 x i32> [ <i32 0, i32 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT: [[TMP16:%.*]] = mul <2 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
-; PRED-NEXT: [[TMP17:%.*]] = zext <2 x i32> [[TMP16]] to <2 x i64>
-; PRED-NEXT: [[TMP18:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 0
-; PRED-NEXT: br i1 [[TMP18]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; PRED: [[PRED_STORE_IF]]:
-; PRED-NEXT: [[TMP19:%.*]] = extractelement <2 x i64> [[TMP17]], i32 0
-; PRED-NEXT: [[TMP20:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP19]]
-; PRED-NEXT: store i32 1, ptr [[TMP20]], align 4
-; PRED-NEXT: br label %[[PRED_STORE_CONTINUE]]
-; PRED: [[PRED_STORE_CONTINUE]]:
-; PRED-NEXT: [[TMP21:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 1
-; PRED-NEXT: br i1 [[TMP21]], label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2]]
-; PRED: [[PRED_STORE_IF1]]:
-; PRED-NEXT: [[TMP22:%.*]] = extractelement <2 x i64> [[TMP17]], i32 1
-; PRED-NEXT: [[TMP23:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP22]]
-; PRED-NEXT: store i32 1, ptr [[TMP23]], align 4
-; PRED-NEXT: br label %[[PRED_STORE_CONTINUE2]]
-; PRED: [[PRED_STORE_CONTINUE2]]:
-; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
-; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[INDEX]], i64 [[TMP15]])
-; PRED-NEXT: [[TMP24:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
-; PRED-NEXT: [[TMP25:%.*]] = xor i1 [[TMP24]], true
-; PRED-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], splat (i32 2)
-; PRED-NEXT: br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; PRED: [[MIDDLE_BLOCK]]:
-; PRED-NEXT: br label %[[EXIT:.*]]
-; PRED: [[SCALAR_PH]]:
; PRED-NEXT: br label %[[FOR_BODY:.*]]
; PRED: [[FOR_BODY]]:
-; PRED-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; PRED-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
; PRED-NEXT: [[TRUNC_IV:%.*]] = trunc i64 [[IV]] to i32
; PRED-NEXT: [[ADD_I:%.*]] = mul i32 [[MUL_X]], [[TRUNC_IV]]
; PRED-NEXT: [[IV_MUL:%.*]] = zext i32 [[ADD_I]] to i64
@@ -344,7 +286,7 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
; PRED-NEXT: store i32 1, ptr [[GEP]], align 4
; PRED-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; PRED-NEXT: [[EC:%.*]] = icmp eq i64 [[IV]], [[N]]
-; PRED-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; PRED-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[FOR_BODY]]
; PRED: [[EXIT]]:
; PRED-NEXT: ret void
;
@@ -440,101 +382,21 @@ define void @trunc_ivs_and_store(i32 %x, ptr %dst, i64 %N) #0 {
;
; PRED-LABEL: define void @trunc_ivs_and_store(
; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; PRED-NEXT: [[ENTRY:.*:]]
-; PRED-NEXT: [[MUL:%.*]] = mul i32 [[X]], [[X]]
-; PRED-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1
-; PRED-NEXT: br label %[[VECTOR_SCEVCHECK:.*]]
-; PRED: [[VECTOR_SCEVCHECK]]:
+; PRED-NEXT: [[ENTRY:.*]]:
; PRED-NEXT: [[TMP1:%.*]] = mul i32 [[X]], [[X]]
-; PRED-NEXT: [[TMP2:%.*]] = sub i32 0, [[TMP1]]
-; PRED-NEXT: [[TMP3:%.*]] = icmp slt i32 [[MUL]], 0
-; PRED-NEXT: [[TMP4:%.*]] = select i1 [[TMP3]], i32 [[TMP2]], i32 [[MUL]]
-; PRED-NEXT: [[TMP5:%.*]] = trunc i64 [[N]] to i32
-; PRED-NEXT: [[MUL1:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP4]], i32 [[TMP5]])
-; PRED-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL1]], 0
-; PRED-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL1]], 1
-; PRED-NEXT: [[TMP6:%.*]] = sub i32 0, [[MUL_RESULT]]
-; PRED-NEXT: [[TMP7:%.*]] = icmp ugt i32 [[TMP6]], 0
-; PRED-NEXT: [[TMP8:%.*]] = select i1 [[TMP3]], i1 [[TMP7]], i1 false
-; PRED-NEXT: [[TMP9:%.*]] = or i1 [[TMP8]], [[MUL_OVERFLOW]]
-; PRED-NEXT: [[TMP10:%.*]] = icmp ugt i64 [[N]], 4294967295
-; PRED-NEXT: [[TMP11:%.*]] = icmp ne i32 [[MUL]], 0
-; PRED-NEXT: [[TMP12:%.*]] = and i1 [[TMP10]], [[TMP11]]
-; PRED-NEXT: [[TMP13:%.*]] = or i1 [[TMP9]], [[TMP12]]
-; PRED-NEXT: br i1 [[TMP13]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; PRED: [[VECTOR_PH]]:
-; PRED-NEXT: [[TMP14:%.*]] = sub i64 [[TMP0]], 4
-; PRED-NEXT: [[TMP15:%.*]] = icmp ugt i64 [[TMP0]], 4
-; PRED-NEXT: [[TMP16:%.*]] = select i1 [[TMP15]], i64 [[TMP14]], i64 0
-; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[MUL]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
-; PRED-NEXT: br label %[[VECTOR_BODY:.*]]
-; PRED: [[VECTOR_BODY]]:
-; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE7:.*]] ]
-; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE7]] ]
-; PRED-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE7]] ]
-; PRED-NEXT: [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
-; PRED-NEXT: [[TMP17:%.*]] = mul <4 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
-; PRED-NEXT: [[TMP18:%.*]] = zext <4 x i32> [[TMP17]] to <4 x i64>
-; PRED-NEXT: [[TMP19:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 0
-; PRED-NEXT: br i1 [[TMP19]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; PRED: [[PRED_STORE_IF]]:
-; PRED-NEXT: [...
[truncated]
|
|
I'm not sure if this statement is quite right:
In one of the tests changed by this patch the loop has a low trip count of 7. Suppose we chose a VF of 16, then doesn't that mean that only 7 out of 16 scalar instructions are being executed? That takes us back to It feels like there are three different problems here:
I guess what I'm trying to say is that if the intent of the patch is to more accurately calculate the cost divisor based on the probability of entering the block, then paradoxically the cost should really reduce even further in some cases. That makes me wonder if this the right approach? Or perhaps the change to |
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this statement is quite right:
This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.In one of the tests changed by this patch the loop has a low trip count of 7. Suppose we chose a VF of 16, then doesn't that mean that only 7 out of 16 scalar instructions are being executed? That takes us back to
getPredBlockCostDivisorreturning a value of 2 I think.
In cases except very low trip counts, using a probability closer to 1 should be more accurate than 0.5 I think. e.g. if we execute 4 full vector iterations and one partial one, the probability should be something close to ~80% I think. Overall this should hopefully help to reduce the number of cases where we vectorize with tail-folding and costly replication.
I might be wrong, but I think this isn't something where block-frequency info would help, which is why I suggested splitting this off from #158690.
For loops with known trip counts, we should be able to compute it accurately given the VF and trip count. For other cases, it might make sense to be a bit more conservative and don't assume 100% probability but something like 80% or 90%.
I think the test that @david-arm mentioned would a good sanity check; we should't choose VF 16 over VF 8, just because we add additional branches we know won't execute
I see what you're saying, but I really think it's just coincidental that the predicated discount cost somtimes aligns with low trip counts. For example in that test in From what I can tell getPredBlockCostDivisor was never intended to discount the cost for tail folding, it was only for if-converted blocks: 1755d81 And somewhere along the line blocks that are predicated due to tail folding got caught up in this discount when it wasn't intended for them.
I agree here, I think we could detect whenever a block is only predicated for tail folding and when we have a known trip count. We can plumb through the VF and get a better estimate there. It feels kind of separate to this PR though. Would it be ok to do that as a follow up? |
|
@david-arm Also probably worth mentioning is that the impact of this was actually very small when I tested it out on benchmarks. In #158690 with aarch64-linux-gnu -march=armv9-a -flto -O3 there was no change in the number of loops vectorized in llvm-test-suite, and actually 0.2% more loops vectorized on SPEC CPU 2017. (I don't think exchange2 is in here though) |
|
I tried out accounting for tail folding using the estimated trip count: inline unsigned
getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
BasicBlock *BB, ElementCount VF) const {
if (!Legal->blockNeedsPredication(BB)) {
if (foldTailByMasking()) {
if (auto TC = getSmallBestKnownTC(PSE, TheLoop)) {
unsigned ETC = estimateElementCount(*TC, getVScaleForTuning());
unsigned EVF = estimateElementCount(VF, getVScaleForTuning());
return std::max(1U, EVF / ETC);
}
}
return 1;
}
return CostKind == TTI::TCK_CodeSize ? 1 : 2;
}But there's no difference on any of the in-tree tests. My guess is that the discount really only occurs with a high enough VF, at which point the cost of scalarization for every lane begins to outweight any discount of the blocks potentially not being executed. |
|
Thanks to @lukel97 and @fhahn for explaining! Just to be clear, I won't hold up the patch. I realise this is for a different day and a different PR, but it seems to me the root problem here is that we make the decision to tail-fold the loop far too early without considering the cost of doing so, then everything else afterwards has to try very hard to unwind the extremely poor early decision. Whereas if we'd avoided tail-folding in the first place, i.e. by at least checking to see if the target supports masked loads and stores, then we may avoid some of the extra complexity of rolling back. Alternatively, if we'd chosen from two VPlans per VF using different styles of vectorisation then we may end up with a better choice than simply not vectorising. |
| // Legal is used so as to not include all blocks in tail folded loops. | ||
| if (VF.isScalar() && Legal->blockNeedsPredication(BB)) | ||
| BlockCost /= getPredBlockCostDivisor(CostKind); | ||
| BlockCost /= getPredBlockCostDivisor(CostKind, BB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the call to getPredBlockCostDivisor is already guarded by Legal->blockNeedsPredication(BB). Instead of changing the meaning/behaviour of getPredBlockCostDivisor is it perhaps better to simply guard all calls to the function in a similar way? Then you can add a comment to getPredBlockCostDivisor saying that it should only ever be called for blocks that are predicated in the original scalar code.
Alternatively, you could remove the Legal->blockNeedsPredication(BB) check here and update the comment above getPredBlockCostDivisor saying that it will always return 1 for blocks that are predicated due to tail-folding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point. And I guess that comment above confirms that the predication discount isn't meant to be applied to tail folded blocks? ab97c9b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// TODO: We should use actual block probability here, if available. | ||
| /// Currently, we always assume predicated blocks have a 50% chance of | ||
| /// executing. | ||
| inline unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment on this function needs updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment in cf6b435 to mention that tail-folded predication doesn't count in this case
|
Having thought about this just now, I also suspect that when tail-folding due to low trip counts (i.e. TC <= 16) we probably clamp the max VF to the next largest power-of-2 value. So we would never choose a max VF of 16 for TC=5 - we'd probably choose a max VF of 8. There will be very few cases where the cost divisor goes above 1. We also don't know at this point what the interleave count will be. |
…ze/dont-discount-unpredicated-blocks
| ; CHECK-NEXT: vector.body: | ||
| ; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION | ||
| ; CHECK-NEXT: ir<%iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<[[VF]]> | ||
| ; CHECK-NEXT: vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<1>, vp<[[VF]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no not testing what it used to (replicate regions aren't merged any more), would be good to adjust the test so this merging still takes place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also done in f072e3c, needed to pass -force-widen-divrem-via-safe-divisor=0 so that the div is scalarized
| ; CHECK-NEXT: } | ||
|
|
||
| ; CHECK-NEXT: Successor(s): pred.load | ||
| ; CHECK-EMPTY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no not testing what it used to (replicate regions aren't merged any more), would be good to adjust the test so this merging still takes place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, should be done now in f072e3c. I had to remove the instructions in between the load and store that were now widened to get the replicate regions beside each other
…ze/dont-discount-unpredicated-blocks
…ze/dont-discount-unpredicated-blocks
…ll merging replicate regions
…ze/dont-discount-unpredicated-blocks
|
Ping |
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you by any chance have any perf numbers to confirm this doesn't cause any regressions?
If so, would be great if you could add them to the PR description
| %add = add i32 %lv.b, 10 | ||
| %mul = mul i32 2, %add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reasn those need to be removed?I might be remembering incorrectly, but this should test we can sink multiple recipes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to be testing that replicate regions are merged? #160449 (comment)
Because the add and mul are no longer replicated because there's no predication discount, we now end up with
REPLICATE load
WIDE add
WIDE mul
REPLICATE store
So I had to remove the add and the mul so the load and store replicate regions would be together, which allows them to be merged.
| ; CHECK-NEXT: ir<%iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<[[VF]]> | ||
| ; CHECK-NEXT: EMIT vp<[[MASK:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]> | ||
| ; CHECK-NEXT: EMIT vp<[[SPLICE:%.+]]> = first-order splice ir<%recur>, ir<%recur.next> | ||
| ; CHECK-NEXT: WIDEN ir<%rem> = srem vp<[[SPLICE]]>, ir<%x> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not testing sinking into replicate regions any longer?
| ; CHECK-NEXT: Successor(s): loop.1 | ||
| ; CHECK-EMPTY: | ||
| ; CHECK-NEXT: loop.1: | ||
| ; CHECK-NEXT: WIDEN ir<%add.1> = add ir<%conv>, ir<%rem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this now disables merging/sinking? Can we avoid this change, to kee testing the original thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally to test splitting at the end of a block but as of today it's not testing it because the conv isn't at the end of the block, I've opened #164636 to move it to a unit test
I tested this again on llvm-test-suite arm64-apple-darwin -O3, the geomean difference in loops vectorized is -0.0%. Only four tests were affected with 1 less loop vectorized, and there was no observable performance difference: |
In llvm#160449 some of the tests end up merging less blocks so we end up losing test coverage. This adds a unit test to try and cover it elsewhere in a more predictable way, so it won't be influenced by e.g. whether or not the cost model decides to scalarize an instruction. Two parts in createReplicateRegion/addReplicateRegions had to be relaxed to allow using a fake Instruction with no BasicBlock parent. I wasn't able to create a fake BasicBlock in the test without hitting iterator assertions.
…ze/dont-discount-unpredicated-blocks
| ; CHECK-NEXT: WIDEN-CAST ir<%conv> = sext vp<[[PRED1]]> to i32 | ||
| ; CHECK-NEXT: EMIT vp<[[SPLICE:%.+]]> = first-order splice ir<%0>, ir<%conv> | ||
| ; CHECK-NEXT: WIDEN ir<%rem> = srem vp<[[SPLICE]]>, ir<%x> | ||
| ; CHECK-NEXT: WIDEN ir<%add> = add ir<%conv>, ir<%rem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC when this test was first written in https://reviews.llvm.org/D100751 sinking operated directly on regions. But now it happens in sinkRecurrenceUsersAfterPrevious which is called before optimization, so before createAndOptimizeReplicateRegions.
So I don't think we're sinking replicate regions directly anymore, so I'm not sure if this was still testing the original "; Test cases for PR50009, which require sinking a replicate region due to a first-order recurrence." comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here that we can't explicitly predicate this test because the srem that's being sunk is used in the FOR chain
@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on #160449
@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on llvm/llvm-project#160449
@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on llvm#160449
| // getPredBlockCostDivisor won't include blocks that are only predicated due | ||
| // to tail folded loops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // getPredBlockCostDivisor won't include blocks that are only predicated due | |
| // to tail folded loops | |
| // getPredBlockCostDivisor will return 1 for blocks only predicated due | |
| // by the header mask when folding the tail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in fdfa9b2
| unsigned VPCostContext::getPredBlockCostDivisor( | ||
| TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| unsigned VPCostContext::getPredBlockCostDivisor( | |
| TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const { | |
| unsigned VPCostContext::getPredBlockCostDivisor( | |
| BasicBlock *BB) const { |
I think it should be fine to just use CostKind from VPCostContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in fdfa9b2
david-arm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
* main: (1028 commits) [clang][DebugInfo] Attach `DISubprogram` to additional call variants (llvm#166202) [C2y] Claim nonconformance to WG14 N3348 (llvm#166966) [X86] 2012-01-10-UndefExceptionEdge.ll - regenerate test checks (llvm#167307) Remove unused standard headers: <string>, <optional>, <numeric>, <tuple> (llvm#167232) [DebugInfo] Add Verifier check for incorrectly-scoped retainedNodes (llvm#166855) [VPlan] Don't apply predication discount to non-originally-predicated blocks (llvm#160449) [libc++] Avoid overloaded `operator,` for (`T`, `Iter`) cases (llvm#161049) [tools][llc] Make save-stats.ll test target independent (llvm#167238) [AArch64] Fallback to PRFUM for PRFM with negative or unaligned offset (llvm#166756) [X86] ldexp-avx512.ll - add v8f16/v16f16/v32f16 test coverage for llvm#165694 (llvm#167294) [DropAssumes] Drop dereferenceable assumptions after vectorization. (llvm#166947) [VPlan] Simplify branch-cond with getVectorTripCount (llvm#155604) Remove unused <algorithm> inclusion (llvm#166942) [AArch64] Combine subtract with borrow to SBC. (llvm#165271) [AArch64][SVE] Avoid redundant extend of unsigned i8/i16 extracts. (llvm#165863) [SPIRV] Fix failing assertion in SPIRVAsmPrinter (llvm#166909) [libc++] Merge insert/emplace(const_iterator, Args...) implementations (llvm#166470) [libc++] Replace __libcpp_is_final with a variable template (llvm#167137) [gn build] Port 152bda7 [libc++] Replace the last uses of __tuple_types with __type_list (llvm#167214) ...
Split off from #158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places.
This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.
This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.
On llvm-test-suite this causes 4 loops in total to no longer be vectorized with -O3 on arm64-apple-darwin, and there's no observable performance impact.