-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[VPlan] Don't use the legacy cost model for loop conditions #156864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector. Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert. This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening: * AArch64/conditional-branches-cost.ll has more IR instructions, but less final machine instructions. * X86/pr81872.ll I've increased the trip count by 1 as otherwise the test no longer vectorizes. * X86/consecutive-ptr-uniforms.ll has more IR instructions, but the branches have constant condition so it would later get simplified to just three stores.
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: John Brawn (john-brawn-arm) ChangesThe current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector. Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert. This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening:
Patch is 95.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156864.diff 13 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3fbeef1211954..a7021003b8dd4 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6789,46 +6789,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
}
}
- /// Compute the cost of all exiting conditions of the loop using the legacy
- /// cost model. This is to match the legacy behavior, which adds the cost of
- /// all exit conditions. Note that this over-estimates the cost, as there will
- /// be a single condition to control the vector loop.
- SmallVector<BasicBlock *> Exiting;
- CM.TheLoop->getExitingBlocks(Exiting);
- SetVector<Instruction *> ExitInstrs;
- // Collect all exit conditions.
- for (BasicBlock *EB : Exiting) {
- auto *Term = dyn_cast<BranchInst>(EB->getTerminator());
- if (!Term || CostCtx.skipCostComputation(Term, VF.isVector()))
- continue;
- if (auto *CondI = dyn_cast<Instruction>(Term->getOperand(0))) {
- ExitInstrs.insert(CondI);
- }
- }
- // Compute the cost of all instructions only feeding the exit conditions.
- for (unsigned I = 0; I != ExitInstrs.size(); ++I) {
- Instruction *CondI = ExitInstrs[I];
- if (!OrigLoop->contains(CondI) ||
- !CostCtx.SkipCostComputation.insert(CondI).second)
- continue;
- InstructionCost CondICost = CostCtx.getLegacyCost(CondI, VF);
- LLVM_DEBUG({
- dbgs() << "Cost of " << CondICost << " for VF " << VF
- << ": exit condition instruction " << *CondI << "\n";
- });
- Cost += CondICost;
- for (Value *Op : CondI->operands()) {
- auto *OpI = dyn_cast<Instruction>(Op);
- if (!OpI || CostCtx.skipCostComputation(OpI, VF.isVector()) ||
- any_of(OpI->users(), [&ExitInstrs, this](User *U) {
- return OrigLoop->contains(cast<Instruction>(U)->getParent()) &&
- !ExitInstrs.contains(cast<Instruction>(U));
- }))
- continue;
- ExitInstrs.insert(OpI);
- }
- }
-
// Pre-compute the costs for branches except for the backedge, as the number
// of replicate regions in a VPlan may not directly match the number of
// branches, which would lead to different decisions.
@@ -6977,6 +6937,37 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
});
});
}
+
+static bool planContainsDifferentCompares(VPlan &Plan, VPCostContext &CostCtx,
+ Loop *TheLoop, ElementCount VF) {
+ // Count how many compare instructions there are in the legacy cost model.
+ unsigned NumLegacyCompares = 0;
+ for (BasicBlock *BB : TheLoop->blocks()) {
+ for (auto &I : *BB) {
+ if (isa<CmpInst>(I)) {
+ NumLegacyCompares += 1;
+ }
+ }
+ }
+
+ // Count how many compare instructions there are in the VPlan.
+ unsigned NumVPlanCompares = 0;
+ auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry());
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
+ for (VPRecipeBase &R : *VPBB) {
+ if (auto *VPI = dyn_cast<VPInstruction>(&R)) {
+ if (VPI->getOpcode() == VPInstruction::BranchOnCount ||
+ VPI->getOpcode() == Instruction::ICmp ||
+ VPI->getOpcode() == Instruction::FCmp)
+ NumVPlanCompares += 1;
+ }
+ }
+ }
+
+ // If we have a different amount, then the legacy cost model and vplan will
+ // disagree.
+ return NumLegacyCompares != NumVPlanCompares;
+}
#endif
VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
@@ -7085,7 +7076,9 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
CostCtx, OrigLoop,
BestFactor.Width) ||
planContainsAdditionalSimplifications(
- getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width)) &&
+ getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width) ||
+ planContainsDifferentCompares(BestPlan, CostCtx, OrigLoop,
+ BestFactor.Width)) &&
" VPlan cost model and legacy cost model disagreed");
assert((BestFactor.Width.isScalar() || BestFactor.ScalarCost > 0) &&
"when vectorizing, the scalar cost must be computed.");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 5f3503d0ce57a..1c4e38f75d626 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -28,6 +28,7 @@
#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Type.h"
#include "llvm/IR/Value.h"
#include "llvm/Support/Casting.h"
@@ -40,6 +41,7 @@
#include <cassert>
using namespace llvm;
+using namespace llvm::PatternMatch;
using VectorParts = SmallVector<Value *, 2>;
@@ -1117,6 +1119,29 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement,
VecTy, Ctx.CostKind, 0);
}
+ case VPInstruction::BranchOnCount: {
+ // If TC <= VF then this is just a branch.
+ // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF
+ // where it checks TC <= VF * UF, but we don't know UF yet. This means in
+ // some cases we get a cost that's too high due to counting a cmp that
+ // later gets removed.
+ Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue();
+ uint64_t C0;
+ if (TC && match(TC, m_ConstantInt(C0)) && C0 <= VF.getKnownMinValue())
+ return 0;
+ // Otherwise BranchOnCount generates ICmpEQ followed by a branch.
+ Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+ return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy,
+ CmpInst::makeCmpResultType(ValTy),
+ CmpInst::ICMP_EQ, Ctx.CostKind);
+ }
+ case Instruction::FCmp:
+ case Instruction::ICmp: {
+ Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+ return Ctx.TTI.getCmpSelInstrCost(getOpcode(), ValTy,
+ CmpInst::makeCmpResultType(ValTy),
+ getPredicate(), Ctx.CostKind);
+ }
case VPInstruction::ExtractPenultimateElement:
if (VF == ElementCount::getScalable(1))
return InstructionCost::getInvalid();
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index dad0290142f29..9d074c0c8d907 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -268,7 +268,7 @@ define void @latch_branch_cost(ptr %dst) {
; DEFAULT: [[MIDDLE_BLOCK]]:
; DEFAULT-NEXT: br i1 false, [[EXIT:label %.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
; DEFAULT: [[VEC_EPILOG_ITER_CHECK]]:
-; DEFAULT-NEXT: br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]]
+; DEFAULT-NEXT: br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF5:![0-9]+]]
; DEFAULT: [[VEC_EPILOG_PH]]:
; DEFAULT-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ 96, %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; DEFAULT-NEXT: br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
@@ -278,7 +278,7 @@ define void @latch_branch_cost(ptr %dst) {
; DEFAULT-NEXT: store <4 x i8> zeroinitializer, ptr [[TMP8]], align 1
; DEFAULT-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 4
; DEFAULT-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 100
-; DEFAULT-NEXT: br i1 [[TMP10]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP10]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; DEFAULT: [[VEC_EPILOG_MIDDLE_BLOCK]]:
; DEFAULT-NEXT: br i1 true, [[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
; DEFAULT: [[VEC_EPILOG_SCALAR_PH]]:
@@ -429,14 +429,14 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
; DEFAULT: [[VECTOR_BODY]]:
; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE37:.*]] ]
-; DEFAULT-NEXT: [[TMP9:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META7:![0-9]+]]
+; DEFAULT-NEXT: [[TMP9:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META8:![0-9]+]]
; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <4 x i32> poison, i32 [[TMP9]], i64 0
; DEFAULT-NEXT: [[BROADCAST_SPLAT29:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT28]], <4 x i32> poison, <4 x i32> zeroinitializer
-; DEFAULT-NEXT: [[TMP19:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META10:![0-9]+]]
+; DEFAULT-NEXT: [[TMP19:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META11:![0-9]+]]
; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP19]], i64 0
; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
; DEFAULT-NEXT: [[TMP6:%.*]] = or <4 x i32> [[BROADCAST_SPLAT]], [[BROADCAST_SPLAT29]]
-; DEFAULT-NEXT: [[TMP7:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META12:![0-9]+]]
+; DEFAULT-NEXT: [[TMP7:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META13:![0-9]+]]
; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT30:%.*]] = insertelement <4 x i32> poison, i32 [[TMP7]], i64 0
; DEFAULT-NEXT: [[BROADCAST_SPLAT31:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT30]], <4 x i32> poison, <4 x i32> zeroinitializer
; DEFAULT-NEXT: [[TMP8:%.*]] = icmp ugt <4 x i32> [[BROADCAST_SPLAT31]], [[TMP6]]
@@ -445,34 +445,34 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
; DEFAULT-NEXT: br i1 [[TMP20]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
; DEFAULT: [[PRED_STORE_IF]]:
; DEFAULT-NEXT: [[TMP11:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT: store i32 [[TMP11]], ptr [[E]], align 4, !alias.scope [[META14:![0-9]+]], !noalias [[META16:![0-9]+]]
+; DEFAULT-NEXT: store i32 [[TMP11]], ptr [[E]], align 4, !alias.scope [[META15:![0-9]+]], !noalias [[META17:![0-9]+]]
; DEFAULT-NEXT: br label %[[PRED_STORE_CONTINUE]]
; DEFAULT: [[PRED_STORE_CONTINUE]]:
; DEFAULT-NEXT: [[TMP12:%.*]] = extractelement <4 x i1> [[TMP8]], i32 1
; DEFAULT-NEXT: br i1 [[TMP12]], label %[[PRED_STORE_IF32:.*]], label %[[PRED_STORE_CONTINUE33:.*]]
; DEFAULT: [[PRED_STORE_IF32]]:
; DEFAULT-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT: store i32 [[TMP13]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT: store i32 [[TMP13]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
; DEFAULT-NEXT: br label %[[PRED_STORE_CONTINUE33]]
; DEFAULT: [[PRED_STORE_CONTINUE33]]:
; DEFAULT-NEXT: [[TMP14:%.*]] = extractelement <4 x i1> [[TMP8]], i32 2
; DEFAULT-NEXT: br i1 [[TMP14]], label %[[PRED_STORE_IF34:.*]], label %[[PRED_STORE_CONTINUE35:.*]]
; DEFAULT: [[PRED_STORE_IF34]]:
; DEFAULT-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT: store i32 [[TMP15]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT: store i32 [[TMP15]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
; DEFAULT-NEXT: br label %[[PRED_STORE_CONTINUE35]]
; DEFAULT: [[PRED_STORE_CONTINUE35]]:
; DEFAULT-NEXT: [[TMP21:%.*]] = extractelement <4 x i1> [[TMP8]], i32 3
; DEFAULT-NEXT: br i1 [[TMP21]], label %[[PRED_STORE_IF36:.*]], label %[[PRED_STORE_CONTINUE37]]
; DEFAULT: [[PRED_STORE_IF36]]:
; DEFAULT-NEXT: [[TMP22:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT: store i32 [[TMP22]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT: store i32 [[TMP22]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
; DEFAULT-NEXT: br label %[[PRED_STORE_CONTINUE37]]
; DEFAULT: [[PRED_STORE_CONTINUE37]]:
-; DEFAULT-NEXT: call void @llvm.masked.store.v4i32.p0(<4 x i32> zeroinitializer, ptr [[TMP16]], i32 4, <4 x i1> [[TMP8]]), !alias.scope [[META18:![0-9]+]], !noalias [[META19:![0-9]+]]
+; DEFAULT-NEXT: call void @llvm.masked.store.v4i32.p0(<4 x i32> zeroinitializer, ptr [[TMP16]], i32 4, <4 x i1> [[TMP8]]), !alias.scope [[META19:![0-9]+]], !noalias [[META20:![0-9]+]]
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; DEFAULT-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
; DEFAULT: [[MIDDLE_BLOCK]]:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
@@ -533,25 +533,47 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
; DEFAULT-LABEL: define void @multiple_exit_conditions(
; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
; DEFAULT-NEXT: [[ENTRY:.*:]]
-; DEFAULT-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP0]], 3
+; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP4]]
+; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; DEFAULT: [[VECTOR_PH]]:
-; DEFAULT-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048
-; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
-; DEFAULT: [[VECTOR_BODY]]:
-; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 8
+; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]]
+; DEFAULT-NEXT: [[INDEX:%.*]] = sub i64 257, [[N_MOD_VF]]
; DEFAULT-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
; DEFAULT-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
+; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[INDEX]], 2
+; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
+; DEFAULT: [[VECTOR_BODY]]:
+; DEFAULT-NEXT: [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[OFFSET_IDX1:%.*]] = mul i64 [[INDEX1]], 8
+; DEFAULT-NEXT: [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX1]]
; DEFAULT-NEXT: [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2
-; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0
-; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
-; DEFAULT-NEXT: [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
-; DEFAULT-NEXT: [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double>
-; DEFAULT-NEXT: store <8 x double> [[TMP3]], ptr [[NEXT_GEP]], align 8
-; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
-; DEFAULT-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
-; DEFAULT-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i16> poison, i16 [[TMP1]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer
+; DEFAULT-NEXT: [[TMP11:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT: [[TMP15:%.*]] = uitofp <vscale x 2 x i16> [[TMP11]] to <vscale x 2 x double>
+; DEFAULT-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP17:%.*]] = shl nuw i64 [[TMP16]], 1
+; DEFAULT-NEXT: [[TMP18:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP17]]
+; DEFAULT-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP20:%.*]] = shl nuw i64 [[TMP19]], 2
+; DEFAULT-NEXT: [[TMP21:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP20]]
+; DEFAULT-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP23:%.*]] = mul nuw i64 [[TMP22]], 6
+; DEFAULT-NEXT: [[TMP24:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP23]]
+; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[NEXT_GEP1]], align 8
+; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP18]], align 8
+; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP21]], align 8
+; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP24]], align 8
+; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX1]], [[TMP3]]
+; DEFAULT-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[INDEX]]
+; DEFAULT-NEXT: br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
; DEFAULT: [[MIDDLE_BLOCK]]:
-; DEFAULT-NEXT: br label %[[SCALAR_PH]]
+; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 257, [[INDEX]]
+; DEFAULT-NEXT: br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
; DEFAULT: [[SCALAR_PH]]:
;
; PRED-LABEL: define void @multiple_exit_conditions(
@@ -658,12 +680,12 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
; COMMON-NEXT: store i8 6, ptr [[TMP6]], align 1
; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE12]]
; COMMON: [[PRED_STORE_CONTINUE12]]:
-; COMMON-NEXT: br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[PRED_STORE_CONTINUE14:.*]]
+; COMMON-NEXT: br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[EXIT1:.*]]
; COMMON: [[PRED_STORE_IF13]]:
; COMMON-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[DST]], i64 7
; COMMON-NEXT: store i8 7, ptr [[TMP7]], align 1
-; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE14]]
-; COMMON: [[PRED_STORE_CONTINUE14]]:
+; COMMON-NEXT: br label %[[EXIT1]]
+; COMMON: [[EXIT1]]:
; COMMON-NEXT: br label %[[MIDDLE_BLOCK:.*]]
; COMMON: [[MIDDLE_BLOCK]]:
; COMMON-NEXT: br [[EXIT:label %.*]]
@@ -866,7 +888,7 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; DEFAULT-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], splat (i64 8)
; DEFAULT-NEXT: [[TMP80:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP80]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP25:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP80]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
; DEFAULT: [[MIDDLE_BLOCK]]:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
@@ -1115,7 +1137,7 @@ define void @redundant_branch_and_tail_folding(ptr %dst, i1 %c) {
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; DEFAULT-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD]], splat (i64 4)
; DEFAULT-NEXT: [[TMP3:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
-; DEFAULT-NEXT: br i1 [[TMP3]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP3]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
; DEFAULT: [[MIDDLE_BLOCK]]:
; DEFAULT-NEXT: br label %[[SCALAR_PH]]
; DEFAULT: [[SCALAR_PH]]:
@@ -1289,7 +1311,7 @@ defin...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving more cost modelling into vplan. I just had a few concerns about some missing costs and a possible test regression.
| // later gets removed. | ||
| Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue(); | ||
| ConstantInt *TCConst = dyn_cast_if_present<ConstantInt>(TC); | ||
| if (TCConst && TCConst->getValue().ule(VF.getKnownMinValue())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO for the case where TC=vscale x M and VF=vscale * N as well? In such cases we should also be able to prove that TC <= VF because it just requires asking if M <= N.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
| return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement, | ||
| VecTy, Ctx.CostKind, 0); | ||
| } | ||
| case VPInstruction::BranchOnCount: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think BranchOnCond needs adding too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BranchOnCond doesn't cause a compare instruction to be generated, it uses the condition generated by another instruction.
| Type *ValTy = Ctx.Types.inferScalarType(getOperand(0)); | ||
| return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy, | ||
| CmpInst::makeCmpResultType(ValTy), | ||
| CmpInst::ICMP_EQ, Ctx.CostKind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add the cost of the branch as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deliberately avoided touching branch costs to keep the scope of this work as small as possible.
| ; CHECK-NEXT: Cost of 0 for VF 8: induction instruction %i.iv = phi i64 [ 0, %entry ], [ %i.iv.next, %for.body ] | ||
| ; CHECK-NEXT: Cost of 1 for VF 8: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16 | ||
| ; CHECK-NEXT: Cost of 0 for VF 8: EMIT vp<{{.+}}> = CANONICAL-INDUCTION ir<0>, vp<%index.next> | ||
| ; CHECK: Cost of 1 for VF 8: EMIT branch-on-count vp<%index.next>, vp<{{.+}}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @igogo-x86 added this file in #106699. You might also want to remove addFullyUnrolledInstructionsToIgnore, since you're now dealing with the cost properly in vplan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that function is marking both compares and induction variable instructions as needing to be ignored. So we could remove the ignoring of compares, except that the function is also called from the legacy cost model which does need to continue ignoring the compares. So I think the call to addFullyUnrolledInstructionsToIgnore in LoopVectorizationPlanner::precomputeCosts can only be removed once the vplan cost model is also correctly handling the induction variable instructions.
| ; PRED-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0 | ||
| ; PRED-NEXT: [[TMP36:%.*]] = xor i1 [[TMP35]], true | ||
| ; PRED-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], splat (i32 4) | ||
| ; PRED-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], splat (i32 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we're choosing a lower VF here? It may be a sensible change, but it's not obvious what triggered the different choice of VF as it doesn't seem like a FCmp, ICmp or BranchOnCount VPInstruction being generated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the vplan the vector loop branch is
EMIT vp<%active.lane.mask.next> = active lane mask vp<%12>, vp<%5>, ir<1>
EMIT vp<%13> = not vp<%active.lane.mask.next>
EMIT branch-on-cond vp<%13>
The loop branch condition comes from the active lane mask where we're already calculating the cost of
Cost of 4 for VF 2: EMIT vp<%active.lane.mask.entry> = active lane mask vp<%index.part.next>, vp<%4>, ir<1>
Previously we also added the cost of a compare in the legacy cost model (which doesn't correspond to anything in the vplan)
Cost of 1 for VF 2: exit condition instruction %exitcond.3.not = icmp eq i64 %iv.1, %N
meaning the cost was 1 higher than it should have been. This means the calculated costs are
| VF | Before this patch | With this patch |
|---|---|---|
| 2 | 12, 6 per lane | 11, 5.5 per lane |
| 4 | 23, 5.8 per lane | 22, 5.5 per lane |
So VF 2 is now chosen because we calculate the per-lane cost is the same as VF 4.
| ; CHECK-NEXT: br i1 [[CMP2]], label %[[RETURN:.*]], label %[[FOR_BODY]] | ||
| ; CHECK: [[RETURN]]: | ||
| ; CHECK-NEXT: ret void | ||
| ; CHECK-NEXT: [[ENTRY:.*:]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance this looks worse than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the control flow here is using constant conditions (because we know the trip count is 3 so we know which lanes need to be stored) so after simplifycfg we get
define void @PR40816() #1 {
entry:
store i32 0, ptr @b, align 1
store i32 1, ptr @b, align 1
store i32 2, ptr @b, align 1
ret void
}
Though if I modify this test a little so that the trip count is 7 (make array @a larger) then we still vectorize but don't get constant conditions for the branches and so it is worse. I'll look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like what's going on here is:
- Currently the load from arrayidx is considered to be part of calculating the loop exit condition, and so the cost is calculated in LoopVectorizationPlanner::precomputeCosts. It gets a very high cost due to useEmulatedMaskMemRefHack so we don't vectorize.
- In the vplan something has figured out that the loop has a constant trip count due to the load being from a constant array, so the load has been removed.
- With this patch that means we don't use the cost of the load, as it no longer exists, and the resulting cost says that vectorization is profitable.
If I manually transform the function into what it is after the vplan transformation then it looks like
define void @PR40816_adj() #1 {
entry:
br label %for.body
for.body:
%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
store i32 %0, ptr @b, align 1
%inc = add nuw nsw i32 %0, 1
%cmp = icmp uge i32 %inc, 7
br i1 %cmp, label %return, label %for.body
return:
ret void
}
and this currently gets vectorized. This is in fact very similar to the test low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.
I think this ultimately comes down to this FIXME in LoopVectorizationCostModel::setCostBasedWideningDecision
// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
// FIXME: This cost is a significant under-estimate for tail folded
// memory ops.
const InstructionCost ScalarizationCost =
IsLegalToScalarize() ? getUniformMemOpCost(&I, VF)
: InstructionCost::getInvalid();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely related @ElvisWang123 's #125640. IIRC the latest status was that there still were some regressions needing investigation
| for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) { | ||
| for (VPRecipeBase &R : *VPBB) { | ||
| if (auto *VPI = dyn_cast<VPInstruction>(&R)) { | ||
| if (VPI->getOpcode() == VPInstruction::BranchOnCount || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem quite right, isn't this missing at least wide compares and replicating/uniform compares?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to use VPlanPatternMatch, which should catch everything I think.
|
Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all in favour of making steps to retire the legacy cost model where possible! However, I think there are still a few issues that need dealing with particularly as the scope of this PR is actually a bit greater than it seems.
| VecTy, Ctx.CostKind, 0); | ||
| } | ||
| case VPInstruction::BranchOnCount: { | ||
| // If TC <= VF then this is just a branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means. Are you saying that we create a vplan for a given VF despite knowing that we will never enter the vector loop? I guess this can happen if TC is exactly equal to VF or we're using tail-folding, but not using the mask for control flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the comment below this, this transformation is happening in simplifyBranchConditionForVFAndUF and means the vector loop is executed exactly once. TC < VF can happen with tail folding, e.g. low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.
| } | ||
| case VPInstruction::BranchOnCount: { | ||
| // If TC <= VF then this is just a branch. | ||
| // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying the cost of this branch is based on a prediction about what simplifyBranchConditionForVFAndUF is going to do later on? I guess that's fine so long as they're both using the same logic. Ideally both simplifyBranchConditionForVFAndUF and this code would call the same common function checking if the branch will be simplified or not. I'm just a bit worried that over time the two will diverge. Although I appreciate here in the code you'd have to assume UF=1.
For example, if you pulled this code out of simplifyBranchConditionForVFAndUF into a common function you could reuse it in both places:
// Try to simplify the branch condition if TC <= VF * UF when the latch
// terminator is BranchOnCount or BranchOnCond where the input is
// Not(ActiveLaneMask).
const SCEV *TripCount =
vputils::getSCEVExprForVPValue(Plan.getTripCount(), SE);
assert(!isa<SCEVCouldNotCompute>(TripCount) &&
"Trip count SCEV must be computable");
ElementCount NumElements = BestVF.multiplyCoefficientBy(BestUF);
const SCEV *C = SE.getElementCount(TripCount->getType(), NumElements);
if (TripCount->isZero() ||
!SE.isKnownPredicate(CmpInst::ICMP_ULE, TripCount, C))
return false;
You'd also now be able to remove the // FIXME: The compare could also be removed if TC = M * vscale, comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the way to go, we shouldn't duplicate that kind of reasoning here. The TODO should say that the branch should be simplified before we compute the costs.
As a workaround for now catching some cases here should be fine, as this should only mean we may miss some new optimizations, but not make things worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I do think we should be dealing with trip counts that are multiples of vscale here too since we now support them and we know that simplifyBranchConditionForVFAndUF should correctly detect TC=3 x vscale < VF=4 x vscale. It would seem unfair to treat TC=3 <= VF=4 as a cost of 0 and TC=3 * vscale <= VF=4 * vscale as a cost of 1 just so we can keep the code simple, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into this, the problem here is that we don't have access to the ScalarEvolution object here in VPInstruction, so putting code from simplifyBranchConditionForVFAndUF into a function and calling it won't work as that makes use of ScalarEvolution. Simplifying the branch before we compute the cost seems like a good solution to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.
| CmpInst::makeCmpResultType(ValTy), | ||
| CmpInst::ICMP_EQ, Ctx.CostKind); | ||
| } | ||
| case Instruction::FCmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change below is about more than just about the loop conditions. See VPPredicator::createHeaderMask for an example where we explicitly introduce a icmp for the current tail-folding mask. In fact, I don't think the cost below is correct because the icmp can have vector inputs.
I think you either need to:
- Find a way to bail out if the icmp/fcmp isn't used as a branch condition, or
- Add support for vector types using
ValTY = toVectorTy(ValTy, VF), and just make sure the example ofVPPredicator::createHeaderMaskis being tested in this PR.
Sorry about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCostForRecipeWithOpcode already correctly handles vector compares, so I've changed this to use that function.
Causes a bunch of test changes, but it looks like they are due to correctly accounting for the cost of a vector compare. test/Transforms/LoopVectorize/X86/pr81872.ll I've adjusted the iteration count again to that it still vectorizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being patient with all the comments. I spotted a bug in the icmp/fcmp cost calculation, but once that's fixed I'm happy!
| } | ||
| case VPInstruction::BranchOnCount: { | ||
| // If TC <= VF then this is just a branch. | ||
| // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.
| } | ||
| case Instruction::FCmp: | ||
| case Instruction::ICmp: | ||
| return getCostForRecipeWithOpcode(getOpcode(), VF, Ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, but while I was trying to understand what's going on with the test changes to conditional-branches-cost.ll I discovered we're now treating the icmp for the exit condition as a vector comparison rather than a scalar comparison. This means we're actually reporting costs for fixed-width VFs that are too high.
I think this code needs changing to be:
return getCostForRecipeWithOpcode(getOpcode(),
vputils::onlyFirstLaneUsed(this) ? ElementCount::getFixed(1) : VF, Ctx);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[TMP0]]) | ||
| ; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i8> poison, i8 [[Y]], i64 0 | ||
| ; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP0]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the reason why this test has changed behaviour is due to us previously adding on the cost of the original scalar exit condition (an icmp) when in reality the vplan doesn't have one. Nice!
I think it's the same conclusion you came to with the induction-costs-sve.ll test below.
| // Count how many compare instructions there are in the VPlan. | ||
| unsigned NumVPlanCompares = 0; | ||
| auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry()); | ||
| for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vp_depth_first_deep will also leave the region and visit its successors, so we will also count the compare in the middle block, and almost always overcount the compares in VPLan. probably needs to check if we left the region.
I think it will also always disable the check if we have loops controlled by active-lane-mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to ignore blocks outside of the vector loop region.
I think it will also always disable the check if we have loops controlled by active-lane-mask?
I'm not sure what you're asking here. When we have a loop that's using an active-lane-mask, the vplan will have something like (example here taken from llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll)
Cost of 1 for VF vscale x 4: EMIT vp<%active.lane.mask.next> = active lane mask vp<%10>, vp<%4>, ir<1>
Cost of 0 for VF vscale x 4: EMIT vp<%11> = not vp<%active.lane.mask.next>
Cost of 0 for VF vscale x 4: EMIT branch-on-cond vp<%11>
the legacy cost model will have
LV: Found an estimated cost of 1 for VF vscale x 4 For instruction: %exitcond.not = icmp eq i64 %iv.next, %n
LV: Found an estimated cost of 0 for VF vscale x 4 For instruction: br i1 %exitcond.not, label %for.end, label %for.body
planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.
Yep, what I was wondering was if we could exclude plans with ActiveLaneMask terminated exiting blocks from the carve-out, to preserve the original check for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's still not clear what you're asking here. Do you mean: planContainsDifferentCompares should return false for plans that contain ActiveLaneMask terminated exiting blocks, so that the assert in computeBestVF that calls planContainsDifferentCompares will do the check BestFactor.Width == LegacyVF.Width? If so then possibly we could, though I don't think it would make a difference as I haven't found an example where planContainsAdditionalSimplifications doesn't also return true (which it will do because the cmp in the legacy cost model doesn't correspond to anything in the vplan).
The current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector.
Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert.
This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening: