Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
861d883
[VPlan] Don't use the legacy cost model for loop conditions
john-brawn-arm Jul 31, 2025
c3f2e8f
Do TC <= VF check differently to avoid llvm::PatternMatch error
john-brawn-arm Sep 4, 2025
77a2769
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Sep 8, 2025
19b984c
Add extra comment to VPInstruction::computeCost
john-brawn-arm Sep 11, 2025
2b83698
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Sep 16, 2025
ec22b08
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Sep 18, 2025
1bf3611
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Sep 22, 2025
2a6c490
Use VPlanPatternMatch for counting the number of compares
john-brawn-arm Sep 23, 2025
6e23400
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Oct 1, 2025
24fa802
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Oct 7, 2025
515264b
Use getCostForRecipeWithOpcode for cmp cost
john-brawn-arm Oct 15, 2025
50b801f
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Oct 20, 2025
2d3999b
Fix tests after merge
john-brawn-arm Oct 20, 2025
33b76ce
Adjust cmp cost calculation
john-brawn-arm Oct 22, 2025
8c21ea3
Make planContainsDifferentCompares ignore blocks outside the vector r…
john-brawn-arm Oct 22, 2025
759741c
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Oct 23, 2025
60b76a2
Update test after merge
john-brawn-arm Oct 23, 2025
8a4c2ad
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Oct 24, 2025
ae498b8
Merge branch 'main' into vplan_cmp_cost
john-brawn-arm Nov 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 36 additions & 41 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6791,46 +6791,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
}
}

/// Compute the cost of all exiting conditions of the loop using the legacy
/// cost model. This is to match the legacy behavior, which adds the cost of
/// all exit conditions. Note that this over-estimates the cost, as there will
/// be a single condition to control the vector loop.
SmallVector<BasicBlock *> Exiting;
CM.TheLoop->getExitingBlocks(Exiting);
SetVector<Instruction *> ExitInstrs;
// Collect all exit conditions.
for (BasicBlock *EB : Exiting) {
auto *Term = dyn_cast<BranchInst>(EB->getTerminator());
if (!Term || CostCtx.skipCostComputation(Term, VF.isVector()))
continue;
if (auto *CondI = dyn_cast<Instruction>(Term->getOperand(0))) {
ExitInstrs.insert(CondI);
}
}
// Compute the cost of all instructions only feeding the exit conditions.
for (unsigned I = 0; I != ExitInstrs.size(); ++I) {
Instruction *CondI = ExitInstrs[I];
if (!OrigLoop->contains(CondI) ||
!CostCtx.SkipCostComputation.insert(CondI).second)
continue;
InstructionCost CondICost = CostCtx.getLegacyCost(CondI, VF);
LLVM_DEBUG({
dbgs() << "Cost of " << CondICost << " for VF " << VF
<< ": exit condition instruction " << *CondI << "\n";
});
Cost += CondICost;
for (Value *Op : CondI->operands()) {
auto *OpI = dyn_cast<Instruction>(Op);
if (!OpI || CostCtx.skipCostComputation(OpI, VF.isVector()) ||
any_of(OpI->users(), [&ExitInstrs, this](User *U) {
return OrigLoop->contains(cast<Instruction>(U)->getParent()) &&
!ExitInstrs.contains(cast<Instruction>(U));
}))
continue;
ExitInstrs.insert(OpI);
}
}

// Pre-compute the costs for branches except for the backedge, as the number
// of replicate regions in a VPlan may not directly match the number of
// branches, which would lead to different decisions.
Expand Down Expand Up @@ -7011,6 +6971,39 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
});
});
}

static bool planContainsDifferentCompares(VPlan &Plan, VPCostContext &CostCtx,
Loop *TheLoop, ElementCount VF) {
// Count how many compare instructions there are in the legacy cost model.
unsigned NumLegacyCompares = 0;
for (BasicBlock *BB : TheLoop->blocks()) {
for (auto &I : *BB) {
if (isa<CmpInst>(I)) {
NumLegacyCompares += 1;
}
}
}

// Count how many compare instructions there are in the VPlan.
unsigned NumVPlanCompares = 0;
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
auto Iter = vp_depth_first_deep(VectorRegion->getEntry());
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vp_depth_first_deep will also leave the region and visit its successors, so we will also count the compare in the middle block, and almost always overcount the compares in VPLan. probably needs to check if we left the region.

I think it will also always disable the check if we have loops controlled by active-lane-mask?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to ignore blocks outside of the vector loop region.

I think it will also always disable the check if we have loops controlled by active-lane-mask?

I'm not sure what you're asking here. When we have a loop that's using an active-lane-mask, the vplan will have something like (example here taken from llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll)

Cost of 1 for VF vscale x 4: EMIT vp<%active.lane.mask.next> = active lane mask vp<%10>, vp<%4>, ir<1>
Cost of 0 for VF vscale x 4: EMIT vp<%11> = not vp<%active.lane.mask.next>
Cost of 0 for VF vscale x 4: EMIT branch-on-cond vp<%11>

the legacy cost model will have

LV: Found an estimated cost of 1 for VF vscale x 4 For instruction:   %exitcond.not = icmp eq i64 %iv.next, %n
LV: Found an estimated cost of 0 for VF vscale x 4 For instruction:   br i1 %exitcond.not, label %for.end, label %for.body

planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.

Yep, what I was wondering was if we could exclude plans with ActiveLaneMask terminated exiting blocks from the carve-out, to preserve the original check for those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's still not clear what you're asking here. Do you mean: planContainsDifferentCompares should return false for plans that contain ActiveLaneMask terminated exiting blocks, so that the assert in computeBestVF that calls planContainsDifferentCompares will do the check BestFactor.Width == LegacyVF.Width? If so then possibly we could, though I don't think it would make a difference as I haven't found an example where planContainsAdditionalSimplifications doesn't also return true (which it will do because the cmp in the legacy cost model doesn't correspond to anything in the vplan).

// Only the blocks in the vector region are relevant.
if (VPBB->getEnclosingLoopRegion() != VectorRegion)
continue;
for (VPRecipeBase &R : *VPBB) {
using namespace VPlanPatternMatch;
if (match(&R, m_BranchOnCount(m_VPValue(), m_VPValue())) ||
match(&R, m_Cmp(m_VPValue(), m_VPValue())))
NumVPlanCompares += 1;
}
}

// If we have a different amount, then the legacy cost model and vplan will
// disagree.
return NumLegacyCompares != NumVPlanCompares;
}
#endif

VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
Expand Down Expand Up @@ -7122,7 +7115,9 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
CostCtx, OrigLoop,
BestFactor.Width) ||
planContainsAdditionalSimplifications(
getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width)) &&
getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width) ||
planContainsDifferentCompares(BestPlan, CostCtx, OrigLoop,
BestFactor.Width)) &&
" VPlan cost model and legacy cost model disagreed");
assert((BestFactor.Width.isScalar() || BestFactor.ScalarCost > 0) &&
"when vectorizing, the scalar cost must be computed.");
Expand Down
24 changes: 24 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1173,6 +1173,30 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement,
VecTy, Ctx.CostKind, 0);
}
case VPInstruction::BranchOnCount: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BranchOnCond needs adding too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BranchOnCond doesn't cause a compare instruction to be generated, it uses the condition generated by another instruction.

// If TC <= VF then this is just a branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means. Are you saying that we create a vplan for a given VF despite knowing that we will never enter the vector loop? I guess this can happen if TC is exactly equal to VF or we're using tail-folding, but not using the mask for control flow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comment below this, this transformation is happening in simplifyBranchConditionForVFAndUF and means the vector loop is executed exactly once. TC < VF can happen with tail folding, e.g. low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.

// FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying the cost of this branch is based on a prediction about what simplifyBranchConditionForVFAndUF is going to do later on? I guess that's fine so long as they're both using the same logic. Ideally both simplifyBranchConditionForVFAndUF and this code would call the same common function checking if the branch will be simplified or not. I'm just a bit worried that over time the two will diverge. Although I appreciate here in the code you'd have to assume UF=1.

For example, if you pulled this code out of simplifyBranchConditionForVFAndUF into a common function you could reuse it in both places:

    // Try to simplify the branch condition if TC <= VF * UF when the latch
    // terminator is   BranchOnCount or BranchOnCond where the input is
    // Not(ActiveLaneMask).
    const SCEV *TripCount =
        vputils::getSCEVExprForVPValue(Plan.getTripCount(), SE);
    assert(!isa<SCEVCouldNotCompute>(TripCount) &&
           "Trip count SCEV must be computable");
    ElementCount NumElements = BestVF.multiplyCoefficientBy(BestUF);
    const SCEV *C = SE.getElementCount(TripCount->getType(), NumElements);
    if (TripCount->isZero() ||
        !SE.isKnownPredicate(CmpInst::ICMP_ULE, TripCount, C))
      return false;

You'd also now be able to remove the // FIXME: The compare could also be removed if TC = M * vscale, comment below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the way to go, we shouldn't duplicate that kind of reasoning here. The TODO should say that the branch should be simplified before we compute the costs.

As a workaround for now catching some cases here should be fine, as this should only mean we may miss some new optimizations, but not make things worse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I do think we should be dealing with trip counts that are multiples of vscale here too since we now support them and we know that simplifyBranchConditionForVFAndUF should correctly detect TC=3 x vscale < VF=4 x vscale. It would seem unfair to treat TC=3 <= VF=4 as a cost of 0 and TC=3 * vscale <= VF=4 * vscale as a cost of 1 just so we can keep the code simple, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into this, the problem here is that we don't have access to the ScalarEvolution object here in VPInstruction, so putting code from simplifyBranchConditionForVFAndUF into a function and calling it won't work as that makes use of ScalarEvolution. Simplifying the branch before we compute the cost seems like a good solution to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.

// where it checks TC <= VF * UF, but we don't know UF yet. This means in
// some cases we get a cost that's too high due to counting a cmp that
// later gets removed.
// FIXME: The compare could also be removed if TC = M * vscale,
// VF = N * vscale, and M <= N. Detecting that would require having the
// trip count as a SCEV though.
Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue();
ConstantInt *TCConst = dyn_cast_if_present<ConstantInt>(TC);
if (TCConst && TCConst->getValue().ule(VF.getKnownMinValue()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO for the case where TC=vscale x M and VF=vscale * N as well? In such cases we should also be able to prove that TC <= VF because it just requires asking if M <= N.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

return 0;
// Otherwise BranchOnCount generates ICmpEQ followed by a branch.
Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy,
CmpInst::makeCmpResultType(ValTy),
CmpInst::ICMP_EQ, Ctx.CostKind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add the cost of the branch as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deliberately avoided touching branch costs to keep the scope of this work as small as possible.

}
case Instruction::FCmp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change below is about more than just about the loop conditions. See VPPredicator::createHeaderMask for an example where we explicitly introduce a icmp for the current tail-folding mask. In fact, I don't think the cost below is correct because the icmp can have vector inputs.

I think you either need to:

  1. Find a way to bail out if the icmp/fcmp isn't used as a branch condition, or
  2. Add support for vector types using ValTY = toVectorTy(ValTy, VF), and just make sure the example of VPPredicator::createHeaderMask is being tested in this PR.

Sorry about this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCostForRecipeWithOpcode already correctly handles vector compares, so I've changed this to use that function.

case Instruction::ICmp:
return getCostForRecipeWithOpcode(
getOpcode(),
vputils::onlyFirstLaneUsed(this) ? ElementCount::getFixed(1) : VF, Ctx);
case VPInstruction::ExtractPenultimateElement:
if (VF == ElementCount::getScalable(1))
return InstructionCost::getInvalid();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -534,25 +534,47 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
; DEFAULT-LABEL: define void @multiple_exit_conditions(
; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
; DEFAULT-NEXT: [[ENTRY:.*:]]
; DEFAULT-NEXT: br label %[[VECTOR_PH:.*]]
; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP0]], 3
; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP4]]
; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; DEFAULT: [[VECTOR_PH]]:
; DEFAULT-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048
; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
; DEFAULT: [[VECTOR_BODY]]:
; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; DEFAULT-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 8
; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]]
; DEFAULT-NEXT: [[INDEX:%.*]] = sub i64 257, [[N_MOD_VF]]
; DEFAULT-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
; DEFAULT-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[INDEX]], 2
; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
; DEFAULT: [[VECTOR_BODY]]:
; DEFAULT-NEXT: [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; DEFAULT-NEXT: [[OFFSET_IDX1:%.*]] = mul i64 [[INDEX1]], 8
; DEFAULT-NEXT: [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX1]]
; DEFAULT-NEXT: [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2
; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0
; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
; DEFAULT-NEXT: [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
; DEFAULT-NEXT: [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double>
; DEFAULT-NEXT: store <8 x double> [[TMP3]], ptr [[NEXT_GEP]], align 8
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; DEFAULT-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
; DEFAULT-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i16> poison, i16 [[TMP1]], i64 0
; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer
; DEFAULT-NEXT: [[TMP11:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
; DEFAULT-NEXT: [[TMP15:%.*]] = uitofp <vscale x 2 x i16> [[TMP11]] to <vscale x 2 x double>
; DEFAULT-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP17:%.*]] = shl nuw i64 [[TMP16]], 1
; DEFAULT-NEXT: [[TMP18:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP17]]
; DEFAULT-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP20:%.*]] = shl nuw i64 [[TMP19]], 2
; DEFAULT-NEXT: [[TMP21:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP20]]
; DEFAULT-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP23:%.*]] = mul nuw i64 [[TMP22]], 6
; DEFAULT-NEXT: [[TMP24:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP23]]
; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[NEXT_GEP1]], align 8
; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP18]], align 8
; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP21]], align 8
; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP24]], align 8
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX1]], [[TMP3]]
; DEFAULT-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[INDEX]]
; DEFAULT-NEXT: br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
; DEFAULT: [[MIDDLE_BLOCK]]:
; DEFAULT-NEXT: br label %[[SCALAR_PH:.*]]
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 257, [[INDEX]]
; DEFAULT-NEXT: br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
; DEFAULT: [[SCALAR_PH]]:
;
; PRED-LABEL: define void @multiple_exit_conditions(
Expand Down Expand Up @@ -660,16 +682,16 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
; COMMON-NEXT: store i8 6, ptr [[TMP6]], align 1
; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE12]]
; COMMON: [[PRED_STORE_CONTINUE12]]:
; COMMON-NEXT: br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[EXIT:.*]]
; COMMON-NEXT: br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[PRED_STORE_CONTINUE14:.*]]
; COMMON: [[PRED_STORE_IF13]]:
; COMMON-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[DST]], i64 7
; COMMON-NEXT: store i8 7, ptr [[TMP7]], align 1
; COMMON-NEXT: br label %[[EXIT]]
; COMMON-NEXT: br label %[[PRED_STORE_CONTINUE14]]
; COMMON: [[PRED_STORE_CONTINUE14]]:
; COMMON-NEXT: br label %[[MIDDLE_BLOCK:.*]]
; COMMON: [[MIDDLE_BLOCK]]:
; COMMON-NEXT: br label %[[EXIT:.*]]
; COMMON: [[EXIT]]:
; COMMON-NEXT: br label %[[SCALAR_PH:.*]]
; COMMON: [[SCALAR_PH]]:
; COMMON-NEXT: br label %[[EXIT1:.*]]
; COMMON: [[EXIT1]]:
; COMMON-NEXT: ret void
;
entry:
Expand Down Expand Up @@ -1303,7 +1325,7 @@ define void @pred_udiv_select_cost(ptr %A, ptr %B, ptr %C, i64 %n, i8 %y) #1 {
; PRED-NEXT: br label %[[VECTOR_MEMCHECK:.*]]
; PRED: [[VECTOR_MEMCHECK]]:
; PRED-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
; PRED-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP1]], 16
; PRED-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP1]], 4
; PRED-NEXT: [[TMP3:%.*]] = sub i64 [[C1]], [[A2]]
; PRED-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP2]]
; PRED-NEXT: [[TMP4:%.*]] = sub i64 [[C1]], [[B3]]
Expand All @@ -1312,42 +1334,42 @@ define void @pred_udiv_select_cost(ptr %A, ptr %B, ptr %C, i64 %n, i8 %y) #1 {
; PRED-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; PRED: [[VECTOR_PH]]:
; PRED-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
; PRED-NEXT: [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 16
; PRED-NEXT: [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4
; PRED-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
; PRED-NEXT: [[TMP8:%.*]] = shl nuw i64 [[TMP7]], 4
; PRED-NEXT: [[TMP8:%.*]] = shl nuw i64 [[TMP7]], 2
; PRED-NEXT: [[TMP9:%.*]] = sub i64 [[TMP0]], [[TMP8]]
; PRED-NEXT: [[TMP10:%.*]] = icmp ugt i64 [[TMP0]], [[TMP8]]
; PRED-NEXT: [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0
; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[TMP0]])
; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i8> poison, i8 [[Y]], i64 0
; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP0]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the reason why this test has changed behaviour is due to us previously adding on the cost of the original scalar exit condition (an icmp) when in reality the vplan doesn't have one. Nice!

I think it's the same conclusion you came to with the induction-costs-sve.ll test below.

; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i8> poison, i8 [[Y]], i64 0
; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer
; PRED-NEXT: br label %[[VECTOR_BODY:.*]]
; PRED: [[VECTOR_BODY]]:
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
; PRED-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[A]], i64 [[INDEX]]
; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP12]], <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
; PRED-NEXT: [[TMP13:%.*]] = uitofp <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x float>
; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr align 1 [[TMP12]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
; PRED-NEXT: [[TMP13:%.*]] = uitofp <vscale x 4 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 4 x float>
; PRED-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[B]], i64 [[INDEX]]
; PRED-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP14]], <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
; PRED-NEXT: [[TMP15:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_MASKED_LOAD5]], zeroinitializer
; PRED-NEXT: [[TMP16:%.*]] = select <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i1> [[TMP15]], <vscale x 16 x i1> zeroinitializer
; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
; PRED-NEXT: [[TMP18:%.*]] = select <vscale x 16 x i1> [[TMP16]], <vscale x 16 x i8> [[BROADCAST_SPLAT]], <vscale x 16 x i8> splat (i8 1)
; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 16 x i8> [[TMP17]], [[TMP18]]
; PRED-NEXT: [[TMP20:%.*]] = icmp ugt <vscale x 16 x i8> [[TMP19]], splat (i8 1)
; PRED-NEXT: [[TMP21:%.*]] = select <vscale x 16 x i1> [[TMP20]], <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32> splat (i32 255)
; PRED-NEXT: [[PREDPHI:%.*]] = select <vscale x 16 x i1> [[TMP15]], <vscale x 16 x i32> [[TMP21]], <vscale x 16 x i32> zeroinitializer
; PRED-NEXT: [[TMP22:%.*]] = zext <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x i32>
; PRED-NEXT: [[TMP23:%.*]] = sub <vscale x 16 x i32> [[PREDPHI]], [[TMP22]]
; PRED-NEXT: [[TMP24:%.*]] = sitofp <vscale x 16 x i32> [[TMP23]] to <vscale x 16 x float>
; PRED-NEXT: [[TMP25:%.*]] = call <vscale x 16 x float> @llvm.fmuladd.nxv16f32(<vscale x 16 x float> [[TMP24]], <vscale x 16 x float> splat (float 3.000000e+00), <vscale x 16 x float> [[TMP13]])
; PRED-NEXT: [[TMP26:%.*]] = fptoui <vscale x 16 x float> [[TMP25]] to <vscale x 16 x i8>
; PRED-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr align 1 [[TMP14]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
; PRED-NEXT: [[TMP15:%.*]] = icmp ne <vscale x 4 x i8> [[WIDE_MASKED_LOAD5]], zeroinitializer
; PRED-NEXT: [[TMP16:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i1> zeroinitializer
; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
; PRED-NEXT: [[TMP18:%.*]] = select <vscale x 4 x i1> [[TMP16]], <vscale x 4 x i8> [[BROADCAST_SPLAT]], <vscale x 4 x i8> splat (i8 1)
; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 4 x i8> [[TMP17]], [[TMP18]]
; PRED-NEXT: [[TMP20:%.*]] = icmp ugt <vscale x 4 x i8> [[TMP19]], splat (i8 1)
; PRED-NEXT: [[TMP21:%.*]] = select <vscale x 4 x i1> [[TMP20]], <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32> splat (i32 255)
; PRED-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> [[TMP21]], <vscale x 4 x i32> zeroinitializer
; PRED-NEXT: [[TMP22:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 4 x i32>
; PRED-NEXT: [[TMP23:%.*]] = sub <vscale x 4 x i32> [[PREDPHI]], [[TMP22]]
; PRED-NEXT: [[TMP24:%.*]] = sitofp <vscale x 4 x i32> [[TMP23]] to <vscale x 4 x float>
; PRED-NEXT: [[TMP25:%.*]] = call <vscale x 4 x float> @llvm.fmuladd.nxv4f32(<vscale x 4 x float> [[TMP24]], <vscale x 4 x float> splat (float 3.000000e+00), <vscale x 4 x float> [[TMP13]])
; PRED-NEXT: [[TMP26:%.*]] = fptoui <vscale x 4 x float> [[TMP25]] to <vscale x 4 x i8>
; PRED-NEXT: [[TMP27:%.*]] = getelementptr i8, ptr [[C]], i64 [[INDEX]]
; PRED-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP26]], ptr align 1 [[TMP27]], <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
; PRED-NEXT: call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP26]], ptr align 1 [[TMP27]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP11]])
; PRED-NEXT: [[TMP28:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP11]])
; PRED-NEXT: [[TMP28:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; PRED-NEXT: [[TMP29:%.*]] = xor i1 [[TMP28]], true
; PRED-NEXT: br i1 [[TMP29]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
; PRED: [[MIDDLE_BLOCK]]:
Expand Down
Loading