[LV][VPlan] Implement VPlan-based cost for exit condition. #125640

ElvisWang123 · 2025-02-04T06:00:45Z

This patch tried to model the cost of exit conditions through vplan-based cost model.

BranchOnCount will generate icmp + br.

The branch instruction is already implemented by the VPRegionBlock so we only need to calculate the cost of icmp.

If the VF is same as the trip count of the loop, the cost of the BranchOnCount is free.

There are some tests changes for following reasons.

Some of the BranchOnCount could be optimized to BranchOnCond, which will generate extract-element + br. And the extract-element is free under some targets.
Some of the instructions calculated in the exit condition in legacy cost model will be optimized out.

llvmbot · 2025-02-04T06:01:20Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Elvis Wang (ElvisWang123)

Changes

This patch tried to model the cost of exit conditions through vplan-based cost model.

BranchOnCount will generate icmp + br.

The branch instruction is already implemented by the VPRegionBlock so we only need to calculate the cost of icmp.

If the VF is same as the trip count of the loop, the cost of the BranchOnCount is free.

There are some tests changes for following reasons.

Some of the BranchOnCount could be optimized to BranchOnCond, which is free.
Some of the instructions calculated in the exit condition in legacy cost model will be optimized out.

Patch is 76.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125640.diff

14 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-39)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+1-4)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+23)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll (+42-38)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+42-12)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll (+7-5)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+46-82)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll (+91-40)
(modified) llvm/test/Transforms/LoopVectorize/ARM/mve-icmpcost.ll (+7-14)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll (+7-15)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll (+10-24)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/multi-exit-cost.ll (+14-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/reduction-small-size.ll (+1-3)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 29f3940ed6fa7a..49159ad40fafc7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7238,45 +7238,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
     }
   }
 
-  /// Compute the cost of all exiting conditions of the loop using the legacy
-  /// cost model. This is to match the legacy behavior, which adds the cost of
-  /// all exit conditions. Note that this over-estimates the cost, as there will
-  /// be a single condition to control the vector loop.
-  SmallVector<BasicBlock *> Exiting;
-  CM.TheLoop->getExitingBlocks(Exiting);
-  SetVector<Instruction *> ExitInstrs;
-  // Collect all exit conditions.
-  for (BasicBlock *EB : Exiting) {
-    auto *Term = dyn_cast<BranchInst>(EB->getTerminator());
-    if (!Term)
-      continue;
-    if (auto *CondI = dyn_cast<Instruction>(Term->getOperand(0))) {
-      ExitInstrs.insert(CondI);
-    }
-  }
-  // Compute the cost of all instructions only feeding the exit conditions.
-  for (unsigned I = 0; I != ExitInstrs.size(); ++I) {
-    Instruction *CondI = ExitInstrs[I];
-    if (!OrigLoop->contains(CondI) ||
-        !CostCtx.SkipCostComputation.insert(CondI).second)
-      continue;
-    InstructionCost CondICost = CostCtx.getLegacyCost(CondI, VF);
-    LLVM_DEBUG({
-      dbgs() << "Cost of " << CondICost << " for VF " << VF
-             << ": exit condition instruction " << *CondI << "\n";
-    });
-    Cost += CondICost;
-    for (Value *Op : CondI->operands()) {
-      auto *OpI = dyn_cast<Instruction>(Op);
-      if (!OpI || any_of(OpI->users(), [&ExitInstrs, this](User *U) {
-            return OrigLoop->contains(cast<Instruction>(U)->getParent()) &&
-                   !ExitInstrs.contains(cast<Instruction>(U));
-          }))
-        continue;
-      ExitInstrs.insert(OpI);
-    }
-  }
-
   // The legacy cost model has special logic to compute the cost of in-loop
   // reductions, which may be smaller than the sum of all instructions involved
   // in the reduction.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index db45ad8aadbbe3..a46a5af6151f8b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1318,10 +1318,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
 
   /// Return the cost of this VPInstruction.
   InstructionCost computeCost(ElementCount VF,
-                              VPCostContext &Ctx) const override {
-    // TODO: Compute accurate cost after retiring the legacy cost model.
-    return 0;
-  }
+                              VPCostContext &Ctx) const override;
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the VPInstruction to \p O.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index aa5f92b235555e..6b80fda76d0f19 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -823,6 +823,29 @@ bool VPInstruction::onlyFirstPartUsed(const VPValue *Op) const {
   llvm_unreachable("switch should return");
 }
 
+InstructionCost VPInstruction::computeCost(ElementCount VF,
+                                           VPCostContext &Ctx) const {
+  Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+
+  switch (getOpcode()) {
+  case VPInstruction::BranchOnCount: {
+    // BranchOnCount will genearte icmp_eq + br instructions and the
+    // cost of branch will be calculated in VPRegionBlock.
+    // If the vector loop only executed once, ignore the cost of the cmp.
+    auto TC = dyn_cast_if_present<ConstantInt>(
+        getParent()->getPlan()->getTripCount()->getUnderlyingValue());
+    if (TC && VF.isFixed() && TC->getSExtValue() == VF.getFixedValue())
+      return 0;
+    return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy, nullptr,
+                                      CmpInst::ICMP_EQ, Ctx.CostKind);
+  }
+  default:
+    // TODO: Compute accurate cost after retiring the legacy cost model.
+    return 0;
+  };
+  llvm_unreachable("switch should return");
+}
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPInstruction::dump() const {
   VPSlotTracker SlotTracker(getParent()->getPlan());
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
index 5b77ced73bce0a..b510c50450055c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
@@ -8,39 +8,41 @@ define void @clamped_tc_8(ptr nocapture %dst, i32 %n, i64 %val) vscale_range(1,1
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 2
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP1]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 8, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 2
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
-; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 8 x i64> @llvm.stepvector.nxv8i64()
-; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 8 x i64> [[TMP8]], splat (i64 1)
-; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 8 x i64> zeroinitializer, [[TMP7]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 8)
+; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 2 x i64> [[TMP8]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 1, [[TMP6]]
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 8 x i64> poison, i64 [[TMP12]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 8 x i64> [[DOTSPLATINSERT]], <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer
-; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i64> poison, i64 [[VAL]], i64 0
-; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP12]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[VAL]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 8 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[INDEX]], 0
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP13]]
-; CHECK-NEXT:    [[TMP10:%.*]] = shl nuw nsw <vscale x 8 x i64> [[VEC_IND]], splat (i64 3)
-; CHECK-NEXT:    [[TMP11:%.*]] = lshr <vscale x 8 x i64> [[BROADCAST_SPLAT]], [[TMP10]]
-; CHECK-NEXT:    [[TMP14:%.*]] = trunc <vscale x 8 x i64> [[TMP11]] to <vscale x 8 x i8>
+; CHECK-NEXT:    [[TMP10:%.*]] = shl nuw nsw <vscale x 2 x i64> [[VEC_IND]], splat (i64 3)
+; CHECK-NEXT:    [[TMP11:%.*]] = lshr <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP10]]
+; CHECK-NEXT:    [[TMP16:%.*]] = trunc <vscale x 2 x i64> [[TMP11]] to <vscale x 2 x i8>
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
-; CHECK-NEXT:    call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP14]], ptr [[TMP17]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i8.p0(<vscale x 2 x i8> [[TMP16]], ptr [[TMP17]], i32 1, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 8 x i64> [[VEC_IND]], [[DOTSPLAT]]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX_NEXT]], i64 8)
-; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 8)
+; CHECK-NEXT:    [[TMP14:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i64> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <vscale x 2 x i1> [[TMP14]], i32 0
+; CHECK-NEXT:    br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       middle.block:
 ; CHECK-NEXT:    br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
@@ -94,39 +96,41 @@ define void @clamped_tc_max_8(ptr nocapture %dst, i32 %n, i64 %val) vscale_range
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 2
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP1]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[WIDE_TRIP_COUNT]], [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 2
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
-; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 8 x i64> @llvm.stepvector.nxv8i64()
-; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 8 x i64> [[TMP8]], splat (i64 1)
-; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 8 x i64> zeroinitializer, [[TMP7]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 2 x i64> [[TMP8]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 1, [[TMP6]]
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 8 x i64> poison, i64 [[TMP12]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 8 x i64> [[DOTSPLATINSERT]], <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer
-; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i64> poison, i64 [[VAL]], i64 0
-; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP12]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[VAL]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 8 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[INDEX]], 0
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP13]]
-; CHECK-NEXT:    [[TMP10:%.*]] = shl nuw nsw <vscale x 8 x i64> [[VEC_IND]], splat (i64 3)
-; CHECK-NEXT:    [[TMP11:%.*]] = lshr <vscale x 8 x i64> [[BROADCAST_SPLAT]], [[TMP10]]
-; CHECK-NEXT:    [[TMP14:%.*]] = trunc <vscale x 8 x i64> [[TMP11]] to <vscale x 8 x i8>
+; CHECK-NEXT:    [[TMP10:%.*]] = shl nuw nsw <vscale x 2 x i64> [[VEC_IND]], splat (i64 3)
+; CHECK-NEXT:    [[TMP11:%.*]] = lshr <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP10]]
+; CHECK-NEXT:    [[TMP16:%.*]] = trunc <vscale x 2 x i64> [[TMP11]] to <vscale x 2 x i8>
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
-; CHECK-NEXT:    call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP14]], ptr [[TMP17]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i8.p0(<vscale x 2 x i8> [[TMP16]], ptr [[TMP17]], i32 1, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 8 x i64> [[VEC_IND]], [[DOTSPLAT]]
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
-; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+; CHECK-NEXT:    [[TMP14:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i64> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <vscale x 2 x i1> [[TMP14]], i32 0
+; CHECK-NEXT:    br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       middle.block:
 ; CHECK-NEXT:    br i1 true, label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index caa98d766a8c34..0ff84e13c243d9 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -768,9 +768,20 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; DEFAULT-LABEL: define void @multiple_exit_conditions(
 ; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
 ; DEFAULT-NEXT:  entry:
-; DEFAULT-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; DEFAULT-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP9:%.*]] = mul i64 [[TMP7]], 8
+; DEFAULT-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP9]]
+; DEFAULT-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; DEFAULT:       vector.ph:
-; DEFAULT-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048
+; DEFAULT-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 8
+; DEFAULT-NEXT:    [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]]
+; DEFAULT-NEXT:    [[N_VEC:%.*]] = sub i64 257, [[N_MOD_VF]]
+; DEFAULT-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP10]], 8
+; DEFAULT-NEXT:    [[TMP6:%.*]] = mul i64 [[N_VEC]], 8
+; DEFAULT-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP6]]
+; DEFAULT-NEXT:    [[TMP8:%.*]] = mul i64 [[N_VEC]], 2
 ; DEFAULT-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; DEFAULT:       vector.body:
 ; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -778,20 +789,39 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; DEFAULT-NEXT:    [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 0
 ; DEFAULT-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP0]]
 ; DEFAULT-NEXT:    [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2
-; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0
-; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
-; DEFAULT-NEXT:    [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
-; DEFAULT-NEXT:    [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double>
+; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i16> poison, i16 [[TMP1]], i64 0
+; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer
+; DEFAULT-NEXT:    [[TMP11:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP12:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP13:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP14:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP15:%.*]] = uitofp <vscale x 2 x i16> [[TMP11]] to <vscale x 2 x double>
+; DEFAULT-NEXT:    [[TMP16:%.*]] = uitofp <vscale x 2 x i16> [[TMP12]] to <vscale x 2 x double>
+; DEFAULT-NEXT:    [[TMP17:%.*]] = uitofp <vscale x 2 x i16> [[TMP13]] to <vscale x 2 x double>
+; DEFAULT-NEXT:    [[TMP18:%.*]] = uitofp <vscale x 2 x i16> [[TMP14]] to <vscale x 2 x double>
 ; DEFAULT-NEXT:    [[TMP4:%.*]] = getelementptr double, ptr [[NEXT_GEP]], i32 0
-; DEFAULT-NEXT:    store <8 x double> [[TMP3]], ptr [[TMP4]], align 8
-; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
-; DEFAULT-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
-; DEFAULT-NEXT:    br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
+; DEFAULT-NEXT:    [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP21:%.*]] = mul i64 [[TMP20]], 2
+; DEFAULT-NEXT:    [[TMP22:%.*]] = getelementptr double, ptr [[NEXT_GEP]], i64 [[TMP21]]
+; DEFAULT-NEXT:    [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP24:%.*]] = mul i64 [[TMP23]], 4
+; DEFAULT-NEXT:    [[TMP25:%.*]] = getelementptr double, ptr [[NEXT_GEP]], i64 [[TMP24]]
...
[truncated]

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

david-arm · 2025-02-04T10:27:07Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+
+  switch (getOpcode()) {
+  case VPInstruction::BranchOnCount: {


Are we guaranteed to only ever generate BranchOnCount here? I thought various vplan transforms took place that could change the terminator. For example, we might need BranchOnCond here for completeness.

I think addVPLaneMaskPhiAndUpdateExitBranch shows an example where BranchOnCond should not be free.

It looks like there are a couple of regressions, e.g. clamped_tc_8, which are probably related. The legacy cost model for now just considers the cost of the condition, but in some cases we removed the original condition already.

Agree that BranchOnCond will generate extractelement(vec, 0) and isn't free for all target. But after this change, the test still change because the extractelement is free under AArch64.

In clamped_tc_8 the BranchOnCount is replaced by get.active.lane.mask + not + BranchOnCond.
I've tried to implement the cost of get.active.lane.mask locally but it will impact lots of tests because the cost of this intrinsics is very expensive in AArch64 (2 * element count). And the clapmed_tc_8 will not even vectorized.

fhahn · 2025-02-04T21:10:43Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    auto TC = dyn_cast_if_present<ConstantInt>(
+        getParent()->getPlan()->getTripCount()->getUnderlyingValue());
+    if (TC && VF.isFixed() && TC->getSExtValue() == VF.getFixedValue())
+      return 0;


Why is this needed here?

The legacy cost model check if the vectorizer can remove the condition and it will not calculate the cost of the condition. You can find this in test fully-unrolled-costs.ll.

We can remove this after we hoist unrollByUF and optimizeForVFandUF before estimating instruction cost.

@fhahn Do you have a better method to get the original trip count here?

fhahn · 2025-02-04T21:13:39Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+
+  switch (getOpcode()) {
+  case VPInstruction::BranchOnCount: {


It looks like there are a couple of regressions, e.g. clamped_tc_8, which are probably related. The legacy cost model for now just considers the cost of the condition, but in some cases we removed the original condition already.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

ElvisWang123 · 2025-02-18T03:01:32Z

Ping!
@david-arm @fhahn I was wondering if you could test this patch and check if it will cause regression under AArch64?
If so, I plan to change the implementation to always adding the cost of ICmp at the end of the vector loop to match the behavior of legacy cost model. Thanks!

ElvisWang123 · 2025-04-07T07:36:44Z

Gentle ping :)

The potential regression of clamped_tc_8 is gone.
If there are still some issues that this patch may cause regression, I will try to always add the cost of icmp to the VPBB (similar to cost of br) to match the legacy cost model.

ElvisWang123 · 2025-06-02T05:49:46Z

Gentle Ping! :)

The tests changes of AArch64/induction-costs-sve.ll and AArch64/reduction-recurrence-costs-sve.ll are basically going back before #130565. These changes are caused by BranchOnCount is optimized out to BranchOnCond which is free.

ElvisWang123 · 2025-06-23T07:53:20Z

Ping :)

fhahn

Thanks for the update, will try to give this a try this week

ElvisWang123 · 2025-08-07T06:22:17Z

Rebase and gentle ping :)

david-arm · 2025-08-07T08:15:38Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Do we also remove the branch when interleaving, i.e. TC == VF * UF? I wonder if we should do:

ElementCount VFxUF = VF * UF; if (TC && VFxUF.isFixed() && TC->getSExtValue() == VF.getFixedValue())

Yes that would be more accurate like following example. But unfortunately, the UF is not decided when calculating the cost (VPlan will be unroll at executePlan()).

if(TC && VF.isFixed() && TC->getZExtValue() == VF.getFixedValue() * Plan->getUF())

david-arm · 2025-08-07T08:16:09Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Shouldn't the trip count be an unsigned value so we should be using TC->getZExtValue() instead?

Updated, thanks!

david-arm · 2025-08-07T08:27:47Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

nit: Formatting doesn't look quite right, I think it can be:

// BranchOnCount will generate icmp_eq + br instructions and the cost of // branch will be calculated in VPRegionBlock.

Fixed, thanks!

david-arm · 2025-08-07T08:28:27Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

nit: Looks like this comment can fit on one line

Fixed, thanks!

david-arm · 2025-08-07T08:56:11Z

llvm/test/Transforms/LoopVectorize/AArch64/early_exit_costs.ll

Hmm, the test wasn't supposed to be profitable due to the trunc. Do you know why it has changed? It would be good to keep the test not vectorising due to profitability if possible.

Yes, the cost of VPInstruction ICmp is not modeled in vplan-based cost model yet.

But model the cost of ICmp will cause more tests changes and more loops not vectorized.

What I mean is that this patch has "broken" the test. The purpose of the test was to explicitly defend a case where vectorisation was not profitable, in this case due to the truncate. That's why the test is named @vectorization_not_profitable_due_to_trunc. Either:

There is a bug in the cost model with this PR which is causing a regression in this test, or

The test is unstable. If so, can you rename the test to something else such as @vectorization_profitable_even_with_trunc.

If it's 2 I think we need to find a replacement test where we do decide against vectorisation due to profitability. I'm happy to look into that, but first I want to be sure it's not 1.

david-arm · 2025-08-07T09:00:52Z

llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll

Something doesn't look right here. The cost of the exit condition went down from 4->1, yet the total cost of the loop has increased from 12->13. Can you try to find out what's going on here?

Add the cost of icmp to make test more clearly.

In this case, the icmp will be widen and has a external user in the middle block, the cost is 4. Also the cost of the branch-on-count will generate a scalar icmp which cost 1. So the entire cost increase.

Actually, if forcing VF=8, the plan will select UF=2. The cost of ICmp and branch-on-count should be free.

VPlan:

VPlan 'Initial VPlan for VF={2,4,8,16},UF>=1' { Live-in vp<%0> = VF Live-in vp<%1> = VF * UF Live-in vp<%2> = vector-trip-count Live-in ir<16> = original trip-count ir-bb<entry>: Successor(s): scalar.ph, vector.ph vector.ph: Successor(s): vector loop <x1> vector loop: { vector.body: EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next> ir<%indvars.iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<%0> vp<%4> = SCALAR-STEPS vp<%3>, ir<1>, vp<%0> CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%src>, vp<%4> vp<%5> = vector-pointer ir<%arrayidx> WIDEN ir<%0> = load vp<%5> CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%dst>, vp<%4> vp<%6> = vector-pointer ir<%arrayidx2> WIDEN ir<%1> = load vp<%6> WIDEN ir<%add> = add nsw ir<%1>, ir<%0> vp<%7> = vector-pointer ir<%arrayidx2> WIDEN store vp<%7>, ir<%add> WIDEN ir<%indvars.iv.next> = add nuw nsw ir<%indvars.iv>, ir<1> WIDEN ir<%exitcond.not> = icmp eq ir<%indvars.iv.next>, ir<16> EMIT vp<%index.next> = add nuw vp<%3>, vp<%1> EMIT branch-on-count vp<%index.next>, vp<%2> No successors } Successor(s): middle.block middle.block: EMIT vp<%9> = extract-last-element ir<%exitcond.not> EMIT vp<%cmp.n> = icmp eq ir<16>, vp<%2> EMIT branch-on-cond vp<%cmp.n> Successor(s): ir-bb<exit>, scalar.ph

fhahn · 2025-08-07T09:21:45Z

llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll

This looks like a possible regression, as we no only use half to available vector width?

This can be fixed after #152628 and implement the cost of VPInstruction::Not.

Hmm doesn't #152628 just hide the issue, as it pushes the extract before the Not? The extract is still there and only because of the BranchOnCond. If this really gets code-gen'd, it should probably be accounted for as extracts can be quite expensive on AArch64 and X86 at least.

Yes, we should account cost of extract but I think is better to implement this after remove the assertion. I think extract/insert costs are account in the legacy model and hard to calculate how many extracts/inserts will be generated.

In this case, the more scalar instruction, the cost model will likely choose the higher VF which prevent the regressions. After #152628, the addition scalar instruction (Not) help this case choosing higher VF.

fhahn · 2025-08-07T09:22:54Z

llvm/test/Transforms/LoopVectorize/X86/multi-exit-cost.ll

this looks like a possible regression, we now use narrower vector width but interleave. This should probably just use the wider vector width?

This loop will choosing the smaller VF because the TC is relative small (31) and no tail-folding.

Check the cost of each VF and TC = 31:

LV: Scalar loop costs: 7. => cost = 217 Cost for VF 2: 8 (Estimated cost per lane: 4.0) => cost = 127 Cost for VF 4: 9 (Estimated cost per lane: 2.2) => cost = 84 Cost for VF 8: 13 (Estimated cost per lane: 1.6) => cost = 88 Cost for VF 16: 21 (Estimated cost per lane: 1.3) => cost = 126

Note that without this patch, LV calculated 2 Icmp + 1 add but these instruction will not generated in the vector.body.

Cost of 1 for VF 4: exit condition instruction %cmp = icmp eq i64 %x.inc, 0 Cost of 1 for VF 4: exit condition instruction %ec = icmp eq i64 %iv.next, %N Cost of 2 for VF 4: exit condition instruction %x.inc = add i64 %iv.and, %x

vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.phi = phi <8 x i8> [ zeroinitializer, %vector.ph ], [ %14, %vector.body ] %10 = and i64 %index, 1 %11 = getelementptr i8, ptr %src, i64 %10 %12 = getelementptr i8, ptr %11, i32 0 %13 = getelementptr i8, ptr %12, i32 -7 %wide.load = load <8 x i8>, ptr %13, align 1 %reverse = shufflevector <8 x i8> %wide.load, <8 x i8> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0> %14 = xor <8 x i8> %reverse, %vec.phi %index.next = add nuw i64 %index, 8 %15 = icmp eq i64 %index.next, %n.vec br i1 %15, label %middle.block, label %vector.body, !llvm.loop !0

fhahn · 2025-08-07T09:24:43Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll

This also looks like a potential regression, as we only use 1/4 of the available vector width?

In this case, the cost of the VF = 8 equals to VF = vscale x 2. It seems it prefer choosing scalable vector.

Cost for VF 2: 10 (Estimated cost per lane: 5.0) Cost for VF 4: 16 (Estimated cost per lane: 4.0) Cost for VF 8: 28 (Estimated cost per lane: 3.5) Cost for VF vscale x 1: Invalid (Estimated cost per lane: Invalid) Cost for VF vscale x 2: 7 (Estimated cost per lane: 3.5)

The original cost contains ICmp + and. But the and is not showing in the generated vector instructions and vplan.

Cost of 1 for VF 4: exit condition instruction %ec = icmp eq i64 %iv.clamp, 512 Cost of 1 for VF 4: exit condition instruction %iv.clamp = and i64 %iv, 4294967294

ElvisWang123

Address comments and based on #152628 that fix some test cases changes.

ElvisWang123 · 2025-08-08T00:39:03Z

llvm/test/Transforms/LoopVectorize/AArch64/early_exit_costs.ll

Yes, the cost of VPInstruction ICmp is not modeled in vplan-based cost model yet.

But model the cost of ICmp will cause more tests changes and more loops not vectorized.

ElvisWang123 · 2025-08-08T01:28:12Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Yes that would be more accurate like following example. But unfortunately, the UF is not decided when calculating the cost (VPlan will be unroll at executePlan()).

if(TC && VF.isFixed() && TC->getZExtValue() == VF.getFixedValue() * Plan->getUF())

ElvisWang123 · 2025-08-08T01:37:47Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Updated, thanks!

ElvisWang123 · 2025-08-08T01:39:21Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Fixed, thanks!

ElvisWang123 · 2025-08-08T01:39:28Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Fixed, thanks!

ElvisWang123 · 2025-08-08T01:56:37Z

llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll

Add the cost of icmp to make test more clearly.

In this case, the icmp will be widen and has a external user in the middle block, the cost is 4. Also the cost of the branch-on-count will generate a scalar icmp which cost 1. So the entire cost increase.

Actually, if forcing VF=8, the plan will select UF=2. The cost of ICmp and branch-on-count should be free.

VPlan:

VPlan 'Initial VPlan for VF={2,4,8,16},UF>=1' { Live-in vp<%0> = VF Live-in vp<%1> = VF * UF Live-in vp<%2> = vector-trip-count Live-in ir<16> = original trip-count ir-bb<entry>: Successor(s): scalar.ph, vector.ph vector.ph: Successor(s): vector loop <x1> vector loop: { vector.body: EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next> ir<%indvars.iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<%0> vp<%4> = SCALAR-STEPS vp<%3>, ir<1>, vp<%0> CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%src>, vp<%4> vp<%5> = vector-pointer ir<%arrayidx> WIDEN ir<%0> = load vp<%5> CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%dst>, vp<%4> vp<%6> = vector-pointer ir<%arrayidx2> WIDEN ir<%1> = load vp<%6> WIDEN ir<%add> = add nsw ir<%1>, ir<%0> vp<%7> = vector-pointer ir<%arrayidx2> WIDEN store vp<%7>, ir<%add> WIDEN ir<%indvars.iv.next> = add nuw nsw ir<%indvars.iv>, ir<1> WIDEN ir<%exitcond.not> = icmp eq ir<%indvars.iv.next>, ir<16> EMIT vp<%index.next> = add nuw vp<%3>, vp<%1> EMIT branch-on-count vp<%index.next>, vp<%2> No successors } Successor(s): middle.block middle.block: EMIT vp<%9> = extract-last-element ir<%exitcond.not> EMIT vp<%cmp.n> = icmp eq ir<16>, vp<%2> EMIT branch-on-cond vp<%cmp.n> Successor(s): ir-bb<exit>, scalar.ph

ElvisWang123 · 2025-08-08T04:13:15Z

llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll

This can be fixed after #152628 and implement the cost of VPInstruction::Not.

ElvisWang123 · 2025-08-08T05:30:03Z

llvm/test/Transforms/LoopVectorize/X86/multi-exit-cost.ll

This loop will choosing the smaller VF because the TC is relative small (31) and no tail-folding.

Check the cost of each VF and TC = 31:

LV: Scalar loop costs: 7. => cost = 217 Cost for VF 2: 8 (Estimated cost per lane: 4.0) => cost = 127 Cost for VF 4: 9 (Estimated cost per lane: 2.2) => cost = 84 Cost for VF 8: 13 (Estimated cost per lane: 1.6) => cost = 88 Cost for VF 16: 21 (Estimated cost per lane: 1.3) => cost = 126

Note that without this patch, LV calculated 2 Icmp + 1 add but these instruction will not generated in the vector.body.

Cost of 1 for VF 4: exit condition instruction %cmp = icmp eq i64 %x.inc, 0 Cost of 1 for VF 4: exit condition instruction %ec = icmp eq i64 %iv.next, %N Cost of 2 for VF 4: exit condition instruction %x.inc = add i64 %iv.and, %x

vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.phi = phi <8 x i8> [ zeroinitializer, %vector.ph ], [ %14, %vector.body ] %10 = and i64 %index, 1 %11 = getelementptr i8, ptr %src, i64 %10 %12 = getelementptr i8, ptr %11, i32 0 %13 = getelementptr i8, ptr %12, i32 -7 %wide.load = load <8 x i8>, ptr %13, align 1 %reverse = shufflevector <8 x i8> %wide.load, <8 x i8> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0> %14 = xor <8 x i8> %reverse, %vec.phi %index.next = add nuw i64 %index, 8 %15 = icmp eq i64 %index.next, %n.vec br i1 %15, label %middle.block, label %vector.body, !llvm.loop !0

ElvisWang123 · 2025-08-08T05:46:35Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll

In this case, the cost of the VF = 8 equals to VF = vscale x 2. It seems it prefer choosing scalable vector.

Cost for VF 2: 10 (Estimated cost per lane: 5.0) Cost for VF 4: 16 (Estimated cost per lane: 4.0) Cost for VF 8: 28 (Estimated cost per lane: 3.5) Cost for VF vscale x 1: Invalid (Estimated cost per lane: Invalid) Cost for VF vscale x 2: 7 (Estimated cost per lane: 3.5)

The original cost contains ICmp + and. But the and is not showing in the generated vector instructions and vplan.

Cost of 1 for VF 4: exit condition instruction %ec = icmp eq i64 %iv.clamp, 512 Cost of 1 for VF 4: exit condition instruction %iv.clamp = and i64 %iv, 4294967294

david-arm · 2025-08-11T08:24:36Z

Yes, the cost of VPInstruction ICmp is not modeled in vplan-based cost model yet.

But model the cost of ICmp will cause more tests changes and more loops not vectorized.

However, if possible I would still prefer to do this in a different order, i.e.

Do preparation work to ensure this PR doesn't regress, then
Land this PR.

otherwise if you do this in the opposite order we're knowingly breaking the loop vectoriser for a period of time, which isn't ideal for other users. If the only way to avoid regressions is to add the cost of the icmp to this PR as well, then perhaps that's a necessary evil?

Also, the test @vectorization_not_profitable_due_to_trunc is still broken, which needs fixing. The test is very confusing because with this PR we do vectorise.

ElvisWang123 · 2025-08-12T06:31:01Z

Yes, the cost of VPInstruction ICmp is not modeled in vplan-based cost model yet.
But model the cost of ICmp will cause more tests changes and more loops not vectorized.

However, if possible I would still prefer to do this in a different order, i.e.

Do preparation work to ensure this PR doesn't regress, then

Land this PR.

otherwise if you do this in the opposite order we're knowingly breaking the loop vectoriser for a period of time, which isn't ideal for other users. If the only way to avoid regressions is to add the cost of the icmp to this PR as well, then perhaps that's a necessary evil?

Also, the test @vectorization_not_profitable_due_to_trunc is still broken, which needs fixing. The test is very confusing because with this PR we do vectorise.

Yes, I would like to fix all regressions before land this patch.
I think some of the cases are caused by the wrong cost from the legacy model that make it vectorize (or choosing higher VF) and vplan-based cost model still missing some part of the cost. Will keep finding cases such as #152628 to fix the regression.

Rebase and add implement cost of Cmp in 9c5e12e with some regression on X86. Will try to find the reason of regression caused by cost of Cmp instructions and fix it.

And the test @vectorization_not_profitable_due_to_trunc won't be vectorized after cost of ICmp.

This patch tried to model the cost of exit conditions through vplan-based cost model. * `BranchOnCount` will generate icmp + br. The branch instruction is already implemented by the VPRegionBlock so we only need to calculate the cost of icmp. If the VF is same as the trip count of the loop, the cost of the BranchOnCount is free. This patch is not quite NFC for following reasons. * Some of the BranchOnCount could be optimized to BranchOnCond, which is free. * Some of the instructions calculated in the exit condition in legacy cost model will be optimized out.

This fixes regression of `llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll`.

ElvisWang123 · 2025-10-16T05:14:41Z

Rebased and ping!

Summary of VF changes for different benchmarks.

benchmark	reason
AArch64/conditional-branches-cost.ll	The cost of the `uitofp` with <vscale x 2 x i8> (cost = 1) is lower than < 2 x i8> (cost = 4). So the LV will choose `<vscale x 2 x i8>`. @david-arm @davemgreen Just wondering why the cost is different for same width?
X86/constant-fold.ll	Max TC = 3 make VF 2, 4 still need scalar epilogue which is not profitable.
X86/fixed-order-recurrence.ll	MAX TC = 9.
X86/induction-costs.ll	TC = 8
X86/multi-exit-cost.l	TC = 31, this make VF = 16 needs 15 iteration of scalar epilogue.
X86/optsize.ll	Optimize for code size
X86/pr81872.ll	TC = 9
X86/predicate-switch.ll	Extra cost of the `icmp` make this loop not vectorized. But this test will still not be vectorized when implementing the cost of VPInstruction w/o underlying value.
X86/scev-checks-unprofitable.ll	TC = 7

Possible solutions.

If the test is created for some optimization or functional coverage. It's probably better to force vectorize or setting the vector-width to make sure the test always be vectorized.
Increase the small ( not power-of-2 friendly) trip count to large trip count.

ElvisWang123 requested review from david-arm, fhahn and hassnaaHamdi February 4, 2025 06:00

llvmbot added vectorizers llvm:transforms labels Feb 4, 2025

david-arm reviewed Feb 4, 2025

View reviewed changes

fhahn reviewed Feb 4, 2025

View reviewed changes

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 8b971d3 to 460b92d Compare February 6, 2025 07:03

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 460b92d to 430d0eb Compare February 14, 2025 07:40

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 430d0eb to 915c3d4 Compare March 10, 2025 04:41

fhahn reviewed Jun 23, 2025

View reviewed changes

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 3e92a7a to 35def52 Compare August 7, 2025 00:52

david-arm reviewed Aug 7, 2025

View reviewed changes

fhahn reviewed Aug 7, 2025

View reviewed changes

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 35def52 to d662a4a Compare August 8, 2025 05:49

ElvisWang123 commented Aug 8, 2025

View reviewed changes

ElvisWang123 force-pushed the impl-vplan-br-cost branch 2 times, most recently from 746880a to 9c5e12e Compare August 12, 2025 06:12

fhahn mentioned this pull request Sep 9, 2025

[VPlan] Don't use the legacy cost model for loop conditions #156864

Open

ElvisWang123 added 3 commits October 13, 2025 22:10

!fixup, implement cost of VPInstruction::Not to fix regression.

8e04ed6

This fixes regression of `llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll`.

!Fixup, implement Cmp instruction costs in VPInstruction::computeCost.

e4bb52b

ElvisWang123 force-pushed the impl-vplan-br-cost branch from 9c5e12e to e4bb52b Compare October 15, 2025 23:25

[LV][VPlan] Implement VPlan-based cost for exit condition. #125640

Are you sure you want to change the base?

[LV][VPlan] Implement VPlan-based cost for exit condition. #125640

Uh oh!

Conversation

ElvisWang123 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ElvisWang123 commented Feb 18, 2025

Uh oh!

ElvisWang123 commented Apr 7, 2025

Uh oh!

ElvisWang123 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElvisWang123 commented Jun 23, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 commented Aug 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ElvisWang123 commented Feb 4, 2025 •

edited

Loading

llvmbot commented Feb 4, 2025 •

edited

Loading

ElvisWang123 commented Jun 2, 2025 •

edited

Loading