[VPlan] Don't use the legacy cost model for loop conditions #156864

john-brawn-arm · 2025-09-04T11:44:31Z

The current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector.

Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert.

This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening:

AArch64/conditional-branches-cost.ll has more IR instructions, but less final machine instructions.
X86/pr81872.ll I've increased the trip count by 1 as otherwise the test no longer vectorizes.
X86/consecutive-ptr-uniforms.ll has more IR instructions, but the branches have constant condition so it would later get simplified to just three stores.

The current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector. Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert. This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening: * AArch64/conditional-branches-cost.ll has more IR instructions, but less final machine instructions. * X86/pr81872.ll I've increased the trip count by 1 as otherwise the test no longer vectorizes. * X86/consecutive-ptr-uniforms.ll has more IR instructions, but the branches have constant condition so it would later get simplified to just three stores.

llvmbot · 2025-09-04T11:45:01Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-transforms

Author: John Brawn (john-brawn-arm)

Changes

The current behaviour of using the legacy cost model for instructions that compute the loop exit condition gets the wrong result when we the loop has been transformed to use a different exit condition, e.g. when have tail-folded predicated vectorization the exit condition is based on the predicate vector.

Fix this by adding costs for all of the VPlan instructions that can create a compare instruction and removing the use of the legacy cost model. This can result in the VPlan choosing a different VF to the legacy cost model, so check for this in computeBestVF so we don't assert.

This causes quite a lot of changes to expected output in tests. These are mostly just changes to the -debug output, or choosing a different VF, but others aren't as obvious as to what's happening:

AArch64/conditional-branches-cost.ll has more IR instructions, but less final machine instructions.
X86/pr81872.ll I've increased the trip count by 1 as otherwise the test no longer vectorizes.
X86/consecutive-ptr-uniforms.ll has more IR instructions, but the branches have constant condition so it would later get simplified to just three stores.

Patch is 95.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156864.diff

13 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+34-41)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+25)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+79-57)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll (+11-5)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+93-113)
(modified) llvm/test/Transforms/LoopVectorize/ARM/mve-icmpcost.ll (+7-14)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/X86/CostModel/vpinstruction-cost.ll (+20-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll (+33-16)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+35-35)
(modified) llvm/test/Transforms/LoopVectorize/X86/multi-exit-cost.ll (+14-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr81872.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/reduction-small-size.ll (+1-3)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3fbeef1211954..a7021003b8dd4 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6789,46 +6789,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
     }
   }
 
-  /// Compute the cost of all exiting conditions of the loop using the legacy
-  /// cost model. This is to match the legacy behavior, which adds the cost of
-  /// all exit conditions. Note that this over-estimates the cost, as there will
-  /// be a single condition to control the vector loop.
-  SmallVector<BasicBlock *> Exiting;
-  CM.TheLoop->getExitingBlocks(Exiting);
-  SetVector<Instruction *> ExitInstrs;
-  // Collect all exit conditions.
-  for (BasicBlock *EB : Exiting) {
-    auto *Term = dyn_cast<BranchInst>(EB->getTerminator());
-    if (!Term || CostCtx.skipCostComputation(Term, VF.isVector()))
-      continue;
-    if (auto *CondI = dyn_cast<Instruction>(Term->getOperand(0))) {
-      ExitInstrs.insert(CondI);
-    }
-  }
-  // Compute the cost of all instructions only feeding the exit conditions.
-  for (unsigned I = 0; I != ExitInstrs.size(); ++I) {
-    Instruction *CondI = ExitInstrs[I];
-    if (!OrigLoop->contains(CondI) ||
-        !CostCtx.SkipCostComputation.insert(CondI).second)
-      continue;
-    InstructionCost CondICost = CostCtx.getLegacyCost(CondI, VF);
-    LLVM_DEBUG({
-      dbgs() << "Cost of " << CondICost << " for VF " << VF
-             << ": exit condition instruction " << *CondI << "\n";
-    });
-    Cost += CondICost;
-    for (Value *Op : CondI->operands()) {
-      auto *OpI = dyn_cast<Instruction>(Op);
-      if (!OpI || CostCtx.skipCostComputation(OpI, VF.isVector()) ||
-          any_of(OpI->users(), [&ExitInstrs, this](User *U) {
-            return OrigLoop->contains(cast<Instruction>(U)->getParent()) &&
-                   !ExitInstrs.contains(cast<Instruction>(U));
-          }))
-        continue;
-      ExitInstrs.insert(OpI);
-    }
-  }
-
   // Pre-compute the costs for branches except for the backedge, as the number
   // of replicate regions in a VPlan may not directly match the number of
   // branches, which would lead to different decisions.
@@ -6977,6 +6937,37 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
     });
   });
 }
+
+static bool planContainsDifferentCompares(VPlan &Plan, VPCostContext &CostCtx,
+                                          Loop *TheLoop, ElementCount VF) {
+  // Count how many compare instructions there are in the legacy cost model.
+  unsigned NumLegacyCompares = 0;
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    for (auto &I : *BB) {
+      if (isa<CmpInst>(I)) {
+        NumLegacyCompares += 1;
+      }
+    }
+  }
+
+  // Count how many compare instructions there are in the VPlan.
+  unsigned NumVPlanCompares = 0;
+  auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
+    for (VPRecipeBase &R : *VPBB) {
+      if (auto *VPI = dyn_cast<VPInstruction>(&R)) {
+        if (VPI->getOpcode() == VPInstruction::BranchOnCount ||
+            VPI->getOpcode() == Instruction::ICmp ||
+            VPI->getOpcode() == Instruction::FCmp)
+          NumVPlanCompares += 1;
+      }
+    }
+  }
+
+  // If we have a different amount, then the legacy cost model and vplan will
+  // disagree.
+  return NumLegacyCompares != NumVPlanCompares;
+}
 #endif
 
 VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
@@ -7085,7 +7076,9 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                                                 CostCtx, OrigLoop,
                                                 BestFactor.Width) ||
           planContainsAdditionalSimplifications(
-              getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width)) &&
+              getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width) ||
+          planContainsDifferentCompares(BestPlan, CostCtx, OrigLoop,
+                                        BestFactor.Width)) &&
          " VPlan cost model and legacy cost model disagreed");
   assert((BestFactor.Width.isScalar() || BestFactor.ScalarCost > 0) &&
          "when vectorizing, the scalar cost must be computed.");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 5f3503d0ce57a..1c4e38f75d626 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -28,6 +28,7 @@
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
 #include "llvm/IR/Type.h"
 #include "llvm/IR/Value.h"
 #include "llvm/Support/Casting.h"
@@ -40,6 +41,7 @@
 #include <cassert>
 
 using namespace llvm;
+using namespace llvm::PatternMatch;
 
 using VectorParts = SmallVector<Value *, 2>;
 
@@ -1117,6 +1119,29 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
     return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement,
                                                     VecTy, Ctx.CostKind, 0);
   }
+  case VPInstruction::BranchOnCount: {
+    // If TC <= VF then this is just a branch.
+    // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF
+    // where it checks TC <= VF * UF, but we don't know UF yet. This means in
+    // some cases we get a cost that's too high due to counting a cmp that
+    // later gets removed.
+    Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue();
+    uint64_t C0;
+    if (TC && match(TC, m_ConstantInt(C0)) && C0 <= VF.getKnownMinValue())
+      return 0;
+    // Otherwise BranchOnCount generates ICmpEQ followed by a branch.
+    Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+    return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy,
+                                      CmpInst::makeCmpResultType(ValTy),
+                                      CmpInst::ICMP_EQ, Ctx.CostKind);
+  }
+  case Instruction::FCmp:
+  case Instruction::ICmp: {
+    Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+    return Ctx.TTI.getCmpSelInstrCost(getOpcode(), ValTy,
+                                      CmpInst::makeCmpResultType(ValTy),
+                                      getPredicate(), Ctx.CostKind);
+  }
   case VPInstruction::ExtractPenultimateElement:
     if (VF == ElementCount::getScalable(1))
       return InstructionCost::getInvalid();
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index dad0290142f29..9d074c0c8d907 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -268,7 +268,7 @@ define void @latch_branch_cost(ptr %dst) {
 ; DEFAULT:       [[MIDDLE_BLOCK]]:
 ; DEFAULT-NEXT:    br i1 false, [[EXIT:label %.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
 ; DEFAULT:       [[VEC_EPILOG_ITER_CHECK]]:
-; DEFAULT-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]]
+; DEFAULT-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF5:![0-9]+]]
 ; DEFAULT:       [[VEC_EPILOG_PH]]:
 ; DEFAULT-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ 96, %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
 ; DEFAULT-NEXT:    br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
@@ -278,7 +278,7 @@ define void @latch_branch_cost(ptr %dst) {
 ; DEFAULT-NEXT:    store <4 x i8> zeroinitializer, ptr [[TMP8]], align 1
 ; DEFAULT-NEXT:    [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 4
 ; DEFAULT-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 100
-; DEFAULT-NEXT:    br i1 [[TMP10]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[TMP10]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; DEFAULT:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
 ; DEFAULT-NEXT:    br i1 true, [[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
 ; DEFAULT:       [[VEC_EPILOG_SCALAR_PH]]:
@@ -429,14 +429,14 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; DEFAULT:       [[VECTOR_BODY]]:
 ; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE37:.*]] ]
-; DEFAULT-NEXT:    [[TMP9:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META7:![0-9]+]]
+; DEFAULT-NEXT:    [[TMP9:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META8:![0-9]+]]
 ; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <4 x i32> poison, i32 [[TMP9]], i64 0
 ; DEFAULT-NEXT:    [[BROADCAST_SPLAT29:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT28]], <4 x i32> poison, <4 x i32> zeroinitializer
-; DEFAULT-NEXT:    [[TMP19:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META10:![0-9]+]]
+; DEFAULT-NEXT:    [[TMP19:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META11:![0-9]+]]
 ; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP19]], i64 0
 ; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
 ; DEFAULT-NEXT:    [[TMP6:%.*]] = or <4 x i32> [[BROADCAST_SPLAT]], [[BROADCAST_SPLAT29]]
-; DEFAULT-NEXT:    [[TMP7:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META12:![0-9]+]]
+; DEFAULT-NEXT:    [[TMP7:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META13:![0-9]+]]
 ; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT30:%.*]] = insertelement <4 x i32> poison, i32 [[TMP7]], i64 0
 ; DEFAULT-NEXT:    [[BROADCAST_SPLAT31:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT30]], <4 x i32> poison, <4 x i32> zeroinitializer
 ; DEFAULT-NEXT:    [[TMP8:%.*]] = icmp ugt <4 x i32> [[BROADCAST_SPLAT31]], [[TMP6]]
@@ -445,34 +445,34 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; DEFAULT-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; DEFAULT:       [[PRED_STORE_IF]]:
 ; DEFAULT-NEXT:    [[TMP11:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT:    store i32 [[TMP11]], ptr [[E]], align 4, !alias.scope [[META14:![0-9]+]], !noalias [[META16:![0-9]+]]
+; DEFAULT-NEXT:    store i32 [[TMP11]], ptr [[E]], align 4, !alias.scope [[META15:![0-9]+]], !noalias [[META17:![0-9]+]]
 ; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE]]
 ; DEFAULT:       [[PRED_STORE_CONTINUE]]:
 ; DEFAULT-NEXT:    [[TMP12:%.*]] = extractelement <4 x i1> [[TMP8]], i32 1
 ; DEFAULT-NEXT:    br i1 [[TMP12]], label %[[PRED_STORE_IF32:.*]], label %[[PRED_STORE_CONTINUE33:.*]]
 ; DEFAULT:       [[PRED_STORE_IF32]]:
 ; DEFAULT-NEXT:    [[TMP13:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT:    store i32 [[TMP13]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT:    store i32 [[TMP13]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
 ; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE33]]
 ; DEFAULT:       [[PRED_STORE_CONTINUE33]]:
 ; DEFAULT-NEXT:    [[TMP14:%.*]] = extractelement <4 x i1> [[TMP8]], i32 2
 ; DEFAULT-NEXT:    br i1 [[TMP14]], label %[[PRED_STORE_IF34:.*]], label %[[PRED_STORE_CONTINUE35:.*]]
 ; DEFAULT:       [[PRED_STORE_IF34]]:
 ; DEFAULT-NEXT:    [[TMP15:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT:    store i32 [[TMP15]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT:    store i32 [[TMP15]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
 ; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE35]]
 ; DEFAULT:       [[PRED_STORE_CONTINUE35]]:
 ; DEFAULT-NEXT:    [[TMP21:%.*]] = extractelement <4 x i1> [[TMP8]], i32 3
 ; DEFAULT-NEXT:    br i1 [[TMP21]], label %[[PRED_STORE_IF36:.*]], label %[[PRED_STORE_CONTINUE37]]
 ; DEFAULT:       [[PRED_STORE_IF36]]:
 ; DEFAULT-NEXT:    [[TMP22:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
-; DEFAULT-NEXT:    store i32 [[TMP22]], ptr [[E]], align 4, !alias.scope [[META14]], !noalias [[META16]]
+; DEFAULT-NEXT:    store i32 [[TMP22]], ptr [[E]], align 4, !alias.scope [[META15]], !noalias [[META17]]
 ; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE37]]
 ; DEFAULT:       [[PRED_STORE_CONTINUE37]]:
-; DEFAULT-NEXT:    call void @llvm.masked.store.v4i32.p0(<4 x i32> zeroinitializer, ptr [[TMP16]], i32 4, <4 x i1> [[TMP8]]), !alias.scope [[META18:![0-9]+]], !noalias [[META19:![0-9]+]]
+; DEFAULT-NEXT:    call void @llvm.masked.store.v4i32.p0(<4 x i32> zeroinitializer, ptr [[TMP16]], i32 4, <4 x i1> [[TMP8]]), !alias.scope [[META19:![0-9]+]], !noalias [[META20:![0-9]+]]
 ; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
 ; DEFAULT-NEXT:    [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT:    br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
 ; DEFAULT:       [[MIDDLE_BLOCK]]:
 ; DEFAULT-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
 ; DEFAULT-NEXT:    br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
@@ -533,25 +533,47 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; DEFAULT-LABEL: define void @multiple_exit_conditions(
 ; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
 ; DEFAULT-NEXT:  [[ENTRY:.*:]]
-; DEFAULT-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP4:%.*]] = shl nuw i64 [[TMP0]], 3
+; DEFAULT-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP4]]
+; DEFAULT-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
 ; DEFAULT:       [[VECTOR_PH]]:
-; DEFAULT-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048
-; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
-; DEFAULT:       [[VECTOR_BODY]]:
-; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 8
+; DEFAULT-NEXT:    [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]]
+; DEFAULT-NEXT:    [[INDEX:%.*]] = sub i64 257, [[N_MOD_VF]]
 ; DEFAULT-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
 ; DEFAULT-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
+; DEFAULT-NEXT:    [[TMP6:%.*]] = mul i64 [[INDEX]], 2
+; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
+; DEFAULT:       [[VECTOR_BODY]]:
+; DEFAULT-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT:    [[OFFSET_IDX1:%.*]] = mul i64 [[INDEX1]], 8
+; DEFAULT-NEXT:    [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX1]]
 ; DEFAULT-NEXT:    [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2
-; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0
-; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
-; DEFAULT-NEXT:    [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
-; DEFAULT-NEXT:    [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double>
-; DEFAULT-NEXT:    store <8 x double> [[TMP3]], ptr [[NEXT_GEP]], align 8
-; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
-; DEFAULT-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
-; DEFAULT-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
+; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i16> poison, i16 [[TMP1]], i64 0
+; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer
+; DEFAULT-NEXT:    [[TMP11:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP15:%.*]] = uitofp <vscale x 2 x i16> [[TMP11]] to <vscale x 2 x double>
+; DEFAULT-NEXT:    [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP17:%.*]] = shl nuw i64 [[TMP16]], 1
+; DEFAULT-NEXT:    [[TMP18:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP17]]
+; DEFAULT-NEXT:    [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP20:%.*]] = shl nuw i64 [[TMP19]], 2
+; DEFAULT-NEXT:    [[TMP21:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP20]]
+; DEFAULT-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP23:%.*]] = mul nuw i64 [[TMP22]], 6
+; DEFAULT-NEXT:    [[TMP24:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP23]]
+; DEFAULT-NEXT:    store <vscale x 2 x double> [[TMP15]], ptr [[NEXT_GEP1]], align 8
+; DEFAULT-NEXT:    store <vscale x 2 x double> [[TMP15]], ptr [[TMP18]], align 8
+; DEFAULT-NEXT:    store <vscale x 2 x double> [[TMP15]], ptr [[TMP21]], align 8
+; DEFAULT-NEXT:    store <vscale x 2 x double> [[TMP15]], ptr [[TMP24]], align 8
+; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX1]], [[TMP3]]
+; DEFAULT-NEXT:    [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[INDEX]]
+; DEFAULT-NEXT:    br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
 ; DEFAULT:       [[MIDDLE_BLOCK]]:
-; DEFAULT-NEXT:    br label %[[SCALAR_PH]]
+; DEFAULT-NEXT:    [[CMP_N:%.*]] = icmp eq i64 257, [[INDEX]]
+; DEFAULT-NEXT:    br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
 ; DEFAULT:       [[SCALAR_PH]]:
 ;
 ; PRED-LABEL: define void @multiple_exit_conditions(
@@ -658,12 +680,12 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
 ; COMMON-NEXT:    store i8 6, ptr [[TMP6]], align 1
 ; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE12]]
 ; COMMON:       [[PRED_STORE_CONTINUE12]]:
-; COMMON-NEXT:    br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[PRED_STORE_CONTINUE14:.*]]
+; COMMON-NEXT:    br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[EXIT1:.*]]
 ; COMMON:       [[PRED_STORE_IF13]]:
 ; COMMON-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[DST]], i64 7
 ; COMMON-NEXT:    store i8 7, ptr [[TMP7]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE14]]
-; COMMON:       [[PRED_STORE_CONTINUE14]]:
+; COMMON-NEXT:    br label %[[EXIT1]]
+; COMMON:       [[EXIT1]]:
 ; COMMON-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
 ; COMMON:       [[MIDDLE_BLOCK]]:
 ; COMMON-NEXT:    br [[EXIT:label %.*]]
@@ -866,7 +888,7 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; DEFAULT-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], splat (i64 8)
 ; DEFAULT-NEXT:    [[TMP80:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT:    br i1 [[TMP80]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP25:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[TMP80]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
 ; DEFAULT:       [[MIDDLE_BLOCK]]:
 ; DEFAULT-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
 ; DEFAULT-NEXT:    br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]]
@@ -1115,7 +1137,7 @@ define void @redundant_branch_and_tail_folding(ptr %dst, i1 %c) {
 ; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; DEFAULT-NEXT:    [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD]], splat (i64 4)
 ; DEFAULT-NEXT:    [[TMP3:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
-; DEFAULT-NEXT:    br i1 [[TMP3]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[TMP3]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
 ; DEFAULT:       [[MIDDLE_BLOCK]]:
 ; DEFAULT-NEXT:    br label %[[SCALAR_PH]]
 ; DEFAULT:       [[SCALAR_PH]]:
@@ -1289,7 +1311,7 @@ defin...
[truncated]

david-arm

Thanks for moving more cost modelling into vplan. I just had a few concerns about some missing costs and a possible test regression.

david-arm · 2025-09-09T12:09:11Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // later gets removed.
+    Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue();
+    ConstantInt *TCConst = dyn_cast_if_present<ConstantInt>(TC);
+    if (TCConst && TCConst->getValue().ule(VF.getKnownMinValue()))


Can you add a TODO for the case where TC=vscale x M and VF=vscale * N as well? In such cases we should also be able to prove that TC <= VF because it just requires asking if M <= N.

david-arm · 2025-09-09T12:10:08Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

    return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement,
                                                    VecTy, Ctx.CostKind, 0);
  }
+  case VPInstruction::BranchOnCount: {


I think BranchOnCond needs adding too.

BranchOnCond doesn't cause a compare instruction to be generated, it uses the condition generated by another instruction.

david-arm · 2025-09-09T12:12:38Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    Type *ValTy = Ctx.Types.inferScalarType(getOperand(0));
+    return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy,
+                                      CmpInst::makeCmpResultType(ValTy),
+                                      CmpInst::ICMP_EQ, Ctx.CostKind);


I think you need to add the cost of the branch as well.

I've deliberately avoided touching branch costs to keep the scope of this work as small as possible.

david-arm · 2025-09-09T12:28:53Z

llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll

 ; CHECK-NEXT: Cost of 0 for VF 8: induction instruction   %i.iv = phi i64 [ 0, %entry ], [ %i.iv.next, %for.body ]
-; CHECK-NEXT: Cost of 1 for VF 8: exit condition instruction   %exitcond.not = icmp eq i64 %i.iv.next, 16
 ; CHECK-NEXT: Cost of 0 for VF 8: EMIT vp<{{.+}}> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
+; CHECK: Cost of 1 for VF 8: EMIT branch-on-count vp<%index.next>, vp<{{.+}}>


I think @igogo-x86 added this file in #106699. You might also want to remove addFullyUnrolledInstructionsToIgnore, since you're now dealing with the cost properly in vplan?

It looks like that function is marking both compares and induction variable instructions as needing to be ignored. So we could remove the ignoring of compares, except that the function is also called from the legacy cost model which does need to continue ignoring the compares. So I think the call to addFullyUnrolledInstructionsToIgnore in LoopVectorizationPlanner::precomputeCosts can only be removed once the vplan cost model is also correctly handling the induction variable instructions.

david-arm · 2025-09-09T12:33:48Z

llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll

+; PRED-NEXT:    [[TMP35:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
 ; PRED-NEXT:    [[TMP36:%.*]] = xor i1 [[TMP35]], true
-; PRED-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], splat (i32 4)
+; PRED-NEXT:    [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], splat (i32 2)


Do you know why we're choosing a lower VF here? It may be a sensible change, but it's not obvious what triggered the different choice of VF as it doesn't seem like a FCmp, ICmp or BranchOnCount VPInstruction being generated here.

In the vplan the vector loop branch is

EMIT vp<%active.lane.mask.next> = active lane mask vp<%12>, vp<%5>, ir<1> EMIT vp<%13> = not vp<%active.lane.mask.next> EMIT branch-on-cond vp<%13>

The loop branch condition comes from the active lane mask where we're already calculating the cost of

Cost of 4 for VF 2: EMIT vp<%active.lane.mask.entry> = active lane mask vp<%index.part.next>, vp<%4>, ir<1>

Previously we also added the cost of a compare in the legacy cost model (which doesn't correspond to anything in the vplan)

Cost of 1 for VF 2: exit condition instruction %exitcond.3.not = icmp eq i64 %iv.1, %N

meaning the cost was 1 higher than it should have been. This means the calculated costs are

VF Before this patch With this patch

2 12, 6 per lane 11, 5.5 per lane

4 23, 5.8 per lane 22, 5.5 per lane

So VF 2 is now chosen because we calculate the per-lane cost is the same as VF 4.

david-arm · 2025-09-09T12:52:02Z

llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

-; CHECK-NEXT:    br i1 [[CMP2]], label %[[RETURN:.*]], label %[[FOR_BODY]]
-; CHECK:       [[RETURN]]:
-; CHECK-NEXT:    ret void
+; CHECK-NEXT:  [[ENTRY:.*:]]


At first glance this looks worse than before.

All of the control flow here is using constant conditions (because we know the trip count is 3 so we know which lanes need to be stored) so after simplifycfg we get

define void @PR40816() #1 { entry: store i32 0, ptr @b, align 1 store i32 1, ptr @b, align 1 store i32 2, ptr @b, align 1 ret void }

Though if I modify this test a little so that the trip count is 7 (make array @a larger) then we still vectorize but don't get constant conditions for the branches and so it is worse. I'll look into this.

It looks like what's going on here is:

Currently the load from arrayidx is considered to be part of calculating the loop exit condition, and so the cost is calculated in LoopVectorizationPlanner::precomputeCosts. It gets a very high cost due to useEmulatedMaskMemRefHack so we don't vectorize.

In the vplan something has figured out that the loop has a constant trip count due to the load being from a constant array, so the load has been removed.

With this patch that means we don't use the cost of the load, as it no longer exists, and the resulting cost says that vectorization is profitable.

If I manually transform the function into what it is after the vplan transformation then it looks like

define void @PR40816_adj() #1 { entry: br label %for.body for.body: %0 = phi i32 [ 0, %entry ], [ %inc, %for.body ] store i32 %0, ptr @b, align 1 %inc = add nuw nsw i32 %0, 1 %cmp = icmp uge i32 %inc, 7 br i1 %cmp, label %return, label %for.body return: ret void }

and this currently gets vectorized. This is in fact very similar to the test low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.

I think this ultimately comes down to this FIXME in LoopVectorizationCostModel::setCostBasedWideningDecision

// Load: Scalar load + broadcast // Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract // FIXME: This cost is a significant under-estimate for tail folded // memory ops. const InstructionCost ScalarizationCost = IsLegalToScalarize() ? getUniformMemOpCost(&I, VF) : InstructionCost::getInvalid();

fhahn

Likely related @ElvisWang123 's #125640. IIRC the latest status was that there still were some regressions needing investigation

fhahn · 2025-09-18T14:08:05Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
+    for (VPRecipeBase &R : *VPBB) {
+      if (auto *VPI = dyn_cast<VPInstruction>(&R)) {
+        if (VPI->getOpcode() == VPInstruction::BranchOnCount ||


that doesn't seem quite right, isn't this missing at least wide compares and replicating/uniform compares?

I've changed this to use VPlanPatternMatch, which should catch everything I think.

john-brawn-arm · 2025-10-14T09:39:57Z

Ping.

david-arm

I'm all in favour of making steps to retire the legacy cost model where possible! However, I think there are still a few issues that need dealing with particularly as the scope of this PR is actually a bit greater than it seems.

david-arm · 2025-10-14T16:44:59Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

                                                    VecTy, Ctx.CostKind, 0);
  }
+  case VPInstruction::BranchOnCount: {
+    // If TC <= VF then this is just a branch.


I'm not sure what this means. Are you saying that we create a vplan for a given VF despite knowing that we will never enter the vector loop? I guess this can happen if TC is exactly equal to VF or we're using tail-folding, but not using the mask for control flow.

As mentioned in the comment below this, this transformation is happening in simplifyBranchConditionForVFAndUF and means the vector loop is executed exactly once. TC < VF can happen with tail folding, e.g. low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.

david-arm · 2025-10-14T16:53:12Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

  }
+  case VPInstruction::BranchOnCount: {
+    // If TC <= VF then this is just a branch.
+    // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF


So you're saying the cost of this branch is based on a prediction about what simplifyBranchConditionForVFAndUF is going to do later on? I guess that's fine so long as they're both using the same logic. Ideally both simplifyBranchConditionForVFAndUF and this code would call the same common function checking if the branch will be simplified or not. I'm just a bit worried that over time the two will diverge. Although I appreciate here in the code you'd have to assume UF=1.

For example, if you pulled this code out of simplifyBranchConditionForVFAndUF into a common function you could reuse it in both places:

// Try to simplify the branch condition if TC <= VF * UF when the latch // terminator is BranchOnCount or BranchOnCond where the input is // Not(ActiveLaneMask). const SCEV *TripCount = vputils::getSCEVExprForVPValue(Plan.getTripCount(), SE); assert(!isa<SCEVCouldNotCompute>(TripCount) && "Trip count SCEV must be computable"); ElementCount NumElements = BestVF.multiplyCoefficientBy(BestUF); const SCEV *C = SE.getElementCount(TripCount->getType(), NumElements); if (TripCount->isZero() || !SE.isKnownPredicate(CmpInst::ICMP_ULE, TripCount, C)) return false;

You'd also now be able to remove the // FIXME: The compare could also be removed if TC = M * vscale, comment below.

I'll look into doing that.

I don't think that's the way to go, we shouldn't duplicate that kind of reasoning here. The TODO should say that the branch should be simplified before we compute the costs.

As a workaround for now catching some cases here should be fine, as this should only mean we may miss some new optimizations, but not make things worse

Well, I do think we should be dealing with trip counts that are multiples of vscale here too since we now support them and we know that simplifyBranchConditionForVFAndUF should correctly detect TC=3 x vscale < VF=4 x vscale. It would seem unfair to treat TC=3 <= VF=4 as a cost of 0 and TC=3 * vscale <= VF=4 * vscale as a cost of 1 just so we can keep the code simple, right?

After looking into this, the problem here is that we don't have access to the ScalarEvolution object here in VPInstruction, so putting code from simplifyBranchConditionForVFAndUF into a function and calling it won't work as that makes use of ScalarEvolution. Simplifying the branch before we compute the cost seems like a good solution to this.

OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.

david-arm · 2025-10-14T17:31:12Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+                                      CmpInst::makeCmpResultType(ValTy),
+                                      CmpInst::ICMP_EQ, Ctx.CostKind);
+  }
+  case Instruction::FCmp:


The change below is about more than just about the loop conditions. See VPPredicator::createHeaderMask for an example where we explicitly introduce a icmp for the current tail-folding mask. In fact, I don't think the cost below is correct because the icmp can have vector inputs.

I think you either need to:

Find a way to bail out if the icmp/fcmp isn't used as a branch condition, or

Add support for vector types using ValTY = toVectorTy(ValTy, VF), and just make sure the example of VPPredicator::createHeaderMask is being tested in this PR.

Sorry about this!

getCostForRecipeWithOpcode already correctly handles vector compares, so I've changed this to use that function.

Causes a bunch of test changes, but it looks like they are due to correctly accounting for the cost of a vector compare. test/Transforms/LoopVectorize/X86/pr81872.ll I've adjusted the iteration count again to that it still vectorizes.

david-arm

Thanks for being patient with all the comments. I spotted a bug in the icmp/fcmp cost calculation, but once that's fixed I'm happy!

david-arm · 2025-10-21T12:21:04Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

  }
+  case VPInstruction::BranchOnCount: {
+    // If TC <= VF then this is just a branch.
+    // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF


OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.

david-arm · 2025-10-21T12:48:46Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  }
+  case Instruction::FCmp:
+  case Instruction::ICmp:
+    return getCostForRecipeWithOpcode(getOpcode(), VF, Ctx);


Sorry about this, but while I was trying to understand what's going on with the test changes to conditional-branches-cost.ll I discovered we're now treating the icmp for the exit condition as a vector comparison rather than a scalar comparison. This means we're actually reporting costs for fixed-width VFs that are too high.

I think this code needs changing to be:

return getCostForRecipeWithOpcode(getOpcode(), vputils::onlyFirstLaneUsed(this) ? ElementCount::getFixed(1) : VF, Ctx);

david-arm · 2025-10-21T13:04:13Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll

-; PRED-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i8> poison, i8 [[Y]], i64 0
-; PRED-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
+; PRED-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP0]])


Looks like the reason why this test has changed behaviour is due to us previously adding on the cost of the original scalar exit condition (an icmp) when in reality the vplan doesn't have one. Nice!

I think it's the same conclusion you came to with the induction-costs-sve.ll test below.

fhahn · 2025-10-21T14:31:29Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // Count how many compare instructions there are in the VPlan.
+  unsigned NumVPlanCompares = 0;
+  auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {


vp_depth_first_deep will also leave the region and visit its successors, so we will also count the compare in the middle block, and almost always overcount the compares in VPLan. probably needs to check if we left the region.

I think it will also always disable the check if we have loops controlled by active-lane-mask?

I've changed this to ignore blocks outside of the vector loop region.

I think it will also always disable the check if we have loops controlled by active-lane-mask?

I'm not sure what you're asking here. When we have a loop that's using an active-lane-mask, the vplan will have something like (example here taken from llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll)

Cost of 1 for VF vscale x 4: EMIT vp<%active.lane.mask.next> = active lane mask vp<%10>, vp<%4>, ir<1> Cost of 0 for VF vscale x 4: EMIT vp<%11> = not vp<%active.lane.mask.next> Cost of 0 for VF vscale x 4: EMIT branch-on-cond vp<%11>

the legacy cost model will have

LV: Found an estimated cost of 1 for VF vscale x 4 For instruction: %exitcond.not = icmp eq i64 %iv.next, %n LV: Found an estimated cost of 0 for VF vscale x 4 For instruction: br i1 %exitcond.not, label %for.end, label %for.body

planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.

planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.

Yep, what I was wondering was if we could exclude plans with ActiveLaneMask terminated exiting blocks from the carve-out, to preserve the original check for those?

Sorry, it's still not clear what you're asking here. Do you mean: planContainsDifferentCompares should return false for plans that contain ActiveLaneMask terminated exiting blocks, so that the assert in computeBestVF that calls planContainsDifferentCompares will do the check BestFactor.Width == LegacyVF.Width? If so then possibly we could, though I don't think it would make a difference as I haven't found an example where planContainsAdditionalSimplifications doesn't also return true (which it will do because the cmp in the legacy cost model doesn't correspond to anything in the vplan).

…egion

john-brawn-arm requested review from SamTebbs33, david-arm, fhahn and lukel97 September 4, 2025 11:44

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Sep 4, 2025

john-brawn-arm added 2 commits September 4, 2025 15:40

Do TC <= VF check differently to avoid llvm::PatternMatch error

c3f2e8f

Merge branch 'main' into vplan_cmp_cost

77a2769

david-arm requested a review from igogo-x86 September 9, 2025 12:52

david-arm reviewed Sep 9, 2025

View reviewed changes

fhahn reviewed Sep 9, 2025

View reviewed changes

john-brawn-arm added 3 commits September 11, 2025 15:06

Add extra comment to VPInstruction::computeCost

19b984c

Merge branch 'main' into vplan_cmp_cost

2b83698

Merge branch 'main' into vplan_cmp_cost

ec22b08

fhahn reviewed Sep 18, 2025

View reviewed changes

john-brawn-arm added 4 commits September 22, 2025 14:25

Merge branch 'main' into vplan_cmp_cost

1bf3611

Use VPlanPatternMatch for counting the number of compares

2a6c490

Merge branch 'main' into vplan_cmp_cost

6e23400

Merge branch 'main' into vplan_cmp_cost

24fa802

david-arm reviewed Oct 14, 2025

View reviewed changes

john-brawn-arm added 3 commits October 20, 2025 10:50

Use getCostForRecipeWithOpcode for cmp cost

515264b

Causes a bunch of test changes, but it looks like they are due to correctly accounting for the cost of a vector compare. test/Transforms/LoopVectorize/X86/pr81872.ll I've adjusted the iteration count again to that it still vectorizes.

Merge branch 'main' into vplan_cmp_cost

50b801f

Fix tests after merge

2d3999b

david-arm reviewed Oct 21, 2025

View reviewed changes

fhahn reviewed Oct 21, 2025

View reviewed changes

Adjust cmp cost calculation

33b76ce

john-brawn-arm added 4 commits October 22, 2025 17:50

Make planContainsDifferentCompares ignore blocks outside the vector r…

8c21ea3

…egion

Merge branch 'main' into vplan_cmp_cost

759741c

Update test after merge

60b76a2

Merge branch 'main' into vplan_cmp_cost

8a4c2ad

VF	Before this patch	With this patch
2	12, 6 per lane	11, 5.5 per lane
4	23, 5.8 per lane	22, 5.5 per lane

[VPlan] Don't use the legacy cost model for loop conditions #156864

Are you sure you want to change the base?

[VPlan] Don't use the legacy cost model for loop conditions #156864

Uh oh!

Conversation

john-brawn-arm commented Sep 4, 2025

Uh oh!

llvmbot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-brawn-arm commented Oct 14, 2025

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Sep 4, 2025 •

edited

Loading