[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

lukel97 · 2025-09-15T17:25:53Z

In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we unprofitably loop vectorize on RISC-V.

The loop looks something like:

  for (int i = 0; i < n; i++) {
    if (x0[i] == a)
      if (x1[i] == b)
        if (x2[i] == c)
          // do stuff...
  }

Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable to vectorize, which due to the if-conversion means we now always execute the body.

This stems from the fact that getPredBlockCostDivisor currently assumes that blocks have 50% chance of being executed as a heuristic.

We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as HeaderFrequency / BlockFrequency.

Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.

Whilst there's a lot of changes in the in-tree tests, this doesn't affect llvm-test-suite or SPEC CPU 2017 that much:

On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized on llvm-test-suite/SPEC CPU 2017.
On x86-64 -flto -O3 with PGO there's 0.9%/0% less geomean loops vectorized on llvm-test-suite/SPEC CPU 2017.

Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO: https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au

llvmbot · 2025-09-15T17:26:25Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

In 531.deepsjeng_r on SPEC CPU 2017 there's a loop that we unprofitably loop vectorize.

The loop looks something like:

  for (int i = 0; i &lt; n; i++) {
    if (x0[i] == a)
      if (x1[i] == b)
	if (x2[i] == c)
	  // do stuff...
  }

Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable vectorize, which due to the if-conversion means we now always execute the body.

This stems from the fact that getPredBlockCostDivisor currently assumes that blocks have 50% chance of being executed as a heuristic.

We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as HeaderFrequency / BlockFrequency.

Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.

This is a WIP, as there is an issue with BlockFrequencyInfo not being invalidated with LTO which leads to incorrect results that #157888 aims to fix. I'm posting this early to give a motivating example that illustrates the issue in llvm/test/Transforms/PhaseOrdering/loop-vectorize-bfi.ll

There are also more test changes that I haven't included for now.

Patch is 24.56 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158690.diff

6 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+41-9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+3-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanHelpers.h (+3-15)
(added) llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll (+157)
(added) llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll (+125)
(added) llvm/test/Transforms/PhaseOrdering/loop-vectorize-bfi.ll (+34)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 640a98c622f80..088c35b23563d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -578,8 +578,10 @@ class InnerLoopVectorizer {
   /// The profitablity analysis.
   LoopVectorizationCostModel *Cost;
 
-  /// BFI and PSI are used to check for profile guided size optimizations.
+  /// Used to calculate the probability of predicated blocks in
+  /// getPredBlockCostDivisor.
   BlockFrequencyInfo *BFI;
+  /// Used to check for profile guided size optimizations.
   ProfileSummaryInfo *PSI;
 
   /// Structure to hold information about generated runtime checks, responsible
@@ -900,7 +902,7 @@ class LoopVectorizationCostModel {
                              InterleavedAccessInfo &IAI,
                              ProfileSummaryInfo *PSI, BlockFrequencyInfo *BFI)
       : ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),
-        TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),
+        TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), BFI(BFI), TheFunction(F),
         Hints(Hints), InterleaveInfo(IAI) {
     if (TTI.supportsScalableVectors() || ForceTargetSupportsScalableVectors)
       initializeVScaleForTuning();
@@ -1249,6 +1251,17 @@ class LoopVectorizationCostModel {
   /// Superset of instructions that return true for isScalarWithPredication.
   bool isPredicatedInst(Instruction *I) const;
 
+  /// A helper function that returns how much we should divide the cost of a
+  /// predicated block by. Typically this is the reciprocal of the block
+  /// probability, i.e. if we return X we are assuming the predicated block will
+  /// execute once for every X iterations of the loop header so the block should
+  /// only contribute 1/X of its cost to the total cost calculation, but when
+  /// optimizing for code size it will just be 1 as code size costs don't depend
+  /// on execution probabilities.
+  inline unsigned
+  getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                          const BasicBlock *BB) const;
+
   /// Return the costs for our two available strategies for lowering a
   /// div/rem operation which requires speculating at least one lane.
   /// First result is for scalarization (will be invalid for scalable
@@ -1711,6 +1724,8 @@ class LoopVectorizationCostModel {
   /// Interface to emit optimization remarks.
   OptimizationRemarkEmitter *ORE;
 
+  const BlockFrequencyInfo *BFI;
+
   const Function *TheFunction;
 
   /// Loop Vectorize Hint.
@@ -2863,6 +2878,19 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
   }
 }
 
+unsigned LoopVectorizationCostModel::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+  if (CostKind == TTI::TCK_CodeSize)
+    return 1;
+
+  uint64_t HeaderFreq = BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  assert(HeaderFreq >= BBFreq &&
+         "Header has smaller block freq than dominated BB?");
+  return BFI->getBlockFreq(TheLoop->getHeader()).getFrequency() /
+         BFI->getBlockFreq(BB).getFrequency();
+}
+
 std::pair<InstructionCost, InstructionCost>
 LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
                                                     ElementCount VF) const {
@@ -2899,7 +2927,8 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
     // Scale the cost by the probability of executing the predicated blocks.
     // This assumes the predicated block for each vector lane is equally
     // likely.
-    ScalarizationCost = ScalarizationCost / getPredBlockCostDivisor(CostKind);
+    ScalarizationCost =
+        ScalarizationCost / getPredBlockCostDivisor(CostKind, I->getParent());
   }
   InstructionCost SafeDivisorCost = 0;
 
@@ -5031,7 +5060,7 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
       }
 
     // Scale the total scalar cost by block probability.
-    ScalarCost /= getPredBlockCostDivisor(CostKind);
+    ScalarCost /= getPredBlockCostDivisor(CostKind, PredInst->getParent());
 
     // Compute the discount. A non-negative discount means the vector version
     // of the instruction costs more, and scalarizing would be beneficial.
@@ -5084,7 +5113,7 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
     // cost by the probability of executing it. blockNeedsPredication from
     // Legal is used so as to not include all blocks in tail folded loops.
     if (VF.isScalar() && Legal->blockNeedsPredication(BB))
-      BlockCost /= getPredBlockCostDivisor(CostKind);
+      BlockCost /= getPredBlockCostDivisor(CostKind, BB);
 
     Cost += BlockCost;
   }
@@ -5162,7 +5191,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   // conditional branches, but may not be executed for each vector lane. Scale
   // the cost by the probability of executing the predicated block.
   if (isPredicatedInst(I)) {
-    Cost /= getPredBlockCostDivisor(CostKind);
+    Cost /= getPredBlockCostDivisor(CostKind, I->getParent());
 
     // Add the cost of an i1 extract and a branch
     auto *VecI1Ty =
@@ -6710,6 +6739,11 @@ bool VPCostContext::skipCostComputation(Instruction *UI, bool IsVector) const {
          SkipCostComputation.contains(UI);
 }
 
+unsigned VPCostContext::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+  return CM.getPredBlockCostDivisor(CostKind, BB);
+}
+
 InstructionCost
 LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
                                           VPCostContext &CostCtx) const {
@@ -10273,9 +10307,7 @@ PreservedAnalyses LoopVectorizePass::run(Function &F,
 
   auto &MAMProxy = AM.getResult<ModuleAnalysisManagerFunctionProxy>(F);
   PSI = MAMProxy.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
-  BFI = nullptr;
-  if (PSI && PSI->hasProfileSummary())
-    BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
+  BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
   LoopVectorizeResult Result = runImpl(F);
   if (!Result.MadeAnyChange)
     return PreservedAnalyses::all();
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 30a3a01ddd949..e2dffdd129c24 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -855,7 +855,9 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
   // For the scalar case, we may not always execute the original predicated
   // block, Thus, scale the block's cost by the probability of executing it.
   if (VF.isScalar())
-    return ThenCost / getPredBlockCostDivisor(Ctx.CostKind);
+    if (auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Then))
+      return ThenCost / Ctx.getPredBlockCostDivisor(Ctx.CostKind,
+                                                    VPIRBB->getIRBasicBlock());
 
   return ThenCost;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index fe59774b7c838..0e9d6e47c740d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -50,21 +50,6 @@ Value *getRuntimeVF(IRBuilderBase &B, Type *Ty, ElementCount VF);
 Value *createStepForVF(IRBuilderBase &B, Type *Ty, ElementCount VF,
                        int64_t Step);
 
-/// A helper function that returns how much we should divide the cost of a
-/// predicated block by. Typically this is the reciprocal of the block
-/// probability, i.e. if we return X we are assuming the predicated block will
-/// execute once for every X iterations of the loop header so the block should
-/// only contribute 1/X of its cost to the total cost calculation, but when
-/// optimizing for code size it will just be 1 as code size costs don't depend
-/// on execution probabilities.
-///
-/// TODO: We should use actual block probability here, if available. Currently,
-///       we always assume predicated blocks have a 50% chance of executing.
-inline unsigned
-getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
-  return CostKind == TTI::TCK_CodeSize ? 1 : 2;
-}
-
 /// A range of powers-of-2 vectorization factors with fixed start and
 /// adjustable end. The range includes start and excludes end, e.g.,:
 /// [1, 16) = {1, 2, 4, 8}
@@ -378,6 +363,9 @@ struct VPCostContext {
   InstructionCost getScalarizationOverhead(Type *ResultTy,
                                            ArrayRef<const VPValue *> Operands,
                                            ElementCount VF);
+
+  unsigned getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                                   const BasicBlock *BB) const;
 };
 
 /// This class can be used to assign names to VPValues. For VPValues without
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
new file mode 100644
index 0000000000000..e683f19b6effa
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
@@ -0,0 +1,157 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -mtriple=aarch64 -mattr=+sve -S %s | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[X:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[X]]
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT:    [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT:    [[Z:%.*]] = udiv i64 [[X1]], [[Y]]
+; CHECK-NEXT:    store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT]] = add i32 [[X]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch
+
+then.0:
+  br i1 %c1, label %then.1, label %latch
+
+then.1:
+  %gep0 = getelementptr i64, ptr %p0, i32 %iv
+  %x = load i64, ptr %gep0
+  %gep1 = getelementptr i64, ptr %p1, i32 %iv
+  %y = load i64, ptr %gep1
+  %z = udiv i64 %x, %y
+  store i64 %z, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw i32 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 1024, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP4:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul nuw i32 [[TMP4]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 1024, [[TMP5]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i32 1024, [[N_MOD_VF]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C0]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = select <vscale x 2 x i1> [[BROADCAST_SPLAT2]], <vscale x 2 x i1> [[BROADCAST_SPLAT]], <vscale x 2 x i1> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr i64, ptr [[P0]], i32 [[INDEX]]
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = shl nuw i64 [[TMP8]], 1
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr i64, ptr [[TMP10]], i64 [[TMP7]]
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP10]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD3:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP20]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr i64, ptr [[P1]], i32 [[INDEX]]
+; CHECK-NEXT:    [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP11:%.*]] = shl nuw i64 [[TMP13]], 1
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP11]]
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD4:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP12]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[TMP21:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD4]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD5]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT:    [[TMP15:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD]], [[TMP21]]
+; CHECK-NEXT:    [[TMP22:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD3]], [[TMP14]]
+; CHECK-NEXT:    [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP18:%.*]] = shl nuw i64 [[TMP17]], 1
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP18]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP15]], ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP22]], ptr [[TMP19]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i32 1024, [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]], !prof [[PROF3:![0-9]+]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]], !prof [[PROF3]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT:    [[X:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[IV1]]
+; CHECK-NEXT:    [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT:    [[Z:%.*]] = udiv i64 [[X]], [[Y]]
+; CHECK-NEXT:    store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT1]] = add i32 [[IV1]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT1]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch, !prof !4
+
+then.0:
+  br i1 %c1, label %then.1, label %latch, !prof !4
+
+then.1:
+  %gep0 = getelementptr i64, ptr %p0, i32 %iv
+  %x = load i64, ptr %gep0
+  %gep1 = getelementptr i64, ptr %p1, i32 %iv
+  %y = load i64, ptr %gep1
+  %z = udiv i64 %x, %y
+  store i64 %z, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+!4 = !{!"branch_weights", i32 1, i32 0}
+
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
new file mode 100644
index 0000000000000..23f2624a207b3
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
@@ -0,0 +1,125 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt < %s -S -p loop-vectorize -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP2:%.*]] = getelementptr i32, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT:    [[X:%.*]] = load i32, ptr [[GEP2]], align 4
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i32, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT:    store i32 0, ptr [[GEP1]], align 4
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT]] = add i32 [[IV1]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch
+
+then.0:
+  br i1 %c1, label %then.1, label %latch
+
+then.1:
+  %gep0 = getelementptr i32, ptr %p0, i32 %iv
+  %x = load i32, ptr %gep0
+  %gep1 = getelementptr i32, ptr %p1, i32 %x
+  store i32 0, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_...
[truncated]

I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

lukel97 · 2025-09-23T12:11:06Z

llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

  %var0 = getelementptr inbounds i64, ptr %a, i64 %i
  %var2 = load i64, ptr %var0, align 4
-  %cond0 = icmp sgt i64 %var2, 0
+  %cond0 = icmp sgt i64 %var2, 1


This constant was changed to keep the old branch probability the same and keep the block scalarized, since with BFI icmp sgt %x, 0 is predicted to be slightly > 50%.

llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll

lukel97 · 2025-09-23T12:29:37Z

llvm/test/Transforms/LoopVectorize/AArch64/early_exit_costs.ll

  %l = load i64, ptr %gep.src, align 1
  %t = trunc i64 %l to i1
-  br i1 %t, label %exit.0, label %loop.latch
+  br i1 %t, label %exit.0, label %loop.latch, !prof !0


Added to retain the old branch probability. With BFI, the probability of exit.0 being taken is computed as 3% (why that is, I'm not sure) and the IV increment isn't discounted in the VF=1 plan.

That is weird. I'd expect trunc i64 to i1 to give a probability of 50% given it's essentially asking for the likelihood of loading an odd-numbered value!

Oh it looks like it's because this is branching to the loop exit, and BPI scales the weight down of the exiting branch by the trip count as a heuristic, which I guess makes sense:

llvm-project/llvm/lib/Analysis/BranchProbabilityInfo.cpp

Lines 895 to 903 in 07ad928

if (isLoopExitingEdge(Edge) &&

// Avoid adjustment of ZERO weight since it should remain unchanged.

Weight != static_cast<uint32_t>(BlockExecWeight::ZERO)) {

// Scale down loop exiting weight by trip count.

Weight = std::max(

static_cast<uint32_t>(BlockExecWeight::LOWEST_NON_ZERO),

Weight.value_or(static_cast<uint32_t>(BlockExecWeight::DEFAULT)) /

TC);

}

llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll

lukel97 · 2025-09-23T12:34:02Z

llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll

+; AVX1-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 -3
+; AVX1-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4, !alias.scope [[META18:![0-9]+]]
+; AVX1-NEXT:    [[REVERSE:%.*]] = shufflevector <4 x i32> [[WIDE_LOAD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; AVX1-NEXT:    [[TMP3:%.*]] = icmp sgt <4 x i32> [[REVERSE]], zeroinitializer


BPI computes >= 0 to be slightly >50%, which influences the cost for AVX1 here.

lukel97 · 2025-09-23T12:36:17Z

llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll

 ; COST:       [[LOOP_HEADER]]:
-; COST-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; COST-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
 ; COST-NEXT:    [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1


The total probability of all these switch destinations being executed is 1. BFI correctly divides up the probabilities, but previously each destination was given a 50% chance which meant the VF=1 VPlan was overly discounted.

lukel97 · 2025-09-23T12:37:58Z

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-store-accesses-with-gaps.ll

-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP8:%.*]] = extractelement <4 x i64> [[TMP2]], i64 1
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP9:%.*]] = getelementptr inbounds nuw i16, ptr [[POINTS]], i64 [[TMP8]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP10:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i64 1
-; ENABLED_MASKED_STRIDED-NEXT:    store i16 [[TMP10]], ptr [[TMP9]], align 2


Predication due to tail folding no longer accidentally discounts

lukel97 · 2025-09-23T12:52:36Z

The invalid results being returned from BlockFrequencyInfo has been fixed in #157888 so this should be ready to review now. There is one crash to do with struct return types that #160274 fixes.

llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll

llvm/test/Transforms/LoopVectorize/ARM/optsize_minsize.ll

… blocks Split off from llvm#158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.

I ran into this crash when #158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

… blocks (#160449) Split off from #158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway. On llvm-test-suite this causes 4 loops in total to no longer be vectorized with -O3 on arm64-apple-darwin, and there's no observable performance impact.

…ze/bfi

…#168697) #158690 plans on passing BFI as a lazy lambda to avoid computing BlockFrequencyInfo when not needed. In preparation for that, this PR removes BFI and PSI from some constructors that aren't used. It also consolidates the two calls to llvm::shouldOptimizeForSize so that the result is computed once and passed where needed. This also renames OptForSize in LoopVectorizationLegality to clarify that it's to prevent runtime SCEV checks, see https://reviews.llvm.org/D68082

…ze/bfi

lukel97 · 2025-11-25T08:59:17Z

Ping

david-arm · 2025-11-25T10:01:06Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  assert(HeaderFreq >= BBFreq &&
+         "Header has smaller block freq than dominated BB?");
+  return HeaderFreq / BBFreq;


If I understand correctly the header frequency should really be the sum of all block frequencies in the loop? That's because it dominates all the other blocks, so in order to get to any block in the loop it has to go through the header. If so, then this calculation looks right to me.

Exactly, as an aside I added that assert in because when I first started this PR there was a bug with the pass manager that meant the header sometimes had a smaller frequency, which led to divisions by zero. It was fixed in 4663d25

Overall the number of changes with the patch I found are relatively small, but a few instances seems slightly unprofitable now, due to the fact that the unsigned division truncates toward zero, an example is https://llvm.godbolt.org/z/q46vM1eh3. where header frequency is 30, and predicated block frequency 20. We will return 1 and assume that the predicated block always executes in the scalar loop, same if the block frequency was 16.

Perhaps we could improve that by handling the remainder differently, e.g. return 2 between 1.5 and 2.5 ect or return as double, although not sure how big the changes would be for the latter

I've rounded the divisor now in 70298e4, this actually reduces the diff a good bit thanks. It also fixes the test case you linked, which is added now in AArch64/predicated-costs.ll

david-arm · 2025-11-25T10:14:29Z

llvm/test/Transforms/LoopVectorize/AArch64/early_exit_costs.ll

  %l = load i64, ptr %gep.src, align 1
  %t = trunc i64 %l to i1
-  br i1 %t, label %exit.0, label %loop.latch
+  br i1 %t, label %exit.0, label %loop.latch, !prof !0


That is weird. I'd expect trunc i64 to i1 to give a probability of 50% given it's essentially asking for the likelihood of loading an odd-numbered value!

llvm/test/Transforms/LoopVectorize/X86/CostModel/masked-store-i16.ll

david-arm · 2025-11-25T10:29:35Z

llvm/test/Transforms/PhaseOrdering/X86/pr48844-br-to-switch-vectorization.ll

+; AVX-NEXT:    [[TMP17:%.*]] = or <8 x i1> [[TMP9]], [[TMP13]]
+; AVX-NEXT:    [[TMP18:%.*]] = or <8 x i1> [[TMP10]], [[TMP14]]
+; AVX-NEXT:    [[TMP19:%.*]] = or <8 x i1> [[TMP11]], [[TMP15]]
+; AVX-NEXT:    tail call void @llvm.masked.store.v8i32.p0(<8 x i32> splat (i32 42), ptr align 4 [[NEXT_GEP]], <8 x i1> [[TMP16]])


I'm very surprised we now start vectorising this loop. It seems like a mistake to me and likely a regression. The chance of any store taking place is tiny, i.e. a 1 out of every UINT_MAX - see the code:

%val = load i32, ptr %ptr, align 4 %c1 = icmp eq i32 %val, 13 %c2 = icmp eq i32 %val, -12 %c3 = or i1 %c1, %c2 br i1 %c3, label %store, label %latch

It would be good to find out what's wrong here.

Good point, looks like this test gets converted to a switch first, and then BranchProbabilityInfo assigns each possible successor of the switch an equal weight here, so the probability of store being taken is 66.6%.

FWIW without BFI the previously probability would have been 50%, and it looks like we were vectorizing it with AVX2 anyway. So seems likely like its just the total scalar cost being nudged past the profitability threshold.

…llvm#168697) llvm#158690 plans on passing BFI as a lazy lambda to avoid computing BlockFrequencyInfo when not needed. In preparation for that, this PR removes BFI and PSI from some constructors that aren't used. It also consolidates the two calls to llvm::shouldOptimizeForSize so that the result is computed once and passed where needed. This also renames OptForSize in LoopVectorizationLegality to clarify that it's to prevent runtime SCEV checks, see https://reviews.llvm.org/D68082

…ze/bfi

lukel97 · 2025-12-01T07:35:17Z

Ping. Since the last review I've gone ahead and removed more of the test diffs by adding explicit branch weights or changing icmp cond %x, 0 to icmp cond %x, 1 which gets BPI to produce 50/50 weights.

fhahn

Thanks for the latest updates! I will check the impact on AArch64 today to make sure there are no surprises.

The PR description still mentions +0.3% compile-time impact, but IIUC with the latest changes it is much lower, right? Would be good to update

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

lukel97 · 2025-12-01T11:15:06Z

The PR description still mentions +0.3% compile-time impact, but IIUC with the latest changes it is much lower, right? Would be good to update

Woops, that was missing a zero. Fixed now to 0.03%.

david-arm · 2025-12-01T11:32:35Z

Thanks for the latest updates! I will check the impact on AArch64 today to make sure there are no surprises.

I will run some tests too!

fhahn · 2025-12-02T10:59:59Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  GetBFI = [&AM, &F, &BFI]() -> BlockFrequencyInfo & {
+    if (!BFI)
+      BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
+    return *BFI;


I think the main saving would come from avoiding the indirect call to the lambda in many cases.

Oh sorry I missed your comment about the ::getBFI method. Are you suggesting that we should implement that in LoopVectorizationCostModel + LoopVectorizer? There's two places where we need to potentially fetch BFI

IIRC in LoopVectorize we only call it once to determine if optimizing for size, so there calling the lambda shouldn't make a difference

Added a method in 55cb35b, unfortunately it requires removing const from some methods since we're setting a member. I also tried passing through a functor in 6d7c004 but that was also kind of hairy. Let me know what you prefer.

fhahn · 2025-12-02T11:08:06Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  assert(HeaderFreq >= BBFreq &&
+         "Header has smaller block freq than dominated BB?");
+  return HeaderFreq / BBFreq;


Overall the number of changes with the patch I found are relatively small, but a few instances seems slightly unprofitable now, due to the fact that the unsigned division truncates toward zero, an example is https://llvm.godbolt.org/z/q46vM1eh3. where header frequency is 30, and predicated block frequency 20. We will return 1 and assume that the predicated block always executes in the scalar loop, same if the block frequency was 16.

Perhaps we could improve that by handling the remainder differently, e.g. return 2 between 1.5 and 2.5 ect or return as double, although not sure how big the changes would be for the latter

This reverts commit 7a29373.

…ze/bfi

I guess the AArch64 system headers transitively include it somehow?

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Sep 15, 2025

lukel97 marked this pull request as draft September 15, 2025 17:26

lukel97 force-pushed the loop-vectorize/bfi branch from c191ed6 to a8810d5 Compare September 17, 2025 12:56

Precommit tests

994ac69

lukel97 force-pushed the loop-vectorize/bfi branch from a8810d5 to ccf4343 Compare September 23, 2025 04:22

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor

319b4e7

lukel97 mentioned this pull request Sep 23, 2025

[VPlan] Fix packed replication of struct types #160274

Merged

lukel97 force-pushed the loop-vectorize/bfi branch from ccf4343 to 319b4e7 Compare September 23, 2025 11:56

Add comment to test

0b4a76e

lukel97 commented Sep 23, 2025

View reviewed changes

lukel97 changed the title ~~[WIP][VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor~~ [VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor Sep 23, 2025

lukel97 requested review from MacDue, SamTebbs33, david-arm and fhahn September 23, 2025 12:47

lukel97 marked this pull request as ready for review September 23, 2025 12:51

fhahn reviewed Sep 23, 2025

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopVectorize/ARM/optsize_minsize.ll Outdated Show resolved Hide resolved

lukel97 mentioned this pull request Sep 24, 2025

[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449

Merged

SamTebbs33 reviewed Oct 2, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

bf4d92f

…ze/bfi

lukel97 mentioned this pull request Nov 19, 2025

[LV] Consolidate shouldOptimizeForSize and remove unused BFI/PSI. NFC #168697

Merged

lukel97 added 5 commits November 19, 2025 21:43

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

7723bd9

…ze/bfi

Lazily fetch BFI to avoid compile time regressions

f2b5fce

Memoize GetBFI

32fbff4

Restore new PM tests

16c4b21

Merge branch 'main' into loop-vectorize/bfi

e09d1b3

david-arm reviewed Nov 25, 2025

View reviewed changes

Remove duplicated VF 1 check lines

602fdfa

lukel97 added 4 commits December 1, 2025 15:12

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

9c00e81

…ze/bfi

Use branch_weights to remove diff in replicating-load-store-costs.ll

bb66d78

Remove diff in struct-return

95e55ef

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

0b5f3c5

…ze/bfi

fhahn reviewed Dec 1, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

Move CostKind check up

1022c7f

lukel97 added 2 commits December 1, 2025 19:39

Make GetBFI return a reference to match how other passes handle it

296797f

Memoize AM.getResult

7a29373

fhahn reviewed Dec 2, 2025

View reviewed changes

lukel97 added 6 commits December 2, 2025 20:49

Revert "Memoize AM.getResult"

09340fe

This reverts commit 7a29373.

Add LoopVectorizeCostModel::getBFI

55cb35b

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

42e3c38

…ze/bfi

Round divisor

70298e4

Remove most of the test diffs

e0b6b94

Include <cmath> for std::round

c4b3104

I guess the AArch64 system headers transitively include it somehow?

	if (isLoopExitingEdge(Edge) &&
	// Avoid adjustment of ZERO weight since it should remain unchanged.
	Weight != static_cast<uint32_t>(BlockExecWeight::ZERO)) {
	// Scale down loop exiting weight by trip count.
	Weight = std::max(
	static_cast<uint32_t>(BlockExecWeight::LOWEST_NON_ZERO),
	Weight.value_or(static_cast<uint32_t>(BlockExecWeight::DEFAULT)) /
	TC);
	}

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

Are you sure you want to change the base?

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

Uh oh!

Conversation

lukel97 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukel97 commented Nov 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukel97 commented Dec 1, 2025

Uh oh!

david-arm commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

lukel97 commented Sep 15, 2025 •

edited

Loading

llvmbot commented Sep 15, 2025 •

edited

Loading

lukel97 Nov 25, 2025 •

edited

Loading

lukel97 commented Dec 1, 2025 •

edited

Loading