[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449

lukel97 · 2025-09-24T06:06:12Z

Split off from #158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places.
This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.

This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.

On llvm-test-suite this causes 4 loops in total to no longer be vectorized with -O3 on arm64-apple-darwin, and there's no observable performance impact.

… blocks Split off from llvm#158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.

llvmbot · 2025-09-24T06:06:42Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Luke Lau (lukel97)

Changes

Split off from #158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places.
This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.

This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.

Patch is 56.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/160449.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+29-4)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+3-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanHelpers.h (+4-15)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+12-57)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+15-232)
(modified) llvm/test/Transforms/LoopVectorize/ARM/optsize_minsize.ll (+6-178)
(modified) llvm/test/Transforms/LoopVectorize/X86/CostModel/masked-interleaved-store-i16.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+2-64)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ca092dcfcb492..c0c2063ca81b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1249,6 +1249,25 @@ class LoopVectorizationCostModel {
   /// Superset of instructions that return true for isScalarWithPredication.
   bool isPredicatedInst(Instruction *I) const;
 
+  /// A helper function that returns how much we should divide the cost of a
+  /// predicated block by. Typically this is the reciprocal of the block
+  /// probability, i.e. if we return X we are assuming the predicated block will
+  /// execute once for every X iterations of the loop header so the block should
+  /// only contribute 1/X of its cost to the total cost calculation, but when
+  /// optimizing for code size it will just be 1 as code size costs don't depend
+  /// on execution probabilities.
+  ///
+  /// TODO: We should use actual block probability here, if available.
+  /// Currently, we always assume predicated blocks have a 50% chance of
+  /// executing.
+  inline unsigned
+  getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                          BasicBlock *BB) const {
+    if (!Legal->blockNeedsPredication(BB))
+      return 1;
+    return CostKind == TTI::TCK_CodeSize ? 1 : 2;
+  }
+
   /// Return the costs for our two available strategies for lowering a
   /// div/rem operation which requires speculating at least one lane.
   /// First result is for scalarization (will be invalid for scalable
@@ -2902,7 +2921,8 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
     // Scale the cost by the probability of executing the predicated blocks.
     // This assumes the predicated block for each vector lane is equally
     // likely.
-    ScalarizationCost = ScalarizationCost / getPredBlockCostDivisor(CostKind);
+    ScalarizationCost =
+        ScalarizationCost / getPredBlockCostDivisor(CostKind, I->getParent());
   }
 
   InstructionCost SafeDivisorCost = 0;
@@ -5035,7 +5055,7 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
       }
 
     // Scale the total scalar cost by block probability.
-    ScalarCost /= getPredBlockCostDivisor(CostKind);
+    ScalarCost /= getPredBlockCostDivisor(CostKind, I->getParent());
 
     // Compute the discount. A non-negative discount means the vector version
     // of the instruction costs more, and scalarizing would be beneficial.
@@ -5088,7 +5108,7 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
     // cost by the probability of executing it. blockNeedsPredication from
     // Legal is used so as to not include all blocks in tail folded loops.
     if (VF.isScalar() && Legal->blockNeedsPredication(BB))
-      BlockCost /= getPredBlockCostDivisor(CostKind);
+      BlockCost /= getPredBlockCostDivisor(CostKind, BB);
 
     Cost += BlockCost;
   }
@@ -5167,7 +5187,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   // conditional branches, but may not be executed for each vector lane. Scale
   // the cost by the probability of executing the predicated block.
   if (isPredicatedInst(I)) {
-    Cost /= getPredBlockCostDivisor(CostKind);
+    Cost /= getPredBlockCostDivisor(CostKind, I->getParent());
 
     // Add the cost of an i1 extract and a branch
     auto *VecI1Ty =
@@ -6727,6 +6747,11 @@ bool VPCostContext::skipCostComputation(Instruction *UI, bool IsVector) const {
          SkipCostComputation.contains(UI);
 }
 
+unsigned VPCostContext::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const {
+  return CM.getPredBlockCostDivisor(CostKind, BB);
+}
+
 InstructionCost
 LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
                                           VPCostContext &CostCtx) const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index a1c6f7977885f..e3b0c2bff9d02 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -855,7 +855,9 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
   // For the scalar case, we may not always execute the original predicated
   // block, Thus, scale the block's cost by the probability of executing it.
   if (VF.isScalar())
-    return ThenCost / getPredBlockCostDivisor(Ctx.CostKind);
+    if (auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Then))
+      return ThenCost / Ctx.getPredBlockCostDivisor(Ctx.CostKind,
+                                                    VPIRBB->getIRBasicBlock());
 
   return ThenCost;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index fe59774b7c838..fe082851ca00c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -50,21 +50,6 @@ Value *getRuntimeVF(IRBuilderBase &B, Type *Ty, ElementCount VF);
 Value *createStepForVF(IRBuilderBase &B, Type *Ty, ElementCount VF,
                        int64_t Step);
 
-/// A helper function that returns how much we should divide the cost of a
-/// predicated block by. Typically this is the reciprocal of the block
-/// probability, i.e. if we return X we are assuming the predicated block will
-/// execute once for every X iterations of the loop header so the block should
-/// only contribute 1/X of its cost to the total cost calculation, but when
-/// optimizing for code size it will just be 1 as code size costs don't depend
-/// on execution probabilities.
-///
-/// TODO: We should use actual block probability here, if available. Currently,
-///       we always assume predicated blocks have a 50% chance of executing.
-inline unsigned
-getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
-  return CostKind == TTI::TCK_CodeSize ? 1 : 2;
-}
-
 /// A range of powers-of-2 vectorization factors with fixed start and
 /// adjustable end. The range includes start and excludes end, e.g.,:
 /// [1, 16) = {1, 2, 4, 8}
@@ -364,6 +349,10 @@ struct VPCostContext {
   /// has already been pre-computed.
   bool skipCostComputation(Instruction *UI, bool IsVector) const;
 
+  /// \returns how much the cost of a predicated block should be divided by.
+  unsigned getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                                   BasicBlock *BB) const;
+
   /// Returns the OperandInfo for \p V, if it is a live-in.
   TargetTransformInfo::OperandValueInfo getOperandInfo(VPValue *V) const;
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index aa3de3613b68e..2e77b75b16e47 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3170,7 +3170,7 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
     // Scale the cost by the probability of executing the predicated blocks.
     // This assumes the predicated block for each vector lane is equally
     // likely.
-    ScalarCost /= getPredBlockCostDivisor(Ctx.CostKind);
+    ScalarCost /= Ctx.getPredBlockCostDivisor(Ctx.CostKind, UI->getParent());
     return ScalarCost;
   }
   case Instruction::Load:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index e4ee6776ae24c..790e1d20b6ec1 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -612,63 +612,18 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
 ;
 ; COMMON-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
 ; COMMON-SAME: ptr [[DST:%.*]]) {
-; COMMON-NEXT:  [[ENTRY:.*:]]
-; COMMON-NEXT:    br label %[[VECTOR_PH:.*]]
-; COMMON:       [[VECTOR_PH]]:
-; COMMON-NEXT:    br label %[[VECTOR_BODY:.*]]
-; COMMON:       [[VECTOR_BODY]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; COMMON:       [[PRED_STORE_IF]]:
-; COMMON-NEXT:    [[TMP0:%.*]] = getelementptr i8, ptr [[DST]], i64 0
-; COMMON-NEXT:    store i8 0, ptr [[TMP0]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; COMMON:       [[PRED_STORE_CONTINUE]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2:.*]]
-; COMMON:       [[PRED_STORE_IF1]]:
-; COMMON-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[DST]], i64 1
-; COMMON-NEXT:    store i8 1, ptr [[TMP1]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE2]]
-; COMMON:       [[PRED_STORE_CONTINUE2]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF3:.*]], label %[[PRED_STORE_CONTINUE4:.*]]
-; COMMON:       [[PRED_STORE_IF3]]:
-; COMMON-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[DST]], i64 2
-; COMMON-NEXT:    store i8 2, ptr [[TMP2]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE4]]
-; COMMON:       [[PRED_STORE_CONTINUE4]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF5:.*]], label %[[PRED_STORE_CONTINUE6:.*]]
-; COMMON:       [[PRED_STORE_IF5]]:
-; COMMON-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 3
-; COMMON-NEXT:    store i8 3, ptr [[TMP3]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE6]]
-; COMMON:       [[PRED_STORE_CONTINUE6]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF7:.*]], label %[[PRED_STORE_CONTINUE8:.*]]
-; COMMON:       [[PRED_STORE_IF7]]:
-; COMMON-NEXT:    [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 4
-; COMMON-NEXT:    store i8 4, ptr [[TMP4]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE8]]
-; COMMON:       [[PRED_STORE_CONTINUE8]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF9:.*]], label %[[PRED_STORE_CONTINUE10:.*]]
-; COMMON:       [[PRED_STORE_IF9]]:
-; COMMON-NEXT:    [[TMP5:%.*]] = getelementptr i8, ptr [[DST]], i64 5
-; COMMON-NEXT:    store i8 5, ptr [[TMP5]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE10]]
-; COMMON:       [[PRED_STORE_CONTINUE10]]:
-; COMMON-NEXT:    br i1 true, label %[[PRED_STORE_IF11:.*]], label %[[PRED_STORE_CONTINUE12:.*]]
-; COMMON:       [[PRED_STORE_IF11]]:
-; COMMON-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[DST]], i64 6
-; COMMON-NEXT:    store i8 6, ptr [[TMP6]], align 1
-; COMMON-NEXT:    br label %[[PRED_STORE_CONTINUE12]]
-; COMMON:       [[PRED_STORE_CONTINUE12]]:
-; COMMON-NEXT:    br i1 false, label %[[PRED_STORE_IF13:.*]], label %[[EXIT:.*]]
-; COMMON:       [[PRED_STORE_IF13]]:
-; COMMON-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[DST]], i64 7
-; COMMON-NEXT:    store i8 7, ptr [[TMP7]], align 1
-; COMMON-NEXT:    br label %[[EXIT]]
-; COMMON:       [[EXIT]]:
-; COMMON-NEXT:    br label %[[SCALAR_PH:.*]]
-; COMMON:       [[SCALAR_PH]]:
-; COMMON-NEXT:    br [[EXIT1:label %.*]]
-; COMMON:       [[SCALAR_PH1:.*:]]
+; COMMON-NEXT:  [[ENTRY:.*]]:
+; COMMON-NEXT:    br label %[[EXIT1:.*]]
+; COMMON:       [[EXIT1]]:
+; COMMON-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[EXIT1]] ]
+; COMMON-NEXT:    [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i8
+; COMMON-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
+; COMMON-NEXT:    store i8 [[IV_TRUNC]], ptr [[GEP]], align 1
+; COMMON-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
+; COMMON-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
+; COMMON-NEXT:    br i1 [[EC]], label %[[SCALAR_PH1:.*]], label %[[EXIT1]]
+; COMMON:       [[SCALAR_PH1]]:
+; COMMON-NEXT:    ret void
 ;
 entry:
   br label %loop
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
index cc7b4aecc3642..71c2a05af964f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
@@ -274,69 +274,11 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
 ;
 ; PRED-LABEL: define void @iv_trunc(
 ; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; PRED-NEXT:  [[ENTRY:.*:]]
+; PRED-NEXT:  [[ENTRY:.*]]:
 ; PRED-NEXT:    [[MUL_X:%.*]] = add i32 [[X]], 1
-; PRED-NEXT:    [[TMP0:%.*]] = add i64 [[N]], 1
-; PRED-NEXT:    br label %[[VECTOR_SCEVCHECK:.*]]
-; PRED:       [[VECTOR_SCEVCHECK]]:
-; PRED-NEXT:    [[TMP1:%.*]] = sub i32 -1, [[X]]
-; PRED-NEXT:    [[TMP2:%.*]] = icmp slt i32 [[MUL_X]], 0
-; PRED-NEXT:    [[TMP3:%.*]] = select i1 [[TMP2]], i32 [[TMP1]], i32 [[MUL_X]]
-; PRED-NEXT:    [[TMP4:%.*]] = trunc i64 [[N]] to i32
-; PRED-NEXT:    [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP3]], i32 [[TMP4]])
-; PRED-NEXT:    [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
-; PRED-NEXT:    [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
-; PRED-NEXT:    [[TMP5:%.*]] = sub i32 0, [[MUL_RESULT]]
-; PRED-NEXT:    [[TMP6:%.*]] = icmp ugt i32 [[TMP5]], 0
-; PRED-NEXT:    [[TMP7:%.*]] = select i1 [[TMP2]], i1 [[TMP6]], i1 false
-; PRED-NEXT:    [[TMP8:%.*]] = or i1 [[TMP7]], [[MUL_OVERFLOW]]
-; PRED-NEXT:    [[TMP9:%.*]] = icmp ugt i64 [[N]], 4294967295
-; PRED-NEXT:    [[TMP10:%.*]] = icmp ne i32 [[MUL_X]], 0
-; PRED-NEXT:    [[TMP11:%.*]] = and i1 [[TMP9]], [[TMP10]]
-; PRED-NEXT:    [[TMP12:%.*]] = or i1 [[TMP8]], [[TMP11]]
-; PRED-NEXT:    br i1 [[TMP12]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; PRED:       [[VECTOR_PH]]:
-; PRED-NEXT:    [[TMP13:%.*]] = sub i64 [[TMP0]], 2
-; PRED-NEXT:    [[TMP14:%.*]] = icmp ugt i64 [[TMP0]], 2
-; PRED-NEXT:    [[TMP15:%.*]] = select i1 [[TMP14]], i64 [[TMP13]], i64 0
-; PRED-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[MUL_X]], i64 0
-; PRED-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
-; PRED-NEXT:    br label %[[VECTOR_BODY:.*]]
-; PRED:       [[VECTOR_BODY]]:
-; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE2:.*]] ]
-; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT:    [[VEC_IND:%.*]] = phi <2 x i32> [ <i32 0, i32 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT:    [[TMP16:%.*]] = mul <2 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
-; PRED-NEXT:    [[TMP17:%.*]] = zext <2 x i32> [[TMP16]] to <2 x i64>
-; PRED-NEXT:    [[TMP18:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 0
-; PRED-NEXT:    br i1 [[TMP18]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; PRED:       [[PRED_STORE_IF]]:
-; PRED-NEXT:    [[TMP19:%.*]] = extractelement <2 x i64> [[TMP17]], i32 0
-; PRED-NEXT:    [[TMP20:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP19]]
-; PRED-NEXT:    store i32 1, ptr [[TMP20]], align 4
-; PRED-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; PRED:       [[PRED_STORE_CONTINUE]]:
-; PRED-NEXT:    [[TMP21:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 1
-; PRED-NEXT:    br i1 [[TMP21]], label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2]]
-; PRED:       [[PRED_STORE_IF1]]:
-; PRED-NEXT:    [[TMP22:%.*]] = extractelement <2 x i64> [[TMP17]], i32 1
-; PRED-NEXT:    [[TMP23:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP22]]
-; PRED-NEXT:    store i32 1, ptr [[TMP23]], align 4
-; PRED-NEXT:    br label %[[PRED_STORE_CONTINUE2]]
-; PRED:       [[PRED_STORE_CONTINUE2]]:
-; PRED-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 2
-; PRED-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[INDEX]], i64 [[TMP15]])
-; PRED-NEXT:    [[TMP24:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
-; PRED-NEXT:    [[TMP25:%.*]] = xor i1 [[TMP24]], true
-; PRED-NEXT:    [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], splat (i32 2)
-; PRED-NEXT:    br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; PRED:       [[MIDDLE_BLOCK]]:
-; PRED-NEXT:    br label %[[EXIT:.*]]
-; PRED:       [[SCALAR_PH]]:
 ; PRED-NEXT:    br label %[[FOR_BODY:.*]]
 ; PRED:       [[FOR_BODY]]:
-; PRED-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; PRED-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
 ; PRED-NEXT:    [[TRUNC_IV:%.*]] = trunc i64 [[IV]] to i32
 ; PRED-NEXT:    [[ADD_I:%.*]] = mul i32 [[MUL_X]], [[TRUNC_IV]]
 ; PRED-NEXT:    [[IV_MUL:%.*]] = zext i32 [[ADD_I]] to i64
@@ -344,7 +286,7 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
 ; PRED-NEXT:    store i32 1, ptr [[GEP]], align 4
 ; PRED-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
 ; PRED-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV]], [[N]]
-; PRED-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; PRED-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[FOR_BODY]]
 ; PRED:       [[EXIT]]:
 ; PRED-NEXT:    ret void
 ;
@@ -440,101 +382,21 @@ define void @trunc_ivs_and_store(i32 %x, ptr %dst, i64 %N) #0 {
 ;
 ; PRED-LABEL: define void @trunc_ivs_and_store(
 ; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; PRED-NEXT:  [[ENTRY:.*:]]
-; PRED-NEXT:    [[MUL:%.*]] = mul i32 [[X]], [[X]]
-; PRED-NEXT:    [[TMP0:%.*]] = add i64 [[N]], 1
-; PRED-NEXT:    br label %[[VECTOR_SCEVCHECK:.*]]
-; PRED:       [[VECTOR_SCEVCHECK]]:
+; PRED-NEXT:  [[ENTRY:.*]]:
 ; PRED-NEXT:    [[TMP1:%.*]] = mul i32 [[X]], [[X]]
-; PRED-NEXT:    [[TMP2:%.*]] = sub i32 0, [[TMP1]]
-; PRED-NEXT:    [[TMP3:%.*]] = icmp slt i32 [[MUL]], 0
-; PRED-NEXT:    [[TMP4:%.*]] = select i1 [[TMP3]], i32 [[TMP2]], i32 [[MUL]]
-; PRED-NEXT:    [[TMP5:%.*]] = trunc i64 [[N]] to i32
-; PRED-NEXT:    [[MUL1:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP4]], i32 [[TMP5]])
-; PRED-NEXT:    [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL1]], 0
-; PRED-NEXT:    [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL1]], 1
-; PRED-NEXT:    [[TMP6:%.*]] = sub i32 0, [[MUL_RESULT]]
-; PRED-NEXT:    [[TMP7:%.*]] = icmp ugt i32 [[TMP6]], 0
-; PRED-NEXT:    [[TMP8:%.*]] = select i1 [[TMP3]], i1 [[TMP7]], i1 false
-; PRED-NEXT:    [[TMP9:%.*]] = or i1 [[TMP8]], [[MUL_OVERFLOW]]
-; PRED-NEXT:    [[TMP10:%.*]] = icmp ugt i64 [[N]], 4294967295
-; PRED-NEXT:    [[TMP11:%.*]] = icmp ne i32 [[MUL]], 0
-; PRED-NEXT:    [[TMP12:%.*]] = and i1 [[TMP10]], [[TMP11]]
-; PRED-NEXT:    [[TMP13:%.*]] = or i1 [[TMP9]], [[TMP12]]
-; PRED-NEXT:    br i1 [[TMP13]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; PRED:       [[VECTOR_PH]]:
-; PRED-NEXT:    [[TMP14:%.*]] = sub i64 [[TMP0]], 4
-; PRED-NEXT:    [[TMP15:%.*]] = icmp ugt i64 [[TMP0]], 4
-; PRED-NEXT:    [[TMP16:%.*]] = select i1 [[TMP15]], i64 [[TMP14]], i64 0
-; PRED-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[MUL]], i64 0
-; PRED-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
-; PRED-NEXT:    br label %[[VECTOR_BODY:.*]]
-; PRED:       [[VECTOR_BODY]]:
-; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE7:.*]] ]
-; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE7]] ]
-; PRED-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE7]] ]
-; PRED-NEXT:    [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
-; PRED-NEXT:    [[TMP17:%.*]] = mul <4 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
-; PRED-NEXT:    [[TMP18:%.*]] = zext <4 x i32> [[TMP17]] to <4 x i64>
-; PRED-NEXT:    [[TMP19:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 0
-; PRED-NEXT:    br i1 [[TMP19]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; PRED:       [[PRED_STORE_IF]]:
-; PRED-NEXT:    [...
[truncated]

david-arm · 2025-09-24T07:46:40Z

I'm not sure if this statement is quite right:

This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.

In one of the tests changed by this patch the loop has a low trip count of 7. Suppose we chose a VF of 16, then doesn't that mean that only 7 out of 16 scalar instructions are being executed? That takes us back to getPredBlockCostDivisor returning a value of 2 I think.

It feels like there are three different problems here:

Known low trip counts. Unless the trip count is an exact multiple of the VF, we know for sure that getPredBlockCostDivisor should return a number greater than 1 and in some cases may be even higher than 2 (TC=4/VF=16, etc.). I understand the intent of the patch is to not vectorise by making all the costs higher, but it feels counter-intuitive as the function isn't actually returning the value it's supposed to.
Known large trip counts. Here we absolutely know that getPredBlockCostDivisor should return 1.
Unknown trip counts. I think this is the "best guess" approach taken by your patch, where you are assuming the trip count is large. However, I know that in some benchmarks such as exchange2 there is a problem with low trip counts that are only known during the LTO optimisation phase. Pre-LTO we vectorise lots of loops in a hot code path without knowing the trip count, then during the LTO phase we discover the trip count is 3! However, I do appreciate that the real solution here is to delay vectorisation until the LTO stage.

I guess what I'm trying to say is that if the intent of the patch is to more accurately calculate the cost divisor based on the probability of entering the block, then paradoxically the cost should really reduce even further in some cases. That makes me wonder if this the right approach? Or perhaps the change to getPredBlockCostDivisor should be restricted only to cases where we know the trip count is much greater than the VF? For example, if we choose VF=2 then it's probably a good bet the cost divisor is going to be almost 1 since even for a trip count of 7 we know the ratio is 7/8.

fhahn

I'm not sure if this statement is quite right:

This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last.

In one of the tests changed by this patch the loop has a low trip count of 7. Suppose we chose a VF of 16, then doesn't that mean that only 7 out of 16 scalar instructions are being executed? That takes us back to getPredBlockCostDivisor returning a value of 2 I think.

In cases except very low trip counts, using a probability closer to 1 should be more accurate than 0.5 I think. e.g. if we execute 4 full vector iterations and one partial one, the probability should be something close to ~80% I think. Overall this should hopefully help to reduce the number of cases where we vectorize with tail-folding and costly replication.

I might be wrong, but I think this isn't something where block-frequency info would help, which is why I suggested splitting this off from #158690.

For loops with known trip counts, we should be able to compute it accurately given the VF and trip count. For other cases, it might make sense to be a bit more conservative and don't assume 100% probability but something like 80% or 90%.

I think the test that @david-arm mentioned would a good sanity check; we should't choose VF 16 over VF 8, just because we add additional branches we know won't execute

lukel97 · 2025-09-24T08:36:50Z

In one of the tests changed by this patch the loop has a low trip count of 7. Suppose we chose a VF of 16, then doesn't that mean that only 7 out of 16 scalar instructions are being executed? That takes us back to getPredBlockCostDivisor returning a value of 2 I think.

I see what you're saying, but I really think it's just coincidental that the predicated discount cost somtimes aligns with low trip counts. For example in that test in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll the VF was 8 for a trip count of 7 so returning 1 actually gave us a more accurate cost.

From what I can tell getPredBlockCostDivisor was never intended to discount the cost for tail folding, it was only for if-converted blocks: 1755d81

And somewhere along the line blocks that are predicated due to tail folding got caught up in this discount when it wasn't intended for them.

I guess what I'm trying to say is that if the intent of the patch is to more accurately calculate the cost divisor based on the probability of entering the block, then paradoxically the cost should really reduce even further in some cases.

I agree here, I think we could detect whenever a block is only predicated for tail folding and when we have a known trip count. We can plumb through the VF and get a better estimate there. It feels kind of separate to this PR though. Would it be ok to do that as a follow up?

lukel97 · 2025-09-24T08:40:41Z

@david-arm Also probably worth mentioning is that the impact of this was actually very small when I tested it out on benchmarks. In #158690 with aarch64-linux-gnu -march=armv9-a -flto -O3 there was no change in the number of loops vectorized in llvm-test-suite, and actually 0.2% more loops vectorized on SPEC CPU 2017. (I don't think exchange2 is in here though)

lukel97 · 2025-09-24T09:13:55Z

I tried out accounting for tail folding using the estimated trip count:

  inline unsigned
  getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
                          BasicBlock *BB, ElementCount VF) const {
    if (!Legal->blockNeedsPredication(BB)) {
      if (foldTailByMasking()) {
        if (auto TC = getSmallBestKnownTC(PSE, TheLoop)) {
          unsigned ETC = estimateElementCount(*TC, getVScaleForTuning());
          unsigned EVF = estimateElementCount(VF, getVScaleForTuning());
	  return std::max(1U, EVF / ETC);
        }
      }
      return 1;
    }
    return CostKind == TTI::TCK_CodeSize ? 1 : 2;
  }

But there's no difference on any of the in-tree tests. My guess is that the discount really only occurs with a high enough VF, at which point the cost of scalarization for every lane begins to outweight any discount of the blocks potentially not being executed.

david-arm · 2025-09-24T09:28:06Z

Thanks to @lukel97 and @fhahn for explaining! Just to be clear, I won't hold up the patch. I realise this is for a different day and a different PR, but it seems to me the root problem here is that we make the decision to tail-fold the loop far too early without considering the cost of doing so, then everything else afterwards has to try very hard to unwind the extremely poor early decision. Whereas if we'd avoided tail-folding in the first place, i.e. by at least checking to see if the target supports masked loads and stores, then we may avoid some of the extra complexity of rolling back. Alternatively, if we'd chosen from two VPlans per VF using different styles of vectorisation then we may end up with a better choice than simply not vectorising.

david-arm · 2025-09-24T09:15:35Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

    // Legal is used so as to not include all blocks in tail folded loops.
    if (VF.isScalar() && Legal->blockNeedsPredication(BB))
-      BlockCost /= getPredBlockCostDivisor(CostKind);
+      BlockCost /= getPredBlockCostDivisor(CostKind, BB);


Here the call to getPredBlockCostDivisor is already guarded by Legal->blockNeedsPredication(BB). Instead of changing the meaning/behaviour of getPredBlockCostDivisor is it perhaps better to simply guard all calls to the function in a similar way? Then you can add a comment to getPredBlockCostDivisor saying that it should only ever be called for blocks that are predicated in the original scalar code.

Alternatively, you could remove the Legal->blockNeedsPredication(BB) check here and update the comment above getPredBlockCostDivisor saying that it will always return 1 for blocks that are predicated due to tail-folding.

Oh good point. And I guess that comment above confirms that the predication discount isn't meant to be applied to tail folded blocks? ab97c9b

I've removed the redundant Legal->blockNeedsPredication(BB) in cf6b435 and updated the comment. I figured we're going to need to plumb through the basic block anyway in #158690 so we might as well do it in this PR

david-arm · 2025-09-24T09:16:31Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// TODO: We should use actual block probability here, if available.
+  /// Currently, we always assume predicated blocks have a 50% chance of
+  /// executing.
+  inline unsigned


I think the comment on this function needs updating.

I've updated the comment in cf6b435 to mention that tail-folded predication doesn't count in this case

david-arm · 2025-09-26T14:33:12Z

Having thought about this just now, I also suspect that when tail-folding due to low trip counts (i.e. TC <= 16) we probably clamp the max VF to the next largest power-of-2 value. So we would never choose a max VF of 16 for TC=5 - we'd probably choose a max VF of 8. There will be very few cases where the cost divisor goes above 1. We also don't know at this point what the interleave count will be.

llvm/lib/Transforms/Vectorize/VPlan.cpp

…ze/dont-discount-unpredicated-blocks

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanHelpers.h

fhahn · 2025-10-02T07:42:47Z

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

 ; CHECK-NEXT: vector.body:
 ; CHECK-NEXT:   EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION
 ; CHECK-NEXT:   ir<%iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<[[VF]]>
+; CHECK-NEXT:   vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<1>, vp<[[VF]]>


I think this is no not testing what it used to (replicate regions aren't merged any more), would be good to adjust the test so this merging still takes place

Also done in f072e3c, needed to pass -force-widen-divrem-via-safe-divisor=0 so that the div is scalarized

fhahn · 2025-10-02T07:42:53Z

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

-; CHECK-NEXT: }
-
+; CHECK-NEXT: Successor(s): pred.load
+; CHECK-EMPTY:


I think this is no not testing what it used to (replicate regions aren't merged any more), would be good to adjust the test so this merging still takes place

Sorry for the delay, should be done now in f072e3c. I had to remove the instructions in between the load and store that were now widened to get the replicate regions beside each other

…ze/dont-discount-unpredicated-blocks

…ll merging replicate regions

…ze/dont-discount-unpredicated-blocks

lukel97 · 2025-10-21T16:43:11Z

Ping

fhahn

Do you by any chance have any perf numbers to confirm this doesn't cause any regressions?

If so, would be great if you could add them to the PR description

fhahn · 2025-10-22T09:35:51Z

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

-  %add = add i32 %lv.b, 10
-  %mul = mul i32 2, %add


Is there any reasn those need to be removed?I might be remembering incorrectly, but this should test we can sink multiple recipes

I think this is supposed to be testing that replicate regions are merged? #160449 (comment)

Because the add and mul are no longer replicated because there's no predication discount, we now end up with

REPLICATE load WIDE add WIDE mul REPLICATE store

So I had to remove the add and the mul so the load and store replicate regions would be together, which allows them to be merged.

fhahn · 2025-10-22T09:36:21Z

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

 ; CHECK-NEXT:   ir<%iv> = WIDEN-INDUCTION ir<0>, ir<1>, vp<[[VF]]>
 ; CHECK-NEXT:   EMIT vp<[[MASK:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]>
 ; CHECK-NEXT:   EMIT vp<[[SPLICE:%.+]]> = first-order splice ir<%recur>, ir<%recur.next>
+; CHECK-NEXT:   WIDEN ir<%rem> = srem vp<[[SPLICE]]>, ir<%x>


not testing sinking into replicate regions any longer?

fhahn · 2025-10-22T09:37:13Z

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

+; CHECK-NEXT:   Successor(s): loop.1
+; CHECK-EMPTY:
+; CHECK-NEXT:  loop.1:
+; CHECK-NEXT:    WIDEN ir<%add.1> = add ir<%conv>, ir<%rem>


this now disables merging/sinking? Can we avoid this change, to kee testing the original thing?

This was originally to test splitting at the end of a block but as of today it's not testing it because the conv isn't at the end of the block, I've opened #164636 to move it to a unit test

lukel97 · 2025-10-22T10:23:44Z

Do you by any chance have any perf numbers to confirm this doesn't cause any regressions?

If so, would be great if you could add them to the PR description

I tested this again on llvm-test-suite arm64-apple-darwin -O3, the geomean difference in loops vectorized is -0.0%. Only four tests were affected with 1 less loop vectorized, and there was no observable performance difference:

  MultiSourc...Benchmarks/7zip/7zip-benchmark   355.00                         354.00   -0.3%
  MultiSourc...e/Applications/ClamAV/clamscan   115.00                         114.00   -0.9%
  MultiSourc...ch/consumer-jpeg/consumer-jpeg    65.00                          64.00   -1.5%
  MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg    61.00                          60.00   -1.6%
  SingleSour...chmarks/BenchmarkGame/fannkuch     1.00                           0.00 -100.0%

In llvm#160449 some of the tests end up merging less blocks so we end up losing test coverage. This adds a unit test to try and cover it elsewhere in a more predictable way, so it won't be influenced by e.g. whether or not the cost model decides to scalarize an instruction. Two parts in createReplicateRegion/addReplicateRegions had to be relaxed to allow using a fake Instruction with no BasicBlock parent. I wasn't able to create a fake BasicBlock in the test without hitting iterator assertions.

…#164934) To remove some test diffs in #160449

…ze/dont-discount-unpredicated-blocks

lukel97 · 2025-11-04T11:17:17Z

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

 ; CHECK-NEXT:   WIDEN-CAST ir<%conv> = sext vp<[[PRED1]]> to i32
 ; CHECK-NEXT:   EMIT vp<[[SPLICE:%.+]]> = first-order splice ir<%0>, ir<%conv>
+; CHECK-NEXT:   WIDEN ir<%rem> = srem vp<[[SPLICE]]>, ir<%x>
+; CHECK-NEXT:   WIDEN ir<%add> = add ir<%conv>, ir<%rem>


IIUC when this test was first written in https://reviews.llvm.org/D100751 sinking operated directly on regions. But now it happens in sinkRecurrenceUsersAfterPrevious which is called before optimization, so before createAndOptimizeReplicateRegions.

So I don't think we're sinking replicate regions directly anymore, so I'm not sure if this was still testing the original "; Test cases for PR50009, which require sinking a replicate region due to a first-order recurrence." comment

Just noting here that we can't explicitly predicate this test because the srem that's being sunk is used in the FOR chain

@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on #160449

@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on llvm/llvm-project#160449

…ocks

@sink_replicate_region_4_requires_split_at_end_of_block was originally added to ensure splitting at the end of a block wouldn't crash, see bdada75 However it looks like we're now no longer testing this because conv isn't at the end of the block anymore. This moves it into a unit test instead. Discovered when working on llvm#160449

fhahn · 2025-11-10T09:15:22Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // getPredBlockCostDivisor won't include blocks that are only predicated due
+    // to tail folded loops


Suggested change

// getPredBlockCostDivisor won't include blocks that are only predicated due

// to tail folded loops

// getPredBlockCostDivisor will return 1 for blocks only predicated due

// by the header mask when folding the tail

Done in fdfa9b2

fhahn · 2025-11-10T09:16:20Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+unsigned VPCostContext::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const {


Suggested change

unsigned VPCostContext::getPredBlockCostDivisor(

TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const {

unsigned VPCostContext::getPredBlockCostDivisor(

BasicBlock *BB) const {

I think it should be fine to just use CostKind from VPCostContext?

Thanks, done in fdfa9b2

david-arm

LGTM!

fhahn

LGTM, thanks!

…ocks

* main: (1028 commits) [clang][DebugInfo] Attach `DISubprogram` to additional call variants (llvm#166202) [C2y] Claim nonconformance to WG14 N3348 (llvm#166966) [X86] 2012-01-10-UndefExceptionEdge.ll - regenerate test checks (llvm#167307) Remove unused standard headers: <string>, <optional>, <numeric>, <tuple> (llvm#167232) [DebugInfo] Add Verifier check for incorrectly-scoped retainedNodes (llvm#166855) [VPlan] Don't apply predication discount to non-originally-predicated blocks (llvm#160449) [libc++] Avoid overloaded `operator,` for (`T`, `Iter`) cases (llvm#161049) [tools][llc] Make save-stats.ll test target independent (llvm#167238) [AArch64] Fallback to PRFUM for PRFM with negative or unaligned offset (llvm#166756) [X86] ldexp-avx512.ll - add v8f16/v16f16/v32f16 test coverage for llvm#165694 (llvm#167294) [DropAssumes] Drop dereferenceable assumptions after vectorization. (llvm#166947) [VPlan] Simplify branch-cond with getVectorTripCount (llvm#155604) Remove unused <algorithm> inclusion (llvm#166942) [AArch64] Combine subtract with borrow to SBC. (llvm#165271) [AArch64][SVE] Avoid redundant extend of unsigned i8/i16 extracts. (llvm#165863) [SPIRV] Fix failing assertion in SPIRVAsmPrinter (llvm#166909) [libc++] Merge insert/emplace(const_iterator, Args...) implementations (llvm#166470) [libc++] Replace __libcpp_is_final with a variable template (llvm#167137) [gn build] Port 152bda7 [libc++] Replace the last uses of __tuple_types with __type_list (llvm#167214) ...

lukel97 requested review from SamTebbs33, david-arm, fhahn and john-brawn-arm September 24, 2025 06:06

llvmbot added vectorizers llvm:transforms labels Sep 24, 2025

Fix vplan tests

972ee3b

fhahn reviewed Sep 24, 2025

View reviewed changes

david-arm reviewed Sep 24, 2025

View reviewed changes

Update comments, remove redundant blockNeedsPredication check

cf6b435

david-arm reviewed Sep 26, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlan.cpp Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

697fbd6

…ze/dont-discount-unpredicated-blocks

fhahn reviewed Oct 2, 2025

View reviewed changes

lukel97 added 4 commits October 4, 2025 18:15

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

874fdbb

…ze/dont-discount-unpredicated-blocks

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

88748ae

…ze/dont-discount-unpredicated-blocks

Update comments, restore vplan-sink-scalars-and-merge.ll so we're sti…

f072e3c

…ll merging replicate regions

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

ac9bed0

…ze/dont-discount-unpredicated-blocks

fhahn reviewed Oct 22, 2025

View reviewed changes

lukel97 mentioned this pull request Oct 22, 2025

[VPlan] Add unit test for createAndOptimizeReplicateRegions. NFC #164652

Closed

lukel97 mentioned this pull request Nov 3, 2025

[VPlan] Explicitly predicate some replicate region sinking tests. NFC #164934

Merged

lukel97 added a commit that referenced this pull request Nov 3, 2025

[VPlan] Explicitly predicate some replicate region sinking tests. NFC (…

e17bc1e

…#164934) To remove some test diffs in #160449

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

50c37ed

…ze/dont-discount-unpredicated-blocks

lukel97 commented Nov 4, 2025

View reviewed changes

lukel97 mentioned this pull request Nov 6, 2025

[VPlan] Add splitAt unit test. NFC #164636

Merged

lukel97 requested a review from artagnon November 6, 2025 14:50

Merge branch 'main' into loop-vectorize/dont-discount-unpredicated-bl…

e0b79bc

…ocks

fhahn reviewed Nov 10, 2025

View reviewed changes

Update comment, use VPCostContext CostKind

fdfa9b2

david-arm approved these changes Nov 10, 2025

View reviewed changes

fhahn approved these changes Nov 10, 2025

View reviewed changes

Merge branch 'main' into loop-vectorize/dont-discount-unpredicated-bl…

2bee564

…ocks

lukel97 enabled auto-merge (squash) November 10, 2025 11:40

lukel97 merged commit bfd4155 into llvm:main Nov 10, 2025
7 of 9 checks passed

lukel97 mentioned this pull request Nov 10, 2025

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

Open

		// getPredBlockCostDivisor won't include blocks that are only predicated due
		// to tail folded loops

		unsigned VPCostContext::getPredBlockCostDivisor(
		TargetTransformInfo::TargetCostKind CostKind, BasicBlock *BB) const {

[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449

[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449

Uh oh!

Conversation

lukel97 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm commented Sep 24, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Sep 24, 2025

Uh oh!

lukel97 commented Sep 24, 2025

Uh oh!

lukel97 commented Sep 24, 2025

Uh oh!

david-arm commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Oct 21, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Oct 22, 2025

Uh oh!

lukel97 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Sep 24, 2025 •

edited

Loading

llvmbot commented Sep 24, 2025 •

edited

Loading

lukel97 Nov 4, 2025 •

edited

Loading