[LV] Count cost of middle block if TC <= VF. #168949

fhahn · 2025-11-20T21:27:45Z

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable.

Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions.

This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying #149042.

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable. Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions. This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying llvm#149042.

llvmbot · 2025-11-20T21:28:17Z

@llvm/pr-subscribers-vectorizers

Author: Florian Hahn (fhahn)

Changes

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable.

Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions.

This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying #149042.

Full diff: https://github.com/llvm/llvm-project/pull/168949.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+8)
(modified) llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll (+17-34)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5b8725178521e..f872df8a1d23a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -9283,6 +9283,14 @@ static bool isOutsideLoopWorkProfitable(GeneratedRTChecks &Checks,
   // one exists.
   TotalCost += calculateEarlyExitCost(CostCtx, Plan, VF.Width);
 
+  // If the expected trip count is less than the VF, the vector loop will only
+  // execute a single iteration. Then the middle block is executed the same
+  // number of times as the vector region.
+  // TODO: Extend logic to always account for the cost of the middle block.
+  auto ExpectedTC = getSmallBestKnownTC(PSE, L);
+  if (ExpectedTC && ElementCount::isKnownLE(*ExpectedTC, VF.Width))
+    TotalCost += Plan.getMiddleBlock()->cost(VF.Width, CostCtx);
+
   // When interleaving only scalar and vector cost will be equal, which in turn
   // would lead to a divide by 0. Fall back to hard threshold.
   if (VF.Width.isScalar()) {
diff --git a/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll b/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
index 803c472a35a51..f44d3008cbaa5 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
@@ -569,53 +569,36 @@ define double @test_load_used_by_other_load_scev_low_trip_count(ptr %ptr.a, ptr
 ; I64-NEXT:  [[ENTRY:.*]]:
 ; I64-NEXT:    br label %[[OUTER_LOOP:.*]]
 ; I64:       [[OUTER_LOOP_LOOPEXIT:.*]]:
+; I64-NEXT:    [[RESULT_LCSSA:%.*]] = phi double [ [[RESULT:%.*]], %[[INNER_LOOP:.*]] ]
 ; I64-NEXT:    br label %[[OUTER_LOOP]]
 ; I64:       [[OUTER_LOOP]]:
-; I64-NEXT:    [[ACCUM:%.*]] = phi double [ 0.000000e+00, %[[ENTRY]] ], [ [[TMP29:%.*]], %[[OUTER_LOOP_LOOPEXIT]] ]
+; I64-NEXT:    [[ACCUM:%.*]] = phi double [ 0.000000e+00, %[[ENTRY]] ], [ [[RESULT_LCSSA]], %[[OUTER_LOOP_LOOPEXIT]] ]
 ; I64-NEXT:    [[COND:%.*]] = call i1 @cond()
 ; I64-NEXT:    br i1 [[COND]], label %[[INNER_LOOP_PREHEADER:.*]], label %[[EXIT:.*]]
 ; I64:       [[INNER_LOOP_PREHEADER]]:
-; I64-NEXT:    br label %[[VECTOR_PH:.*]]
-; I64:       [[VECTOR_PH]]:
-; I64-NEXT:    br label %[[VECTOR_BODY:.*]]
-; I64:       [[VECTOR_BODY]]:
-; I64-NEXT:    [[TMP0:%.*]] = add i64 0, 1
-; I64-NEXT:    [[TMP1:%.*]] = add i64 1, 1
-; I64-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[PTR_C]], i64 [[TMP0]]
+; I64-NEXT:    br label %[[INNER_LOOP]]
+; I64:       [[INNER_LOOP]]:
+; I64-NEXT:    [[IV:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[INNER_LOOP]] ], [ 0, %[[INNER_LOOP_PREHEADER]] ]
+; I64-NEXT:    [[ACCUM_INNER:%.*]] = phi double [ [[MUL1:%.*]], %[[INNER_LOOP]] ], [ [[ACCUM]], %[[INNER_LOOP_PREHEADER]] ]
+; I64-NEXT:    [[TMP1:%.*]] = add i64 [[IV]], 1
 ; I64-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[PTR_C]], i64 [[TMP1]]
-; I64-NEXT:    [[TMP4:%.*]] = getelementptr i64, ptr [[PTR_A]], i64 [[TMP0]]
 ; I64-NEXT:    [[TMP5:%.*]] = getelementptr i64, ptr [[PTR_A]], i64 [[TMP1]]
-; I64-NEXT:    [[TMP6:%.*]] = load i64, ptr [[TMP4]], align 8
 ; I64-NEXT:    [[TMP7:%.*]] = load i64, ptr [[TMP5]], align 8
-; I64-NEXT:    [[TMP8:%.*]] = getelementptr double, ptr [[PTR_B]], i64 [[TMP6]]
 ; I64-NEXT:    [[TMP9:%.*]] = getelementptr double, ptr [[PTR_B]], i64 [[TMP7]]
 ; I64-NEXT:    [[TMP10:%.*]] = load double, ptr [[PTR_A]], align 8
-; I64-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x double> poison, double [[TMP10]], i64 0
-; I64-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLATINSERT]], <2 x double> poison, <2 x i32> zeroinitializer
-; I64-NEXT:    [[TMP11:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], zeroinitializer
-; I64-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[TMP2]], i64 8
+; I64-NEXT:    [[ADD1:%.*]] = fadd double [[TMP10]], 0.000000e+00
 ; I64-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[TMP3]], i64 8
-; I64-NEXT:    [[TMP14:%.*]] = load double, ptr [[TMP12]], align 8
 ; I64-NEXT:    [[TMP15:%.*]] = load double, ptr [[TMP13]], align 8
-; I64-NEXT:    [[TMP16:%.*]] = insertelement <2 x double> poison, double [[TMP14]], i32 0
-; I64-NEXT:    [[TMP17:%.*]] = insertelement <2 x double> [[TMP16]], double [[TMP15]], i32 1
-; I64-NEXT:    [[TMP18:%.*]] = fmul <2 x double> [[TMP11]], zeroinitializer
-; I64-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x double> poison, double [[ACCUM]], i64 0
-; I64-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLATINSERT1]], <2 x double> poison, <2 x i32> zeroinitializer
-; I64-NEXT:    [[TMP19:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLAT2]], <2 x double> [[TMP18]], <2 x i32> <i32 1, i32 2>
-; I64-NEXT:    [[TMP20:%.*]] = fmul <2 x double> [[TMP17]], zeroinitializer
-; I64-NEXT:    [[TMP21:%.*]] = fadd <2 x double> [[TMP20]], zeroinitializer
-; I64-NEXT:    [[TMP22:%.*]] = fadd <2 x double> [[TMP21]], splat (double 1.000000e+00)
-; I64-NEXT:    [[TMP23:%.*]] = load double, ptr [[TMP8]], align 8
+; I64-NEXT:    [[MUL1]] = fmul double [[ADD1]], 0.000000e+00
+; I64-NEXT:    [[MUL2:%.*]] = fmul double [[TMP15]], 0.000000e+00
+; I64-NEXT:    [[ADD2:%.*]] = fadd double [[MUL2]], 0.000000e+00
+; I64-NEXT:    [[ADD3:%.*]] = fadd double [[ADD2]], 1.000000e+00
 ; I64-NEXT:    [[TMP24:%.*]] = load double, ptr [[TMP9]], align 8
-; I64-NEXT:    [[TMP25:%.*]] = insertelement <2 x double> poison, double [[TMP23]], i32 0
-; I64-NEXT:    [[TMP26:%.*]] = insertelement <2 x double> [[TMP25]], double [[TMP24]], i32 1
-; I64-NEXT:    [[TMP27:%.*]] = fdiv <2 x double> [[TMP26]], [[TMP22]]
-; I64-NEXT:    [[TMP28:%.*]] = fsub <2 x double> [[TMP19]], [[TMP27]]
-; I64-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
-; I64:       [[MIDDLE_BLOCK]]:
-; I64-NEXT:    [[TMP29]] = extractelement <2 x double> [[TMP28]], i32 1
-; I64-NEXT:    br label %[[OUTER_LOOP_LOOPEXIT]]
+; I64-NEXT:    [[DIV:%.*]] = fdiv double [[TMP24]], [[ADD3]]
+; I64-NEXT:    [[RESULT]] = fsub double [[ACCUM_INNER]], [[DIV]]
+; I64-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
+; I64-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 1
+; I64-NEXT:    br i1 [[EXITCOND]], label %[[OUTER_LOOP_LOOPEXIT]], label %[[INNER_LOOP]]
 ; I64:       [[EXIT]]:
 ; I64-NEXT:    ret double [[ACCUM]]
 ;

llvmbot · 2025-11-20T21:28:17Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable.

Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions.

This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying #149042.

Full diff: https://github.com/llvm/llvm-project/pull/168949.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+8)
(modified) llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll (+17-34)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5b8725178521e..f872df8a1d23a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -9283,6 +9283,14 @@ static bool isOutsideLoopWorkProfitable(GeneratedRTChecks &Checks,
   // one exists.
   TotalCost += calculateEarlyExitCost(CostCtx, Plan, VF.Width);
 
+  // If the expected trip count is less than the VF, the vector loop will only
+  // execute a single iteration. Then the middle block is executed the same
+  // number of times as the vector region.
+  // TODO: Extend logic to always account for the cost of the middle block.
+  auto ExpectedTC = getSmallBestKnownTC(PSE, L);
+  if (ExpectedTC && ElementCount::isKnownLE(*ExpectedTC, VF.Width))
+    TotalCost += Plan.getMiddleBlock()->cost(VF.Width, CostCtx);
+
   // When interleaving only scalar and vector cost will be equal, which in turn
   // would lead to a divide by 0. Fall back to hard threshold.
   if (VF.Width.isScalar()) {
diff --git a/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll b/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
index 803c472a35a51..f44d3008cbaa5 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll
@@ -569,53 +569,36 @@ define double @test_load_used_by_other_load_scev_low_trip_count(ptr %ptr.a, ptr
 ; I64-NEXT:  [[ENTRY:.*]]:
 ; I64-NEXT:    br label %[[OUTER_LOOP:.*]]
 ; I64:       [[OUTER_LOOP_LOOPEXIT:.*]]:
+; I64-NEXT:    [[RESULT_LCSSA:%.*]] = phi double [ [[RESULT:%.*]], %[[INNER_LOOP:.*]] ]
 ; I64-NEXT:    br label %[[OUTER_LOOP]]
 ; I64:       [[OUTER_LOOP]]:
-; I64-NEXT:    [[ACCUM:%.*]] = phi double [ 0.000000e+00, %[[ENTRY]] ], [ [[TMP29:%.*]], %[[OUTER_LOOP_LOOPEXIT]] ]
+; I64-NEXT:    [[ACCUM:%.*]] = phi double [ 0.000000e+00, %[[ENTRY]] ], [ [[RESULT_LCSSA]], %[[OUTER_LOOP_LOOPEXIT]] ]
 ; I64-NEXT:    [[COND:%.*]] = call i1 @cond()
 ; I64-NEXT:    br i1 [[COND]], label %[[INNER_LOOP_PREHEADER:.*]], label %[[EXIT:.*]]
 ; I64:       [[INNER_LOOP_PREHEADER]]:
-; I64-NEXT:    br label %[[VECTOR_PH:.*]]
-; I64:       [[VECTOR_PH]]:
-; I64-NEXT:    br label %[[VECTOR_BODY:.*]]
-; I64:       [[VECTOR_BODY]]:
-; I64-NEXT:    [[TMP0:%.*]] = add i64 0, 1
-; I64-NEXT:    [[TMP1:%.*]] = add i64 1, 1
-; I64-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[PTR_C]], i64 [[TMP0]]
+; I64-NEXT:    br label %[[INNER_LOOP]]
+; I64:       [[INNER_LOOP]]:
+; I64-NEXT:    [[IV:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[INNER_LOOP]] ], [ 0, %[[INNER_LOOP_PREHEADER]] ]
+; I64-NEXT:    [[ACCUM_INNER:%.*]] = phi double [ [[MUL1:%.*]], %[[INNER_LOOP]] ], [ [[ACCUM]], %[[INNER_LOOP_PREHEADER]] ]
+; I64-NEXT:    [[TMP1:%.*]] = add i64 [[IV]], 1
 ; I64-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[PTR_C]], i64 [[TMP1]]
-; I64-NEXT:    [[TMP4:%.*]] = getelementptr i64, ptr [[PTR_A]], i64 [[TMP0]]
 ; I64-NEXT:    [[TMP5:%.*]] = getelementptr i64, ptr [[PTR_A]], i64 [[TMP1]]
-; I64-NEXT:    [[TMP6:%.*]] = load i64, ptr [[TMP4]], align 8
 ; I64-NEXT:    [[TMP7:%.*]] = load i64, ptr [[TMP5]], align 8
-; I64-NEXT:    [[TMP8:%.*]] = getelementptr double, ptr [[PTR_B]], i64 [[TMP6]]
 ; I64-NEXT:    [[TMP9:%.*]] = getelementptr double, ptr [[PTR_B]], i64 [[TMP7]]
 ; I64-NEXT:    [[TMP10:%.*]] = load double, ptr [[PTR_A]], align 8
-; I64-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x double> poison, double [[TMP10]], i64 0
-; I64-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLATINSERT]], <2 x double> poison, <2 x i32> zeroinitializer
-; I64-NEXT:    [[TMP11:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], zeroinitializer
-; I64-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[TMP2]], i64 8
+; I64-NEXT:    [[ADD1:%.*]] = fadd double [[TMP10]], 0.000000e+00
 ; I64-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[TMP3]], i64 8
-; I64-NEXT:    [[TMP14:%.*]] = load double, ptr [[TMP12]], align 8
 ; I64-NEXT:    [[TMP15:%.*]] = load double, ptr [[TMP13]], align 8
-; I64-NEXT:    [[TMP16:%.*]] = insertelement <2 x double> poison, double [[TMP14]], i32 0
-; I64-NEXT:    [[TMP17:%.*]] = insertelement <2 x double> [[TMP16]], double [[TMP15]], i32 1
-; I64-NEXT:    [[TMP18:%.*]] = fmul <2 x double> [[TMP11]], zeroinitializer
-; I64-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x double> poison, double [[ACCUM]], i64 0
-; I64-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLATINSERT1]], <2 x double> poison, <2 x i32> zeroinitializer
-; I64-NEXT:    [[TMP19:%.*]] = shufflevector <2 x double> [[BROADCAST_SPLAT2]], <2 x double> [[TMP18]], <2 x i32> <i32 1, i32 2>
-; I64-NEXT:    [[TMP20:%.*]] = fmul <2 x double> [[TMP17]], zeroinitializer
-; I64-NEXT:    [[TMP21:%.*]] = fadd <2 x double> [[TMP20]], zeroinitializer
-; I64-NEXT:    [[TMP22:%.*]] = fadd <2 x double> [[TMP21]], splat (double 1.000000e+00)
-; I64-NEXT:    [[TMP23:%.*]] = load double, ptr [[TMP8]], align 8
+; I64-NEXT:    [[MUL1]] = fmul double [[ADD1]], 0.000000e+00
+; I64-NEXT:    [[MUL2:%.*]] = fmul double [[TMP15]], 0.000000e+00
+; I64-NEXT:    [[ADD2:%.*]] = fadd double [[MUL2]], 0.000000e+00
+; I64-NEXT:    [[ADD3:%.*]] = fadd double [[ADD2]], 1.000000e+00
 ; I64-NEXT:    [[TMP24:%.*]] = load double, ptr [[TMP9]], align 8
-; I64-NEXT:    [[TMP25:%.*]] = insertelement <2 x double> poison, double [[TMP23]], i32 0
-; I64-NEXT:    [[TMP26:%.*]] = insertelement <2 x double> [[TMP25]], double [[TMP24]], i32 1
-; I64-NEXT:    [[TMP27:%.*]] = fdiv <2 x double> [[TMP26]], [[TMP22]]
-; I64-NEXT:    [[TMP28:%.*]] = fsub <2 x double> [[TMP19]], [[TMP27]]
-; I64-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
-; I64:       [[MIDDLE_BLOCK]]:
-; I64-NEXT:    [[TMP29]] = extractelement <2 x double> [[TMP28]], i32 1
-; I64-NEXT:    br label %[[OUTER_LOOP_LOOPEXIT]]
+; I64-NEXT:    [[DIV:%.*]] = fdiv double [[TMP24]], [[ADD3]]
+; I64-NEXT:    [[RESULT]] = fsub double [[ACCUM_INNER]], [[DIV]]
+; I64-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
+; I64-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 1
+; I64-NEXT:    br i1 [[EXITCOND]], label %[[OUTER_LOOP_LOOPEXIT]], label %[[INNER_LOOP]]
 ; I64:       [[EXIT]]:
 ; I64-NEXT:    ret double [[ACCUM]]
 ;

fhahn · 2025-11-20T21:36:43Z

Without #149042 there are no changes across a very large C/C++ based test suite on ARM64 macOS

github-actions · 2025-11-20T22:11:17Z

🐧 Linux x64 Test Results

186505 tests passed
4873 tests skipped

lukel97

Makes sense to me, confirmed no changes either on llvm-test-suite rva23u64 -O3.

david-arm · 2025-11-24T11:38:19Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // number of times as the vector region.
+  // TODO: Extend logic to always account for the cost of the middle block.
+  auto ExpectedTC = getSmallBestKnownTC(PSE, L);
+  if (ExpectedTC && ElementCount::isKnownLE(*ExpectedTC, VF.Width))


This code now seems inconsistent with the sum of calculateEarlyExitCost and Checks.getCost(), which is always adding on the cost of work in the early exits regardless of trip count. Also, the code below already seems to have proper logic to handle this more complelely, i.e.

if (auto ExpectedTC = getSmallBestKnownTC(PSE, L)) { if (ElementCount::isKnownLT(*ExpectedTC, VF.MinProfitableTripCount)) {

I don't know why we shouldn't just always add on the cost to be consistent with the existing code that expects it? It feels really odd to treat the middle block differently to the runtime check block and early exit block code.

Yep, this is intentional in the initial patch (from the description: Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions.)

The TODO is to enable it unconditionally, but that potentially has much higher impact, which is why I'd prefer to do this as a separate commit, to make it slightly easier to narrow down regressions.

OK fair enough. If there is going to be follow-on work to do this unconditionally I'm happy with that!

david-arm · 2025-11-24T11:39:01Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // TODO: Extend logic to always account for the cost of the middle block.
+  auto ExpectedTC = getSmallBestKnownTC(PSE, L);
+  if (ExpectedTC && ElementCount::isKnownLE(*ExpectedTC, VF.Width))
+    TotalCost += Plan.getMiddleBlock()->cost(VF.Width, CostCtx);


Can you update the comment above the function isOutsideLoopWorkProfitable please?

Also the comment further down:

RtC is the cost of the generated runtime checks plus the cost of

needs updating too I think.

Both should be updated, thanks!

…ock-cost-for-low-tripcount-loops

david-arm

LGTM!

david-arm · 2025-11-24T12:55:21Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // number of times as the vector region.
+  // TODO: Extend logic to always account for the cost of the middle block.
+  auto ExpectedTC = getSmallBestKnownTC(PSE, L);
+  if (ExpectedTC && ElementCount::isKnownLE(*ExpectedTC, VF.Width))


OK fair enough. If there is going to be follow-on work to do this unconditionally I'm happy with that!

…unt-loops

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable. Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions. This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying llvm/llvm-project#149042. PR: llvm/llvm-project#168949

If the expected trip count is less than the VF, the vector loop will only execute a single iteration. When that's the case, the cost of the middle block has the same impact as the cost of the vector loop. Include it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra work in the middle block makes it unprofitable. Note that isOutsideLoopWorkProfitable already scales the cost of blocks outside the vector region, but the patch restricts accounting for the middle block to cases where VF <= ExpectedTC, to initially catch some worst cases and avoid regressions. This initial version should specifically avoid unprofitable tail-folding for loops with low trip counts after re-applying llvm#149042. PR: llvm#168949

…folding. (#149042)" This reverts commit a6edeed. The following fixes have landed, addressing issues causing the original revert: * #169298 * #167897 * #168949 Original message: Building on top of #148817, introduce a new abstract LastActiveLane opcode that gets lowered to Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1). When folding the tail, update all extracts for uses outside the loop the extract the value of the last actice lane. See also #148603 PR: #149042

… when tail-folding. (#149042)" This reverts commit a6edeed. The following fixes have landed, addressing issues causing the original revert: * llvm/llvm-project#169298 * llvm/llvm-project#167897 * llvm/llvm-project#168949 Original message: Building on top of llvm/llvm-project#148817, introduce a new abstract LastActiveLane opcode that gets lowered to Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1). When folding the tail, update all extracts for uses outside the loop the extract the value of the last actice lane. See also llvm/llvm-project#148603 PR: llvm/llvm-project#149042

…folding. (#149042)" This reverts commit a6edeed. The following fixes have landed, addressing issues causing the original revert: * #169298 * #167897 * #168949 Original message: Building on top of #148817, introduce a new abstract LastActiveLane opcode that gets lowered to Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1). When folding the tail, update all extracts for uses outside the loop the extract the value of the last actice lane. See also #148603 PR: #149042

… when tail-folding. (#149042)" This reverts commit a6edeed. The following fixes have landed, addressing issues causing the original revert: * llvm/llvm-project#169298 * llvm/llvm-project#167897 * llvm/llvm-project#168949 Original message: Building on top of llvm/llvm-project#148817, introduce a new abstract LastActiveLane opcode that gets lowered to Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1). When folding the tail, update all extracts for uses outside the loop the extract the value of the last actice lane. See also llvm/llvm-project#148603 PR: llvm/llvm-project#149042

…folding. (llvm#149042)" This reverts commit a6edeed. The following fixes have landed, addressing issues causing the original revert: * llvm#169298 * llvm#167897 * llvm#168949 Original message: Building on top of llvm#148817, introduce a new abstract LastActiveLane opcode that gets lowered to Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1). When folding the tail, update all extracts for uses outside the loop the extract the value of the last actice lane. See also llvm#148603 PR: llvm#149042

fhahn requested review from ayalz, davemgreen, david-arm, lukel97 and rengolin November 20, 2025 21:27

llvmbot added vectorizers llvm:transforms labels Nov 20, 2025

fhahn mentioned this pull request Nov 20, 2025

[LV] #149042 caused performance regressions on avx512 #167858

Closed

lukel97 approved these changes Nov 21, 2025

View reviewed changes

david-arm reviewed Nov 24, 2025

View reviewed changes

fhahn added 2 commits November 24, 2025 12:26

Merge remote-tracking branch 'origin/main' into lv-consider-middle-bl…

d5aa2ae

…ock-cost-for-low-tripcount-loops

!fixup update comments, thanks

5a9560d

david-arm approved these changes Nov 24, 2025

View reviewed changes

lukel97 mentioned this pull request Nov 24, 2025

Reapply "[LV] Use ExtractLane(LastActiveLane, V) live outs when tail-folding. (#149042)" #168738

Closed

Merge branch 'main' into lv-consider-middle-block-cost-for-low-tripco…

423b872

…unt-loops

fhahn enabled auto-merge (squash) November 24, 2025 18:53

fhahn merged commit 48eb697 into llvm:main Nov 24, 2025
7 of 9 checks passed

fhahn deleted the lv-consider-middle-block-cost-for-low-tripcount-loops branch November 24, 2025 20:45

[LV] Count cost of middle block if TC <= VF. #168949

[LV] Count cost of middle block if TC <= VF. #168949

Uh oh!

Conversation

fhahn commented Nov 20, 2025

Uh oh!

llvmbot commented Nov 20, 2025

Uh oh!

llvmbot commented Nov 20, 2025

Uh oh!

fhahn commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

david-arm Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

fhahn Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

david-arm Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

david-arm Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

fhahn Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

david-arm Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Nov 20, 2025 •

edited

Loading