[LoopVectorize] Use predicated version of getSmallConstantMaxTripCount #109928

david-arm · 2024-09-25T08:35:32Z

There are a number of places where we call getSmallConstantMaxTripCount
without passing a vector of predicates:

getSmallBestKnownTC
isIndvarOverflowCheckKnownFalse
computeMaxVF
isMoreProfitable

I've changed all of these to now pass in a predicate vector so that
we get the benefit of making better vectorisation choices when we
know the max trip count for loops that require SCEV predicate checks.

I've tried to add tests that cover all the cases affected by these
changes.

llvmbot · 2024-09-25T08:36:08Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

There are a number of places where we call getSmallConstantMaxTripCount
without passing a vector of predicates:

getSmallBestKnownTC
isIndvarOverflowCheckKnownFalse
computeMaxVF
isMoreProfitable

I've changed all of these to now pass in a predicate vector so that
we get the benefit of making better vectorisation choices when we
know the max trip count for loops that require SCEV predicate checks.

I've tried to add tests that cover all the cases affected by these
changes.

Patch is 27.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109928.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+14-5)
(added) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+405)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0566d80c1cc001..55b59458a0aa41 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -422,7 +422,8 @@ static std::optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE,
       return *EstimatedTC;
 
   // Check if upper bound estimate is known.
-  if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
+  SmallVector<const SCEVPredicate *, 2> Predicates;
+  if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L, &Predicates))
     return ExpectedTC;
 
   return std::nullopt;
@@ -2298,8 +2299,9 @@ static bool isIndvarOverflowCheckKnownFalse(
   // We know the runtime overflow check is known false iff the (max) trip-count
   // is known and (max) trip-count + (VF * UF) does not overflow in the type of
   // the vector loop induction variable.
-  if (unsigned TC =
-          Cost->PSE.getSE()->getSmallConstantMaxTripCount(Cost->TheLoop)) {
+  SmallVector<const SCEVPredicate *, 2> Predicates;
+  if (unsigned TC = Cost->PSE.getSE()->getSmallConstantMaxTripCount(
+          Cost->TheLoop, &Predicates)) {
     uint64_t MaxVF = VF.getKnownMinValue();
     if (VF.isScalable()) {
       std::optional<unsigned> MaxVScale =
@@ -3994,8 +3996,13 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   }
 
   unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
-  unsigned MaxTC = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);
+
+  SmallVector<const SCEVPredicate *, 2> Predicates;
+  unsigned MaxTC =
+      PSE.getSE()->getSmallConstantMaxTripCount(TheLoop, &Predicates);
   LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
+  if (TC != MaxTC)
+    LLVM_DEBUG(dbgs() << "LV: Found maximum trip count: " << MaxTC << '\n');
   if (TC == 1) {
     reportVectorizationFailure("Single iteration (non) loop",
         "loop trip count is one, irrelevant for vectorization",
@@ -4283,7 +4290,9 @@ bool LoopVectorizationPlanner::isMoreProfitable(
   InstructionCost CostA = A.Cost;
   InstructionCost CostB = B.Cost;
 
-  unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
+  SmallVector<const SCEVPredicate *, 2> Predicates;
+  unsigned MaxTripCount =
+      PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop, &Predicates);
 
   // Improve estimate for the vector width if it is scalable.
   unsigned EstimatedWidthA = A.Width.getKnownMinValue();
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
new file mode 100644
index 00000000000000..2cdfa0d1564219
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
@@ -0,0 +1,405 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; REQUIRES: asserts
+; RUN: opt -S < %s -p loop-vectorize -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s
+; RUN: cat %t | FileCheck %s --check-prefix=DEBUG
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; DEBUG-LABEL: LV: Checking a loop in 'low_vf_ic_is_better'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 19
+; DEBUG: LV: IC is 1
+; DEBUG: LV: VF is vscale x 8
+; DEBUG: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:vscale x 4, Epilogue Loop UF:1
+
+; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
+; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
+; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
+
+; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 16
+; DEBUG: LV: Clamping the MaxVF to maximum power of two not exceeding the constant trip count: 16
+; DEBUG: LV: IC is 1
+; DEBUG: LV: VF is 16
+; DEBUG: LV: Vectorization is not beneficial: expected trip count < minimum profitable VF (16 < 32)
+; DEBUG: LV: Too many memory checks needed.
+
+; DEBUG-LABEL: LV: Checking a loop in 'overflow_indvar_known_false'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 1027
+; DEBUG: LV: can fold tail by masking.
+; DEBUG: Executing best plan with VF=vscale x 16, UF=1
+
+define void @low_vf_ic_is_better(ptr nocapture noundef %p, i16 noundef %val) {
+; CHECK-LABEL: define void @low_vf_ic_is_better(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 19
+; CHECK-NEXT:    br i1 [[CMP7]], label %[[ITER_CHECK:.*]], label %[[WHILE_END:.*]]
+; CHECK:       [[ITER_CHECK]]:
+; CHECK-NEXT:    [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT:    [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT:    [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT:    [[TMP2:%.*]] = zext i32 [[TMP1]] to i64
+; CHECK-NEXT:    [[TMP3:%.*]] = sub i64 20, [[TMP2]]
+; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP5]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK:       [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT:    [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
+; CHECK-NEXT:    [[TMP8:%.*]] = sub i64 19, [[TMP7]]
+; CHECK-NEXT:    [[TMP9:%.*]] = trunc i64 [[TMP8]] to i32
+; CHECK-NEXT:    [[TMP10:%.*]] = add i32 [[TMP6]], [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP6]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp ugt i64 [[TMP8]], 4294967295
+; CHECK-NEXT:    [[TMP13:%.*]] = or i1 [[TMP11]], [[TMP12]]
+; CHECK-NEXT:    br i1 [[TMP13]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
+; CHECK:       [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
+; CHECK-NEXT:    [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP15:%.*]] = mul i64 [[TMP14]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP3]], [[TMP15]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP17:%.*]] = mul i64 [[TMP16]], 8
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], [[TMP17]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP19:%.*]] = mul i64 [[TMP18]], 8
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i8> poison, i8 [[CONV]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[TMP0]], [[INDEX]]
+; CHECK-NEXT:    [[TMP20:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds [100 x i8], ptr [[V]], i64 0, i64 [[TMP20]]
+; CHECK-NEXT:    [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[TMP21]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP22]], align 1
+; CHECK-NEXT:    [[TMP23:%.*]] = add <vscale x 8 x i8> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    store <vscale x 8 x i8> [[TMP23]], ptr [[TMP22]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
+; CHECK-NEXT:    [[TMP31:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP31]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; CHECK:       [[VEC_EPILOG_ITER_CHECK]]:
+; CHECK-NEXT:    [[IND_END5:%.*]] = add i64 [[TMP0]], [[N_VEC]]
+; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP32:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP33:%.*]] = mul i64 [[TMP32]], 4
+; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP33]]
+; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]]
+; CHECK:       [[VEC_EPILOG_PH]]:
+; CHECK-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; CHECK-NEXT:    [[TMP34:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP35:%.*]] = mul i64 [[TMP34]], 4
+; CHECK-NEXT:    [[N_MOD_VF3:%.*]] = urem i64 [[TMP3]], [[TMP35]]
+; CHECK-NEXT:    [[N_VEC4:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF3]]
+; CHECK-NEXT:    [[IND_END:%.*]] = add i64 [[TMP0]], [[N_VEC4]]
+; CHECK-NEXT:    [[TMP36:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP37:%.*]] = mul i64 [[TMP36]], 4
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <vscale x 4 x i8> poison, i8 [[CONV]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT9:%.*]] = shufflevector <vscale x 4 x i8> [[BROADCAST_SPLATINSERT8]], <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
+; CHECK:       [[VEC_EPILOG_VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX6:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX7:%.*]] = add i64 [[TMP0]], [[INDEX6]]
+; CHECK-NEXT:    [[TMP38:%.*]] = add i64 [[OFFSET_IDX7]], 0
+; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr inbounds [100 x i8], ptr [[V]], i64 0, i64 [[TMP38]]
+; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[TMP39]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD7:%.*]] = load <vscale x 4 x i8>, ptr [[TMP40]], align 1
+; CHECK-NEXT:    [[TMP41:%.*]] = add <vscale x 4 x i8> [[WIDE_LOAD7]], [[BROADCAST_SPLAT9]]
+; CHECK-NEXT:    store <vscale x 4 x i8> [[TMP41]], ptr [[TMP40]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT11]] = add nuw i64 [[INDEX6]], [[TMP37]]
+; CHECK-NEXT:    [[TMP42:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC4]]
+; CHECK-NEXT:    br i1 [[TMP42]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N12:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC4]]
+; CHECK-NEXT:    br i1 [[CMP_N12]], label %[[WHILE_END_LOOPEXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; CHECK:       [[VEC_EPILOG_SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END5]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[TMP0]], %[[VECTOR_SCEVCHECK]] ], [ [[TMP0]], %[[ITER_CHECK]] ]
+; CHECK-NEXT:    br label %[[WHILE_BODY:.*]]
+; CHECK:       [[WHILE_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP43:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ADD:%.*]] = add i8 [[TMP43]], [[CONV]]
+; CHECK-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[TMP44:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP44]], 19
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK:       [[WHILE_END]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %p.promoted = load i32, ptr %p, align 4
+  %cmp7 = icmp ult i32 %p.promoted, 19
+  br i1 %cmp7, label %while.preheader, label %while.end
+
+while.preheader:
+  %conv = trunc i16 %val to i8
+  %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+  %0 = zext nneg i32 %p.promoted to i64
+  br label %while.body
+
+while.body:
+  %indvars.iv = phi i64 [ %0, %while.preheader ], [ %indvars.iv.next, %while.body ]
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %arrayidx = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+  %1 = load i8, ptr %arrayidx, align 1
+  %add = add i8 %1, %conv
+  store i8 %add, ptr %arrayidx, align 1
+  %2 = and i64 %indvars.iv.next, 4294967295
+  %exitcond.not = icmp eq i64 %2, 19
+  br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+  ret void
+}
+
+define void @trip_count_too_small(ptr nocapture noundef %p, i16 noundef %val) {
+; CHECK-LABEL: define void @trip_count_too_small(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 3
+; CHECK-NEXT:    br i1 [[CMP7]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK:       [[WHILE_PREHEADER]]:
+; CHECK-NEXT:    [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT:    [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT:    [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT:    br label %[[WHILE_BODY:.*]]
+; CHECK:       [[WHILE_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[TMP0]], %[[WHILE_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP43:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ADD:%.*]] = add i8 [[TMP43]], [[CONV]]
+; CHECK-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[TMP44:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP44]], 3
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
+; CHECK:       [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK:       [[WHILE_END]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %p.promoted = load i32, ptr %p, align 4
+  %cmp7 = icmp ult i32 %p.promoted, 3
+  br i1 %cmp7, label %while.preheader, label %while.end
+
+while.preheader:
+  %conv = trunc i16 %val to i8
+  %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+  %0 = zext nneg i32 %p.promoted to i64
+  br label %while.body
+
+while.body:
+  %indvars.iv = phi i64 [ %0, %while.preheader ], [ %indvars.iv.next, %while.body ]
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %arrayidx = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+  %1 = load i8, ptr %arrayidx, align 1
+  %add = add i8 %1, %conv
+  store i8 %add, ptr %arrayidx, align 1
+  %2 = and i64 %indvars.iv.next, 4294967295
+  %exitcond.not = icmp eq i64 %2, 3
+  br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+  ret void
+}
+
+define void @too_many_runtime_checks(ptr nocapture noundef %p, ptr nocapture noundef %p1, ptr nocapture noundef readonly %p2, ptr nocapture noundef readonly %p3, i16 noundef %val) {
+; CHECK-LABEL: define void @too_many_runtime_checks(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], ptr nocapture noundef [[P1:%.*]], ptr nocapture noundef readonly [[P2:%.*]], ptr nocapture noundef readonly [[P3:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[CMP20:%.*]] = icmp ult i32 [[TMP0]], 16
+; CHECK-NEXT:    br i1 [[CMP20]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK:       [[WHILE_PREHEADER]]:
+; CHECK-NEXT:    [[CONV8:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT:    [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT:    [[TMP1:%.*]] = zext nneg i32 [[TMP0]] to i64
+; CHECK-NEXT:    br label %[[WHILE_BODY:.*]]
+; CHECK:       [[WHILE_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[TMP1]], %[[WHILE_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i8, ptr [[P2]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP60:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw i8, ptr [[P3]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP61:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
+; CHECK-NEXT:    [[MUL:%.*]] = mul i8 [[TMP61]], [[TMP60]]
+; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds nuw i8, ptr [[P1]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP62:%.*]] = load i8, ptr [[ARRAYIDX5]], align 1
+; CHECK-NEXT:    [[ADD:%.*]] = add i8 [[MUL]], [[TMP62]]
+; CHECK-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX5]], align 1
+; CHECK-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP63:%.*]] = load i8, ptr [[ARRAYIDX10]], align 1
+; CHECK-NEXT:    [[ADD12:%.*]] = add i8 [[TMP63]], [[CONV8]]
+; CHECK-NEXT:    store i8 [[ADD12]], ptr [[ARRAYIDX10]], align 1
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[TMP64:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP64]], 16
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
+; CHECK:       [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK:       [[WHILE_END]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %0 = load i32, ptr %p, align 4
+  %cmp20 = icmp ult i32 %0, 16
+  br i1 %cmp20, label %while.preheader, label %while.end
+
+while.preheader:
+  %conv8 = trunc i16 %val to i8
+  %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+  %1 = zext nneg i32 %0 to i64
+  br label %while.body
+
+while.body:
+  %indvars.iv = phi i64 [ %1, %while.preheader ], [ %indvars.iv.next, %while.body ]
+  %arrayidx = getelementptr inbounds nuw i8, ptr %p2, i64 %indvars.iv
+  %2 = load i8, ptr %arrayidx, align 1
+  %arrayidx2 = getelementptr inbounds nuw i8, ptr %p3, i64 %indvars.iv
+  %3 = load i8, ptr %arrayidx2, align 1
+  %mul = mul i8 %3, %2
+  %arrayidx5 = getelementptr inbounds nuw i8, ptr %p1, i64 %indvars.iv
+  %4 = load i8, ptr %arrayidx5, align 1
+  %add = add i8 %mul, %4
+  store i8 %add, ptr %arrayidx5, align 1
+  %arrayidx10 = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+  %5 = load i8, ptr %arrayidx10, align 1
+  %add12 = add i8 %5, %conv8
+  store i8 %add12, ptr %arrayidx10, align 1
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %6 = and i64 %indvars.iv.next, 4294967295
+  %exitcond.not = icmp eq i64 %6, 16
+  br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+  ret void
+}
+
+define void @overflow_indvar_known_false(ptr nocapture noundef %p, i16 noundef %val) vscale_range(1,16) {
+; CHECK-LABEL: define void @overflow_indvar_known_false(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR1:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 1027
+; CHECK-NEXT:    br i1 [[CMP7]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK:       [[WHILE_PREHEADER]]:
+; CHECK-NEXT:    [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT:    [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT:    [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT:    [[TMP19:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT:    [[TMP20:%.*]] = zext i32 [[TMP19...
[truncated]

fhahn · 2024-09-25T20:12:48Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+
+  SmallVector<const SCEVPredicate *, 2> Predicates;
+  unsigned MaxTC =
+      PSE.getSE()->getSmallConstantMaxTripCount(TheLoop, &Predicates);


Should this use PSE.getSmallConstantMaxTripCount() to ensure the predicates are added to PSE?

Hmm, that's a good question. I'd assumed that when we call PSE.getBackedgeTakenCount(TheLoop, &Predicates) at some point later on to create the trip count, that it would just end up adding the same predicates anyway.

Does it matter if we attempt to add the same predicate twice? If not, then I'm happy to change this and use the predicated version.

I think it's safe to add the predicates multiple times, since the addPredicate function in PredicatedScalarEvolution already checks to see if the new predicate is already implied by its existing list. I've added a new getSmallConstantMaxTripCount to PredicatedScalarEvolution and kept track of the count to avoid unnecessary computation, similar to how we return the backedge taken count.

Yes, PSE should take care to avoid adding duplicated predicates, thanks for adjusting!

Should a similar change also be done for the recent changes for early exit checks in LV?

Do you mean the calls to PSE.getSE()->getPredicatedExitCount in LoopVectorizationLegality::isVectorizableEarlyExitLoop? I'm not sure if we want to do exactly the same, i.e. cache the exit count for every exit. I could potentially add an equivalent method to PredicatedScalarEvolution that automatically adds the predicates, but probably best done in a separate patch because it's unrelated to the max trip count. What do you think?

david-arm · 2024-10-03T08:48:57Z

Deal with conflicts after rebase

fhahn · 2024-10-06T14:33:59Z

llvm/include/llvm/Analysis/ScalarEvolution.h

  const SCEV *SymbolicMaxBackedgeCount = nullptr;
+
+  /// The constant max trip count for the loop.
+  std::optional<unsigned> SmallConstantMaxTripCount;


Suggested change

std::optional<unsigned> SmallConstantMaxTripCount;

const SCEV* SmallConstantMaxTripCount = nullptr;

for consistency with BackedgeCount and SymbolicMaxBackedgeCount above?

The problem with that is the max trip count is measured as an unsigned value. I could potentially cache the intermediate result in the function:

unsigned ScalarEvolution::getSmallConstantMaxTripCount( const Loop *L, SmallVectorImpl<const SCEVPredicate *> *Predicates) { const auto *MaxExitCount = Predicates ? getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) : getConstantMaxBackedgeTakenCount(L); return getConstantTripCount(dyn_cast<SCEVConstant>(MaxExitCount)); }

i.e. MaxExitCount, but then you still have to call getConstantTripCount each time so you see less benefit from caching it. That's why I chose to use std::optional<unsigned> - alternatively I could use a larger type such as uint64_t and treat UINT64_MAX as being equivalent to not cached yet.

Ah I see! It's not worth the trouble then I think.

llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll

There are a number of places where we call getSmallConstantMaxTripCount without passing a vector of predicates: getSmallBestKnownTC isIndvarOverflowCheckKnownFalse computeMaxVF isMoreProfitable I've changed all of these to now pass in a predicate vector so that we get the benefit of making better vectorisation choices when we know the max trip count for loops that require SCEV predicate checks. I've tried to add tests that cover all the cases affected by these changes.

fhahn

LGTM, thanks!

fhahn · 2024-10-10T21:13:48Z

llvm/include/llvm/Analysis/ScalarEvolution.h

  const SCEV *SymbolicMaxBackedgeCount = nullptr;
+
+  /// The constant max trip count for the loop.
+  std::optional<unsigned> SmallConstantMaxTripCount;


Ah I see! It's not worth the trouble then I think.

llvm#109928) There are a number of places where we call getSmallConstantMaxTripCount without passing a vector of predicates: getSmallBestKnownTC isIndvarOverflowCheckKnownFalse computeMaxVF isMoreProfitable I've changed all of these to now pass in a predicate vector so that we get the benefit of making better vectorisation choices when we know the max trip count for loops that require SCEV predicate checks. I've tried to add tests that cover all the cases affected by these changes.

david-arm requested review from fhahn, huntergr-arm and igogo-x86 September 25, 2024 08:35

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Sep 25, 2024

fhahn reviewed Sep 25, 2024

View reviewed changes

david-arm force-pushed the pred_small_tc branch from f969a79 to 49eaf3d Compare September 27, 2024 14:07

david-arm requested a review from nikic as a code owner September 27, 2024 14:07

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Sep 27, 2024

david-arm force-pushed the pred_small_tc branch from 49eaf3d to c79589f Compare October 3, 2024 08:47

fhahn reviewed Oct 6, 2024

View reviewed changes

david-arm added 2 commits October 7, 2024 10:35

Add tests and extra max trip count debug output

861d786

david-arm force-pushed the pred_small_tc branch from c79589f to 01bb75f Compare October 7, 2024 10:37

fhahn approved these changes Oct 10, 2024

View reviewed changes

david-arm merged commit 72f339d into llvm:main Oct 11, 2024

david-arm deleted the pred_small_tc branch January 28, 2025 11:50

	std::optional<unsigned> SmallConstantMaxTripCount;
	const SCEV* SmallConstantMaxTripCount = nullptr;

[LoopVectorize] Use predicated version of getSmallConstantMaxTripCount #109928

[LoopVectorize] Use predicated version of getSmallConstantMaxTripCount #109928

Uh oh!

Conversation

david-arm commented Sep 25, 2024

Uh oh!

llvmbot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llvmbot commented Sep 25, 2024 •

edited

Loading

david-arm commented Oct 3, 2024 •

edited

Loading