-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[LoopVectorize] Use predicated version of getSmallConstantMaxTripCount #109928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: David Sherwood (david-arm) ChangesThere are a number of places where we call getSmallConstantMaxTripCount getSmallBestKnownTC I've changed all of these to now pass in a predicate vector so that I've tried to add tests that cover all the cases affected by these Patch is 27.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109928.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0566d80c1cc001..55b59458a0aa41 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -422,7 +422,8 @@ static std::optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE,
return *EstimatedTC;
// Check if upper bound estimate is known.
- if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
+ SmallVector<const SCEVPredicate *, 2> Predicates;
+ if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L, &Predicates))
return ExpectedTC;
return std::nullopt;
@@ -2298,8 +2299,9 @@ static bool isIndvarOverflowCheckKnownFalse(
// We know the runtime overflow check is known false iff the (max) trip-count
// is known and (max) trip-count + (VF * UF) does not overflow in the type of
// the vector loop induction variable.
- if (unsigned TC =
- Cost->PSE.getSE()->getSmallConstantMaxTripCount(Cost->TheLoop)) {
+ SmallVector<const SCEVPredicate *, 2> Predicates;
+ if (unsigned TC = Cost->PSE.getSE()->getSmallConstantMaxTripCount(
+ Cost->TheLoop, &Predicates)) {
uint64_t MaxVF = VF.getKnownMinValue();
if (VF.isScalable()) {
std::optional<unsigned> MaxVScale =
@@ -3994,8 +3996,13 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
}
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
- unsigned MaxTC = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);
+
+ SmallVector<const SCEVPredicate *, 2> Predicates;
+ unsigned MaxTC =
+ PSE.getSE()->getSmallConstantMaxTripCount(TheLoop, &Predicates);
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
+ if (TC != MaxTC)
+ LLVM_DEBUG(dbgs() << "LV: Found maximum trip count: " << MaxTC << '\n');
if (TC == 1) {
reportVectorizationFailure("Single iteration (non) loop",
"loop trip count is one, irrelevant for vectorization",
@@ -4283,7 +4290,9 @@ bool LoopVectorizationPlanner::isMoreProfitable(
InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;
- unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
+ SmallVector<const SCEVPredicate *, 2> Predicates;
+ unsigned MaxTripCount =
+ PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop, &Predicates);
// Improve estimate for the vector width if it is scalable.
unsigned EstimatedWidthA = A.Width.getKnownMinValue();
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
new file mode 100644
index 00000000000000..2cdfa0d1564219
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
@@ -0,0 +1,405 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; REQUIRES: asserts
+; RUN: opt -S < %s -p loop-vectorize -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s
+; RUN: cat %t | FileCheck %s --check-prefix=DEBUG
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; DEBUG-LABEL: LV: Checking a loop in 'low_vf_ic_is_better'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 19
+; DEBUG: LV: IC is 1
+; DEBUG: LV: VF is vscale x 8
+; DEBUG: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:vscale x 4, Epilogue Loop UF:1
+
+; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
+; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
+; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
+
+; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 16
+; DEBUG: LV: Clamping the MaxVF to maximum power of two not exceeding the constant trip count: 16
+; DEBUG: LV: IC is 1
+; DEBUG: LV: VF is 16
+; DEBUG: LV: Vectorization is not beneficial: expected trip count < minimum profitable VF (16 < 32)
+; DEBUG: LV: Too many memory checks needed.
+
+; DEBUG-LABEL: LV: Checking a loop in 'overflow_indvar_known_false'
+; DEBUG: LV: Found trip count: 0
+; DEBUG: LV: Found maximum trip count: 1027
+; DEBUG: LV: can fold tail by masking.
+; DEBUG: Executing best plan with VF=vscale x 16, UF=1
+
+define void @low_vf_ic_is_better(ptr nocapture noundef %p, i16 noundef %val) {
+; CHECK-LABEL: define void @low_vf_ic_is_better(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT: [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 19
+; CHECK-NEXT: br i1 [[CMP7]], label %[[ITER_CHECK:.*]], label %[[WHILE_END:.*]]
+; CHECK: [[ITER_CHECK]]:
+; CHECK-NEXT: [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT: [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT: [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT: [[TMP2:%.*]] = zext i32 [[TMP1]] to i64
+; CHECK-NEXT: [[TMP3:%.*]] = sub i64 20, [[TMP2]]
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP5]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK: [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT: [[TMP6:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT: [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
+; CHECK-NEXT: [[TMP8:%.*]] = sub i64 19, [[TMP7]]
+; CHECK-NEXT: [[TMP9:%.*]] = trunc i64 [[TMP8]] to i32
+; CHECK-NEXT: [[TMP10:%.*]] = add i32 [[TMP6]], [[TMP9]]
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP6]]
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ugt i64 [[TMP8]], 4294967295
+; CHECK-NEXT: [[TMP13:%.*]] = or i1 [[TMP11]], [[TMP12]]
+; CHECK-NEXT: br i1 [[TMP13]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
+; CHECK: [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
+; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP14]], 8
+; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP3]], [[TMP15]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 8
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], [[TMP17]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 8
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i8> poison, i8 [[CONV]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 [[TMP0]], [[INDEX]]
+; CHECK-NEXT: [[TMP20:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds [100 x i8], ptr [[V]], i64 0, i64 [[TMP20]]
+; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[TMP21]], i32 0
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP22]], align 1
+; CHECK-NEXT: [[TMP23:%.*]] = add <vscale x 8 x i8> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: store <vscale x 8 x i8> [[TMP23]], ptr [[TMP22]], align 1
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
+; CHECK-NEXT: [[TMP31:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP31]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; CHECK: [[VEC_EPILOG_ITER_CHECK]]:
+; CHECK-NEXT: [[IND_END5:%.*]] = add i64 [[TMP0]], [[N_VEC]]
+; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
+; CHECK-NEXT: [[TMP32:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP33:%.*]] = mul i64 [[TMP32]], 4
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP33]]
+; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]]
+; CHECK: [[VEC_EPILOG_PH]]:
+; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; CHECK-NEXT: [[TMP34:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP35:%.*]] = mul i64 [[TMP34]], 4
+; CHECK-NEXT: [[N_MOD_VF3:%.*]] = urem i64 [[TMP3]], [[TMP35]]
+; CHECK-NEXT: [[N_VEC4:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF3]]
+; CHECK-NEXT: [[IND_END:%.*]] = add i64 [[TMP0]], [[N_VEC4]]
+; CHECK-NEXT: [[TMP36:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP37:%.*]] = mul i64 [[TMP36]], 4
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <vscale x 4 x i8> poison, i8 [[CONV]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT9:%.*]] = shufflevector <vscale x 4 x i8> [[BROADCAST_SPLATINSERT8]], <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
+; CHECK: [[VEC_EPILOG_VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX6:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX7:%.*]] = add i64 [[TMP0]], [[INDEX6]]
+; CHECK-NEXT: [[TMP38:%.*]] = add i64 [[OFFSET_IDX7]], 0
+; CHECK-NEXT: [[TMP39:%.*]] = getelementptr inbounds [100 x i8], ptr [[V]], i64 0, i64 [[TMP38]]
+; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[TMP39]], i32 0
+; CHECK-NEXT: [[WIDE_LOAD7:%.*]] = load <vscale x 4 x i8>, ptr [[TMP40]], align 1
+; CHECK-NEXT: [[TMP41:%.*]] = add <vscale x 4 x i8> [[WIDE_LOAD7]], [[BROADCAST_SPLAT9]]
+; CHECK-NEXT: store <vscale x 4 x i8> [[TMP41]], ptr [[TMP40]], align 1
+; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX6]], [[TMP37]]
+; CHECK-NEXT: [[TMP42:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC4]]
+; CHECK-NEXT: br i1 [[TMP42]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: [[VEC_EPILOG_MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[CMP_N12:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC4]]
+; CHECK-NEXT: br i1 [[CMP_N12]], label %[[WHILE_END_LOOPEXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; CHECK: [[VEC_EPILOG_SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END5]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[TMP0]], %[[VECTOR_SCEVCHECK]] ], [ [[TMP0]], %[[ITER_CHECK]] ]
+; CHECK-NEXT: br label %[[WHILE_BODY:.*]]
+; CHECK: [[WHILE_BODY]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP43:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT: [[ADD:%.*]] = add i8 [[TMP43]], [[CONV]]
+; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT: [[TMP44:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP44]], 19
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT: br label %[[WHILE_END]]
+; CHECK: [[WHILE_END]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ %p.promoted = load i32, ptr %p, align 4
+ %cmp7 = icmp ult i32 %p.promoted, 19
+ br i1 %cmp7, label %while.preheader, label %while.end
+
+while.preheader:
+ %conv = trunc i16 %val to i8
+ %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+ %0 = zext nneg i32 %p.promoted to i64
+ br label %while.body
+
+while.body:
+ %indvars.iv = phi i64 [ %0, %while.preheader ], [ %indvars.iv.next, %while.body ]
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %arrayidx = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+ %1 = load i8, ptr %arrayidx, align 1
+ %add = add i8 %1, %conv
+ store i8 %add, ptr %arrayidx, align 1
+ %2 = and i64 %indvars.iv.next, 4294967295
+ %exitcond.not = icmp eq i64 %2, 19
+ br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+ ret void
+}
+
+define void @trip_count_too_small(ptr nocapture noundef %p, i16 noundef %val) {
+; CHECK-LABEL: define void @trip_count_too_small(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT: [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 3
+; CHECK-NEXT: br i1 [[CMP7]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK: [[WHILE_PREHEADER]]:
+; CHECK-NEXT: [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT: [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT: [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT: br label %[[WHILE_BODY:.*]]
+; CHECK: [[WHILE_BODY]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[TMP0]], %[[WHILE_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP43:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT: [[ADD:%.*]] = add i8 [[TMP43]], [[CONV]]
+; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT: [[TMP44:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP44]], 3
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
+; CHECK: [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT: br label %[[WHILE_END]]
+; CHECK: [[WHILE_END]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ %p.promoted = load i32, ptr %p, align 4
+ %cmp7 = icmp ult i32 %p.promoted, 3
+ br i1 %cmp7, label %while.preheader, label %while.end
+
+while.preheader:
+ %conv = trunc i16 %val to i8
+ %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+ %0 = zext nneg i32 %p.promoted to i64
+ br label %while.body
+
+while.body:
+ %indvars.iv = phi i64 [ %0, %while.preheader ], [ %indvars.iv.next, %while.body ]
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %arrayidx = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+ %1 = load i8, ptr %arrayidx, align 1
+ %add = add i8 %1, %conv
+ store i8 %add, ptr %arrayidx, align 1
+ %2 = and i64 %indvars.iv.next, 4294967295
+ %exitcond.not = icmp eq i64 %2, 3
+ br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+ ret void
+}
+
+define void @too_many_runtime_checks(ptr nocapture noundef %p, ptr nocapture noundef %p1, ptr nocapture noundef readonly %p2, ptr nocapture noundef readonly %p3, i16 noundef %val) {
+; CHECK-LABEL: define void @too_many_runtime_checks(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], ptr nocapture noundef [[P1:%.*]], ptr nocapture noundef readonly [[P2:%.*]], ptr nocapture noundef readonly [[P3:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT: [[CMP20:%.*]] = icmp ult i32 [[TMP0]], 16
+; CHECK-NEXT: br i1 [[CMP20]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK: [[WHILE_PREHEADER]]:
+; CHECK-NEXT: [[CONV8:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT: [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT: [[TMP1:%.*]] = zext nneg i32 [[TMP0]] to i64
+; CHECK-NEXT: br label %[[WHILE_BODY:.*]]
+; CHECK: [[WHILE_BODY]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[TMP1]], %[[WHILE_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i8, ptr [[P2]], i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP60:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
+; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw i8, ptr [[P3]], i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP61:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
+; CHECK-NEXT: [[MUL:%.*]] = mul i8 [[TMP61]], [[TMP60]]
+; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds nuw i8, ptr [[P1]], i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP62:%.*]] = load i8, ptr [[ARRAYIDX5]], align 1
+; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP62]]
+; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX5]], align 1
+; CHECK-NEXT: [[ARRAYIDX10:%.*]] = getelementptr inbounds nuw [100 x i8], ptr [[V]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP63:%.*]] = load i8, ptr [[ARRAYIDX10]], align 1
+; CHECK-NEXT: [[ADD12:%.*]] = add i8 [[TMP63]], [[CONV8]]
+; CHECK-NEXT: store i8 [[ADD12]], ptr [[ARRAYIDX10]], align 1
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[TMP64:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP64]], 16
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
+; CHECK: [[WHILE_END_LOOPEXIT]]:
+; CHECK-NEXT: br label %[[WHILE_END]]
+; CHECK: [[WHILE_END]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ %0 = load i32, ptr %p, align 4
+ %cmp20 = icmp ult i32 %0, 16
+ br i1 %cmp20, label %while.preheader, label %while.end
+
+while.preheader:
+ %conv8 = trunc i16 %val to i8
+ %v = getelementptr inbounds nuw i8, ptr %p, i64 4
+ %1 = zext nneg i32 %0 to i64
+ br label %while.body
+
+while.body:
+ %indvars.iv = phi i64 [ %1, %while.preheader ], [ %indvars.iv.next, %while.body ]
+ %arrayidx = getelementptr inbounds nuw i8, ptr %p2, i64 %indvars.iv
+ %2 = load i8, ptr %arrayidx, align 1
+ %arrayidx2 = getelementptr inbounds nuw i8, ptr %p3, i64 %indvars.iv
+ %3 = load i8, ptr %arrayidx2, align 1
+ %mul = mul i8 %3, %2
+ %arrayidx5 = getelementptr inbounds nuw i8, ptr %p1, i64 %indvars.iv
+ %4 = load i8, ptr %arrayidx5, align 1
+ %add = add i8 %mul, %4
+ store i8 %add, ptr %arrayidx5, align 1
+ %arrayidx10 = getelementptr inbounds nuw [100 x i8], ptr %v, i64 0, i64 %indvars.iv
+ %5 = load i8, ptr %arrayidx10, align 1
+ %add12 = add i8 %5, %conv8
+ store i8 %add12, ptr %arrayidx10, align 1
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %6 = and i64 %indvars.iv.next, 4294967295
+ %exitcond.not = icmp eq i64 %6, 16
+ br i1 %exitcond.not, label %while.end, label %while.body
+
+while.end:
+ ret void
+}
+
+define void @overflow_indvar_known_false(ptr nocapture noundef %p, i16 noundef %val) vscale_range(1,16) {
+; CHECK-LABEL: define void @overflow_indvar_known_false(
+; CHECK-SAME: ptr nocapture noundef [[P:%.*]], i16 noundef [[VAL:%.*]]) #[[ATTR1:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[P_PROMOTED:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT: [[CMP7:%.*]] = icmp ult i32 [[P_PROMOTED]], 1027
+; CHECK-NEXT: br i1 [[CMP7]], label %[[WHILE_PREHEADER:.*]], label %[[WHILE_END:.*]]
+; CHECK: [[WHILE_PREHEADER]]:
+; CHECK-NEXT: [[CONV:%.*]] = trunc i16 [[VAL]] to i8
+; CHECK-NEXT: [[V:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 4
+; CHECK-NEXT: [[TMP0:%.*]] = zext nneg i32 [[P_PROMOTED]] to i64
+; CHECK-NEXT: [[TMP19:%.*]] = add i32 [[P_PROMOTED]], 1
+; CHECK-NEXT: [[TMP20:%.*]] = zext i32 [[TMP19...
[truncated]
|
|
|
||
| SmallVector<const SCEVPredicate *, 2> Predicates; | ||
| unsigned MaxTC = | ||
| PSE.getSE()->getSmallConstantMaxTripCount(TheLoop, &Predicates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use PSE.getSmallConstantMaxTripCount() to ensure the predicates are added to PSE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good question. I'd assumed that when we call PSE.getBackedgeTakenCount(TheLoop, &Predicates) at some point later on to create the trip count, that it would just end up adding the same predicates anyway.
Does it matter if we attempt to add the same predicate twice? If not, then I'm happy to change this and use the predicated version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to add the predicates multiple times, since the addPredicate function in PredicatedScalarEvolution already checks to see if the new predicate is already implied by its existing list. I've added a new getSmallConstantMaxTripCount to PredicatedScalarEvolution and kept track of the count to avoid unnecessary computation, similar to how we return the backedge taken count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, PSE should take care to avoid adding duplicated predicates, thanks for adjusting!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a similar change also be done for the recent changes for early exit checks in LV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the calls to PSE.getSE()->getPredicatedExitCount in LoopVectorizationLegality::isVectorizableEarlyExitLoop? I'm not sure if we want to do exactly the same, i.e. cache the exit count for every exit. I could potentially add an equivalent method to PredicatedScalarEvolution that automatically adds the predicates, but probably best done in a separate patch because it's unrelated to the max trip count. What do you think?
f969a79 to
49eaf3d
Compare
49eaf3d to
c79589f
Compare
|
Deal with conflicts after rebase |
| const SCEV *SymbolicMaxBackedgeCount = nullptr; | ||
|
|
||
| /// The constant max trip count for the loop. | ||
| std::optional<unsigned> SmallConstantMaxTripCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::optional<unsigned> SmallConstantMaxTripCount; | |
| const SCEV* SmallConstantMaxTripCount = nullptr; |
for consistency with BackedgeCount and SymbolicMaxBackedgeCount above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is the max trip count is measured as an unsigned value. I could potentially cache the intermediate result in the function:
unsigned ScalarEvolution::getSmallConstantMaxTripCount(
const Loop *L, SmallVectorImpl<const SCEVPredicate *> *Predicates) {
const auto *MaxExitCount =
Predicates ? getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates)
: getConstantMaxBackedgeTakenCount(L);
return getConstantTripCount(dyn_cast<SCEVConstant>(MaxExitCount));
}
i.e. MaxExitCount, but then you still have to call getConstantTripCount each time so you see less benefit from caching it. That's why I chose to use std::optional<unsigned> - alternatively I could use a larger type such as uint64_t and treat UINT64_MAX as being equivalent to not cached yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! It's not worth the trouble then I think.
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
Outdated
Show resolved
Hide resolved
There are a number of places where we call getSmallConstantMaxTripCount without passing a vector of predicates: getSmallBestKnownTC isIndvarOverflowCheckKnownFalse computeMaxVF isMoreProfitable I've changed all of these to now pass in a predicate vector so that we get the benefit of making better vectorisation choices when we know the max trip count for loops that require SCEV predicate checks. I've tried to add tests that cover all the cases affected by these changes.
c79589f to
01bb75f
Compare
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
| const SCEV *SymbolicMaxBackedgeCount = nullptr; | ||
|
|
||
| /// The constant max trip count for the loop. | ||
| std::optional<unsigned> SmallConstantMaxTripCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! It's not worth the trouble then I think.
llvm#109928) There are a number of places where we call getSmallConstantMaxTripCount without passing a vector of predicates: getSmallBestKnownTC isIndvarOverflowCheckKnownFalse computeMaxVF isMoreProfitable I've changed all of these to now pass in a predicate vector so that we get the benefit of making better vectorisation choices when we know the max trip count for loops that require SCEV predicate checks. I've tried to add tests that cover all the cases affected by these changes.
There are a number of places where we call getSmallConstantMaxTripCount
without passing a vector of predicates:
getSmallBestKnownTC
isIndvarOverflowCheckKnownFalse
computeMaxVF
isMoreProfitable
I've changed all of these to now pass in a predicate vector so that
we get the benefit of making better vectorisation choices when we
know the max trip count for loops that require SCEV predicate checks.
I've tried to add tests that cover all the cases affected by these
changes.