Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -1812,10 +1812,11 @@ class TargetTransformInfo {
unsigned ChainSizeInBytes,
VectorType *VecTy) const;

/// \returns True if the targets prefers fixed width vectorization if the
/// \returns True if the target prefers fixed width vectorization if the
/// loop vectorizer's cost-model assigns an equal cost to the fixed and
/// scalable version of the vectorized loop.
LLVM_ABI bool preferFixedOverScalableIfEqualCost() const;
/// \p IsEpilogue is true if the decision is for the epilogue loop.
LLVM_ABI bool preferFixedOverScalableIfEqualCost(bool IsEpilogue) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be documented


/// \returns True if target prefers SLP vectorizer with altermate opcode
/// vectorization, false - otherwise.
Expand Down
4 changes: 3 additions & 1 deletion llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -1092,7 +1092,9 @@ class TargetTransformInfoImplBase {
return VF;
}

virtual bool preferFixedOverScalableIfEqualCost() const { return false; }
virtual bool preferFixedOverScalableIfEqualCost(bool IsEpilogue) const {
return false;
}

virtual bool preferInLoopReduction(RecurKind Kind, Type *Ty) const {
return false;
Expand Down
5 changes: 3 additions & 2 deletions llvm/lib/Analysis/TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1403,8 +1403,9 @@ unsigned TargetTransformInfo::getStoreVectorFactor(unsigned VF,
return TTIImpl->getStoreVectorFactor(VF, StoreSize, ChainSizeInBytes, VecTy);
}

bool TargetTransformInfo::preferFixedOverScalableIfEqualCost() const {
return TTIImpl->preferFixedOverScalableIfEqualCost();
bool TargetTransformInfo::preferFixedOverScalableIfEqualCost(
bool IsEpilogue) const {
return TTIImpl->preferFixedOverScalableIfEqualCost(IsEpilogue);
}

bool TargetTransformInfo::preferInLoopReduction(RecurKind Kind,
Expand Down
8 changes: 7 additions & 1 deletion llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6022,9 +6022,15 @@ static bool containsDecreasingPointers(Loop *TheLoop,
return false;
}

bool AArch64TTIImpl::preferFixedOverScalableIfEqualCost() const {
bool AArch64TTIImpl::preferFixedOverScalableIfEqualCost(bool IsEpilogue) const {
if (SVEPreferFixedOverScalableIfEqualCost.getNumOccurrences())
return SVEPreferFixedOverScalableIfEqualCost;
// For cases like post-LTO vectorization, when we eventually know the trip
// count, epilogue with fixed-width vectorization can be deleted if the trip
// count is less than the epilogue iterations. That's why we prefer
// fixed-width vectorization in epilogue in case of equal costs.
if (IsEpilogue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want this to happen after checking SVEPreferFixedOverScalableIfEqualCost so that we have a way to disable it. Also, it might be worth calling useFixedOverScalableIfEqualCost(IsEpilogue) to get the answer and setting this on a per-CPU basis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To set it on per-CPU, I think I have to create another new target feature for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning above seems not CPU specific and should apply to all CPUs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably true. I was just thinking of limiting this for now to specific CPUs where we have observed the problem, but maybe that's unnecessary. I guess we can leave like it this for now.

return true;
return ST->useFixedOverScalableIfEqualCost();
}

Expand Down
2 changes: 1 addition & 1 deletion llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
return TailFoldingStyle::DataWithoutLaneMask;
}

bool preferFixedOverScalableIfEqualCost() const override;
bool preferFixedOverScalableIfEqualCost(bool IsEpilogue) const override;

unsigned getEpilogueVectorizationMinVF() const override;

Expand Down
6 changes: 4 additions & 2 deletions llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
Original file line number Diff line number Diff line change
Expand Up @@ -622,13 +622,15 @@ class LoopVectorizationPlanner {
/// Returns true if the per-lane cost of VectorizationFactor A is lower than
/// that of B.
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B, bool HasTail) const;
const VectorizationFactor &B, bool HasTail,
bool IsEpilogue = false) const;

/// Returns true if the per-lane cost of VectorizationFactor A is lower than
/// that of B in the context of vectorizing a loop with known \p MaxTripCount.
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B,
const unsigned MaxTripCount, bool HasTail) const;
const unsigned MaxTripCount, bool HasTail,
bool IsEpilogue = false) const;

/// Determines if we have the infrastructure to vectorize the loop and its
/// epilogue, assuming the main loop is vectorized by \p VF.
Expand Down
15 changes: 9 additions & 6 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3856,7 +3856,8 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
bool LoopVectorizationPlanner::isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B,
const unsigned MaxTripCount,
bool HasTail) const {
bool HasTail,
bool IsEpilogue) const {
InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;

Expand All @@ -3880,7 +3881,7 @@ bool LoopVectorizationPlanner::isMoreProfitable(const VectorizationFactor &A,
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
bool PreferScalable = !TTI.preferFixedOverScalableIfEqualCost() &&
bool PreferScalable = !TTI.preferFixedOverScalableIfEqualCost(IsEpilogue) &&
A.Width.isScalable() && !B.Width.isScalable();

auto CmpFn = [PreferScalable](const InstructionCost &LHS,
Expand Down Expand Up @@ -3918,10 +3919,11 @@ bool LoopVectorizationPlanner::isMoreProfitable(const VectorizationFactor &A,

bool LoopVectorizationPlanner::isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B,
bool HasTail) const {
bool HasTail,
bool IsEpilogue) const {
const unsigned MaxTripCount = PSE.getSmallConstantMaxTripCount();
return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount,
HasTail);
return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount, HasTail,
IsEpilogue);
}

void LoopVectorizationPlanner::emitInvalidCostRemarks(
Expand Down Expand Up @@ -4454,7 +4456,8 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
}

if (Result.Width.isScalar() ||
isMoreProfitable(NextVF, Result, MaxTripCount, !CM.foldTailByMasking()))
isMoreProfitable(NextVF, Result, MaxTripCount, !CM.foldTailByMasking(),
/*IsEpilogue*/ true))
Result = NextVF;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ target triple = "aarch64-unknown-linux-gnu"
; DEBUG: LV: Found maximum trip count: 19
; DEBUG: LV: IC is 1
; DEBUG-VS1: LV: VF is vscale x 16
; DEBUG-VS1: Main Loop VF:vscale x 16, Main Loop UF:1, Epilogue Loop VF:vscale x 8, Epilogue Loop UF:1
; DEBUG-VS1: Main Loop VF:vscale x 16, Main Loop UF:1, Epilogue Loop VF:8, Epilogue Loop UF:1
; DEBUG-VS2: LV: VF is vscale x 8
; DEBUG-VS2: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:vscale x 4, Epilogue Loop UF:1
; DEBUG-VS2: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:8, Epilogue Loop UF:1

; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
Expand Down Expand Up @@ -48,9 +48,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
; CHECK-VS1-NEXT: [[TMP1:%.*]] = add i32 [[TC]], 1
; CHECK-VS1-NEXT: [[TMP2:%.*]] = zext i32 [[TMP1]] to i64
; CHECK-VS1-NEXT: [[TMP3:%.*]] = sub i64 20, [[TMP2]]
; CHECK-VS1-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS1-NEXT: [[TMP5:%.*]] = shl nuw i64 [[TMP4]], 3
; CHECK-VS1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP5]]
; CHECK-VS1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
; CHECK-VS1-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
; CHECK-VS1: [[VECTOR_SCEVCHECK]]:
; CHECK-VS1-NEXT: [[TMP6:%.*]] = add i32 [[TC]], 1
Expand Down Expand Up @@ -91,28 +89,24 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
; CHECK-VS1: [[VEC_EPILOG_ITER_CHECK]]:
; CHECK-VS1-NEXT: [[IND_END4:%.*]] = add i64 [[TMP0]], [[N_VEC]]
; CHECK-VS1-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
; CHECK-VS1-NEXT: [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS1-NEXT: [[TMP27:%.*]] = shl nuw i64 [[TMP26]], 3
; CHECK-VS1-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP27]]
; CHECK-VS1-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
; CHECK-VS1-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
; CHECK-VS1: [[VEC_EPILOG_PH]]:
; CHECK-VS1-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-VS1-NEXT: [[TMP28:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS1-NEXT: [[TMP29:%.*]] = mul nuw i64 [[TMP28]], 8
; CHECK-VS1-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], [[TMP29]]
; CHECK-VS1-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], 8
; CHECK-VS1-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF2]]
; CHECK-VS1-NEXT: [[TMP39:%.*]] = add i64 [[TMP0]], [[N_VEC3]]
; CHECK-VS1-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <vscale x 8 x i8> poison, i8 [[CONV]], i64 0
; CHECK-VS1-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <vscale x 8 x i8> [[BROADCAST_SPLATINSERT7]], <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer
; CHECK-VS1-NEXT: [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <8 x i8> poison, i8 [[CONV]], i64 0
; CHECK-VS1-NEXT: [[BROADCAST_SPLAT5:%.*]] = shufflevector <8 x i8> [[BROADCAST_SPLATINSERT4]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-VS1-NEXT: br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
; CHECK-VS1: [[VEC_EPILOG_VECTOR_BODY]]:
; CHECK-VS1-NEXT: [[INDEX5:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT9:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-VS1-NEXT: [[OFFSET_IDX:%.*]] = add i64 [[TMP0]], [[INDEX5]]
; CHECK-VS1-NEXT: [[TMP33:%.*]] = getelementptr inbounds nuw i8, ptr [[V]], i64 [[OFFSET_IDX]]
; CHECK-VS1-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 8 x i8>, ptr [[TMP33]], align 1
; CHECK-VS1-NEXT: [[TMP35:%.*]] = add <vscale x 8 x i8> [[WIDE_LOAD6]], [[BROADCAST_SPLAT8]]
; CHECK-VS1-NEXT: store <vscale x 8 x i8> [[TMP35]], ptr [[TMP33]], align 1
; CHECK-VS1-NEXT: [[INDEX_NEXT9]] = add nuw i64 [[INDEX5]], [[TMP29]]
; CHECK-VS1-NEXT: [[WIDE_LOAD7:%.*]] = load <8 x i8>, ptr [[TMP33]], align 1
; CHECK-VS1-NEXT: [[TMP23:%.*]] = add <8 x i8> [[WIDE_LOAD7]], [[BROADCAST_SPLAT5]]
; CHECK-VS1-NEXT: store <8 x i8> [[TMP23]], ptr [[TMP33]], align 1
; CHECK-VS1-NEXT: [[INDEX_NEXT9]] = add nuw i64 [[INDEX5]], 8
; CHECK-VS1-NEXT: [[TMP36:%.*]] = icmp eq i64 [[INDEX_NEXT9]], [[N_VEC3]]
; CHECK-VS1-NEXT: br i1 [[TMP36]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK-VS1: [[VEC_EPILOG_MIDDLE_BLOCK]]:
Expand Down Expand Up @@ -148,9 +142,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
; CHECK-VS2-NEXT: [[TMP1:%.*]] = add i32 [[TC]], 1
; CHECK-VS2-NEXT: [[TMP2:%.*]] = zext i32 [[TMP1]] to i64
; CHECK-VS2-NEXT: [[TMP3:%.*]] = sub i64 20, [[TMP2]]
; CHECK-VS2-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS2-NEXT: [[TMP5:%.*]] = shl nuw i64 [[TMP4]], 2
; CHECK-VS2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP5]]
; CHECK-VS2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
; CHECK-VS2-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
; CHECK-VS2: [[VECTOR_SCEVCHECK]]:
; CHECK-VS2-NEXT: [[TMP6:%.*]] = add i32 [[TC]], 1
Expand Down Expand Up @@ -191,28 +183,24 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
; CHECK-VS2: [[VEC_EPILOG_ITER_CHECK]]:
; CHECK-VS2-NEXT: [[IND_END4:%.*]] = add i64 [[TMP0]], [[N_VEC]]
; CHECK-VS2-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
; CHECK-VS2-NEXT: [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS2-NEXT: [[TMP27:%.*]] = shl nuw i64 [[TMP26]], 2
; CHECK-VS2-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP27]]
; CHECK-VS2-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
; CHECK-VS2-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
; CHECK-VS2: [[VEC_EPILOG_PH]]:
; CHECK-VS2-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-VS2-NEXT: [[TMP28:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS2-NEXT: [[TMP29:%.*]] = mul nuw i64 [[TMP28]], 4
; CHECK-VS2-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], [[TMP29]]
; CHECK-VS2-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], 8
; CHECK-VS2-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF2]]
; CHECK-VS2-NEXT: [[TMP39:%.*]] = add i64 [[TMP0]], [[N_VEC3]]
; CHECK-VS2-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <vscale x 4 x i8> poison, i8 [[CONV]], i64 0
; CHECK-VS2-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <vscale x 4 x i8> [[BROADCAST_SPLATINSERT7]], <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer
; CHECK-VS2-NEXT: [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <8 x i8> poison, i8 [[CONV]], i64 0
; CHECK-VS2-NEXT: [[BROADCAST_SPLAT5:%.*]] = shufflevector <8 x i8> [[BROADCAST_SPLATINSERT4]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-VS2-NEXT: br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
; CHECK-VS2: [[VEC_EPILOG_VECTOR_BODY]]:
; CHECK-VS2-NEXT: [[INDEX5:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT9:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-VS2-NEXT: [[OFFSET_IDX:%.*]] = add i64 [[TMP0]], [[INDEX5]]
; CHECK-VS2-NEXT: [[TMP33:%.*]] = getelementptr inbounds nuw i8, ptr [[V]], i64 [[OFFSET_IDX]]
; CHECK-VS2-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 4 x i8>, ptr [[TMP33]], align 1
; CHECK-VS2-NEXT: [[TMP35:%.*]] = add <vscale x 4 x i8> [[WIDE_LOAD6]], [[BROADCAST_SPLAT8]]
; CHECK-VS2-NEXT: store <vscale x 4 x i8> [[TMP35]], ptr [[TMP33]], align 1
; CHECK-VS2-NEXT: [[INDEX_NEXT9]] = add nuw i64 [[INDEX5]], [[TMP29]]
; CHECK-VS2-NEXT: [[WIDE_LOAD7:%.*]] = load <8 x i8>, ptr [[TMP33]], align 1
; CHECK-VS2-NEXT: [[TMP23:%.*]] = add <8 x i8> [[WIDE_LOAD7]], [[BROADCAST_SPLAT5]]
; CHECK-VS2-NEXT: store <8 x i8> [[TMP23]], ptr [[TMP33]], align 1
; CHECK-VS2-NEXT: [[INDEX_NEXT9]] = add nuw i64 [[INDEX5]], 8
; CHECK-VS2-NEXT: [[TMP36:%.*]] = icmp eq i64 [[INDEX_NEXT9]], [[N_VEC3]]
; CHECK-VS2-NEXT: br i1 [[TMP36]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK-VS2: [[VEC_EPILOG_MIDDLE_BLOCK]]:
Expand Down
26 changes: 8 additions & 18 deletions llvm/test/Transforms/LoopVectorize/AArch64/store-costs-sve.ll
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,7 @@ define void @cost_store_i8(ptr %dst) #0 {
; DEFAULT-LABEL: define void @cost_store_i8(
; DEFAULT-SAME: ptr [[DST:%.*]]) #[[ATTR0:[0-9]+]] {
; DEFAULT-NEXT: iter.check:
; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP1:%.*]] = shl nuw i64 [[TMP0]], 3
; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 101, [[TMP1]]
; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; DEFAULT-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; DEFAULT: vector.main.loop.iter.check:
; DEFAULT-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP3:%.*]] = shl nuw i64 [[TMP2]], 5
Expand Down Expand Up @@ -40,29 +37,22 @@ define void @cost_store_i8(ptr %dst) #0 {
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; DEFAULT: vec.epilog.iter.check:
; DEFAULT-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 101, [[N_VEC]]
; DEFAULT-NEXT: [[TMP12:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP13:%.*]] = shl nuw i64 [[TMP12]], 3
; DEFAULT-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP13]]
; DEFAULT-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
; DEFAULT-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
; DEFAULT: vec.epilog.ph:
; DEFAULT-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; DEFAULT-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
; DEFAULT-NEXT: [[TMP15:%.*]] = mul nuw i64 [[TMP14]], 8
; DEFAULT-NEXT: [[N_MOD_VF2:%.*]] = urem i64 101, [[TMP15]]
; DEFAULT-NEXT: [[N_VEC3:%.*]] = sub i64 101, [[N_MOD_VF2]]
; DEFAULT-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; DEFAULT: vec.epilog.vector.body:
; DEFAULT-NEXT: [[INDEX5:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT6:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; DEFAULT-NEXT: [[TMP19:%.*]] = getelementptr i8, ptr [[DST]], i64 [[INDEX5]]
; DEFAULT-NEXT: store <vscale x 8 x i8> zeroinitializer, ptr [[TMP19]], align 1
; DEFAULT-NEXT: [[INDEX_NEXT6]] = add nuw i64 [[INDEX5]], [[TMP15]]
; DEFAULT-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT6]], [[N_VEC3]]
; DEFAULT-NEXT: br i1 [[TMP21]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; DEFAULT-NEXT: store <8 x i8> zeroinitializer, ptr [[TMP19]], align 1
; DEFAULT-NEXT: [[INDEX_NEXT6]] = add nuw i64 [[INDEX5]], 8
; DEFAULT-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT6]], 96
; DEFAULT-NEXT: br i1 [[TMP10]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; DEFAULT: vec.epilog.middle.block:
; DEFAULT-NEXT: [[CMP_N4:%.*]] = icmp eq i64 101, [[N_VEC3]]
; DEFAULT-NEXT: br i1 [[CMP_N4]], label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]
; DEFAULT-NEXT: br i1 false, label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]
; DEFAULT: vec.epilog.scalar.ph:
; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.*]] ]
; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 96, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.*]] ]
; DEFAULT-NEXT: br label [[LOOP:%.*]]
; DEFAULT: loop:
; DEFAULT-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ target triple = "aarch64-unknown-linux-gnu"
define void @foo(ptr noalias nocapture readonly %p, ptr noalias nocapture %q, i64 %len) #0 {
; CHECK-EPILOG: vec.epilog.ph:
; CHECK-EPILOG: vec.epilog.vector.body:
; CHECK-EPILOG: load <vscale x 4 x i16>
; CHECK-EPILOG: load <8 x i16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a new RUN line for neoverse-v1 with the flag -sve-prefer-fixed-over-scalable-if-equal=false so that we also test the old behaviour, i.e.

  CHECK-EPILOG-PREFER-SCALABLE:        load <vscale x 4 x i16>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the new line relevant for all run lines here? It may be better to have a new test file with a function where the cost of fixed & scalable is the same, which has a comment with a brief explanation and run lines with both the option on and off.

With the changes where its difficult to see if this is working as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


; The epilogue loop gets vectorised vscale x 2 x i16 wide.
; CHECK-EPILOG-V2: vec.epilog.ph:
Expand Down
Loading