Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 43 additions & 18 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1441,9 +1441,8 @@ class LoopVectorizationCostModel {

/// Selects and saves TailFoldingStyle for 2 options - if IV update may
/// overflow or not.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Can select and save 4 options in stead of 2 ... [may IV update overflow or not] x [is VF fixed or scalable] ... but better simplify this than further complicate it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not understand how it is related

/// \param IsScalableVF true if scalable vector factors enabled.
/// \param UserIC User specific interleave count.
void setTailFoldingStyles(bool IsScalableVF, unsigned UserIC) {
void setTailFoldingStyles(unsigned UserIC) {
assert(!ChosenTailFoldingStyle && "Tail folding must not be selected yet.");
if (!Legal->canFoldTailByMasking()) {
ChosenTailFoldingStyle =
Expand All @@ -1466,12 +1465,9 @@ class LoopVectorizationCostModel {
// Override forced styles if needed.
// FIXME: use actual opcode/data type for analysis here.
// FIXME: Investigate opportunity for fixed vector factor.
bool EVLIsLegal =
IsScalableVF && UserIC <= 1 &&
TTI.hasActiveVectorLength(0, nullptr, Align()) &&
!EnableVPlanNativePath &&
// FIXME: implement support for max safe dependency distance.
Legal->isSafeForAnyVectorWidth();
bool EVLIsLegal = UserIC <= 1 &&
TTI.hasActiveVectorLength(0, nullptr, Align()) &&
!EnableVPlanNativePath;
Comment on lines +1434 to +1436
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated to this patch) Should this fallback of EVL, due to UserIC>1 or NativePath, apply to getPreferredTailFoldingStyle() decisions above, in addition to (de)forced styles below?

if (!EVLIsLegal) {
// If for some reason EVL mode is unsupported, fallback to
// DataWithoutLaneMask to try to vectorize the loop with folded tail
Expand All @@ -1489,13 +1485,29 @@ class LoopVectorizationCostModel {
}
}

/// Disables previously chosen tail folding policy, sets it to None. Expects,
/// that the tail policy was selected.
void disableTailFolding() {
assert(ChosenTailFoldingStyle && "Tail folding must be selected.");
ChosenTailFoldingStyle =
std::make_pair(TailFoldingStyle::None, TailFoldingStyle::None);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it would be better to avoid undoing recorded decisions, as it is difficult to track and update related decisions. VPlan was designed to address this by materializing decisions.

Could max feasible VF's be recomputed once no tail is left to fold, rather than re-selecting tail style?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, will try to do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thoughts, I think this may not be possible for non-power-of-2 distance. Here we need to be sure that tail-folding-with-EVL is chosen to enable selection of the non-power-of-2 maxsafedist

/// Returns true if all loop blocks should be masked to fold tail loop.
bool foldTailByMasking() const {
// TODO: check if it is possible to check for None style independent of
// IVUpdateMayOverflow flag in getTailFoldingStyle.
return getTailFoldingStyle() != TailFoldingStyle::None;
}

/// Return maximum safe number of elements to be processed, which do not
/// prevent store-load forwarding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explains what max dependence distance is, which is indeed target independent. Why/does EVL need another value? Tried to answer below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify

/// TODO: need to consider adjusting cost model to use this value as a
/// vectorization factor for EVL-based vectorization.
std::optional<unsigned> getMaxEVLSafeElements() const {
return MaxEVLSafeElements;
}

/// Returns true if the instructions in this block requires predication
/// for any reason, e.g. because tail folding now requires a predicate
/// or because the block in the original loop was predicated.
Expand Down Expand Up @@ -1647,6 +1659,11 @@ class LoopVectorizationCostModel {
/// true if scalable vectorization is supported and enabled.
std::optional<bool> IsScalableVectorizationAllowed;

/// Maximum safe number of elements to be processed, which do not
/// prevent store-load forwarding and safe with regard of the memory
/// dependencies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, explanation is general, what's special about EVL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some extra explanation

std::optional<unsigned> MaxEVLSafeElements;

/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated
Expand Down Expand Up @@ -3898,9 +3915,14 @@ FixedScalableVFPair LoopVectorizationCostModel::computeFeasibleMaxVF(
// dependence distance).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further explanation about scalable and EVL max safe dependences, complementing the above?

unsigned MaxSafeElements =
llvm::bit_floor(Legal->getMaxSafeVectorWidthInBits() / WidestType);
unsigned MaxScalableSafeElements = MaxSafeElements;
if (foldTailWithEVL() && !Legal->isSafeForAnyVectorWidth()) {
MaxScalableSafeElements = PowerOf2Ceil(MaxSafeElements);
MaxEVLSafeElements = MaxSafeElements;
}

auto MaxSafeFixedVF = ElementCount::getFixed(MaxSafeElements);
auto MaxSafeScalableVF = getMaxLegalScalableVF(MaxSafeElements);
auto MaxSafeScalableVF = getMaxLegalScalableVF(MaxScalableSafeElements);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should getMaxLegalScalableVF() take into account if EVL is used or not, i.e., be responsible for producing MaxSafeScalableVF given MaxSafeElements for EVL and non EVL cases, rather than changing this parameter passed to it? Conceptually, max number of safe elements itself is invariant across fixed/scalable/EVL or any other target feature.


LLVM_DEBUG(dbgs() << "LV: The max safe fixed VF is: " << MaxSafeFixedVF
<< ".\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also dump max EVL safe VF?)
(while we're here, the next else is redundant - follows a return)

Expand Down Expand Up @@ -4070,15 +4092,22 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();
}

FixedScalableVFPair MaxFactors = computeFeasibleMaxVF(MaxTC, UserVF, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this speculative computing of feasible max VF's, assuming tail is folded, be complemented with recomputing the feasible max VF's later when we know there's no tail to fold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, instead of moving tail folding before the MaxVF computation, instead try to compute MaxVF twice, if required? I think this is possible.

// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
setTailFoldingStyles(UserIC);
FixedScalableVFPair MaxFactors =
computeFeasibleMaxVF(MaxTC, UserVF, foldTailByMasking());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to reason about MaxVF, setting tail, max dependence distance, and EVL:

MaxVF: the current mechanism of computeMaxVF() should be simplified rather than further complicated, let alone due to EVL/safe-dependence-distance support. This method determines two upper bounds: fixed and scalable, of VF ranges for building VPlans, to save time considering unprofitable and illegal ones. How many VPlans should be built in case of EVL, for what range of (scalable) VF's - ending with MaxVF?
Suffice to consider a single VPlan for a single VF - the one corresponding to vector length computed dynamically by providing the original trip count and max safe distance - regardless of any MaxVF, both fixed and scalable? (LMULs other than 1 may play a role, but one that conceptually corresponds to UF, treated as a compile-time fixed constant.) Some scalable-VF should be used as the static type, accommodating the trip-count (if known) and max dependence distance. But eventually all excessive lanes will be masked out every iteration, i.e., MaxVF may exceed max dependence distance in the case of EVL, but not in other cases (fixed or scalable).

MaxVF and tail folding (yes/no/style): computeMaxVF() uses computeFeasibleMaxVF() which in turn uses getMaximizedVFForTarget() - the latter dependent on whether tail is folded or not - to limit MaxVF by the original trip count or not, and ends up being responsible for setting the tail style. Speculating a folded tail should produce greater (or equal) MaxVF's.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In downstream we have a bit different implementation than in the upstream and not the bet one. Trying to improve it here.

How many VPlans should be built in case of EVL, for what range of (scalable) VF's - ending with MaxVF?
Yes, bit_ceil(MaxVF).

Suffice to consider a single VPlan for a single VF - the one corresponding to vector length computed dynamically by providing the original trip count and max safe distance - regardless of any MaxVF, both fixed and scalable?

Currently no, I think, it may affect the cost estimation.

MaxVF and tail folding (yes/no/style): computeMaxVF() uses computeFeasibleMaxVF() which in turn uses getMaximizedVFForTarget() - the latter dependent on whether tail is folded or not - to limit MaxVF by the original trip count or not, and ends up being responsible for setting the tail style. Speculating a folded tail should produce greater (or equal) MaxVF's.

Need to define tail folding style before to avoid speculation here. Plus, it is required for non-power-of-2 distance support.


// Avoid tail folding if the trip count is known to be a multiple of any VF
// we choose.
std::optional<unsigned> MaxPowerOf2RuntimeVF =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated) What does "Runtime" in "MaxPower2RuntimeVF" stand for?

MaxFactors.FixedVF.getFixedValue();
if (MaxFactors.ScalableVF) {
std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI);
if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo()) {
if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo() &&
(!foldTailWithEVL() || isPowerOf2_32(MaxEVLSafeElements.value_or(0)))) {
MaxPowerOf2RuntimeVF = std::max<unsigned>(
*MaxPowerOf2RuntimeVF,
*MaxVScale * MaxFactors.ScalableVF.getKnownMinValue());
Expand All @@ -4101,15 +4130,11 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
if (Rem->isZero()) {
// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
disableTailFolding();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed, because we need to first need to compute the max VFs, assuming tail folding, right?

Could we end up in a scenario, where we MaxSafeNumberOfElements is 3, we max scalable VF of 4 is picked and then tail-folding is disabled here and no EVL will be used, vectorizing incorrectly with VF 4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Right.
  2. Yes, looks so. Need to do extra processing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still pending or is there a check now elsewhere?

Would it be simpler to just increase the scalable VF when tail-folding with EVL below where we already adjust the max fixed VF, instead of needing to reset the tail folding decision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added the extra check above, when setting MaxPowerOf2RuntimeVF
  2. I'm afraid there might be some side effects, if we'll keep tail-folding mode ON, while it is OFF. Better explicitly set it to OFF, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm afraid there might be some side effects, if we'll keep tail-folding mode ON, while it is OFF. Better explicitly set it to OFF, I think

Agreed, I was suggesting if it is possible to keep setTailFoldingStyles at its original place and then try to maximize MaxScalableVF below where we already deal with the EVL case ( if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) { below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be done right now, but the following patch still requires moving it up (for-non-power-2 support)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible to keep it in the original place without much extra work that would be preferable, moving when actually needed?

return MaxFactors;
}
}

// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
if (foldTailByMasking()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following above mentioned thought, should the following correction for over-speculated MaxFactors take place in the else, (!foldTailByMasking()) case -

MaxFactors = computeFeasibleMaxVF(MaxTC, UserVF, false);

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, some special processing will be required for EVL case with non-power-of-2 safe distance

if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
LLVM_DEBUG(
Expand Down Expand Up @@ -8492,8 +8517,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::optimize(*Plan, *PSE.getSE());
// TODO: try to put it close to addActiveLaneMask().
// Discard the plan if it is not EVL-compatible
if (CM.foldTailWithEVL() &&
!VPlanTransforms::tryAddExplicitVectorLength(*Plan))
if (CM.foldTailWithEVL() && !VPlanTransforms::tryAddExplicitVectorLength(
*Plan, CM.getMaxEVLSafeElements()))
break;
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
Expand Down
7 changes: 7 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -1243,6 +1243,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
SLPLoad,
SLPStore,
ActiveLaneMask,
/// Creates special scalar explicit-vector-length instruction, which
/// calculates the vectorization factor (number of iterations, that can be
/// executed simultaneously) at runtime.
/// Has two mandatory parameters - EVL (effective vector length) on the
/// previous iteration and original trip count.
/// Also, has one optional parameter - max safe distance, allowed for the
/// loop.
ExplicitVectorLength,
/// Creates a scalar phi in a leaf VPBB with a single predecessor in VPlan.
/// The first operand is the incoming value from the predecessor in VPlan,
Expand Down
9 changes: 6 additions & 3 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
return true;
switch (Opcode) {
case Instruction::ICmp:
case Instruction::Select:
case VPInstruction::BranchOnCond:
case VPInstruction::BranchOnCount:
case VPInstruction::CalculateTripCountMinusVF:
Expand Down Expand Up @@ -405,9 +406,10 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
return Builder.CreateCmp(getPredicate(), A, B, Name);
}
case Instruction::Select: {
Value *Cond = State.get(getOperand(0), Part);
Value *Op1 = State.get(getOperand(1), Part);
Value *Op2 = State.get(getOperand(2), Part);
bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
Value *Cond = State.get(getOperand(0), Part, OnlyFirstLaneUsed);
Value *Op1 = State.get(getOperand(1), Part, OnlyFirstLaneUsed);
Value *Op2 = State.get(getOperand(2), Part, OnlyFirstLaneUsed);
return Builder.CreateSelect(Cond, Op1, Op2, Name);
}
case VPInstruction::ActiveLaneMask: {
Expand Down Expand Up @@ -753,6 +755,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
default:
return false;
case Instruction::ICmp:
case Instruction::Select:
case VPInstruction::PtrAdd:
// TODO: Cover additional opcodes.
return vputils::onlyFirstLaneUsed(this);
Expand Down
39 changes: 35 additions & 4 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1427,7 +1427,23 @@ void VPlanTransforms::addActiveLaneMask(
/// %NextEVLIV = add IVSize (cast i32 %VPEVVL to IVSize), %EVLPhi
/// ...
///
bool VPlanTransforms::tryAddExplicitVectorLength(VPlan &Plan) {
/// If MaxEVLSafeElements is provided, the function adds the following recipes:
/// vector.ph:
/// ...
///
/// vector.body:
/// ...
/// %EVLPhi = EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI [ %StartV, %vector.ph ],
/// [ %NextEVLIV, %vector.body ]
/// %cmp = cmp ult %EVLPhi, MaxEVLSafeElements
/// %SAFE_AVL = select %cmp, %EVLPhi, MaxEVLSafeElements
/// %VPEVL = EXPLICIT-VECTOR-LENGTH %SAFE_AVL, original TC
/// ...
/// %NextEVLIV = add IVSize (cast i32 %VPEVVL to IVSize), %EVLPhi
/// ...
///
bool VPlanTransforms::tryAddExplicitVectorLength(
VPlan &Plan, const std::optional<unsigned> &MaxEVLSafeElements) {
VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
// The transform updates all users of inductions to work based on EVL, instead
// of the VF directly. At the moment, widened inductions cannot be updated, so
Expand All @@ -1452,9 +1468,24 @@ bool VPlanTransforms::tryAddExplicitVectorLength(VPlan &Plan) {
// Create the ExplicitVectorLengthPhi recipe in the main loop.
auto *EVLPhi = new VPEVLBasedIVPHIRecipe(StartV, DebugLoc());
EVLPhi->insertAfter(CanonicalIVPHI);
auto *VPEVL = new VPInstruction(VPInstruction::ExplicitVectorLength,
{EVLPhi, Plan.getTripCount()});
VPEVL->insertBefore(*Header, Header->getFirstNonPhi());
VPRecipeBase *AVL = EVLPhi;
if (MaxEVLSafeElements) {
VPValue *EVLSafe = Plan.getOrAddLiveIn(
ConstantInt::get(CanonicalIVPHI->getScalarType(), *MaxEVLSafeElements));
auto *Cmp = new VPInstruction(Instruction::ICmp, ICmpInst::ICMP_ULT, EVLPhi,
EVLSafe);
Cmp->insertBefore(*Header, Header->getFirstNonPhi());
AVL = new VPInstruction(Instruction::Select, {Cmp, EVLPhi, EVLSafe},
DebugLoc(), "safe_avl");
AVL->insertAfter(Cmp);
}
auto *VPEVL = new VPInstruction(
VPInstruction::ExplicitVectorLength,
{AVL->getVPSingleValue(), Plan.getTripCount()}, DebugLoc());
if (MaxEVLSafeElements)
VPEVL->insertAfter(AVL);
else
VPEVL->insertBefore(*Header, Header->getFirstNonPhi());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: using VPBuilder would probably help generate the phi and sequence of non-phi instructions more easily, setting its insertion point once (to first non-phi) rather than inserting each recipe individually. BTW, EVLPhi might also be inserted there, instead of immediately after CanonicalIVPHI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVLPhi better do in the separate patch


auto *CanonicalIVIncrement =
cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
Expand Down
4 changes: 3 additions & 1 deletion llvm/lib/Transforms/Vectorize/VPlanTransforms.h
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,9 @@ struct VPlanTransforms {
/// VPCanonicalIVPHIRecipe is only used to control the loop after
/// this transformation.
/// \returns true if the transformation succeeds, or false if it doesn't.
static bool tryAddExplicitVectorLength(VPlan &Plan);
static bool
tryAddExplicitVectorLength(VPlan &Plan,
const std::optional<unsigned> &MaxEVLSafeElements);
};

} // namespace llvm
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -422,28 +422,37 @@ define void @no_high_lmul_or_interleave(ptr %p) {
; IF-EVL-NEXT: entry:
; IF-EVL-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; IF-EVL: vector.ph:
; IF-EVL-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; IF-EVL-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], 1
; IF-EVL-NEXT: [[N_RND_UP:%.*]] = add i64 3002, [[TMP1]]
; IF-EVL-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP0]]
; IF-EVL-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; IF-EVL-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
; IF-EVL-NEXT: br label [[VECTOR_BODY:%.*]]
; IF-EVL: vector.body:
; IF-EVL-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; IF-EVL-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[INDEX]], i64 0
; IF-EVL-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
; IF-EVL-NEXT: [[VEC_IV:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
; IF-EVL-NEXT: [[TMP1:%.*]] = icmp ule <4 x i64> [[VEC_IV]], <i64 3001, i64 3001, i64 3001, i64 3001>
; IF-EVL-NEXT: [[TMP2:%.*]] = getelementptr i64, ptr [[P:%.*]], i64 [[TMP0]]
; IF-EVL-NEXT: [[TMP3:%.*]] = getelementptr i64, ptr [[TMP2]], i32 0
; IF-EVL-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <4 x i64> @llvm.masked.load.v4i64.p0(ptr [[TMP3]], i32 32, <4 x i1> [[TMP1]], <4 x i64> poison)
; IF-EVL-NEXT: [[TMP4:%.*]] = add i64 [[TMP0]], 1024
; IF-EVL-NEXT: [[TMP5:%.*]] = getelementptr i64, ptr [[P]], i64 [[TMP4]]
; IF-EVL-NEXT: [[TMP6:%.*]] = getelementptr i64, ptr [[TMP5]], i32 0
; IF-EVL-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> [[WIDE_MASKED_LOAD]], ptr [[TMP6]], i32 32, <4 x i1> [[TMP1]])
; IF-EVL-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
; IF-EVL-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 3004
; IF-EVL-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; IF-EVL-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
; IF-EVL-NEXT: [[TMP3:%.*]] = icmp ult i64 [[EVL_BASED_IV]], 1024
; IF-EVL-NEXT: [[SAFE_AVL:%.*]] = select i1 [[TMP3]], i64 [[EVL_BASED_IV]], i64 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must the (max) dependence distance be a power of 2, currently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to check if vectorization is safe for any VF before setting MaxSafeElements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is power-of-2 here, no problem

; IF-EVL-NEXT: [[TMP4:%.*]] = sub i64 3002, [[SAFE_AVL]]
; IF-EVL-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[TMP4]], i32 1, i1 true)
; IF-EVL-NEXT: [[TMP6:%.*]] = add i64 [[EVL_BASED_IV]], 0
; IF-EVL-NEXT: [[TMP7:%.*]] = getelementptr i64, ptr [[P:%.*]], i64 [[TMP6]]
; IF-EVL-NEXT: [[TMP8:%.*]] = getelementptr i64, ptr [[TMP7]], i32 0
; IF-EVL-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 1 x i64> @llvm.vp.load.nxv1i64.p0(ptr align 32 [[TMP8]], <vscale x 1 x i1> shufflevector (<vscale x 1 x i1> insertelement (<vscale x 1 x i1> poison, i1 true, i64 0), <vscale x 1 x i1> poison, <vscale x 1 x i32> zeroinitializer), i32 [[TMP5]])
; IF-EVL-NEXT: [[TMP9:%.*]] = add i64 [[TMP6]], 1024
; IF-EVL-NEXT: [[TMP10:%.*]] = getelementptr i64, ptr [[P]], i64 [[TMP9]]
; IF-EVL-NEXT: [[TMP11:%.*]] = getelementptr i64, ptr [[TMP10]], i32 0
; IF-EVL-NEXT: call void @llvm.vp.store.nxv1i64.p0(<vscale x 1 x i64> [[VP_OP_LOAD]], ptr align 32 [[TMP11]], <vscale x 1 x i1> shufflevector (<vscale x 1 x i1> insertelement (<vscale x 1 x i1> poison, i1 true, i64 0), <vscale x 1 x i1> poison, <vscale x 1 x i32> zeroinitializer), i32 [[TMP5]])
; IF-EVL-NEXT: [[TMP12:%.*]] = zext i32 [[TMP5]] to i64
; IF-EVL-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[TMP12]], [[EVL_BASED_IV]]
; IF-EVL-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
; IF-EVL-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; IF-EVL-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; IF-EVL: middle.block:
; IF-EVL-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; IF-EVL: scalar.ph:
; IF-EVL-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 3004, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
; IF-EVL-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
; IF-EVL-NEXT: br label [[LOOP:%.*]]
; IF-EVL: loop:
; IF-EVL-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
Expand Down