Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7179,9 +7179,6 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
VPlanTransforms::simplifyRecipes(BestVPlan);
VPlanTransforms::removeBranchOnConst(BestVPlan);
VPlanTransforms::narrowInterleaveGroups(
BestVPlan, BestVF,
TTI.getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector));
Comment on lines -7234 to -7236
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in addition to moving narrowInterleaveGroups from VPlan execution to planning, it also changes relative transform order - being moved from LVP::executePlan() after optimizing for final VF and UF, to be the last transform of buildVPlansWithVPRecipes(), skipping over several transforms in LVP::executePlan().

Perhaps worth first hoisting it to appear earlier/earliest in LVP::executePlan(), still operating on the final VPlan but before it is unrolled etc., and then move it to the end of LVP::executePlan() where it operates on multiple VPlan's?

VPlanTransforms::cse(BestVPlan);
VPlanTransforms::removeDeadRecipes(BestVPlan);

Expand Down Expand Up @@ -8228,6 +8225,10 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
if (CM.foldTailWithEVL() && !HasScalarVF)
VPlanTransforms::runPass(VPlanTransforms::addExplicitVectorLength,
*Plan, CM.getMaxSafeElements());

if (auto P = VPlanTransforms::narrowInterleaveGroups(*Plan, TTI))
VPlans.push_back(std::move(P));

assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1214,6 +1214,7 @@ VPlan *VPlan::duplicate() {
}
Old2NewVPValues[&VectorTripCount] = &NewPlan->VectorTripCount;
Old2NewVPValues[&VF] = &NewPlan->VF;
Old2NewVPValues[&UF] = &NewPlan->UF;
Old2NewVPValues[&VFxUF] = &NewPlan->VFxUF;
if (BackedgeTakenCount) {
NewPlan->BackedgeTakenCount = new VPValue();
Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -4085,6 +4085,9 @@ class VPlan {
/// Represents the vectorization factor of the loop.
VPValue VF;

/// Represents the symbolic unroll factor of the loop.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Represents the symbolic unroll factor of the loop.
/// Represents the unroll factor of the loop.

VF and VFxUF are also "symbolic", when VF is fixed.
Worth documenting here that they must not be used after materializing?

VPValue UF;

/// Represents the loop-invariant VF * UF of the vector loop region.
VPValue VFxUF;

Expand Down Expand Up @@ -4236,6 +4239,9 @@ class VPlan {
/// Returns the VF of the vector loop region.
VPValue &getVF() { return VF; };

/// Returns the symbolic UF of the vector loop region.
VPValue &getSymbolicUF() { return UF; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this cann't be made const, as it is used with replaceAllUsesWith, which cannot be const.

Comment on lines +4311 to +4312
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the symbolic UF of the vector loop region.
VPValue &getSymbolicUF() { return UF; };
/// Returns the UF of the vector loop region.
VPValue &getUF() { return UF; };

to be consistent with VF and VFxUF which may also be symbolic; or at-least rename UF to be SymbolicUF.
This would require renaming the exiting getUF() which returns unsigned, say, to be getFixedUF(). (Can also provide getFixedVF(), getFixedVFxUF() to support fixed VF case.)


/// Returns VF * UF of the vector loop region.
VPValue &getVFxUF() { return VFxUF; }

Expand Down
99 changes: 71 additions & 28 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3802,6 +3802,9 @@ void VPlanTransforms::materializeVFAndVFxUF(VPlan &Plan, VPBasicBlock *VectorPH,
// used.
// TODO: Assert that they aren't used.
Comment on lines 3960 to 3961
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above comment and TODO apply to Plan.getUF as well?


VPValue *UF = Plan.getOrAddLiveIn(ConstantInt::get(TCTy, Plan.getUF()));
Plan.getSymbolicUF().replaceAllUsesWith(UF);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better rename materializeVFAndVFxUF() now to, say, materializeVFAndUF() or materializeFactors()?


// If there are no users of the runtime VF, compute VFxUF by constant folding
// the multiplication of VF and UF.
if (VF.getNumUsers() == 0) {
Expand All @@ -3821,7 +3824,6 @@ void VPlanTransforms::materializeVFAndVFxUF(VPlan &Plan, VPBasicBlock *VectorPH,
}
VF.replaceAllUsesWith(RuntimeVF);

VPValue *UF = Plan.getOrAddLiveIn(ConstantInt::get(TCTy, Plan.getUF()));
VPValue *MulByUF = Builder.createNaryOp(Instruction::Mul, {RuntimeVF, UF});
VFxUF.replaceAllUsesWith(MulByUF);
}
Expand Down Expand Up @@ -3930,16 +3932,26 @@ static bool isAlreadyNarrow(VPValue *VPV) {
return RepR && RepR->isSingleScalar();
}

void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
unsigned VectorRegWidth) {
std::unique_ptr<VPlan>
VPlanTransforms::narrowInterleaveGroups(VPlan &Plan,
const TargetTransformInfo &TTI) {
using namespace llvm::VPlanPatternMatch;
VPRegionBlock *VectorLoop = Plan.getVectorLoopRegion();

if (!VectorLoop)
return;
return nullptr;

VPTypeAnalysis TypeInfo(Plan);
auto GetVectorWidthForVF = [&TTI](ElementCount VF) {
return TTI
.getRegisterBitWidth(VF.isFixed()
? TargetTransformInfo::RGK_FixedWidthVector
: TargetTransformInfo::RGK_ScalableVector)
.getKnownMinValue();
};

unsigned VFMinVal = VF.getKnownMinValue();
VPTypeAnalysis TypeInfo(Plan);
SmallVector<VPInterleaveRecipe *> StoreGroups;
std::optional<ElementCount> VFToOptimize;
for (auto &R : *VectorLoop->getEntryBasicBlock()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Independent) Checking recipes of entry BB only?

if (isa<VPCanonicalIVPHIRecipe>(&R) ||
match(&R, m_BranchOnCount(m_VPValue(), m_VPValue())))
Expand All @@ -3954,30 +3966,38 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
// * recipes writing to memory except interleave groups
// Only support plans with a canonical induction phi.
if (R.isPhi())
return;
return nullptr;

auto *InterleaveR = dyn_cast<VPInterleaveRecipe>(&R);
if (R.mayWriteToMemory() && !InterleaveR)
return;

// Do not narrow interleave groups if there are VectorPointer recipes and
// the plan was unrolled. The recipe implicitly uses VF from
// VPTransformState.
// TODO: Remove restriction once the VF for the VectorPointer offset is
// modeled explicitly as operand.
if (isa<VPVectorPointerRecipe>(&R) && Plan.getUF() > 1)
return;
Comment on lines -4120 to -4126
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO taken care of? Below asserts that vector pointer recipes are absent.

return nullptr;

// All other ops are allowed, but we reject uses that cannot be converted
// when checking all allowed consumers (store interleave groups) below.
if (!InterleaveR)
continue;

// Bail out on non-consecutive interleave groups.
if (!isConsecutiveInterleaveGroup(InterleaveR, VFMinVal, TypeInfo,
VectorRegWidth))
return;

// Try to find a single VF, where all interleave groups are consecutive and
// saturate the full vector width. If we already have a candidate VF, check
// if it is applicable for the current InterleaveR, otherwise look for a
// suitable VF across the Plans VFs.
//
if (VFToOptimize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify using VFs = VFToOptimize ? {*VFToOptimize} : Plan.vectorFactors()?

if (!isConsecutiveInterleaveGroup(
InterleaveR, VFToOptimize->getKnownMinValue(), TypeInfo,
GetVectorWidthForVF(*VFToOptimize)))
return nullptr;
} else {
for (ElementCount VF : Plan.vectorFactors()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just pass in the ElementCounts directly to isConsecutiveInterleaveGroup? It feels a bit cleaner because otherwise isConsecutiveInterleaveGroup is a bit fragile, since it doesn't know if the min value passed in for the VF is for a fixed-width or scalable VF and could lead to incorrect behaviour. If you pass in the original VFs then isConsecutiveInterleaveGroup can bail out or assert if VF.isScalable() != RegWidth.isScalable().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

if (isConsecutiveInterleaveGroup(InterleaveR, VF.getKnownMinValue(),
TypeInfo, GetVectorWidthForVF(VF))) {
VFToOptimize = VF;
break;
}
}
if (!VFToOptimize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can't you just fold this into

      if (auto VF = isConsecutiveInterleaveGroup(
              InterleaveR, to_vector(Plan.vectorFactors()), TypeInfo, TTI))
        VFToOptimize = *VF;
      else
        return nullptr;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

return nullptr;
}
// Skip read interleave groups.
if (InterleaveR->getStoredValues().empty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Independent) May be good to rename InterleaveR into InterleavedStore, at-least from here on.

continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Independent) What if below Member0 is already narrow but not all stored values are the same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Independent) Better to check indices of members in IG rather than match the order of VPValues defined by interleaved load recipe to the order of interleaved store operands? Or verify that these recipes retain these orders.

Expand Down Expand Up @@ -4011,24 +4031,44 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
auto *WideMember0 = dyn_cast_or_null<VPWidenRecipe>(
InterleaveR->getStoredValues()[0]->getDefiningRecipe());
if (!WideMember0)
return;
return nullptr;
for (const auto &[I, V] : enumerate(InterleaveR->getStoredValues())) {
auto *R = dyn_cast_or_null<VPWidenRecipe>(V->getDefiningRecipe());
if (!R || R->getOpcode() != WideMember0->getOpcode() ||
R->getNumOperands() > 2)
return;
return nullptr;
if (any_of(enumerate(R->operands()),
[WideMember0, Idx = I](const auto &P) {
const auto &[OpIdx, OpV] = P;
return !canNarrowLoad(WideMember0, OpIdx, OpV, Idx);
}))
return;
return nullptr;
}
StoreGroups.push_back(InterleaveR);
}

if (StoreGroups.empty())
return;
return nullptr;

// All interleave groups in Plan can be narrowed for VFToOptimize. Split the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a
TODO: Handle cases where only some interleave groups can be narrowed.
?
This transform pivots the dimension of vectorization from fully-loop-based to fully-SLP, affecting all recipes. The motivation stems from at-least one SLP tree, and works well when all recipes of the loop lie on SLP trees, but may still be beneficial even if some recipes remain scalar - outside of any SLP tree.

// original Plan into 2: a) a new clone which contains all VFs of Plan, except
// VFToOptimize, and b) the original Plan with VFToOptimize as single VF.
std::unique_ptr<VPlan> NewPlan;
if (size(Plan.vectorFactors()) != 1) {
NewPlan = std::unique_ptr<VPlan>(Plan.duplicate());
Plan.setVF(*VFToOptimize);
bool First = true;
for (ElementCount VF : NewPlan->vectorFactors()) {
if (VF == VFToOptimize)
continue;
if (First) {
NewPlan->setVF(VF);
First = false;
continue;
}
NewPlan->addVF(VF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit cumbersome. It would be nice if addVF could be made to work without any existing VF, then you could rewrite the loop as:

  for (ElementCount VF : NewPlan->vectorFactors())
    if (VF != VFToOptimize)
      NewPlan->addVF(VF);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, the current code is quite cumbersome. What we really want is to remove VFToOptimize from NewPlan, I added a removeVF helper

}
}

// Convert InterleaveGroup \p R to a single VPWidenLoadRecipe.
SmallPtrSet<VPValue *, 4> NarrowedOps;
Expand Down Expand Up @@ -4099,9 +4139,8 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
auto *Inc = cast<VPInstruction>(CanIV->getBackedgeValue());
VPBuilder PHBuilder(Plan.getVectorPreheader());

VPValue *UF = Plan.getOrAddLiveIn(
ConstantInt::get(CanIV->getScalarType(), 1 * Plan.getUF()));
if (VF.isScalable()) {
VPValue *UF = &Plan.getSymbolicUF();
if (VFToOptimize->isScalable()) {
VPValue *VScale = PHBuilder.createElementCount(
CanIV->getScalarType(), ElementCount::getScalable(1));
VPValue *VScaleUF = PHBuilder.createNaryOp(Instruction::Mul, {VScale, UF});
Expand All @@ -4113,6 +4152,10 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VF of Plan is set to 1 to affect the induction recipes that use it, in order to de-vectorize the loop, but the widen loads and stores recipes (that replace the interleaved loads and stores) are to still generate vectors instructions according to the original VF. Would be good to clarify this discrepancy.

}
removeDeadRecipes(Plan);
assert(none_of(*VectorLoop->getEntryBasicBlock(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again attention is given to entry BB only.

IsaPred<VPVectorPointerRecipe>) &&
"All VPVectorPointerRecipes should have been removed");
Comment on lines +4303 to +4305
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to the original constraint that UF must be 1 if vector pointer recipes are present?

return NewPlan;
}

/// Add branch weight metadata, if the \p Plan's middle block is terminated by a
Expand Down
21 changes: 13 additions & 8 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.h
Original file line number Diff line number Diff line change
Expand Up @@ -333,14 +333,19 @@ struct VPlanTransforms {
static DenseMap<const SCEV *, Value *> expandSCEVs(VPlan &Plan,
ScalarEvolution &SE);

/// Try to convert a plan with interleave groups with VF elements to a plan
/// with the interleave groups replaced by wide loads and stores processing VF
/// elements, if all transformed interleave groups access the full vector
/// width (checked via \o VectorRegWidth). This effectively is a very simple
/// form of loop-aware SLP, where we use interleave groups to identify
/// candidates.
static void narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
unsigned VectorRegWidth);
/// Try to find a single VF among \p Plan's VFs for which all interleave
/// groups (with VF elements) can be replaced by wide loads ans tores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /// groups (with known minimum VF elements)
nit: by wide loads and stores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed thanks!

/// processing VF elements, if all transformed interleave groups access the
/// full vector width (checked via \o VectorRegWidth). If the transformation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: VectorRegWidth is no longer a parameter passed to the function. Perhaps replace with checked via the maximum vector register width

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

/// can be applied, the original \p Plan will be split in 2, if is has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit difficult to follow. Do you mean something like

  /// can be applied, the original \p Plan will be split in 2:
  ///   1. The original Plan with the single VF containing the optimised recipes using wide loads instead of interleave groups.
  ///   2. A new clone which contains all VFs of Plan except the optimised VF.

It's unclear what VFToOptimize is because it's not passed as a parameter to the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, much better, updated, thanks!

/// multiple VFs: a) a new clone which contains all VFs of Plan, except
/// VFToOptimize, and b) the original Plan with VFToOptimize as single VF. In
/// that case, the new clone is returned.
///
/// This effectively is a very simple form of loop-aware SLP, where we use
/// interleave groups to identify candidates.
static std::unique_ptr<VPlan>
narrowInterleaveGroups(VPlan &Plan, const TargetTransformInfo &TTI);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More important than "narrowing" is the "pivoting" of the vectorization dimension from being loop-based to being SLP-based, thereby eliminating shuffle-de-shuffle redundancies. This can be achieved w/o narrowing, provided support for very-wide load/store recipes or emission of multiple wide load/store recipes instead of emitting only single ones.


/// Predicate and linearize the control-flow in the only loop region of
/// \p Plan. If \p FoldTail is true, create a mask guarding the loop
Expand Down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a scalable vector version of at least one of these tests please? I tested this file with this PR and ran opt -p loop-vectorize -mcpu=neoverse-v1 and we generate IR like this for test_add_double_same_const_args_1:

  %wide.load = load <vscale x 2 x double>, ptr %9, align 4
  %wide.load1 = load <vscale x 2 x double>, ptr %10, align 4
  %11 = fadd <vscale x 2 x double> %wide.load, splat (double 1.000000e+00)
  %12 = fadd <vscale x 2 x double> %wide.load1, splat (double 1.000000e+00)
...
  store <vscale x 2 x double> %11, ptr %13, align 4
  store <vscale x 2 x double> %12, ptr %14, align 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a RUN line to the scalable test file w/o forced interleaving. I think that should add the missing coverage. Could also add additional tests there.

Original file line number Diff line number Diff line change
Expand Up @@ -175,28 +175,18 @@ define void @test_add_double_same_var_args_1(ptr %res, ptr noalias %A, ptr noali
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 2
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 1
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[A]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[A]], i64 [[TMP0]]
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <4 x double>, ptr [[TMP1]], align 4
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <4 x double> [[WIDE_VEC]], <4 x double> poison, <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <4 x double> [[WIDE_VEC]], <4 x double> poison, <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[WIDE_VEC2:%.*]] = load <4 x double>, ptr [[TMP2]], align 4
; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <4 x double> [[WIDE_VEC2]], <4 x double> poison, <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <4 x double> [[WIDE_VEC2]], <4 x double> poison, <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[STRIDED_VEC]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[STRIDED_VEC3]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = load <2 x double>, ptr [[TMP1]], align 4
; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = load <2 x double>, ptr [[TMP2]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[STRIDED_VEC1]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP6:%.*]] = fadd <2 x double> [[STRIDED_VEC4]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[RES]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[RES]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> [[TMP5]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <4 x double> [[TMP9]], <4 x double> poison, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
; CHECK-NEXT: store <4 x double> [[INTERLEAVED_VEC]], ptr [[TMP7]], align 4
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[INTERLEAVED_VEC5:%.*]] = shufflevector <4 x double> [[TMP10]], <4 x double> poison, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
; CHECK-NEXT: store <4 x double> [[INTERLEAVED_VEC5]], ptr [[TMP8]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: store <2 x double> [[TMP5]], ptr [[TMP7]], align 4
; CHECK-NEXT: store <2 x double> [[TMP6]], ptr [[TMP8]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
; CHECK-NEXT: br i1 [[TMP11]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
Expand Down Expand Up @@ -237,28 +227,18 @@ define void @test_add_double_same_var_args_2(ptr %res, ptr noalias %A, ptr noali
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 2
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 1
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[A]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[A]], i64 [[TMP0]]
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <4 x double>, ptr [[TMP1]], align 4
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <4 x double> [[WIDE_VEC]], <4 x double> poison, <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <4 x double> [[WIDE_VEC]], <4 x double> poison, <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[WIDE_VEC2:%.*]] = load <4 x double>, ptr [[TMP2]], align 4
; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <4 x double> [[WIDE_VEC2]], <4 x double> poison, <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <4 x double> [[WIDE_VEC2]], <4 x double> poison, <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], [[STRIDED_VEC]]
; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], [[STRIDED_VEC3]]
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = load <2 x double>, ptr [[TMP1]], align 4
; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = load <2 x double>, ptr [[TMP2]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], [[STRIDED_VEC1]]
; CHECK-NEXT: [[TMP6:%.*]] = fadd <2 x double> [[BROADCAST_SPLAT]], [[STRIDED_VEC4]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[RES]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds nuw { double, double }, ptr [[RES]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> [[TMP5]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <4 x double> [[TMP9]], <4 x double> poison, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
; CHECK-NEXT: store <4 x double> [[INTERLEAVED_VEC]], ptr [[TMP7]], align 4
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[INTERLEAVED_VEC5:%.*]] = shufflevector <4 x double> [[TMP10]], <4 x double> poison, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
; CHECK-NEXT: store <4 x double> [[INTERLEAVED_VEC5]], ptr [[TMP8]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: store <2 x double> [[TMP5]], ptr [[TMP7]], align 4
; CHECK-NEXT: store <2 x double> [[TMP6]], ptr [[TMP8]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
; CHECK-NEXT: br i1 [[TMP11]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
Expand Down
Loading