Skip to content
Merged
89 changes: 88 additions & 1 deletion llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -771,6 +771,92 @@ sinkRecurrenceUsersAfterPrevious(VPFirstOrderRecurrencePHIRecipe *FOR,
return true;
}

/// Try to hoist \p Previous and its operands before all users of \p FOR.
static bool hoistPreviousBeforeFORUsers(VPFirstOrderRecurrencePHIRecipe *FOR,
VPRecipeBase *Previous,
VPDominatorTree &VPDT) {
using namespace llvm::VPlanPatternMatch;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this namespace (still) needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

if (Previous->mayHaveSideEffects() || Previous->mayReadFromMemory())
return false;

// Collect recipes that need hoisting.
SmallVector<VPRecipeBase *> WorkList;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SmallVector<VPRecipeBase *> WorkList;
SmallVector<VPRecipeBase *> HoistCandidates;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

SmallPtrSet<VPRecipeBase *, 8> Seen;
VPBasicBlock *HoistBlock = FOR->getParent();
auto HoistPoint = HoistBlock->getFirstNonPhi();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to set HoistPoint to FOR, given the flat loop? See more below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to initialize to nullptr and then look for the user dominating all others. FOR as initializer isn't suitable I think, as it already dominates all users and is in the phi section of the block, so we can't hoist recipes before it

auto TryToPushHoistCandidate = [&](VPRecipeBase *HoistCandidate) {
// If we reach FOR, it means the original Previous depends on some other
// recurrence that in turn depends on FOR. If that is the case, we would
// also need to hoist recipes involving the other FOR, which may break
// dependencies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Previous depends on FOR (directly or indirectly), they form a dependence cycle, which should have been classified as an induction or reduction rather than FOR? I.e., assert this does not occur. (Also holds for sinking.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can occur at the moment I think.

There can be cases where 'previous' of one FOR uses another FOR (e.g. as in https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/first-order-recurrence-multiply-recurrences.ll#L32). They get classified as FORs but not vectorized because we cannot rearrange the users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but here "If we reach FOR" refers to the header phi of the original recurrence, rather than 'previous' of one FOR using header phi of another FOR. Posted inline as this appears outdated.

if (HoistCandidate == FOR)
return false;

// Hoist candidate outside any region, no need to hoist.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Hoist candidate outside any region, no need to hoist.
// Candidate is outside loop region, dominates FOR users w/o hoisting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

if (!HoistCandidate->getParent()->getParent())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!HoistCandidate->getParent()->getParent())
if (!HoistCandidate->getParent()->getEnclosingLoopRegion())

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

return true;

// Hoist candidate is a header phi or already visited, no need to hoist.
if (isa<VPHeaderPHIRecipe>(HoistCandidate) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more logical to check for header phi along with previous out-of-loop check, as both cases dominate all FOR users, and check if already Seen separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks!

!Seen.insert(HoistCandidate).second)
return true;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be worth asserting HoistPoint (is non-null and) dominates all FOR->users().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

// If we reached a recipe that dominates all users of FOR, we don't need to
// hoist the recipe.
if (all_of(FOR->users(), [&VPDT, HoistCandidate](VPUser *U) {
return VPDT.properlyDominates(HoistCandidate, cast<VPRecipeBase>(U));
})) {
if (VPDT.properlyDominates(&*HoistPoint, HoistCandidate)) {
// This HoistCandidate domiantes all users of FOR and is closer to them
// than the previous HoistPoint.
HoistPoint = std::next(HoistCandidate->getIterator());
HoistBlock = HoistCandidate->getParent();
}
return true;
}

// Don't move candiates with sideeffects, as we do not yet analyze recipes
// between candidate and hoist destination yet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Don't move candiates with sideeffects, as we do not yet analyze recipes
// between candidate and hoist destination yet.
// Avoid moving candidates with side-effects, as we do not yet analyze associated
// dependencies.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

if (HoistCandidate->mayHaveSideEffects())
return false;

WorkList.push_back(HoistCandidate);
return true;
};

// Recursively try to hoist Previous and its operands before all users of FOR.
// Update HoistPoint to the closest recipe that dominates all users of FOR.
if (!TryToPushHoistCandidate(Previous))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure is similar to that of sinking. Here's an alternative structure, which may be simpler: initialize Worklist with Previous, and iteratively check each candidate if (a) it already dominates FOR users or already Seen, if not (b) can be hoisted to dominate them. If (a) continue, otherwise if not (b) return false, otherwise add all operands to worklist. See more below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

return false;

for (unsigned I = 0; I != WorkList.size(); ++I) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (unsigned I : seq<unsigned>(WorkList.size())) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkList gets expanded in the loop, so I left it as is for now.

VPRecipeBase *Current = WorkList[I];
assert(Current->getNumDefinedValues() == 1 &&
"only recipes with a single defined value expected");

for (VPValue *Op : Current->operands())
if (auto *R = Op->getDefiningRecipe())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Live-ins could be considered along with header phis and out-of-loop recipes, as dominating users of FOR, but then Worklist will accommodate VPValues rather than recipes. Check (a) above for dominance could be applied to each operand before inserting into Worklist, as done here, rather than when retrieving it as described above. I.e., instead of a single TryToPushHoistCandidate : recipe -> bool, have two functions:

  SmallVector<VPRecipeBase *> HoistCandidates ({Previous});
  for (unsigned I = 0; I != HoistCandidates.size(); ++I) {
    VPRecipeBase *HoistCandidate = HoistCandidates[I];
    if (!CanHoistBefore(HoistCandidate))
      return false;
    for (VPValue *Op : HoistCandidate->operands())
      if (VPRecipeBase *OpCandidate = OpNeedsHoisting(Op))
        HoistCandidates.push_back(OpCandidate);
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested thanks!

if (!TryToPushHoistCandidate(R))
return false;
}

// Keep recipes to hoist ordered by dominance so earlier instructions are
// processed first.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoist candidates are already held in a set (Seen), have that set ordered according to dominance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it uses a SmallPtrSet, could be updated to use std::set, but sorting explicitly might be a bit clearer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. But then "Keep recipes to hoist ordered by ..." should read "Order recipes to hoist by ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

sort(WorkList, [&VPDT](const VPRecipeBase *A, const VPRecipeBase *B) {
return VPDT.properlyDominates(A, B);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it provide strict weak ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should, as the CFG in the loop is flattened.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the CFG in the loop is assumed to be flattened,

  • verify CFG in the loop is flattened, expansion of replicate regions must be applied later?
  • fix hoist point to right before first FOR user?
  • check dominance of first FOR user only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify CFG in the loop is flattened, expansion of replicate regions must be applied later?
adde assert, thanks
fix hoist point to right before first FOR user?
check dominance of first FOR user only?
both done, thanks!

});

for (VPRecipeBase *HoistCandidate : WorkList) {
if (HoistPoint == HoistCandidate->getIterator()) {
HoistPoint = std::next(HoistCandidate->getIterator());
continue;
}
HoistCandidate->moveBefore(*HoistBlock, HoistPoint);
}

return true;
}

bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
VPBuilder &LoopBuilder) {
VPDominatorTree VPDT;
Expand All @@ -795,7 +881,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
}

if (!sinkRecurrenceUsersAfterPrevious(FOR, Previous, VPDT))
return false;
if (!hoistPreviousBeforeFORUsers(FOR, Previous, VPDT))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!hoistPreviousBeforeFORUsers(FOR, Previous, VPDT))
if (!sinkRecurrenceUsersAfterPrevious(FOR, Previous, VPDT) &&
!hoistPreviousBeforeFORUsers(FOR, Previous, VPDT))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

return false;

// Introduce a recipe to combine the incoming and previous values of a
// fixed-order recurrence.
Expand Down
12 changes: 7 additions & 5 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,13 @@ struct VPlanTransforms {
GetIntOrFpInductionDescriptor,
ScalarEvolution &SE, const TargetLibraryInfo &TLI);

/// Sink users of fixed-order recurrences after the recipe defining their
/// previous value. Then introduce FirstOrderRecurrenceSplice VPInstructions
/// to combine the value from the recurrence phis and previous values. The
/// current implementation assumes all users can be sunk after the previous
/// value, which is enforced by earlier legality checks.
/// Try to move users of fixed-order recurrences after the recipe defining
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Try to move users of fixed-order recurrences after the recipe defining
/// Try to have all users of fixed-order recurrences appear after the recipe defining

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

/// their previous value, either by sinking them or hoisting the recipe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// their previous value, either by sinking them or hoisting the recipe
/// their previous value, by either sinking the users or hoisting the recipe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// their previous value, either by sinking them or hoisting the recipe
/// their previous value, by either sinking the users or hoisting the recipe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

/// defining their previous value (and its operands). Then introduce
/// FirstOrderRecurrenceSplice VPInstructions to combine the value from the
/// recurrence phis and previous values. The current implementation assumes
/// all users can be sunk after the previous value, which is enforced by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// all users can be sunk after the previous value, which is enforced by
/// all users can be sunk after the previous value, or the previous value can be hoisted before all users, which is enforced by

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

/// earlier legality checks.
/// \returns true if all users of fixed-order recurrences could be re-arranged
/// as needed or false if it is not possible. In the latter case, \p Plan is
/// not valid.
Expand Down
64 changes: 64 additions & 0 deletions llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll
Original file line number Diff line number Diff line change
Expand Up @@ -278,3 +278,67 @@ exit:
store double %.lcssa, ptr %C
ret i64 %.in.lcssa
}

; Test for https://github.com/llvm/llvm-project/issues/106523.
define void @for_iv_trunc_optimized(ptr %dst) {
; CHECK-LABEL: @for_iv_trunc_optimized(
; CHECK-NEXT: bb:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 1>, [[VECTOR_PH]] ], [ [[STEP_ADD:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VECTOR_RECUR1:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 0>, [[VECTOR_PH]] ], [ [[TMP3:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 1, i32 2, i32 3, i32 4>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop seems to be vectorized by VF=4 and unrolled by UF=2 as VEC_IND and INDEX are bumped by 8's, but there's only a single copy of <4 x i32> vectors, presumably due to dce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, due to only storing to an invariant pointer I think

; CHECK-NEXT: [[STEP_ADD]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
; CHECK-NEXT: [[TMP0:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[VEC_IND]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[VEC_IND]], <4 x i32> [[STEP_ADD]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TMP0 is the first splice for %for.1, fed by (last lane of) a <4 x i32> vector IV of last iteration, rather than truncating an i64 one, along with first 3 lanes of current vector IV.
TMP1 is the second splice for %for.1.

; CHECK-NEXT: [[TMP2:%.*]] = or <4 x i32> [[TMP0]], zeroinitializer
; CHECK-NEXT: [[TMP3]] = or <4 x i32> [[TMP1]], zeroinitializer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ors-with-zeroes should be simplified away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do yes. For now I updated the input to avoid the redundant op to make the test more robust

; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first splice of %for.2 combines the last lane of VECTOR_RECUR1 with first 3 lanes of TMP2, but being dead is eliminated.
TMP5 is the second splice for %for.2, combining last lane of TMP2 with first 3 lanes of TMP3, the last of which feeds the store to invariant address.

; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[TMP5]], i32 3
; CHECK-NEXT: store i32 [[TMP6]], ptr [[DST:%.*]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[STEP_ADD]], <i32 4, i32 4, i32 4, i32 4>
; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 336
; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i32> [[STEP_ADD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
; CHECK-NEXT: br i1 false, label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 337, [[MIDDLE_BLOCK]] ], [ 1, [[BB:%.*]] ]
; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i32 [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ], [ 1, [[BB]] ]
; CHECK-NEXT: [[SCALAR_RECUR_INIT4:%.*]] = phi i32 [ [[VECTOR_RECUR_EXTRACT3]], [[MIDDLE_BLOCK]] ], [ 0, [[BB]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[ADD:%.*]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[FOR_1:%.*]] = phi i32 [ [[TRUNC:%.*]], [[LOOP]] ], [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[FOR_2:%.*]] = phi i32 [ [[OR:%.*]], [[LOOP]] ], [ [[SCALAR_RECUR_INIT4]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[OR]] = or i32 [[FOR_1]], 0
; CHECK-NEXT: [[ADD]] = add i64 [[IV]], 1
; CHECK-NEXT: store i32 [[FOR_2]], ptr [[DST]], align 4
; CHECK-NEXT: [[ICMP:%.*]] = icmp ult i64 [[IV]], 337
; CHECK-NEXT: [[TRUNC]] = trunc i64 [[IV]] to i32
; CHECK-NEXT: br i1 [[ICMP]], label [[LOOP]], label [[EXIT]], !llvm.loop [[LOOP9:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: ret void
;
bb:
br label %loop

loop:
%iv = phi i64 [ %add, %loop ], [ 1, %bb ]
%for.1 = phi i32 [ %trunc, %loop ], [ 1, %bb ]
%for.2 = phi i32 [ %or, %loop ], [ 0, %bb ]
%or = or i32 %for.1, 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth explaining this case.

%for.2 requires no code motion, as its previous (%or) precedes its (first) user (store). Furthermore, its user cannot sink, being a store.

%for.1 requires code motion, as its previous (%trunc) follows its (first) user (%or). Sinking %or past %trunc seems possible, as %or has no uses (except for feeding %for.2; worth strengthening VPlan's dce?). However, %or is both the user of %for.1 and the previous of %for.2, and we refrain from sinking instructions that act as previous because they (may) serve points to sink after.

Instead, %for.1 can be reconciled by hoisting its previous above its user %or, as this user %trunc depends only on %iv. Somewhat difficult to track in the above CHECKs, whose vector loop is free of any trunc, has two or's due to unrolling, and a single splice for %for.2 due to dce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

%add = add i64 %iv, 1
store i32 %for.2, ptr %dst, align 4
%icmp = icmp ult i64 %iv, 337
%trunc = trunc i64 %iv to i32
br i1 %icmp, label %loop, label %exit

exit:
ret void
}
Original file line number Diff line number Diff line change
Expand Up @@ -147,14 +147,57 @@ exit:
}

; This test has two FORs (for.x and for.y) where incoming value from the previous
; iteration (for.x.prev) of one FOR (for.y) depends on another FOR (for.x). Due to
; this dependency all uses of the former FOR (for.y) should be sunk after
; incoming value from the previous iteration (for.x.prev) of te latter FOR (for.y).
; That means side-effecting user (store i64 %for.y.i64, ptr %gep) of the latter
; FOR (for.y) should be moved which is not currently supported.
; iteration (for.x.prev) of one FOR (for.y) depends on another FOR (for.x).
; Sinking would require moving a recipe with side effects (store). Instead,
; for.x.prev can be hoisted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Instead, for.x.prev can be hoisted" - is it important to hoist for.x.prev? (swapping places with the gep)

for.x requires code motion, as its first user for.x.prev precedes its previous for.x.next.

for.y does not require code motion, as its first user for.y.i64 succeeds its previous for.x.prev.

To handle for.x, we avoid sinking its user for.x.prev, because of its role as previous for for.y. Instead, we hoist its previous for.x.next above its user (and all the way out of the loop).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, should be for.x.next. updated, thanks!

define i32 @test_chained_first_order_recurrences_4(ptr %base, i64 %x) {
; CHECK-LABEL: 'test_chained_first_order_recurrences_4'
; CHECK: No VPlans built.
; CHECK: VPlan 'Initial VPlan for VF={4},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VFxUF:%.+]]> = VF * UF
; CHECK-NEXT: Live-in vp<[[VTC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in ir<4098> = original trip-count
; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:
; CHECK-NEXT: WIDEN ir<%for.x.next> = mul ir<%x>, ir<2>
; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:
; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION ir<0>, vp<[[CAN_IV_NEXT:%.+]]>
; CHECK-NEXT: FIRST-ORDER-RECURRENCE-PHI ir<%for.x> = phi ir<0>, ir<%for.x.next>
; CHECK-NEXT: FIRST-ORDER-RECURRENCE-PHI ir<%for.y> = phi ir<0>, ir<%for.x.prev>
; CHECK-NEXT: vp<[[SCALAR_STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<1>
; CHECK-NEXT: EMIT vp<[[SPLICE_X:%.]]> = first-order splice ir<%for.x>, ir<%for.x.next>
; CHECK-NEXT: WIDEN-CAST ir<%for.x.prev> = trunc vp<[[SPLICE_X]]> to i32
; CHECK-NEXT: EMIT vp<[[SPLICE_Y:%.+]]> = first-order splice ir<%for.y>, ir<%for.x.prev>
; CHECK-NEXT: CLONE ir<%gep> = getelementptr ir<%base>, vp<[[SCALAR_STEPS]]>
; CHECK-NEXT: WIDEN-CAST ir<%for.y.i64> = sext vp<[[SPLICE_Y]]> to i64
; CHECK-NEXT: vp<[[VEC_PTR:%.+]]> = vector-pointer ir<%gep>
; CHECK-NEXT: WIDEN store vp<[[VEC_PTR]]>, ir<%for.y.i64>
; CHECK-NEXT: EMIT vp<[[CAN_IV_NEXT]]> = add nuw vp<[[CAN_IV]]>, vp<[[VFxUF]]>
; CHECK-NEXT: EMIT branch-on-count vp<[[CAN_IV_NEXT]]>, vp<[[VTC]]>
; CHECK-NEXT: No successors
; CHECK-NEXT: }
; CHECK-NEXT: Successor(s): middle.block
; CHECK-EMPTY:
; CHECK-NEXT: middle.block:
; CHECK-NEXT: EMIT vp<[[EXT_X:%.+]]> = extract-from-end ir<%for.x.next>, ir<1>
; CHECK-NEXT: EMIT vp<[[EXT_Y:%.+]]> = extract-from-end ir<%for.x.prev>, ir<1>
; CHECK-NEXT: EMIT vp<[[MIDDLE_C:%.+]]> = icmp eq ir<4098>, vp<[[VTC]]>
; CHECK-NEXT: EMIT branch-on-cond vp<[[MIDDLE_C]]>
; CHECK-NEXT: Successor(s): ir-bb<ret>, scalar.ph
; CHECK-EMPTY:
; CHECK-NEXT: ir-bb<ret>:
; CHECK-NEXT: No successors
; CHECK-EMPTY:
; CHECK-NEXT: scalar.ph:
; CHECK-NEXT: EMIT vp<[[RESUME_X:%.+]]> = resume-phi vp<[[EXT_X]]>, ir<0>
; CHECK-NEXT: EMIT vp<[[RESUME_Y:%.+]]> = resume-phi vp<[[EXT_Y]]>, ir<0>
; CHECK-NEXT: No successors
; CHECK-EMPTY:
; CHECK-NEXT: Live-out i64 %for.x = vp<[[RESUME_X]]>
; CHECK-NEXT: Live-out i32 %for.y = vp<[[RESUME_Y]]>
; CHECK-NEXT: }
;
entry:
br label %loop
Expand All @@ -178,7 +221,54 @@ ret:

define i32 @test_chained_first_order_recurrences_5_hoist_to_load(ptr %base) {
; CHECK-LABEL: 'test_chained_first_order_recurrences_5_hoist_to_load'
; CHECK: No VPlans built.
; CHECK: VPlan 'Initial VPlan for VF={4},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VFxUF:%.+]]> = VF * UF
; CHECK-NEXT: Live-in vp<[[VTC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in ir<4098> = original trip-count
; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:
; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION ir<0>, vp<[[CAN_IV_NEXT:%.+]]>
; CHECK-NEXT: FIRST-ORDER-RECURRENCE-PHI ir<%for.x> = phi ir<0>, ir<%for.x.next>
; CHECK-NEXT: FIRST-ORDER-RECURRENCE-PHI ir<%for.y> = phi ir<0>, ir<%for.x.prev>
; CHECK-NEXT: vp<[[SCALAR_STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<1>
; CHECK-NEXT: CLONE ir<%gep> = getelementptr ir<%base>, vp<[[SCALAR_STEPS]]>
; CHECK-NEXT: vp<[[VEC_PTR:%.+]]> = vector-pointer ir<%gep>
; CHECK-NEXT: WIDEN ir<%l> = load vp<[[VEC_PTR]]>
; CHECK-NEXT: WIDEN ir<%for.x.next> = mul ir<%l>, ir<2>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar case to the one above, except the end - we hoist its previous for.x.next above its user for.x.prev (until reaching a dependence on %l, rather than all the way out of the loop).

; CHECK-NEXT: EMIT vp<[[SPLICE_X:%.]]> = first-order splice ir<%for.x>, ir<%for.x.next>
; CHECK-NEXT: WIDEN-CAST ir<%for.x.prev> = trunc vp<[[SPLICE_X]]> to i32
; CHECK-NEXT: EMIT vp<[[SPLICE_Y:%.+]]> = first-order splice ir<%for.y>, ir<%for.x.prev>
; CHECK-NEXT: WIDEN-CAST ir<%for.y.i64> = sext vp<[[SPLICE_Y]]> to i64
; CHECK-NEXT: vp<[[VEC_PTR:%.+]]> = vector-pointer ir<%gep>
; CHECK-NEXT: WIDEN store vp<[[VEC_PTR]]>, ir<%for.y.i64>
; CHECK-NEXT: EMIT vp<[[CAN_IV_NEXT]]> = add nuw vp<[[CAN_IV]]>, vp<[[VFxUF]]>
; CHECK-NEXT: EMIT branch-on-count vp<[[CAN_IV_NEXT]]>, vp<[[VTC]]>
; CHECK-NEXT: No successors
; CHECK-NEXT: }
; CHECK-NEXT: Successor(s): middle.block
; CHECK-EMPTY:
; CHECK-NEXT: middle.block:
; CHECK-NEXT: EMIT vp<[[EXT_X:%.+]]> = extract-from-end ir<%for.x.next>, ir<1>
; CHECK-NEXT: EMIT vp<[[EXT_Y:%.+]]> = extract-from-end ir<%for.x.prev>, ir<1>
; CHECK-NEXT: EMIT vp<[[MIDDLE_C:%.+]]> = icmp eq ir<4098>, vp<[[VTC]]>
; CHECK-NEXT: EMIT branch-on-cond vp<[[MIDDLE_C]]>
; CHECK-NEXT: Successor(s): ir-bb<ret>, scalar.ph
; CHECK-EMPTY:
; CHECK-NEXT: ir-bb<ret>:
; CHECK-NEXT: No successors
; CHECK-EMPTY:
; CHECK-NEXT: scalar.ph:
; CHECK-NEXT: EMIT vp<[[RESUME_X:%.+]]> = resume-phi vp<[[EXT_X]]>, ir<0>
; CHECK-NEXT: EMIT vp<[[RESUME_Y:%.+]]> = resume-phi vp<[[EXT_Y]]>, ir<0>
; CHECK-NEXT: No successors
; CHECK-EMPTY:
; CHECK-NEXT: Live-out i64 %for.x = vp<[[RESUME_X]]>
; CHECK-NEXT: Live-out i32 %for.y = vp<[[RESUME_Y]]>
; CHECK-NEXT: }
;
entry:
br label %loop
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -385,19 +385,51 @@ exit:
define void @hoist_previous_value_and_operands(ptr %dst, i64 %mask) {
; CHECK-LABEL: @hoist_previous_value_and_operands(
; CHECK-NEXT: bb:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[MASK:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 1, i64 2, i64 3, i64 4>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 1>, [[VECTOR_PH]] ], [ [[TMP2:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VECTOR_RECUR1:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 0>, [[VECTOR_PH]] ], [ [[TMP4:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 1, [[INDEX]]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = and <4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP2]] = trunc <4 x i64> [[TMP1]] to <4 x i32>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar case to PR106523 above, except here the loop is vectorized w/o being unrolled, the previous truncation gets vectorized (rather than folded into an IV) and hoisted along with its AND operand above the user 'or', the store is consecutive (rather than to an invariant address). Name should use plural hoist_previous_value_and_operands or singular hoist_previous_value_and_operand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is a target-independent version. Added a comment, thanks!

And dropped the s from operands

; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[TMP2]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
; CHECK-NEXT: [[TMP4]] = or <4 x i32> [[TMP3]], zeroinitializer
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR1]], <4 x i32> [[TMP4]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[DST:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP6]], i32 0
; CHECK-NEXT: store <4 x i32> [[TMP5]], ptr [[TMP7]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], <i64 4, i64 4, i64 4, i64 4>
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 336
; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i32> [[TMP2]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
; CHECK-NEXT: br i1 false, label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 337, [[MIDDLE_BLOCK]] ], [ 1, [[BB:%.*]] ]
; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i32 [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ], [ 1, [[BB]] ]
; CHECK-NEXT: [[SCALAR_RECUR_INIT3:%.*]] = phi i32 [ [[VECTOR_RECUR_EXTRACT2]], [[MIDDLE_BLOCK]] ], [ 0, [[BB]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should have probably been check not vectorized.

; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[ADD:%.*]], [[LOOP]] ], [ 1, [[BB:%.*]] ]
; CHECK-NEXT: [[FOR_1:%.*]] = phi i32 [ [[TRUNC:%.*]], [[LOOP]] ], [ 1, [[BB]] ]
; CHECK-NEXT: [[FOR_2:%.*]] = phi i32 [ [[OR:%.*]], [[LOOP]] ], [ 0, [[BB]] ]
; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[ADD:%.*]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[FOR_1:%.*]] = phi i32 [ [[TRUNC:%.*]], [[LOOP]] ], [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[FOR_2:%.*]] = phi i32 [ [[OR:%.*]], [[LOOP]] ], [ [[SCALAR_RECUR_INIT3]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[OR]] = or i32 [[FOR_1]], 0
; CHECK-NEXT: [[ADD]] = add i64 [[IV]], 1
; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST:%.*]], i64 [[IV]]
; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i64 [[IV]]
; CHECK-NEXT: store i32 [[FOR_2]], ptr [[GEP]], align 4
; CHECK-NEXT: [[ICMP:%.*]] = icmp ult i64 [[IV]], 337
; CHECK-NEXT: [[A:%.*]] = and i64 [[IV]], [[MASK:%.*]]
; CHECK-NEXT: [[A:%.*]] = and i64 [[IV]], [[MASK]]
; CHECK-NEXT: [[TRUNC]] = trunc i64 [[A]] to i32
; CHECK-NEXT: br i1 [[ICMP]], label [[LOOP]], label [[EXIT:%.*]]
; CHECK-NEXT: br i1 [[ICMP]], label [[LOOP]], label [[EXIT]], !llvm.loop [[LOOP7:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: ret void
;
Expand Down
Loading