[LoopUnroll] Skip remainder loop guard if skip unrolled loop #156549

jdenny-ornl · 2025-09-02T22:30:29Z

The original loop (OL) that serves as input to LoopUnroll has basic blocks that are arranged as follows:

OLPreHeader
OLHeader <-.
...        |
OLLatch ---'
OLExit

In this depiction, every block has an implicit edge to the next block below, so any explicit edge indicates a conditional branch.

Given OL and unroll count N, LoopUnroll sometimes creates an unrolled loop (UL) with a remainder loop (RL) epilogue arranged like this:

,-- ULGuard
|   ULPreHeader
|   ULHeader <-.
|   ...        |
|   ULLatch ---'
|   ULExit
`-> RLGuard -----.
    RLPreHeader  |
,-> RLHeader     |
|   ...          |
`-- RLLatch      |
    RLExit       |
    OLExit <-----'

Each UL iteration executes N OL iterations, but each RL iteration executes 1 OL iteration. ULGuard or RLGuard checks whether the first iteration of UL or RL should execute, respectively. If so, ULLatch or RLLatch checks whether to execute each subsequent iteration.

Once reached, OL always executes its first iteration but not necessarily the next N-1 iterations. Thus, ULGuard is always required before the first UL iteration. However, when control flows from ULGuard directly to RLGuard, the first OL iteration has yet to execute, so RLGuard is then redundant before the first RL iteration.

Thus, this patch makes the following changes:

Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus eliminating RLGuard's unnecessary branch instruction for that path.
Eliminate the creation of RLGuard phi node poison values. Without this patch, RLGuard has such a phi node for each value that is defined by any OL iteration and used in OLExit. The poison value is required where ULGuard is the predecessor. The poison value indicates that control flow from ULGuard to RLGuard to Exit has no counterpart in OL because the first OL iteration must execute either in UL or RL.
Simplify the CFG by not splitting ULExit and RLGuard because, without the ULGuard predecessor, the single block can now be a dedicated UL exit.
To RLPreHeader, add an llvm.assume call that asserts the RL trip count is non-zero. Without this patch, RLPreHeader is reachable only when RLGuard guarantees that assertion is true. With this patch, RLGuard guarantees it only when RLGuard is the predecessor, and the OL structure guarantees it when ULGuard is the predecessor. If RL itself is unrolled later, this guarantee somehow prevents ScalarEvolution from giving up when trying to compute a maximum trip count for RL. That maximum trip count enables the branch instruction in the final unrolled instance of RLLatch to be eliminated. Without the llvm.assume call, some existing unroll tests start to fail because that instruction is not eliminated.

The original motivation for this patch is to facilitate later patches that fix LoopUnroll's computation of branch weights so that they maintain the block frequency of OL's body (see #135812). Specifically, this patch ensures RLGuard's branch weights do not affect RL's contribution to the block frequency of OL's body in the case that ULGuard skips UL.

@f

For example: ``` declare void @f(i32) define void @test(i32 %n) { entry: br label %do.body do.body: %i = phi i32 [ 0, %entry ], [ %inc, %do.body ] %inc = add i32 %i, 1 call void @f(i32 %i) %c = icmp sge i32 %inc, %n br i1 %c, label %do.end, label %do.body, !prof !0 do.end: ret void } !0 = !{!"branch_weights", i32 1, i32 9} ``` Given those branch weights, once any loop iteration is actually reached, the probability of the loop exiting at the iteration's end is 1/(1+9). That is, the loop is likely to exit every 10 iterations and thus has an estimated trip count of 10. `opt -passes='print<block-freq>'` shows that 10 is indeed the frequency of the loop body: ``` Printing analysis results of BFI for function 'test': block-frequency-info: test - entry: float = 1.0, int = 1801439852625920 - do.body: float = 10.0, int = 18014398509481984 - do.end: float = 1.0, int = 1801439852625920 ``` Key Observation: The frequency of reaching any particular iteration is less than for the previous iteration because the previous iteration has a non-zero probability of exiting the loop. This observation holds even though every loop iteration, once actually reached, has exactly the same probability of exiting and thus exactly the same branch weights. Now we use `opt -unroll-force-peel-count=2 -passes=loop-unroll` to peel 2 iterations and insert them before the remaining loop. We expect the key observation above not to change, but it does under the implementation without this patch. The block frequency becomes 1.0 for the first iteration, 0.9 for the second, and 6.4 for the main loop body. Again, a decreasing frequency is expected, but it decreases too much: the total frequency of the original loop body becomes 8.3. The new branch weights reveal the problem: ``` !0 = !{!"branch_weights", i32 1, i32 9} !1 = !{!"branch_weights", i32 1, i32 8} !2 = !{!"branch_weights", i32 1, i32 7} ``` The exit probability is now 1/10 for the first peeled iteration, 1/9 for the second, and 1/8 for the remaining loop iterations. It seems this behavior is trying to ensure a decreasing block frequency. However, as in the key observation above for the original loop, that happens correctly without decreasing the branch weights across iterations. This patch changes the peeling implementation not to decrease the branch weights across loop iterations so that the frequency for every iteration is the same as it was in the original loop. The total frequency of the loop body, summed across all its occurrences, thus remains 10 after peeling. Unfortunately, that change means a later analysis cannot accurately estimate the trip count of the remaining loop while examining the remaining loop in isolation without considering the probability of actually reaching it. For that purpose, this patch stores the new trip count as separate metadata named `llvm.loop.estimated_trip_count` and extends `llvm::getLoopEstimatedTripCount` to prefer it, if present, over branch weights. An alternative fix is for `llvm::getLoopEstimatedTripCount` to subtract the `llvm.loop.peeled.count` metadata from the trip count estimated by a loop's branch weights. However, there might be other loop transformations that still corrupt block frequencies in a similar manner and require a similar fix. `llvm.loop.estimated_trip_count` is intended to provide a general way to store estimated trip counts when branch weights cannot directly store them. This patch introduces several fixme comments that need to be addressed before it can land.

Extending beyond the limitations of `getExpectedExitLoopLatchBranch` is a possible improvement for the future not an urgent fixme. No one has pointed out code that computes estimated trip counts without using `llvm::getLoopEstimatedTripCount`.

The update adds this PR's new metadata.

This patch implements the `llvm.loop.estimated_trip_count` metadata discussed in [[RFC] Fix Loop Transformations to Preserve Block Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785). As [suggested in the RFC comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4), it adds the new metadata to all loops at the time of profile ingestion and estimates each trip count from the loop's `branch_weights` metadata. As [suggested in the PR#128785 review](#128785 (comment)), it does so via a `PGOEstimateTripCountsPass` pass, which creates the new metadata for the loop but omits the value if it cannot estimate a trip count due to the loop's form. An important observation not previously discussed is that `PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count but later passes can transform the loop in a way that makes it possible. Currently, such passes do not necessarily update the metadata, but eventually that should be fixed. Until then, if the new metadata has no value, `llvm::getLoopEstimatedTripCount` disregards it and tries again to estimate the trip count from the loop's `branch_weights` metadata.

Somehow, on some of my builds, `llvm::` prefixes are dropped from some symbol names in the printed past list.

That's PR #128785, which is now a parent PR. First, remove a todo that's now documented more generally than LoopPeel in plenty of other places. Second, update LoopPeel's setLoopEstimatedTripCount call to avoid a now redundant argument that eventually won't be supported.

mtrofin · 2025-09-03T13:55:37Z

llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp

-      // Adding a value to the new PHI node from the unrolling loop latch.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+      // Add new PHI nodes to the loop exit block.
+      PHINode *NewPN0 = PHINode::Create(PN.getType(), 1, PN.getName() + ".unr");


do you need a phi if all you have is 1 incoming value?

also, a nit: could you please add an argument comment for what 1 and then further below, 2, are?

do you need a phi if all you have is 1 incoming value?

I believe that is required for LCSSA form given that the incoming value is the latch (in the loop) and the phi is on an exiting edge.

also, a nit: could you please add an argument comment for what 1 and then further below, 2, are?

Will do.

llvmbot · 2025-09-03T15:14:00Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-amdgpu

Author: Joel E. Denny (jdenny-ornl)

Changes

The original loop (OL) that serves as input to LoopUnroll has basic blocks that are arranged as follows:

OLPreHeader
OLHeader &lt;-.
...        |
OLLatch ---'
OLExit

In this depiction, every block has an implicit edge to the next block below, so any explicit edge indicates a conditional branch.

Given OL and unroll count N, LoopUnroll sometimes creates an unrolled loop (UL) with a remainder loop (RL) epilogue arranged like this:

,-- ULGuard
|   ULPreHeader
|   ULHeader &lt;-.
|   ...        |
|   ULLatch ---'
|   ULExit
`-&gt; RLGuard -----.
    RLPreHeader  |
,-&gt; RLHeader     |
|   ...          |
`-- RLLatch      |
    RLExit       |
    OLExit &lt;-----'

Each UL iteration executes N OL iterations, but each RL iteration executes 1 OL iteration. ULGuard or RLGuard checks whether the first iteration of UL or RL should execute, respectively. If so, ULLatch or RLLatch checks whether to execute each subsequent iteration.

Once reached, OL always executes its first iteration but not necessarily the next N-1 iterations. Thus, ULGuard is always required before the first UL iteration. However, when control flows from ULGuard directly to RLGuard, the first OL iteration has yet to execute, so RLGuard is then redundant before the first RL iteration.

Thus, this patch makes the following changes:

Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus eliminating RLGuard's unnecessary branch instruction for that path.
Eliminate the creation of RLGuard phi node poison values. Without this patch, RLGuard has such a phi node for each value that is defined by any OL iteration and used in OLExit. The poison value is required where ULGuard is the predecessor. The poison value indicates that control flow from ULGuard to RLGuard to Exit has no counterpart in OL because the first OL iteration must execute either in UL or RL.
Simplify the CFG by not splitting ULExit and RLGuard because, without the ULGuard predecessor, the single block can now be a dedicated UL exit.
To RLPreHeader, add an llvm.assume call that asserts the RL trip count is non-zero. Without this patch, RLPreHeader is reachable only when RLGuard guarantees that assertion is true. With this patch, RLGuard guarantees it only when RLGuard is the predecessor, and the OL structure guarantees it when ULGuard is the predecessor. If RL itself is unrolled later, this guarantee somehow prevents ScalarEvolution from giving up when trying to compute a maximum trip count for RL. That maximum trip count enables the branch instruction in the final unrolled instance of RLLatch to be eliminated. Without the llvm.assume call, some existing unroll tests start to fail because that instruction is not eliminated.

The original motivation for this patch is to facilitate later patches that fix LoopUnroll's computation of branch weights so that they maintain the block frequency of OL's body (see #135812). Specifically, this patch ensures RLGuard's branch weights do not affect RL's contribution to the block frequency of OL's body in the case that ULGuard skips UL.

Patch is 308.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156549.diff

38 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (+59-33)
(modified) llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll (+1-1)
(modified) llvm/test/Transforms/HardwareLoops/ARM/structure.ll (+4-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll (+55-56)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/vector.ll (+15-15)
(modified) llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-runtime.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll (+21-23)
(modified) llvm/test/Transforms/LoopUnroll/Hexagon/reuse-lcssa-phi-scev-expansion.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors-inseltpoison.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/RISCV/vector.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/WebAssembly/basic-unrolling.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/convergent.controlled.ll (+5-5)
(modified) llvm/test/Transforms/LoopUnroll/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-epilog-debuginfo.ll (+1-3)
(modified) llvm/test/Transforms/LoopUnroll/runtime-exit-phi-scev-invalidation.ll (+10-11)
(modified) llvm/test/Transforms/LoopUnroll/runtime-i128.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-at-most-two-exits.ll (+12-13)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll (+405-422)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop1.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop2.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop5.ll (+11-12)
(modified) llvm/test/Transforms/LoopUnroll/runtime-multiexit-heuristic.ll (+58-63)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-assume-no-remainder.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll (+31-34)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-remainder.ll (+13-14)
(modified) llvm/test/Transforms/LoopUnroll/scev-invalidation-lcssa.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/tripcount-overflow.ll (+8-9)
(modified) llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/unroll-loads-cse.ll (+35-35)
(modified) llvm/test/Transforms/LoopUnrollAndJam/dependencies_visit_order.ll (+7-8)
(modified) llvm/test/Transforms/LoopUnrollAndJam/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll (+51-58)
(modified) llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll (+25-18)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/extra-unroll-simplifications.ll (+14-10)

diff --git a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
index bf882d7406853..6312831cf0ee0 100644
--- a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
@@ -201,18 +201,27 @@ static void ConnectProlog(Loop *L, Value *BECount, unsigned Count,
 /// unroll count is non-zero.
 ///
 /// This function performs the following:
-/// - Update PHI nodes at the unrolling loop exit and epilog loop exit
-/// - Create PHI nodes at the unrolling loop exit to combine
-///   values that exit the unrolling loop code and jump around it.
+/// - Update PHI nodes at the epilog loop exit
+/// - Create PHI nodes at the unrolling loop exit and epilog preheader to
+///   combine values that exit the unrolling loop code and jump around it.
 /// - Update PHI operands in the epilog loop by the new PHI nodes
-/// - Branch around the epilog loop if extra iters (ModVal) is zero.
+/// - At the unrolling loop exit, branch around the epilog loop if extra iters
+//    (ModVal) is zero.
+/// - At the epilog preheader, add an llvm.assume call that extra iters is
+///   non-zero.  If the unrolling loop exit is the predecessor, the above new
+///   branch guarantees that assumption.  If the unrolling loop preheader is the
+///   predecessor, then the required first iteration from the original loop has
+///   yet to be executed, so it must be executed in the epilog loop.  If we
+///   later unroll the epilog loop, that llvm.assume call somehow enables
+///   ScalarEvolution to compute a epilog loop maximum trip count, which enables
+///   eliminating the branch at the end of the final unrolled epilog iteration.
 ///
 static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                           BasicBlock *Exit, BasicBlock *PreHeader,
                           BasicBlock *EpilogPreHeader, BasicBlock *NewPreHeader,
                           ValueToValueMapTy &VMap, DominatorTree *DT,
                           LoopInfo *LI, bool PreserveLCSSA, ScalarEvolution &SE,
-                          unsigned Count) {
+                          unsigned Count, AssumptionCache &AC) {
   BasicBlock *Latch = L->getLoopLatch();
   assert(Latch && "Loop must have a latch");
   BasicBlock *EpilogLatch = cast<BasicBlock>(VMap[Latch]);
@@ -231,7 +240,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   //   EpilogLatch
   // Exit (EpilogPN)
 
-  // Update PHI nodes at NewExit and Exit.
+  // Update PHI nodes at Exit.
   for (PHINode &PN : NewExit->phis()) {
     // PN should be used in another PHI located in Exit block as
     // Exit was split by SplitBlockPredecessors into Exit and NewExit
@@ -246,15 +255,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     // epilogue edges have already been added.
     //
     // There is EpilogPreHeader incoming block instead of NewExit as
-    // NewExit was spilt 1 more time to get EpilogPreHeader.
+    // NewExit was split 1 more time to get EpilogPreHeader.
     assert(PN.hasOneUse() && "The phi should have 1 use");
     PHINode *EpilogPN = cast<PHINode>(PN.use_begin()->getUser());
     assert(EpilogPN->getParent() == Exit && "EpilogPN should be in Exit block");
 
-    // Add incoming PreHeader from branch around the Loop
-    PN.addIncoming(PoisonValue::get(PN.getType()), PreHeader);
-    SE.forgetValue(&PN);
-
     Value *V = PN.getIncomingValueForBlock(Latch);
     Instruction *I = dyn_cast<Instruction>(V);
     if (I && L->contains(I))
@@ -271,35 +276,52 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                                NewExit);
     // Now PHIs should look like:
     // NewExit:
-    //   PN = PHI [I, Latch], [poison, PreHeader]
+    //   PN = PHI [I, Latch]
     // ...
     // Exit:
     //   EpilogPN = PHI [PN, NewExit], [VMap[I], EpilogLatch]
   }
 
-  // Create PHI nodes at NewExit (from the unrolling loop Latch and PreHeader).
-  // Update corresponding PHI nodes in epilog loop.
+  // Create PHI nodes at NewExit (from the unrolling loop Latch) and at
+  // EpilogPreHeader (from PreHeader and NewExit).  Update corresponding PHI
+  // nodes in epilog loop.
   for (BasicBlock *Succ : successors(Latch)) {
     // Skip this as we already updated phis in exit blocks.
     if (!L->contains(Succ))
       continue;
+
+    // Succ here appears to always be just L->getHeader().  Otherwise, how do we
+    // know its corresponding epilog block (from VMap) is EpilogHeader and thus
+    // EpilogPreHeader is the right incoming block for VPN, as set below?
+    // TODO: Can we thus avoid the enclosing loop over successors?
+    assert(Succ == L->getHeader() &&
+           "Expect the only in-loop successor of latch to be the loop header");
+
     for (PHINode &PN : Succ->phis()) {
-      // Add new PHI nodes to the loop exit block and update epilog
-      // PHIs with the new PHI values.
-      PHINode *NewPN = PHINode::Create(PN.getType(), 2, PN.getName() + ".unr");
-      NewPN->insertBefore(NewExit->getFirstNonPHIIt());
-      // Adding a value to the new PHI node from the unrolling loop preheader.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
-      // Adding a value to the new PHI node from the unrolling loop latch.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+      // Add new PHI nodes to the loop exit block.
+      PHINode *NewPN0 = PHINode::Create(PN.getType(), /*NumReservedValues=*/1,
+                                        PN.getName() + ".unr");
+      NewPN0->insertBefore(NewExit->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop latch.
+      NewPN0->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+
+      // Add new PHI nodes to EpilogPreHeader.
+      PHINode *NewPN1 = PHINode::Create(PN.getType(), /*NumReservedValues=*/2,
+                                        PN.getName() + ".epil.init");
+      NewPN1->insertBefore(EpilogPreHeader->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop preheader.
+      NewPN1->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
+      // Add value to the new PHI node from the epilog loop guard.
+      NewPN1->addIncoming(NewPN0, NewExit);
 
       // Update the existing PHI node operand with the value from the new PHI
       // node.  Corresponding instruction in epilog loop should be PHI.
       PHINode *VPN = cast<PHINode>(VMap[&PN]);
-      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN);
+      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN1);
     }
   }
 
+  // In NewExit, branch around the epilog loop if no extra iters.
   Instruction *InsertPt = NewExit->getTerminator();
   IRBuilder<> B(InsertPt);
   Value *BrLoopExit = B.CreateIsNotNull(ModVal, "lcmp.mod");
@@ -308,7 +330,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   SmallVector<BasicBlock*, 4> Preds(predecessors(Exit));
   SplitBlockPredecessors(Exit, Preds, ".epilog-lcssa", DT, LI, nullptr,
                          PreserveLCSSA);
-  // Add the branch to the exit block (around the unrolling loop)
+  // Add the branch to the exit block (around the epilog loop)
   MDNode *BranchWeights = nullptr;
   if (hasBranchWeightMD(*Latch->getTerminator())) {
     // Assume equal distribution in interval [0, Count).
@@ -322,10 +344,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     DT->changeImmediateDominator(Exit, NewDom);
   }
 
-  // Split the main loop exit to maintain canonicalization guarantees.
-  SmallVector<BasicBlock*, 4> NewExitPreds{Latch};
-  SplitBlockPredecessors(NewExit, NewExitPreds, ".loopexit", DT, LI, nullptr,
-                         PreserveLCSSA);
+  // In EpilogPreHeader, assume extra iters is non-zero.
+  IRBuilder<> B2(EpilogPreHeader, EpilogPreHeader->getFirstNonPHIIt());
+  Value *ModIsNotNull = B2.CreateIsNotNull(ModVal, "lcmp.mod");
+  AssumeInst *AI = cast<AssumeInst>(B2.CreateAssumption(ModIsNotNull));
+  AC.registerAssumption(AI);
 }
 
 /// Create a clone of the blocks in a loop and connect them together. A new
@@ -795,7 +818,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
                                            ConstantInt::get(BECount->getType(),
                                                             Count - 1)) :
                            B.CreateIsNotNull(ModVal, "lcmp.mod");
-  BasicBlock *RemainderLoop = UseEpilogRemainder ? NewExit : PrologPreHeader;
+  BasicBlock *RemainderLoop =
+      UseEpilogRemainder ? EpilogPreHeader : PrologPreHeader;
   BasicBlock *UnrollingLoop = UseEpilogRemainder ? NewPreHeader : PrologExit;
   // Branch to either remainder (extra iterations) loop or unrolling loop.
   MDNode *BranchWeights = nullptr;
@@ -808,7 +832,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
   PreHeaderBR->eraseFromParent();
   if (DT) {
     if (UseEpilogRemainder)
-      DT->changeImmediateDominator(NewExit, PreHeader);
+      DT->changeImmediateDominator(EpilogPreHeader, PreHeader);
     else
       DT->changeImmediateDominator(PrologExit, PreHeader);
   }
@@ -880,7 +904,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
   // from both the original loop and the remainder code reaching the exit
   // blocks. While the IDom of these exit blocks were from the original loop,
   // now the IDom is the preheader (which decides whether the original loop or
-  // remainder code should run).
+  // remainder code should run) unless the block still has just the original
+  // predecessor (such as NewExit in the case of an epilog remainder).
   if (DT && !L->getExitingBlock()) {
     SmallVector<BasicBlock *, 16> ChildrenToUpdate;
     // NB! We have to examine the dom children of all loop blocks, not just
@@ -891,7 +916,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
       auto *DomNodeBB = DT->getNode(BB);
       for (auto *DomChild : DomNodeBB->children()) {
         auto *DomChildBB = DomChild->getBlock();
-        if (!L->contains(LI->getLoopFor(DomChildBB)))
+        if (!L->contains(LI->getLoopFor(DomChildBB)) &&
+            DomChildBB->getUniquePredecessor() != BB)
           ChildrenToUpdate.push_back(DomChildBB);
       }
     }
@@ -930,7 +956,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
     // Connect the epilog code to the original loop and update the
     // PHI functions.
     ConnectEpilog(L, ModVal, NewExit, LatchExit, PreHeader, EpilogPreHeader,
-                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count);
+                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count, *AC);
 
     // Update counter in loop for unrolling.
     // Use an incrementing IV.  Pre-incr/post-incr is backedge/trip count.
diff --git a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
index d23afaed9a047..abcc566ec5633 100644
--- a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
+++ b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
@@ -5,7 +5,7 @@
 ;; Check atoms are remapped for runtime unrolling.
 
 ; CHECK: for.body.epil:
-; CHECK-NEXT: store i64 %indvars.iv.unr, ptr %p, align 4, !dbg [[G2R1:!.*]]
+; CHECK-NEXT: store i64 %indvars.iv.epil.init, ptr %p, align 4, !dbg [[G2R1:!.*]]
 
 ; CHECK: for.body.epil.1:
 ; CHECK-NEXT: store i64 %indvars.iv.next.epil, ptr %p, align 4, !dbg [[G3R1:!.*]]
diff --git a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
index cb66fefbfcc85..6993fd16dbad5 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
@@ -321,10 +321,10 @@ for.inc:                                          ; preds = %sw.bb, %sw.bb1, %fo
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL:     [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL-NEXT: [[EXIT]]
+; CHECK-UNROLL-NEXT: {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 
 define void @unroll_inc_int(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
@@ -357,10 +357,10 @@ for.body:
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EPIL_EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL: [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL: [[EPIL_EXIT]]:
+; CHECK-UNROLL:     {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 ; CHECK-UNROLL:     pop
 define void @unroll_inc_unsigned(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
index a62e45d27a5c5..ef105d7adc05b 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
@@ -15,7 +15,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 7
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 7
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -72,18 +72,18 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[IV_NEXT_7]] = add nuw nsw i64 [[IV_EPIL]], 8
 ; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
 ; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_UNR]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
+; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[SCALED_IV_EPIL1:%.*]] = mul nuw nsw i64 [[IV_EPIL1]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_EPIL1:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL1]]
@@ -106,7 +106,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; OTHER:       [[ENTRY_NEW]]:
 ; OTHER-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; OTHER-NEXT:    br label %[[LOOP:.*]]
@@ -127,15 +127,15 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; OTHER-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; OTHER-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; OTHER:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; OTHER-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; OTHER-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT_UNR_LCSSA]]:
-; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; OTHER-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; OTHER:       [[LOOP_EPIL_PREHEADER]]:
+; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; OTHER-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; OTHER-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; OTHER-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; OTHER:       [[LOOP_EPIL]]:
 ; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -172,7 +172,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -195,15 +195,15 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; APPLE-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
 ; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -222,7 +222,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-...
[truncated]

llvmbot · 2025-09-03T15:14:00Z

@llvm/pr-subscribers-debuginfo

Author: Joel E. Denny (jdenny-ornl)

Changes

The original loop (OL) that serves as input to LoopUnroll has basic blocks that are arranged as follows:

OLPreHeader
OLHeader &lt;-.
...        |
OLLatch ---'
OLExit

In this depiction, every block has an implicit edge to the next block below, so any explicit edge indicates a conditional branch.

Given OL and unroll count N, LoopUnroll sometimes creates an unrolled loop (UL) with a remainder loop (RL) epilogue arranged like this:

,-- ULGuard
|   ULPreHeader
|   ULHeader &lt;-.
|   ...        |
|   ULLatch ---'
|   ULExit
`-&gt; RLGuard -----.
    RLPreHeader  |
,-&gt; RLHeader     |
|   ...          |
`-- RLLatch      |
    RLExit       |
    OLExit &lt;-----'

Each UL iteration executes N OL iterations, but each RL iteration executes 1 OL iteration. ULGuard or RLGuard checks whether the first iteration of UL or RL should execute, respectively. If so, ULLatch or RLLatch checks whether to execute each subsequent iteration.

Once reached, OL always executes its first iteration but not necessarily the next N-1 iterations. Thus, ULGuard is always required before the first UL iteration. However, when control flows from ULGuard directly to RLGuard, the first OL iteration has yet to execute, so RLGuard is then redundant before the first RL iteration.

Thus, this patch makes the following changes:

Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus eliminating RLGuard's unnecessary branch instruction for that path.
Eliminate the creation of RLGuard phi node poison values. Without this patch, RLGuard has such a phi node for each value that is defined by any OL iteration and used in OLExit. The poison value is required where ULGuard is the predecessor. The poison value indicates that control flow from ULGuard to RLGuard to Exit has no counterpart in OL because the first OL iteration must execute either in UL or RL.
Simplify the CFG by not splitting ULExit and RLGuard because, without the ULGuard predecessor, the single block can now be a dedicated UL exit.
To RLPreHeader, add an llvm.assume call that asserts the RL trip count is non-zero. Without this patch, RLPreHeader is reachable only when RLGuard guarantees that assertion is true. With this patch, RLGuard guarantees it only when RLGuard is the predecessor, and the OL structure guarantees it when ULGuard is the predecessor. If RL itself is unrolled later, this guarantee somehow prevents ScalarEvolution from giving up when trying to compute a maximum trip count for RL. That maximum trip count enables the branch instruction in the final unrolled instance of RLLatch to be eliminated. Without the llvm.assume call, some existing unroll tests start to fail because that instruction is not eliminated.

The original motivation for this patch is to facilitate later patches that fix LoopUnroll's computation of branch weights so that they maintain the block frequency of OL's body (see #135812). Specifically, this patch ensures RLGuard's branch weights do not affect RL's contribution to the block frequency of OL's body in the case that ULGuard skips UL.

Patch is 308.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156549.diff

38 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (+59-33)
(modified) llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll (+1-1)
(modified) llvm/test/Transforms/HardwareLoops/ARM/structure.ll (+4-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll (+55-56)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/vector.ll (+15-15)
(modified) llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-runtime.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll (+21-23)
(modified) llvm/test/Transforms/LoopUnroll/Hexagon/reuse-lcssa-phi-scev-expansion.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors-inseltpoison.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/RISCV/vector.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/WebAssembly/basic-unrolling.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/convergent.controlled.ll (+5-5)
(modified) llvm/test/Transforms/LoopUnroll/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-epilog-debuginfo.ll (+1-3)
(modified) llvm/test/Transforms/LoopUnroll/runtime-exit-phi-scev-invalidation.ll (+10-11)
(modified) llvm/test/Transforms/LoopUnroll/runtime-i128.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-at-most-two-exits.ll (+12-13)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll (+405-422)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop1.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop2.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop5.ll (+11-12)
(modified) llvm/test/Transforms/LoopUnroll/runtime-multiexit-heuristic.ll (+58-63)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-assume-no-remainder.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll (+31-34)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-remainder.ll (+13-14)
(modified) llvm/test/Transforms/LoopUnroll/scev-invalidation-lcssa.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/tripcount-overflow.ll (+8-9)
(modified) llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/unroll-loads-cse.ll (+35-35)
(modified) llvm/test/Transforms/LoopUnrollAndJam/dependencies_visit_order.ll (+7-8)
(modified) llvm/test/Transforms/LoopUnrollAndJam/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll (+51-58)
(modified) llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll (+25-18)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/extra-unroll-simplifications.ll (+14-10)

diff --git a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
index bf882d7406853..6312831cf0ee0 100644
--- a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
@@ -201,18 +201,27 @@ static void ConnectProlog(Loop *L, Value *BECount, unsigned Count,
 /// unroll count is non-zero.
 ///
 /// This function performs the following:
-/// - Update PHI nodes at the unrolling loop exit and epilog loop exit
-/// - Create PHI nodes at the unrolling loop exit to combine
-///   values that exit the unrolling loop code and jump around it.
+/// - Update PHI nodes at the epilog loop exit
+/// - Create PHI nodes at the unrolling loop exit and epilog preheader to
+///   combine values that exit the unrolling loop code and jump around it.
 /// - Update PHI operands in the epilog loop by the new PHI nodes
-/// - Branch around the epilog loop if extra iters (ModVal) is zero.
+/// - At the unrolling loop exit, branch around the epilog loop if extra iters
+//    (ModVal) is zero.
+/// - At the epilog preheader, add an llvm.assume call that extra iters is
+///   non-zero.  If the unrolling loop exit is the predecessor, the above new
+///   branch guarantees that assumption.  If the unrolling loop preheader is the
+///   predecessor, then the required first iteration from the original loop has
+///   yet to be executed, so it must be executed in the epilog loop.  If we
+///   later unroll the epilog loop, that llvm.assume call somehow enables
+///   ScalarEvolution to compute a epilog loop maximum trip count, which enables
+///   eliminating the branch at the end of the final unrolled epilog iteration.
 ///
 static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                           BasicBlock *Exit, BasicBlock *PreHeader,
                           BasicBlock *EpilogPreHeader, BasicBlock *NewPreHeader,
                           ValueToValueMapTy &VMap, DominatorTree *DT,
                           LoopInfo *LI, bool PreserveLCSSA, ScalarEvolution &SE,
-                          unsigned Count) {
+                          unsigned Count, AssumptionCache &AC) {
   BasicBlock *Latch = L->getLoopLatch();
   assert(Latch && "Loop must have a latch");
   BasicBlock *EpilogLatch = cast<BasicBlock>(VMap[Latch]);
@@ -231,7 +240,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   //   EpilogLatch
   // Exit (EpilogPN)
 
-  // Update PHI nodes at NewExit and Exit.
+  // Update PHI nodes at Exit.
   for (PHINode &PN : NewExit->phis()) {
     // PN should be used in another PHI located in Exit block as
     // Exit was split by SplitBlockPredecessors into Exit and NewExit
@@ -246,15 +255,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     // epilogue edges have already been added.
     //
     // There is EpilogPreHeader incoming block instead of NewExit as
-    // NewExit was spilt 1 more time to get EpilogPreHeader.
+    // NewExit was split 1 more time to get EpilogPreHeader.
     assert(PN.hasOneUse() && "The phi should have 1 use");
     PHINode *EpilogPN = cast<PHINode>(PN.use_begin()->getUser());
     assert(EpilogPN->getParent() == Exit && "EpilogPN should be in Exit block");
 
-    // Add incoming PreHeader from branch around the Loop
-    PN.addIncoming(PoisonValue::get(PN.getType()), PreHeader);
-    SE.forgetValue(&PN);
-
     Value *V = PN.getIncomingValueForBlock(Latch);
     Instruction *I = dyn_cast<Instruction>(V);
     if (I && L->contains(I))
@@ -271,35 +276,52 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                                NewExit);
     // Now PHIs should look like:
     // NewExit:
-    //   PN = PHI [I, Latch], [poison, PreHeader]
+    //   PN = PHI [I, Latch]
     // ...
     // Exit:
     //   EpilogPN = PHI [PN, NewExit], [VMap[I], EpilogLatch]
   }
 
-  // Create PHI nodes at NewExit (from the unrolling loop Latch and PreHeader).
-  // Update corresponding PHI nodes in epilog loop.
+  // Create PHI nodes at NewExit (from the unrolling loop Latch) and at
+  // EpilogPreHeader (from PreHeader and NewExit).  Update corresponding PHI
+  // nodes in epilog loop.
   for (BasicBlock *Succ : successors(Latch)) {
     // Skip this as we already updated phis in exit blocks.
     if (!L->contains(Succ))
       continue;
+
+    // Succ here appears to always be just L->getHeader().  Otherwise, how do we
+    // know its corresponding epilog block (from VMap) is EpilogHeader and thus
+    // EpilogPreHeader is the right incoming block for VPN, as set below?
+    // TODO: Can we thus avoid the enclosing loop over successors?
+    assert(Succ == L->getHeader() &&
+           "Expect the only in-loop successor of latch to be the loop header");
+
     for (PHINode &PN : Succ->phis()) {
-      // Add new PHI nodes to the loop exit block and update epilog
-      // PHIs with the new PHI values.
-      PHINode *NewPN = PHINode::Create(PN.getType(), 2, PN.getName() + ".unr");
-      NewPN->insertBefore(NewExit->getFirstNonPHIIt());
-      // Adding a value to the new PHI node from the unrolling loop preheader.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
-      // Adding a value to the new PHI node from the unrolling loop latch.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+      // Add new PHI nodes to the loop exit block.
+      PHINode *NewPN0 = PHINode::Create(PN.getType(), /*NumReservedValues=*/1,
+                                        PN.getName() + ".unr");
+      NewPN0->insertBefore(NewExit->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop latch.
+      NewPN0->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+
+      // Add new PHI nodes to EpilogPreHeader.
+      PHINode *NewPN1 = PHINode::Create(PN.getType(), /*NumReservedValues=*/2,
+                                        PN.getName() + ".epil.init");
+      NewPN1->insertBefore(EpilogPreHeader->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop preheader.
+      NewPN1->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
+      // Add value to the new PHI node from the epilog loop guard.
+      NewPN1->addIncoming(NewPN0, NewExit);
 
       // Update the existing PHI node operand with the value from the new PHI
       // node.  Corresponding instruction in epilog loop should be PHI.
       PHINode *VPN = cast<PHINode>(VMap[&PN]);
-      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN);
+      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN1);
     }
   }
 
+  // In NewExit, branch around the epilog loop if no extra iters.
   Instruction *InsertPt = NewExit->getTerminator();
   IRBuilder<> B(InsertPt);
   Value *BrLoopExit = B.CreateIsNotNull(ModVal, "lcmp.mod");
@@ -308,7 +330,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   SmallVector<BasicBlock*, 4> Preds(predecessors(Exit));
   SplitBlockPredecessors(Exit, Preds, ".epilog-lcssa", DT, LI, nullptr,
                          PreserveLCSSA);
-  // Add the branch to the exit block (around the unrolling loop)
+  // Add the branch to the exit block (around the epilog loop)
   MDNode *BranchWeights = nullptr;
   if (hasBranchWeightMD(*Latch->getTerminator())) {
     // Assume equal distribution in interval [0, Count).
@@ -322,10 +344,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     DT->changeImmediateDominator(Exit, NewDom);
   }
 
-  // Split the main loop exit to maintain canonicalization guarantees.
-  SmallVector<BasicBlock*, 4> NewExitPreds{Latch};
-  SplitBlockPredecessors(NewExit, NewExitPreds, ".loopexit", DT, LI, nullptr,
-                         PreserveLCSSA);
+  // In EpilogPreHeader, assume extra iters is non-zero.
+  IRBuilder<> B2(EpilogPreHeader, EpilogPreHeader->getFirstNonPHIIt());
+  Value *ModIsNotNull = B2.CreateIsNotNull(ModVal, "lcmp.mod");
+  AssumeInst *AI = cast<AssumeInst>(B2.CreateAssumption(ModIsNotNull));
+  AC.registerAssumption(AI);
 }
 
 /// Create a clone of the blocks in a loop and connect them together. A new
@@ -795,7 +818,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
                                            ConstantInt::get(BECount->getType(),
                                                             Count - 1)) :
                            B.CreateIsNotNull(ModVal, "lcmp.mod");
-  BasicBlock *RemainderLoop = UseEpilogRemainder ? NewExit : PrologPreHeader;
+  BasicBlock *RemainderLoop =
+      UseEpilogRemainder ? EpilogPreHeader : PrologPreHeader;
   BasicBlock *UnrollingLoop = UseEpilogRemainder ? NewPreHeader : PrologExit;
   // Branch to either remainder (extra iterations) loop or unrolling loop.
   MDNode *BranchWeights = nullptr;
@@ -808,7 +832,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
   PreHeaderBR->eraseFromParent();
   if (DT) {
     if (UseEpilogRemainder)
-      DT->changeImmediateDominator(NewExit, PreHeader);
+      DT->changeImmediateDominator(EpilogPreHeader, PreHeader);
     else
       DT->changeImmediateDominator(PrologExit, PreHeader);
   }
@@ -880,7 +904,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
   // from both the original loop and the remainder code reaching the exit
   // blocks. While the IDom of these exit blocks were from the original loop,
   // now the IDom is the preheader (which decides whether the original loop or
-  // remainder code should run).
+  // remainder code should run) unless the block still has just the original
+  // predecessor (such as NewExit in the case of an epilog remainder).
   if (DT && !L->getExitingBlock()) {
     SmallVector<BasicBlock *, 16> ChildrenToUpdate;
     // NB! We have to examine the dom children of all loop blocks, not just
@@ -891,7 +916,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
       auto *DomNodeBB = DT->getNode(BB);
       for (auto *DomChild : DomNodeBB->children()) {
         auto *DomChildBB = DomChild->getBlock();
-        if (!L->contains(LI->getLoopFor(DomChildBB)))
+        if (!L->contains(LI->getLoopFor(DomChildBB)) &&
+            DomChildBB->getUniquePredecessor() != BB)
           ChildrenToUpdate.push_back(DomChildBB);
       }
     }
@@ -930,7 +956,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
     // Connect the epilog code to the original loop and update the
     // PHI functions.
     ConnectEpilog(L, ModVal, NewExit, LatchExit, PreHeader, EpilogPreHeader,
-                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count);
+                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count, *AC);
 
     // Update counter in loop for unrolling.
     // Use an incrementing IV.  Pre-incr/post-incr is backedge/trip count.
diff --git a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
index d23afaed9a047..abcc566ec5633 100644
--- a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
+++ b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
@@ -5,7 +5,7 @@
 ;; Check atoms are remapped for runtime unrolling.
 
 ; CHECK: for.body.epil:
-; CHECK-NEXT: store i64 %indvars.iv.unr, ptr %p, align 4, !dbg [[G2R1:!.*]]
+; CHECK-NEXT: store i64 %indvars.iv.epil.init, ptr %p, align 4, !dbg [[G2R1:!.*]]
 
 ; CHECK: for.body.epil.1:
 ; CHECK-NEXT: store i64 %indvars.iv.next.epil, ptr %p, align 4, !dbg [[G3R1:!.*]]
diff --git a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
index cb66fefbfcc85..6993fd16dbad5 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
@@ -321,10 +321,10 @@ for.inc:                                          ; preds = %sw.bb, %sw.bb1, %fo
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL:     [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL-NEXT: [[EXIT]]
+; CHECK-UNROLL-NEXT: {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 
 define void @unroll_inc_int(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
@@ -357,10 +357,10 @@ for.body:
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EPIL_EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL: [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL: [[EPIL_EXIT]]:
+; CHECK-UNROLL:     {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 ; CHECK-UNROLL:     pop
 define void @unroll_inc_unsigned(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
index a62e45d27a5c5..ef105d7adc05b 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
@@ -15,7 +15,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 7
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 7
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -72,18 +72,18 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[IV_NEXT_7]] = add nuw nsw i64 [[IV_EPIL]], 8
 ; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
 ; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_UNR]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
+; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[SCALED_IV_EPIL1:%.*]] = mul nuw nsw i64 [[IV_EPIL1]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_EPIL1:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL1]]
@@ -106,7 +106,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; OTHER:       [[ENTRY_NEW]]:
 ; OTHER-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; OTHER-NEXT:    br label %[[LOOP:.*]]
@@ -127,15 +127,15 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; OTHER-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; OTHER-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; OTHER:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; OTHER-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; OTHER-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT_UNR_LCSSA]]:
-; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; OTHER-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; OTHER:       [[LOOP_EPIL_PREHEADER]]:
+; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; OTHER-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; OTHER-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; OTHER-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; OTHER:       [[LOOP_EPIL]]:
 ; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -172,7 +172,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -195,15 +195,15 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; APPLE-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
 ; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -222,7 +222,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-...
[truncated]

llvmbot · 2025-09-03T15:14:01Z

@llvm/pr-subscribers-backend-powerpc

Author: Joel E. Denny (jdenny-ornl)

Changes

The original loop (OL) that serves as input to LoopUnroll has basic blocks that are arranged as follows:

OLPreHeader
OLHeader &lt;-.
...        |
OLLatch ---'
OLExit

In this depiction, every block has an implicit edge to the next block below, so any explicit edge indicates a conditional branch.

Given OL and unroll count N, LoopUnroll sometimes creates an unrolled loop (UL) with a remainder loop (RL) epilogue arranged like this:

,-- ULGuard
|   ULPreHeader
|   ULHeader &lt;-.
|   ...        |
|   ULLatch ---'
|   ULExit
`-&gt; RLGuard -----.
    RLPreHeader  |
,-&gt; RLHeader     |
|   ...          |
`-- RLLatch      |
    RLExit       |
    OLExit &lt;-----'

Each UL iteration executes N OL iterations, but each RL iteration executes 1 OL iteration. ULGuard or RLGuard checks whether the first iteration of UL or RL should execute, respectively. If so, ULLatch or RLLatch checks whether to execute each subsequent iteration.

Once reached, OL always executes its first iteration but not necessarily the next N-1 iterations. Thus, ULGuard is always required before the first UL iteration. However, when control flows from ULGuard directly to RLGuard, the first OL iteration has yet to execute, so RLGuard is then redundant before the first RL iteration.

Thus, this patch makes the following changes:

Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus eliminating RLGuard's unnecessary branch instruction for that path.
Eliminate the creation of RLGuard phi node poison values. Without this patch, RLGuard has such a phi node for each value that is defined by any OL iteration and used in OLExit. The poison value is required where ULGuard is the predecessor. The poison value indicates that control flow from ULGuard to RLGuard to Exit has no counterpart in OL because the first OL iteration must execute either in UL or RL.
Simplify the CFG by not splitting ULExit and RLGuard because, without the ULGuard predecessor, the single block can now be a dedicated UL exit.
To RLPreHeader, add an llvm.assume call that asserts the RL trip count is non-zero. Without this patch, RLPreHeader is reachable only when RLGuard guarantees that assertion is true. With this patch, RLGuard guarantees it only when RLGuard is the predecessor, and the OL structure guarantees it when ULGuard is the predecessor. If RL itself is unrolled later, this guarantee somehow prevents ScalarEvolution from giving up when trying to compute a maximum trip count for RL. That maximum trip count enables the branch instruction in the final unrolled instance of RLLatch to be eliminated. Without the llvm.assume call, some existing unroll tests start to fail because that instruction is not eliminated.

The original motivation for this patch is to facilitate later patches that fix LoopUnroll's computation of branch weights so that they maintain the block frequency of OL's body (see #135812). Specifically, this patch ensures RLGuard's branch weights do not affect RL's contribution to the block frequency of OL's body in the case that ULGuard skips UL.

Patch is 308.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156549.diff

38 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (+59-33)
(modified) llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll (+1-1)
(modified) llvm/test/Transforms/HardwareLoops/ARM/structure.ll (+4-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll (+55-56)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/vector.ll (+15-15)
(modified) llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-runtime.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll (+21-23)
(modified) llvm/test/Transforms/LoopUnroll/Hexagon/reuse-lcssa-phi-scev-expansion.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors-inseltpoison.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors.ll (+6-6)
(modified) llvm/test/Transforms/LoopUnroll/RISCV/vector.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/WebAssembly/basic-unrolling.ll (+6-4)
(modified) llvm/test/Transforms/LoopUnroll/convergent.controlled.ll (+5-5)
(modified) llvm/test/Transforms/LoopUnroll/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-epilog-debuginfo.ll (+1-3)
(modified) llvm/test/Transforms/LoopUnroll/runtime-exit-phi-scev-invalidation.ll (+10-11)
(modified) llvm/test/Transforms/LoopUnroll/runtime-i128.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-at-most-two-exits.ll (+12-13)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll (+405-422)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop1.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop2.ll (+2-2)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop5.ll (+11-12)
(modified) llvm/test/Transforms/LoopUnroll/runtime-multiexit-heuristic.ll (+58-63)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-assume-no-remainder.ll (+7-7)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll (+31-34)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-remainder.ll (+13-14)
(modified) llvm/test/Transforms/LoopUnroll/scev-invalidation-lcssa.ll (+8-8)
(modified) llvm/test/Transforms/LoopUnroll/tripcount-overflow.ll (+8-9)
(modified) llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnroll/unroll-loads-cse.ll (+35-35)
(modified) llvm/test/Transforms/LoopUnrollAndJam/dependencies_visit_order.ll (+7-8)
(modified) llvm/test/Transforms/LoopUnrollAndJam/followup.ll (+1-1)
(modified) llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll (+51-58)
(modified) llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll (+25-18)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/extra-unroll-simplifications.ll (+14-10)

diff --git a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
index bf882d7406853..6312831cf0ee0 100644
--- a/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
@@ -201,18 +201,27 @@ static void ConnectProlog(Loop *L, Value *BECount, unsigned Count,
 /// unroll count is non-zero.
 ///
 /// This function performs the following:
-/// - Update PHI nodes at the unrolling loop exit and epilog loop exit
-/// - Create PHI nodes at the unrolling loop exit to combine
-///   values that exit the unrolling loop code and jump around it.
+/// - Update PHI nodes at the epilog loop exit
+/// - Create PHI nodes at the unrolling loop exit and epilog preheader to
+///   combine values that exit the unrolling loop code and jump around it.
 /// - Update PHI operands in the epilog loop by the new PHI nodes
-/// - Branch around the epilog loop if extra iters (ModVal) is zero.
+/// - At the unrolling loop exit, branch around the epilog loop if extra iters
+//    (ModVal) is zero.
+/// - At the epilog preheader, add an llvm.assume call that extra iters is
+///   non-zero.  If the unrolling loop exit is the predecessor, the above new
+///   branch guarantees that assumption.  If the unrolling loop preheader is the
+///   predecessor, then the required first iteration from the original loop has
+///   yet to be executed, so it must be executed in the epilog loop.  If we
+///   later unroll the epilog loop, that llvm.assume call somehow enables
+///   ScalarEvolution to compute a epilog loop maximum trip count, which enables
+///   eliminating the branch at the end of the final unrolled epilog iteration.
 ///
 static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                           BasicBlock *Exit, BasicBlock *PreHeader,
                           BasicBlock *EpilogPreHeader, BasicBlock *NewPreHeader,
                           ValueToValueMapTy &VMap, DominatorTree *DT,
                           LoopInfo *LI, bool PreserveLCSSA, ScalarEvolution &SE,
-                          unsigned Count) {
+                          unsigned Count, AssumptionCache &AC) {
   BasicBlock *Latch = L->getLoopLatch();
   assert(Latch && "Loop must have a latch");
   BasicBlock *EpilogLatch = cast<BasicBlock>(VMap[Latch]);
@@ -231,7 +240,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   //   EpilogLatch
   // Exit (EpilogPN)
 
-  // Update PHI nodes at NewExit and Exit.
+  // Update PHI nodes at Exit.
   for (PHINode &PN : NewExit->phis()) {
     // PN should be used in another PHI located in Exit block as
     // Exit was split by SplitBlockPredecessors into Exit and NewExit
@@ -246,15 +255,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     // epilogue edges have already been added.
     //
     // There is EpilogPreHeader incoming block instead of NewExit as
-    // NewExit was spilt 1 more time to get EpilogPreHeader.
+    // NewExit was split 1 more time to get EpilogPreHeader.
     assert(PN.hasOneUse() && "The phi should have 1 use");
     PHINode *EpilogPN = cast<PHINode>(PN.use_begin()->getUser());
     assert(EpilogPN->getParent() == Exit && "EpilogPN should be in Exit block");
 
-    // Add incoming PreHeader from branch around the Loop
-    PN.addIncoming(PoisonValue::get(PN.getType()), PreHeader);
-    SE.forgetValue(&PN);
-
     Value *V = PN.getIncomingValueForBlock(Latch);
     Instruction *I = dyn_cast<Instruction>(V);
     if (I && L->contains(I))
@@ -271,35 +276,52 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
                                NewExit);
     // Now PHIs should look like:
     // NewExit:
-    //   PN = PHI [I, Latch], [poison, PreHeader]
+    //   PN = PHI [I, Latch]
     // ...
     // Exit:
     //   EpilogPN = PHI [PN, NewExit], [VMap[I], EpilogLatch]
   }
 
-  // Create PHI nodes at NewExit (from the unrolling loop Latch and PreHeader).
-  // Update corresponding PHI nodes in epilog loop.
+  // Create PHI nodes at NewExit (from the unrolling loop Latch) and at
+  // EpilogPreHeader (from PreHeader and NewExit).  Update corresponding PHI
+  // nodes in epilog loop.
   for (BasicBlock *Succ : successors(Latch)) {
     // Skip this as we already updated phis in exit blocks.
     if (!L->contains(Succ))
       continue;
+
+    // Succ here appears to always be just L->getHeader().  Otherwise, how do we
+    // know its corresponding epilog block (from VMap) is EpilogHeader and thus
+    // EpilogPreHeader is the right incoming block for VPN, as set below?
+    // TODO: Can we thus avoid the enclosing loop over successors?
+    assert(Succ == L->getHeader() &&
+           "Expect the only in-loop successor of latch to be the loop header");
+
     for (PHINode &PN : Succ->phis()) {
-      // Add new PHI nodes to the loop exit block and update epilog
-      // PHIs with the new PHI values.
-      PHINode *NewPN = PHINode::Create(PN.getType(), 2, PN.getName() + ".unr");
-      NewPN->insertBefore(NewExit->getFirstNonPHIIt());
-      // Adding a value to the new PHI node from the unrolling loop preheader.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
-      // Adding a value to the new PHI node from the unrolling loop latch.
-      NewPN->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+      // Add new PHI nodes to the loop exit block.
+      PHINode *NewPN0 = PHINode::Create(PN.getType(), /*NumReservedValues=*/1,
+                                        PN.getName() + ".unr");
+      NewPN0->insertBefore(NewExit->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop latch.
+      NewPN0->addIncoming(PN.getIncomingValueForBlock(Latch), Latch);
+
+      // Add new PHI nodes to EpilogPreHeader.
+      PHINode *NewPN1 = PHINode::Create(PN.getType(), /*NumReservedValues=*/2,
+                                        PN.getName() + ".epil.init");
+      NewPN1->insertBefore(EpilogPreHeader->getFirstNonPHIIt());
+      // Add value to the new PHI node from the unrolling loop preheader.
+      NewPN1->addIncoming(PN.getIncomingValueForBlock(NewPreHeader), PreHeader);
+      // Add value to the new PHI node from the epilog loop guard.
+      NewPN1->addIncoming(NewPN0, NewExit);
 
       // Update the existing PHI node operand with the value from the new PHI
       // node.  Corresponding instruction in epilog loop should be PHI.
       PHINode *VPN = cast<PHINode>(VMap[&PN]);
-      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN);
+      VPN->setIncomingValueForBlock(EpilogPreHeader, NewPN1);
     }
   }
 
+  // In NewExit, branch around the epilog loop if no extra iters.
   Instruction *InsertPt = NewExit->getTerminator();
   IRBuilder<> B(InsertPt);
   Value *BrLoopExit = B.CreateIsNotNull(ModVal, "lcmp.mod");
@@ -308,7 +330,7 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
   SmallVector<BasicBlock*, 4> Preds(predecessors(Exit));
   SplitBlockPredecessors(Exit, Preds, ".epilog-lcssa", DT, LI, nullptr,
                          PreserveLCSSA);
-  // Add the branch to the exit block (around the unrolling loop)
+  // Add the branch to the exit block (around the epilog loop)
   MDNode *BranchWeights = nullptr;
   if (hasBranchWeightMD(*Latch->getTerminator())) {
     // Assume equal distribution in interval [0, Count).
@@ -322,10 +344,11 @@ static void ConnectEpilog(Loop *L, Value *ModVal, BasicBlock *NewExit,
     DT->changeImmediateDominator(Exit, NewDom);
   }
 
-  // Split the main loop exit to maintain canonicalization guarantees.
-  SmallVector<BasicBlock*, 4> NewExitPreds{Latch};
-  SplitBlockPredecessors(NewExit, NewExitPreds, ".loopexit", DT, LI, nullptr,
-                         PreserveLCSSA);
+  // In EpilogPreHeader, assume extra iters is non-zero.
+  IRBuilder<> B2(EpilogPreHeader, EpilogPreHeader->getFirstNonPHIIt());
+  Value *ModIsNotNull = B2.CreateIsNotNull(ModVal, "lcmp.mod");
+  AssumeInst *AI = cast<AssumeInst>(B2.CreateAssumption(ModIsNotNull));
+  AC.registerAssumption(AI);
 }
 
 /// Create a clone of the blocks in a loop and connect them together. A new
@@ -795,7 +818,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
                                            ConstantInt::get(BECount->getType(),
                                                             Count - 1)) :
                            B.CreateIsNotNull(ModVal, "lcmp.mod");
-  BasicBlock *RemainderLoop = UseEpilogRemainder ? NewExit : PrologPreHeader;
+  BasicBlock *RemainderLoop =
+      UseEpilogRemainder ? EpilogPreHeader : PrologPreHeader;
   BasicBlock *UnrollingLoop = UseEpilogRemainder ? NewPreHeader : PrologExit;
   // Branch to either remainder (extra iterations) loop or unrolling loop.
   MDNode *BranchWeights = nullptr;
@@ -808,7 +832,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
   PreHeaderBR->eraseFromParent();
   if (DT) {
     if (UseEpilogRemainder)
-      DT->changeImmediateDominator(NewExit, PreHeader);
+      DT->changeImmediateDominator(EpilogPreHeader, PreHeader);
     else
       DT->changeImmediateDominator(PrologExit, PreHeader);
   }
@@ -880,7 +904,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
   // from both the original loop and the remainder code reaching the exit
   // blocks. While the IDom of these exit blocks were from the original loop,
   // now the IDom is the preheader (which decides whether the original loop or
-  // remainder code should run).
+  // remainder code should run) unless the block still has just the original
+  // predecessor (such as NewExit in the case of an epilog remainder).
   if (DT && !L->getExitingBlock()) {
     SmallVector<BasicBlock *, 16> ChildrenToUpdate;
     // NB! We have to examine the dom children of all loop blocks, not just
@@ -891,7 +916,8 @@ bool llvm::UnrollRuntimeLoopRemainder(
       auto *DomNodeBB = DT->getNode(BB);
       for (auto *DomChild : DomNodeBB->children()) {
         auto *DomChildBB = DomChild->getBlock();
-        if (!L->contains(LI->getLoopFor(DomChildBB)))
+        if (!L->contains(LI->getLoopFor(DomChildBB)) &&
+            DomChildBB->getUniquePredecessor() != BB)
           ChildrenToUpdate.push_back(DomChildBB);
       }
     }
@@ -930,7 +956,7 @@ bool llvm::UnrollRuntimeLoopRemainder(
     // Connect the epilog code to the original loop and update the
     // PHI functions.
     ConnectEpilog(L, ModVal, NewExit, LatchExit, PreHeader, EpilogPreHeader,
-                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count);
+                  NewPreHeader, VMap, DT, LI, PreserveLCSSA, *SE, Count, *AC);
 
     // Update counter in loop for unrolling.
     // Use an incrementing IV.  Pre-incr/post-incr is backedge/trip count.
diff --git a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
index d23afaed9a047..abcc566ec5633 100644
--- a/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
+++ b/llvm/test/DebugInfo/KeyInstructions/Generic/loop-unroll-runtime.ll
@@ -5,7 +5,7 @@
 ;; Check atoms are remapped for runtime unrolling.
 
 ; CHECK: for.body.epil:
-; CHECK-NEXT: store i64 %indvars.iv.unr, ptr %p, align 4, !dbg [[G2R1:!.*]]
+; CHECK-NEXT: store i64 %indvars.iv.epil.init, ptr %p, align 4, !dbg [[G2R1:!.*]]
 
 ; CHECK: for.body.epil.1:
 ; CHECK-NEXT: store i64 %indvars.iv.next.epil, ptr %p, align 4, !dbg [[G3R1:!.*]]
diff --git a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
index cb66fefbfcc85..6993fd16dbad5 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
@@ -321,10 +321,10 @@ for.inc:                                          ; preds = %sw.bb, %sw.bb1, %fo
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL:     [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL-NEXT: [[EXIT]]
+; CHECK-UNROLL-NEXT: {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 
 define void @unroll_inc_int(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
@@ -357,10 +357,10 @@ for.body:
 ; CHECK-UNROLL-NOT: dls
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL:     le lr, [[LOOP]]
-; CHECK-UNROLL:     wls lr, r12, [[EPIL_EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     dls lr, r12
 ; CHECK-UNROLL: [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
-; CHECK-UNROLL: [[EPIL_EXIT]]:
+; CHECK-UNROLL:     {{\.LBB[0-9_]+}}: @ %for.cond.cleanup
 ; CHECK-UNROLL:     pop
 define void @unroll_inc_unsigned(ptr nocapture %a, ptr nocapture readonly %b, ptr nocapture readonly %c, i32 %N) {
 entry:
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
index a62e45d27a5c5..ef105d7adc05b 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
@@ -15,7 +15,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 7
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 7
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -72,18 +72,18 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[IV_NEXT_7]] = add nuw nsw i64 [[IV_EPIL]], 8
 ; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
 ; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_7]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_UNR]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
+; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[SCALED_IV_EPIL1:%.*]] = mul nuw nsw i64 [[IV_EPIL1]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_EPIL1:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL1]]
@@ -106,7 +106,7 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; OTHER:       [[ENTRY_NEW]]:
 ; OTHER-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; OTHER-NEXT:    br label %[[LOOP:.*]]
@@ -127,15 +127,15 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; OTHER-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; OTHER-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; OTHER:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; OTHER-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; OTHER-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT_UNR_LCSSA]]:
-; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; OTHER-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; OTHER:       [[LOOP_EPIL_PREHEADER]]:
+; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; OTHER-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; OTHER-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; OTHER-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; OTHER:       [[LOOP_EPIL]]:
 ; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -172,7 +172,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; APPLE-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; APPLE-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; APPLE-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
 ; APPLE:       [[ENTRY_NEW]]:
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
@@ -195,15 +195,15 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
 ; APPLE-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
-; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA_LOOPEXIT:.*]], label %[[LOOP]]
-; APPLE:       [[EXIT_UNR_LCSSA_LOOPEXIT]]:
-; APPLE-NEXT:    [[IV_UNR_PH:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
-; APPLE-NEXT:    br label %[[EXIT_UNR_LCSSA]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR_PH]], %[[EXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
-; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[EXIT:.*]]
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
 ; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
@@ -222,7 +222,7 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; OTHER-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
 ; OTHER-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
 ; OTHER-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
-; OTHER-NEXT:    br i1 [[TMP1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[ENTRY_NEW:.*]]
+; OTHER-...
[truncated]

valerydmit · 2025-10-31T23:34:14Z

@jdenny-ornl , this patch caused more than 25% performance degradation due to additional spills/reloads. Could you please take a look?
f.zip
To reproduce compile the attached file with O3 using flang or clang.

mtrofin · 2025-11-01T15:35:50Z

@jdenny-ornl , this patch caused more than 25% performance degradation due to additional spills/reloads. Could you please take a look?

f.zip

To reproduce compile the attached file with O3 using flang or clang.

Drive-by comment - is the regression present when profiles are available?

jdenny-ornl added 30 commits March 19, 2025 16:19

Run update_test_checks.py on a test

f821eeb

Fix typo

af8ec56

Merge branch 'main' into fix-peel-branch-weights

a0264ad

Merge branch 'main' into fix-peel-branch-weights

fd29a49

Document new metadata

6303177

Improve LangRef.rst entry

bbd0e95

Merge branch 'main' into fix-peel-branch-weights

715cb0a

Merge branch 'main' into fix-peel-branch-weights

67fa67d

Update fixmes

37ce859

Extending beyond the limitations of `getExpectedExitLoopLatchBranch` is a possible improvement for the future not an urgent fixme. No one has pointed out code that computes estimated trip counts without using `llvm::getLoopEstimatedTripCount`.

Merge branch 'main' into fix-peel-branch-weights

4337dcd

Update test for AArch4, which I did not build before

5193158

Merge branch 'main' into fix-peel-branch-weights

bbd2f22

Run update script on test changed by merge from main

b23f467

The update adds this PR's new metadata.

Merge branch 'main' into fix-peel-branch-weights

e250cfc

Merge branch 'pgo-estimated-trip-count' into fix-peel-branch-weights

859b84d

Merge branch 'main' into pgo-estimated-trip-count

db5920a

Add PGOEstimateTripCounts in more cases

47fbe85

Add unused initialization

f8097fb

Simplify some test changes

7b27203

Extend verify pass to cover new metadata

4c4669a

Fix test for some builds

0f40efd

Somehow, on some of my builds, `llvm::` prefixes are dropped from some symbol names in the printed past list.

Merge branch 'main' into pgo-estimated-trip-count

2791a1c

Apply some small reviewer suggestions

6148922

Merge branch 'pgo-estimated-trip-count' into fix-peel-branch-weights

3f6a91a

Attempt to fix windows pre-commit CI

3a49b43

Merge branch 'pgo-estimated-trip-count' into fix-peel-branch-weights

c283ebe

Merge branch 'main' into pgo-estimated-trip-count

2f7daa8

jdenny-ornl requested review from fhahn, jdoerfert and mtrofin September 2, 2025 22:30

jdenny-ornl mentioned this pull request Sep 2, 2025

Fix Loop Transformations to Preserve Block Frequencies #135812

Open

mtrofin approved these changes Sep 3, 2025

View reviewed changes

Improve comments

415cb8f

llvmbot added backend:AMDGPU backend:RISC-V backend:PowerPC debuginfo llvm:transforms labels Sep 3, 2025

jdenny-ornl added 11 commits September 8, 2025 22:27

Merge branch 'main' into pgo-estimated-trip-count

b8aed9b

Merge branch 'pgo-estimated-trip-count' into fix-peel-branch-weights

5c9e43e

Merge branch 'fix-peel-branch-weights' into skip-unroll-epilog-guard

cc3283d

Merge branch 'main' into fix-peel-branch-weights

83ac767

Empty commit to try to restart pre-commit CI

1f81310

Merge branch 'fix-peel-branch-weights' into skip-unroll-epilog-guard

2382fbd

Merge branch 'main' into fix-peel-branch-weights

04c8ade

Merge branch 'fix-peel-branch-weights' into skip-unroll-epilog-guard

9b80c13

Merge branch 'main' into fix-peel-branch-weights

a1a5460

Merge branch 'fix-peel-branch-weights' into skip-unroll-epilog-guard

df9cf8c

Merge branch 'main' into skip-unroll-epilog-guard

4353f1f

jdenny-ornl changed the base branch from users/jdenny-ornl/fix-peel-branch-weights to main October 6, 2025 19:17

jdenny-ornl merged commit 6d44b90 into main Oct 7, 2025
9 checks passed

jdenny-ornl deleted the users/jdenny-ornl/skip-unroll-epilog-guard branch October 7, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoopUnroll] Skip remainder loop guard if skip unrolled loop #156549

[LoopUnroll] Skip remainder loop guard if skip unrolled loop #156549

Uh oh!

jdenny-ornl commented Sep 2, 2025 •

edited

Loading

Uh oh!

mtrofin Sep 3, 2025

Uh oh!

jdenny-ornl Sep 3, 2025

Uh oh!

llvmbot commented Sep 3, 2025 •

edited

Loading

Uh oh!

llvmbot commented Sep 3, 2025

Uh oh!

llvmbot commented Sep 3, 2025

Uh oh!

Uh oh!

valerydmit commented Oct 31, 2025

Uh oh!

mtrofin commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[LoopUnroll] Skip remainder loop guard if skip unrolled loop #156549

[LoopUnroll] Skip remainder loop guard if skip unrolled loop #156549

Uh oh!

Conversation

jdenny-ornl commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtrofin Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jdenny-ornl Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 3, 2025

Uh oh!

llvmbot commented Sep 3, 2025

Uh oh!

Uh oh!

valerydmit commented Oct 31, 2025

Uh oh!

mtrofin commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jdenny-ornl commented Sep 2, 2025 •

edited

Loading

llvmbot commented Sep 3, 2025 •

edited

Loading