Skip to content

Conversation

jdenny-ornl
Copy link
Collaborator

@jdenny-ornl jdenny-ornl commented Sep 9, 2025

This patch implements the LoopUnroll changes discussed in [RFC] Fix Loop Transformations to Preserve Block
Frequencies
and is thus another step in addressing issue #135812.

In summary, for the case of partial loop unrolling without a remainder loop, this patch changes LoopUnroll to:

  • Maintain branch weights consistently with the original loop for the sake of preserving the total frequency of the original loop body.
  • Store the new estimated trip count in the llvm.loop.estimated_trip_count metadata, introduced by PR [PGO] Add llvm.loop.estimated_trip_count metadata #148758.
  • Correct the new estimated trip count (e.g., 3 instead of 2) when the original estimated trip count (e.g., 10) divided by the unroll count (e.g., 4) leaves a remainder (e.g., 2).

There are loop unrolling cases this patch does not fully fix, such as partial unrolling with a remainder loop and complete unrolling, and there are two associated tests whose branch weights this patch adversely affects. They will be addressed in future patches that should land with this patch.

For example:

```
declare void @f(i32)

define void @test(i32 %n) {
entry:
  br label %do.body

do.body:
  %i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
  %inc = add i32 %i, 1
  call void @f(i32 %i)
  %c = icmp sge i32 %inc, %n
  br i1 %c, label %do.end, label %do.body, !prof !0

do.end:
  ret void
}

!0 = !{!"branch_weights", i32 1, i32 9}
```

Given those branch weights, once any loop iteration is actually
reached, the probability of the loop exiting at the iteration's end is
1/(1+9).  That is, the loop is likely to exit every 10 iterations and
thus has an estimated trip count of 10.  `opt
-passes='print<block-freq>'` shows that 10 is indeed the frequency of
the loop body:

```
Printing analysis results of BFI for function 'test':
block-frequency-info: test
 - entry: float = 1.0, int = 1801439852625920
 - do.body: float = 10.0, int = 18014398509481984
 - do.end: float = 1.0, int = 1801439852625920
```

Key Observation: The frequency of reaching any particular iteration is
less than for the previous iteration because the previous iteration
has a non-zero probability of exiting the loop.  This observation
holds even though every loop iteration, once actually reached, has
exactly the same probability of exiting and thus exactly the same
branch weights.

Now we use `opt -unroll-force-peel-count=2 -passes=loop-unroll` to
peel 2 iterations and insert them before the remaining loop.  We
expect the key observation above not to change, but it does under the
implementation without this patch.  The block frequency becomes 1.0
for the first iteration, 0.9 for the second, and 6.4 for the main loop
body.  Again, a decreasing frequency is expected, but it decreases too
much: the total frequency of the original loop body becomes 8.3.  The
new branch weights reveal the problem:

```
!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 8}
!2 = !{!"branch_weights", i32 1, i32 7}
```

The exit probability is now 1/10 for the first peeled iteration, 1/9
for the second, and 1/8 for the remaining loop iterations.  It seems
this behavior is trying to ensure a decreasing block frequency.
However, as in the key observation above for the original loop, that
happens correctly without decreasing the branch weights across
iterations.

This patch changes the peeling implementation not to decrease the
branch weights across loop iterations so that the frequency for every
iteration is the same as it was in the original loop.  The total
frequency of the loop body, summed across all its occurrences, thus
remains 10 after peeling.

Unfortunately, that change means a later analysis cannot accurately
estimate the trip count of the remaining loop while examining the
remaining loop in isolation without considering the probability of
actually reaching it.  For that purpose, this patch stores the new
trip count as separate metadata named `llvm.loop.estimated_trip_count`
and extends `llvm::getLoopEstimatedTripCount` to prefer it, if
present, over branch weights.

An alternative fix is for `llvm::getLoopEstimatedTripCount` to
subtract the `llvm.loop.peeled.count` metadata from the trip count
estimated by a loop's branch weights.  However, there might be other
loop transformations that still corrupt block frequencies in a similar
manner and require a similar fix.  `llvm.loop.estimated_trip_count` is
intended to provide a general way to store estimated trip counts when
branch weights cannot directly store them.

This patch introduces several fixme comments that need to be addressed
before it can land.
Extending beyond the limitations of `getExpectedExitLoopLatchBranch`
is a possible improvement for the future not an urgent fixme.

No one has pointed out code that computes estimated trip counts
without using `llvm::getLoopEstimatedTripCount`.
The update adds this PR's new metadata.
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights`
metadata.  As [suggested in the PR#128785
review](#128785 (comment)),
it does so via a `PGOEstimateTripCountsPass` pass, which creates the
new metadata for the loop but omits the value if it cannot estimate a
trip count due to the loop's form.

An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip
count but later passes can transform the loop in a way that makes it
possible.  Currently, such passes do not necessarily update the
metadata, but eventually that should be fixed.  Until then, if the new
metadata has no value, `llvm::getLoopEstimatedTripCount` disregards it
and tries again to estimate the trip count from the loop's
`branch_weights` metadata.
Somehow, on some of my builds, `llvm::` prefixes are dropped from some
symbol names in the printed past list.
That's PR #128785, which is now a parent PR.

First, remove a todo that's now documented more generally than
LoopPeel in plenty of other places.  Second, update LoopPeel's
setLoopEstimatedTripCount call to avoid a now redundant argument that
eventually won't be supported.
This patch implements the LoopUnroll changes discussed in [[RFC] Fix
Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785)
and is thus another step in addressing issue #135812.

In summary, for the case of partial loop unrolling without a runtime,
this patch changes LoopUnroll to:

- Maintain branch weights consistently with the original loop for the
  sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
  `llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
- Correct the new estimated trip count (e.g., 3 instead of 2) when the
  original estimated trip count (e.g., 10) divided by the unroll count
  (e.g., 4) leaves a remainder (e.g., 2).

There are loop unrolling cases this patch does not fully fix, such as
partial unrolling with a runtime and complete unrolling, and there are
two associated tests this patch marks as XFAIL.  They will be
addressed in future patches that should land with this patch.
@llvmbot
Copy link
Member

llvmbot commented Sep 9, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Joel E. Denny (jdenny-ornl)

Changes

This patch implements the LoopUnroll changes discussed in [RFC] Fix Loop Transformations to Preserve Block
Frequencies
and is thus another step in addressing issue #135812.

In summary, for the case of partial loop unrolling without a runtime, this patch changes LoopUnroll to:

  • Maintain branch weights consistently with the original loop for the sake of preserving the total frequency of the original loop body.
  • Store the new estimated trip count in the llvm.loop.estimated_trip_count metadata, introduced by PR #148758.
  • Correct the new estimated trip count (e.g., 3 instead of 2) when the original estimated trip count (e.g., 10) divided by the unroll count (e.g., 4) leaves a remainder (e.g., 2).

There are loop unrolling cases this patch does not fully fix, such as partial unrolling with a runtime and complete unrolling, and there are two associated tests this patch marks as XFAIL. They will be addressed in future patches that should land with this patch.


Full diff: https://github.com/llvm/llvm-project/pull/157754.diff

5 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/LoopUnroll.cpp (+30-6)
  • (renamed) llvm/test/Transforms/LoopUnroll/branch-weights-freq/peel.ll ()
  • (added) llvm/test/Transforms/LoopUnroll/branch-weights-freq/unroll-partial.ll (+68)
  • (modified) llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll (+1)
  • (modified) llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll (+1)
diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index 8a6c7789d1372..93c43396c54b6 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -499,9 +499,8 @@ llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
 
   const unsigned MaxTripCount = SE->getSmallConstantMaxTripCount(L);
   const bool MaxOrZero = SE->isBackedgeTakenCountMaxOrZero(L);
-  unsigned EstimatedLoopInvocationWeight = 0;
   std::optional<unsigned> OriginalTripCount =
-      llvm::getLoopEstimatedTripCount(L, &EstimatedLoopInvocationWeight);
+      llvm::getLoopEstimatedTripCount(L);
 
   // Effectively "DCE" unrolled iterations that are beyond the max tripcount
   // and will never be executed.
@@ -1130,10 +1129,35 @@ llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
     // We shouldn't try to use `L` anymore.
     L = nullptr;
   } else if (OriginalTripCount) {
-    // Update the trip count. Note that the remainder has already logic
-    // computing it in `UnrollRuntimeLoopRemainder`.
-    setLoopEstimatedTripCount(L, *OriginalTripCount / ULO.Count,
-                              EstimatedLoopInvocationWeight);
+    // Update metadata for the estimated trip count.
+    //
+    // If ULO.Runtime, UnrollRuntimeLoopRemainder handles branch weights for the
+    // remainder loop it creates, and the unrolled loop's branch weights are
+    // adjusted below.  Otherwise, if unrolled loop iterations' latches become
+    // unconditional, branch weights are adjusted above.  Otherwise, the
+    // original loop's branch weights are correct for the unrolled loop, so do
+    // not adjust them.
+    // FIXME: Actually handle such unconditional latches and ULO.Runtime.
+    //
+    // For example, consider what happens if the unroll count is 4 for a loop
+    // with an estimated trip count of 10 when we do not create a remainder loop
+    // and all iterations' latches remain conditional.  Each unrolled
+    // iteration's latch still has the same probability of exiting the loop as
+    // it did when in the original loop, and thus it should still have the same
+    // branch weights.  Each unrolled iteration's non-zero probability of
+    // exiting already appropriately reduces the probability of reaching the
+    // remaining iterations just as it did in the original loop.  Trying to also
+    // adjust the branch weights of the final unrolled iteration's latch (i.e.,
+    // the backedge for the unrolled loop as a whole) to reflect its new trip
+    // count of 3 will erroneously further reduce its block frequencies.
+    // However, in case an analysis later needs to estimate the trip count of
+    // the unrolled loop as a whole without considering the branch weights for
+    // each unrolled iteration's latch within it, we store the new trip count as
+    // separate metadata.
+    unsigned NewTripCount = *OriginalTripCount / ULO.Count;
+    if (!ULO.Runtime && *OriginalTripCount % ULO.Count)
+      NewTripCount += 1;
+    setLoopEstimatedTripCount(L, NewTripCount);
   }
 
   // LoopInfo should not be valid, confirm that.
diff --git a/llvm/test/Transforms/LoopUnroll/peel-branch-weights-freq.ll b/llvm/test/Transforms/LoopUnroll/branch-weights-freq/peel.ll
similarity index 100%
rename from llvm/test/Transforms/LoopUnroll/peel-branch-weights-freq.ll
rename to llvm/test/Transforms/LoopUnroll/branch-weights-freq/peel.ll
diff --git a/llvm/test/Transforms/LoopUnroll/branch-weights-freq/unroll-partial.ll b/llvm/test/Transforms/LoopUnroll/branch-weights-freq/unroll-partial.ll
new file mode 100644
index 0000000000000..cde9d46ee8421
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/branch-weights-freq/unroll-partial.ll
@@ -0,0 +1,68 @@
+; Test branch weight metadata, estimated trip count metadata, and block
+; frequencies after partial loop unrolling without -unroll-runtime.
+
+; RUN: opt < %s -S -passes='print<block-freq>' 2>&1 | \
+; RUN:   FileCheck -check-prefix=CHECK %s
+
+; The -implicit-check-not options make sure that no additional labels or calls
+; to @f show up.
+; RUN: opt < %s -S -passes='loop-unroll,print<block-freq>' \
+; RUN:     -unroll-count=4 2>&1 | \
+; RUN:   FileCheck %s -check-prefix=CHECK-UR \
+; RUN:       -implicit-check-not='{{^( *- )?[^ ;]*:}}' \
+; RUN:       -implicit-check-not='call void @f'
+
+; CHECK: block-frequency-info: test
+; CHECK: do.body: float = 10.0,
+
+; The sum should still be ~10.
+;
+; CHECK-UR: block-frequency-info: test
+; CHECK-UR: - [[ENTRY:.*]]:
+; CHECK-UR: - [[DO_BODY:.*]]: float = 2.9078,
+; CHECK-UR: - [[DO_BODY_1:.*]]: float = 2.617,
+; CHECK-UR: - [[DO_BODY_2:.*]]: float = 2.3553,
+; CHECK-UR: - [[DO_BODY_3:.*]]: float = 2.1198,
+; CHECK-UR: - [[DO_END:.*]]:
+
+declare void @f(i32)
+
+define void @test(i32 %n) {
+; CHECK-UR-LABEL: define void @test(i32 %{{.*}}) {
+;       CHECK-UR: [[ENTRY]]:
+;       CHECK-UR:   br label %[[DO_BODY]]
+;       CHECK-UR: [[DO_BODY]]:
+;       CHECK-UR:   call void @f
+;       CHECK-UR:   br i1 %{{.*}}, label %[[DO_END]], label %[[DO_BODY_1]], !prof ![[#PROF:]]
+;       CHECK-UR: [[DO_BODY_1]]:
+;       CHECK-UR:   call void @f
+;       CHECK-UR:   br i1 %{{.*}}, label %[[DO_END]], label %[[DO_BODY_2]], !prof ![[#PROF]]
+;       CHECK-UR: [[DO_BODY_2]]:
+;       CHECK-UR:   call void @f
+;       CHECK-UR:   br i1 %{{.*}}, label %[[DO_END]], label %[[DO_BODY_3]], !prof ![[#PROF]]
+;       CHECK-UR: [[DO_BODY_3]]:
+;       CHECK-UR:   call void @f
+;       CHECK-UR:   br i1 %{{.*}}, label %[[DO_END]], label %[[DO_BODY]], !prof ![[#PROF]], !llvm.loop ![[#LOOP_UR_LATCH:]]
+;       CHECK-UR: [[DO_END]]:
+;       CHECK-UR:   ret void
+
+entry:
+  br label %do.body
+
+do.body:
+  %i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
+  %inc = add i32 %i, 1
+  call void @f(i32 %i)
+  %c = icmp sge i32 %inc, %n
+  br i1 %c, label %do.end, label %do.body, !prof !0
+
+do.end:
+  ret void
+}
+
+!0 = !{!"branch_weights", i32 1, i32 9}
+
+; CHECK-UR: ![[#PROF]] = !{!"branch_weights", i32 1, i32 9}
+; CHECK-UR: ![[#LOOP_UR_LATCH]] = distinct !{![[#LOOP_UR_LATCH]], ![[#LOOP_UR_TC:]], ![[#DISABLE:]]}
+; CHECK-UR: ![[#LOOP_UR_TC]] = !{!"llvm.loop.estimated_trip_count", i32 3}
+; CHECK-UR: ![[#DISABLE]] = !{!"llvm.loop.unroll.disable"}
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll b/llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll
index 26171990a2592..36c0497a002cf 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll
+++ b/llvm/test/Transforms/LoopUnroll/runtime-loop-branchweight.ll
@@ -1,4 +1,5 @@
 ; RUN: opt < %s -S -passes=loop-unroll -unroll-runtime=true -unroll-count=4 | FileCheck %s
+; XFAIL: *
 
 ;; Check that the remainder loop is properly assigned a branch weight for its latch branch.
 ; CHECK-LABEL: @test(
diff --git a/llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll b/llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll
index 611ee5fb5807e..d59a0298b3106 100644
--- a/llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll
+++ b/llvm/test/Transforms/LoopUnroll/unroll-heuristics-pgo.ll
@@ -1,4 +1,5 @@
 ; RUN: opt < %s -S -passes=loop-unroll -unroll-runtime -unroll-threshold=40 -unroll-max-percent-threshold-boost=100 | FileCheck %s
+; XFAIL: *
 
 @known_constant = internal unnamed_addr constant [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16
 

jdenny-ornl added a commit that referenced this pull request Sep 16, 2025
As another step in issue #135812, this patch fixes block frequencies
for partial loop unrolling with an epilogue remainder loop.  It does
not fully handle the case when the epilogue loop itself is unrolled.
That will be handled in the next patch.

For the guard and latch of each of the unrolled loop and epilogue
loop, this patch sets branch weights derived directly from the
original loop latch branch weights.  The total frequency of the
original loop body, summed across all its occurrences in the unrolled
loop and epilogue loop, is the same as in the original loop.  This
patch also sets `llvm.loop.estimated_trip_count` for the epilogue loop
instead of relying on the epilogue's latch branch weights to imply it.

This patch removes the XFAIL directives that PR #157754 added to the
test suite.
@@ -1,4 +1,5 @@
; RUN: opt < %s -S -passes=loop-unroll -unroll-runtime=true -unroll-count=4 | FileCheck %s
; XFAIL: *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No xfailing tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last paragraph of #157754 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use xfail for this, check the actual baseline content and update in the next patches

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the actual baseline content

Sorry, what do you mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the xfails and made the tests pass. Let me know whether it's what you had in mind.

Base automatically changed from users/jdenny-ornl/skip-unroll-epilog-guard to main October 7, 2025 14:45
@jdenny-ornl
Copy link
Collaborator Author

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants