Skip to content

Conversation

@jdenny-ornl
Copy link
Collaborator

@jdenny-ornl jdenny-ornl commented Sep 16, 2025

As another step in issue #135812, this patch fixes block frequencies for partial loop unrolling with an epilogue remainder loop. It does not fully handle the case when the epilogue loop itself is unrolled. That will be handled in the next patch.

For the guard and latch of each of the unrolled loop and epilogue loop, this patch sets branch weights derived directly from the original loop latch branch weights. The total frequency of the original loop body, summed across all its occurrences in the unrolled loop and epilogue loop, is the same as in the original loop. This patch also sets llvm.loop.estimated_trip_count for the epilogue loop instead of relying on the epilogue's latch branch weights to imply it.

This patch fixes branch weights in tests that PR #157754 adversely affected.

For example:

```
declare void @f(i32)

define void @test(i32 %n) {
entry:
  br label %do.body

do.body:
  %i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
  %inc = add i32 %i, 1
  call void @f(i32 %i)
  %c = icmp sge i32 %inc, %n
  br i1 %c, label %do.end, label %do.body, !prof !0

do.end:
  ret void
}

!0 = !{!"branch_weights", i32 1, i32 9}
```

Given those branch weights, once any loop iteration is actually
reached, the probability of the loop exiting at the iteration's end is
1/(1+9).  That is, the loop is likely to exit every 10 iterations and
thus has an estimated trip count of 10.  `opt
-passes='print<block-freq>'` shows that 10 is indeed the frequency of
the loop body:

```
Printing analysis results of BFI for function 'test':
block-frequency-info: test
 - entry: float = 1.0, int = 1801439852625920
 - do.body: float = 10.0, int = 18014398509481984
 - do.end: float = 1.0, int = 1801439852625920
```

Key Observation: The frequency of reaching any particular iteration is
less than for the previous iteration because the previous iteration
has a non-zero probability of exiting the loop.  This observation
holds even though every loop iteration, once actually reached, has
exactly the same probability of exiting and thus exactly the same
branch weights.

Now we use `opt -unroll-force-peel-count=2 -passes=loop-unroll` to
peel 2 iterations and insert them before the remaining loop.  We
expect the key observation above not to change, but it does under the
implementation without this patch.  The block frequency becomes 1.0
for the first iteration, 0.9 for the second, and 6.4 for the main loop
body.  Again, a decreasing frequency is expected, but it decreases too
much: the total frequency of the original loop body becomes 8.3.  The
new branch weights reveal the problem:

```
!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 8}
!2 = !{!"branch_weights", i32 1, i32 7}
```

The exit probability is now 1/10 for the first peeled iteration, 1/9
for the second, and 1/8 for the remaining loop iterations.  It seems
this behavior is trying to ensure a decreasing block frequency.
However, as in the key observation above for the original loop, that
happens correctly without decreasing the branch weights across
iterations.

This patch changes the peeling implementation not to decrease the
branch weights across loop iterations so that the frequency for every
iteration is the same as it was in the original loop.  The total
frequency of the loop body, summed across all its occurrences, thus
remains 10 after peeling.

Unfortunately, that change means a later analysis cannot accurately
estimate the trip count of the remaining loop while examining the
remaining loop in isolation without considering the probability of
actually reaching it.  For that purpose, this patch stores the new
trip count as separate metadata named `llvm.loop.estimated_trip_count`
and extends `llvm::getLoopEstimatedTripCount` to prefer it, if
present, over branch weights.

An alternative fix is for `llvm::getLoopEstimatedTripCount` to
subtract the `llvm.loop.peeled.count` metadata from the trip count
estimated by a loop's branch weights.  However, there might be other
loop transformations that still corrupt block frequencies in a similar
manner and require a similar fix.  `llvm.loop.estimated_trip_count` is
intended to provide a general way to store estimated trip counts when
branch weights cannot directly store them.

This patch introduces several fixme comments that need to be addressed
before it can land.
Extending beyond the limitations of `getExpectedExitLoopLatchBranch`
is a possible improvement for the future not an urgent fixme.

No one has pointed out code that computes estimated trip counts
without using `llvm::getLoopEstimatedTripCount`.
The update adds this PR's new metadata.
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights`
metadata.  As [suggested in the PR#128785
review](#128785 (comment)),
it does so via a `PGOEstimateTripCountsPass` pass, which creates the
new metadata for the loop but omits the value if it cannot estimate a
trip count due to the loop's form.

An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip
count but later passes can transform the loop in a way that makes it
possible.  Currently, such passes do not necessarily update the
metadata, but eventually that should be fixed.  Until then, if the new
metadata has no value, `llvm::getLoopEstimatedTripCount` disregards it
and tries again to estimate the trip count from the loop's
`branch_weights` metadata.
Somehow, on some of my builds, `llvm::` prefixes are dropped from some
symbol names in the printed past list.
That's PR #128785, which is now a parent PR.

First, remove a todo that's now documented more generally than
LoopPeel in plenty of other places.  Second, update LoopPeel's
setLoopEstimatedTripCount call to avoid a now redundant argument that
eventually won't be supported.
if (OriginalTripCount) {
unsigned NewTripCount = *OriginalTripCount / ULO.Count;
if (!ULO.Runtime && *OriginalTripCount % ULO.Count)
NewTripCount += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ++NewTripCount?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done. I was thinking in Python, I guess.

if (!ForFirstTarget)
std::swap(Prob0, Prob1);
MDBuilder MDB(B->getContext());
B->setMetadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use setBranchWeights from ProfDataUtils.h, less boylerplate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done. I also applied that in setLoopEstimatedTripCount for consistency.

@jdenny-ornl
Copy link
Collaborator Author

ping

Base automatically changed from users/jdenny-ornl/fix-blockfreq-unroll-no-runtime to main October 31, 2025 14:44
@jdenny-ornl jdenny-ornl merged commit cc8ff73 into main Oct 31, 2025
7 of 10 checks passed
@jdenny-ornl jdenny-ornl deleted the users/jdenny-ornl/fix-blockfreq-unroll-epilogue branch October 31, 2025 15:01
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is causing a crash for the opt -passes='loop-unroll<O3>':

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-macosx12.0.0"

define void @test() #0 {
entry:
  br label %loop

for.cond.cleanup:
  ret void

loop:
  br i1 false, label %for.cond.cleanup, label %loop, !prof !0
}

attributes #0 = { "target-cpu"="neoverse-512tvb" }

!0 = !{!"branch_weights", i32 0, i32 0}

@jdenny-ornl
Copy link
Collaborator Author

It looks like this is causing a crash for the opt -passes='loop-unroll<O3>':

Thanks. Is there a bot somewhere that shows the crash?

@fhahn
Copy link
Contributor

fhahn commented Oct 31, 2025

It looks like this is causing a crash for the opt -passes='loop-unroll<O3>':

Thanks. Is there a bot somewhere that shows the crash?

no, this is when building a large C/C++ workload

jdenny-ornl added a commit that referenced this pull request Oct 31, 2025
BranchProbability fails an assert when its denominator is zero.

Reported at
<#159163 (review)>.
@jdenny-ornl
Copy link
Collaborator Author

It looks like this is causing a crash for the opt -passes='loop-unroll<O3>':

Thanks. Is there a bot somewhere that shows the crash?

no, this is when building a large C/C++ workload

PR #165938 fixes it for me.

@Andarwinux
Copy link
Contributor

It looks like this is causing a crash for the opt -passes='loop-unroll<O3>':

Thanks. Is there a bot somewhere that shows the crash?

no, this is when building a large C/C++ workload

PR #165938 fixes it for me.

Could you take a look at this? I used bisect to determine the crash was caused by this PR, but your fix doesn't seem to fully fix the problem I'm encountering.

To be precise, the original crash was fixed, but it subsequently crashed in the same manner elsewhere. Furthermore, it appears the problem doesn't occur as long as --pgso=false is not specified.

@Andarwinux
Copy link
Contributor

Now there's a simpler reproducer.

DEBADRIBASAK pushed a commit to DEBADRIBASAK/llvm-project that referenced this pull request Nov 3, 2025
As another step in issue llvm#135812, this patch fixes block frequencies for
partial loop unrolling with an epilogue remainder loop. It does not
fully handle the case when the epilogue loop itself is unrolled. That
will be handled in the next patch.

For the guard and latch of each of the unrolled loop and epilogue loop,
this patch sets branch weights derived directly from the original loop
latch branch weights. The total frequency of the original loop body,
summed across all its occurrences in the unrolled loop and epilogue
loop, is the same as in the original loop. This patch also sets
`llvm.loop.estimated_trip_count` for the epilogue loop instead of
relying on the epilogue's latch branch weights to imply it.

This patch fixes branch weights in tests that PR llvm#157754 adversely
affected.
jdenny-ornl added a commit that referenced this pull request Nov 3, 2025
BranchProbability fails an assert when its denominator is zero.

Reported at
<#159163 (review)>.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Nov 3, 2025
…5938)

BranchProbability fails an assert when its denominator is zero.

Reported at
<llvm/llvm-project#159163 (review)>.
@jdenny-ornl
Copy link
Collaborator Author

@Andarwinux Thanks for letting me know.

Could you take a look at this? I used bisect to determine the crash was caused by this PR, but your fix doesn't seem to fully fix the problem I'm encountering.

To be precise, the original crash was fixed, but it subsequently crashed in the same manner elsewhere. Furthermore, it appears the problem doesn't occur as long as --pgso=false is not specified.

As I commented there, I need instructions to reproduce that bug.

Now there's a simpler reproducer.

That one seems to be resolved by PR #165938.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants