[SimpleLoopUnswitch] Record loops from unswitching non-trivial conditions #141121

antoniofrighetto · 2025-05-22T18:44:56Z

Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again.

Fixes: #138509.

llvmbot · 2025-05-22T18:45:34Z

@llvm/pr-subscribers-llvm-transforms

Author: Antonio Frighetto (antoniofrighetto)

Changes

Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again. While this should optimistically suffice, ensure the outer loop basic block size is taken into account as well when estimating the cost for unswitching non-trivial conditions.

Fixes: #138509.

Full diff: https://github.com/llvm/llvm-project/pull/141121.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp (+35-26)
(added) llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll (+49)

diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index 0bf90036b8b82..4ebb73e917370 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -2142,9 +2142,22 @@ void visitDomSubTree(DominatorTree &DT, BasicBlock *BB, CallableT Callable) {
 void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
                   bool CurrentLoopValid, bool PartiallyInvariant,
                   bool InjectedCondition, ArrayRef<Loop *> NewLoops) {
-  // If we did a non-trivial unswitch, we have added new (cloned) loops.
-  if (!NewLoops.empty())
+  auto RecordLoopAsUnswitched = [&](Loop *TargetLoop, StringRef Tag) {
+    auto &Ctx = TargetLoop->getHeader()->getContext();
+    const auto &DisableMDName = (Twine(Tag) + ".disable").str();
+    MDNode *DisableMD = MDNode::get(Ctx, MDString::get(Ctx, DisableMDName));
+    MDNode *NewLoopID = makePostTransformationMetadata(
+        Ctx, TargetLoop->getLoopID(), {Tag}, {DisableMD});
+    TargetLoop->setLoopID(NewLoopID);
+  };
+
+  // If we performed a non-trivial unswitch, we have added new cloned loops.
+  // Mark such newly-created loops as visited.
+  if (!NewLoops.empty()) {
+    for (Loop *NL : NewLoops)
+      RecordLoopAsUnswitched(NL, "llvm.loop.unswitch.nontrivial");
     U.addSiblingLoops(NewLoops);
+  }
 
   // If the current loop remains valid, we should revisit it to catch any
   // other unswitch opportunities. Otherwise, we need to mark it as deleted.
@@ -2152,24 +2165,10 @@ void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
     if (PartiallyInvariant) {
       // Mark the new loop as partially unswitched, to avoid unswitching on
       // the same condition again.
-      auto &Context = L.getHeader()->getContext();
-      MDNode *DisableUnswitchMD = MDNode::get(
-          Context,
-          MDString::get(Context, "llvm.loop.unswitch.partial.disable"));
-      MDNode *NewLoopID = makePostTransformationMetadata(
-          Context, L.getLoopID(), {"llvm.loop.unswitch.partial"},
-          {DisableUnswitchMD});
-      L.setLoopID(NewLoopID);
+      RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.partial");
     } else if (InjectedCondition) {
       // Do the same for injection of invariant conditions.
-      auto &Context = L.getHeader()->getContext();
-      MDNode *DisableUnswitchMD = MDNode::get(
-          Context,
-          MDString::get(Context, "llvm.loop.unswitch.injection.disable"));
-      MDNode *NewLoopID = makePostTransformationMetadata(
-          Context, L.getLoopID(), {"llvm.loop.unswitch.injection"},
-          {DisableUnswitchMD});
-      L.setLoopID(NewLoopID);
+      RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.injection");
     } else
       U.revisitCurrentLoop();
   } else
@@ -2806,9 +2805,9 @@ static BranchInst *turnGuardIntoBranch(IntrinsicInst *GI, Loop &L,
 }
 
 /// Cost multiplier is a way to limit potentially exponential behavior
-/// of loop-unswitch. Cost is multipied in proportion of 2^number of unswitch
-/// candidates available. Also accounting for the number of "sibling" loops with
-/// the idea to account for previous unswitches that already happened on this
+/// of loop-unswitch. Cost is multiplied in proportion of 2^number of unswitch
+/// candidates available. Also consider the number of "sibling" loops with
+/// the idea of accounting for previous unswitches that already happened on this
 /// cluster of loops. There was an attempt to keep this formula simple,
 /// just enough to limit the worst case behavior. Even if it is not that simple
 /// now it is still not an attempt to provide a detailed heuristic size
@@ -2839,7 +2838,14 @@ static int CalculateUnswitchCostMultiplier(
     return 1;
   }
 
+  // When dealing with nested loops, the basic block size of the outer loop may
+  // increase significantly during unswitching non-trivial conditions. The final
+  // cost may be adjusted taking this into account.
   auto *ParentL = L.getParentLoop();
+  int ParentSizeMultiplier = 1;
+  if (ParentL)
+    ParentSizeMultiplier = std::max((int)ParentL->getNumBlocks(), 1);
+
   int SiblingsCount = (ParentL ? ParentL->getSubLoopsVector().size()
                                : std::distance(LI.begin(), LI.end()));
   // Count amount of clones that all the candidates might cause during
@@ -2887,11 +2893,13 @@ static int CalculateUnswitchCostMultiplier(
       SiblingsMultiplier > UnswitchThreshold)
     CostMultiplier = UnswitchThreshold;
   else
-    CostMultiplier = std::min(SiblingsMultiplier * (1 << ClonesPower),
-                              (int)UnswitchThreshold);
+    CostMultiplier =
+        std::min(SiblingsMultiplier * ParentSizeMultiplier * (1 << ClonesPower),
+                 (int)UnswitchThreshold);
 
   LLVM_DEBUG(dbgs() << "  Computed multiplier  " << CostMultiplier
-                    << " (siblings " << SiblingsMultiplier << " * clones "
+                    << " (siblings " << SiblingsMultiplier << "* parent size "
+                    << ParentSizeMultiplier << " * clones "
                     << (1 << ClonesPower) << ")"
                     << " for unswitch candidate: " << TI << "\n");
   return CostMultiplier;
@@ -3504,8 +3512,9 @@ static bool unswitchBestCondition(Loop &L, DominatorTree &DT, LoopInfo &LI,
   SmallVector<NonTrivialUnswitchCandidate, 4> UnswitchCandidates;
   IVConditionInfo PartialIVInfo;
   Instruction *PartialIVCondBranch = nullptr;
-  collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
-                            PartialIVCondBranch, L, LI, AA, MSSAU);
+  if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.nontrivial.disable"))
+    collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
+                              PartialIVCondBranch, L, LI, AA, MSSAU);
   if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.injection.disable"))
     collectUnswitchCandidatesWithInjections(UnswitchCandidates, PartialIVInfo,
                                             PartialIVCondBranch, L, DT, LI, AA,
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll b/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
new file mode 100644
index 0000000000000..e24d17f088427
--- /dev/null
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
@@ -0,0 +1,49 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes="loop-mssa(loop-simplifycfg,licm,loop-rotate,simple-loop-unswitch<nontrivial>)" < %s | FileCheck %s
+
+@a = global i32 0, align 4
+@b = global i32 0, align 4
+@c = global i32 0, align 4
+@d = global i32 0, align 4
+
+define i32 @main() {
+entry:
+  br label %outer.loop.header
+
+outer.loop.header:                                ; preds = %outer.loop.latch, %entry
+  br i1 false, label %exit, label %outer.loop.body
+
+outer.loop.body:                                  ; preds = %inner.loop.header, %outer.loop.header
+  store i32 1, ptr @c, align 4
+  %cmp = icmp sgt i32 0, -1
+  br i1 %cmp, label %outer.loop.latch, label %exit
+
+inner.loop.header:                                ; preds = %outer.loop.latch, %inner.loop.body
+  %a_val = load i32, ptr @a, align 4
+  %c_val = load i32, ptr @c, align 4
+  %mul = mul nsw i32 %c_val, %a_val
+  store i32 %mul, ptr @b, align 4
+  %cmp2 = icmp sgt i32 %mul, -1
+  br i1 %cmp2, label %inner.loop.body, label %outer.loop.body
+
+inner.loop.body:                                  ; preds = %inner.loop.header
+  %mul2 = mul nsw i32 %c_val, 3
+  store i32 %mul2, ptr @c, align 4
+  store i32 %c_val, ptr @d, align 4
+  %mul3 = mul nsw i32 %c_val, %a_val
+  %cmp3 = icmp sgt i32 %mul3, -1
+  br i1 %cmp3, label %inner.loop.header, label %exit
+
+outer.loop.latch:                                 ; preds = %outer.loop.body
+  %d_val = load i32, ptr @d, align 4
+  store i32 %d_val, ptr @b, align 4
+  %cmp4 = icmp eq i32 %d_val, 0
+  br i1 %cmp4, label %inner.loop.header, label %outer.loop.header
+
+exit:                                             ; preds = %inner.loop.body, %outer.loop.body, %outer.loop.header
+  ret i32 0
+}
+
+; CHECK: [[LOOP0:.*]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.unswitch.nontrivial.disable"}
+; CHECK: [[LOOP2:.*]] = distinct !{[[LOOP2]], [[META1]]}

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp

dtcxzyw · 2025-05-30T14:03:22Z

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp


 static cl::opt<int>
-    UnswitchThreshold("unswitch-threshold", cl::init(50), cl::Hidden,
+    UnswitchThreshold("unswitch-threshold", cl::init(120), cl::Hidden,


While this should optimistically suffice, ensure the outer loop basic block size is taken into account as well when estimating the cost for unswitching non-trivial conditions.

Can you provide some performance data on SPEC/llvm-test-suite? IMO adding llvm.loop.unswitch.nontrivial.disable is enough...

Thinking on this more, I'm not totally sure if we want to entirely prevent unswitched non-trivial conditions from being unswitched again (at least, this seems one of the reason why the cost estimator tool is there). Perhaps there are downstream clients which may need to perform arbitrarily non-trivial unswitches (?), in which case, we should only tune the cost. If this is not a concern, I think we could drop the extra cost and keep llvm.loop.unswitch.nontrivial.disable only.

Thinking on this again, I don't think it should be desirable to unswitch previously unswitched conditions in the new siblings loops; and recording the loops as unswitched should suffice. Updated this and simplified by dropping cost estimator changes.

dtcxzyw

LGTM. Thank you!
Can you test this patch on llvm-test-suite to ensure that it doesn't cause performance regressions?

antoniofrighetto · 2025-07-23T13:12:17Z

LGTM. Thank you! Can you test this patch on llvm-test-suite to ensure that it doesn't cause performance regressions?

Sure. Test-suite baseline and loop unswitch builds set up as follows:

$ mkdir build && cd build
$ cmake -G Ninja -DCMAKE_C_COMPILER=/path/to/clang -C ../cmake/caches/O2.cmake ..

Execution time:

~/llvm-test-suite $ python3 ./utils/compare.py ./baseline_build/output.json ./loop_unswitch_build/output.json
Tests: 3402
Metric: exec_time

Program                                       exec_time               
                                              base      patch  diff   
SingleSource/UnitTests/testcase-Value-1         0.00      0.00    inf%
tools/fpcmp-target                              0.00      0.00    inf%
SingleSour...r/AVX512F/Vector-AVX512F-shift     0.00      0.00    inf%
SingleSour...ctor-AVX512F-roundscale_scalar     0.00      0.00    inf%
SingleSour...tor/AVX512F/Vector-AVX512F-xor     0.00      0.00    inf%
SingleSour...s/Vector/SSE/Vector-sse.isamax     0.00      0.00    inf%
SingleSour.../AVX512F/Vector-AVX512F-reduce     0.00      0.00    inf%
SingleSour...AVX512F/Vector-AVX512F-movedup     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-941014-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-941025-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960218-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960219-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960301-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960302-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960311-2     0.00      0.00    inf%
                           Geomean difference                  -100.0%
           exec_time                            
run             base          patch         diff
count  3402.000000    3402.000000    2994.000000
mean   1772.676173    1787.095436    inf        
std    34734.073638   35483.831295  NaN         
min    0.000000       0.000000      -1.000000   
25%    0.000600       0.000000      -0.142857   
50%    0.000800       0.000800       0.000000   
75%    3.602975       3.628384       0.125000   
max    978102.897079  983259.753532  inf

Compile time:

~/llvm-test-suite $ python3 ./utils/compare.py -m compile_time ./baseline_build/output.json ./loop_unswitch_build/output.json
Tests: 3402
Metric: compile_time

Program                                       compile_time             
                                              base         patch  diff 
SingleSource/UnitTests/blockstret               0.01         0.02 43.9%
SingleSour...ce/UnitTests/2003-05-26-Shorts     0.02         0.03 42.9%
SingleSour.../UnitTests/2007-01-04-KNR-Args     0.01         0.02 38.0%
SingleSour...tTests/2020-01-06-coverage-010     0.02         0.03 32.8%
SingleSour.../UnitTests/2002-05-02-CastTest     0.03         0.03 30.8%
SingleSour...tTests/2003-09-18-BitFieldTest     0.02         0.02 28.6%
SingleSour...tTests/2020-01-06-coverage-003     0.02         0.02 25.5%
SingleSource/UnitTests/block-byref-test         0.01         0.01 25.3%
SingleSour...tTests/2002-05-02-ArgumentTest     0.01         0.02 23.5%
SingleSour...Tests/block-copied-in-cxxobj-1     0.01         0.01 22.0%
SingleSour...UnitTests/ms_struct-bitfield-1     0.02         0.02 19.2%
SingleSour...ests/ms_struct-bitfield-init-1     0.02         0.03 18.4%
MultiSourc...enchmarks/McCat/17-bintr/bintr     0.08         0.09 17.8%
SingleSour...tTests/2004-02-02-NegativeZero     0.01         0.01 17.6%
SingleSour...UnitTests/DefaultInitDynArrays     0.01         0.01 16.4%
                           Geomean difference                     -1.3%
      compile_time                         
run           base        patch        diff
count  3402.000000  3402.000000  467.000000
mean   0.212624     0.212469    -0.008679  
std    2.140155     2.140445     0.089601  
min    0.000000     0.000000    -0.442177  
25%    0.000000     0.000000    -0.031559  
50%    0.000000     0.000000    -0.002967  
75%    0.000000     0.000000     0.015789  
max    88.569200    88.710500    0.439252

Overall mean:

mean exec_time:  1772.68s  -> 1787.10s (+0.8%)
mean compile_time: 0.213s  -> 0.212s   (‑0.1%)

Should be on track.

…ions Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again. Fixes: llvm#138509.

nikic · 2025-07-24T08:35:26Z

Does this mean that we can now only unswitch a single condition in the loop?

antoniofrighetto · 2025-07-24T08:46:59Z

Does this mean that we can now only unswitch a single condition in the loop?

It should mean that the conditions that have happened to be created in the new sibling loops during cloning won't be unswitched further.

bjope · 2025-08-05T18:49:20Z

For the record, this did cause benchmark regressions for our downstream target. Worst case I've seen was a cycle increase by a factor 1.3 for a benchmark. I don't have much more to say about it right now (haven't had time to look at it closer to understand the exact scenario). Just wanted to mention this, if it turns out more people have problems.

bjope · 2025-08-12T21:40:18Z

As I'm trying to understand the regression we see a bit more, I'm curious to know more about the scenario that this patch is trying to prevent. Did a single invocation of loop-unswitch cause exponential growth in the past? (I assume that was the case as the involved test cases seem to only add loop-unswitch to the pipeline once)

Afaik we might run SimpleLoopUnswitch multiple times in a pipeline. And there are functionality like ShouldRunExtraSimpleLoopUnswitch that adds extra runs of SimpleLoopUnswitch. There is even a performance tip in docs/Frontend/PerformanceTips.rst about the possibility of adding extra executions of LoopUnswitch in some situations to get more unswitching.

Should perhaps the llvm.loop.unswitch.nontrivial.disable metadata be removed again at the end of the pass to allow more unswitching if running the pass multiple times? That would still prevent exponential unswitching as we wouldn't unswitch a loop more times than the number of times the pass is added to the pipeline.

bjope · 2025-08-12T21:59:48Z

Removing it at the end of the "pass" is perhaps not as simple as it sounds as it rather would be at the end of the loop pass manager or something like that (given that the loop pass manager is running the pass on one loop at a time).

antoniofrighetto requested review from dtcxzyw, fhahn and nikic May 22, 2025 18:44

llvmbot added the llvm:transforms label May 22, 2025

antoniofrighetto marked this pull request as draft May 22, 2025 21:44

dtcxzyw reviewed May 23, 2025

View reviewed changes

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp Outdated Show resolved Hide resolved

antoniofrighetto marked this pull request as ready for review May 23, 2025 15:22

dtcxzyw reviewed May 30, 2025

View reviewed changes

dtcxzyw mentioned this pull request May 30, 2025

Fuzz PR141121 dtcxzyw/llvm-fuzz-service#76

Closed

antoniofrighetto requested a review from preames June 5, 2025 08:46

dtcxzyw approved these changes Jul 15, 2025

View reviewed changes

[SimpleLoopUnswitch] Record loops from unswitching non-trivial condit…

e9de32f

…ions Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again. Fixes: llvm#138509.

antoniofrighetto force-pushed the feature/loop-unswitch-record-new-loops branch from 015a1aa to e9de32f Compare July 24, 2025 08:28

antoniofrighetto merged commit e9de32f into llvm:main Jul 24, 2025
7 of 9 checks passed

antoniofrighetto mentioned this pull request Aug 18, 2025

[14.0.0 regression] Compiler hang at -O3 on x86_64-linux-gnu #138509

Closed

bjope mentioned this pull request Aug 26, 2025

[SimpleLoopUnswitch] Adjust cost multiplier accounting for parent loop size #155379

Merged

[SimpleLoopUnswitch] Record loops from unswitching non-trivial conditions #141121

[SimpleLoopUnswitch] Record loops from unswitching non-trivial conditions #141121

Uh oh!

Conversation

antoniofrighetto commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 22, 2025

Uh oh!

Uh oh!

dtcxzyw May 30, 2025

Choose a reason for hiding this comment

Uh oh!

antoniofrighetto Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

antoniofrighetto Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

dtcxzyw left a comment

Choose a reason for hiding this comment

Uh oh!

antoniofrighetto commented Jul 23, 2025

Uh oh!

Uh oh!

nikic commented Jul 24, 2025

Uh oh!

antoniofrighetto commented Jul 24, 2025

Uh oh!

bjope commented Aug 5, 2025

Uh oh!

bjope commented Aug 12, 2025

Uh oh!

bjope commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

antoniofrighetto commented May 22, 2025 •

edited

Loading