Skip to content

Conversation

@antoniofrighetto
Copy link
Contributor

@antoniofrighetto antoniofrighetto commented May 22, 2025

Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again.

Fixes: #138509.

@llvmbot
Copy link
Member

llvmbot commented May 22, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Antonio Frighetto (antoniofrighetto)

Changes

Track newly-cloned loops coming from unswitching non-trivial invariant conditions, so as to prevent conditions in such cloned blocks from being unswitched again. While this should optimistically suffice, ensure the outer loop basic block size is taken into account as well when estimating the cost for unswitching non-trivial conditions.

Fixes: #138509.


Full diff: https://github.com/llvm/llvm-project/pull/141121.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp (+35-26)
  • (added) llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll (+49)
diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index 0bf90036b8b82..4ebb73e917370 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -2142,9 +2142,22 @@ void visitDomSubTree(DominatorTree &DT, BasicBlock *BB, CallableT Callable) {
 void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
                   bool CurrentLoopValid, bool PartiallyInvariant,
                   bool InjectedCondition, ArrayRef<Loop *> NewLoops) {
-  // If we did a non-trivial unswitch, we have added new (cloned) loops.
-  if (!NewLoops.empty())
+  auto RecordLoopAsUnswitched = [&](Loop *TargetLoop, StringRef Tag) {
+    auto &Ctx = TargetLoop->getHeader()->getContext();
+    const auto &DisableMDName = (Twine(Tag) + ".disable").str();
+    MDNode *DisableMD = MDNode::get(Ctx, MDString::get(Ctx, DisableMDName));
+    MDNode *NewLoopID = makePostTransformationMetadata(
+        Ctx, TargetLoop->getLoopID(), {Tag}, {DisableMD});
+    TargetLoop->setLoopID(NewLoopID);
+  };
+
+  // If we performed a non-trivial unswitch, we have added new cloned loops.
+  // Mark such newly-created loops as visited.
+  if (!NewLoops.empty()) {
+    for (Loop *NL : NewLoops)
+      RecordLoopAsUnswitched(NL, "llvm.loop.unswitch.nontrivial");
     U.addSiblingLoops(NewLoops);
+  }
 
   // If the current loop remains valid, we should revisit it to catch any
   // other unswitch opportunities. Otherwise, we need to mark it as deleted.
@@ -2152,24 +2165,10 @@ void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
     if (PartiallyInvariant) {
       // Mark the new loop as partially unswitched, to avoid unswitching on
       // the same condition again.
-      auto &Context = L.getHeader()->getContext();
-      MDNode *DisableUnswitchMD = MDNode::get(
-          Context,
-          MDString::get(Context, "llvm.loop.unswitch.partial.disable"));
-      MDNode *NewLoopID = makePostTransformationMetadata(
-          Context, L.getLoopID(), {"llvm.loop.unswitch.partial"},
-          {DisableUnswitchMD});
-      L.setLoopID(NewLoopID);
+      RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.partial");
     } else if (InjectedCondition) {
       // Do the same for injection of invariant conditions.
-      auto &Context = L.getHeader()->getContext();
-      MDNode *DisableUnswitchMD = MDNode::get(
-          Context,
-          MDString::get(Context, "llvm.loop.unswitch.injection.disable"));
-      MDNode *NewLoopID = makePostTransformationMetadata(
-          Context, L.getLoopID(), {"llvm.loop.unswitch.injection"},
-          {DisableUnswitchMD});
-      L.setLoopID(NewLoopID);
+      RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.injection");
     } else
       U.revisitCurrentLoop();
   } else
@@ -2806,9 +2805,9 @@ static BranchInst *turnGuardIntoBranch(IntrinsicInst *GI, Loop &L,
 }
 
 /// Cost multiplier is a way to limit potentially exponential behavior
-/// of loop-unswitch. Cost is multipied in proportion of 2^number of unswitch
-/// candidates available. Also accounting for the number of "sibling" loops with
-/// the idea to account for previous unswitches that already happened on this
+/// of loop-unswitch. Cost is multiplied in proportion of 2^number of unswitch
+/// candidates available. Also consider the number of "sibling" loops with
+/// the idea of accounting for previous unswitches that already happened on this
 /// cluster of loops. There was an attempt to keep this formula simple,
 /// just enough to limit the worst case behavior. Even if it is not that simple
 /// now it is still not an attempt to provide a detailed heuristic size
@@ -2839,7 +2838,14 @@ static int CalculateUnswitchCostMultiplier(
     return 1;
   }
 
+  // When dealing with nested loops, the basic block size of the outer loop may
+  // increase significantly during unswitching non-trivial conditions. The final
+  // cost may be adjusted taking this into account.
   auto *ParentL = L.getParentLoop();
+  int ParentSizeMultiplier = 1;
+  if (ParentL)
+    ParentSizeMultiplier = std::max((int)ParentL->getNumBlocks(), 1);
+
   int SiblingsCount = (ParentL ? ParentL->getSubLoopsVector().size()
                                : std::distance(LI.begin(), LI.end()));
   // Count amount of clones that all the candidates might cause during
@@ -2887,11 +2893,13 @@ static int CalculateUnswitchCostMultiplier(
       SiblingsMultiplier > UnswitchThreshold)
     CostMultiplier = UnswitchThreshold;
   else
-    CostMultiplier = std::min(SiblingsMultiplier * (1 << ClonesPower),
-                              (int)UnswitchThreshold);
+    CostMultiplier =
+        std::min(SiblingsMultiplier * ParentSizeMultiplier * (1 << ClonesPower),
+                 (int)UnswitchThreshold);
 
   LLVM_DEBUG(dbgs() << "  Computed multiplier  " << CostMultiplier
-                    << " (siblings " << SiblingsMultiplier << " * clones "
+                    << " (siblings " << SiblingsMultiplier << "* parent size "
+                    << ParentSizeMultiplier << " * clones "
                     << (1 << ClonesPower) << ")"
                     << " for unswitch candidate: " << TI << "\n");
   return CostMultiplier;
@@ -3504,8 +3512,9 @@ static bool unswitchBestCondition(Loop &L, DominatorTree &DT, LoopInfo &LI,
   SmallVector<NonTrivialUnswitchCandidate, 4> UnswitchCandidates;
   IVConditionInfo PartialIVInfo;
   Instruction *PartialIVCondBranch = nullptr;
-  collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
-                            PartialIVCondBranch, L, LI, AA, MSSAU);
+  if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.nontrivial.disable"))
+    collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
+                              PartialIVCondBranch, L, LI, AA, MSSAU);
   if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.injection.disable"))
     collectUnswitchCandidatesWithInjections(UnswitchCandidates, PartialIVInfo,
                                             PartialIVCondBranch, L, DT, LI, AA,
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll b/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
new file mode 100644
index 0000000000000..e24d17f088427
--- /dev/null
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
@@ -0,0 +1,49 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes="loop-mssa(loop-simplifycfg,licm,loop-rotate,simple-loop-unswitch<nontrivial>)" < %s | FileCheck %s
+
+@a = global i32 0, align 4
+@b = global i32 0, align 4
+@c = global i32 0, align 4
+@d = global i32 0, align 4
+
+define i32 @main() {
+entry:
+  br label %outer.loop.header
+
+outer.loop.header:                                ; preds = %outer.loop.latch, %entry
+  br i1 false, label %exit, label %outer.loop.body
+
+outer.loop.body:                                  ; preds = %inner.loop.header, %outer.loop.header
+  store i32 1, ptr @c, align 4
+  %cmp = icmp sgt i32 0, -1
+  br i1 %cmp, label %outer.loop.latch, label %exit
+
+inner.loop.header:                                ; preds = %outer.loop.latch, %inner.loop.body
+  %a_val = load i32, ptr @a, align 4
+  %c_val = load i32, ptr @c, align 4
+  %mul = mul nsw i32 %c_val, %a_val
+  store i32 %mul, ptr @b, align 4
+  %cmp2 = icmp sgt i32 %mul, -1
+  br i1 %cmp2, label %inner.loop.body, label %outer.loop.body
+
+inner.loop.body:                                  ; preds = %inner.loop.header
+  %mul2 = mul nsw i32 %c_val, 3
+  store i32 %mul2, ptr @c, align 4
+  store i32 %c_val, ptr @d, align 4
+  %mul3 = mul nsw i32 %c_val, %a_val
+  %cmp3 = icmp sgt i32 %mul3, -1
+  br i1 %cmp3, label %inner.loop.header, label %exit
+
+outer.loop.latch:                                 ; preds = %outer.loop.body
+  %d_val = load i32, ptr @d, align 4
+  store i32 %d_val, ptr @b, align 4
+  %cmp4 = icmp eq i32 %d_val, 0
+  br i1 %cmp4, label %inner.loop.header, label %outer.loop.header
+
+exit:                                             ; preds = %inner.loop.body, %outer.loop.body, %outer.loop.header
+  ret i32 0
+}
+
+; CHECK: [[LOOP0:.*]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.unswitch.nontrivial.disable"}
+; CHECK: [[LOOP2:.*]] = distinct !{[[LOOP2]], [[META1]]}

@antoniofrighetto antoniofrighetto marked this pull request as draft May 22, 2025 21:44
@antoniofrighetto antoniofrighetto marked this pull request as ready for review May 23, 2025 15:22

static cl::opt<int>
UnswitchThreshold("unswitch-threshold", cl::init(50), cl::Hidden,
UnswitchThreshold("unswitch-threshold", cl::init(120), cl::Hidden,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this should optimistically suffice, ensure the outer loop basic block size is taken into account as well when estimating the cost for unswitching non-trivial conditions.

Can you provide some performance data on SPEC/llvm-test-suite? IMO adding llvm.loop.unswitch.nontrivial.disable is enough...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this more, I'm not totally sure if we want to entirely prevent unswitched non-trivial conditions from being unswitched again (at least, this seems one of the reason why the cost estimator tool is there). Perhaps there are downstream clients which may need to perform arbitrarily non-trivial unswitches (?), in which case, we should only tune the cost. If this is not a concern, I think we could drop the extra cost and keep llvm.loop.unswitch.nontrivial.disable only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this again, I don't think it should be desirable to unswitch previously unswitched conditions in the new siblings loops; and recording the loops as unswitched should suffice. Updated this and simplified by dropping cost estimator changes.

Copy link
Member

@dtcxzyw dtcxzyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!
Can you test this patch on llvm-test-suite to ensure that it doesn't cause performance regressions?

@antoniofrighetto
Copy link
Contributor Author

LGTM. Thank you! Can you test this patch on llvm-test-suite to ensure that it doesn't cause performance regressions?

Sure. Test-suite baseline and loop unswitch builds set up as follows:

$ mkdir build && cd build
$ cmake -G Ninja -DCMAKE_C_COMPILER=/path/to/clang -C ../cmake/caches/O2.cmake ..

Execution time:

~/llvm-test-suite $ python3 ./utils/compare.py ./baseline_build/output.json ./loop_unswitch_build/output.json
Tests: 3402
Metric: exec_time

Program                                       exec_time               
                                              base      patch  diff   
SingleSource/UnitTests/testcase-Value-1         0.00      0.00    inf%
tools/fpcmp-target                              0.00      0.00    inf%
SingleSour...r/AVX512F/Vector-AVX512F-shift     0.00      0.00    inf%
SingleSour...ctor-AVX512F-roundscale_scalar     0.00      0.00    inf%
SingleSour...tor/AVX512F/Vector-AVX512F-xor     0.00      0.00    inf%
SingleSour...s/Vector/SSE/Vector-sse.isamax     0.00      0.00    inf%
SingleSour.../AVX512F/Vector-AVX512F-reduce     0.00      0.00    inf%
SingleSour...AVX512F/Vector-AVX512F-movedup     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-941014-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-941025-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960218-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960219-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960301-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960302-1     0.00      0.00    inf%
SingleSour...execute/GCC-C-execute-960311-2     0.00      0.00    inf%
                           Geomean difference                  -100.0%
           exec_time                            
run             base          patch         diff
count  3402.000000    3402.000000    2994.000000
mean   1772.676173    1787.095436    inf        
std    34734.073638   35483.831295  NaN         
min    0.000000       0.000000      -1.000000   
25%    0.000600       0.000000      -0.142857   
50%    0.000800       0.000800       0.000000   
75%    3.602975       3.628384       0.125000   
max    978102.897079  983259.753532  inf

Compile time:

~/llvm-test-suite $ python3 ./utils/compare.py -m compile_time ./baseline_build/output.json ./loop_unswitch_build/output.json
Tests: 3402
Metric: compile_time

Program                                       compile_time             
                                              base         patch  diff 
SingleSource/UnitTests/blockstret               0.01         0.02 43.9%
SingleSour...ce/UnitTests/2003-05-26-Shorts     0.02         0.03 42.9%
SingleSour.../UnitTests/2007-01-04-KNR-Args     0.01         0.02 38.0%
SingleSour...tTests/2020-01-06-coverage-010     0.02         0.03 32.8%
SingleSour.../UnitTests/2002-05-02-CastTest     0.03         0.03 30.8%
SingleSour...tTests/2003-09-18-BitFieldTest     0.02         0.02 28.6%
SingleSour...tTests/2020-01-06-coverage-003     0.02         0.02 25.5%
SingleSource/UnitTests/block-byref-test         0.01         0.01 25.3%
SingleSour...tTests/2002-05-02-ArgumentTest     0.01         0.02 23.5%
SingleSour...Tests/block-copied-in-cxxobj-1     0.01         0.01 22.0%
SingleSour...UnitTests/ms_struct-bitfield-1     0.02         0.02 19.2%
SingleSour...ests/ms_struct-bitfield-init-1     0.02         0.03 18.4%
MultiSourc...enchmarks/McCat/17-bintr/bintr     0.08         0.09 17.8%
SingleSour...tTests/2004-02-02-NegativeZero     0.01         0.01 17.6%
SingleSour...UnitTests/DefaultInitDynArrays     0.01         0.01 16.4%
                           Geomean difference                     -1.3%
      compile_time                         
run           base        patch        diff
count  3402.000000  3402.000000  467.000000
mean   0.212624     0.212469    -0.008679  
std    2.140155     2.140445     0.089601  
min    0.000000     0.000000    -0.442177  
25%    0.000000     0.000000    -0.031559  
50%    0.000000     0.000000    -0.002967  
75%    0.000000     0.000000     0.015789  
max    88.569200    88.710500    0.439252  

Overall mean:

mean exec_time:  1772.68s  -> 1787.10s (+0.8%)
mean compile_time: 0.213s  -> 0.212s   (‑0.1%)

Should be on track.

…ions

Track newly-cloned loops coming from unswitching non-trivial invariant
conditions, so as to prevent conditions in such cloned blocks from
being unswitched again.

Fixes: llvm#138509.
@antoniofrighetto antoniofrighetto force-pushed the feature/loop-unswitch-record-new-loops branch from 015a1aa to e9de32f Compare July 24, 2025 08:28
@antoniofrighetto antoniofrighetto merged commit e9de32f into llvm:main Jul 24, 2025
7 of 9 checks passed
@nikic
Copy link
Contributor

nikic commented Jul 24, 2025

Does this mean that we can now only unswitch a single condition in the loop?

@antoniofrighetto
Copy link
Contributor Author

Does this mean that we can now only unswitch a single condition in the loop?

It should mean that the conditions that have happened to be created in the new sibling loops during cloning won't be unswitched further.

@bjope
Copy link
Collaborator

bjope commented Aug 5, 2025

For the record, this did cause benchmark regressions for our downstream target. Worst case I've seen was a cycle increase by a factor 1.3 for a benchmark. I don't have much more to say about it right now (haven't had time to look at it closer to understand the exact scenario). Just wanted to mention this, if it turns out more people have problems.

@bjope
Copy link
Collaborator

bjope commented Aug 12, 2025

As I'm trying to understand the regression we see a bit more, I'm curious to know more about the scenario that this patch is trying to prevent. Did a single invocation of loop-unswitch cause exponential growth in the past? (I assume that was the case as the involved test cases seem to only add loop-unswitch to the pipeline once)

Afaik we might run SimpleLoopUnswitch multiple times in a pipeline. And there are functionality like ShouldRunExtraSimpleLoopUnswitch that adds extra runs of SimpleLoopUnswitch. There is even a performance tip in docs/Frontend/PerformanceTips.rst about the possibility of adding extra executions of LoopUnswitch in some situations to get more unswitching.

Should perhaps the llvm.loop.unswitch.nontrivial.disable metadata be removed again at the end of the pass to allow more unswitching if running the pass multiple times? That would still prevent exponential unswitching as we wouldn't unswitch a loop more times than the number of times the pass is added to the pipeline.

@bjope
Copy link
Collaborator

bjope commented Aug 12, 2025

Removing it at the end of the "pass" is perhaps not as simple as it sounds as it rather would be at the end of the loop pass manager or something like that (given that the loop pass manager is running the pass on one loop at a time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[14.0.0 regression] Compiler hang at -O3 on x86_64-linux-gnu

5 participants