Skip to content

Conversation

@DingdWang
Copy link
Contributor

During compilation of large files with many branches, I observed that the function SortNonLocalDepInfoCache in MemoryDependenceAnalysis becomes a significant performance bottleneck. This is because Cache.size() can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient.
This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly.
As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Jun 6, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2025

@llvm/pr-subscribers-llvm-analysis

Author: None (DingdWang)

Changes

During compilation of large files with many branches, I observed that the function SortNonLocalDepInfoCache in MemoryDependenceAnalysis becomes a significant performance bottleneck. This is because Cache.size() can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient.
This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly.
As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.


Full diff: https://github.com/llvm/llvm-project/pull/143107.diff

1 Files Affected:

  • (modified) llvm/lib/Analysis/MemoryDependenceAnalysis.cpp (+21-34)
diff --git a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
index f062189bac6a0..bd0d6bb18241a 100644
--- a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
+++ b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
@@ -49,6 +49,7 @@
 #include "llvm/Support/Debug.h"
 #include <algorithm>
 #include <cassert>
+#include <cmath>
 #include <iterator>
 #include <utility>
 
@@ -83,6 +84,9 @@ static cl::opt<unsigned>
 // Limit on the number of memdep results to process.
 static const unsigned int NumResultsLimit = 100;
 
+// for quickly calculating log
+const float ln2 = 0.69314718f;
+
 /// This is a helper function that removes Val from 'Inst's set in ReverseMap.
 ///
 /// If the set becomes empty, remove Inst's entry.
@@ -369,8 +373,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
     BasicBlock *BB, Instruction *QueryInst, unsigned *Limit,
     BatchAAResults &BatchAA) {
   bool isInvariantLoad = false;
-  Align MemLocAlign =
-      MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
+  Align MemLocAlign = MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
 
   unsigned DefaultLimit = getDefaultBlockScanLimit();
   if (!Limit)
@@ -418,7 +421,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
   // True for volatile instruction.
   // For Load/Store return true if atomic ordering is stronger than AO,
   // for other instruction just true if it can read or write to memory.
-  auto isComplexForReordering = [](Instruction * I, AtomicOrdering AO)->bool {
+  auto isComplexForReordering = [](Instruction *I, AtomicOrdering AO) -> bool {
     if (I->isVolatile())
       return true;
     if (auto *LI = dyn_cast<LoadInst>(I))
@@ -461,7 +464,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
       case Intrinsic::masked_load:
       case Intrinsic::masked_store: {
         MemoryLocation Loc;
-        /*ModRefInfo MR =*/ GetLocation(II, Loc, TLI);
+        /*ModRefInfo MR =*/GetLocation(II, Loc, TLI);
         AliasResult R = BatchAA.alias(Loc, MemLoc);
         if (R == AliasResult::NoAlias)
           continue;
@@ -890,7 +893,7 @@ void MemoryDependenceResults::getNonLocalPointerDependency(
   // translation.
   SmallDenseMap<BasicBlock *, Value *, 16> Visited;
   if (getNonLocalPointerDepFromBB(QueryInst, Address, Loc, isLoad, FromBB,
-                                   Result, Visited, true))
+                                  Result, Visited, true))
     return;
   Result.clear();
   Result.push_back(NonLocalDepResult(FromBB, MemDepResult::getUnknown(),
@@ -991,33 +994,19 @@ MemDepResult MemoryDependenceResults::getNonLocalInfoForBlock(
 static void
 SortNonLocalDepInfoCache(MemoryDependenceResults::NonLocalDepInfo &Cache,
                          unsigned NumSortedEntries) {
-  switch (Cache.size() - NumSortedEntries) {
-  case 0:
-    // done, no new entries.
-    break;
-  case 2: {
-    // Two new entries, insert the last one into place.
-    NonLocalDepEntry Val = Cache.back();
-    Cache.pop_back();
-    MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
-        std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
-    Cache.insert(Entry, Val);
-    [[fallthrough]];
-  }
-  case 1:
-    // One new entry, Just insert the new value at the appropriate position.
-    if (Cache.size() != 1) {
+
+  auto s = Cache.size() - NumSortedEntries;
+  if (s < log2(Cache.size()) * ln2) {
+    while (s > 0) {
       NonLocalDepEntry Val = Cache.back();
       Cache.pop_back();
       MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
-          llvm::upper_bound(Cache, Val);
+          std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
       Cache.insert(Entry, Val);
+      s--;
     }
-    break;
-  default:
-    // Added many values, do a full scale sort.
+  } else {
     llvm::sort(Cache);
-    break;
   }
 }
 
@@ -1343,8 +1332,8 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
       // assume it is unknown, but this also does not block PRE of the load.
       if (!CanTranslate ||
           !getNonLocalPointerDepFromBB(QueryInst, PredPointer,
-                                      Loc.getWithNewPtr(PredPtrVal), isLoad,
-                                      Pred, Result, Visited)) {
+                                       Loc.getWithNewPtr(PredPtrVal), isLoad,
+                                       Pred, Result, Visited)) {
         // Add the entry to the Result list.
         NonLocalDepResult Entry(Pred, MemDepResult::getUnknown(), PredPtrVal);
         Result.push_back(Entry);
@@ -1412,7 +1401,6 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
 
         I.setResult(MemDepResult::getUnknown());
 
-
         break;
       }
     }
@@ -1733,9 +1721,7 @@ MemoryDependenceWrapperPass::MemoryDependenceWrapperPass() : FunctionPass(ID) {}
 
 MemoryDependenceWrapperPass::~MemoryDependenceWrapperPass() = default;
 
-void MemoryDependenceWrapperPass::releaseMemory() {
-  MemDep.reset();
-}
+void MemoryDependenceWrapperPass::releaseMemory() { MemDep.reset(); }
 
 void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.setPreservesAll();
@@ -1745,8 +1731,9 @@ void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.addRequiredTransitive<TargetLibraryInfoWrapperPass>();
 }
 
-bool MemoryDependenceResults::invalidate(Function &F, const PreservedAnalyses &PA,
-                               FunctionAnalysisManager::Invalidator &Inv) {
+bool MemoryDependenceResults::invalidate(
+    Function &F, const PreservedAnalyses &PA,
+    FunctionAnalysisManager::Invalidator &Inv) {
   // Check whether our analysis is preserved.
   auto PAC = PA.getChecker<MemoryDependenceAnalysis>();
   if (!PAC.preserved() && !PAC.preservedSet<AllAnalysesOn<Function>>())

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not reformat parts of the file you do not modify.

Why does this cache use a sorted vector at all? Would it be possible to convert it to a DenseMap instead?

@nikic
Copy link
Contributor

nikic commented Jun 6, 2025

@github-actions
Copy link

github-actions bot commented Jun 11, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@DingdWang
Copy link
Contributor Author

DingdWang commented Jun 13, 2025

@DingdWang
Copy link
Contributor Author

DingdWang commented Jun 13, 2025

Please do not reformat parts of the file you do not modify.

Why does this cache use a sorted vector at all? Would it be possible to convert it to a DenseMap instead?

  1. sry, have reverted reformat of other parts
  2. Within MemDep, I found that the sorted vector is primarily used to look up records for a specific basic block in the cache. A DenseMap could serve this purpose as well. However, I’m not entirely sure if switching to a DenseMap would affect the users of MemDep, since the order of the cache currently influences the order of the returned results.
    Even if the order does matter, we could always perform a sorting step right before returning the results to preserve the expected order. In addition, changing to a DenseMap would alter the interface, and users would need to adjust their usage accordingly.
    Overall, I think switching to a DenseMap is worth considering. I plan to experiment with this approach to see if there are any unforeseen impacts. If it proves feasible, I will submit a separate patch for this change in the future.

@DingdWang DingdWang marked this pull request as draft June 26, 2025 06:17
@DingdWang DingdWang marked this pull request as draft June 26, 2025 06:17
@DingdWang DingdWang closed this Jun 26, 2025
@DingdWang DingdWang reopened this Jun 26, 2025
@DingdWang
Copy link
Contributor Author

Compile time result: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=4e570371a96e25c8b7f9d25266afe864dcfdfa20&stat=instructions%3Au

The optimization effect does not appear significant on the benchmark. I checked the related cache size and NumSortedEntries outputs, and found that on the benchmark, the overall cache size is relatively small—around a few hundred entries—and the function is called infrequently. As a result, the number of times this patch can actually hit and optimize is limited, so the performance improvement is not obvious.
In contrast, for large files, this function can be called nearly 16 million times, with about 1.5 million hits benefiting from the optimization, which leads to a clear performance gain. This data suggests that the original code’s approach of only considering cases 1 and 2 for insertion sort is insufficient; in reality, cases 3, 4, and 5 occur frequently as well.

statistics of sqlit3 from benchmark

Total: 95321

Cache.size statistics:
  Max: 613
  Min: 0
  Median: 3
  Average: 29.8503

Cache.size - NumSortedEntries difference distribution (top 20 by frequency), excluding NumSortedEntries=0 and diff=0:
  1: 6133
  2: 1650
  3: 1147
  5: 1088
  4: 626
  7: 620
  6: 490
  8: 388
  10: 372
  9: 231
  11: 204
  12: 140
  13: 100
  19: 88
  15: 84
  14: 82
  16: 72
  18: 69
  17: 67
  22: 65

Number of cases where 2 < Cache.size - NumSortedEntries < log2(Cache.size): 1784

Difference distribution in special cases (top 20), excluding NumSortedEntries=0 and diff=0:
  3: 703
  5: 500
  4: 300
  6: 146
  7: 115
  8: 20

Cache.size statistics for top 10 difference values in special cases:
  Difference = 3:
    Count: 703
    Max Cache.size: 571
    Min Cache.size: 9
    Median Cache.size: 34
    Average Cache.size: 86.0484
  Difference = 5:
    Count: 500
    Max Cache.size: 589
    Min Cache.size: 33
    Median Cache.size: 96.5
    Average Cache.size: 156.2580
  Difference = 4:
    Count: 300
    Max Cache.size: 600
    Min Cache.size: 17
    Median Cache.size: 64.0
    Average Cache.size: 109.3567
  Difference = 6:
    Count: 146
    Max Cache.size: 606
    Min Cache.size: 66
    Median Cache.size: 197.5
    Average Cache.size: 220.3151
  Difference = 7:
    Count: 115
    Max Cache.size: 613
    Min Cache.size: 130
    Median Cache.size: 250
    Average Cache.size: 293.1391
  Difference = 8:
    Count: 20
    Max Cache.size: 440
    Min Cache.size: 257
    Median Cache.size: 326.0
    Average Cache.size: 347.1500

statistics of a big file

Total: 15971872

Cache.size statistics:
  Max: 57548
  Min: 0
  Median: 165.0
  Average: 3376.9060

Cache.size - NumSortedEntries difference distribution (top 20 by frequency), excluding NumSortedEntries=0 and diff=0:
  2: 2339040
  1: 700840
  4: 502824
  6: 302040
  5: 289494
  7: 168130
  3: 159230
  8: 46549
  9: 24162
  10: 22767
  11: 13970
  12: 4347
  13: 2280
  161: 1112
  160: 1098
  157: 1092
  162: 1083
  158: 1015
  155: 989
  154: 954

Number of cases where 2 < Cache.size - NumSortedEntries < log2(Cache.size): 1526693

Difference distribution in special cases (top 20), excluding NumSortedEntries=0 and diff=0:
  4: 502647
  6: 301711
  5: 288484
  7: 167896
  3: 158712
  8: 45461
  9: 23584
  10: 21185
  11: 12859
  12: 2953
  13: 1074
  14: 127

Cache.size statistics for top 10 difference values in special cases:
  Difference = 4:
    Count: 502647
    Max Cache.size: 57432
    Min Cache.size: 17
    Median Cache.size: 11537
    Average Cache.size: 11473.2582
  Difference = 6:
    Count: 301711
    Max Cache.size: 56986
    Min Cache.size: 65
    Median Cache.size: 11535
    Average Cache.size: 11426.9020
  Difference = 5:
    Count: 288484
    Max Cache.size: 57400
    Min Cache.size: 50
    Median Cache.size: 10694.0
    Average Cache.size: 10912.9279
  Difference = 7:
    Count: 167896
    Max Cache.size: 57382
    Min Cache.size: 129
    Median Cache.size: 10765.5
    Average Cache.size: 10981.8878
  Difference = 3:
    Count: 158712
    Max Cache.size: 57219
    Min Cache.size: 9
    Median Cache.size: 11095.0
    Average Cache.size: 11065.3424
  Difference = 8:
    Count: 45461
    Max Cache.size: 57326
    Min Cache.size: 257
    Median Cache.size: 11178
    Average Cache.size: 11267.4616
  Difference = 9:
    Count: 23584
    Max Cache.size: 57272
    Min Cache.size: 513
    Median Cache.size: 11668.0
    Average Cache.size: 11672.4953
  Difference = 10:
    Count: 21185
    Max Cache.size: 57442
    Min Cache.size: 1025
    Median Cache.size: 13721
    Average Cache.size: 14924.2003
  Difference = 11:
    Count: 12859
    Max Cache.size: 48734
    Min Cache.size: 2051
    Median Cache.size: 13343
    Average Cache.size: 13009.9933
  Difference = 12:
    Count: 2953
    Max Cache.size: 39377
    Min Cache.size: 4118
    Median Cache.size: 14387
    Average Cache.size: 14023.0687

// If the number of unsorted entires is small and the cache size is big, use
// insertion sort is faster. Here use Log2_32 to quickly choose the sort
// method.
if (s < Log2_32(Cache.size())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of using log2 here is based on empirical experience. The main goal is to have a relatively fast way to determine whether the number of unsorted entries is significantly smaller than the cache size. To tune this condition, I experimented with the following four options, and based on the timing results, using log2 proved to be the fastest. The benchmark results are as follows:

  1. s < NumSortedEntries: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=e174118c88ee3d9d31fb3ed4e29b9ae2fcac46fa&stat=instructions%3Au
  2. s < Log2_32(Cache.size()) * llvm::numbers::ln2 / llvm::numbers::ln10: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=9368621b42fa8b68e1e3081110f82ae9a5d57458&stat=instructions%3Au
  3. s < Log2_32(Cache.size()) * llvm::numbers::ln2: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=19a8584d14dbb95b4a71a92a43da3a2c5d5e550a&stat=instructions%3Au
  4. s < Log2_32(Cache.size()): https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=0fa6bc6bdf1c9c5464e81970e973f2c43edac874&stat=instructions%3Au

@DingdWang DingdWang marked this pull request as ready for review June 26, 2025 11:26
@DingdWang
Copy link
Contributor Author

kind ping

@Enna1 Enna1 requested a review from nikic June 26, 2025 11:28
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


// Output number of sorted entries and size of cache for each sort.
LLVM_DEBUG(dbgs() << "NumSortedEntries: " << NumSortedEntries
<< ", Cache.size: " << Cache.size() << "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop this debug output, it will be very spammy for anyone not specifically trying to optimize this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// One new entry, Just insert the new value at the appropriate position.
if (Cache.size() != 1) {

// If the number of unsorted entires is small and the cache size is big, use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If the number of unsorted entires is small and the cache size is big, use
// If the number of unsorted entires is small and the cache size is big, using

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nikic nikic merged commit 0c6784c into llvm:main Jul 25, 2025
9 checks passed
@github-actions
Copy link

@DingdWang Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jul 28, 2025
… caches with few unsorted entries (llvm#143107)

During compilation of large files with many branches, I observed that
the function `SortNonLocalDepInfoCache` in `MemoryDependenceAnalysis`
becomes a significant performance bottleneck. This is because
`Cache.size()` can be very large (around 20,000), but only a small
number of entries (approximately 5 to 8) actually need sorting. The
original implementation performs a full sort in all cases, which is
inefficient.

This patch introduces a lightweight heuristic to quickly estimate the
number of unsorted entries and choose a more efficient sorting method
accordingly.

As a result, the GVN pass runtime on a large file is reduced from
approximately 26.3 minutes to 16.5 minutes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:analysis Includes value tracking, cost tables and constant folding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants