[MemDep] Optimize SortNonLocalDepInfoCache sorting strategy for large caches with few unsorted entries #143107

DingdWang · 2025-06-06T10:37:59Z

During compilation of large files with many branches, I observed that the function SortNonLocalDepInfoCache in MemoryDependenceAnalysis becomes a significant performance bottleneck. This is because Cache.size() can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient.
This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly.
As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.

github-actions · 2025-06-06T10:38:24Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-06-06T10:38:59Z

@llvm/pr-subscribers-llvm-analysis

Author: None (DingdWang)

Changes

During compilation of large files with many branches, I observed that the function SortNonLocalDepInfoCache in MemoryDependenceAnalysis becomes a significant performance bottleneck. This is because Cache.size() can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient.
This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly.
As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.

Full diff: https://github.com/llvm/llvm-project/pull/143107.diff

1 Files Affected:

(modified) llvm/lib/Analysis/MemoryDependenceAnalysis.cpp (+21-34)

diff --git a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
index f062189bac6a0..bd0d6bb18241a 100644
--- a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
+++ b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
@@ -49,6 +49,7 @@
 #include "llvm/Support/Debug.h"
 #include <algorithm>
 #include <cassert>
+#include <cmath>
 #include <iterator>
 #include <utility>
 
@@ -83,6 +84,9 @@ static cl::opt<unsigned>
 // Limit on the number of memdep results to process.
 static const unsigned int NumResultsLimit = 100;
 
+// for quickly calculating log
+const float ln2 = 0.69314718f;
+
 /// This is a helper function that removes Val from 'Inst's set in ReverseMap.
 ///
 /// If the set becomes empty, remove Inst's entry.
@@ -369,8 +373,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
     BasicBlock *BB, Instruction *QueryInst, unsigned *Limit,
     BatchAAResults &BatchAA) {
   bool isInvariantLoad = false;
-  Align MemLocAlign =
-      MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
+  Align MemLocAlign = MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
 
   unsigned DefaultLimit = getDefaultBlockScanLimit();
   if (!Limit)
@@ -418,7 +421,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
   // True for volatile instruction.
   // For Load/Store return true if atomic ordering is stronger than AO,
   // for other instruction just true if it can read or write to memory.
-  auto isComplexForReordering = [](Instruction * I, AtomicOrdering AO)->bool {
+  auto isComplexForReordering = [](Instruction *I, AtomicOrdering AO) -> bool {
     if (I->isVolatile())
       return true;
     if (auto *LI = dyn_cast<LoadInst>(I))
@@ -461,7 +464,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
       case Intrinsic::masked_load:
       case Intrinsic::masked_store: {
         MemoryLocation Loc;
-        /*ModRefInfo MR =*/ GetLocation(II, Loc, TLI);
+        /*ModRefInfo MR =*/GetLocation(II, Loc, TLI);
         AliasResult R = BatchAA.alias(Loc, MemLoc);
         if (R == AliasResult::NoAlias)
           continue;
@@ -890,7 +893,7 @@ void MemoryDependenceResults::getNonLocalPointerDependency(
   // translation.
   SmallDenseMap<BasicBlock *, Value *, 16> Visited;
   if (getNonLocalPointerDepFromBB(QueryInst, Address, Loc, isLoad, FromBB,
-                                   Result, Visited, true))
+                                  Result, Visited, true))
     return;
   Result.clear();
   Result.push_back(NonLocalDepResult(FromBB, MemDepResult::getUnknown(),
@@ -991,33 +994,19 @@ MemDepResult MemoryDependenceResults::getNonLocalInfoForBlock(
 static void
 SortNonLocalDepInfoCache(MemoryDependenceResults::NonLocalDepInfo &Cache,
                          unsigned NumSortedEntries) {
-  switch (Cache.size() - NumSortedEntries) {
-  case 0:
-    // done, no new entries.
-    break;
-  case 2: {
-    // Two new entries, insert the last one into place.
-    NonLocalDepEntry Val = Cache.back();
-    Cache.pop_back();
-    MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
-        std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
-    Cache.insert(Entry, Val);
-    [[fallthrough]];
-  }
-  case 1:
-    // One new entry, Just insert the new value at the appropriate position.
-    if (Cache.size() != 1) {
+
+  auto s = Cache.size() - NumSortedEntries;
+  if (s < log2(Cache.size()) * ln2) {
+    while (s > 0) {
       NonLocalDepEntry Val = Cache.back();
       Cache.pop_back();
       MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
-          llvm::upper_bound(Cache, Val);
+          std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
       Cache.insert(Entry, Val);
+      s--;
     }
-    break;
-  default:
-    // Added many values, do a full scale sort.
+  } else {
     llvm::sort(Cache);
-    break;
   }
 }
 
@@ -1343,8 +1332,8 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
       // assume it is unknown, but this also does not block PRE of the load.
       if (!CanTranslate ||
           !getNonLocalPointerDepFromBB(QueryInst, PredPointer,
-                                      Loc.getWithNewPtr(PredPtrVal), isLoad,
-                                      Pred, Result, Visited)) {
+                                       Loc.getWithNewPtr(PredPtrVal), isLoad,
+                                       Pred, Result, Visited)) {
         // Add the entry to the Result list.
         NonLocalDepResult Entry(Pred, MemDepResult::getUnknown(), PredPtrVal);
         Result.push_back(Entry);
@@ -1412,7 +1401,6 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
 
         I.setResult(MemDepResult::getUnknown());
 
-
         break;
       }
     }
@@ -1733,9 +1721,7 @@ MemoryDependenceWrapperPass::MemoryDependenceWrapperPass() : FunctionPass(ID) {}
 
 MemoryDependenceWrapperPass::~MemoryDependenceWrapperPass() = default;
 
-void MemoryDependenceWrapperPass::releaseMemory() {
-  MemDep.reset();
-}
+void MemoryDependenceWrapperPass::releaseMemory() { MemDep.reset(); }
 
 void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.setPreservesAll();
@@ -1745,8 +1731,9 @@ void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.addRequiredTransitive<TargetLibraryInfoWrapperPass>();
 }
 
-bool MemoryDependenceResults::invalidate(Function &F, const PreservedAnalyses &PA,
-                               FunctionAnalysisManager::Invalidator &Inv) {
+bool MemoryDependenceResults::invalidate(
+    Function &F, const PreservedAnalyses &PA,
+    FunctionAnalysisManager::Invalidator &Inv) {
   // Check whether our analysis is preserved.
   auto PAC = PA.getChecker<MemoryDependenceAnalysis>();
   if (!PAC.preserved() && !PAC.preservedSet<AllAnalysesOn<Function>>())

nikic

Please do not reformat parts of the file you do not modify.

Why does this cache use a sorted vector at all? Would it be possible to convert it to a DenseMap instead?

nikic · 2025-06-06T15:01:46Z

Crashes during clang bootstrap: https://llvm-compile-time-tracker.com/show_error.php?commit=0273219d4061bb0bcd0a39e7556be882c6792e24

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp

github-actions · 2025-06-11T08:26:32Z

✅ With the latest revision this PR passed the C/C++ code formatter.

DingdWang · 2025-06-13T08:43:33Z

Crashes during clang bootstrap: https://llvm-compile-time-tracker.com/show_error.php?commit=0273219d4061bb0bcd0a39e7556be882c6792e24

bug fixed

DingdWang · 2025-06-13T11:34:42Z

Please do not reformat parts of the file you do not modify.

Why does this cache use a sorted vector at all? Would it be possible to convert it to a DenseMap instead?

sry, have reverted reformat of other parts
Within MemDep, I found that the sorted vector is primarily used to look up records for a specific basic block in the cache. A DenseMap could serve this purpose as well. However, I’m not entirely sure if switching to a DenseMap would affect the users of MemDep, since the order of the cache currently influences the order of the returned results.
Even if the order does matter, we could always perform a sorting step right before returning the results to preserve the expected order. In addition, changing to a DenseMap would alter the interface, and users would need to adjust their usage accordingly.
Overall, I think switching to a DenseMap is worth considering. I plan to experiment with this approach to see if there are any unforeseen impacts. If it proves feasible, I will submit a separate patch for this change in the future.

… caches with few unsorted entries

DingdWang · 2025-06-26T11:12:17Z

Compile time result: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=4e570371a96e25c8b7f9d25266afe864dcfdfa20&stat=instructions%3Au

The optimization effect does not appear significant on the benchmark. I checked the related cache size and NumSortedEntries outputs, and found that on the benchmark, the overall cache size is relatively small—around a few hundred entries—and the function is called infrequently. As a result, the number of times this patch can actually hit and optimize is limited, so the performance improvement is not obvious.
In contrast, for large files, this function can be called nearly 16 million times, with about 1.5 million hits benefiting from the optimization, which leads to a clear performance gain. This data suggests that the original code’s approach of only considering cases 1 and 2 for insertion sort is insufficient; in reality, cases 3, 4, and 5 occur frequently as well.

statistics of sqlit3 from benchmark

Total: 95321

Cache.size statistics:
  Max: 613
  Min: 0
  Median: 3
  Average: 29.8503

Cache.size - NumSortedEntries difference distribution (top 20 by frequency), excluding NumSortedEntries=0 and diff=0:
  1: 6133
  2: 1650
  3: 1147
  5: 1088
  4: 626
  7: 620
  6: 490
  8: 388
  10: 372
  9: 231
  11: 204
  12: 140
  13: 100
  19: 88
  15: 84
  14: 82
  16: 72
  18: 69
  17: 67
  22: 65

Number of cases where 2 < Cache.size - NumSortedEntries < log2(Cache.size): 1784

Difference distribution in special cases (top 20), excluding NumSortedEntries=0 and diff=0:
  3: 703
  5: 500
  4: 300
  6: 146
  7: 115
  8: 20

Cache.size statistics for top 10 difference values in special cases:
  Difference = 3:
    Count: 703
    Max Cache.size: 571
    Min Cache.size: 9
    Median Cache.size: 34
    Average Cache.size: 86.0484
  Difference = 5:
    Count: 500
    Max Cache.size: 589
    Min Cache.size: 33
    Median Cache.size: 96.5
    Average Cache.size: 156.2580
  Difference = 4:
    Count: 300
    Max Cache.size: 600
    Min Cache.size: 17
    Median Cache.size: 64.0
    Average Cache.size: 109.3567
  Difference = 6:
    Count: 146
    Max Cache.size: 606
    Min Cache.size: 66
    Median Cache.size: 197.5
    Average Cache.size: 220.3151
  Difference = 7:
    Count: 115
    Max Cache.size: 613
    Min Cache.size: 130
    Median Cache.size: 250
    Average Cache.size: 293.1391
  Difference = 8:
    Count: 20
    Max Cache.size: 440
    Min Cache.size: 257
    Median Cache.size: 326.0
    Average Cache.size: 347.1500

statistics of a big file

Total: 15971872

Cache.size statistics:
  Max: 57548
  Min: 0
  Median: 165.0
  Average: 3376.9060

Cache.size - NumSortedEntries difference distribution (top 20 by frequency), excluding NumSortedEntries=0 and diff=0:
  2: 2339040
  1: 700840
  4: 502824
  6: 302040
  5: 289494
  7: 168130
  3: 159230
  8: 46549
  9: 24162
  10: 22767
  11: 13970
  12: 4347
  13: 2280
  161: 1112
  160: 1098
  157: 1092
  162: 1083
  158: 1015
  155: 989
  154: 954

Number of cases where 2 < Cache.size - NumSortedEntries < log2(Cache.size): 1526693

Difference distribution in special cases (top 20), excluding NumSortedEntries=0 and diff=0:
  4: 502647
  6: 301711
  5: 288484
  7: 167896
  3: 158712
  8: 45461
  9: 23584
  10: 21185
  11: 12859
  12: 2953
  13: 1074
  14: 127

Cache.size statistics for top 10 difference values in special cases:
  Difference = 4:
    Count: 502647
    Max Cache.size: 57432
    Min Cache.size: 17
    Median Cache.size: 11537
    Average Cache.size: 11473.2582
  Difference = 6:
    Count: 301711
    Max Cache.size: 56986
    Min Cache.size: 65
    Median Cache.size: 11535
    Average Cache.size: 11426.9020
  Difference = 5:
    Count: 288484
    Max Cache.size: 57400
    Min Cache.size: 50
    Median Cache.size: 10694.0
    Average Cache.size: 10912.9279
  Difference = 7:
    Count: 167896
    Max Cache.size: 57382
    Min Cache.size: 129
    Median Cache.size: 10765.5
    Average Cache.size: 10981.8878
  Difference = 3:
    Count: 158712
    Max Cache.size: 57219
    Min Cache.size: 9
    Median Cache.size: 11095.0
    Average Cache.size: 11065.3424
  Difference = 8:
    Count: 45461
    Max Cache.size: 57326
    Min Cache.size: 257
    Median Cache.size: 11178
    Average Cache.size: 11267.4616
  Difference = 9:
    Count: 23584
    Max Cache.size: 57272
    Min Cache.size: 513
    Median Cache.size: 11668.0
    Average Cache.size: 11672.4953
  Difference = 10:
    Count: 21185
    Max Cache.size: 57442
    Min Cache.size: 1025
    Median Cache.size: 13721
    Average Cache.size: 14924.2003
  Difference = 11:
    Count: 12859
    Max Cache.size: 48734
    Min Cache.size: 2051
    Median Cache.size: 13343
    Average Cache.size: 13009.9933
  Difference = 12:
    Count: 2953
    Max Cache.size: 39377
    Min Cache.size: 4118
    Median Cache.size: 14387
    Average Cache.size: 14023.0687

DingdWang · 2025-06-26T11:25:08Z

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp

+  // If the number of unsorted entires is small and the cache size is big, use
+  // insertion sort is faster. Here use Log2_32 to quickly choose the sort
+  // method.
+  if (s < Log2_32(Cache.size())) {


The choice of using log2 here is based on empirical experience. The main goal is to have a relatively fast way to determine whether the number of unsorted entries is significantly smaller than the cache size. To tune this condition, I experimented with the following four options, and based on the timing results, using log2 proved to be the fastest. The benchmark results are as follows:

s < NumSortedEntries: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=e174118c88ee3d9d31fb3ed4e29b9ae2fcac46fa&stat=instructions%3Au

s < Log2_32(Cache.size()) * llvm::numbers::ln2 / llvm::numbers::ln10: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=9368621b42fa8b68e1e3081110f82ae9a5d57458&stat=instructions%3Au

s < Log2_32(Cache.size()) * llvm::numbers::ln2: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=19a8584d14dbb95b4a71a92a43da3a2c5d5e550a&stat=instructions%3Au

s < Log2_32(Cache.size()): https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=0fa6bc6bdf1c9c5464e81970e973f2c43edac874&stat=instructions%3Au

DingdWang · 2025-06-26T11:26:42Z

kind ping

nikic

LGTM

nikic · 2025-07-25T10:25:24Z

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp

+
+  // Output number of sorted entries and size of cache for each sort.
+  LLVM_DEBUG(dbgs() << "NumSortedEntries: " << NumSortedEntries
+                    << ", Cache.size: " << Cache.size() << "\n");


Please drop this debug output, it will be very spammy for anyone not specifically trying to optimize this code.

nikic · 2025-07-25T10:25:35Z

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp

-    // One new entry, Just insert the new value at the appropriate position.
-    if (Cache.size() != 1) {
+
+  // If the number of unsorted entires is small and the cache size is big, use


Suggested change

// If the number of unsorted entires is small and the cache size is big, use

// If the number of unsorted entires is small and the cache size is big, using

github-actions · 2025-07-25T14:45:21Z

@DingdWang Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

… caches with few unsorted entries (llvm#143107) During compilation of large files with many branches, I observed that the function `SortNonLocalDepInfoCache` in `MemoryDependenceAnalysis` becomes a significant performance bottleneck. This is because `Cache.size()` can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient. This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly. As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Jun 6, 2025

nikic reviewed Jun 6, 2025

View reviewed changes

ecnelises reviewed Jun 9, 2025

View reviewed changes

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp Outdated Show resolved Hide resolved

DingdWang force-pushed the main branch from ac4e8c6 to d6e4042 Compare June 12, 2025 11:35

DingdWang marked this pull request as draft June 26, 2025 06:17

DingdWang closed this Jun 26, 2025

DingdWang force-pushed the main branch from d15073c to 470f456 Compare June 26, 2025 06:43

[MemDep] Optimize SortNonLocalDepInfoCache sorting strategy for large…

c49a7e8

… caches with few unsorted entries

DingdWang reopened this Jun 26, 2025

DingdWang commented Jun 26, 2025

View reviewed changes

DingdWang marked this pull request as ready for review June 26, 2025 11:26

Enna1 requested a review from nikic June 26, 2025 11:28

nikic approved these changes Jul 25, 2025

View reviewed changes

nikic mentioned this pull request Jul 25, 2025

GVN (MemDep): Pathological or infinite behavior #150531

Open

fix comments from nikic

fec951d

nikic merged commit 0c6784c into llvm:main Jul 25, 2025
9 checks passed

	// If the number of unsorted entires is small and the cache size is big, use
	// If the number of unsorted entires is small and the cache size is big, using

[MemDep] Optimize SortNonLocalDepInfoCache sorting strategy for large caches with few unsorted entries #143107

[MemDep] Optimize SortNonLocalDepInfoCache sorting strategy for large caches with few unsorted entries #143107

Uh oh!

Conversation

DingdWang commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

llvmbot commented Jun 6, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

nikic commented Jun 6, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DingdWang commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DingdWang commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DingdWang commented Jun 26, 2025

Uh oh!

DingdWang Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

DingdWang commented Jun 26, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

nikic Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

DingdWang Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

nikic Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

DingdWang Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Jun 11, 2025 •

edited

Loading

DingdWang commented Jun 13, 2025 •

edited

Loading

DingdWang commented Jun 13, 2025 •

edited

Loading