-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[MemDep] Optimize SortNonLocalDepInfoCache sorting strategy for large caches with few unsorted entries #143107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-llvm-analysis Author: None (DingdWang) ChangesDuring compilation of large files with many branches, I observed that the function Full diff: https://github.com/llvm/llvm-project/pull/143107.diff 1 Files Affected:
diff --git a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
index f062189bac6a0..bd0d6bb18241a 100644
--- a/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
+++ b/llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
@@ -49,6 +49,7 @@
#include "llvm/Support/Debug.h"
#include <algorithm>
#include <cassert>
+#include <cmath>
#include <iterator>
#include <utility>
@@ -83,6 +84,9 @@ static cl::opt<unsigned>
// Limit on the number of memdep results to process.
static const unsigned int NumResultsLimit = 100;
+// for quickly calculating log
+const float ln2 = 0.69314718f;
+
/// This is a helper function that removes Val from 'Inst's set in ReverseMap.
///
/// If the set becomes empty, remove Inst's entry.
@@ -369,8 +373,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
BasicBlock *BB, Instruction *QueryInst, unsigned *Limit,
BatchAAResults &BatchAA) {
bool isInvariantLoad = false;
- Align MemLocAlign =
- MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
+ Align MemLocAlign = MemLoc.Ptr->getPointerAlignment(BB->getDataLayout());
unsigned DefaultLimit = getDefaultBlockScanLimit();
if (!Limit)
@@ -418,7 +421,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
// True for volatile instruction.
// For Load/Store return true if atomic ordering is stronger than AO,
// for other instruction just true if it can read or write to memory.
- auto isComplexForReordering = [](Instruction * I, AtomicOrdering AO)->bool {
+ auto isComplexForReordering = [](Instruction *I, AtomicOrdering AO) -> bool {
if (I->isVolatile())
return true;
if (auto *LI = dyn_cast<LoadInst>(I))
@@ -461,7 +464,7 @@ MemDepResult MemoryDependenceResults::getSimplePointerDependencyFrom(
case Intrinsic::masked_load:
case Intrinsic::masked_store: {
MemoryLocation Loc;
- /*ModRefInfo MR =*/ GetLocation(II, Loc, TLI);
+ /*ModRefInfo MR =*/GetLocation(II, Loc, TLI);
AliasResult R = BatchAA.alias(Loc, MemLoc);
if (R == AliasResult::NoAlias)
continue;
@@ -890,7 +893,7 @@ void MemoryDependenceResults::getNonLocalPointerDependency(
// translation.
SmallDenseMap<BasicBlock *, Value *, 16> Visited;
if (getNonLocalPointerDepFromBB(QueryInst, Address, Loc, isLoad, FromBB,
- Result, Visited, true))
+ Result, Visited, true))
return;
Result.clear();
Result.push_back(NonLocalDepResult(FromBB, MemDepResult::getUnknown(),
@@ -991,33 +994,19 @@ MemDepResult MemoryDependenceResults::getNonLocalInfoForBlock(
static void
SortNonLocalDepInfoCache(MemoryDependenceResults::NonLocalDepInfo &Cache,
unsigned NumSortedEntries) {
- switch (Cache.size() - NumSortedEntries) {
- case 0:
- // done, no new entries.
- break;
- case 2: {
- // Two new entries, insert the last one into place.
- NonLocalDepEntry Val = Cache.back();
- Cache.pop_back();
- MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
- std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
- Cache.insert(Entry, Val);
- [[fallthrough]];
- }
- case 1:
- // One new entry, Just insert the new value at the appropriate position.
- if (Cache.size() != 1) {
+
+ auto s = Cache.size() - NumSortedEntries;
+ if (s < log2(Cache.size()) * ln2) {
+ while (s > 0) {
NonLocalDepEntry Val = Cache.back();
Cache.pop_back();
MemoryDependenceResults::NonLocalDepInfo::iterator Entry =
- llvm::upper_bound(Cache, Val);
+ std::upper_bound(Cache.begin(), Cache.end() - 1, Val);
Cache.insert(Entry, Val);
+ s--;
}
- break;
- default:
- // Added many values, do a full scale sort.
+ } else {
llvm::sort(Cache);
- break;
}
}
@@ -1343,8 +1332,8 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
// assume it is unknown, but this also does not block PRE of the load.
if (!CanTranslate ||
!getNonLocalPointerDepFromBB(QueryInst, PredPointer,
- Loc.getWithNewPtr(PredPtrVal), isLoad,
- Pred, Result, Visited)) {
+ Loc.getWithNewPtr(PredPtrVal), isLoad,
+ Pred, Result, Visited)) {
// Add the entry to the Result list.
NonLocalDepResult Entry(Pred, MemDepResult::getUnknown(), PredPtrVal);
Result.push_back(Entry);
@@ -1412,7 +1401,6 @@ bool MemoryDependenceResults::getNonLocalPointerDepFromBB(
I.setResult(MemDepResult::getUnknown());
-
break;
}
}
@@ -1733,9 +1721,7 @@ MemoryDependenceWrapperPass::MemoryDependenceWrapperPass() : FunctionPass(ID) {}
MemoryDependenceWrapperPass::~MemoryDependenceWrapperPass() = default;
-void MemoryDependenceWrapperPass::releaseMemory() {
- MemDep.reset();
-}
+void MemoryDependenceWrapperPass::releaseMemory() { MemDep.reset(); }
void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
AU.setPreservesAll();
@@ -1745,8 +1731,9 @@ void MemoryDependenceWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequiredTransitive<TargetLibraryInfoWrapperPass>();
}
-bool MemoryDependenceResults::invalidate(Function &F, const PreservedAnalyses &PA,
- FunctionAnalysisManager::Invalidator &Inv) {
+bool MemoryDependenceResults::invalidate(
+ Function &F, const PreservedAnalyses &PA,
+ FunctionAnalysisManager::Invalidator &Inv) {
// Check whether our analysis is preserved.
auto PAC = PA.getChecker<MemoryDependenceAnalysis>();
if (!PAC.preserved() && !PAC.preservedSet<AllAnalysesOn<Function>>())
|
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not reformat parts of the file you do not modify.
Why does this cache use a sorted vector at all? Would it be possible to convert it to a DenseMap instead?
|
Crashes during clang bootstrap: https://llvm-compile-time-tracker.com/show_error.php?commit=0273219d4061bb0bcd0a39e7556be882c6792e24 |
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
bug fixed |
|
… caches with few unsorted entries
|
Compile time result: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=4e570371a96e25c8b7f9d25266afe864dcfdfa20&stat=instructions%3Au The optimization effect does not appear significant on the benchmark. I checked the related cache size and NumSortedEntries outputs, and found that on the benchmark, the overall cache size is relatively small—around a few hundred entries—and the function is called infrequently. As a result, the number of times this patch can actually hit and optimize is limited, so the performance improvement is not obvious. statistics of sqlit3 from benchmark statistics of a big file |
| // If the number of unsorted entires is small and the cache size is big, use | ||
| // insertion sort is faster. Here use Log2_32 to quickly choose the sort | ||
| // method. | ||
| if (s < Log2_32(Cache.size())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of using log2 here is based on empirical experience. The main goal is to have a relatively fast way to determine whether the number of unsorted entries is significantly smaller than the cache size. To tune this condition, I experimented with the following four options, and based on the timing results, using log2 proved to be the fastest. The benchmark results are as follows:
- s < NumSortedEntries: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=e174118c88ee3d9d31fb3ed4e29b9ae2fcac46fa&stat=instructions%3Au
- s < Log2_32(Cache.size()) * llvm::numbers::ln2 / llvm::numbers::ln10: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=9368621b42fa8b68e1e3081110f82ae9a5d57458&stat=instructions%3Au
- s < Log2_32(Cache.size()) * llvm::numbers::ln2: https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=19a8584d14dbb95b4a71a92a43da3a2c5d5e550a&stat=instructions%3Au
- s < Log2_32(Cache.size()): https://llvm-compile-time-tracker.com/compare.php?from=26f3f24a4f0a67eb23d255aba7a73a12bee1db11&to=0fa6bc6bdf1c9c5464e81970e973f2c43edac874&stat=instructions%3Au
|
kind ping |
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| // Output number of sorted entries and size of cache for each sort. | ||
| LLVM_DEBUG(dbgs() << "NumSortedEntries: " << NumSortedEntries | ||
| << ", Cache.size: " << Cache.size() << "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop this debug output, it will be very spammy for anyone not specifically trying to optimize this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // One new entry, Just insert the new value at the appropriate position. | ||
| if (Cache.size() != 1) { | ||
|
|
||
| // If the number of unsorted entires is small and the cache size is big, use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // If the number of unsorted entires is small and the cache size is big, use | |
| // If the number of unsorted entires is small and the cache size is big, using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@DingdWang Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
… caches with few unsorted entries (llvm#143107) During compilation of large files with many branches, I observed that the function `SortNonLocalDepInfoCache` in `MemoryDependenceAnalysis` becomes a significant performance bottleneck. This is because `Cache.size()` can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient. This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly. As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.
During compilation of large files with many branches, I observed that the function
SortNonLocalDepInfoCacheinMemoryDependenceAnalysisbecomes a significant performance bottleneck. This is becauseCache.size()can be very large (around 20,000), but only a small number of entries (approximately 5 to 8) actually need sorting. The original implementation performs a full sort in all cases, which is inefficient.This patch introduces a lightweight heuristic to quickly estimate the number of unsorted entries and choose a more efficient sorting method accordingly.
As a result, the GVN pass runtime on a large file is reduced from approximately 26.3 minutes to 16.5 minutes.