[memprof] Speed up caller-callee pair extraction (Part 2) #116441

kazutakahirata · 2024-11-15T21:30:50Z

This patch further speeds up the extraction of caller-callee pairs
from the profile.

Recall that we reconstruct a call stack by traversing the radix tree
from one of its leaf nodes toward a root. The implication is that
when we decode many different call stacks, we end up visiting nodes
near the root(s) repeatedly. That in turn adds many duplicates to our
data structure:

DenseMap<uint64_t, SmallVector<CallEdgeTy, 0>> Calls;

only to be deduplicated later with sort+unique for each vector.

This patch makes the extraction process more efficient by keeping
track of indices of the radix tree array we've visited so far and
terminating traversal as soon as we encounter an element previously
visited.

Note that even with this improvement, we still add at least one
caller-callee pair to the data structure above for each call stack
because we do need to add a caller-callee pair for the leaf node with
the callee GUID being 0.

Without this patch, it takes 4 seconds to extract caller-callee pairs
from a large MemProf profile. This patch shortenes that down to
900ms.

This patch further speeds up the extraction of caller-callee pairs from the profile. Recall that we reconstruct a call stack by traversing the radix tree from one of its leaf nodes toward a root. The implication is that when we decode many different call stacks, we end up visiting nodes near the root(s) repeatedly. That in turn adds many duplicates to our data structure: DenseMap<uint64_t, SmallVector<CallEdgeTy, 0>> Calls; only to be deduplicated later with sort+unique for each vector. This patch makes the extraction process more efficient by keeping track of indices of the radix tree array we've visited so far and terminating traversal as soon as we encounter an element previously visited. Note that even with this improvement, we still add at least one caller-callee pair to the data structure above for each call stack because we do need to add a caller-callee pair for the leaf node with the callee GUID being 0. Without this patch, it takes 4 seconds to extract caller-callee pairs from a large MemProf profile. This patch shortenes that down to 900ms.

snehasish

lgtm

llvmbot · 2024-11-16T06:02:13Z

@llvm/pr-subscribers-pgo

Author: Kazu Hirata (kazutakahirata)

Changes

This patch further speeds up the extraction of caller-callee pairs
from the profile.

Recall that we reconstruct a call stack by traversing the radix tree
from one of its leaf nodes toward a root. The implication is that
when we decode many different call stacks, we end up visiting nodes
near the root(s) repeatedly. That in turn adds many duplicates to our
data structure:

DenseMap<uint64_t, SmallVector<CallEdgeTy, 0>> Calls;

only to be deduplicated later with sort+unique for each vector.

This patch makes the extraction process more efficient by keeping
track of indices of the radix tree array we've visited so far and
terminating traversal as soon as we encounter an element previously
visited.

Note that even with this improvement, we still add at least one
caller-callee pair to the data structure above for each call stack
because we do need to add a caller-callee pair for the leaf node with
the callee GUID being 0.

Without this patch, it takes 4 seconds to extract caller-callee pairs
from a large MemProf profile. This patch shortenes that down to
900ms.

Full diff: https://github.com/llvm/llvm-project/pull/116441.diff

2 Files Affected:

(modified) llvm/include/llvm/ProfileData/MemProf.h (+17-2)
(modified) llvm/lib/ProfileData/InstrProfReader.cpp (+2-1)

diff --git a/llvm/include/llvm/ProfileData/MemProf.h b/llvm/include/llvm/ProfileData/MemProf.h
index ae262060718a7c..41dd41169320a5 100644
--- a/llvm/include/llvm/ProfileData/MemProf.h
+++ b/llvm/include/llvm/ProfileData/MemProf.h
@@ -1,6 +1,7 @@
 #ifndef LLVM_PROFILEDATA_MEMPROF_H_
 #define LLVM_PROFILEDATA_MEMPROF_H_
 
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/STLForwardCompat.h"
 #include "llvm/ADT/STLFunctionalExtras.h"
@@ -971,11 +972,16 @@ struct CallerCalleePairExtractor {
   // A map from caller GUIDs to lists of call sites in respective callers.
   DenseMap<uint64_t, SmallVector<CallEdgeTy, 0>> CallerCalleePairs;
 
+  // The set of linear call stack IDs that we've visited.
+  BitVector Visited;
+
   CallerCalleePairExtractor() = delete;
   CallerCalleePairExtractor(
       const unsigned char *CallStackBase,
-      llvm::function_ref<Frame(LinearFrameId)> FrameIdToFrame)
-      : CallStackBase(CallStackBase), FrameIdToFrame(FrameIdToFrame) {}
+      llvm::function_ref<Frame(LinearFrameId)> FrameIdToFrame,
+      unsigned RadixTreeSize)
+      : CallStackBase(CallStackBase), FrameIdToFrame(FrameIdToFrame),
+        Visited(RadixTreeSize) {}
 
   void operator()(LinearCallStackId LinearCSId) {
     const unsigned char *Ptr =
@@ -1004,6 +1010,15 @@ struct CallerCalleePairExtractor {
       LineLocation Loc(F.LineOffset, F.Column);
       CallerCalleePairs[CallerGUID].emplace_back(Loc, CalleeGUID);
 
+      // Keep track of the indices we've visited.  If we've already visited the
+      // current one, terminate the traversal.  We will not discover any new
+      // caller-callee pair by continuing the traversal.
+      unsigned Offset =
+          std::distance(CallStackBase, Ptr) / sizeof(LinearFrameId);
+      if (Visited.test(Offset))
+        break;
+      Visited.set(Offset);
+
       Ptr += sizeof(LinearFrameId);
       CalleeGUID = CallerGUID;
     }
diff --git a/llvm/lib/ProfileData/InstrProfReader.cpp b/llvm/lib/ProfileData/InstrProfReader.cpp
index 5a2a3352c4b07d..1d6ffaff230074 100644
--- a/llvm/lib/ProfileData/InstrProfReader.cpp
+++ b/llvm/lib/ProfileData/InstrProfReader.cpp
@@ -1678,7 +1678,8 @@ IndexedMemProfReader::getMemProfCallerCalleePairs() const {
   assert(Version == memprof::Version3);
 
   memprof::LinearFrameIdConverter FrameIdConv(FrameBase);
-  memprof::CallerCalleePairExtractor Extractor(CallStackBase, FrameIdConv);
+  memprof::CallerCalleePairExtractor Extractor(CallStackBase, FrameIdConv,
+                                               RadixTreeSize);
 
   // The set of linear call stack IDs that we need to traverse from.  We expect
   // the set to be dense, so we use a BitVector.

kazutakahirata requested review from snehasish and teresajohnson November 15, 2024 21:30

snehasish approved these changes Nov 15, 2024

View reviewed changes

kazutakahirata merged commit 57ed628 into llvm:main Nov 15, 2024
8 of 9 checks passed

kazutakahirata deleted the memprof_undrift_faster_extract2 branch November 15, 2024 23:33

llvmbot added the PGO Profile Guided Optimizations label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[memprof] Speed up caller-callee pair extraction (Part 2) #116441

[memprof] Speed up caller-callee pair extraction (Part 2) #116441

Uh oh!

kazutakahirata commented Nov 15, 2024

Uh oh!

snehasish left a comment

Uh oh!

Uh oh!

llvmbot commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[memprof] Speed up caller-callee pair extraction (Part 2) #116441

[memprof] Speed up caller-callee pair extraction (Part 2) #116441

Uh oh!

Conversation

kazutakahirata commented Nov 15, 2024

Uh oh!

snehasish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants