-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[BOLT][NFC] Use brstack in guides and user outputs #163950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BOLT][NFC] Use brstack in guides and user outputs #163950
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
439c102
to
1ac2401
Compare
@llvm/pr-subscribers-bolt Author: Paschalis Mpeis (paschalis-mpeis) ChangesUpdate guides to use BRSTACK, with a mention to BRBE for AArch64. Use BRSTACK in user-facing outputs. Full diff: https://github.com/llvm/llvm-project/pull/163950.diff 12 Files Affected:
diff --git a/bolt/README.md b/bolt/README.md
index fe54bd82a356a..962a450d7885c 100644
--- a/bolt/README.md
+++ b/bolt/README.md
@@ -108,9 +108,10 @@ $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
#### For Services
Once you get the service deployed and warmed-up, it is time to collect perf
-data with LBR (branch information). The exact perf command to use will depend
-on the service. E.g., to collect the data for all processes running on the
-server for the next 3 minutes use:
+data with BRSTACK (branch information). Different architectures implement this
+using different hardware units, for example LBR on X86, and BRBE on AArch64.
+The exact perf command to use will depend on the service. E.g., to collect the
+data for all processes running on the server for the next 3 minutes use:
```
$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
```
@@ -163,7 +164,7 @@ $ perf2bolt -p perf.data -o perf.fdata <executable>
This command will aggregate branch data from `perf.data` and store it in a
format that is both more compact and more resilient to binary modifications.
-If the profile was collected without LBRs, you will need to add `-nl` flag to
+If the profile was collected without BRSTACKs, you will need to add `-nl` flag to
the command line above.
### Step 3: Optimize with BOLT
diff --git a/bolt/docs/Heatmaps.md b/bolt/docs/Heatmaps.md
index 6cf9c4da533b1..221b848811f02 100644
--- a/bolt/docs/Heatmaps.md
+++ b/bolt/docs/Heatmaps.md
@@ -1,7 +1,7 @@
# Code Heatmaps
BOLT has gained the ability to print code heatmaps based on
-sampling-based profiles generated by `perf`, either with `LBR` data or not.
+sampling-based profiles generated by `perf`, either with `BRSTACK` data or not.
The output is produced in colored ASCII to be displayed in a color-capable
terminal. It looks something like this:
@@ -20,9 +20,9 @@ or if you want to monitor the existing process(es):
$ perf record -e cycles:u -j any,u [-p PID|-a] -- sleep <interval>
```
-Running with LBR (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
-from basic events by using the llvm-bolt-heatmap option `-nl` (no LBR) but
-such heatmaps do not have the coverage provided by LBR and may only be useful
+Running with BRSTACK (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
+from basic events by using the llvm-bolt-heatmap option `-nl` (no BRSTACK) but
+such heatmaps do not have the coverage provided by BRSTACK and may only be useful
for finding event hotspots at larger code block granularities.
Once the run is complete, and `perf.data` is generated, run llvm-bolt-heatmap:
diff --git a/bolt/docs/OptimizingClang.md b/bolt/docs/OptimizingClang.md
index 685fcc2b738fa..02227d9642266 100644
--- a/bolt/docs/OptimizingClang.md
+++ b/bolt/docs/OptimizingClang.md
@@ -97,7 +97,7 @@ BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
790053908 : all conditional branches (=)
...
```
-The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
+The statistics in the output is based on the BRSTACK profile (LBR) collected with `perf`, and since we were using
the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
branches` is a good indication that BOLT was able to straighten out the code even after PGO.
diff --git a/bolt/docs/OptimizingLinux.md b/bolt/docs/OptimizingLinux.md
index c85fecabcccc2..65aa8bf418c80 100644
--- a/bolt/docs/OptimizingLinux.md
+++ b/bolt/docs/OptimizingLinux.md
@@ -5,7 +5,7 @@
Many Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.
-BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
+BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch stack history (BRSTACK), such as Intel's last branch records (LBR) or AArch64's Branch Record Buffer Extension (BRBE). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
While BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 3604fdd3a94b4..01dadfbeb0cd0 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -46,16 +46,15 @@ namespace opts {
static cl::opt<bool>
BasicAggregation("nl",
- cl::desc("aggregate basic samples (without LBR info)"),
+ cl::desc("aggregate basic samples (without BRSTACK info)"),
cl::cat(AggregatorCategory));
cl::opt<bool> ArmSPE("spe", cl::desc("Enable Arm SPE mode."),
cl::cat(AggregatorCategory));
-static cl::opt<std::string>
- ITraceAggregation("itrace",
- cl::desc("Generate LBR info with perf itrace argument"),
- cl::cat(AggregatorCategory));
+static cl::opt<std::string> ITraceAggregation(
+ "itrace", cl::desc("Generate BRSTACK info with perf itrace argument"),
+ cl::cat(AggregatorCategory));
static cl::opt<bool>
FilterMemProfile("filter-mem-profile",
@@ -201,7 +200,7 @@ void DataAggregator::start() {
}
if (opts::BasicAggregation) {
- launchPerfProcess("events without LBR", MainEventsPPI,
+ launchPerfProcess("events without BRSTACK", MainEventsPPI,
"script -F pid,event,ip");
} else if (!opts::ITraceAggregation.empty()) {
// Disable parsing memory profile from trace data, unless requested by user.
@@ -1069,7 +1068,7 @@ ErrorOr<DataAggregator::LBREntry> DataAggregator::parseLBREntry() {
if (std::error_code EC = Rest.getError())
return EC;
if (Rest.get().size() < 5) {
- reportError("expected rest of LBR entry");
+ reportError("expected rest of BRSTACK entry");
Diag << "Found: " << Rest.get() << "\n";
return make_error_code(llvm::errc::io_error);
}
@@ -1433,7 +1432,7 @@ std::error_code DataAggregator::printLBRHeatMap() {
errs() << "HEATMAP-ERROR: no basic event samples detected in profile. "
"Cannot build heatmap.";
} else {
- errs() << "HEATMAP-ERROR: no LBR traces detected in profile. "
+ errs() << "HEATMAP-ERROR: no BRSTACK traces detected in profile. "
"Cannot build heatmap. Use -nl for building heatmap from "
"basic events.\n";
}
@@ -1572,7 +1571,7 @@ void DataAggregator::printBranchStacksDiagnostics(
std::error_code DataAggregator::parseBranchEvents() {
std::string BranchEventTypeStr =
- opts::ArmSPE ? "SPE branch events in LBR-format" : "branch events";
+ opts::ArmSPE ? "SPE branch events in BRSTACK-format" : "branch events";
outs() << "PERF2BOLT: parse " << BranchEventTypeStr << "...\n";
NamedRegionTimer T("parseBranch", "Parsing branch events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
@@ -1620,7 +1619,7 @@ std::error_code DataAggregator::parseBranchEvents() {
clear(TraceMap);
outs() << "PERF2BOLT: read " << NumSamples << " samples and " << NumEntries
- << " LBR entries\n";
+ << " BRSTACK entries\n";
if (NumTotalSamples) {
if (NumSamples && NumSamplesNoLBR == NumSamples) {
// Note: we don't know if perf2bolt is being used to parse memory samples
@@ -1628,8 +1627,10 @@ std::error_code DataAggregator::parseBranchEvents() {
if (!opts::ArmSPE)
errs()
<< "PERF2BOLT-WARNING: all recorded samples for this binary lack "
- "LBR. Record profile with perf record -j any or run perf2bolt "
- "in no-LBR mode with -nl (the performance improvement in -nl "
+ "BRSTACK. Record profile with perf record -j any or run "
+ "perf2bolt "
+ "in non-BRSTACK mode with -nl (the performance improvement in "
+ "-nl "
"mode may be limited)\n";
else
errs()
@@ -1664,7 +1665,7 @@ void DataAggregator::processBranchEvents() {
}
std::error_code DataAggregator::parseBasicEvents() {
- outs() << "PERF2BOLT: parsing basic events (without LBR)...\n";
+ outs() << "PERF2BOLT: parsing basic events (without BRSTACK)...\n";
NamedRegionTimer T("parseBasic", "Parsing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
while (hasData()) {
@@ -1688,7 +1689,7 @@ std::error_code DataAggregator::parseBasicEvents() {
}
void DataAggregator::processBasicEvents() {
- outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
+ outs() << "PERF2BOLT: processing basic events (without BRSTACK)...\n";
NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
uint64_t OutOfRangeSamples = 0;
@@ -1777,7 +1778,8 @@ std::error_code DataAggregator::parsePreAggregatedLBRSamples() {
++AggregatedLBRs;
}
- outs() << "PERF2BOLT: read " << AggregatedLBRs << " aggregated LBR entries\n";
+ outs() << "PERF2BOLT: read " << AggregatedLBRs
+ << " aggregated BRSTACK entries\n";
return std::error_code();
}
@@ -2426,7 +2428,7 @@ std::error_code DataAggregator::writeBATYAML(BinaryContext &BC,
void DataAggregator::dump() const { DataReader::dump(); }
void DataAggregator::dump(const PerfBranchSample &Sample) const {
- Diag << "Sample LBR entries: " << Sample.LBR.size() << "\n";
+ Diag << "Sample BRSTACK entries: " << Sample.LBR.size() << "\n";
for (const LBREntry &LBR : Sample.LBR)
Diag << LBR << '\n';
}
diff --git a/bolt/lib/Profile/DataReader.cpp b/bolt/lib/Profile/DataReader.cpp
index 277d4bb5e7282..ca76f199c25e1 100644
--- a/bolt/lib/Profile/DataReader.cpp
+++ b/bolt/lib/Profile/DataReader.cpp
@@ -570,7 +570,7 @@ void DataReader::readBasicSampleData(BinaryFunction &BF) {
if (!SampleDataOrErr)
return;
- // Basic samples mode territory (without LBR info)
+ // Basic samples mode territory (without BRSTACK info)
// First step is to assign BB execution count based on samples from perf
BF.ProfileMatchRatio = 1.0f;
BF.removeTagsFromProfile();
@@ -578,8 +578,8 @@ void DataReader::readBasicSampleData(BinaryFunction &BF) {
bool NormalizeByCalls = usesEvent("branches");
static bool NagUser = true;
if (NagUser) {
- outs()
- << "BOLT-INFO: operating with basic samples profiling data (no LBR).\n";
+ outs() << "BOLT-INFO: operating with basic samples profiling data (no "
+ "BRSTACK).\n";
if (NormalizeByInsnCount)
outs() << "BOLT-INFO: normalizing samples by instruction count.\n";
else if (NormalizeByCalls)
diff --git a/bolt/test/X86/bolt-address-translation-yaml.test b/bolt/test/X86/bolt-address-translation-yaml.test
index cffe848a16ae1..27cea4a24a2f2 100644
--- a/bolt/test/X86/bolt-address-translation-yaml.test
+++ b/bolt/test/X86/bolt-address-translation-yaml.test
@@ -46,7 +46,7 @@ WRITE-BAT-CHECK: BOLT-INFO: BAT section size (bytes): 404
READ-BAT-CHECK-NOT: BOLT-ERROR: unable to save profile in YAML format for input file processed by BOLT
READ-BAT-CHECK: BOLT-INFO: Parsed 5 BAT entries
-READ-BAT-CHECK: PERF2BOLT: read 79 aggregated LBR entries
+READ-BAT-CHECK: PERF2BOLT: read 79 aggregated BRSTACK entries
READ-BAT-CHECK: HEATMAP: building heat map
READ-BAT-CHECK: BOLT-INFO: 5 out of 21 functions in the binary (23.8%) have non-empty execution profile
READ-BAT-FDATA-CHECK: BOLT-INFO: 5 out of 16 functions in the binary (31.2%) have non-empty execution profile
diff --git a/bolt/test/X86/heatmap-preagg.test b/bolt/test/X86/heatmap-preagg.test
index 493101664c4fd..3b31ca290b9c4 100644
--- a/bolt/test/X86/heatmap-preagg.test
+++ b/bolt/test/X86/heatmap-preagg.test
@@ -32,7 +32,7 @@ RUN: --block-size=1024 | FileCheck --check-prefix CHECK-HEATMAP-BAT-1K %s
CHECK-HEATMAP-BAT-1K: HEATMAP: dumping heatmap with bucket size 1024
CHECK-HEATMAP-BAT-1K-NOT: HEATMAP: dumping heatmap with bucket size
-CHECK-HEATMAP: PERF2BOLT: read 81 aggregated LBR entries
+CHECK-HEATMAP: PERF2BOLT: read 81 aggregated BRSTACK entries
CHECK-HEATMAP: HEATMAP: invalid traces: 1
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 128
@@ -71,7 +71,7 @@ CHECK-HM-1024-NEXT: 0
CHECK-BAT-HM-64: (349, 1126]
CHECK-BAT-HM-4K: (605, 2182]
-CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated LBR entries
+CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated BRSTACK entries
CHECK-HEATMAP-BAT: HEATMAP: invalid traces: 2
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 4096
diff --git a/bolt/test/X86/nolbr.s b/bolt/test/X86/nolbr.s
index 999c68566c949..eca2c3e11d27a 100644
--- a/bolt/test/X86/nolbr.s
+++ b/bolt/test/X86/nolbr.s
@@ -17,7 +17,7 @@
# CHECK-FDATA-NEXT: 1 _start [[#]] 1
# CHECK-BOLT: BOLT-INFO: pre-processing profile using branch profile reader
-# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no LBR).
+# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no BRSTACK).
# CHECK-BOLT: BOLT-INFO: 1 out of 1 functions in the binary (100.0%) have non-empty execution profile
.globl _start
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
index 1f44f7510a9fb..1677d270236f3 100644
--- a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -6,6 +6,6 @@ RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.e
RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe 2> /dev/null
-RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-LBR
+RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-BRSTACK
-CHECK-SPE-LBR: PERF2BOLT: parse SPE branch events in LBR-format
+CHECK-SPE-BRSTACK: PERF2BOLT: parse SPE branch events in BRSTACK-format
diff --git a/bolt/tools/heatmap/heatmap.cpp b/bolt/tools/heatmap/heatmap.cpp
index 43167558b6758..1fd0c5292cbdf 100644
--- a/bolt/tools/heatmap/heatmap.cpp
+++ b/bolt/tools/heatmap/heatmap.cpp
@@ -69,7 +69,8 @@ int main(int argc, char **argv) {
" - Sampled profile collected from the binary:\n"
" - perf data or pre-aggregated profile data (instrumentation profile "
"not supported)\n"
- " - perf data can have basic (IP) or branch-stack (LBR) samples\n\n"
+ " - perf data can have basic (IP) or branch-stack (BRSTACK) "
+ "samples\n\n"
" Outputs:\n"
" - Heatmaps: colored ASCII (requires a color-capable terminal or a"
diff --git a/bolt/tools/merge-fdata/merge-fdata.cpp b/bolt/tools/merge-fdata/merge-fdata.cpp
index cfcb9373548a1..af3e50f062e08 100644
--- a/bolt/tools/merge-fdata/merge-fdata.cpp
+++ b/bolt/tools/merge-fdata/merge-fdata.cpp
@@ -120,14 +120,14 @@ void mergeProfileHeaders(BinaryProfileHeader &MergedHeader,
if (!MergedHeader.Id.empty() && (MergedHeader.Id != Header.Id))
errs() << "WARNING: build-ids in merged profiles do not match\n";
- // Cannot merge samples profile with LBR profile.
+ // Cannot merge samples profile with BRSTACK profile.
if (!MergedHeader.Flags)
MergedHeader.Flags = Header.Flags;
constexpr auto Mask = llvm::bolt::BinaryFunction::PF_BRANCH |
llvm::bolt::BinaryFunction::PF_BASIC;
if ((MergedHeader.Flags & Mask) != (Header.Flags & Mask)) {
- errs() << "ERROR: cannot merge LBR profile with non-LBR profile\n";
+ errs() << "ERROR: cannot merge BRSTACK profile with non-BRSTACK profile\n";
exit(1);
}
MergedHeader.Flags = MergedHeader.Flags | Header.Flags;
@@ -319,7 +319,7 @@ void mergeLegacyProfiles(const SmallVectorImpl<std::string> &Filenames) {
auto [Signature, ExecCount] = Line.rsplit(' ');
if (ExecCount.getAsInteger(10, Count.Exec))
report_error(Filename, "Malformed / corrupted execution count");
- // Only LBR profile has misprediction field
+ // Only BRSTACK profile has misprediction field
if (!NoLBRCollection.value_or(false)) {
auto [SignatureLBR, MispredCount] = Signature.rsplit(' ');
Signature = SignatureLBR;
|
Hey folks, I thought to start simple with |
Linux perf man page uses non-capitalized form: |
Update guides to use brstack, with a mention to BRBE for AArch64. Use brstack in user-facing outputs.
1ac2401
to
a170e35
Compare
Good suggestion. Done. Forced-pushed to update the commit message as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall, thank you for addressing the ambiguity.
Co-authored-by: Amir Ayupov <[email protected]>
Thanks Amir, applied the suggestion. |
I must have caused this failure: Will send a fixup patch soon. |
Pushed to bolt-tests: (cc: @aaupov; unsure if I can add reviewers there) |
Update guides to use brstack, with a mention to BRBE for AArch64. Use brstack in user-facing outputs.