Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions bolt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,10 @@ $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
#### For Services

Once you get the service deployed and warmed-up, it is time to collect perf
data with LBR (branch information). The exact perf command to use will depend
on the service. E.g., to collect the data for all processes running on the
server for the next 3 minutes use:
data with brstack (branch information). Different architectures implement this
using different hardware units, for example LBR on X86, and BRBE on AArch64.
The exact perf command to use will depend on the service. E.g., to collect the
data for all processes running on the server for the next 3 minutes use:
```
$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
```
Expand Down Expand Up @@ -163,7 +164,7 @@ $ perf2bolt -p perf.data -o perf.fdata <executable>
This command will aggregate branch data from `perf.data` and store it in a
format that is both more compact and more resilient to binary modifications.

If the profile was collected without LBRs, you will need to add `-nl` flag to
If the profile was collected without brstacks, you will need to add `-nl` flag to
the command line above.

### Step 3: Optimize with BOLT
Expand Down
8 changes: 4 additions & 4 deletions bolt/docs/Heatmaps.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Code Heatmaps

BOLT has gained the ability to print code heatmaps based on
sampling-based profiles generated by `perf`, either with `LBR` data or not.
sampling-based profiles generated by `perf`, either with `brstack` data or not.
The output is produced in colored ASCII to be displayed in a color-capable
terminal. It looks something like this:

Expand All @@ -20,9 +20,9 @@ or if you want to monitor the existing process(es):
$ perf record -e cycles:u -j any,u [-p PID|-a] -- sleep <interval>
```

Running with LBR (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
from basic events by using the llvm-bolt-heatmap option `-nl` (no LBR) but
such heatmaps do not have the coverage provided by LBR and may only be useful
Running with brstack (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
from basic events by using the llvm-bolt-heatmap option `-nl` (no brstack) but
such heatmaps do not have the coverage provided by brstack and may only be useful
for finding event hotspots at larger code block granularities.

Once the run is complete, and `perf.data` is generated, run llvm-bolt-heatmap:
Expand Down
2 changes: 1 addition & 1 deletion bolt/docs/OptimizingClang.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
790053908 : all conditional branches (=)
...
```
The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
The statistics in the output is based on the brstack profile (LBR) collected with `perf`, and since we were using
the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
branches` is a good indication that BOLT was able to straighten out the code even after PGO.

Expand Down
2 changes: 1 addition & 1 deletion bolt/docs/OptimizingLinux.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

Many Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.

BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch stack history (brstack), such as Intel's last branch records (LBR) or AArch64's Branch Record Buffer Extension (BRBE). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.

While BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.

Expand Down
34 changes: 18 additions & 16 deletions bolt/lib/Profile/DataAggregator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,15 @@ namespace opts {

static cl::opt<bool>
BasicAggregation("nl",
cl::desc("aggregate basic samples (without LBR info)"),
cl::desc("aggregate basic samples (without brstack info)"),
cl::cat(AggregatorCategory));

cl::opt<bool> ArmSPE("spe", cl::desc("Enable Arm SPE mode."),
cl::cat(AggregatorCategory));

static cl::opt<std::string>
ITraceAggregation("itrace",
cl::desc("Generate LBR info with perf itrace argument"),
cl::cat(AggregatorCategory));
static cl::opt<std::string> ITraceAggregation(
"itrace", cl::desc("Generate brstack info with perf itrace argument"),
cl::cat(AggregatorCategory));

static cl::opt<bool>
FilterMemProfile("filter-mem-profile",
Expand Down Expand Up @@ -201,7 +200,7 @@ void DataAggregator::start() {
}

if (opts::BasicAggregation) {
launchPerfProcess("events without LBR", MainEventsPPI,
launchPerfProcess("events without brstack", MainEventsPPI,
"script -F pid,event,ip");
} else if (!opts::ITraceAggregation.empty()) {
// Disable parsing memory profile from trace data, unless requested by user.
Expand Down Expand Up @@ -1069,7 +1068,7 @@ ErrorOr<DataAggregator::LBREntry> DataAggregator::parseLBREntry() {
if (std::error_code EC = Rest.getError())
return EC;
if (Rest.get().size() < 5) {
reportError("expected rest of LBR entry");
reportError("expected rest of brstack entry");
Diag << "Found: " << Rest.get() << "\n";
return make_error_code(llvm::errc::io_error);
}
Expand Down Expand Up @@ -1433,7 +1432,7 @@ std::error_code DataAggregator::printLBRHeatMap() {
errs() << "HEATMAP-ERROR: no basic event samples detected in profile. "
"Cannot build heatmap.";
} else {
errs() << "HEATMAP-ERROR: no LBR traces detected in profile. "
errs() << "HEATMAP-ERROR: no brstack traces detected in profile. "
"Cannot build heatmap. Use -nl for building heatmap from "
"basic events.\n";
}
Expand Down Expand Up @@ -1572,7 +1571,7 @@ void DataAggregator::printBranchStacksDiagnostics(

std::error_code DataAggregator::parseBranchEvents() {
std::string BranchEventTypeStr =
opts::ArmSPE ? "SPE branch events in LBR-format" : "branch events";
opts::ArmSPE ? "SPE branch events in brstack-format" : "branch events";
outs() << "PERF2BOLT: parse " << BranchEventTypeStr << "...\n";
NamedRegionTimer T("parseBranch", "Parsing branch events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
Expand Down Expand Up @@ -1620,16 +1619,18 @@ std::error_code DataAggregator::parseBranchEvents() {
clear(TraceMap);

outs() << "PERF2BOLT: read " << NumSamples << " samples and " << NumEntries
<< " LBR entries\n";
<< " brstack entries\n";
if (NumTotalSamples) {
if (NumSamples && NumSamplesNoLBR == NumSamples) {
// Note: we don't know if perf2bolt is being used to parse memory samples
// at this point. In this case, it is OK to parse zero LBRs.
if (!opts::ArmSPE)
errs()
<< "PERF2BOLT-WARNING: all recorded samples for this binary lack "
"LBR. Record profile with perf record -j any or run perf2bolt "
"in no-LBR mode with -nl (the performance improvement in -nl "
"brstack. Record profile with perf record -j any or run "
"perf2bolt "
"in non-brstack mode with -nl (the performance improvement in "
"-nl "
"mode may be limited)\n";
else
errs()
Expand Down Expand Up @@ -1664,7 +1665,7 @@ void DataAggregator::processBranchEvents() {
}

std::error_code DataAggregator::parseBasicEvents() {
outs() << "PERF2BOLT: parsing basic events (without LBR)...\n";
outs() << "PERF2BOLT: parsing basic events (without brstack)...\n";
NamedRegionTimer T("parseBasic", "Parsing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
while (hasData()) {
Expand All @@ -1688,7 +1689,7 @@ std::error_code DataAggregator::parseBasicEvents() {
}

void DataAggregator::processBasicEvents() {
outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
outs() << "PERF2BOLT: processing basic events (without brstack)...\n";
NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
uint64_t OutOfRangeSamples = 0;
Expand Down Expand Up @@ -1777,7 +1778,8 @@ std::error_code DataAggregator::parsePreAggregatedLBRSamples() {
++AggregatedLBRs;
}

outs() << "PERF2BOLT: read " << AggregatedLBRs << " aggregated LBR entries\n";
outs() << "PERF2BOLT: read " << AggregatedLBRs
<< " aggregated brstack entries\n";

return std::error_code();
}
Expand Down Expand Up @@ -2426,7 +2428,7 @@ std::error_code DataAggregator::writeBATYAML(BinaryContext &BC,
void DataAggregator::dump() const { DataReader::dump(); }

void DataAggregator::dump(const PerfBranchSample &Sample) const {
Diag << "Sample LBR entries: " << Sample.LBR.size() << "\n";
Diag << "Sample brstack entries: " << Sample.LBR.size() << "\n";
for (const LBREntry &LBR : Sample.LBR)
Diag << LBR << '\n';
}
Expand Down
6 changes: 3 additions & 3 deletions bolt/lib/Profile/DataReader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -570,16 +570,16 @@ void DataReader::readBasicSampleData(BinaryFunction &BF) {
if (!SampleDataOrErr)
return;

// Basic samples mode territory (without LBR info)
// Basic samples mode territory (without brstack info)
// First step is to assign BB execution count based on samples from perf
BF.ProfileMatchRatio = 1.0f;
BF.removeTagsFromProfile();
bool NormalizeByInsnCount = usesEvent("cycles") || usesEvent("instructions");
bool NormalizeByCalls = usesEvent("branches");
static bool NagUser = true;
if (NagUser) {
outs()
<< "BOLT-INFO: operating with basic samples profiling data (no LBR).\n";
outs() << "BOLT-INFO: operating with basic samples profiling data (no "
"brstack).\n";
if (NormalizeByInsnCount)
outs() << "BOLT-INFO: normalizing samples by instruction count.\n";
else if (NormalizeByCalls)
Expand Down
2 changes: 1 addition & 1 deletion bolt/test/X86/bolt-address-translation-yaml.test
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ WRITE-BAT-CHECK: BOLT-INFO: BAT section size (bytes): 404

READ-BAT-CHECK-NOT: BOLT-ERROR: unable to save profile in YAML format for input file processed by BOLT
READ-BAT-CHECK: BOLT-INFO: Parsed 5 BAT entries
READ-BAT-CHECK: PERF2BOLT: read 79 aggregated LBR entries
READ-BAT-CHECK: PERF2BOLT: read 79 aggregated brstack entries
READ-BAT-CHECK: HEATMAP: building heat map
READ-BAT-CHECK: BOLT-INFO: 5 out of 21 functions in the binary (23.8%) have non-empty execution profile
READ-BAT-FDATA-CHECK: BOLT-INFO: 5 out of 16 functions in the binary (31.2%) have non-empty execution profile
Expand Down
4 changes: 2 additions & 2 deletions bolt/test/X86/heatmap-preagg.test
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ RUN: --block-size=1024 | FileCheck --check-prefix CHECK-HEATMAP-BAT-1K %s
CHECK-HEATMAP-BAT-1K: HEATMAP: dumping heatmap with bucket size 1024
CHECK-HEATMAP-BAT-1K-NOT: HEATMAP: dumping heatmap with bucket size

CHECK-HEATMAP: PERF2BOLT: read 81 aggregated LBR entries
CHECK-HEATMAP: PERF2BOLT: read 81 aggregated brstack entries
CHECK-HEATMAP: HEATMAP: invalid traces: 1
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 128
Expand Down Expand Up @@ -71,7 +71,7 @@ CHECK-HM-1024-NEXT: 0
CHECK-BAT-HM-64: (349, 1126]
CHECK-BAT-HM-4K: (605, 2182]

CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated LBR entries
CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated brstack entries
CHECK-HEATMAP-BAT: HEATMAP: invalid traces: 2
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 4096
Expand Down
2 changes: 1 addition & 1 deletion bolt/test/X86/nolbr.s
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# CHECK-FDATA-NEXT: 1 _start [[#]] 1

# CHECK-BOLT: BOLT-INFO: pre-processing profile using branch profile reader
# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no LBR).
# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no brstack).
# CHECK-BOLT: BOLT-INFO: 1 out of 1 functions in the binary (100.0%) have non-empty execution profile

.globl _start
Expand Down
4 changes: 2 additions & 2 deletions bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.e

RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe 2> /dev/null

RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-LBR
RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-BRSTACK

CHECK-SPE-LBR: PERF2BOLT: parse SPE branch events in LBR-format
CHECK-SPE-BRSTACK: PERF2BOLT: parse SPE branch events in brstack-format
3 changes: 2 additions & 1 deletion bolt/tools/heatmap/heatmap.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ int main(int argc, char **argv) {
" - Sampled profile collected from the binary:\n"
" - perf data or pre-aggregated profile data (instrumentation profile "
"not supported)\n"
" - perf data can have basic (IP) or branch-stack (LBR) samples\n\n"
" - perf data can have basic (IP) or branch-stack (brstack) "
"samples\n\n"

" Outputs:\n"
" - Heatmaps: colored ASCII (requires a color-capable terminal or a"
Expand Down
4 changes: 2 additions & 2 deletions bolt/tools/merge-fdata/merge-fdata.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,14 @@ void mergeProfileHeaders(BinaryProfileHeader &MergedHeader,
if (!MergedHeader.Id.empty() && (MergedHeader.Id != Header.Id))
errs() << "WARNING: build-ids in merged profiles do not match\n";

// Cannot merge samples profile with LBR profile.
// Cannot merge samples profile with brstack profile.
if (!MergedHeader.Flags)
MergedHeader.Flags = Header.Flags;

constexpr auto Mask = llvm::bolt::BinaryFunction::PF_BRANCH |
llvm::bolt::BinaryFunction::PF_BASIC;
if ((MergedHeader.Flags & Mask) != (Header.Flags & Mask)) {
errs() << "ERROR: cannot merge LBR profile with non-LBR profile\n";
errs() << "ERROR: cannot merge brstack profile with non-brstack profile\n";
exit(1);
}
MergedHeader.Flags = MergedHeader.Flags | Header.Flags;
Expand Down
Loading