Skip to content

Commit 4679420

Browse files
Add BOLT Merge Learning Path
1 parent 109bb6d commit 4679420

File tree

9 files changed

+520
-0
lines changed

9 files changed

+520
-0
lines changed
109 KB
Loading
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
3+
4+
title: BOLT Merge :Feature-level and Library-level BOLTing with Profile Merging
5+
6+
minutes_to_complete: 30
7+
8+
who_is_this_for: >
9+
In networking and high-performance applications, a single execution path (e.g., one feature) often activates only a small portion of the binary. For example, 20% of the application may be exercised under one feature, while the remaining features and external libraries remain untouched.
10+
11+
A single BOLT pass using one workload leads to partial optimization, typically benefiting only the code paths covered by that specific run.
12+
13+
This learning path is intended for performance engineers and developers working on Arm-based systems who need to optimize large, feature-rich application binaries that depend on external libraries.
14+
15+
It demonstrates how to bolt application features and shared libraries independently, then merge the resulting profiles to achieve full code coverage and deploy a fully optimized binary.
16+
17+
18+
learning_objectives:
19+
- Instrument and optimize binaries for individual workload features using LLVM-BOLT
20+
- Collect separate BOLT profiles and merge them for comprehensive code coverage
21+
- Optimize shared libraries independently
22+
- Integrate these bolted libraries into applications at runtime
23+
- Compare performance across baseline, isolated, and merged optimization cases
24+
25+
prerequisites:
26+
- An Arm based system running Linux with BOLT and Linux Perf installed. The Linux kernel should be version 5.15 or later. Earlier kernel versions can be used, but some Linux Perf features may be limited or not available.
27+
- (Optional) A second, more powerful Linux system to build the software executable and run BOLT.
28+
29+
author: Gayathri Narayana Yegna Narayanan
30+
31+
### Tags
32+
skilllevels: Introductory
33+
subjects: Performance and Architecture
34+
armips:
35+
- Neoverse
36+
- Cortex-A
37+
tools_software_languages:
38+
- BOLT
39+
- perf
40+
- Runbook
41+
operatingsystems:
42+
- Linux
43+
44+
further_reading:
45+
- resource:
46+
title: BOLT README
47+
link: https://github.com/llvm/llvm-project/tree/main/bolt
48+
type: documentation
49+
- resource:
50+
title: BOLT - A Practical Binary Optimizer for Data Centers and Beyond
51+
link: https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/
52+
type: website
53+
54+
55+
56+
### FIXED, DO NOT MODIFY
57+
# ================================================================================
58+
weight: 1 # _index.md always has weight of 1 to order correctly
59+
layout: "learningpathall" # All files under learning paths have this same wrapper
60+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
61+
---
62+
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
61.7 KB
Loading
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: Overview of BOLT Merge
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) is a post-link binary optimizer that uses Linux Perf data to re-order the executable code layout to reduce memory overhead and improve performance.
10+
11+
12+
The diagram below illustrates why merging BOLT profiles and optimizing libraries independently is essential for complete binary optimization:
13+
14+
![Why BOLT Profile Merging?](Bolt-merge.png)
15+
16+
- The **left chart** shows a typical application binary, where only 50% is proprietary application code, and the rest consists of external libraries.
17+
- The **right chart** breaks down that application code into individual features (F1–F5). In any given run, typically only one feature is active — meaning only 20% of the code is exercised and profiled.
18+
- As a result, a single BOLT pass provides incomplete optimization.
19+
20+
To ensure full optimization, the workflow includes:
21+
1. Profiling each workload feature separately
22+
2. Profiling external libraries independently
23+
3. Merging profiles for broader code coverage
24+
4. Applying BOLT to each binary and library
25+
5. Linking bolted libraries with the merged-profile binary
26+
27+
In this Learning Path, you'll learn how to:
28+
- Collect and merge BOLT profiles from multiple workload features (e.g., read-only and write-only)
29+
- Independently optimize application binaries and external user-space libraries (e.g., `libssl.so`, `libcrypto.so`)
30+
- Link the final optimized binary with the separately bolted libraries to deploy a fully optimized runtime stack
31+
32+
While MySQL and sysbench are used as examples, this method applies to **any feature-rich application** that:
33+
- Exhibits multiple runtime paths
34+
- Uses dynamic libraries
35+
- Requires full-stack binary optimization for performance-critical deployment
36+
37+
38+
39+
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: BOLT Optimization - First feature
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this step, you will instrument an application binary (such as `mysqld`) with BOLT to collect runtime profile data for a specific feature — for example, a **read-only workload**.
10+
11+
The collected profile will later be merged with others and used to optimize the application's code layout.
12+
13+
### Step 1: Build or obtain the uninstrumented binary
14+
15+
Make sure your application binary is:
16+
17+
- Built from source (e.g., `mysqld`)
18+
- Unstripped, with symbol information available
19+
- Compiled with frame pointers enabled (`-fno-omit-frame-pointer`)
20+
21+
You can verify this with:
22+
23+
```bash
24+
readelf -s /path/to/mysqld | grep main
25+
```
26+
27+
If the symbols are missing, rebuild the binary with debug info and no stripping.
28+
29+
---
30+
31+
### Step 2: Instrument the binary with BOLT
32+
33+
Use `llvm-bolt` to create an instrumented version of the binary:
34+
35+
```bash
36+
llvm-bolt /path/to/mysqld \\
37+
-instrument \\
38+
-o /path/to/mysqld.instrumented \\
39+
--instrumentation-file=/path/to/profile-readonly.fdata \\
40+
--instrumentation-sleep-time=5 \\
41+
--instrumentation-no-counters-clear \\
42+
--instrumentation-wait-forks
43+
```
44+
45+
### Explanation of key options
46+
47+
- `-instrument`: Enables profile generation instrumentation
48+
- `--instrumentation-file`: Path where the profile output will be saved
49+
- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes)
50+
51+
---
52+
53+
### Step 3: Run the instrumented binary under a feature-specific workload
54+
55+
Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench:
56+
57+
```bash
58+
taskset -c 9 ./src/sysbench \\
59+
--db-driver=mysql \\
60+
--mysql-host=127.0.0.1 \\
61+
--mysql-db=bench \\
62+
--mysql-user=bench \\
63+
--mysql-password=bench \\
64+
--mysql-port=3306 \\
65+
--tables=8 \\
66+
--table-size=10000 \\
67+
--threads=1 \\
68+
src/lua/oltp_read_only.lua run
69+
```
70+
71+
> Adjust this command as needed for your workload and CPU/core binding.
72+
73+
The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data.
74+
75+
---
76+
77+
### Step 4: Verify the profile was created
78+
79+
After running the workload:
80+
81+
```bash
82+
ls -lh /path/to/profile-readonly.fdata
83+
```
84+
85+
You should see a non-empty file. This file will later be merged with other profiles (e.g., for write-only traffic) to generate a complete merged profile.
86+
87+
---
88+
89+
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: BOLT Optimization - Second Feature & BOLT Merge to combine
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this step, you'll collect profile data for a **write-heavy** workload and also **instrument external libraries** such as `libcrypto.so` and `libssl.so` used by the application (e.g., MySQL).
10+
11+
12+
### Step 1: Run Write-Only Workload for Application Binary
13+
14+
Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`:
15+
16+
```bash
17+
taskset -c 9 ./src/sysbench \\
18+
--db-driver=mysql \\
19+
--mysql-host=127.0.0.1 \\
20+
--mysql-db=bench \\
21+
--mysql-user=bench \\
22+
--mysql-password=bench \\
23+
--mysql-port=3306 \\
24+
--tables=8 \\
25+
--table-size=10000 \\
26+
--threads=1 \\
27+
src/lua/oltp_write_only.lua run
28+
```
29+
30+
Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`.
31+
---
32+
### Step 2: Verify the Second Profile Was Generated
33+
34+
```bash
35+
ls -lh /path/to/profile-writeonly.fdata
36+
```
37+
38+
Both `.fdata` files should now exist and contain valid data:
39+
40+
- `profile-readonly.fdata`
41+
- `profile-writeonly.fdata`
42+
43+
---
44+
45+
### Step 3: Merge the Feature Profiles
46+
47+
Use `merge-fdata` to combine the feature-specific profiles into one comprehensive `.fdata` file:
48+
49+
```bash
50+
merge-fdata /path/to/profile-readonly.fdata /path/to/profile-writeonly.fdata \\
51+
-o /path/to/profile-merged.fdata
52+
```
53+
54+
**Example command from an actual setup:**
55+
56+
```bash
57+
/home/ubuntu/llvm-latest/build/bin/merge-fdata prof-instrumentation-readonly.fdata prof-instrumentation-writeonly.fdata \\
58+
-o prof-instrumentation-readwritemerged.fdata
59+
```
60+
61+
Output:
62+
63+
```
64+
Using legacy profile format.
65+
Profile from 2 files merged.
66+
```
67+
68+
This creates a single merged profile (`profile-merged.fdata`) covering both read-only and write-only workload behaviors.
69+
70+
---
71+
72+
### Step 4: Verify the Merged Profile
73+
74+
Check the merged `.fdata` file:
75+
76+
```bash
77+
ls -lh /path/to/profile-merged.fdata
78+
```
79+
80+
---
81+
### Step 5: Generate the Final Binary with the Merged Profile
82+
83+
Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file:
84+
85+
```bash
86+
llvm-bolt build/bin/mysqld \\
87+
-o build/bin/mysqldreadwrite_merged.bolt_instrumentation \\
88+
-data=/home/ubuntu/mysql-server-8.0.33/sysbench/prof-instrumentation-readwritemerged.fdata \\
89+
-reorder-blocks=ext-tsp \\
90+
-reorder-functions=hfsort \\
91+
-split-functions \\
92+
-split-all-cold \\
93+
-split-eh \\
94+
-dyno-stats \\
95+
--print-profile-stats 2>&1 | tee bolt_orig.log
96+
```
97+
98+
This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features.
99+
100+

0 commit comments

Comments
 (0)