Skip to content

Commit 0cfca10

Browse files
committed
Optimizing Arm binaries and libraries with LLVM BOLT
1 parent 1bfa370 commit 0cfca10

File tree

8 files changed

+501
-0
lines changed

8 files changed

+501
-0
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
title: Optimizing Arm binaries and libraries with LLVM-BOLT and profile merging
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 30
9+
10+
who_is_this_for: Performance engineers, software developers working on Arm platforms who want to optimize both application binaries and shared libraries using LLVM-BOLT.
11+
12+
learning_objectives:
13+
- Instrument and optimize binaries for individual workload features using LLVM-BOLT.
14+
- Collect separate BOLT profiles and merge them for comprehensive code coverage.
15+
- Optimize shared libraries independently.
16+
- Integrate optimized shared libraries into applications.
17+
- Evaluate and compare application and library performance across baseline, isolated, and merged optimization scenarios.
18+
19+
prerequisites:
20+
- An Arm based system running Linux with BOLT and Linux Perf installed. The Linux kernel should be version 5.15 or later.
21+
- (Optional) A second, more powerful Linux system to build the software executable and run BOLT.
22+
23+
author: Gayathri Narayana Yegna Narayanan
24+
25+
### Tags
26+
skilllevels: Introductory
27+
subjects: Performance and Architecture
28+
armips:
29+
- Neoverse
30+
- Cortex-A
31+
tools_software_languages:
32+
- BOLT
33+
- perf
34+
- Runbook
35+
operatingsystems:
36+
- Linux
37+
38+
further_reading:
39+
- resource:
40+
title: BOLT README
41+
link: https://github.com/llvm/llvm-project/tree/main/bolt
42+
type: documentation
43+
- resource:
44+
title: BOLT - A Practical Binary Optimizer for Data Centers and Beyond
45+
link: https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/
46+
type: website
47+
48+
49+
### FIXED, DO NOT MODIFY
50+
# ================================================================================
51+
weight: 1 # _index.md always has weight of 1 to order correctly
52+
layout: "learningpathall" # All files under learning paths have this same wrapper
53+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
54+
---
55+
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
61.7 KB
Loading
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: Overview of BOLT Merge
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) is a post-link binary optimizer that uses Linux Perf data to re-order the executable code layout to reduce memory overhead and improve performance.
10+
11+
In this Learning Path, you'll learn how to:
12+
- Collect and merge BOLT profiles from multiple workload features (e.g., read-only and write-only)
13+
- Independently optimize application binaries and external user-space libraries (e.g., `libssl.so`, `libcrypto.so`)
14+
- Link the final optimized binary with the separately bolted libraries to deploy a fully optimized runtime stack
15+
16+
While MySQL and sysbench are used as examples, this method applies to **any feature-rich application** that:
17+
- Exhibits multiple runtime paths
18+
- Uses dynamic libraries
19+
- Requires full-stack binary optimization for performance-critical deployment
20+
21+
The workflow includes:
22+
1. Profiling each workload feature separately
23+
2. Profiling external libraries independently
24+
3. Merging profiles for broader code coverage
25+
4. Applying BOLT to each binary and library
26+
5. Linking bolted libraries with the merged-profile binary
27+
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: BOLT Optimization - First feature
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this step, you will instrument an application binary (such as `mysqld`) with BOLT to collect runtime profile data for a specific feature — for example, a **read-only workload**.
10+
11+
The collected profile will later be merged with others and used to optimize the application's code layout.
12+
13+
### Step 1: Build or obtain the uninstrumented binary
14+
15+
Make sure your application binary is:
16+
17+
- Built from source (e.g., `mysqld`)
18+
- Unstripped, with symbol information available
19+
- Compiled with frame pointers enabled (`-fno-omit-frame-pointer`)
20+
21+
You can verify this with:
22+
23+
```bash
24+
readelf -s /path/to/mysqld | grep main
25+
```
26+
27+
If the symbols are missing, rebuild the binary with debug info and no stripping.
28+
29+
---
30+
31+
### Step 2: Instrument the binary with BOLT
32+
33+
Use `llvm-bolt` to create an instrumented version of the binary:
34+
35+
```bash
36+
llvm-bolt /path/to/mysqld \\
37+
-instrument \\
38+
-o /path/to/mysqld.instrumented \\
39+
--instrumentation-file=/path/to/profile-readonly.fdata \\
40+
--instrumentation-sleep-time=5 \\
41+
--instrumentation-no-counters-clear \\
42+
--instrumentation-wait-forks
43+
```
44+
45+
### Explanation of key options
46+
47+
- `-instrument`: Enables profile generation instrumentation
48+
- `--instrumentation-file`: Path where the profile output will be saved
49+
- `--instrumentation-wait-forks`: Ensures the instrumentation continues through forks (important for daemon processes)
50+
51+
---
52+
53+
### Step 3: Run the instrumented binary under a feature-specific workload
54+
55+
Use a workload generator to stress the binary in a feature-specific way. For example, to simulate **read-only traffic** with sysbench:
56+
57+
```bash
58+
taskset -c 9 ./src/sysbench \\
59+
--db-driver=mysql \\
60+
--mysql-host=127.0.0.1 \\
61+
--mysql-db=bench \\
62+
--mysql-user=bench \\
63+
--mysql-password=bench \\
64+
--mysql-port=3306 \\
65+
--tables=8 \\
66+
--table-size=10000 \\
67+
--threads=1 \\
68+
src/lua/oltp_read_only.lua run
69+
```
70+
71+
> Adjust this command as needed for your workload and CPU/core binding.
72+
73+
The `.fdata` file defined in `--instrumentation-file` will be populated with runtime execution data.
74+
75+
---
76+
77+
### Step 4: Verify the profile was created
78+
79+
After running the workload:
80+
81+
```bash
82+
ls -lh /path/to/profile-readonly.fdata
83+
```
84+
85+
You should see a non-empty file. This file will later be merged with other profiles (e.g., for write-only traffic) to generate a complete merged profile.
86+
87+
---
88+
89+
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: BOLT Optimization - Second Feature & BOLT Merge to combine
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
In this step, you'll collect profile data for a **write-heavy** workload and also **instrument external libraries** such as `libcrypto.so` and `libssl.so` used by the application (e.g., MySQL).
10+
11+
12+
### Step 1: Run Write-Only Workload for Application Binary
13+
14+
Use the same BOLT-instrumented MySQL binary and drive it with a write-only workload to capture `profile-writeonly.fdata`:
15+
16+
```bash
17+
taskset -c 9 ./src/sysbench \\
18+
--db-driver=mysql \\
19+
--mysql-host=127.0.0.1 \\
20+
--mysql-db=bench \\
21+
--mysql-user=bench \\
22+
--mysql-password=bench \\
23+
--mysql-port=3306 \\
24+
--tables=8 \\
25+
--table-size=10000 \\
26+
--threads=1 \\
27+
src/lua/oltp_write_only.lua run
28+
```
29+
30+
Make sure that the `--instrumentation-file` is set appropriately to save `profile-writeonly.fdata`.
31+
---
32+
### Step 2: Verify the Second Profile Was Generated
33+
34+
```bash
35+
ls -lh /path/to/profile-writeonly.fdata
36+
```
37+
38+
Both `.fdata` files should now exist and contain valid data:
39+
40+
- `profile-readonly.fdata`
41+
- `profile-writeonly.fdata`
42+
43+
---
44+
45+
### Step 3: Merge the Feature Profiles
46+
47+
Use `merge-fdata` to combine the feature-specific profiles into one comprehensive `.fdata` file:
48+
49+
```bash
50+
merge-fdata /path/to/profile-readonly.fdata /path/to/profile-writeonly.fdata \\
51+
-o /path/to/profile-merged.fdata
52+
```
53+
54+
**Example command from an actual setup:**
55+
56+
```bash
57+
/home/ubuntu/llvm-latest/build/bin/merge-fdata prof-instrumentation-readonly.fdata prof-instrumentation-writeonly.fdata \\
58+
-o prof-instrumentation-readwritemerged.fdata
59+
```
60+
61+
Output:
62+
63+
```
64+
Using legacy profile format.
65+
Profile from 2 files merged.
66+
```
67+
68+
This creates a single merged profile (`profile-merged.fdata`) covering both read-only and write-only workload behaviors.
69+
70+
---
71+
72+
### Step 4: Verify the Merged Profile
73+
74+
Check the merged `.fdata` file:
75+
76+
```bash
77+
ls -lh /path/to/profile-merged.fdata
78+
```
79+
80+
---
81+
### Step 5: Generate the Final Binary with the Merged Profile
82+
83+
Use LLVM-BOLT to generate the final optimized binary using the merged `.fdata` file:
84+
85+
```bash
86+
llvm-bolt build/bin/mysqld \\
87+
-o build/bin/mysqldreadwrite_merged.bolt_instrumentation \\
88+
-data=/home/ubuntu/mysql-server-8.0.33/sysbench/prof-instrumentation-readwritemerged.fdata \\
89+
-reorder-blocks=ext-tsp \\
90+
-reorder-functions=hfsort \\
91+
-split-functions \\
92+
-split-all-cold \\
93+
-split-eh \\
94+
-dyno-stats \\
95+
--print-profile-stats 2>&1 | tee bolt_orig.log
96+
```
97+
98+
This command optimizes the binary layout based on the merged workload profile, creating a single binary (`mysqldreadwrite_merged.bolt_instrumentation`) that is optimized across both features.
99+
100+

0 commit comments

Comments
 (0)