Skip to content

Commit 36f516e

Browse files
author
Your Name
committed
initial commit
1 parent 59f4931 commit 36f516e

File tree

9 files changed

+356
-0
lines changed

9 files changed

+356
-0
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
title: Getting Started with Profile Guided Optimisation and Google Benchmark
3+
4+
minutes_to_complete: 15
5+
6+
who_is_this_for: Developers who are looking to optimise the performance of a program using the observed characteristics at runtime.
7+
8+
learning_objectives:
9+
- Learn how to write a function microbenchmark using Google Benchmark
10+
- Learn how to use profile guided optimisation to build binaries optimised for real-world workloads
11+
12+
prerequisites:
13+
- Basic C++ understanding
14+
- Access to an Arm-based linux machine
15+
16+
author: Kieran Hejmadi
17+
18+
### Tags
19+
skilllevels: Introductory
20+
subjects: ML
21+
armips:
22+
- Neoverse
23+
tools_software_languages:
24+
- C++
25+
- Google Benchmark
26+
- G++
27+
operatingsystems:
28+
- Linux
29+
30+
further_reading:
31+
- resource:
32+
title: G++ Profile Guided Optimisation Documentation
33+
link: https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html
34+
type: documentation
35+
- resource:
36+
title: Google Benchmark Library
37+
link: https://github.com/google/benchmark
38+
type: documentation
39+
40+
41+
42+
### FIXED, DO NOT MODIFY
43+
# ================================================================================
44+
weight: 1 # _index.md always has weight of 1 to order correctly
45+
layout: "learningpathall" # All files under learning paths have this same wrapper
46+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
47+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
3.57 MB
Loading
636 KB
Loading
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
title: Introduction to Profile-Guided Optimisation
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction of Profile Guided Optimisation
10+
11+
### What is Profile-Guided Optimization (PGO) and How Does It Work?
12+
13+
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the -fprofile-generate flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the -fprofile-use flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
14+
15+
### When to Use Profile-Guided Optimization
16+
17+
PGO is particularly beneficial in the later stages of development, once the codebase has stabilized. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions—those that are executed frequently—can lead to significant performance improvements. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
18+
19+
### Limitations of Profile-Guided Optimization and When Not to Use
20+
21+
While PGO offers substantial performance benefits, it has certain limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance in some cases.
22+
23+
Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
24+
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: Introduction to Google Benchmark
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Google Benchmark
10+
11+
Google Benchmark is a C++ library designed to expedite the microbenchmarking of code. It simplifies the process of writing microbenchmarks by providing a structured framework that automatically handles iterations, timing, and statistical analysis. This allows developers to focus on optimizing their code rather than writing main functions, refactoring source code to run in a testing scenario and trying to anticipate any unwanted compiler optimisations.
12+
13+
To use Google Benchmark, you define a function that contains the code you want to measure. This function should accept a `benchmark::State&` parameter and iterate over it to perform the benchmarking. You then register this function using the `BENCHMARK` macro and include `BENCHMARK_MAIN()` to create the main function for the benchmark executable. Here's a basic example:
14+
15+
```cpp
16+
#include <benchmark/benchmark.h>
17+
18+
static void BM_StringCreation(benchmark::State& state) {
19+
for (auto _ : state)
20+
std::string empty_string;
21+
}
22+
BENCHMARK(BM_StringCreation);
23+
24+
BENCHMARK_MAIN();
25+
```
26+
27+
Filtering and preventing Compiler Optimisations
28+
29+
To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored.
30+
31+
Additionally, to run a specific subset of benchmarks, you can use the `--benchmark_filter` command-line option with a regular expression. For example `./benchmark_binary --benchmark_filter=BM_String.*` so you don't need to repeatedly comment out lines of source code.
32+
33+
34+
For more detailed information and advanced usage, refer to the [official Google documentation](https://github.com/google/benchmark).
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Division Example
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Install prerequisites
10+
11+
In this example I am connecting to an AWS-based, `c7g.xlarge` instance running Ubuntu 24.04 LTS. Run the following commands to install the prerequisite packages.
12+
13+
14+
```bash
15+
sudo apt update
16+
sudo apt install gcc g++ make libbenchmark-dev -y
17+
```
18+
19+
## Division example
20+
21+
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` results in indirection so that the value is unknown compile time, although it is visible in our source code as 1500.
22+
23+
```cpp
24+
#include <benchmark/benchmark.h>
25+
#include <vector>
26+
27+
// Benchmark division instruction
28+
static void baseDiv(benchmark::State &s) {
29+
std::vector<int> v_in(4096);
30+
std::vector<int> v_out(4096);
31+
32+
for (auto _ : s) {
33+
for (size_t i = 0; i < v_in.size(); i++) v_out[i] = v_in[i] / s.range(0);
34+
// s.range(0) is unknown at compile time, cannot be reduced
35+
}
36+
}
37+
38+
BENCHMARK(baseDiv)->Arg(1500)->Unit(benchmark::kMicrosecond); // value of 1500 is passed through as an argument so strength reduction cannot be applied
39+
40+
BENCHMARK_MAIN();
41+
```
42+
43+
To compile as run the microbenchmark on this function we need to link against `pthreads` and `benchmark` with the following commands.
44+
45+
```bash
46+
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
47+
```
48+
49+
Running the output, `div_bench.base` results in the following output.
50+
51+
```output
52+
Running ./div_bench.base
53+
Run on (4 X 2100 MHz CPU s)
54+
CPU Caches:
55+
L1 Data 64 KiB (x4)
56+
L1 Instruction 64 KiB (x4)
57+
L2 Unified 1024 KiB (x4)
58+
L3 Unified 32768 KiB (x1)
59+
Load Average: 0.00, 0.00, 0.00
60+
***WARNING*** Library was built as DEBUG. Timings may be affected.
61+
-------------------------------------------------------
62+
Benchmark Time CPU Iterations
63+
-------------------------------------------------------
64+
baseDiv/1500 7.90 us 7.90 us 88512
65+
```
66+
67+
68+
### Inspect Assembly
69+
70+
To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding.
71+
72+
{{% notice Please Note %}}
73+
You may need to set the `perf_event_paranoid` value to 0 with the `sudo sysctl kernel.perf_event_paranoid=0` command
74+
{{% /notice %}}
75+
76+
77+
Run the following command to record `perf` data and create a report in the terminal
78+
79+
```bash
80+
sudo perf record -o perf-division-base ./div_bench.base
81+
sudo perf report --input=perf-division-base
82+
```
83+
![before-pgo](./before-pgo.gif)
84+
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Using Profile Guided Optimisation
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
### Building binary with PGO
10+
11+
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command, that includes the `-fprofile-generate` flag to build the instrumented binary.
12+
13+
```bash
14+
g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
15+
```
16+
17+
Next, run the binary to record the profile.
18+
19+
```bash
20+
./div_bench.opt
21+
```
22+
An output file, `*.gcda` should be generated in the same directory. To incorporate this profile into the compilation, run the following command with the `-fprofile-use` flag.
23+
24+
```bash
25+
g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
26+
```
27+
28+
### Running the optimised binary
29+
30+
Running the newly created `div_bench.opt` binary we observe following improvement.
31+
32+
```output
33+
Running ./div_bench.opt
34+
Run on (4 X 2100 MHz CPU s)
35+
CPU Caches:
36+
L1 Data 64 KiB (x4)
37+
L1 Instruction 64 KiB (x4)
38+
L2 Unified 1024 KiB (x4)
39+
L3 Unified 32768 KiB (x1)
40+
Load Average: 0.10, 0.03, 0.01
41+
***WARNING*** Library was built as DEBUG. Timings may be affected.
42+
-------------------------------------------------------
43+
Benchmark Time CPU Iterations
44+
-------------------------------------------------------
45+
baseDiv/1500 2.86 us 2.86 us 244429
46+
```
47+
48+
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this context. Next, let's understand how it was optimised.
49+
50+
### Inspect Assembly
51+
52+
53+
Run the following command to record `perf` data and create a report that can be viewed in the terminal.
54+
55+
```bash
56+
sudo perf record -o perf-division-opt ./div_bench.opt
57+
sudo perf report --input=perf-division-opt
58+
```
59+
60+
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use slightly different instructions.
61+
62+
![after-pgo](./after-pgo.gif)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: (Optional) Incorporating PGO into CI system
3+
weight: 6
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
### Building locally with Make
10+
11+
Since PGO can be used by simple command-line instructions, we can trivially incorporate this into a `make` file, as per the sample Makefile below if building locally.
12+
13+
```makefile
14+
# Simple Makefile for building and benchmarking div_bench with and without PGO
15+
16+
# Compiler and flags
17+
CXX := g++
18+
CXXFLAGS := -O3 -std=c++17
19+
LDLIBS := -lbenchmark -lpthread
20+
21+
# Default target: build both binaries
22+
.PHONY: all clean clean-gcda clean-perf run perf
23+
all: div_bench.base div_bench.opt
24+
25+
# Build the baseline binary (no PGO)
26+
div_bench.base: div_bench.cpp
27+
$(CXX) $(CXXFLAGS) $< $(LDLIBS) -o $@
28+
29+
# Build the PGO-optimized binary:
30+
div_bench.opt: div_bench.cpp
31+
$(MAKE) clean-gcda
32+
$(CXX) $(CXXFLAGS) -fprofile-generate $< $(LDLIBS) -o $@
33+
@echo "Running instrumented binary to gather profile data..."
34+
./div_bench.opt
35+
$(CXX) $(CXXFLAGS) -fprofile-use $< $(LDLIBS) -o $@
36+
$(MAKE) clean-perf
37+
38+
# Remove all generated files
39+
clean: clean-gcda
40+
rm -f div_bench.base div_bench.opt
41+
rm -rf ./*.gcda
42+
43+
# Run both benchmarks with informative headers
44+
run: div_bench.base div_bench.opt
45+
@echo "==================== Without Profile-Guided Optimization ===================="
46+
./div_bench.base
47+
@echo "==================== With Profile-Guided Optimization ===================="
48+
./div_bench.opt
49+
```
50+
51+
### Building with GitHub Actions
52+
53+
The `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. Further tests could be to check for regressions.
54+
55+
```yaml
56+
name: PGO Benchmark
57+
58+
on:
59+
push:
60+
branches: [ main ]
61+
62+
jobs:
63+
build:
64+
runs-on: ubuntu-24.04-arm
65+
66+
steps:
67+
- name: Check out source
68+
uses: actions/checkout@v3
69+
70+
- name: Install dependencies
71+
run: |
72+
sudo apt-get update
73+
sudo apt-get install -y libbenchmark-dev g++
74+
75+
- name: Clean previous profiling data
76+
run: |
77+
rm -rf ./*gcda
78+
rm -f div_bench.base div_bench.opt
79+
80+
- name: Compile base and instrumented binary
81+
run: |
82+
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
83+
g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
84+
85+
- name: Generate profile data and compile with PGO
86+
run: |
87+
./div_bench.opt
88+
g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
89+
90+
- name: Run benchmarks
91+
run: |
92+
echo "==================== Without Profile-Guided Optimization ===================="
93+
./div_bench.base
94+
echo "==================== With Profile-Guided Optimization ===================="
95+
./div_bench.opt
96+
echo "==================== Benchmarking complete ===================="
97+
```

0 commit comments

Comments
 (0)