Skip to content

Commit 5d18b40

Browse files
author
Your Name
committed
review content before PR
1 parent 36f516e commit 5d18b40

File tree

6 files changed

+25
-16
lines changed

6 files changed

+25
-16
lines changed

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Getting Started with Profile Guided Optimisation and Google Benchmark
2+
title: Optimizing Performance with Profile-Guided Optimization and Google Benchmark
33

44
minutes_to_complete: 15
55

6-
who_is_this_for: Developers who are looking to optimise the performance of a program using the observed characteristics at runtime.
6+
who_is_this_for: Developers who are looking to optimise the performance of a program using the characteristics observed at runtime.
77

88
learning_objectives:
9-
- Learn how to write a function microbenchmark using Google Benchmark
9+
- Learn how to microbenchmark a function using Google Benchmark
1010
- Learn how to use profile guided optimisation to build binaries optimised for real-world workloads
1111

1212
prerequisites:

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,20 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## Introduction of Profile Guided Optimisation
9+
## Introduction to Profile Guided Optimisation
1010

1111
### What is Profile-Guided Optimization (PGO) and How Does It Work?
1212

13-
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the -fprofile-generate flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the -fprofile-use flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
13+
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
1414

1515
### When to Use Profile-Guided Optimization
1616

17-
PGO is particularly beneficial in the later stages of development, once the codebase has stabilized. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions—those that are executed frequently—can lead to significant performance improvements. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
17+
PGO is particularly beneficial in the later stages of development when the real-world workload can be applied. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions that are executed frequently. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
1818

1919
### Limitations of Profile-Guided Optimization and When Not to Use
2020

21-
While PGO offers substantial performance benefits, it has certain limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance in some cases.
21+
While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance.
2222

2323
Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
2424

25+
Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for more information.

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ BENCHMARK(BM_StringCreation);
2424
BENCHMARK_MAIN();
2525
```
2626
27-
Filtering and preventing Compiler Optimisations
27+
### Filtering and preventing Compiler Optimisations
2828
2929
To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored.
3030

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ sudo apt install gcc g++ make libbenchmark-dev -y
1818

1919
## Division example
2020

21-
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` results in indirection so that the value is unknown compile time, although it is visible in our source code as 1500.
21+
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` causes indirection since the divisor value is unknown compile time, although it is visible in our source code as 1500.
2222

2323
```cpp
2424
#include <benchmark/benchmark.h>
@@ -70,7 +70,7 @@ baseDiv/1500 7.90 us 7.90 us 88512
7070
To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding.
7171

7272
{{% notice Please Note %}}
73-
You may need to set the `perf_event_paranoid` value to 0 with the `sudo sysctl kernel.perf_event_paranoid=0` command
73+
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command
7474
{{% /notice %}}
7575

7676

@@ -80,5 +80,8 @@ Run the following command to record `perf` data and create a report in the termi
8080
sudo perf record -o perf-division-base ./div_bench.base
8181
sudo perf report --input=perf-division-base
8282
```
83+
84+
As the `perf report` graphic below shows, our program spends a significant amount of time in the short loops with no loop unrolling. There is also the relatively expensive `sdiv` operation and we spend most of the execution time storing the result of that operation.
85+
8386
![before-pgo](./before-pgo.gif)
8487

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
### Building binary with PGO
1010

11-
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command, that includes the `-fprofile-generate` flag to build the instrumented binary.
11+
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command that includes the `-fprofile-generate` flag to build the instrumented binary.
1212

1313
```bash
1414
g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
@@ -45,18 +45,18 @@ Benchmark Time CPU Iterations
4545
baseDiv/1500 2.86 us 2.86 us 244429
4646
```
4747

48-
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this context. Next, let's understand how it was optimised.
48+
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. **This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this into the optimisation process**. Next, let's understand how it was optimised.
4949

5050
### Inspect Assembly
5151

5252

53-
Run the following command to record `perf` data and create a report that can be viewed in the terminal.
53+
As per the previous section, run the following command to record `perf` data and create a report that can be viewed in the terminal.
5454

5555
```bash
5656
sudo perf record -o perf-division-opt ./div_bench.opt
5757
sudo perf report --input=perf-division-opt
5858
```
5959

60-
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use slightly different instructions.
60+
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use many more cheaper operations (also known as strength reduction) to execute our loop far quicker.
6161

6262
![after-pgo](./after-pgo.gif)

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,12 @@ layout: learningpathall
88

99
### Building locally with Make
1010

11-
Since PGO can be used by simple command-line instructions, we can trivially incorporate this into a `make` file, as per the sample Makefile below if building locally.
11+
As PGO can be utilized with simple command-line instructions, it can easily be integrated into a `make` file and continuous integration (CI) systems, as demonstrated in the sample Makefile below for local builds.
12+
13+
{{% notice Caution %}}
14+
PGO requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
15+
{{% /notice %}}
16+
1217

1318
```makefile
1419
# Simple Makefile for building and benchmarking div_bench with and without PGO
@@ -50,7 +55,7 @@ run: div_bench.base div_bench.opt
5055

5156
### Building with GitHub Actions
5257

53-
The `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. Further tests could be to check for regressions.
58+
As another alternative, the `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. This barebones example natively compiles on a GitHub hosted Ubuntu 24.04 Arm-based runner. Further tests could automate for regressions.
5459

5560
```yaml
5661
name: PGO Benchmark

0 commit comments

Comments
 (0)