Skip to content

Commit feff99a

Browse files
Merge pull request #1903 from jasonrandrews/review
Review PGO Learning Path
2 parents f52f9eb + a9a7596 commit feff99a

File tree

6 files changed

+111
-62
lines changed

6 files changed

+111
-62
lines changed

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ cascade:
77

88
minutes_to_complete: 15
99

10-
who_is_this_for: Developers who are looking to optimise the performance of a program using the characteristics observed at runtime.
10+
who_is_this_for: Developers who are looking to optimize C++ performance using characteristics observed at runtime.
1111

1212
learning_objectives:
13-
- Learn how to microbenchmark a function using Google Benchmark
14-
- Learn how to use profile guided optimisation to build binaries optimised for real-world workloads
13+
- Learn how to microbenchmark a function using Google Benchmark.
14+
- Learn how to use profile guided optimization to build binaries optimized for real-world workloads.
1515

1616
prerequisites:
17-
- Basic C++ understanding
18-
- Access to an Arm-based linux machine
17+
- Basic C++ understanding.
18+
- Access to an Arm-based Linux machine.
1919

2020
author: Kieran Hejmadi
2121

@@ -25,15 +25,14 @@ subjects: ML
2525
armips:
2626
- Neoverse
2727
tools_software_languages:
28-
- C++
2928
- Google Benchmark
30-
- G++
29+
- Runbook
3130
operatingsystems:
3231
- Linux
3332

3433
further_reading:
3534
- resource:
36-
title: G++ Profile Guided Optimisation Documentation
35+
title: G++ Profile Guided Optimization Documentation
3736
link: https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html
3837
type: documentation
3938
- resource:
Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,23 @@
11
---
2-
title: Introduction to Profile-Guided Optimisation
2+
title: Introduction to Profile-Guided Optimization
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction to Profile Guided Optimisation
9+
### What is Profile-Guided Optimization (PGO) and how does it work?
1010

11-
### What is Profile-Guided Optimization (PGO) and How Does It Work?
11+
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compile the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompile the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach identifies frequently executed paths—known as “hot” paths—and optimizes them more aggressively, while potentially reducing emphasis on less critical code paths.
1212

13-
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
13+
### When should I use Profile-Guided Optimization?
1414

15-
### When to Use Profile-Guided Optimization
15+
PGO is particularly beneficial in the later stages of development when real-world workloads are available. It is most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, consider optimizing “hot” functions that execute frequently. Doing so ensures that the most impactful parts of your code are optimized based on actual usage patterns.
1616

17-
PGO is particularly beneficial in the later stages of development when the real-world workload can be applied. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions that are executed frequently. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
17+
### What are the limitations of Profile-Guided Optimization and when should I avoid it?
1818

19-
### Limitations of Profile-Guided Optimization and When Not to Use
19+
While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not deliver the desired performance improvements and could even degrade performance.
2020

21-
While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance.
21+
Additionally, the process requires extra build steps, potentially increasing compile times for large codebases. Therefore, use PGO only on performance-critical sections that are heavily influenced by actual runtime behavior. PGO might not be ideal for early-stage development or applications with highly variable or unpredictable usage patterns.
2222

23-
Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
24-
25-
Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for more information.
23+
Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for further details on enabling and using PGO.

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@ layout: learningpathall
88

99
## Google Benchmark
1010

11-
Google Benchmark is a C++ library designed to expedite the microbenchmarking of code. It simplifies the process of writing microbenchmarks by providing a structured framework that automatically handles iterations, timing, and statistical analysis. This allows developers to focus on optimizing their code rather than writing main functions, refactoring source code to run in a testing scenario and trying to anticipate any unwanted compiler optimisations.
11+
Google Benchmark is a C++ library specifically designed for microbenchmarking – measuring the performance of small code snippets with high accuracy. Microbenchmarking is essential for identifying bottlenecks and optimizing critical sections of code, especially in performance-sensitive applications. Google Benchmark simplifies this process by providing a framework that handles common tasks like managing iterations, timing execution, and performing statistical analysis. This allows you to focus on the code being measured rather than writing boilerplate code for testing scenarios or trying to prevent unwanted compiler optimizations.
1212

13-
To use Google Benchmark, you define a function that contains the code you want to measure. This function should accept a `benchmark::State&` parameter and iterate over it to perform the benchmarking. You then register this function using the `BENCHMARK` macro and include `BENCHMARK_MAIN()` to create the main function for the benchmark executable. Here's a basic example:
13+
To use Google Benchmark, you define a function that contains the code you want to measure. This function should accept a `benchmark::State&` parameter and iterate over it to perform the benchmarking. You then register this function using the `BENCHMARK` macro and include `BENCHMARK_MAIN()` to create the main function for the benchmark executable.
14+
15+
Here's a basic example:
1416

1517
```cpp
1618
#include <benchmark/benchmark.h>
@@ -24,11 +26,17 @@ BENCHMARK(BM_StringCreation);
2426
BENCHMARK_MAIN();
2527
```
2628
27-
### Filtering and preventing Compiler Optimisations
29+
### Filtering and Preventing Compiler Optimizations
2830
29-
To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored.
31+
Google Benchmark provides tools to ensure accurate measurements by preventing the compiler from optimizing away parts of your benchmarked code:
3032
31-
Additionally, to run a specific subset of benchmarks, you can use the `--benchmark_filter` command-line option with a regular expression. For example `./benchmark_binary --benchmark_filter=BM_String.*` so you don't need to repeatedly comment out lines of source code.
33+
1. **Preventing Optimizations**: Use `benchmark::DoNotOptimize(value);` to force the compiler to read and store a variable or expression, ensuring it is not optimized away.
34+
35+
2. **Filtering Benchmarks**: To run a specific subset of benchmarks, use the `--benchmark_filter` command-line option with a regular expression. For example:
3236
37+
```bash
38+
./benchmark_binary --benchmark_filter=BM_String.*
39+
```
40+
This eliminates the need to repeatedly comment out lines of source code.
3341

34-
For more detailed information and advanced usage, refer to the [official Google documentation](https://github.com/google/benchmark).
42+
For more detailed information and advanced usage, refer to the [official documentation](https://github.com/google/benchmark).

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,17 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
## Install prerequisites
9+
## Introduction
1010

11-
In this example I am connecting to an AWS-based, `c7g.xlarge` instance running Ubuntu 24.04 LTS. Run the following commands to install the prerequisite packages.
11+
In this section, you'll learn how to use Google Benchmark and Profile-Guided Optimization to improve the performance of a simple division operation. This example demonstrates how even seemingly straightforward operations can benefit from optimization techniques.
1212

13+
Integer division is an excellent operation to benchmark because it's typically much more expensive than other arithmetic operations like addition, subtraction, or multiplication. On most CPU architectures, including Arm, division instructions have higher latency and lower throughput compared to other arithmetic operations. By applying Profile-Guided Optimization to code containing division operations, we can potentially achieve significant performance improvements.
14+
15+
## What tools are needed to run a Google Benchmark example?
16+
17+
For this example, you can use any Arm Linux computer. For example, an AWS EC2 `c7g.xlarge` instance running Ubuntu 24.04 LTS can be used.
18+
19+
Run the following commands to install the prerequisite packages:
1320

1421
```bash
1522
sudo apt update
@@ -18,7 +25,9 @@ sudo apt install gcc g++ make libbenchmark-dev -y
1825

1926
## Division example
2027

21-
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` causes indirection since the divisor value is unknown compile time, although it is visible in our source code as 1500.
28+
Use an editor to copy and paste the C++ source code below into a file named `div_bench.cpp`.
29+
30+
This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the use of `benchmark/benchmark.h` introduces indirection since the divisor value is unknown at compile time, although it is visible in the source code as 1500.
2231

2332
```cpp
2433
#include <benchmark/benchmark.h>
@@ -40,13 +49,21 @@ BENCHMARK(baseDiv)->Arg(1500)->Unit(benchmark::kMicrosecond); // value of 1500 i
4049
BENCHMARK_MAIN();
4150
```
4251
43-
To compile as run the microbenchmark on this function we need to link against `pthreads` and `benchmark` with the following commands.
52+
To compile and run the microbenchmark on this function, you need to link with the `pthreads` and `benchmark` libraries.
53+
54+
Compile with the command:
4455
4556
```bash
4657
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
4758
```
4859

49-
Running the output, `div_bench.base` results in the following output.
60+
Run the program:
61+
62+
```bash
63+
./div_bench.base
64+
```
65+
66+
The output is:
5067

5168
```output
5269
Running ./div_bench.base
@@ -64,24 +81,24 @@ Benchmark Time CPU Iterations
6481
baseDiv/1500 7.90 us 7.90 us 88512
6582
```
6683

84+
### Inspect Assembly
6785

68-
### Inspect Assembly
86+
To inspect what assembly instructions are being executed the most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code.
6987

70-
To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding.
88+
Install Perf using the [install guide](https://learn.arm.com/install-guides/perf/) before proceeding.
7189

7290
{{% notice Please Note %}}
73-
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command
91+
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command to run the commands below.
7492
{{% /notice %}}
7593

76-
77-
Run the following command to record `perf` data and create a report in the terminal
94+
Run the following commands to record `perf` data and create a report in the terminal:
7895

7996
```bash
8097
sudo perf record -o perf-division-base ./div_bench.base
8198
sudo perf report --input=perf-division-base
8299
```
83100

84-
As the `perf report` graphic below shows, our program spends a significant amount of time in the short loops with no loop unrolling. There is also the relatively expensive `sdiv` operation and we spend most of the execution time storing the result of that operation.
101+
As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation.
85102

86103
![before-pgo](./before-pgo.gif)
87104

content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Using Profile Guided Optimisation
2+
title: Using Profile Guided Optimization
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
@@ -8,26 +8,33 @@ layout: learningpathall
88

99
### Building binary with PGO
1010

11-
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command that includes the `-fprofile-generate` flag to build the instrumented binary.
11+
To generate a binary optimized using the runtime profile, first build an instrumented binary that records usage data. Run the following command, which includes the `-fprofile-generate` flag, to build the instrumented binary:
1212

1313
```bash
1414
g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
1515
```
1616

17-
Next, run the binary to record the profile.
17+
Next, run the instrumented binary to generate the profile data:
1818

1919
```bash
2020
./div_bench.opt
2121
```
22-
An output file, `*.gcda` should be generated in the same directory. To incorporate this profile into the compilation, run the following command with the `-fprofile-use` flag.
22+
23+
This execution creates profile data files (typically with a `.gcda` extension) in the same directory. To incorporate this profile data into the compilation, rebuild the program using the `-fprofile-use` flag:
2324

2425
```bash
2526
g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
2627
```
2728

28-
### Running the optimised binary
29+
### Running the optimized binary
30+
31+
Run again with the optimized binary:
32+
33+
```bash
34+
./div_bench.opt
35+
```
2936

30-
Running the newly created `div_bench.opt` binary we observe following improvement.
37+
Running the newly created `div_bench.opt` binary, you observe the following improvement:
3138

3239
```output
3340
Running ./div_bench.opt
@@ -45,18 +52,19 @@ Benchmark Time CPU Iterations
4552
baseDiv/1500 2.86 us 2.86 us 244429
4653
```
4754

48-
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. **This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this into the optimisation process**. Next, let's understand how it was optimised.
55+
As the terminal output above shows, the average execution time is reduced from 7.90 to 2.86 microseconds. **This improvement occurs because the profile data informed the compiler that the input divisor was consistently 1500 during the profiled runs, allowing it to apply specific optimizations.**
4956

50-
### Inspect Assembly
57+
Next, let's examine how the code was optimized at the assembly level.
5158

59+
### Inspect Assembly
5260

53-
As per the previous section, run the following command to record `perf` data and create a report that can be viewed in the terminal.
61+
Run the following commands to record `perf` data for the optimized binary and create a report:
5462

5563
```bash
5664
sudo perf record -o perf-division-opt ./div_bench.opt
5765
sudo perf report --input=perf-division-opt
5866
```
5967

60-
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use many more cheaper operations (also known as strength reduction) to execute our loop far quicker.
68+
As the graphic below illustrates, the profile data enabled the compiler to optimize the program significantly. The optimized code features loop unrolling and uses strength reduction (replacing the expensive division with cheaper operations), allowing the loop to execute much faster.
6169

6270
![after-pgo](./after-pgo.gif)

0 commit comments

Comments
 (0)