You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,19 +6,20 @@ weight: 2
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Introduction of Profile Guided Optimisation
9
+
## Introduction to Profile Guided Optimisation
10
10
11
11
### What is Profile-Guided Optimization (PGO) and How Does It Work?
12
12
13
-
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the -fprofile-generate flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the -fprofile-use flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
13
+
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
14
14
15
15
### When to Use Profile-Guided Optimization
16
16
17
-
PGO is particularly beneficial in the later stages of development, once the codebase has stabilized. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions—those that are executed frequently—can lead to significant performance improvements. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
17
+
PGO is particularly beneficial in the later stages of development when the real-world workload can be applied. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions that are executed frequently. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
18
18
19
19
### Limitations of Profile-Guided Optimization and When Not to Use
20
20
21
-
While PGO offers substantial performance benefits, it has certain limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance in some cases.
21
+
While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance.
22
22
23
23
Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
24
24
25
+
Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for more information.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ BENCHMARK(BM_StringCreation);
24
24
BENCHMARK_MAIN();
25
25
```
26
26
27
-
Filtering and preventing Compiler Optimisations
27
+
### Filtering and preventing Compiler Optimisations
28
28
29
29
To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored.
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h`results in indirection so that the value is unknown compile time, although it is visible in our source code as 1500.
21
+
Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h`causes indirection since the divisor value is unknown compile time, although it is visible in our source code as 1500.
22
22
23
23
```cpp
24
24
#include<benchmark/benchmark.h>
@@ -70,7 +70,7 @@ baseDiv/1500 7.90 us 7.90 us 88512
70
70
To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding.
71
71
72
72
{{% notice Please Note %}}
73
-
You may need to set the `perf_event_paranoid` value to 0 with the `sudo sysctl kernel.perf_event_paranoid=0` command
73
+
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command
74
74
{{% /notice %}}
75
75
76
76
@@ -80,5 +80,8 @@ Run the following command to record `perf` data and create a report in the termi
80
80
sudo perf record -o perf-division-base ./div_bench.base
81
81
sudo perf report --input=perf-division-base
82
82
```
83
+
84
+
As the `perf report` graphic below shows, our program spends a significant amount of time in the short loops with no loop unrolling. There is also the relatively expensive `sdiv` operation and we spend most of the execution time storing the result of that operation.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
9
9
### Building binary with PGO
10
10
11
-
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command, that includes the `-fprofile-generate` flag to build the instrumented binary.
11
+
To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command that includes the `-fprofile-generate` flag to build the instrumented binary.
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this context. Next, let's understand how it was optimised.
48
+
As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. **This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this into the optimisation process**. Next, let's understand how it was optimised.
49
49
50
50
### Inspect Assembly
51
51
52
52
53
-
Run the following command to record `perf` data and create a report that can be viewed in the terminal.
53
+
As per the previous section, run the following command to record `perf` data and create a report that can be viewed in the terminal.
54
54
55
55
```bash
56
56
sudo perf record -o perf-division-opt ./div_bench.opt
57
57
sudo perf report --input=perf-division-opt
58
58
```
59
59
60
-
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use slightly different instructions.
60
+
As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use many more cheaper operations (also known as strength reduction) to execute our loop far quicker.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,12 @@ layout: learningpathall
8
8
9
9
### Building locally with Make
10
10
11
-
Since PGO can be used by simple command-line instructions, we can trivially incorporate this into a `make` file, as per the sample Makefile below if building locally.
11
+
As PGO can be utilized with simple command-line instructions, it can easily be integrated into a `make` file and continuous integration (CI) systems, as demonstrated in the sample Makefile below for local builds.
12
+
13
+
{{% notice Caution %}}
14
+
PGO requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
15
+
{{% /notice %}}
16
+
12
17
13
18
```makefile
14
19
# Simple Makefile for building and benchmarking div_bench with and without PGO
The `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. Further tests could be to check for regressions.
58
+
As another alternative, the `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. This barebones example natively compiles on a GitHub hosted Ubuntu 24.04 Arm-based runner. Further tests could automate for regressions.
0 commit comments