review content before PR

Your Name · Your Name · commit 5d18b403d824 · 2025-04-28T16:57:07.000+01:00
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md
@@ -1,12 +1,12 @@
 ---
-title: Getting Started with Profile Guided Optimisation and Google Benchmark
+title: Optimizing Performance with Profile-Guided Optimization and Google Benchmark
 
 minutes_to_complete: 15
 
-who_is_this_for: Developers who are looking to optimise the performance of a program using the observed characteristics at runtime.
+who_is_this_for: Developers who are looking to optimise the performance of a program using the characteristics observed at runtime.
 
 learning_objectives: 
-    - Learn how to write a function microbenchmark using Google Benchmark
+    - Learn how to microbenchmark a function using Google Benchmark
     - Learn how to use profile guided optimisation to build binaries optimised for real-world workloads
 
 prerequisites:
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md
@@ -6,19 +6,20 @@ weight: 2
 layout: learningpathall
 ---
 
-## Introduction of Profile Guided Optimisation
+## Introduction to Profile Guided Optimisation
 
 ### What is Profile-Guided Optimization (PGO) and How Does It Work?
 
-Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the -fprofile-generate flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the -fprofile-use flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
+Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
 
 ### When to Use Profile-Guided Optimization
 
-PGO is particularly beneficial in the later stages of development, once the codebase has stabilized. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions—those that are executed frequently—can lead to significant performance improvements. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
+PGO is particularly beneficial in the later stages of development when the real-world workload can be applied. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions that are executed frequently. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
 
 ### Limitations of Profile-Guided Optimization and When Not to Use
 
-While PGO offers substantial performance benefits, it has certain limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance in some cases. 
+While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance. 
 
 Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
 
+Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for more information. 
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md
@@ -24,7 +24,7 @@ BENCHMARK(BM_StringCreation);
 BENCHMARK_MAIN();
 ```
 
-Filtering and preventing Compiler Optimisations
+### Filtering and preventing Compiler Optimisations
 
 To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored. 
 
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md
@@ -18,7 +18,7 @@ sudo apt install gcc g++ make libbenchmark-dev -y
 
 ## Division example
 
-Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` results in indirection so that the value is unknown compile time, although it is visible in our source code as 1500. 
+Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` causes indirection since the divisor value is unknown compile time, although it is visible in our source code as 1500. 
 
 ```cpp
 #include <benchmark/benchmark.h>
@@ -70,7 +70,7 @@ baseDiv/1500       7.90 us         7.90 us        88512
 To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding. 
 
 {{% notice Please Note %}}
-You may need to set the `perf_event_paranoid` value to 0 with the `sudo sysctl kernel.perf_event_paranoid=0` command
+You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command
 {{% /notice %}}
 
 
@@ -80,5 +80,8 @@ Run the following command to record `perf` data and create a report in the termi
 sudo perf record -o perf-division-base ./div_bench.base 
 sudo perf report --input=perf-division-base
 ```
+
+As the `perf report` graphic below shows, our program spends a significant amount of time in the short loops with no loop unrolling. There is also the relatively expensive `sdiv` operation and we spend most of the execution time storing the result of that operation.
+
 ![before-pgo](./before-pgo.gif)
 
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ### Building binary with PGO
 
-To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command, that includes the `-fprofile-generate` flag to build the instrumented binary. 
+To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command that includes the `-fprofile-generate` flag to build the instrumented binary. 
 
 ```bash
 g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
@@ -45,18 +45,18 @@ Benchmark             Time             CPU   Iterations
 baseDiv/1500       2.86 us         2.86 us       244429
 ```
 
-As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this context. Next, let's understand how it was optimised. 
+As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. **This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this into the optimisation process**. Next, let's understand how it was optimised. 
 
 ### Inspect Assembly 
 
 
-Run the following command to record `perf` data and create a report that can be viewed in the terminal. 
+As per the previous section, run the following command to record `perf` data and create a report that can be viewed in the terminal. 
 
 ```bash
 sudo perf record -o perf-division-opt ./div_bench.opt
 sudo perf report --input=perf-division-opt
 ```
 
-As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use slightly different instructions.  
+As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use many more cheaper operations (also known as strength reduction) to execute our loop far quicker. 
 
 ![after-pgo](./after-pgo.gif)
diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md
@@ -8,7 +8,12 @@ layout: learningpathall
 
 ### Building locally with Make
 
-Since PGO can be used by simple command-line instructions, we can trivially incorporate this into a `make` file, as per the sample Makefile below if building locally. 
+As PGO can be utilized with simple command-line instructions, it can easily be integrated into a `make` file and continuous integration (CI) systems, as demonstrated in the sample Makefile below for local builds.
+
+{{% notice Caution %}}
+PGO requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
+{{% /notice %}}
+
 
 ```makefile
 # Simple Makefile for building and benchmarking div_bench with and without PGO
@@ -50,7 +55,7 @@ run: div_bench.base div_bench.opt
 
 ### Building with GitHub Actions
 
-The `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. Further tests could be to check for regressions. 
+As another alternative, the `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. This barebones example natively compiles on a GitHub hosted Ubuntu 24.04 Arm-based runner. Further tests could automate for regressions. 
 
 ```yaml
 name: PGO Benchmark