ArmDeveloperEcosystem · jasonrandrews · Apr 30, 2025 · Apr 28, 2025 · Apr 28, 2025
diff --git a/...ing-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md b/...ing-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md
@@ -0,0 +1,47 @@
+---
+title: Optimizing Performance with Profile-Guided Optimization and Google Benchmark
+
+minutes_to_complete: 15
+
+who_is_this_for: Developers who are looking to optimise the performance of a program using the characteristics observed at runtime.
+
+learning_objectives: 
+    - Learn how to microbenchmark a function using Google Benchmark
+    - Learn how to use profile guided optimisation to build binaries optimised for real-world workloads
+
+prerequisites:
+    - Basic C++ understanding
+    - Access to an Arm-based linux machine
+
+author: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Neoverse
+tools_software_languages:
+    - C++
+    - Google Benchmark
+    - G++
+operatingsystems:
+    - Linux
+
+further_reading:
+    - resource:
+        title: G++ Profile Guided Optimisation Documentation 
+        link: https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html
+        type: documentation
+    - resource:
+        title: Google Benchmark Library 
+        link: https://github.com/google/benchmark
+        type: documentation
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...aths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_next-steps.md b/...aths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/after-pgo.gif b/...paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/after-pgo.gif
diff --git a/...aths/servers-and-cloud-computing/cpp-profile-guided-optimisation/before-pgo.gif b/...aths/servers-and-cloud-computing/cpp-profile-guided-optimisation/before-pgo.gif
diff --git a/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md b/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md
@@ -0,0 +1,25 @@
+---
+title: Introduction to Profile-Guided Optimisation
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction to Profile Guided Optimisation
+
+### What is Profile-Guided Optimization (PGO) and How Does It Work?
+
+Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: first, compiling the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; and second, recompiling the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach enables the compiler to identify frequently executed paths—known as “hot” paths—and optimize them more aggressively, while potentially reducing the emphasis on less critical code paths.
+
+### When to Use Profile-Guided Optimization
+
+PGO is particularly beneficial in the later stages of development when the real-world workload can be applied. It’s most effective for applications where performance is critical and runtime behavior is complex or data-dependent. For instance, optimizing “hot” functions that are executed frequently. By focusing on these critical sections, PGO ensures that the most impactful parts of the code are optimized based on actual usage patterns.
+
+### Limitations of Profile-Guided Optimization and When Not to Use
+
+While PGO offers substantial performance benefits, it has limitations. The profiling data must accurately represent typical usage scenarios; otherwise, the optimizations may not yield the desired performance improvements and could even degrade performance. 
+
+Additionally, the process requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
+
+Please refer to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html) for more information. 
diff --git a/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md b/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-2.md
@@ -0,0 +1,34 @@
+---
+title: Introduction to Google Benchmark
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Google Benchmark
+
+Google Benchmark is a C++ library designed to expedite the microbenchmarking of code. It simplifies the process of writing microbenchmarks by providing a structured framework that automatically handles iterations, timing, and statistical analysis. This allows developers to focus on optimizing their code rather than writing  main functions, refactoring source code to run in a testing scenario and trying to anticipate any unwanted compiler optimisations.
+
+To use Google Benchmark, you define a function that contains the code you want to measure. This function should accept a `benchmark::State&` parameter and iterate over it to perform the benchmarking. You then register this function using the `BENCHMARK` macro and include `BENCHMARK_MAIN()` to create the main function for the benchmark executable. Here's a basic example:
+
+```cpp
+#include <benchmark/benchmark.h>
+
+static void BM_StringCreation(benchmark::State& state) {
+  for (auto _ : state)
+    std::string empty_string;
+}
+BENCHMARK(BM_StringCreation);
+
+BENCHMARK_MAIN();
+```
+
+### Filtering and preventing Compiler Optimisations
+
+To ensure that the compiler does not optimize away parts of your benchmarked code, Google Benchmark provides the function `benchmark::DoNotOptimize(value);`. This Prevents the compiler from optimizing away a variable or expression by forcing it to be read and stored. 
+
+Additionally, to run a specific subset of benchmarks, you can use the `--benchmark_filter` command-line option with a regular expression. For example `./benchmark_binary --benchmark_filter=BM_String.*` so you don't need to repeatedly comment out lines of source code.  
+
+
+For more detailed information and advanced usage, refer to the [official Google documentation](https://github.com/google/benchmark).
diff --git a/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md b/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md
@@ -0,0 +1,87 @@
+---
+title: Division Example
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Install prerequisites 
+
+In this example I am connecting to an AWS-based, `c7g.xlarge` instance running Ubuntu 24.04 LTS. Run the following commands to install the prerequisite packages. 
+
+
+```bash
+sudo apt update
+sudo apt install gcc g++ make libbenchmark-dev -y
+```
+
+## Division example
+
+Copy and paste the `C++` source code below into a file named `div_bench.cpp`. This trivial example takes in a vector of 4096 32-bit integers and divides each element by a number. Importantly, the `benchmark/benchmark.h` causes indirection since the divisor value is unknown compile time, although it is visible in our source code as 1500. 
+
+```cpp
+#include <benchmark/benchmark.h>
+#include <vector>
+
+// Benchmark division instruction
+static void baseDiv(benchmark::State &s) {
+  std::vector<int> v_in(4096);
+  std::vector<int> v_out(4096);
+
+  for (auto _ : s) {
+    for (size_t i = 0; i < v_in.size(); i++) v_out[i] = v_in[i] / s.range(0);
+    // s.range(0) is unknown at compile time, cannot be reduced
+  }
+}
+
+BENCHMARK(baseDiv)->Arg(1500)->Unit(benchmark::kMicrosecond); // value of 1500 is passed through as an argument so strength reduction cannot be applied
+
+BENCHMARK_MAIN();
+```
+
+To compile as run the microbenchmark on this function we need to link against `pthreads` and `benchmark` with the following commands. 
+
+```bash
+g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
+```
+
+Running the output, `div_bench.base` results in the following output. 
+
+```output
+Running ./div_bench.base
+Run on (4 X 2100 MHz CPU s)
+CPU Caches:
+  L1 Data 64 KiB (x4)
+  L1 Instruction 64 KiB (x4)
+  L2 Unified 1024 KiB (x4)
+  L3 Unified 32768 KiB (x1)
+Load Average: 0.00, 0.00, 0.00
+***WARNING*** Library was built as DEBUG. Timings may be affected.
+-------------------------------------------------------
+Benchmark             Time             CPU   Iterations
+-------------------------------------------------------
+baseDiv/1500       7.90 us         7.90 us        88512
+```
+
+
+### Inspect Assembly 
+
+To inspect what assembly instructions are being executed the most frequently, we can use the `perf` command. Please install `perf` using the [installation instructions](https://learn.arm.com/install-guides/perf/) before proceeding. 
+
+{{% notice Please Note %}}
+You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command
+{{% /notice %}}
+
+
+Run the following command to record `perf` data and create a report in the terminal
+
+```bash
+sudo perf record -o perf-division-base ./div_bench.base 
+sudo perf report --input=perf-division-base
+```
+
+As the `perf report` graphic below shows, our program spends a significant amount of time in the short loops with no loop unrolling. There is also the relatively expensive `sdiv` operation and we spend most of the execution time storing the result of that operation.
+
+![before-pgo](./before-pgo.gif)
+
diff --git a/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md b/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md
@@ -0,0 +1,62 @@
+---
+title: Using Profile Guided Optimisation
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+### Building binary with PGO
+
+To generate an binary optimised on the runtime profile. First we need to build an instrumented binary that can record the usage. Run the following command that includes the `-fprofile-generate` flag to build the instrumented binary. 
+
+```bash
+g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
+```
+
+Next, run the binary to record the profile. 
+
+```bash
+./div_bench.opt
+```
+An output file, `*.gcda` should be generated in the same directory. To incorporate this profile into the compilation, run the following command with the `-fprofile-use` flag. 
+
+```bash
+g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
+```
+
+### Running the optimised binary 
+
+Running the newly created `div_bench.opt` binary we observe following improvement.
+
+```output
+Running ./div_bench.opt
+Run on (4 X 2100 MHz CPU s)
+CPU Caches:
+  L1 Data 64 KiB (x4)
+  L1 Instruction 64 KiB (x4)
+  L2 Unified 1024 KiB (x4)
+  L3 Unified 32768 KiB (x1)
+Load Average: 0.10, 0.03, 0.01
+***WARNING*** Library was built as DEBUG. Timings may be affected.
+-------------------------------------------------------
+Benchmark             Time             CPU   Iterations
+-------------------------------------------------------
+baseDiv/1500       2.86 us         2.86 us       244429
+```
+
+As the terminal output above shows, we have reduced our average execution time from 7.90 to 2.86 microseconds. **This is because we are able to provide the context that the profile data shows the input divisor is always 1500 and the compiler is able to incorporate this into the optimisation process**. Next, let's understand how it was optimised. 
+
+### Inspect Assembly 
+
+
+As per the previous section, run the following command to record `perf` data and create a report that can be viewed in the terminal. 
+
+```bash
+sudo perf record -o perf-division-opt ./div_bench.opt
+sudo perf report --input=perf-division-opt
+```
+
+As the graphic below shows, the profile provided allowed the optimised program to unroll several times and use many more cheaper operations (also known as strength reduction) to execute our loop far quicker. 
+
+![after-pgo](./after-pgo.gif)
diff --git a/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md b/...g-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md
@@ -0,0 +1,102 @@
+---
+title: (Optional) Incorporating PGO into CI system
+weight: 6
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+### Building locally with Make
+
+As PGO can be utilized with simple command-line instructions, it can easily be integrated into a `make` file and continuous integration (CI) systems, as demonstrated in the sample Makefile below for local builds.
+
+{{% notice Caution %}}
+PGO requires additional build steps which will inevitably increase compile time which can be an issue for large code bases. As such, PGO is not suitable for all sections of code. We recommend only using PGO only sections of code which are heavily influenced by run-time behaviour and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
+{{% /notice %}}
+
+
+```makefile
+# Simple Makefile for building and benchmarking div_bench with and without PGO
+
+# Compiler and flags
+CXX := g++
+CXXFLAGS := -O3 -std=c++17
+LDLIBS := -lbenchmark -lpthread
+
+# Default target: build both binaries
+.PHONY: all clean clean-gcda clean-perf run perf
+all: div_bench.base div_bench.opt
+
+# Build the baseline binary (no PGO)
+div_bench.base: div_bench.cpp
+	$(CXX) $(CXXFLAGS) $< $(LDLIBS) -o $@
+
+# Build the PGO-optimized binary:
+div_bench.opt: div_bench.cpp
+	$(MAKE) clean-gcda
+	$(CXX) $(CXXFLAGS) -fprofile-generate $< $(LDLIBS) -o $@
+	@echo "Running instrumented binary to gather profile data..."
+	./div_bench.opt
+	$(CXX) $(CXXFLAGS) -fprofile-use $< $(LDLIBS) -o $@
+	$(MAKE) clean-perf
+
+# Remove all generated files
+clean: clean-gcda
+  rm -f div_bench.base div_bench.opt
+  rm -rf ./*.gcda
+
+# Run both benchmarks with informative headers
+run: div_bench.base div_bench.opt
+	@echo "==================== Without Profile-Guided Optimization ===================="
+	./div_bench.base
+	@echo "==================== With Profile-Guided Optimization ===================="
+	./div_bench.opt
+```
+
+### Building with GitHub Actions
+
+As another alternative, the `yaml` file below can serve as an basic example of integrating profile guided optimisation into your CI flow. This barebones example natively compiles on a GitHub hosted Ubuntu 24.04 Arm-based runner. Further tests could automate for regressions. 
+
+```yaml
+name: PGO Benchmark
+
+on:
+  push:
+    branches: [ main ]
+
+jobs:
+  build:
+    runs-on: ubuntu-24.04-arm
+
+    steps:
+      - name: Check out source
+        uses: actions/checkout@v3
+
+      - name: Install dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libbenchmark-dev g++
+
+      - name: Clean previous profiling data
+        run: |
+          rm -rf ./*gcda
+          rm -f div_bench.base div_bench.opt
+
+      - name: Compile base and instrumented binary
+        run: |
+          g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
+          g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
+
+      - name: Generate profile data and compile with PGO
+        run: |
+          ./div_bench.opt
+          g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
+
+      - name: Run benchmarks
+        run: |
+            echo "==================== Without Profile-Guided Optimization ===================="
+            ./div_bench.base
+            echo "==================== With Profile-Guided Optimization ===================="
+            ./div_bench.opt
+            echo "==================== Benchmarking complete ===================="
+```