Merge pull request #1833 from jasonrandrews/review

jasonrandrews · web-flow · commit cc64dd9132d8 · 2025-04-15T16:34:31.000-05:00
Review C++ loop optimization Learning Path: update titles, enhance co…
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/Example.md b/content/learning-paths/cross-platform/cpp-loop-size-context/Example.md
@@ -1,16 +1,18 @@
 ---
-title: Example
+title: Baseline loop implementation
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Example
+## Understand the baseline loop
 
-The following `C++` snippet takes user input as the loop size so that the loop size, `max_loop_size`, is only known at runtime. This initialises an array of size, , `max_loop_size` with the value for each element corresponding to the index position. The function, `foo`, loops through each element to print out the sum of all elements. 
+The following C++ program takes user input as the loop size so that the loop size `max_loop_size` is only known at runtime. This initializes an array of size `max_loop_size` with the value for each element corresponding to the index position. 
 
-Copy the snippet below into a file named, `no-context.cpp`. 
+The function `foo()` loops through each element to print out the sum of all elements. Without any boundary information provided to the compiler, it must generate conservative code that works for any loop size. 
+
+Use a text editor to copy the code below into a file named `no-context.cpp`. 
 
 ```cpp
 #include <iostream>
@@ -48,18 +50,24 @@ int main() {
 }
 ```
 
-Compiling using the following command. 
+Compile the program using the following command: 
+
+```bash
+g++ -O3 -march=armv8-a+simd no-context.cpp -o no-context
+```
+
+Run the example with 40000 as the input:
 
 ```bash
-g++ -O3 -march=armv8-a+simd no_context.cpp -o no_context
+./no-context 
 ```
 
-Running the example with the number 4000 leads to the following results. You will see runtime variability depending on which platform you run this on. 
+You see the output below, your runtime will vary depending on the computer you are using.
 
 ```output
-./no_context 
 Enter a value for max_loop_size (must be a multiple of 4): 40000
 Sum: 799980000
 Time taken by foo: 138100 nanoseconds
 ```
 
+Continue to the next section to see how to use developer knowledge of loops to improve performance. 
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/Introduction.md b/content/learning-paths/cross-platform/cpp-loop-size-context/Introduction.md
@@ -1,25 +1,54 @@
 ---
-title: Setup
+title: Understand developer knowledge for compiler optimizations
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Introduction
+## What is developer knowledge?
 
-Often, the programmer has deeper insights into their software's behavior and its inputs than the compiler does. For instance, if a loop's size is determined at runtime, the compiler must conservatively handle the possibility of variable sizes, potentially limiting optimization opportunities. However, a developer might know more about the application's runtime characteristics—such as the fact that the loop size always adheres to specific constraints, like being a multiple of a particular number.
+Often, software developers have deeper insights into their software's behavior and its inputs than the compiler does. This knowledge represents a valuable optimization opportunity that can significantly improve performance when properly communicated to the compiler as boundary information.
 
-To illustrate how you can explicitly provide this valuable context to the compiler, we'll walk through a simple C++ example.
+### The compiler's challenge
 
-## Setup
+When a loop's size is determined at runtime, the compiler faces a dilemma:
+- It must generate code that works correctly for any possible input size
+- It cannot make assumptions that might enable more aggressive optimizations
+- It must take a conservative approach to ensure correctness across all scenarios
 
-In this learning path, I will be demonstrating the examples using an Arm-based `r7g.large` instance from AWS; however, you're welcome to follow along using any Arm-based machine that suits your environment or preference.
+### The developer's advantage
 
-To get started, you'll first need to install the `g++` compiler on your system. Use the following commands as a guide, adjusting them accordingly based on the operating system or distribution you're working with.
+As a developer, you often know more about your application's runtime characteristics than the compiler can infer, such as:
+- Loop sizes that always follow specific patterns (like being multiples of 4, 8, or 16)
+- Input constraints that are enforced elsewhere in your application
+- Data alignment guarantees that enable vectorization opportunities
+
+In this Learning Path, you'll learn how to explicitly communicate this valuable context to the compiler, enabling it to generate more efficient code.
+
+## Environment setup
+
+You can use any Arm Linux system to run the example application and learn about loop optimization. The only requirement is to install the `g++` compiler.
+
+### Installing the compiler
+
+If you are running Ubuntu or another Debian-based Linux distribution, you can use the commands below to install the compiler:
 
 ```bash
 sudo apt update
-sudo apt install g++
+sudo apt install g++ -y
+```
+
+For other Linux distributions, use the appropriate package manager to install `g++`.
+
+### Compiler version
+
+This learning path uses standard C++ features and optimization techniques that work with any recent C++ compiler.
+
+You can check your version using:
+
+```bash
+g++ --version
 ```
 
+Continue to the next section to learn about an example application which demonstrates how to use developer knowledge for loop boundary information.
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/_index.md b/content/learning-paths/cross-platform/cpp-loop-size-context/_index.md
@@ -1,35 +1,51 @@
 ---
-title: Learn to Optimize C++ Loops with Size Context
+title: Boost C++ performance by optimizing loops with boundary information
+
+draft: true
+cascade:
+    draft: true
 
 minutes_to_complete: 15
 
-who_is_this_for: C++ developer who want to improve the runtime of for loops with basic insider knowledge of the loop size
+who_is_this_for: This is an introductory topic for C++ developers who want to improve the runtime of loops using existing knowledge of the loop size.
 
 learning_objectives: 
-    - Learn how to add preexisting knowledge of loop sizes to for loops
+    - Learn how to communicate loop size constraints to the compiler for better optimization.
+    - Understand how providing compile-time context can improve runtime performance.
+    - Implement techniques to express loop boundaries that enable better code generation.
+    - Compare and analyze the performance impact of providing loop size context.
 
 prerequisites:
-    - Access to an Arm-based machine / instance
-    - Basic understanding of C++
+    - An Arm computer running Linux. You can also use a virtual machine from a [cloud service provider](/learning-paths/servers-and-cloud-computing/csp/).
 
 author: Kieran Hejmadi
 
 ### Tags
 skilllevels: Introductory
-subjects: ML
+subjects: Performance and Architecture
 armips:
     - Neoverse
+    - Cortex-A
 tools_software_languages:
     - C++
+    - Runbook
 operatingsystems:
     - Linux
 
-
+### Cross-platform metadata only
+shared_path: true
+shared_between:
+    - servers-and-cloud-computing
+    - laptops-and-desktops
 
 further_reading:
     - resource:
-        title: PLACEHOLDER MANUAL 
-        link: PLACEHOLDER MANUAL LINK
+        title: GCC Optimization Options Documentation
+        link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
+        type: documentation
+    - resource:
+        title: LLVM Loop Vectorization Guide
+        link: https://llvm.org/docs/Vectorizers.html
         type: documentation
 
 
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/providing-inside-knowledge.md b/content/learning-paths/cross-platform/cpp-loop-size-context/providing-inside-knowledge.md
@@ -1,30 +1,34 @@
 ---
-title: Adding Inside Knowledge
+title: Optimize loops using boundary information
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Adding Inside Knowledge
+## How can I add developer knowledge to optimize performance? 
 
-To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows:
+To ensure the loop size is always a multiple of 4 and communicate this boundary information to the compiler, you can rewrite the loop size calculation as follows:
 
 ```output
 ((max_loop_size/4)*4)
 ```
 
-At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly. 
+At first glance, this calculation looks mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. This pattern allows the compiler to recognize and optimize for this specific constraint.
 
-As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example.
+This optimization is particularly effective because it enables the compiler to use SIMD (Single Instruction, Multiple Data) vectorization. When the compiler knows the loop count is a multiple of 4, it can process four elements at once using vector registers, significantly improving performance on Arm processors.
+
+A slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. 
+
+For example:
 
 ```output
 (max_loop_size_div_4 * 4)
 ```
 
-## Improved Example
+## Try an improved example
 
-Copy the snippet below and paste into a file named `context.cpp`.
+Use a text editor to copy the code below and paste it into a file named `context.cpp`.
 
 ```cpp
 #include <iostream>
@@ -63,23 +67,59 @@ int main() {
 }
 ```
 
-Again compile with the same compiler flags. 
+Compile the new program with the same flags:
 
 ```bash
 g++ -O3 -march=armv8-a+simd context.cpp -o context
 ```
 
-```output
+Run the new example with the same 40000 as input:
+
+```bash
 ./context 
+```
+
+You see the new output:
+
+```output
 Enter a value for max_loop_size (must be a multiple of 4): 40000
 Sum: 799980000
 Time taken by foo: 24650 nanoseconds
 ```
-In this particular run, the time taken has significantly reduced compared to our previous example. 
+
+The time taken has significantly reduced compared to the previous version. This performance improvement is a direct result of providing boundary information to the compiler. 
+
+## Performance considerations
+
+While this optimization technique provides significant performance benefits, it's important to note that it assumes the input is a multiple of 4. In a real-world application, you would need to validate user input or handle cases where the input isn't a multiple of 4. 
+
+For example:
+
+```cpp
+// Validate input
+if (max_loop_size % 4 != 0) {
+    std::cerr << "Error: Input must be a multiple of 4" << std::endl;
+    return 1;
+}
+```
+
+Alternatively, you could pad the array to ensure its size is always a multiple of 4, or handle the remainder elements separately after processing the vectorized portion of the array. The approach you choose depends on your specific application requirements and constraints.
 
 ## Comparison
 
-To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK). 
+You can compare the differences in [Compiler Explorer](https://godbolt.org/z/nvx4j1vTK). 
+
+The assembly code shows there are fewer lines of assembly corresponding to the function `foo()` when context is added. This is because the compiler can optimize the conditional checking and any clean up code given the context.
+
+When examining the assembly output in Compiler Explorer, look for these key differences:
+
+1. **Vector instructions**: In the optimized version, look for instructions like `ld1` (load to vector register) and `addv` (add across vector) which indicate SIMD operations.
+
+2. **Loop structure**: The optimized version will likely have fewer instructions inside the main loop body as multiple elements are processed at once.
+
+3. **Unrolling factor**: Notice how the compiler might unroll the loop to process multiple elements in each iteration, reducing branch overhead.
+
+4. **Register usage**: The optimized version will make more efficient use of vector registers (v0-v31) rather than just scalar registers.
 
-As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context. 
+These assembly-level differences directly translate to the performance improvements you observed in the execution time.