ArmDeveloperEcosystem · jasonrandrews · Apr 15, 2025 · Mar 26, 2025 · Mar 26, 2025
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/Example.md b/content/learning-paths/cross-platform/cpp-loop-size-context/Example.md
@@ -0,0 +1,65 @@
+---
+title: Example
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Example
+
+The following `C++` snippet takes user input as the loop size so that the loop size, `max_loop_size`, is only known at runtime. This initialises an array of size, , `max_loop_size` with the value for each element corresponding to the index position. The function, `foo`, loops through each element to print out the sum of all elements. 
+
+Copy the snippet below into a file named, `no-context.cpp`. 
+
+```cpp
+#include <iostream>
+#include <chrono>
+
+void foo(const int* x, int max_loop_size)
+{
+    int sum = 0;
+    for (int k = 0; k < max_loop_size; k++) {
+        sum += x[k];
+    }
+    std::cout << "Sum: " << sum << std::endl;
+}
+
+int main() {
+    int max_loop_size;
+    std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
+    std::cin >> max_loop_size;
+
+    int x[max_loop_size];
+    // Initialise test data
+    for(int i = 0; i < max_loop_size; ++i) x[i] = i;
+
+    // Start timing
+    auto start = std::chrono::high_resolution_clock::now();
+    foo(x, max_loop_size);
+    // Stop timing
+    auto end = std::chrono::high_resolution_clock::now();
+
+    // Calculate and display the elapsed time
+    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
+    std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;
+
+    return 0;
+}
+```
+
+Compiling using the following command. 
+
+```bash
+g++ -O3 -march=armv8-a+simd no_context.cpp -o no_context
+```
+
+Running the example with the number 4000 leads to the following results. You will see runtime variability depending on which platform you run this on. 
+
+```output
+./no_context 
+Enter a value for max_loop_size (must be a multiple of 4): 40000
+Sum: 799980000
+Time taken by foo: 138100 nanoseconds
+```
+
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/Introduction.md b/content/learning-paths/cross-platform/cpp-loop-size-context/Introduction.md
@@ -0,0 +1,25 @@
+---
+title: Setup
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction
+
+Often, the programmer has deeper insights into their software's behavior and its inputs than the compiler does. For instance, if a loop's size is determined at runtime, the compiler must conservatively handle the possibility of variable sizes, potentially limiting optimization opportunities. However, a developer might know more about the application's runtime characteristics—such as the fact that the loop size always adheres to specific constraints, like being a multiple of a particular number.
+
+To illustrate how you can explicitly provide this valuable context to the compiler, we'll walk through a simple C++ example.
+
+## Setup
+
+In this learning path, I will be demonstrating the examples using an Arm-based `r7g.large` instance from AWS; however, you're welcome to follow along using any Arm-based machine that suits your environment or preference.
+
+To get started, you'll first need to install the `g++` compiler on your system. Use the following commands as a guide, adjusting them accordingly based on the operating system or distribution you're working with.
+
+```bash
+sudo apt update
+sudo apt install g++
+```
+
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/_index.md b/content/learning-paths/cross-platform/cpp-loop-size-context/_index.md
@@ -0,0 +1,41 @@
+---
+title: Learn to Optimize C++ Loops with Size Context
+
+minutes_to_complete: 15
+
+who_is_this_for: C++ developer who want to improve the runtime of for loops with basic insider knowledge of the loop size
+
+learning_objectives: 
+    - Learn how to add preexisting knowledge of loop sizes to for loops
+
+prerequisites:
+    - Access to an Arm-based machine / instance
+    - Basic understanding of C++
+
+author: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Neoverse
+tools_software_languages:
+    - C++
+operatingsystems:
+    - Linux
+
+
+
+further_reading:
+    - resource:
+        title: PLACEHOLDER MANUAL 
+        link: PLACEHOLDER MANUAL LINK
+        type: documentation
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/cross-platform/cpp-loop-size-context/_next-steps.md b/content/learning-paths/cross-platform/cpp-loop-size-context/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...arning-paths/cross-platform/cpp-loop-size-context/providing-inside-knowledge.md b/...arning-paths/cross-platform/cpp-loop-size-context/providing-inside-knowledge.md
@@ -0,0 +1,85 @@
+---
+title: Adding Inside Knowledge
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Adding Inside Knowledge
+
+To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows:
+
+```output
+((max_loop_size/4)*4)
+```
+
+At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly. 
+
+As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example.
+
+```output
+(max_loop_size_div_4 * 4)
+```
+
+## Improved Example
+
+Copy the snippet below and paste into a file named `context.cpp`.
+
+```cpp
+#include <iostream>
+#include <chrono>
+
+void foo(const int* x, int max_loop_size_div_4)
+{
+    int sum = 0;
+    for (int k = 0; k < max_loop_size_div_4 * 4; k++) {
+        sum += x[k];
+    }
+    std::cout << "Sum: " << sum << std::endl;
+}
+
+int main() {
+    int max_loop_size;
+    std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
+    std::cin >> max_loop_size;
+
+    int max_loop_size_div_4 = max_loop_size / 4;
+    int x[max_loop_size];
+    // Initialise test data
+    for(int i = 0; i < max_loop_size; ++i) x[i] = i;
+
+    // Start timing
+    auto start = std::chrono::high_resolution_clock::now();
+    foo(x, max_loop_size_div_4);
+    // Stop timing
+    auto end = std::chrono::high_resolution_clock::now();
+
+    // Calculate and display the elapsed time
+    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
+    std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;
+
+    return 0;
+}
+```
+
+Again compile with the same compiler flags. 
+
+```bash
+g++ -O3 -march=armv8-a+simd context.cpp -o context
+```
+
+```output
+./context 
+Enter a value for max_loop_size (must be a multiple of 4): 40000
+Sum: 799980000
+Time taken by foo: 24650 nanoseconds
+```
+In this particular run, the time taken has significantly reduced compared to our previous example. 
+
+## Comparison
+
+To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK). 
+
+As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context. 
+