Revert "removed additional LP"

Your Name · Your Name · commit 277a452ad102 · 2025-04-03T11:34:11.000+01:00
This reverts commit e072136.
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/1.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/1.md
@@ -0,0 +1,71 @@
+---
+title: Introduction
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction
+
+In Arm-based systems, it is crucial to consider the platform’s architecture and compiler capabilities when writing a C++ for loop. By understanding how compilers automatically vectorize code, you can better organize loops to leverage advanced SIMD features for performance. Compiler autovectorization inspects loops at compile time, generating instructions that process multiple data elements in parallel. Depending on compiler flags and data types, the resulting machine code may use different extensions to harness the underlying hardware.
+
+- **NEON**: A 128-bit SIMD extension that processes data in parallel, offering improved performance for single-precision floating-point operations and integer workloads.
+
+- **SVE** (Scalable Vector Extension): Introduces variable-length vectors to provide scalable performance across different Arm implementations, enabling a flexible approach to SIMD
+
+- **SVE2**: Builds upon SVE by adding more instructions for integer, fixed-point, and complex workloads, broadening the range of vectorizable code to enable general data-procssing
+
+While assembly and Arm intrinsics can yield further optimizations, they are beyond this learning path. Instead, we will concentrate on C++ constructs that help the compiler generate efficient vectorized instructions.
+
+## Environment Setup
+
+In this learning path I will be using an AWS Graviton 3 instance based on the Arm Neoverse V1 architecture. This particular instance supports both NEON and SVE. If you are unfamiliar with using cloud instances, please reach out [getting started guide](TODO).
+
+```bash
+sudo apt update
+sudo apt install g++
+```
+Please note: There will be slight differences in performance when using difference versions of compilers. 
+
+
+## Trivial Vectorisable Example
+
+Data-level parallelism (DLP) refers to the capability of modern CPUs, including Arm architectures, to perform operations on multiple data points simultaneously. In practice, this means the compiler can identify loops or repeated calculations on array elements and convert them into a smaller set of vectorized instructions. By grouping data elements, the compiler leverages hardware instructions that operate on multiple values at once, reducing the total number of instruction cycles. This transformation is key to achieving high-performance code on Arm-based systems, where NEON, SVE, and SVE2 extensions are used to efficiently handle tasks that involve large arrays or complex data processing.
+
+Copy and paste the C++ code snippet below into a file named `trivial_vector.cpp`.
+
+```cpp
+#include <iostream>
+#include <vector>
+
+using namespace std;
+
+void vector_add(const vector<int>& a, const vector<int>& b, vector<int>& c) {
+    const int size = a.size();
+    for (int i = 0; i < size; ++i) {
+        c[i] = a[i] + b[i];
+    }
+}
+
+int main() {
+    const int size = 100;
+    vector<int> a(size, 8);
+    vector<int> b(size, 2);
+    vector<int> c(size, 0);
+
+    vector_add(a, b, c);
+
+    for (int i = 0; i < size; ++i) {
+        cout << c[i] << " ";
+    }
+    cout << endl;
+
+    return 0;
+}
+```
+The snippet above performs a vector add of 2 vectors, a and b of size 100. Notable things to observe are:
+
+- Fixed loop size of 100
+- No conditional statements within the loop
+- Fixed data type of `int`
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/2.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/2.md
@@ -0,0 +1,78 @@
+---
+title: Providing additional information
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Conditional Statements
+
+Conditional statements within a for loop allow certain iterations to execute only when specified conditions are met. This mechanism is crucial for processing arrays or vectors selectively, as it lets you skip unnecessary computations or handle edge cases without disrupting the flow of the entire loop.
+
+Arm’s Scalable Vector Extension (SVE) introduces a predicate (mask) to manage these conditional operations on a per-element basis. Instead of processing entire vectors uniformly, SVE uses the mask to enable or disable specific lanes dynamically. This approach is especially powerful for loops whose iteration counts are not exact multiples of the vector length, as it avoids wasted operations. Additionally, SVE supports strided access, meaning it can load or store elements separated by a constant stride in memory, improving efficiency in scenarios like processing slices of arrays.
+
+In contrast, Arm NEON relies on packing data into fixed-width 128-bit registers. Elements are grouped together (packed) and processed simultaneously, but this can lead to overhead when handling irregular loop counts or accessing data with non-contiguous memory layouts. By comparison, SVE’s mask-based approach and flexible vector lengths provide more fine-grained control and higher efficiency for diverse data patterns.
+
+To demonstrate the C++ code snippet below creates initialises 2 arrays of 128 integers. If the value of the index is even it's value is equal to the index, if the index is odd the value is 0. 
+
+```cpp
+#include <iostream>
+
+int reduce(int *a, int *b, long N);
+
+int main(){
+    int a[128];
+    int b[128];
+    for (int i = 0; i < 128; ++i){
+        if (i % 2 == 0){
+            a[i] = i;
+            b[i] = 1;
+        } 
+        else {
+            a[i] = 0;
+            b[i] = 0;
+        }
+    }
+    long N = 128;
+    int s = reduce(a, b, N);
+    std::cout << s << std::endl;
+    return 0;
+
+}
+
+int reduce(int *a, int *b, long N){
+    long i;
+    int s = 0;
+    for (i = 0; i < N; ++i){
+        if (b[i]){
+            s += a[i];
+        }
+    }
+    return s;
+}
+```
+
+This example can be vectorised with SVE strided access. Run the commands below to generate the annotated assembly for both NEON (simd) and SVE. 
+
+```bash
+g++ -march=armv8-a+simd -fverbose-asm -O3 pred_loop.cpp -S -o neon_basic.s
+g++ -march=armv8-a+sve -O3 -fverbose-asm pred_loop.cpp -S -o sve_basic.s
+```
+
+Passing the `-fverbose-asm` command annotates the assembly with the corresponding lines of source code. 
+
+The SVE assembly uses the `st1w` instruction for strided access, whereas the NEON implementation does not. 
+
+```output
+// pred_loop.cpp:9:         if (i % 2 == 0){
+...
+	st1w	z2.s, p0, [x2, x0, lsl 2]	// vect_patt_104.35, loop_mask_124, MEM <vector([4,4]) int> [(int *)_67 + ivtmp.63_15 * 4]
+	st1w	z0.s, p0, [x1, x0, lsl 2]	// vect_cstore_25.34, loop_mask_124, MEM <vector([4,4]) int> [(int *)_22 + ivtmp.63_15 * 4]
+```
+
+// MAYBE REMOVE
+Inspecting the assembly we can see the `UADDV` instruction is being used for this reduction operation. ![reduction_operation](./reduction.png).
+
+
+
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/3.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/3.md
@@ -0,0 +1,11 @@
+---
+title: Sparse Addressing 
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Sparse Addressing (indirect addressing)
+
+
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/4.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/4.md
@@ -0,0 +1,124 @@
+---
+title: Adding Context 
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Adding Context
+
+In some situations, you may already know that a loop will run a multiple of a constant. Making this explicit can help the compiler avoid generating “cleanup” code for leftover iterations. To do this we can change our variable.
+
+
+Original:
+
+```cpp
+int max_loop_size; // max_loop_size is always a multiple of 4. 
+for (n = 0; n < max_loop_size; n++) {
+    // ...
+}
+```
+
+Addition of context:
+
+```cpp
+int max_loop_size_div_4 = max_loop_size / 4;
+for (n = 0; n < max_loop_size_div_4 * 4; n++) {
+    // ...
+}
+```
+
+Your initial observation might be that these calculations are redundant. However, this change allows the compiler to ensure that `max_loop_size` is a multiple of 4. For example, if `max_loop_size` were 9, the division would yield 9/4 as 2 instead of 2.25, because integer division truncates the decimal part. If you want to test this out, run the basic snippet of code below
+
+```cpp
+#include <stdio.h>
+
+int main() {
+    for (int i = 1; i <= 20; ++i) {
+        int result = i / 4;
+        printf("Number: %d, Divided by 4: %d\n", i, result);
+    }
+    return 0;
+}
+```
+
+## Example
+
+The following example presumes the `max_loop_size` can be any integer.
+
+```cpp
+#include <iostream>
+
+void foo(const int* x, int max_loop_size)
+{
+    int sum = 0;
+    for (int k = 0; k < max_loop_size; k++) {
+        sum += x[k];
+    }
+    std::cout << "Sum: " << sum << std::endl;
+}
+
+int main() {
+    int max_loop_size;
+    std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
+    std::cin >> max_loop_size;
+
+
+    int x[max_loop_size];
+
+    // Initialize test data
+    for(int i = 0; i < max_loop_size; ++i) x[i] = i;
+
+    foo(x, max_loop_size);
+
+    return 0;
+}
+```
+
+```bash
+```
+
+```cpp
+#include <iostream>
+
+void foo(const int* x, int max_loop_size_div_4)
+{
+    int sum = 0;
+    for (int k = 0; k < max_loop_size_div_4 * 4; k++) {
+        sum += x[k];
+    }
+    std::cout << "Sum: " << sum << std::endl;
+}
+
+int main() {
+    int max_loop_size;
+    std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
+    std::cin >> max_loop_size;
+
+
+    int max_loop_size_div_4 = max_loop_size / 4;
+
+    int x[max_loop_size_div_4 * 4];
+
+    // Initialize test data
+    for(int i = 0; i < (max_loop_size_div_4*4); ++i) x[i] = i;
+
+    foo(x, max_loop_size_div_4);
+
+    return 0;
+}
+```
+
+```bash
+g++ -O3 -fverbose-asm -march=armv8-a+simd example_with_context.cpp -S -o example_with_context.s
+g++ -O3 -fverbose-asm -march=armv8-a+simd example_no_context.cpp -S -o example_no_context.s
+```
+
+```output
+wc -l example_with_context.s example_no_context.s 
+  259 example_with_context.s
+  319 example_no_context.s
+```
+
+![diff](./diff.png)
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/5.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/5.md
@@ -0,0 +1,32 @@
+---
+title: Loop-carried Dependencies
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Loop-carried Dependencies
+
+Loop-carried dependencies are computed values that carry over from one iteration to the next, making each iteration dependent on the outcome of the previous one. When trying to vectorise code for Arm SIMD intructions (such as NEON and SVE), the goal is to perform multiple iterations concurrently. However, these dependencies force a strictly sequential execution order, preventing independent, parallel computation across iterations.
+
+Consider the C++ loop below.
+
+```cpp
+for (i=0;i<50; i++){
+        A[i + 1] = A[i] + c[i];
+        B[i + 1] = B[i] + A[i + 1];
+}
+```
+
+In this loop, an iteration is defined as a single execution of the loop body for a specific index value i. Each iteration computes two new values: one for array A and one for array B. However, two loop-carried dependencies are present that hinder vectorisation.
+
+ - The first dependency is found in the computation of A[i + 1]. Here, the value A[i + 1] relies on the value of A[i] computed in the previous iteration. This creates a sequential chain: you must compute A[i] before you can compute A[i + 1].
+
+ - The second dependency appears in the computation of B[i + 1]. This value depends on B[i] from the previous iteration, and it also relies on A[i + 1], which itself is a product of the previous A value.
+
+ ```cpp
+for (i=0; i<1000; i+=1){
+    sum = sum + x[i] * y[i];
+}
+ ```
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/_index.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/_index.md
@@ -0,0 +1,53 @@
+---
+title: C++ for loop considerings for Autovectorisation
+
+minutes_to_complete: 90
+
+who_is_this_for: This is an advanced topic for software developers who want to instrument hardware event counters or the system counter in software applications.
+
+learning_objectives:
+    - Understand different options for accessing counters from user space
+    - Use the system counter to measure time in code
+    - Use PAPI to instrument event counters in code
+    - Use the Linux perf_event_open system call to instrument event counters in code
+prerequisites:
+    - An Arm computer running Linux. A bare metal or cloud metal instance is best because they expose more counters. You can use a virtual machine (VM), but fewer counters may be available. These instructions have been tested on the `a1.metal` instance type.
+
+author: Julio Suarez
+
+### Tags
+skilllevels: Advanced
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+tools_software_languages:
+    - PAPI
+    - perf
+    - Assembly
+    - GCC
+    - Runbook
+
+operatingsystems:
+    - Linux
+
+further_reading:
+    - resource:
+        title: Linux perf_events documentation
+        link: https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html
+        type: documentation
+    - resource:
+        title: PAPI documentation
+        link: https://github.com/icl-utk-edu/papi/wiki
+        type: documentation
+    - resource:
+        title: Perf
+        link: https://en.wikipedia.org/wiki/Perf_%28Linux%29
+        type: documentation
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/_next-steps.md
@@ -0,0 +1,23 @@
+---
+next_step_guidance: Learn about Thread Santiser and more Memory ordering options
+
+recommended_path: /learning-paths/cross-platform/intrinsics
+
+further_reading:
+    - resource:
+        title: C++ Memory Order Reference Manual 
+        link: https://en.cppreference.com/w/cpp/atomic/memory_order
+        type: documentation
+    - resource:
+        title: Thread Santiser Manual 
+        link: https://github.com/google/sanitizers/wiki/threadsanitizercppmanual
+        type: documentation
+
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 21                  # set to always be larger than the content in this path, and one more than 'review'
+title: "Next Steps"         # Always the same
+layout: "learningpathall"   # All files under learning paths have this same wrapper
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/diff.png b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/diff.png
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/reduction.png b/content/learning-paths/servers-and-cloud-computing/arm-for-loop-considerations/reduction.png