Skip to content

Commit 277a452

Browse files
author
Your Name
committed
Revert "removed additional LP"
This reverts commit e072136.
1 parent e072136 commit 277a452

File tree

9 files changed

+392
-0
lines changed

9 files changed

+392
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: Introduction
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction
10+
11+
In Arm-based systems, it is crucial to consider the platform’s architecture and compiler capabilities when writing a C++ for loop. By understanding how compilers automatically vectorize code, you can better organize loops to leverage advanced SIMD features for performance. Compiler autovectorization inspects loops at compile time, generating instructions that process multiple data elements in parallel. Depending on compiler flags and data types, the resulting machine code may use different extensions to harness the underlying hardware.
12+
13+
- **NEON**: A 128-bit SIMD extension that processes data in parallel, offering improved performance for single-precision floating-point operations and integer workloads.
14+
15+
- **SVE** (Scalable Vector Extension): Introduces variable-length vectors to provide scalable performance across different Arm implementations, enabling a flexible approach to SIMD
16+
17+
- **SVE2**: Builds upon SVE by adding more instructions for integer, fixed-point, and complex workloads, broadening the range of vectorizable code to enable general data-procssing
18+
19+
While assembly and Arm intrinsics can yield further optimizations, they are beyond this learning path. Instead, we will concentrate on C++ constructs that help the compiler generate efficient vectorized instructions.
20+
21+
## Environment Setup
22+
23+
In this learning path I will be using an AWS Graviton 3 instance based on the Arm Neoverse V1 architecture. This particular instance supports both NEON and SVE. If you are unfamiliar with using cloud instances, please reach out [getting started guide](TODO).
24+
25+
```bash
26+
sudo apt update
27+
sudo apt install g++
28+
```
29+
Please note: There will be slight differences in performance when using difference versions of compilers.
30+
31+
32+
## Trivial Vectorisable Example
33+
34+
Data-level parallelism (DLP) refers to the capability of modern CPUs, including Arm architectures, to perform operations on multiple data points simultaneously. In practice, this means the compiler can identify loops or repeated calculations on array elements and convert them into a smaller set of vectorized instructions. By grouping data elements, the compiler leverages hardware instructions that operate on multiple values at once, reducing the total number of instruction cycles. This transformation is key to achieving high-performance code on Arm-based systems, where NEON, SVE, and SVE2 extensions are used to efficiently handle tasks that involve large arrays or complex data processing.
35+
36+
Copy and paste the C++ code snippet below into a file named `trivial_vector.cpp`.
37+
38+
```cpp
39+
#include <iostream>
40+
#include <vector>
41+
42+
using namespace std;
43+
44+
void vector_add(const vector<int>& a, const vector<int>& b, vector<int>& c) {
45+
const int size = a.size();
46+
for (int i = 0; i < size; ++i) {
47+
c[i] = a[i] + b[i];
48+
}
49+
}
50+
51+
int main() {
52+
const int size = 100;
53+
vector<int> a(size, 8);
54+
vector<int> b(size, 2);
55+
vector<int> c(size, 0);
56+
57+
vector_add(a, b, c);
58+
59+
for (int i = 0; i < size; ++i) {
60+
cout << c[i] << " ";
61+
}
62+
cout << endl;
63+
64+
return 0;
65+
}
66+
```
67+
The snippet above performs a vector add of 2 vectors, a and b of size 100. Notable things to observe are:
68+
69+
- Fixed loop size of 100
70+
- No conditional statements within the loop
71+
- Fixed data type of `int`
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: Providing additional information
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Conditional Statements
10+
11+
Conditional statements within a for loop allow certain iterations to execute only when specified conditions are met. This mechanism is crucial for processing arrays or vectors selectively, as it lets you skip unnecessary computations or handle edge cases without disrupting the flow of the entire loop.
12+
13+
Arm’s Scalable Vector Extension (SVE) introduces a predicate (mask) to manage these conditional operations on a per-element basis. Instead of processing entire vectors uniformly, SVE uses the mask to enable or disable specific lanes dynamically. This approach is especially powerful for loops whose iteration counts are not exact multiples of the vector length, as it avoids wasted operations. Additionally, SVE supports strided access, meaning it can load or store elements separated by a constant stride in memory, improving efficiency in scenarios like processing slices of arrays.
14+
15+
In contrast, Arm NEON relies on packing data into fixed-width 128-bit registers. Elements are grouped together (packed) and processed simultaneously, but this can lead to overhead when handling irregular loop counts or accessing data with non-contiguous memory layouts. By comparison, SVE’s mask-based approach and flexible vector lengths provide more fine-grained control and higher efficiency for diverse data patterns.
16+
17+
To demonstrate the C++ code snippet below creates initialises 2 arrays of 128 integers. If the value of the index is even it's value is equal to the index, if the index is odd the value is 0.
18+
19+
```cpp
20+
#include <iostream>
21+
22+
int reduce(int *a, int *b, long N);
23+
24+
int main(){
25+
int a[128];
26+
int b[128];
27+
for (int i = 0; i < 128; ++i){
28+
if (i % 2 == 0){
29+
a[i] = i;
30+
b[i] = 1;
31+
}
32+
else {
33+
a[i] = 0;
34+
b[i] = 0;
35+
}
36+
}
37+
long N = 128;
38+
int s = reduce(a, b, N);
39+
std::cout << s << std::endl;
40+
return 0;
41+
42+
}
43+
44+
int reduce(int *a, int *b, long N){
45+
long i;
46+
int s = 0;
47+
for (i = 0; i < N; ++i){
48+
if (b[i]){
49+
s += a[i];
50+
}
51+
}
52+
return s;
53+
}
54+
```
55+
56+
This example can be vectorised with SVE strided access. Run the commands below to generate the annotated assembly for both NEON (simd) and SVE.
57+
58+
```bash
59+
g++ -march=armv8-a+simd -fverbose-asm -O3 pred_loop.cpp -S -o neon_basic.s
60+
g++ -march=armv8-a+sve -O3 -fverbose-asm pred_loop.cpp -S -o sve_basic.s
61+
```
62+
63+
Passing the `-fverbose-asm` command annotates the assembly with the corresponding lines of source code.
64+
65+
The SVE assembly uses the `st1w` instruction for strided access, whereas the NEON implementation does not.
66+
67+
```output
68+
// pred_loop.cpp:9: if (i % 2 == 0){
69+
...
70+
st1w z2.s, p0, [x2, x0, lsl 2] // vect_patt_104.35, loop_mask_124, MEM <vector([4,4]) int> [(int *)_67 + ivtmp.63_15 * 4]
71+
st1w z0.s, p0, [x1, x0, lsl 2] // vect_cstore_25.34, loop_mask_124, MEM <vector([4,4]) int> [(int *)_22 + ivtmp.63_15 * 4]
72+
```
73+
74+
// MAYBE REMOVE
75+
Inspecting the assembly we can see the `UADDV` instruction is being used for this reduction operation. ![reduction_operation](./reduction.png).
76+
77+
78+
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: Sparse Addressing
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Sparse Addressing (indirect addressing)
10+
11+
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: Adding Context
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Adding Context
10+
11+
In some situations, you may already know that a loop will run a multiple of a constant. Making this explicit can help the compiler avoid generating “cleanup” code for leftover iterations. To do this we can change our variable.
12+
13+
14+
Original:
15+
16+
```cpp
17+
int max_loop_size; // max_loop_size is always a multiple of 4.
18+
for (n = 0; n < max_loop_size; n++) {
19+
// ...
20+
}
21+
```
22+
23+
Addition of context:
24+
25+
```cpp
26+
int max_loop_size_div_4 = max_loop_size / 4;
27+
for (n = 0; n < max_loop_size_div_4 * 4; n++) {
28+
// ...
29+
}
30+
```
31+
32+
Your initial observation might be that these calculations are redundant. However, this change allows the compiler to ensure that `max_loop_size` is a multiple of 4. For example, if `max_loop_size` were 9, the division would yield 9/4 as 2 instead of 2.25, because integer division truncates the decimal part. If you want to test this out, run the basic snippet of code below
33+
34+
```cpp
35+
#include <stdio.h>
36+
37+
int main() {
38+
for (int i = 1; i <= 20; ++i) {
39+
int result = i / 4;
40+
printf("Number: %d, Divided by 4: %d\n", i, result);
41+
}
42+
return 0;
43+
}
44+
```
45+
46+
## Example
47+
48+
The following example presumes the `max_loop_size` can be any integer.
49+
50+
```cpp
51+
#include <iostream>
52+
53+
void foo(const int* x, int max_loop_size)
54+
{
55+
int sum = 0;
56+
for (int k = 0; k < max_loop_size; k++) {
57+
sum += x[k];
58+
}
59+
std::cout << "Sum: " << sum << std::endl;
60+
}
61+
62+
int main() {
63+
int max_loop_size;
64+
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
65+
std::cin >> max_loop_size;
66+
67+
68+
int x[max_loop_size];
69+
70+
// Initialize test data
71+
for(int i = 0; i < max_loop_size; ++i) x[i] = i;
72+
73+
foo(x, max_loop_size);
74+
75+
return 0;
76+
}
77+
```
78+
79+
```bash
80+
```
81+
82+
```cpp
83+
#include <iostream>
84+
85+
void foo(const int* x, int max_loop_size_div_4)
86+
{
87+
int sum = 0;
88+
for (int k = 0; k < max_loop_size_div_4 * 4; k++) {
89+
sum += x[k];
90+
}
91+
std::cout << "Sum: " << sum << std::endl;
92+
}
93+
94+
int main() {
95+
int max_loop_size;
96+
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
97+
std::cin >> max_loop_size;
98+
99+
100+
int max_loop_size_div_4 = max_loop_size / 4;
101+
102+
int x[max_loop_size_div_4 * 4];
103+
104+
// Initialize test data
105+
for(int i = 0; i < (max_loop_size_div_4*4); ++i) x[i] = i;
106+
107+
foo(x, max_loop_size_div_4);
108+
109+
return 0;
110+
}
111+
```
112+
113+
```bash
114+
g++ -O3 -fverbose-asm -march=armv8-a+simd example_with_context.cpp -S -o example_with_context.s
115+
g++ -O3 -fverbose-asm -march=armv8-a+simd example_no_context.cpp -S -o example_no_context.s
116+
```
117+
118+
```output
119+
wc -l example_with_context.s example_no_context.s
120+
259 example_with_context.s
121+
319 example_no_context.s
122+
```
123+
124+
![diff](./diff.png)
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
title: Loop-carried Dependencies
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Loop-carried Dependencies
10+
11+
Loop-carried dependencies are computed values that carry over from one iteration to the next, making each iteration dependent on the outcome of the previous one. When trying to vectorise code for Arm SIMD intructions (such as NEON and SVE), the goal is to perform multiple iterations concurrently. However, these dependencies force a strictly sequential execution order, preventing independent, parallel computation across iterations.
12+
13+
Consider the C++ loop below.
14+
15+
```cpp
16+
for (i=0;i<50; i++){
17+
A[i + 1] = A[i] + c[i];
18+
B[i + 1] = B[i] + A[i + 1];
19+
}
20+
```
21+
22+
In this loop, an iteration is defined as a single execution of the loop body for a specific index value i. Each iteration computes two new values: one for array A and one for array B. However, two loop-carried dependencies are present that hinder vectorisation.
23+
24+
- The first dependency is found in the computation of A[i + 1]. Here, the value A[i + 1] relies on the value of A[i] computed in the previous iteration. This creates a sequential chain: you must compute A[i] before you can compute A[i + 1].
25+
26+
- The second dependency appears in the computation of B[i + 1]. This value depends on B[i] from the previous iteration, and it also relies on A[i + 1], which itself is a product of the previous A value.
27+
28+
```cpp
29+
for (i=0; i<1000; i+=1){
30+
sum = sum + x[i] * y[i];
31+
}
32+
```
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
title: C++ for loop considerings for Autovectorisation
3+
4+
minutes_to_complete: 90
5+
6+
who_is_this_for: This is an advanced topic for software developers who want to instrument hardware event counters or the system counter in software applications.
7+
8+
learning_objectives:
9+
- Understand different options for accessing counters from user space
10+
- Use the system counter to measure time in code
11+
- Use PAPI to instrument event counters in code
12+
- Use the Linux perf_event_open system call to instrument event counters in code
13+
prerequisites:
14+
- An Arm computer running Linux. A bare metal or cloud metal instance is best because they expose more counters. You can use a virtual machine (VM), but fewer counters may be available. These instructions have been tested on the `a1.metal` instance type.
15+
16+
author: Julio Suarez
17+
18+
### Tags
19+
skilllevels: Advanced
20+
subjects: Performance and Architecture
21+
armips:
22+
- Neoverse
23+
tools_software_languages:
24+
- PAPI
25+
- perf
26+
- Assembly
27+
- GCC
28+
- Runbook
29+
30+
operatingsystems:
31+
- Linux
32+
33+
further_reading:
34+
- resource:
35+
title: Linux perf_events documentation
36+
link: https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html
37+
type: documentation
38+
- resource:
39+
title: PAPI documentation
40+
link: https://github.com/icl-utk-edu/papi/wiki
41+
type: documentation
42+
- resource:
43+
title: Perf
44+
link: https://en.wikipedia.org/wiki/Perf_%28Linux%29
45+
type: documentation
46+
47+
48+
### FIXED, DO NOT MODIFY
49+
# ================================================================================
50+
weight: 1 # _index.md always has weight of 1 to order correctly
51+
layout: "learningpathall" # All files under learning paths have this same wrapper
52+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
53+
---
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
next_step_guidance: Learn about Thread Santiser and more Memory ordering options
3+
4+
recommended_path: /learning-paths/cross-platform/intrinsics
5+
6+
further_reading:
7+
- resource:
8+
title: C++ Memory Order Reference Manual
9+
link: https://en.cppreference.com/w/cpp/atomic/memory_order
10+
type: documentation
11+
- resource:
12+
title: Thread Santiser Manual
13+
link: https://github.com/google/sanitizers/wiki/threadsanitizercppmanual
14+
type: documentation
15+
16+
17+
# ================================================================================
18+
# FIXED, DO NOT MODIFY
19+
# ================================================================================
20+
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
21+
title: "Next Steps" # Always the same
22+
layout: "learningpathall" # All files under learning paths have this same wrapper
23+
---
347 KB
Loading
39 KB
Loading

0 commit comments

Comments
 (0)