Skip to content

Commit cc64dd9

Browse files
Merge pull request #1833 from jasonrandrews/review
Review C++ loop optimization Learning Path: update titles, enhance co…
2 parents 733b0cf + b4ef798 commit cc64dd9

File tree

4 files changed

+130
-37
lines changed

4 files changed

+130
-37
lines changed

content/learning-paths/cross-platform/cpp-loop-size-context/Example.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
---
2-
title: Example
2+
title: Baseline loop implementation
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Example
9+
## Understand the baseline loop
1010

11-
The following `C++` snippet takes user input as the loop size so that the loop size, `max_loop_size`, is only known at runtime. This initialises an array of size, , `max_loop_size` with the value for each element corresponding to the index position. The function, `foo`, loops through each element to print out the sum of all elements.
11+
The following C++ program takes user input as the loop size so that the loop size `max_loop_size` is only known at runtime. This initializes an array of size `max_loop_size` with the value for each element corresponding to the index position.
1212

13-
Copy the snippet below into a file named, `no-context.cpp`.
13+
The function `foo()` loops through each element to print out the sum of all elements. Without any boundary information provided to the compiler, it must generate conservative code that works for any loop size.
14+
15+
Use a text editor to copy the code below into a file named `no-context.cpp`.
1416

1517
```cpp
1618
#include <iostream>
@@ -48,18 +50,24 @@ int main() {
4850
}
4951
```
5052
51-
Compiling using the following command.
53+
Compile the program using the following command:
54+
55+
```bash
56+
g++ -O3 -march=armv8-a+simd no-context.cpp -o no-context
57+
```
58+
59+
Run the example with 40000 as the input:
5260

5361
```bash
54-
g++ -O3 -march=armv8-a+simd no_context.cpp -o no_context
62+
./no-context
5563
```
5664

57-
Running the example with the number 4000 leads to the following results. You will see runtime variability depending on which platform you run this on.
65+
You see the output below, your runtime will vary depending on the computer you are using.
5866

5967
```output
60-
./no_context
6168
Enter a value for max_loop_size (must be a multiple of 4): 40000
6269
Sum: 799980000
6370
Time taken by foo: 138100 nanoseconds
6471
```
6572

73+
Continue to the next section to see how to use developer knowledge of loops to improve performance.
Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,54 @@
11
---
2-
title: Setup
2+
title: Understand developer knowledge for compiler optimizations
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction
9+
## What is developer knowledge?
1010

11-
Often, the programmer has deeper insights into their software's behavior and its inputs than the compiler does. For instance, if a loop's size is determined at runtime, the compiler must conservatively handle the possibility of variable sizes, potentially limiting optimization opportunities. However, a developer might know more about the application's runtime characteristics—such as the fact that the loop size always adheres to specific constraints, like being a multiple of a particular number.
11+
Often, software developers have deeper insights into their software's behavior and its inputs than the compiler does. This knowledge represents a valuable optimization opportunity that can significantly improve performance when properly communicated to the compiler as boundary information.
1212

13-
To illustrate how you can explicitly provide this valuable context to the compiler, we'll walk through a simple C++ example.
13+
### The compiler's challenge
1414

15-
## Setup
15+
When a loop's size is determined at runtime, the compiler faces a dilemma:
16+
- It must generate code that works correctly for any possible input size
17+
- It cannot make assumptions that might enable more aggressive optimizations
18+
- It must take a conservative approach to ensure correctness across all scenarios
1619

17-
In this learning path, I will be demonstrating the examples using an Arm-based `r7g.large` instance from AWS; however, you're welcome to follow along using any Arm-based machine that suits your environment or preference.
20+
### The developer's advantage
1821

19-
To get started, you'll first need to install the `g++` compiler on your system. Use the following commands as a guide, adjusting them accordingly based on the operating system or distribution you're working with.
22+
As a developer, you often know more about your application's runtime characteristics than the compiler can infer, such as:
23+
- Loop sizes that always follow specific patterns (like being multiples of 4, 8, or 16)
24+
- Input constraints that are enforced elsewhere in your application
25+
- Data alignment guarantees that enable vectorization opportunities
26+
27+
In this Learning Path, you'll learn how to explicitly communicate this valuable context to the compiler, enabling it to generate more efficient code.
28+
29+
## Environment setup
30+
31+
You can use any Arm Linux system to run the example application and learn about loop optimization. The only requirement is to install the `g++` compiler.
32+
33+
### Installing the compiler
34+
35+
If you are running Ubuntu or another Debian-based Linux distribution, you can use the commands below to install the compiler:
2036

2137
```bash
2238
sudo apt update
23-
sudo apt install g++
39+
sudo apt install g++ -y
40+
```
41+
42+
For other Linux distributions, use the appropriate package manager to install `g++`.
43+
44+
### Compiler version
45+
46+
This learning path uses standard C++ features and optimization techniques that work with any recent C++ compiler.
47+
48+
You can check your version using:
49+
50+
```bash
51+
g++ --version
2452
```
2553

54+
Continue to the next section to learn about an example application which demonstrates how to use developer knowledge for loop boundary information.

content/learning-paths/cross-platform/cpp-loop-size-context/_index.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,51 @@
11
---
2-
title: Learn to Optimize C++ Loops with Size Context
2+
title: Boost C++ performance by optimizing loops with boundary information
3+
4+
draft: true
5+
cascade:
6+
draft: true
37

48
minutes_to_complete: 15
59

6-
who_is_this_for: C++ developer who want to improve the runtime of for loops with basic insider knowledge of the loop size
10+
who_is_this_for: This is an introductory topic for C++ developers who want to improve the runtime of loops using existing knowledge of the loop size.
711

812
learning_objectives:
9-
- Learn how to add preexisting knowledge of loop sizes to for loops
13+
- Learn how to communicate loop size constraints to the compiler for better optimization.
14+
- Understand how providing compile-time context can improve runtime performance.
15+
- Implement techniques to express loop boundaries that enable better code generation.
16+
- Compare and analyze the performance impact of providing loop size context.
1017

1118
prerequisites:
12-
- Access to an Arm-based machine / instance
13-
- Basic understanding of C++
19+
- An Arm computer running Linux. You can also use a virtual machine from a [cloud service provider](/learning-paths/servers-and-cloud-computing/csp/).
1420

1521
author: Kieran Hejmadi
1622

1723
### Tags
1824
skilllevels: Introductory
19-
subjects: ML
25+
subjects: Performance and Architecture
2026
armips:
2127
- Neoverse
28+
- Cortex-A
2229
tools_software_languages:
2330
- C++
31+
- Runbook
2432
operatingsystems:
2533
- Linux
2634

27-
35+
### Cross-platform metadata only
36+
shared_path: true
37+
shared_between:
38+
- servers-and-cloud-computing
39+
- laptops-and-desktops
2840

2941
further_reading:
3042
- resource:
31-
title: PLACEHOLDER MANUAL
32-
link: PLACEHOLDER MANUAL LINK
43+
title: GCC Optimization Options Documentation
44+
link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
45+
type: documentation
46+
- resource:
47+
title: LLVM Loop Vectorization Guide
48+
link: https://llvm.org/docs/Vectorizers.html
3349
type: documentation
3450

3551

content/learning-paths/cross-platform/cpp-loop-size-context/providing-inside-knowledge.md

Lines changed: 52 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,34 @@
11
---
2-
title: Adding Inside Knowledge
2+
title: Optimize loops using boundary information
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Adding Inside Knowledge
9+
## How can I add developer knowledge to optimize performance?
1010

11-
To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows:
11+
To ensure the loop size is always a multiple of 4 and communicate this boundary information to the compiler, you can rewrite the loop size calculation as follows:
1212

1313
```output
1414
((max_loop_size/4)*4)
1515
```
1616

17-
At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly.
17+
At first glance, this calculation looks mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. This pattern allows the compiler to recognize and optimize for this specific constraint.
1818

19-
As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example.
19+
This optimization is particularly effective because it enables the compiler to use SIMD (Single Instruction, Multiple Data) vectorization. When the compiler knows the loop count is a multiple of 4, it can process four elements at once using vector registers, significantly improving performance on Arm processors.
20+
21+
A slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in.
22+
23+
For example:
2024

2125
```output
2226
(max_loop_size_div_4 * 4)
2327
```
2428

25-
## Improved Example
29+
## Try an improved example
2630

27-
Copy the snippet below and paste into a file named `context.cpp`.
31+
Use a text editor to copy the code below and paste it into a file named `context.cpp`.
2832

2933
```cpp
3034
#include <iostream>
@@ -63,23 +67,59 @@ int main() {
6367
}
6468
```
6569
66-
Again compile with the same compiler flags.
70+
Compile the new program with the same flags:
6771
6872
```bash
6973
g++ -O3 -march=armv8-a+simd context.cpp -o context
7074
```
7175

72-
```output
76+
Run the new example with the same 40000 as input:
77+
78+
```bash
7379
./context
80+
```
81+
82+
You see the new output:
83+
84+
```output
7485
Enter a value for max_loop_size (must be a multiple of 4): 40000
7586
Sum: 799980000
7687
Time taken by foo: 24650 nanoseconds
7788
```
78-
In this particular run, the time taken has significantly reduced compared to our previous example.
89+
90+
The time taken has significantly reduced compared to the previous version. This performance improvement is a direct result of providing boundary information to the compiler.
91+
92+
## Performance considerations
93+
94+
While this optimization technique provides significant performance benefits, it's important to note that it assumes the input is a multiple of 4. In a real-world application, you would need to validate user input or handle cases where the input isn't a multiple of 4.
95+
96+
For example:
97+
98+
```cpp
99+
// Validate input
100+
if (max_loop_size % 4 != 0) {
101+
std::cerr << "Error: Input must be a multiple of 4" << std::endl;
102+
return 1;
103+
}
104+
```
105+
106+
Alternatively, you could pad the array to ensure its size is always a multiple of 4, or handle the remainder elements separately after processing the vectorized portion of the array. The approach you choose depends on your specific application requirements and constraints.
79107

80108
## Comparison
81109

82-
To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK).
110+
You can compare the differences in [Compiler Explorer](https://godbolt.org/z/nvx4j1vTK).
111+
112+
The assembly code shows there are fewer lines of assembly corresponding to the function `foo()` when context is added. This is because the compiler can optimize the conditional checking and any clean up code given the context.
113+
114+
When examining the assembly output in Compiler Explorer, look for these key differences:
115+
116+
1. **Vector instructions**: In the optimized version, look for instructions like `ld1` (load to vector register) and `addv` (add across vector) which indicate SIMD operations.
117+
118+
2. **Loop structure**: The optimized version will likely have fewer instructions inside the main loop body as multiple elements are processed at once.
119+
120+
3. **Unrolling factor**: Notice how the compiler might unroll the loop to process multiple elements in each iteration, reducing branch overhead.
121+
122+
4. **Register usage**: The optimized version will make more efficient use of vector registers (v0-v31) rather than just scalar registers.
83123

84-
As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context.
124+
These assembly-level differences directly translate to the performance improvements you observed in the execution time.
85125

0 commit comments

Comments
 (0)