Skip to content

Commit a2b146e

Browse files
Merge pull request #1751 from kieranhejmadi01/cpp-loop-size-context
Learn to Optimize C++ Loops with Size Context - LP
2 parents cfcf859 + 762f3ae commit a2b146e

File tree

5 files changed

+224
-0
lines changed

5 files changed

+224
-0
lines changed
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: Example
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Example
10+
11+
The following `C++` snippet takes user input as the loop size so that the loop size, `max_loop_size`, is only known at runtime. This initialises an array of size, , `max_loop_size` with the value for each element corresponding to the index position. The function, `foo`, loops through each element to print out the sum of all elements.
12+
13+
Copy the snippet below into a file named, `no-context.cpp`.
14+
15+
```cpp
16+
#include <iostream>
17+
#include <chrono>
18+
19+
void foo(const int* x, int max_loop_size)
20+
{
21+
int sum = 0;
22+
for (int k = 0; k < max_loop_size; k++) {
23+
sum += x[k];
24+
}
25+
std::cout << "Sum: " << sum << std::endl;
26+
}
27+
28+
int main() {
29+
int max_loop_size;
30+
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
31+
std::cin >> max_loop_size;
32+
33+
int x[max_loop_size];
34+
// Initialise test data
35+
for(int i = 0; i < max_loop_size; ++i) x[i] = i;
36+
37+
// Start timing
38+
auto start = std::chrono::high_resolution_clock::now();
39+
foo(x, max_loop_size);
40+
// Stop timing
41+
auto end = std::chrono::high_resolution_clock::now();
42+
43+
// Calculate and display the elapsed time
44+
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
45+
std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;
46+
47+
return 0;
48+
}
49+
```
50+
51+
Compiling using the following command.
52+
53+
```bash
54+
g++ -O3 -march=armv8-a+simd no_context.cpp -o no_context
55+
```
56+
57+
Running the example with the number 4000 leads to the following results. You will see runtime variability depending on which platform you run this on.
58+
59+
```output
60+
./no_context
61+
Enter a value for max_loop_size (must be a multiple of 4): 40000
62+
Sum: 799980000
63+
Time taken by foo: 138100 nanoseconds
64+
```
65+
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
title: Setup
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction
10+
11+
Often, the programmer has deeper insights into their software's behavior and its inputs than the compiler does. For instance, if a loop's size is determined at runtime, the compiler must conservatively handle the possibility of variable sizes, potentially limiting optimization opportunities. However, a developer might know more about the application's runtime characteristics—such as the fact that the loop size always adheres to specific constraints, like being a multiple of a particular number.
12+
13+
To illustrate how you can explicitly provide this valuable context to the compiler, we'll walk through a simple C++ example.
14+
15+
## Setup
16+
17+
In this learning path, I will be demonstrating the examples using an Arm-based `r7g.large` instance from AWS; however, you're welcome to follow along using any Arm-based machine that suits your environment or preference.
18+
19+
To get started, you'll first need to install the `g++` compiler on your system. Use the following commands as a guide, adjusting them accordingly based on the operating system or distribution you're working with.
20+
21+
```bash
22+
sudo apt update
23+
sudo apt install g++
24+
```
25+
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Learn to Optimize C++ Loops with Size Context
3+
4+
minutes_to_complete: 15
5+
6+
who_is_this_for: C++ developer who want to improve the runtime of for loops with basic insider knowledge of the loop size
7+
8+
learning_objectives:
9+
- Learn how to add preexisting knowledge of loop sizes to for loops
10+
11+
prerequisites:
12+
- Access to an Arm-based machine / instance
13+
- Basic understanding of C++
14+
15+
author: Kieran Hejmadi
16+
17+
### Tags
18+
skilllevels: Introductory
19+
subjects: ML
20+
armips:
21+
- Neoverse
22+
tools_software_languages:
23+
- C++
24+
operatingsystems:
25+
- Linux
26+
27+
28+
29+
further_reading:
30+
- resource:
31+
title: PLACEHOLDER MANUAL
32+
link: PLACEHOLDER MANUAL LINK
33+
type: documentation
34+
35+
36+
### FIXED, DO NOT MODIFY
37+
# ================================================================================
38+
weight: 1 # _index.md always has weight of 1 to order correctly
39+
layout: "learningpathall" # All files under learning paths have this same wrapper
40+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
41+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Adding Inside Knowledge
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Adding Inside Knowledge
10+
11+
To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows:
12+
13+
```output
14+
((max_loop_size/4)*4)
15+
```
16+
17+
At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly.
18+
19+
As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example.
20+
21+
```output
22+
(max_loop_size_div_4 * 4)
23+
```
24+
25+
## Improved Example
26+
27+
Copy the snippet below and paste into a file named `context.cpp`.
28+
29+
```cpp
30+
#include <iostream>
31+
#include <chrono>
32+
33+
void foo(const int* x, int max_loop_size_div_4)
34+
{
35+
int sum = 0;
36+
for (int k = 0; k < max_loop_size_div_4 * 4; k++) {
37+
sum += x[k];
38+
}
39+
std::cout << "Sum: " << sum << std::endl;
40+
}
41+
42+
int main() {
43+
int max_loop_size;
44+
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
45+
std::cin >> max_loop_size;
46+
47+
int max_loop_size_div_4 = max_loop_size / 4;
48+
int x[max_loop_size];
49+
// Initialise test data
50+
for(int i = 0; i < max_loop_size; ++i) x[i] = i;
51+
52+
// Start timing
53+
auto start = std::chrono::high_resolution_clock::now();
54+
foo(x, max_loop_size_div_4);
55+
// Stop timing
56+
auto end = std::chrono::high_resolution_clock::now();
57+
58+
// Calculate and display the elapsed time
59+
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
60+
std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;
61+
62+
return 0;
63+
}
64+
```
65+
66+
Again compile with the same compiler flags.
67+
68+
```bash
69+
g++ -O3 -march=armv8-a+simd context.cpp -o context
70+
```
71+
72+
```output
73+
./context
74+
Enter a value for max_loop_size (must be a multiple of 4): 40000
75+
Sum: 799980000
76+
Time taken by foo: 24650 nanoseconds
77+
```
78+
In this particular run, the time taken has significantly reduced compared to our previous example.
79+
80+
## Comparison
81+
82+
To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK).
83+
84+
As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context.
85+

0 commit comments

Comments
 (0)