Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Example
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Example

The following `C++` snippet takes user input as the loop size so that the loop size, `max_loop_size`, is only known at runtime. This initialises an array of size, , `max_loop_size` with the value for each element corresponding to the index position. The function, `foo`, loops through each element to print out the sum of all elements.

Copy the snippet below into a file named, `no-context.cpp`.

```cpp
#include <iostream>
#include <chrono>

void foo(const int* x, int max_loop_size)
{
int sum = 0;
for (int k = 0; k < max_loop_size; k++) {
sum += x[k];
}
std::cout << "Sum: " << sum << std::endl;
}

int main() {
int max_loop_size;
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
std::cin >> max_loop_size;

int x[max_loop_size];
// Initialise test data
for(int i = 0; i < max_loop_size; ++i) x[i] = i;

// Start timing
auto start = std::chrono::high_resolution_clock::now();
foo(x, max_loop_size);
// Stop timing
auto end = std::chrono::high_resolution_clock::now();

// Calculate and display the elapsed time
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;

return 0;
}
```

Compiling using the following command.

```bash
g++ -O3 -march=armv8-a+simd no_context.cpp -o no_context
```

Running the example with the number 4000 leads to the following results. You will see runtime variability depending on which platform you run this on.

```output
./no_context
Enter a value for max_loop_size (must be a multiple of 4): 40000
Sum: 799980000
Time taken by foo: 138100 nanoseconds
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Setup
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Introduction

Often, the programmer has deeper insights into their software's behavior and its inputs than the compiler does. For instance, if a loop's size is determined at runtime, the compiler must conservatively handle the possibility of variable sizes, potentially limiting optimization opportunities. However, a developer might know more about the application's runtime characteristics—such as the fact that the loop size always adheres to specific constraints, like being a multiple of a particular number.

To illustrate how you can explicitly provide this valuable context to the compiler, we'll walk through a simple C++ example.

## Setup

In this learning path, I will be demonstrating the examples using an Arm-based `r7g.large` instance from AWS; however, you're welcome to follow along using any Arm-based machine that suits your environment or preference.

To get started, you'll first need to install the `g++` compiler on your system. Use the following commands as a guide, adjusting them accordingly based on the operating system or distribution you're working with.

```bash
sudo apt update
sudo apt install g++
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: Learn to Optimize C++ Loops with Size Context

minutes_to_complete: 15

who_is_this_for: C++ developer who want to improve the runtime of for loops with basic insider knowledge of the loop size

learning_objectives:
- Learn how to add preexisting knowledge of loop sizes to for loops

prerequisites:
- Access to an Arm-based machine / instance
- Basic understanding of C++

author: Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Neoverse
tools_software_languages:
- C++
operatingsystems:
- Linux



further_reading:
- resource:
title: PLACEHOLDER MANUAL
link: PLACEHOLDER MANUAL LINK
type: documentation


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: Adding Inside Knowledge
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Adding Inside Knowledge

To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows:

```output
((max_loop_size/4)*4)
```

At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly.

As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example.

```output
(max_loop_size_div_4 * 4)
```

## Improved Example

Copy the snippet below and paste into a file named `context.cpp`.

```cpp
#include <iostream>
#include <chrono>

void foo(const int* x, int max_loop_size_div_4)
{
int sum = 0;
for (int k = 0; k < max_loop_size_div_4 * 4; k++) {
sum += x[k];
}
std::cout << "Sum: " << sum << std::endl;
}

int main() {
int max_loop_size;
std::cout << "Enter a value for max_loop_size (must be a multiple of 4): ";
std::cin >> max_loop_size;

int max_loop_size_div_4 = max_loop_size / 4;
int x[max_loop_size];
// Initialise test data
for(int i = 0; i < max_loop_size; ++i) x[i] = i;

// Start timing
auto start = std::chrono::high_resolution_clock::now();
foo(x, max_loop_size_div_4);
// Stop timing
auto end = std::chrono::high_resolution_clock::now();

// Calculate and display the elapsed time
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Time taken by foo: " << duration << " nanoseconds" << std::endl;

return 0;
}
```

Again compile with the same compiler flags.

```bash
g++ -O3 -march=armv8-a+simd context.cpp -o context
```

```output
./context
Enter a value for max_loop_size (must be a multiple of 4): 40000
Sum: 799980000
Time taken by foo: 24650 nanoseconds
```
In this particular run, the time taken has significantly reduced compared to our previous example.

## Comparison

To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK).

As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context.

Loading