|
1 | 1 | --- |
2 | | -title: Adding Inside Knowledge |
| 2 | +title: Optimize loops using boundary information |
3 | 3 | weight: 4 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
6 | 6 | layout: learningpathall |
7 | 7 | --- |
8 | 8 |
|
9 | | -## Adding Inside Knowledge |
| 9 | +## How can I add developer knowledge to optimize performance? |
10 | 10 |
|
11 | | -To explicitly inform the compiler that our input will always be a multiple of 4, we can rewrite the loop size calculation as follows: |
| 11 | +To ensure the loop size is always a multiple of 4 and communicate this boundary information to the compiler, you can rewrite the loop size calculation as follows: |
12 | 12 |
|
13 | 13 | ```output |
14 | 14 | ((max_loop_size/4)*4) |
15 | 15 | ``` |
16 | 16 |
|
17 | | -At first glance, this calculation might seem mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. The compiler can pick up on this information and optimise accordingly. |
| 17 | +At first glance, this calculation looks mathematically redundant. However, since the expression `(max_loop_size/4)` is an integer division, it truncates the result, effectively guaranteeing that `(max_loop_size/4)*4` will always yield a number divisible by 4. This pattern allows the compiler to recognize and optimize for this specific constraint. |
18 | 18 |
|
19 | | -As slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. For example. |
| 19 | +This optimization is particularly effective because it enables the compiler to use SIMD (Single Instruction, Multiple Data) vectorization. When the compiler knows the loop count is a multiple of 4, it can process four elements at once using vector registers, significantly improving performance on Arm processors. |
| 20 | + |
| 21 | +A slightly easier to read method that avoids confusion when passing arguments is to divide the variable and rename before it is passed in. |
| 22 | + |
| 23 | +For example: |
20 | 24 |
|
21 | 25 | ```output |
22 | 26 | (max_loop_size_div_4 * 4) |
23 | 27 | ``` |
24 | 28 |
|
25 | | -## Improved Example |
| 29 | +## Try an improved example |
26 | 30 |
|
27 | | -Copy the snippet below and paste into a file named `context.cpp`. |
| 31 | +Use a text editor to copy the code below and paste it into a file named `context.cpp`. |
28 | 32 |
|
29 | 33 | ```cpp |
30 | 34 | #include <iostream> |
@@ -63,23 +67,59 @@ int main() { |
63 | 67 | } |
64 | 68 | ``` |
65 | 69 |
|
66 | | -Again compile with the same compiler flags. |
| 70 | +Compile the new program with the same flags: |
67 | 71 |
|
68 | 72 | ```bash |
69 | 73 | g++ -O3 -march=armv8-a+simd context.cpp -o context |
70 | 74 | ``` |
71 | 75 |
|
72 | | -```output |
| 76 | +Run the new example with the same 40000 as input: |
| 77 | + |
| 78 | +```bash |
73 | 79 | ./context |
| 80 | +``` |
| 81 | + |
| 82 | +You see the new output: |
| 83 | + |
| 84 | +```output |
74 | 85 | Enter a value for max_loop_size (must be a multiple of 4): 40000 |
75 | 86 | Sum: 799980000 |
76 | 87 | Time taken by foo: 24650 nanoseconds |
77 | 88 | ``` |
78 | | -In this particular run, the time taken has significantly reduced compared to our previous example. |
| 89 | + |
| 90 | +The time taken has significantly reduced compared to the previous version. This performance improvement is a direct result of providing boundary information to the compiler. |
| 91 | + |
| 92 | +## Performance considerations |
| 93 | + |
| 94 | +While this optimization technique provides significant performance benefits, it's important to note that it assumes the input is a multiple of 4. In a real-world application, you would need to validate user input or handle cases where the input isn't a multiple of 4. |
| 95 | + |
| 96 | +For example: |
| 97 | + |
| 98 | +```cpp |
| 99 | +// Validate input |
| 100 | +if (max_loop_size % 4 != 0) { |
| 101 | + std::cerr << "Error: Input must be a multiple of 4" << std::endl; |
| 102 | + return 1; |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +Alternatively, you could pad the array to ensure its size is always a multiple of 4, or handle the remainder elements separately after processing the vectorized portion of the array. The approach you choose depends on your specific application requirements and constraints. |
79 | 107 |
|
80 | 108 | ## Comparison |
81 | 109 |
|
82 | | -To compare we will use compiler explorer to see the assembly [here](https://godbolt.org/z/nvx4j1vTK). |
| 110 | +You can compare the differences in [Compiler Explorer](https://godbolt.org/z/nvx4j1vTK). |
| 111 | + |
| 112 | +The assembly code shows there are fewer lines of assembly corresponding to the function `foo()` when context is added. This is because the compiler can optimize the conditional checking and any clean up code given the context. |
| 113 | + |
| 114 | +When examining the assembly output in Compiler Explorer, look for these key differences: |
| 115 | + |
| 116 | +1. **Vector instructions**: In the optimized version, look for instructions like `ld1` (load to vector register) and `addv` (add across vector) which indicate SIMD operations. |
| 117 | + |
| 118 | +2. **Loop structure**: The optimized version will likely have fewer instructions inside the main loop body as multiple elements are processed at once. |
| 119 | + |
| 120 | +3. **Unrolling factor**: Notice how the compiler might unroll the loop to process multiple elements in each iteration, reducing branch overhead. |
| 121 | + |
| 122 | +4. **Register usage**: The optimized version will make more efficient use of vector registers (v0-v31) rather than just scalar registers. |
83 | 123 |
|
84 | | -As the assembly shows we have fewer lines of assembly corresponding to the function `foo` when context is added. This is because the compiler can optimise the conditional checking and any clean up code given the context. |
| 124 | +These assembly-level differences directly translate to the performance improvements you observed in the execution time. |
85 | 125 |
|
0 commit comments