Skip to content

Commit ea7dc98

Browse files
author
Your Name
committed
renamed to floating-point-rounding-errors
1 parent dd4a67a commit ea7dc98

File tree

9 files changed

+348
-0
lines changed

9 files changed

+348
-0
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: Learn about floating point rounding errors on Arm and x86
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for: Developers porting applications from x86 to AArch64 who observe different results on each platform.
7+
8+
learning_objectives:
9+
- Understand the differences between converting floating point numbers on x86 and Arm.
10+
- Understand factors that affect floating point behaviour
11+
- How to use basic compiler flags to produce predictable behaviour
12+
13+
prerequisites:
14+
- Access to an x86 and Arm-based machine
15+
- Basic understanding of floating point numbers
16+
- A C++/C compiler
17+
18+
author: Kieran Hejmadi
19+
20+
### Tags
21+
skilllevels: Introductory
22+
subjects: Performance and Architecture
23+
armips:
24+
- Cortex-A
25+
- Neoverse
26+
tools_software_languages:
27+
- C++
28+
29+
further_reading:
30+
- resource:
31+
title: G++ Optimisation Flags
32+
link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
33+
type: documentation
34+
35+
36+
37+
### FIXED, DO NOT MODIFY
38+
# ================================================================================
39+
weight: 1 # _index.md always has weight of 1 to order correctly
40+
layout: "learningpathall" # All files under learning paths have this same wrapper
41+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
42+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
448 KB
Loading
93.6 KB
Loading
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
title: Floating Point Representations
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Recap on Floating Point Numbers
10+
11+
If you are unfamiliar with floating point representations, we recommend looking at this [introductory learning path](https://learn.arm.com/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/).
12+
13+
As a recap, floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations.
14+
15+
IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit). Each floating-point number consists of three components:
16+
- **sign bit**. (Determining positive or negative value)
17+
- **exponent** (defining the scale or magnitude)
18+
- **significand** (also called the mantissa, representing the significant digits of the number).
19+
20+
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers for robust numerical computation. A key feature of IEEE 754 is its support for rounding modes and exception handling, ensuring predictable behavior in mathematical operations. However, floating-point arithmetic is inherently imprecise due to limited precision, leading to small rounding errors.
21+
22+
The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and matissa.
23+
24+
![floating-point](./floating-point-numbers.png)
25+
26+
## Rounding Errors
27+
28+
As mentioned above, since we are using a finite number of bits to store a continuous range of numbers, we introduce rounding error. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented. When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if we construct a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2. The possible values will look like the graph below.
29+
30+
![ulp](./ulp.png)
31+
32+
Key takeaways:
33+
34+
- ULP size varies with the number’s magnitude.
35+
- Larger numbers have bigger ULPs due to wider spacing between values.
36+
- Smaller numbers have smaller ULPs, reducing quantization error.
37+
- ULP behavior impacts numerical stability and precision in computations.
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: Differences between x86 and Arm
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Differences in behaviour between x86 and Arm.
10+
11+
Architecture and standards define dealing with floating point overflows and truncations in difference ways. First, connect to a x86 and Arm-based machine. In this example I am connecting to an AWS `t2.2xlarge` and `t4g.xlarge` running Ubuntu 22.04 LTS.
12+
13+
14+
To demonstrate this, the C++ code snippet below casts floating point numbers to various data types. Copy and paste into a new file called `converting-float.cpp`.
15+
16+
```cpp
17+
#include <iostream>
18+
#include <cmath>
19+
#include <limits>
20+
#include <cstdint>
21+
22+
void convertFloatToInt(float value) {
23+
// Convert to unsigned 32-bit integer
24+
uint32_t u32 = static_cast<uint32_t>(value);
25+
26+
// Convert to signed 32-bit integer
27+
int32_t s32 = static_cast<int32_t>(value);
28+
29+
// Convert to unsigned 16-bit integer (truncation happens)
30+
uint16_t u16 = static_cast<uint16_t>(u32);
31+
uint8_t u8 = static_cast<uint8_t>(value);
32+
33+
// Convert to signed 16-bit integer (truncation happens)
34+
int16_t s16 = static_cast<int16_t>(s32);
35+
36+
std::cout << "Floating-Point Value: " << value << "\n";
37+
std::cout << " → uint32_t: " << u32 << " (0x" << std::hex << u32 << std::dec << ")\n";
38+
std::cout << " → int32_t: " << s32 << " (0x" << std::hex << s32 << std::dec << ")\n";
39+
std::cout << " → uint16_t (truncated): " << u16 << " (0x" << std::hex << u16 << std::dec << ")\n";
40+
std::cout << " → int16_t (truncated): " << s16 << " (0x" << std::hex << s16 << std::dec << ")\n";
41+
std::cout << " → uint8_t (truncated): " << static_cast<int>(u8) << std::endl;
42+
43+
std::cout << "----------------------------------\n";
44+
}
45+
46+
int main() {
47+
std::cout << "Demonstrating Floating-Point to Integer Conversion\n\n";
48+
49+
// Test cases
50+
convertFloatToInt(42.7f); // Normal case
51+
convertFloatToInt(-15.3f); // Negative value -> wraps on unsigned
52+
convertFloatToInt(4294967296.0f); // Overflow: 2^32 (UINT32_MAX + 1)
53+
convertFloatToInt(3.4e+38f); // Large float exceeding UINT32_MAX
54+
convertFloatToInt(-3.4e+38f); // Large negative float
55+
convertFloatToInt(NAN); // NaN behavior on different platforms
56+
return 0;
57+
}
58+
```
59+
60+
To demonstrate we will compile `converting-float.cpp` on an Arm64 and x86 machine. Install the `g++` compiler with the following command. I am using `G++` version 13.3 at the time of writing.
61+
62+
```bash
63+
sudo apt update
64+
sudo apt install g++ gcc
65+
```
66+
67+
68+
Run the command below on both an Arm-based and x86-based system.
69+
70+
```bash
71+
g++ converting-float.cpp -o converting-float
72+
```
73+
74+
For easy comparison, the image below shows the x86 output (left) and Arm output (right). The highlighted lines show the difference in output.
75+
76+
![differences](./differences.png)
77+
78+
As you can see, there are several cases where different behaviour is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
79+
80+
## Removing Hardcoded values with Macros
81+
82+
Clearly the above differences show that explictly checking for specific values will lead to unportable code. For example, consider the naively implemented function below. This checks if the value is 0, the value an x86 machine will convert a floating point number that exceeds the maximum 32-bit float value. This is different to AArch64 behaviour leading to unportable code.
83+
84+
```cpp
85+
void checkFloatToUint32(float num) {
86+
uint32_t castedNum = static_cast<uint32_t>(num);
87+
if (castedNum == 0) {
88+
std::cout << "The casted number is 0, indicating the float could out of bounds for uint32_t." << std::endl;
89+
} else {
90+
std::cout << "The casted number is: " << castedNum << std::endl;
91+
}
92+
}
93+
```
94+
95+
This can simply be corrected by using the macro, `UINT32_MAX`.
96+
{{% notice Note %}} To find out all the available compiler-defined macros, the `echo | <clang|gcc etc.> -dM -` will output them to the terminal{{% /notice %}}
97+
98+
```cpp
99+
void checkFloatToUint32(float num) {
100+
uint32_t castedNum = static_cast<uint32_t>(num);
101+
if (castedNum == UINT32_MAX) {
102+
std::cout << "The casted number is " << UINT32_MAX << " indicating the float was out of bounds for uint32_t." << std::endl;
103+
} else {
104+
std::cout << "The casted number is: " << castedNum << std::endl;
105+
}
106+
}
107+
```
108+
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Example
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Error Propagation
10+
11+
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm architecture may decide to reorder the instructions that each has a different rounding error (described in the unit in last place section) so that subtle changes are observed. Alternatively, 2 functions that are mathematically equivalent will propagate errors differently on a computer.
12+
13+
Consider the example below. Function 1, `f1` and 2, `f2` are mathematically equivalent. Hence they should return the same value given the same input. If we input a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, function `f2` avoids the subtraction of nearly equal number. The full reasoning is out of scope but for those interested should look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
14+
15+
Copy and paste the C++ snippet below into a file named `error-propagation.cpp`.
16+
17+
18+
```cpp
19+
#include <stdio.h>
20+
#include <math.h>
21+
22+
// Function 1: Computes sqrt(1 + x) - 1 using the naive approach
23+
float f1(float x) {
24+
return sqrtf(1 + x) - 1;
25+
}
26+
27+
// Function 2: Computes the same value using an algebraically equivalent transformation
28+
// This version is numerically more stable
29+
float f2(float x) {
30+
return x / (sqrtf(1 + x) + 1);
31+
}
32+
33+
int main() {
34+
float x = 1e-8; // A small value that causes floating-point precision issues
35+
float result1 = f1(x);
36+
float result2 = f2(x);
37+
38+
// Theoretically, result1 and result2 should be the same
39+
float difference = result1 - result2;
40+
// Multiply by a large number to amplify the error
41+
float final_result = 100000000.0f * difference + 0.0001f;
42+
43+
// Print the results
44+
printf("f1(%e) = %.10f\n", x, result1);
45+
printf("f2(%e) = %.10f\n", x, result2);
46+
printf("Difference (f1 - f2) = %.10e\n", difference);
47+
printf("Final result after magnification: %.10f\n", final_result);
48+
49+
return 0;
50+
}
51+
```
52+
53+
Compile the source code on both x86 and Arm64 with the following command.
54+
55+
```bash
56+
g++ -g error-propagation.cpp -o error-propagation
57+
```
58+
59+
Running the 2 binaries shows that the second function, f2, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
60+
61+
on x86:
62+
```output
63+
f1(1.000000e-08) = 0.0000000000
64+
f2(1.000000e-08) = 0.0000000050
65+
Difference (f1 - f2) = -4.9999999696e-09
66+
Final result after magnification: -0.4999000132
67+
```
68+
on Arm:
69+
```output
70+
f1(1.000000e-08) = 0.0000000000
71+
f2(1.000000e-08) = 0.0000000050
72+
Difference (f1 - f2) = -4.9999999696e-09
73+
Final result after magnification: -0.4998999834
74+
```
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Minimising Variability across platforms
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Minimising Variability across platforms
10+
11+
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment. This is part of the C++11 standard and is used to ensure that the program can properly handle floating-point exceptions and rounding modes enabling your program to continue running if an exception is thrown. For more information, refer to the [documentation in the C++11 standard](https://en.cppreference.com/w/cpp/numeric/fenv).
12+
13+
In the context below, enabling floating-point environment access is crucial because the functions you are working with involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. However, in our example since we are hardcoding the inputs this is not strictly necessary but is included as it may be relevant for your own application.
14+
15+
This directive is particularly important when performing operations that require high numerical stability and precision, such as the square root calculations in functions below. It allows the program to manage the floating-point state and handle any anomalies that might occur during these calculations, thereby improving the robustness and reliability of your numerical computations.
16+
17+
Save the C++ file below as `error-propagation-min.cpp`.
18+
19+
```cpp
20+
#include <cstdio>
21+
#include <cmath>
22+
#include <cfenv>
23+
24+
// Enable floating-point exceptions
25+
#pragma STDC FENV_ACCESS ON
26+
27+
// Function 1: Computes sqrt(1 + x) - 1 using the naive approach
28+
double f1(double x) {
29+
return sqrt(1 + x) - 1;
30+
}
31+
32+
// Function 2: Computes the same value using an algebraically equivalent transformation
33+
// This version is numerically more stable
34+
double f2(double x) {
35+
return x / (sqrt(1 + x) + 1);
36+
}
37+
38+
int main() {
39+
// Enable all floating-point exceptions
40+
std::feclearexcept(FE_ALL_EXCEPT);
41+
std::feraiseexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
42+
43+
double x = 1e-8; // A small value that causes floating-point precision issues
44+
double result1 = f1(x);
45+
double result2 = f2(x);
46+
47+
// Theoretically, result1 and result2 should be the same
48+
double difference = result1 - result2;
49+
// Multiply by a large number to amplify the error
50+
double final_result = 100000000.0 * difference + 0.0001;
51+
52+
// Print the results
53+
printf("f1(%e) = %.10f\n", x, result1);
54+
printf("f2(%e) = %.10f\n", x, result2);
55+
printf("Difference (f1 - f2) = %.10e\n", difference);
56+
printf("Final result after magnification: %.10f\n", final_result);
57+
58+
return 0;
59+
}
60+
```
61+
62+
Compile with the following command. In addition, we pass the C++ flag, `-frounding-math`. You should use use when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In our example, it results in a predictable rounding mode on function `f1` across x86 and Arm64. For more information, please refer to the [G++ documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Optimize-Options.html)
63+
64+
```bash
65+
g++ -o error-propagation-min error-propagation-min.cpp -frounding-math
66+
```
67+
68+
Running the new binary on both platforms leads to function, `f1` having a similar value to `f2`. Further the difference is now identical across both Arm64 and x86.
69+
70+
```output
71+
./error-propagation-min
72+
f1(1.000000e-08) = 0.0000000050
73+
f2(1.000000e-08) = 0.0000000050
74+
Difference (f1 - f2) = -1.7887354748e-17
75+
Final result after magnification: 0.0000999982
76+
```
77+
78+
{{% notice Note %}} G++ provides several compiler flags to help balance accuracy and performance such as`-ffp-contract` which is useful when lossy, fused operations are used, for example, fused-multiple. As another example `-ffloat-store` which prevent floating point variables from being stored in registers which can have different levels of precision and rounding. **Please refer to your compiler documentation for more information on the available flags**{{% /notice %}}
79+
22.1 KB
Loading

0 commit comments

Comments
 (0)