Skip to content

Commit aa25b17

Browse files
committed
Review floating point Learning Path
1 parent a55a94f commit aa25b17

File tree

5 files changed

+87
-44
lines changed

5 files changed

+87
-44
lines changed

content/learning-paths/cross-platform/floating-point-rounding-errors/_index.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,18 @@
11
---
2-
title: Learn about floating point rounding errors on Arm and x86
2+
title: Learn about floating point rounding on Arm
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: Developers porting applications from x86 to AArch64 who observe different results on each platform.
6+
who_is_this_for: Developers porting applications from x86 to Arm who observe different floating point values on each platform.
77

88
learning_objectives:
9-
- Understand the differences between converting floating point numbers on x86 and Arm.
10-
- Understand factors that affect floating point behaviour
11-
- How to use basic compiler flags to produce predictable behaviour
9+
- Understand the differences between floating point numbers on x86 and Arm.
10+
- Understand factors that affect floating point behavior.
11+
- How to use compiler flags to produce predictable behavior.
1212

1313
prerequisites:
14-
- Access to an x86 and Arm-based machine
15-
- Basic understanding of floating point numbers
16-
- A C++/C compiler
14+
- Access to an x86 and an Arm Linux machine.
15+
- Basic understanding of floating point numbers.
1716

1817
author: Kieran Hejmadi
1918

@@ -26,11 +25,21 @@ armips:
2625
tools_software_languages:
2726
- C++
2827

28+
shared_path: true
29+
shared_between:
30+
- servers-and-cloud-computing
31+
- laptops-and-desktops
32+
- mobile-graphics-and-gaming
33+
2934
further_reading:
3035
- resource:
3136
title: G++ Optimisation Flags
3237
link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
3338
type: documentation
39+
- resource:
40+
title: Floating-point environment
41+
link: https://en.cppreference.com/w/cpp/numeric/fenv
42+
type: documentation
3443

3544

3645

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-1.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,30 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## Recap on Floating Point Numbers
9+
## Review of floating point numbers
1010

11-
If you are unfamiliar with floating point representations, we recommend looking at this [introductory learning path](https://learn.arm.com/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/).
11+
If you are unfamiliar with floating point number representation, you can review [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers different data types and explains data type conversions.
1212

13-
As a recap, floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations.
13+
Floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations.
1414

15-
IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit). Each floating-point number consists of three components:
15+
IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit).
16+
17+
Each floating-point number consists of three components:
1618
- **sign bit**. (Determining positive or negative value)
1719
- **exponent** (defining the scale or magnitude)
1820
- **significand** (also called the mantissa, representing the significant digits of the number).
1921

2022
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers for robust numerical computation. A key feature of IEEE 754 is its support for rounding modes and exception handling, ensuring predictable behavior in mathematical operations. However, floating-point arithmetic is inherently imprecise due to limited precision, leading to small rounding errors.
2123

22-
The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and matissa.
24+
The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and mantissa.
2325

2426
![floating-point](./floating-point-numbers.png)
2527

26-
## Rounding Errors
28+
## Rounding errors
29+
30+
Since computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented.
2731

28-
As mentioned above, since we are using a finite number of bits to store a continuous range of numbers, we introduce rounding error. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented. When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if we construct a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2. The possible values will look like the graph below.
32+
When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2 is used, the possible values are represented in the graph below.
2933

3034
![ulp](./ulp.png)
3135

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-2.md

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,16 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
## Differences in behaviour between x86 and Arm.
9+
## What are the differences in behavior between x86 and Arm floating point?
1010

11-
Architecture and standards define dealing with floating point overflows and truncations in difference ways. First, connect to a x86 and Arm-based machine. In this example I am connecting to an AWS `t2.2xlarge` and `t4g.xlarge` running Ubuntu 22.04 LTS.
11+
Architecture and standards define floating point overflows and truncations in different ways.
1212

13+
You can see this by comparing an example application on an x86 and an Arm Linux system.
1314

14-
To demonstrate this, the C++ code snippet below casts floating point numbers to various data types. Copy and paste into a new file called `converting-float.cpp`.
15+
You can use any Linux systems for this example. If you are using AWS, you can use EC2 instance types
16+
`t3.micro` and `t4g.small` running Ubuntu 24.04.
17+
18+
To learn about floating point differences, use an editor to copy and paste the C++ code below into a new file named `converting-float.cpp`.
1519

1620
```cpp
1721
#include <iostream>
@@ -57,15 +61,16 @@ int main() {
5761
}
5862
```
5963
60-
To demonstrate we will compile `converting-float.cpp` on an Arm64 and x86 machine. Install the `g++` compiler with the following command. I am using `G++` version 13.3 at the time of writing.
64+
If you need to install the `g++` compiler, run the commands below.
6165
6266
```bash
6367
sudo apt update
64-
sudo apt install g++ gcc
68+
sudo apt install g++ -y
6569
```
6670

71+
Compile `converting-float.cpp` on an Arm and x86 machine.
6772

68-
Run the command below on both an Arm-based and x86-based system.
73+
The compile command is the same on both systems.
6974

7075
```bash
7176
g++ converting-float.cpp -o converting-float
@@ -75,11 +80,13 @@ For easy comparison, the image below shows the x86 output (left) and Arm output
7580

7681
![differences](./differences.png)
7782

78-
As you can see, there are several cases where different behaviour is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
83+
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
84+
85+
## Removing hardcoded values with macros
7986

80-
## Removing Hardcoded values with Macros
87+
The above differences show that explicitly checking for specific values will lead to unportable code.
8188

82-
Clearly the above differences show that explictly checking for specific values will lead to unportable code. For example, consider the naively implemented function below. This checks if the value is 0, the value an x86 machine will convert a floating point number that exceeds the maximum 32-bit float value. This is different to AArch64 behaviour leading to unportable code.
89+
For example, consider the function below. The code checks if the value is 0. The value an x86 machine will convert a floating point number that exceeds the maximum 32-bit float value. This is different from Arm behavior leading to unportable code.
8390

8491
```cpp
8592
void checkFloatToUint32(float num) {
@@ -93,7 +100,15 @@ void checkFloatToUint32(float num) {
93100
```
94101
95102
This can simply be corrected by using the macro, `UINT32_MAX`.
96-
{{% notice Note %}} To find out all the available compiler-defined macros, the `echo | <clang|gcc etc.> -dM -` will output them to the terminal{{% /notice %}}
103+
104+
{{% notice Note %}}
105+
To find out all the available compiler-defined macros, you can output them using:
106+
```bash
107+
echo "" | g++ -dM -E -
108+
```
109+
{{% /notice %}}
110+
111+
A portable version of the code is:
97112

98113
```cpp
99114
void checkFloatToUint32(float num) {

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-3.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
11
---
2-
title: Example
3-
weight: 3
2+
title: Error propagation
3+
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Error Propagation
9+
## What is error propagation in x86 and Arm systems?
1010

11-
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm architecture may decide to reorder the instructions that each has a different rounding error (described in the unit in last place section) so that subtle changes are observed. Alternatively, 2 functions that are mathematically equivalent will propagate errors differently on a computer.
11+
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm system may decide to reorder the instructions that each have a different rounding error so that subtle changes are observed.
1212

13-
Consider the example below. Function 1, `f1` and 2, `f2` are mathematically equivalent. Hence they should return the same value given the same input. If we input a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, function `f2` avoids the subtraction of nearly equal number. The full reasoning is out of scope but for those interested should look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
13+
It is possible that 2 functions that are mathematically equivalent will propagate errors differently on a computer.
1414

15-
Copy and paste the C++ snippet below into a file named `error-propagation.cpp`.
15+
Functions `f1` and `f2` are mathematically equivalent. You would expect them to return the same value given the same input.
16+
17+
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids the subtraction of nearly equal number. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
1618

19+
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`.
1720

1821
```cpp
1922
#include <stdio.h>
@@ -50,22 +53,24 @@ int main() {
5053
}
5154
```
5255
53-
Compile the source code on both x86 and Arm64 with the following command.
56+
Compile the code on both x86 and Arm with the following command.
5457
5558
```bash
5659
g++ -g error-propagation.cpp -o error-propagation
5760
```
5861

59-
Running the 2 binaries shows that the second function, f2, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
62+
Running the 2 binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
63+
64+
Running on x86:
6065

61-
on x86:
6266
```output
6367
f1(1.000000e-08) = 0.0000000000
6468
f2(1.000000e-08) = 0.0000000050
6569
Difference (f1 - f2) = -4.9999999696e-09
6670
Final result after magnification: -0.4999000132
6771
```
68-
on Arm:
72+
73+
Running on Arm:
6974
```output
7075
f1(1.000000e-08) = 0.0000000000
7176
f2(1.000000e-08) = 0.0000000050

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-4.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
11
---
2-
title: Minimising Variability across platforms
3-
weight: 3
2+
title: Minimizing variability across platforms
3+
weight: 5
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Minimising Variability across platforms
9+
## How can I minimize variability across x86 and Arm?
1010

11-
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment. This is part of the C++11 standard and is used to ensure that the program can properly handle floating-point exceptions and rounding modes enabling your program to continue running if an exception is thrown. For more information, refer to the [documentation in the C++11 standard](https://en.cppreference.com/w/cpp/numeric/fenv).
11+
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment.
1212

13-
In the context below, enabling floating-point environment access is crucial because the functions you are working with involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. However, in our example since we are hardcoding the inputs this is not strictly necessary but is included as it may be relevant for your own application.
13+
This is part of the C++11 standard and is used to ensure that the program can properly handle floating-point exceptions and rounding modes enabling your program to continue running if an exception is thrown.
14+
15+
In the context below, enabling floating-point environment access is crucial because the functions you are working with involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. This is not necessary for this example, but is included because it may be relevant for your own application.
1416

1517
This directive is particularly important when performing operations that require high numerical stability and precision, such as the square root calculations in functions below. It allows the program to manage the floating-point state and handle any anomalies that might occur during these calculations, thereby improving the robustness and reliability of your numerical computations.
1618

17-
Save the C++ file below as `error-propagation-min.cpp`.
19+
Use an editor to copy and paste the C++ file below into a file named `error-propagation-min.cpp`.
1820

1921
```cpp
2022
#include <cstdio>
@@ -59,13 +61,17 @@ int main() {
5961
}
6062
```
6163
62-
Compile with the following command. In addition, we pass the C++ flag, `-frounding-math`. You should use use when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In our example, it results in a predictable rounding mode on function `f1` across x86 and Arm64. For more information, please refer to the [G++ documentation](https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Optimize-Options.html)
64+
Compile on both computers, using the C++ flag, `-frounding-math`.
65+
66+
You should use this flat when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In this example, it results in a predictable rounding mode on function `f1` across x86 and Arm.
6367
6468
```bash
6569
g++ -o error-propagation-min error-propagation-min.cpp -frounding-math
6670
```
6771

68-
Running the new binary on both platforms leads to function, `f1` having a similar value to `f2`. Further the difference is now identical across both Arm64 and x86.
72+
Running the new binary on both systems leads to function, `f1` having a similar value to `f2`. Further the difference is now identical across both Arm64 and x86.
73+
74+
Here is the output on both systems:
6975

7076
```output
7177
./error-propagation-min
@@ -75,5 +81,9 @@ Difference (f1 - f2) = -1.7887354748e-17
7581
Final result after magnification: 0.0000999982
7682
```
7783

78-
{{% notice Note %}} G++ provides several compiler flags to help balance accuracy and performance such as`-ffp-contract` which is useful when lossy, fused operations are used, for example, fused-multiple. As another example `-ffloat-store` which prevent floating point variables from being stored in registers which can have different levels of precision and rounding. **Please refer to your compiler documentation for more information on the available flags**{{% /notice %}}
84+
G++ provides several compiler flags to help balance accuracy and performance such as`-ffp-contract` which is useful when lossy, fused operations are used, such as fused-multiple.
85+
86+
Another example is `-ffloat-store` which prevents floating point variables from being stored in registers which can have different levels of precision and rounding.
87+
88+
You can refer to compiler documentation for more information about the available flags.
7989

0 commit comments

Comments
 (0)