You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-1.md
+9-6Lines changed: 9 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,21 +8,24 @@ layout: learningpathall
8
8
9
9
## Review of floating-point numbers
10
10
11
-
If you are new to floating-point numbers, for some background information, see the Learning Path [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers data types and conversions.
11
+
{{% notice Learning tip%}}
12
+
If you are new to floating-point numbers, and would like some further information, see
13
+
the Learning Path [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers data types and conversions.
14
+
{{% /notice %}}
12
15
13
-
Floating-point numbers represent real numbers using limited precision, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating-point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, defines the most widely used format for floating-point arithmetic, ensuring consistency across hardware and software.
16
+
Floating-point numbers represent real numbers using limited precision, enabling efficient storage and computation of decimal values. In C/C++, floating-point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, defines the most widely used format for floating-point arithmetic, ensuring consistency across hardware and software.
14
17
15
18
IEEE 754 specifies two primary formats: single-precision (32-bit) and double-precision (64-bit).
16
19
17
20
Each floating-point number consists of three components:
18
21
19
22
-**Sign bit**: Determines the sign (positive or negative).
20
23
-**Exponent**: Sets the scale or magnitude.
21
-
-**Significand** (or mantissa): Holds the significant digits in binary.
24
+
-**Significand**: Holds the significant digits in binary.
22
25
23
26
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers. It supports rounding modes and exception handling, which help ensure predictable results. However, floating-point arithmetic is inherently imprecise, leading to small rounding errors.
24
27
25
-
The graphic below shows various forms of floating-point representation supported by Arm, each with varying number of bits assigned to the exponent and mantissa.
28
+
The graphic below shows various forms of floating-point representation supported by Arm, each with varying number of bits assigned to the exponent and significand.
26
29
27
30

28
31
@@ -32,7 +35,7 @@ Because computers use a finite number of bits to store a continuous range of num
32
35
33
36
Operations round results to the nearest representable value, introducing small discrepancies. This rounding error, often measured in ULPs, reflects how far the computed value may deviate from the exact mathematical result.
34
37
35
-
For example, with 3 bits for the significand (mantissa) and an exponent range of -1 to 2, only a limited set of values can be represented.The diagram below illustrates these values.
38
+
For example, with 3 bits for the significand and an exponent range of -1 to 2, only a limited set of values can be represented.The diagram below illustrates these values.
- ULP behavior impacts numerical stability and precision.
45
48
46
-
{{% notice Learning Tip %}}
49
+
{{% notice Learning tip %}}
47
50
Keep in mind that rounding and representation issues aren't bugs — they’re a consequence of how floating-point math works at the hardware level. Understanding these fundamentals is essential when porting numerical code across architectures like x86 and Arm.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-2.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,7 +79,7 @@ For easy comparison, the image below shows the x86 output (left) and Arm output
79
79
80
80

81
81
82
-
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
82
+
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to an unsigned number or dealing with out-of-bounds numbers.
83
83
84
84
## Removing hardcoded values with macros
85
85
@@ -93,7 +93,7 @@ For example, the function below checks if the casted result is `0`. This can be
93
93
voidcheckFloatToUint32(float num) {
94
94
uint32_t castedNum = static_cast<uint32_t>(num);
95
95
if (castedNum == 0) {
96
-
std::cout << "The casted number is 0, indicating the float could out of bounds for uint32_t." << std::endl;
96
+
std::cout << "The casted number is 0, indicating that the float is out of bounds for uint32_t." << std::endl;
97
97
} else {
98
98
std::cout << "The casted number is: " << castedNum << std::endl;
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-3.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,13 +10,13 @@ layout: learningpathall
10
10
11
11
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm system may decide to reorder the instructions that each have a different rounding error so that subtle changes are observed.
12
12
13
-
It is possible that 2 functions that are mathematically equivalent will propagate errors differently on a computer.
13
+
It is possible that two functions that are mathematically equivalent will propagate errors differently on a computer.
14
14
15
15
Functions `f1` and `f2` are mathematically equivalent. You would expect them to return the same value given the same input.
16
16
17
-
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids the subtraction of nearly equal number. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
17
+
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids subtracting nearly equal numbers for clarity. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
18
18
19
-
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`.
19
+
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`:
20
20
21
21
```cpp
22
22
#include<stdio.h>
@@ -53,13 +53,13 @@ int main() {
53
53
}
54
54
```
55
55
56
-
Compile the code on both x86 and Arm with the following command.
56
+
Compile the code on both x86 and Arm with the following command:
57
57
58
58
```bash
59
59
g++ -g error-propagation.cpp -o error-propagation
60
60
```
61
61
62
-
Running the 2 binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
62
+
Running the two binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-4.md
+7-9Lines changed: 7 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,13 @@ layout: learningpathall
8
8
9
9
## How can I minimize floating-point variability across x86 and Arm?
10
10
11
-
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment.
11
+
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment. This is part of the C++11 standard and ensures that the program can properly handle floating-point exceptions and rounding modes, enabling your program to continue running if an exception is thrown.
12
12
13
-
This is part of the C++11 standard and is used to ensure that the program can properly handle floating-point exceptions and rounding modes enabling your program to continue running if an exception is thrown.
14
-
15
-
In the context below, enabling floating-point environment access is crucial because the functions you are working with involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. This is not necessary for this example, but is included because it may be relevant for your own application.
13
+
In the context below, enabling floating-point environment access is crucial because the functions in this example involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. Although not strictly necessary for this example, the directive is included because it may be relevant for your own applications.
16
14
17
15
This directive is particularly important when performing operations that require high numerical stability and precision, such as the square root calculations in functions below. It allows the program to manage the floating-point state and handle any anomalies that might occur during these calculations, thereby improving the robustness and reliability of your numerical computations.
18
16
19
-
Use an editor to copy and paste the C++ file below into a file named `error-propagation-min.cpp`.
17
+
Use an editor to copy and paste the C++ file below into a file named `error-propagation-min.cpp`:
20
18
21
19
```cpp
22
20
#include<cstdio>
@@ -63,13 +61,13 @@ int main() {
63
61
64
62
Compile on both computers, using the C++ flag, `-frounding-math`.
65
63
66
-
You should use this flat when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In this example, it results in a predictable rounding mode on function `f1` across x86 and Arm.
64
+
You should use this flag when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In this example, it ensures that `f1` uses a predictable rounding mode across both x86 and Arm.
Running the new binary on both systems leads to function,`f1`having a similar value to `f2`. Further the difference is now identical across both Arm64 and x86.
70
+
Running the new binary on both systems shows that function `f1`produces a value nearly identical to `f2`, and the difference between them is now identical across both Arm64 and x86.
G++ provides several compiler flags to help balance accuracy and performance such as`-ffp-contract` which is useful when lossy, fused operations are used, such as fused-multiple.
82
+
G++ provides several compiler flags to help balance accuracy and performance. For example, `-ffp-contract` is useful when lossy, fused operations are used, such as fused-multiple.
85
83
86
84
Another example is `-ffloat-store` which prevents floating-point variables from being stored in registers which can have different levels of precision and rounding.
87
85
88
-
You can refer to compiler documentation for more information about the available flags.
86
+
You can refer to compiler documentation for more information on the flags available.
0 commit comments