Skip to content

Commit 00f990c

Browse files
Merge pull request #1788 from madeline-underwood/FP
FP_JA to sign off
2 parents f6093ff + 5ae8a88 commit 00f990c

File tree

5 files changed

+63
-50
lines changed

5 files changed

+63
-50
lines changed

content/learning-paths/cross-platform/floating-point-rounding-errors/_index.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,23 @@
11
---
2-
title: Learn about floating point rounding on Arm
2+
title: Explore floating-point differences between x86 and Arm
33

44
draft: true
55
cascade:
66
draft: true
77

88
minutes_to_complete: 30
99

10-
who_is_this_for: Developers porting applications from x86 to Arm who observe different floating point values on each platform.
10+
who_is_this_for: This is an introductory topic for developers who are porting applications from x86 to Arm and want to understand how floating-point behavior differs between these architectures - particularly in the context of numerical consistency, performance, and debugging subtle bugs.
1111

1212
learning_objectives:
13-
- Understand the differences between floating point numbers on x86 and Arm.
14-
- Understand factors that affect floating point behavior.
15-
- How to use compiler flags to produce predictable behavior.
13+
- Identify key differences in floating-point behavior between the x86 and Arm architectures.
14+
- Recognize the impact of compiler optimizations and instruction sets on floating-point results.
15+
- Apply compiler flags and best practices to ensure consistent floating-point behavior across
16+
platforms.
1617

1718
prerequisites:
1819
- Access to an x86 and an Arm Linux machine.
19-
- Basic understanding of floating point numbers.
20+
- Familiarity with floating-point numbers.
2021

2122
author: Kieran Hejmadi
2223

@@ -38,7 +39,7 @@ shared_between:
3839

3940
further_reading:
4041
- resource:
41-
title: G++ Optimisation Flags
42+
title: G++ Optimization Flags
4243
link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
4344
type: documentation
4445
- resource:
Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,54 @@
11
---
2-
title: Floating Point Representations
2+
title: "Floating-Point Representation"
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Review of floating point numbers
9+
## Review of floating-point numbers
1010

11-
If you are unfamiliar with floating point number representation, you can review [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers different data types and explains data type conversions.
11+
{{% notice Learning tip%}}
12+
If you are new to floating-point numbers, and would like some further information, see
13+
the Learning Path [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers data types and conversions.
14+
{{% /notice %}}
1215

13-
Floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations.
16+
Floating-point numbers represent real numbers using limited precision, enabling efficient storage and computation of decimal values. In C/C++, floating-point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, defines the most widely used format for floating-point arithmetic, ensuring consistency across hardware and software.
1417

15-
IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit).
18+
IEEE 754 specifies two primary formats: single-precision (32-bit) and double-precision (64-bit).
1619

1720
Each floating-point number consists of three components:
18-
- **sign bit**. (Determining positive or negative value)
19-
- **exponent** (defining the scale or magnitude)
20-
- **significand** (also called the mantissa, representing the significant digits of the number).
2121

22-
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers for robust numerical computation. A key feature of IEEE 754 is its support for rounding modes and exception handling, ensuring predictable behavior in mathematical operations. However, floating-point arithmetic is inherently imprecise due to limited precision, leading to small rounding errors.
22+
- **Sign bit**: Determines the sign (positive or negative).
23+
- **Exponent**: Sets the scale or magnitude.
24+
- **Significand**: Holds the significant digits in binary.
2325

24-
The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and mantissa.
26+
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers. It supports rounding modes and exception handling, which help ensure predictable results. However, floating-point arithmetic is inherently imprecise, leading to small rounding errors.
27+
28+
The graphic below shows various forms of floating-point representation supported by Arm, each with varying number of bits assigned to the exponent and significand.
2529

2630
![floating-point](./floating-point-numbers.png)
2731

2832
## Rounding errors
2933

30-
Since computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented.
34+
Because computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It quantifies the rounding error, which arises because not all real values can be exactly represented.
35+
36+
Operations round results to the nearest representable value, introducing small discrepancies. This rounding error, often measured in ULPs, reflects how far the computed value may deviate from the exact mathematical result.
3137

32-
When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2 is used, the possible values are represented in the graph below.
38+
For example, with 3 bits for the significand and an exponent range of -1 to 2, only a limited set of values can be represented. The diagram below illustrates these values.
3339

3440
![ulp](./ulp.png)
3541

3642
Key takeaways:
3743

38-
- ULP size varies with the number’s magnitude.
39-
- Larger numbers have bigger ULPs due to wider spacing between values.
40-
- Smaller numbers have smaller ULPs, reducing quantization error.
41-
- ULP behavior impacts numerical stability and precision in computations.
44+
- ULP size increases with magnitude.
45+
- Larger numbers have wider spacing between values (larger ULPs).
46+
- Smaller numbers have tighter spacing (smaller ULPs), reducing quantization error.
47+
- ULP behavior impacts numerical stability and precision.
48+
49+
{{% notice Learning tip %}}
50+
Keep in mind that rounding and representation issues aren't bugs — they’re a consequence of how floating-point math works at the hardware level. Understanding these fundamentals is essential when porting numerical code across architectures like x86 and Arm.
51+
{{% /notice %}}
52+
53+
54+
In the next section, you'll explore how x86 and Arm differ in how they implement and optimize floating-point operations — and why this matters for writing portable, accurate software.

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-2.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,13 @@ layout: learningpathall
88

99
## What are the differences in behavior between x86 and Arm floating point?
1010

11-
Architecture and standards define floating point overflows and truncations in different ways.
11+
Although both x86 and Arm generally follow the IEEE 754 standard for floating-point representation, their behavior in edge cases — like overflow and truncation — can differ due to implementation details and instruction sets.
1212

13-
You can see this by comparing an example application on an x86 and an Arm Linux system.
13+
You can see this by comparing an example application on both an x86 and an Arm Linux system.
1414

15-
You can use any Linux systems for this example. If you are using AWS, you can use EC2 instance types
16-
`t3.micro` and `t4g.small` running Ubuntu 24.04.
15+
Run this example on any Linux system with x86 and Arm architecture; on AWS, use EC2 instance types `t3.micro` and `t4g.small` with Ubuntu 24.04.
1716

18-
To learn about floating point differences, use an editor to copy and paste the C++ code below into a new file named `converting-float.cpp`.
17+
To learn about floating-point differences, use an editor to copy and paste the C++ code below into a new file named `converting-float.cpp`:
1918

2019
```cpp
2120
#include <iostream>
@@ -61,7 +60,7 @@ int main() {
6160
}
6261
```
6362
64-
If you need to install the `g++` compiler, run the commands below.
63+
If you need to install the `g++` compiler, run the commands below:
6564
6665
```bash
6766
sudo apt update
@@ -76,23 +75,25 @@ The compile command is the same on both systems.
7675
g++ converting-float.cpp -o converting-float
7776
```
7877

79-
For easy comparison, the image below shows the x86 output (left) and Arm output (right). The highlighted lines show the difference in output.
78+
For easy comparison, the image below shows the x86 output (left) and Arm output (right). The highlighted lines show the difference in output:
8079

8180
![differences](./differences.png)
8281

83-
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
82+
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to an unsigned number or dealing with out-of-bounds numbers.
8483

8584
## Removing hardcoded values with macros
8685

8786
The above differences show that explicitly checking for specific values will lead to unportable code.
8887

89-
For example, consider the function below. The code checks if the value is 0. The value an x86 machine will convert a floating point number that exceeds the maximum 32-bit float value. This is different from Arm behavior leading to unportable code.
88+
For example, the function below checks if the casted result is `0`. This can be misleading — on x86, casting an out-of-range floating-point value to `uint32_t` may wrap to `0`, while on Arm it may behave differently. Relying on these results makes the code unportable.
89+
90+
9091

9192
```cpp
9293
void checkFloatToUint32(float num) {
9394
uint32_t castedNum = static_cast<uint32_t>(num);
9495
if (castedNum == 0) {
95-
std::cout << "The casted number is 0, indicating the float could out of bounds for uint32_t." << std::endl;
96+
std::cout << "The casted number is 0, indicating that the float is out of bounds for uint32_t." << std::endl;
9697
} else {
9798
std::cout << "The casted number is: " << castedNum << std::endl;
9899
}

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-3.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ layout: learningpathall
1010

1111
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm system may decide to reorder the instructions that each have a different rounding error so that subtle changes are observed.
1212

13-
It is possible that 2 functions that are mathematically equivalent will propagate errors differently on a computer.
13+
It is possible that two functions that are mathematically equivalent will propagate errors differently on a computer.
1414

1515
Functions `f1` and `f2` are mathematically equivalent. You would expect them to return the same value given the same input.
1616

17-
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids the subtraction of nearly equal number. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
17+
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids subtracting nearly equal numbers for clarity. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
1818

19-
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`.
19+
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`:
2020

2121
```cpp
2222
#include <stdio.h>
@@ -53,13 +53,13 @@ int main() {
5353
}
5454
```
5555
56-
Compile the code on both x86 and Arm with the following command.
56+
Compile the code on both x86 and Arm with the following command:
5757
5858
```bash
5959
g++ -g error-propagation.cpp -o error-propagation
6060
```
6161

62-
Running the 2 binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
62+
Running the two binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
6363

6464
Running on x86:
6565

content/learning-paths/cross-platform/floating-point-rounding-errors/how-to-4.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,20 @@
11
---
2-
title: Minimizing variability across platforms
2+
title: Minimizing floating-point variability across platforms
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## How can I minimize variability across x86 and Arm?
9+
## How can I minimize floating-point variability across x86 and Arm?
1010

11-
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment.
11+
The line `#pragma STDC FENV_ACCESS ON` is a directive that informs the compiler to enable access to the floating-point environment. This is part of the C++11 standard and ensures that the program can properly handle floating-point exceptions and rounding modes, enabling your program to continue running if an exception is thrown.
1212

13-
This is part of the C++11 standard and is used to ensure that the program can properly handle floating-point exceptions and rounding modes enabling your program to continue running if an exception is thrown.
14-
15-
In the context below, enabling floating-point environment access is crucial because the functions you are working with involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. This is not necessary for this example, but is included because it may be relevant for your own application.
13+
In the context below, enabling floating-point environment access is crucial because the functions in this example involve floating-point arithmetic, which can be prone to precision errors and exceptions such as overflow, underflow, division by zero, and invalid operations. Although not strictly necessary for this example, the directive is included because it may be relevant for your own applications.
1614

1715
This directive is particularly important when performing operations that require high numerical stability and precision, such as the square root calculations in functions below. It allows the program to manage the floating-point state and handle any anomalies that might occur during these calculations, thereby improving the robustness and reliability of your numerical computations.
1816

19-
Use an editor to copy and paste the C++ file below into a file named `error-propagation-min.cpp`.
17+
Use an editor to copy and paste the C++ file below into a file named `error-propagation-min.cpp`:
2018

2119
```cpp
2220
#include <cstdio>
@@ -63,13 +61,13 @@ int main() {
6361
6462
Compile on both computers, using the C++ flag, `-frounding-math`.
6563
66-
You should use this flat when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In this example, it results in a predictable rounding mode on function `f1` across x86 and Arm.
64+
You should use this flag when your program dynamically changes the floating-point rounding mode or needs to run correctly under different rounding modes. In this example, it ensures that `f1` uses a predictable rounding mode across both x86 and Arm.
6765
6866
```bash
6967
g++ -o error-propagation-min error-propagation-min.cpp -frounding-math
7068
```
7169

72-
Running the new binary on both systems leads to function, `f1` having a similar value to `f2`. Further the difference is now identical across both Arm64 and x86.
70+
Running the new binary on both systems shows that function `f1` produces a value nearly identical to `f2`, and the difference between them is now identical across both Arm64 and x86.
7371

7472
Here is the output on both systems:
7573

@@ -81,9 +79,9 @@ Difference (f1 - f2) = -1.7887354748e-17
8179
Final result after magnification: 0.0000999982
8280
```
8381

84-
G++ provides several compiler flags to help balance accuracy and performance such as`-ffp-contract` which is useful when lossy, fused operations are used, such as fused-multiple.
82+
G++ provides several compiler flags to help balance accuracy and performance. For example, `-ffp-contract` is useful when lossy, fused operations are used, such as fused-multiple.
8583

86-
Another example is `-ffloat-store` which prevents floating point variables from being stored in registers which can have different levels of precision and rounding.
84+
Another example is `-ffloat-store` which prevents floating-point variables from being stored in registers which can have different levels of precision and rounding.
8785

88-
You can refer to compiler documentation for more information about the available flags.
86+
You can refer to compiler documentation for more information on the flags available.
8987

0 commit comments

Comments
 (0)