Skip to content

Commit 5927a7d

Browse files
Merge pull request #1770 from ArmDeveloperEcosystem/main
Production update
2 parents 1eb22c2 + f2b7707 commit 5927a7d

File tree

126 files changed

+5127
-1473
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+5127
-1473
lines changed

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,10 @@ startup.sh
1414
nohup.out
1515

1616
venv/
17-
z_local_saved/
17+
z_local_saved/
18+
/.idea/
19+
/tools/.python-version
20+
/.python-version
21+
*.iml
22+
*.xml
23+

.wordlist.txt

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3872,3 +3872,48 @@ upscales
38723872
upscaling
38733873
vl
38743874
webbot
3875+
APKs
3876+
ASR's
3877+
DLRM
3878+
DLRMv
3879+
DeepSeek
3880+
Geremy
3881+
MERCHANTABILITY
3882+
MLPerf’s
3883+
MoE
3884+
NONINFRINGEMENT
3885+
NaN
3886+
OCPU
3887+
OCaml
3888+
Ollama
3889+
Ollama's
3890+
Prefill
3891+
Unsloth’s
3892+
YAMLs
3893+
Yiyang
3894+
bartowski
3895+
bc
3896+
checkboxes
3897+
deepseek
3898+
diy
3899+
fenv
3900+
gguf
3901+
highmem
3902+
inria
3903+
lfs
3904+
lora
3905+
ollama
3906+
opam
3907+
perceptrons
3908+
personalization
3909+
rclone
3910+
screenspace
3911+
significand
3912+
stdbuf
3913+
sublicense
3914+
tok
3915+
truncations
3916+
ulp
3917+
unmangled
3918+
unportable
3919+
zeropoint

assets/contributors.csv

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Alaaeddine Chakroun,Day Devs,Alaaeddine-Chakroun,alaaeddine-chakroun,,https://da
4747
Koki Mitsunami,Arm,,kmitsunami,,
4848
Chen Zhang,Zilliz,,,,
4949
Tianyu Li,Arm,,,,
50-
Georgios Mermigkis,VectorCamp,gMerm,georgios-mermigkis,,https://vectorcamp.gr/
50+
Georgios Mermigkis,VectorCamp,gMerm,georgios-mermigkis,,https://vectorcamp.gr/
5151
Ben Clark,Arm,,,,
5252
Han Yin,Arm,hanyin-arm,nacosiren,,
5353
Willen Yang,Arm,,,,
@@ -80,3 +80,5 @@ Tom Pilar,,,,,
8080
Cyril Rohr,,,,,
8181
Odin Shen,Arm,odincodeshen,odin-shen-lmshen,,
8282
Avin Zarlez,Arm,AvinZarlez,avinzarlez,,https://www.avinzarlez.com/
83+
Shuheng Deng,Arm,,,,
84+
Yiyang Fan,Arm,,,,
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
title: Learn about floating point rounding on Arm
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 30
9+
10+
who_is_this_for: Developers porting applications from x86 to Arm who observe different floating point values on each platform.
11+
12+
learning_objectives:
13+
- Understand the differences between floating point numbers on x86 and Arm.
14+
- Understand factors that affect floating point behavior.
15+
- How to use compiler flags to produce predictable behavior.
16+
17+
prerequisites:
18+
- Access to an x86 and an Arm Linux machine.
19+
- Basic understanding of floating point numbers.
20+
21+
author: Kieran Hejmadi
22+
23+
### Tags
24+
skilllevels: Introductory
25+
subjects: Performance and Architecture
26+
armips:
27+
- Cortex-A
28+
- Neoverse
29+
tools_software_languages:
30+
- C++
31+
operatingsystems:
32+
- Linux
33+
shared_path: true
34+
shared_between:
35+
- servers-and-cloud-computing
36+
- laptops-and-desktops
37+
- mobile-graphics-and-gaming
38+
39+
further_reading:
40+
- resource:
41+
title: G++ Optimisation Flags
42+
link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
43+
type: documentation
44+
- resource:
45+
title: Floating-point environment
46+
link: https://en.cppreference.com/w/cpp/numeric/fenv
47+
type: documentation
48+
49+
50+
51+
### FIXED, DO NOT MODIFY
52+
# ================================================================================
53+
weight: 1 # _index.md always has weight of 1 to order correctly
54+
layout: "learningpathall" # All files under learning paths have this same wrapper
55+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
56+
---
448 KB
Loading
93.6 KB
Loading
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Floating Point Representations
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Review of floating point numbers
10+
11+
If you are unfamiliar with floating point number representation, you can review [Learn about integer and floating-point conversions](/learning-paths/cross-platform/integer-vs-floats/introduction-integer-float-types/). It covers different data types and explains data type conversions.
12+
13+
Floating-point numbers are a fundamental representation of real numbers in computer systems, enabling efficient storage and computation of decimal values with varying degrees of precision. In C/C++, floating point variables are created with keywords such as `float` or `double`. The IEEE 754 standard, established in 1985, is the most widely used format for floating-point arithmetic, ensuring consistency across different hardware and software implementations.
14+
15+
IEEE 754 defines two primary formats: single-precision (32-bit) and double-precision (64-bit).
16+
17+
Each floating-point number consists of three components:
18+
- **sign bit**. (Determining positive or negative value)
19+
- **exponent** (defining the scale or magnitude)
20+
- **significand** (also called the mantissa, representing the significant digits of the number).
21+
22+
The standard uses a biased exponent to handle both large and small numbers efficiently, and it incorporates special values such as NaN (Not a Number), infinity, and subnormal numbers for robust numerical computation. A key feature of IEEE 754 is its support for rounding modes and exception handling, ensuring predictable behavior in mathematical operations. However, floating-point arithmetic is inherently imprecise due to limited precision, leading to small rounding errors.
23+
24+
The graphic below illustrates various forms of floating point representation supported by Arm, each with varying number of bits assigned to the exponent and mantissa.
25+
26+
![floating-point](./floating-point-numbers.png)
27+
28+
## Rounding errors
29+
30+
Since computers use a finite number of bits to store a continuous range of numbers, rounding errors are introduced. The unit in last place (ULP) is the smallest difference between two consecutive floating-point numbers. It measures floating-point rounding error, which arises because not all real numbers can be exactly represented.
31+
32+
When an operation is performed, the result is rounded to the nearest representable value, introducing a small error. This error, often measured in ULPs, indicates how close the computed value is to the exact result. For a simple example, if a floating-point schema with 3 bits for the mantissa (precision) and an exponent in the range of -1 to 2 is used, the possible values are represented in the graph below.
33+
34+
![ulp](./ulp.png)
35+
36+
Key takeaways:
37+
38+
- ULP size varies with the number’s magnitude.
39+
- Larger numbers have bigger ULPs due to wider spacing between values.
40+
- Smaller numbers have smaller ULPs, reducing quantization error.
41+
- ULP behavior impacts numerical stability and precision in computations.
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Differences between x86 and Arm
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What are the differences in behavior between x86 and Arm floating point?
10+
11+
Architecture and standards define floating point overflows and truncations in different ways.
12+
13+
You can see this by comparing an example application on an x86 and an Arm Linux system.
14+
15+
You can use any Linux systems for this example. If you are using AWS, you can use EC2 instance types
16+
`t3.micro` and `t4g.small` running Ubuntu 24.04.
17+
18+
To learn about floating point differences, use an editor to copy and paste the C++ code below into a new file named `converting-float.cpp`.
19+
20+
```cpp
21+
#include <iostream>
22+
#include <cmath>
23+
#include <limits>
24+
#include <cstdint>
25+
26+
void convertFloatToInt(float value) {
27+
// Convert to unsigned 32-bit integer
28+
uint32_t u32 = static_cast<uint32_t>(value);
29+
30+
// Convert to signed 32-bit integer
31+
int32_t s32 = static_cast<int32_t>(value);
32+
33+
// Convert to unsigned 16-bit integer (truncation happens)
34+
uint16_t u16 = static_cast<uint16_t>(u32);
35+
uint8_t u8 = static_cast<uint8_t>(value);
36+
37+
// Convert to signed 16-bit integer (truncation happens)
38+
int16_t s16 = static_cast<int16_t>(s32);
39+
40+
std::cout << "Floating-Point Value: " << value << "\n";
41+
std::cout << " → uint32_t: " << u32 << " (0x" << std::hex << u32 << std::dec << ")\n";
42+
std::cout << " → int32_t: " << s32 << " (0x" << std::hex << s32 << std::dec << ")\n";
43+
std::cout << " → uint16_t (truncated): " << u16 << " (0x" << std::hex << u16 << std::dec << ")\n";
44+
std::cout << " → int16_t (truncated): " << s16 << " (0x" << std::hex << s16 << std::dec << ")\n";
45+
std::cout << " → uint8_t (truncated): " << static_cast<int>(u8) << std::endl;
46+
47+
std::cout << "----------------------------------\n";
48+
}
49+
50+
int main() {
51+
std::cout << "Demonstrating Floating-Point to Integer Conversion\n\n";
52+
53+
// Test cases
54+
convertFloatToInt(42.7f); // Normal case
55+
convertFloatToInt(-15.3f); // Negative value -> wraps on unsigned
56+
convertFloatToInt(4294967296.0f); // Overflow: 2^32 (UINT32_MAX + 1)
57+
convertFloatToInt(3.4e+38f); // Large float exceeding UINT32_MAX
58+
convertFloatToInt(-3.4e+38f); // Large negative float
59+
convertFloatToInt(NAN); // NaN behavior on different platforms
60+
return 0;
61+
}
62+
```
63+
64+
If you need to install the `g++` compiler, run the commands below.
65+
66+
```bash
67+
sudo apt update
68+
sudo apt install g++ -y
69+
```
70+
71+
Compile `converting-float.cpp` on an Arm and x86 machine.
72+
73+
The compile command is the same on both systems.
74+
75+
```bash
76+
g++ converting-float.cpp -o converting-float
77+
```
78+
79+
For easy comparison, the image below shows the x86 output (left) and Arm output (right). The highlighted lines show the difference in output.
80+
81+
![differences](./differences.png)
82+
83+
As you can see, there are several cases where different behavior is observed. For example when trying to convert a signed number to a unsigned number or dealing with out-of-bounds numbers.
84+
85+
## Removing hardcoded values with macros
86+
87+
The above differences show that explicitly checking for specific values will lead to unportable code.
88+
89+
For example, consider the function below. The code checks if the value is 0. The value an x86 machine will convert a floating point number that exceeds the maximum 32-bit float value. This is different from Arm behavior leading to unportable code.
90+
91+
```cpp
92+
void checkFloatToUint32(float num) {
93+
uint32_t castedNum = static_cast<uint32_t>(num);
94+
if (castedNum == 0) {
95+
std::cout << "The casted number is 0, indicating the float could out of bounds for uint32_t." << std::endl;
96+
} else {
97+
std::cout << "The casted number is: " << castedNum << std::endl;
98+
}
99+
}
100+
```
101+
102+
This can simply be corrected by using the macro, `UINT32_MAX`.
103+
104+
{{% notice Note %}}
105+
To find out all the available compiler-defined macros, you can output them using:
106+
```bash
107+
echo "" | g++ -dM -E -
108+
```
109+
{{% /notice %}}
110+
111+
A portable version of the code is:
112+
113+
```cpp
114+
void checkFloatToUint32(float num) {
115+
uint32_t castedNum = static_cast<uint32_t>(num);
116+
if (castedNum == UINT32_MAX) {
117+
std::cout << "The casted number is " << UINT32_MAX << " indicating the float was out of bounds for uint32_t." << std::endl;
118+
} else {
119+
std::cout << "The casted number is: " << castedNum << std::endl;
120+
}
121+
}
122+
```
123+
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Error propagation
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What is error propagation in x86 and Arm systems?
10+
11+
One cause of different outputs between x86 and Arm stems from the order of instructions and how errors are propagated. As a hypothetical example, an Arm system may decide to reorder the instructions that each have a different rounding error so that subtle changes are observed.
12+
13+
It is possible that 2 functions that are mathematically equivalent will propagate errors differently on a computer.
14+
15+
Functions `f1` and `f2` are mathematically equivalent. You would expect them to return the same value given the same input.
16+
17+
If the input is a very small number, `1e-8`, the error is different due to the loss in precision caused by different operations. Specifically, `f2` avoids the subtraction of nearly equal number. For a full description look into the topic of [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability).
18+
19+
Use an editor to copy and paste the C++ code below into a file named `error-propagation.cpp`.
20+
21+
```cpp
22+
#include <stdio.h>
23+
#include <math.h>
24+
25+
// Function 1: Computes sqrt(1 + x) - 1 using the naive approach
26+
float f1(float x) {
27+
return sqrtf(1 + x) - 1;
28+
}
29+
30+
// Function 2: Computes the same value using an algebraically equivalent transformation
31+
// This version is numerically more stable
32+
float f2(float x) {
33+
return x / (sqrtf(1 + x) + 1);
34+
}
35+
36+
int main() {
37+
float x = 1e-8; // A small value that causes floating-point precision issues
38+
float result1 = f1(x);
39+
float result2 = f2(x);
40+
41+
// Theoretically, result1 and result2 should be the same
42+
float difference = result1 - result2;
43+
// Multiply by a large number to amplify the error
44+
float final_result = 100000000.0f * difference + 0.0001f;
45+
46+
// Print the results
47+
printf("f1(%e) = %.10f\n", x, result1);
48+
printf("f2(%e) = %.10f\n", x, result2);
49+
printf("Difference (f1 - f2) = %.10e\n", difference);
50+
printf("Final result after magnification: %.10f\n", final_result);
51+
52+
return 0;
53+
}
54+
```
55+
56+
Compile the code on both x86 and Arm with the following command.
57+
58+
```bash
59+
g++ -g error-propagation.cpp -o error-propagation
60+
```
61+
62+
Running the 2 binaries shows that the second function, `f2`, has a small rounding error on both architectures. Additionally, there is a further rounding difference when run on x86 compared to Arm.
63+
64+
Running on x86:
65+
66+
```output
67+
f1(1.000000e-08) = 0.0000000000
68+
f2(1.000000e-08) = 0.0000000050
69+
Difference (f1 - f2) = -4.9999999696e-09
70+
Final result after magnification: -0.4999000132
71+
```
72+
73+
Running on Arm:
74+
```output
75+
f1(1.000000e-08) = 0.0000000000
76+
f2(1.000000e-08) = 0.0000000050
77+
Difference (f1 - f2) = -4.9999999696e-09
78+
Final result after magnification: -0.4998999834
79+
```

0 commit comments

Comments
 (0)