Skip to content

Commit 5e3a0a0

Browse files
Merge pull request #2375 from ArmDeveloperEcosystem/main
Production update
2 parents 2840f3d + fe17602 commit 5e3a0a0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1776
-752
lines changed

.wordlist.txt

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4949,4 +4949,15 @@ uop
49494949
walkthrough
49504950
warmups
49514951
xo
4952-
yi
4952+
yi
4953+
AMX
4954+
AlexNet
4955+
FMAC
4956+
MySql
4957+
MyStrongPassword
4958+
RDBMS
4959+
SqueezeNet
4960+
TIdentify
4961+
goroutines
4962+
mysqlslap
4963+
squeezenet

content/install-guides/container.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ sw_vers -productVersion
4646
Example output:
4747

4848
```output
49-
15.5
49+
15.6.1
5050
```
5151

5252
You must be running macOS 15.0 or later to use the Container CLI.
@@ -60,13 +60,13 @@ Go to the [GitHub Releases page](https://github.com/apple/container/releases) an
6060
For example:
6161

6262
```bash
63-
wget https://github.com/apple/container/releases/download/0.2.0/container-0.2.0-installer-signed.pkg
63+
wget https://github.com/apple/container/releases/download/0.4.1/container-0.4.1-installer-signed.pkg
6464
```
6565

6666
Install the package:
6767

6868
```bash
69-
sudo installer -pkg container-0.2.0-installer-signed.pkg -target /
69+
sudo installer -pkg container-0.4.1-installer-signed.pkg -target /
7070
```
7171

7272
This installs the Container binary at `/usr/local/bin/container`.
@@ -90,7 +90,7 @@ container --version
9090
Example output:
9191

9292
```output
93-
container CLI version 0.2.0
93+
container CLI version 0.4.1 (build: release, commit: 4ac18b5)
9494
```
9595

9696
## Build and run a container

content/learning-paths/cross-platform/floating-point-behavior/_index.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,23 @@ title: Understand floating-point behavior across x86 and Arm architectures
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: This is an introductory topic for developers who are porting applications from x86 to Arm and want to understand floating-point behavior across these architectures. Both architectures provide reliable and consistent floating-point computation following the IEEE 754 standard.
6+
who_is_this_for: This is a topic for developers who are porting applications from x86 to Arm and want to understand floating-point behavior across these architectures. Both architectures provide reliable and consistent floating-point computation following the IEEE 754 standard.
77

88
learning_objectives:
99
- Understand that Arm and x86 produce identical results for all well-defined floating-point operations.
1010
- Recognize that differences only occur in special undefined cases permitted by IEEE 754.
11-
- Learn best practices for writing portable floating-point code across architectures.
12-
- Apply appropriate precision levels for portable results.
11+
- Learn to recognize floating-point differences and make your code portable across architectures.
1312

1413
prerequisites:
1514
- Access to an x86 and an Arm Linux machine.
1615
- Familiarity with floating-point numbers.
1716

18-
author: Kieran Hejmadi
17+
author:
18+
- Kieran Hejmadi
19+
- Jason Andrews
1920

2021
### Tags
21-
skilllevels: Introductory
22+
skilllevels: Advanced
2223
subjects: Performance and Architecture
2324
armips:
2425
- Cortex-A
Lines changed: 69 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
---
2-
title: Single and double precision considerations
2+
title: Precision and floating-point instruction considerations
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Understanding numerical precision differences in single vs double precision
9+
When moving from x86 to Arm you may see differences in floating-point behavior. Understanding these differences may require digging deeper into the details, including the precision and the floating-point instructions.
1010

11-
This section explores how different levels of floating-point precision can affect numerical results. The differences shown here are not architecture-specific issues, but demonstrate the importance of choosing appropriate precision levels for numerical computations.
11+
This section explores an example with minor differences in floating-point results, particularly focused on Fused Multiply-Add (FMAC) operations. You can run the example to learn more about how the same C code can produce different results on different platforms.
1212

13-
### Single precision limitations
13+
## Single precision and FMAC differences
1414

15-
Consider two mathematically equivalent functions, `f1()` and `f2()`. While they should theoretically produce the same result, small differences can arise due to the limited precision of floating-point arithmetic.
15+
Consider two mathematically equivalent functions, `f1()` and `f2()`. While they should theoretically produce the same result, small differences can arise due to the limited precision of floating-point arithmetic and the instructions used.
1616

17-
The differences shown in this example are due to using single precision (float) arithmetic, not due to architectural differences between Arm and x86. Both architectures handle single precision arithmetic according to IEEE 754.
17+
When these small differences are amplified, you can observe how Arm and x86 architectures handle floating-point operations differently, particularly with respect to FMAC (Fused Multiply-Add) operations. The example shows the Clang compiler on Arm using FMAC instructions by default, which can lead to slightly different results compared to x86, which is not using FMAC instructions.
1818

1919
Functions `f1()` and `f2()` are mathematically equivalent. You would expect them to return the same value given the same input.
2020

21-
Use an editor to copy and paste the C++ code below into a file named `single-precision.cpp`
21+
Use an editor to copy and paste the C code below into a file named `example.c`
2222

23-
```cpp
23+
```c
2424
#include <stdio.h>
2525
#include <math.h>
2626

@@ -42,74 +42,109 @@ int main() {
4242

4343
// Theoretically, result1 and result2 should be the same
4444
float difference = result1 - result2;
45-
// Multiply by a large number to amplify the error
45+
46+
// Multiply by a large number to amplify the error - using single precision (float)
47+
// This is where architecture differences occur due to FMAC instructions
4648
float final_result = 100000000.0f * difference + 0.0001f;
49+
50+
// Using double precision for the calculation makes results consistent across platforms
51+
double final_result_double = 100000000.0 * difference + 0.0001;
4752

4853
// Print the results
4954
printf("f1(%e) = %.10f\n", x, result1);
5055
printf("f2(%e) = %.10f\n", x, result2);
5156
printf("Difference (f1 - f2) = %.10e\n", difference);
52-
printf("Final result after magnification: %.10f\n", final_result);
57+
printf("Final result after magnification (float): %.10f\n", final_result);
58+
printf("Final result after magnification (double): %.10f\n", final_result_double);
5359

5460
return 0;
5561
}
5662
```
5763
64+
You need access to an Arm and x86 Linux computer to compare the results. The output below is from Ubuntu 24.04 using Clang. The Clang version is 18.1.3.
65+
5866
Compile and run the code on both x86 and Arm with the following command:
5967
6068
```bash
61-
g++ -g single-precision.cpp -o single-precision
62-
./single-precision
69+
clang -g example.c -o example -lm
70+
./example
6371
```
6472

65-
Output running on x86:
73+
The output running on x86:
6674

6775
```output
6876
f1(1.000000e-08) = 0.0000000000
6977
f2(1.000000e-08) = 0.0000000050
7078
Difference (f1 - f2) = -4.9999999696e-09
71-
Final result after magnification: -0.4999000132
79+
Final result after magnification (float): -0.4999000132
80+
Final result after magnification (double): -0.4998999970
7281
```
7382

74-
Output running on Arm:
83+
The output running on Arm:
7584

7685
```output
7786
f1(1.000000e-08) = 0.0000000000
7887
f2(1.000000e-08) = 0.0000000050
7988
Difference (f1 - f2) = -4.9999999696e-09
80-
Final result after magnification: -0.4998999834
89+
Final result after magnification (float): -0.4998999834
90+
Final result after magnification (double): -0.4998999970
8191
```
8292

83-
Depending on your compiler and library versions, you may get the same output on both systems. You can also use the `clang` compiler and see if the output matches.
93+
Notice that the double precision results are identical across platforms, while the single precision results differ.
94+
95+
You can disable the fused multiply-add on Arm with a compiler flag:
96+
97+
```bash
98+
clang -g -ffp-contract=off example.c -o example2 -lm
99+
./example2
100+
```
101+
102+
Now the output of `example2` on Arm matches the x86 output.
103+
104+
You can use `objdump` to look at the assembly instructions to confirm the use of FMAC instructions.
105+
106+
Page through the `objdump` output to find the difference shown below in the `main()` function.
84107

85108
```bash
86-
clang -g single-precision.cpp -o single-precision -lm
87-
./single-precision
109+
llvm-objdump -d ./example | more
88110
```
89111

90-
In some cases the GNU compiler output differs from the Clang output.
112+
The Arm output includes `fmadd`:
113+
114+
```output
115+
8c8: 1f010800 fmadd s0, s0, s1, s2
116+
```
91117

92-
Here's what's happening:
118+
The x86 uses separate multiply and add instructions:
119+
120+
```output
121+
125c: f2 0f 59 c1 mulsd %xmm1, %xmm0
122+
1260: f2 0f 10 0d b8 0d 00 00 movsd 0xdb8(%rip), %xmm1 # 0x2020 <_IO_stdin_used+0x20>
123+
1268: f2 0f 58 c1 addsd %xmm1, %xmm0
124+
```
93125

94-
1. Different square root algorithms: x86 and Arm use different hardware and library implementations for `sqrtf(1 + 1e-8)`
126+
{{% notice Note %}}
127+
On Ubuntu 24.04 the GNU Compiler, `gcc`, produces the same result as x86 and does not use the `fmadd` instruction. Be aware that corner case examples like this may change in future compiler versions.
128+
{{% /notice %}}
95129

96-
2. Tiny implementation differences get amplified. The difference between the two `sqrtf()` results is only about 3e-10, but this gets multiplied by 100,000,000, making it visible in the final result.
130+
## Techniques for consistent results
97131

98-
3. Both `f1()` and `f2()` use `sqrtf()`. Even though `f2()` is more numerically stable, both functions call `sqrtf()` with the same input, so they both inherit the same architecture-specific square root result.
132+
You can make the results consistent across platforms in several ways:
99133

100-
4. Compiler and library versions may produce different output due to different implementations of library functions such as `sqrtf()`.
134+
- Use double precision for critical calculations by changing `100000000.0f` to `100000000.0` (double precision).
101135

102-
The final result is that x86 and Arm libraries compute `sqrtf(1.00000001)` with tiny differences in the least significant bits. This is normal and expected behavior and IEEE 754 allows for implementation variations in transcendental functions like square root, as long as they stay within specified error bounds.
136+
- Disable fused multiply-add operations using the `-ffp-contract=off` compiler flag.
103137

104-
The very small difference you see is within acceptable floating-point precision limits.
138+
- Use the compiler flag `-ffp-contract=fast` to enable fused multiply-add on x86.
105139

106-
### Key takeaways
140+
## Key takeaways
107141

108-
- The small differences shown are due to library implementations in single-precision mode, not fundamental architectural differences.
109-
- Single-precision arithmetic has inherent limitations that can cause small numerical differences.
110-
- Using numerically stable algorithms, like `f2()`, can minimize error propagation.
111-
- Understanding [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) is important for writing portable code.
142+
- Different floating-point behavior between architectures can often be traced to specific hardware features or instructions such as Fused Multiply-Add (FMAC) operations.
143+
- FMAC performs multiplication and addition with a single rounding step, which can lead to different results compared to separate multiply and add operations.
144+
- Compilers may use FMAC instructions on Arm by default, but not on x86.
145+
- To ensure consistent results across platforms, consider using double precision for critical calculations and controlling compiler optimizations with flags like `-ffp-contract=off` and `-ffp-contract=fast`.
146+
- Understanding [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) remains important for writing portable code.
112147

113-
By adopting best practices and appropriate precision levels, developers can ensure consistent results across platforms.
148+
If you see differences in floating-point results, it typically means you need to look a little deeper to find the causes.
114149

115-
Continue to the next section to see how precision impacts the results.
150+
These situations are not common, but it is good to be aware of them as a software developer migrating to the Arm architecture. You can be confident that floating-point on Arm behaves predictably and that you can get consistent results across multiple architectures.

content/learning-paths/cross-platform/floating-point-behavior/how-to-4.md

Lines changed: 0 additions & 74 deletions
This file was deleted.

content/learning-paths/cross-platform/vectorization-comparison/_index.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
---
22
title: "Migrate x86-64 SIMD to Arm64"
33

4-
draft: true
5-
cascade:
6-
draft: true
7-
84
minutes_to_complete: 30
95

106
who_is_this_for: This is an advanced topic for developers migrating vectorized (SIMD) code from x86-64 to Arm64.

content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/1-overview.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,32 +6,34 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## TinyML
9+
## Overview
1010

1111
This Learning Path is about TinyML. It is a starting point for learning how innovative AI technologies can be used on even the smallest of devices, making Edge AI more accessible and efficient. You will learn how to set up your host machine to facilitate compilation and ensure smooth integration across devices.
1212

1313
This section provides an overview of the domain with real-life use cases and available devices.
14+
## What is TinyML?
15+
1416

1517
TinyML represents a significant shift in Machine Learning deployment. Unlike traditional Machine Learning, which typically depends on cloud-based servers or high-performance hardware, TinyML is tailored to function on devices with limited resources, constrained memory, low power, and fewer processing capabilities.
1618

1719
TinyML has gained popularity because it enables AI applications to operate in real-time, directly on the device, with minimal latency, enhanced privacy, and the ability to work offline. This shift opens up new possibilities for creating smarter and more efficient embedded systems.
1820

19-
### Benefits and applications
21+
## Benefits and applications
2022

2123
The benefits of TinyML align well with the Arm architecture, which is widely used in IoT, mobile devices, and edge AI deployments.
2224

2325
Here are some of the key benefits of TinyML on Arm:
2426

2527

26-
- **Power Efficiency**: TinyML models are designed to be extremely power-efficient, making them ideal for battery-operated devices like sensors, wearables, and drones.
28+
- Power efficiency: TinyML models are designed to be extremely power-efficient, making them ideal for battery-operated devices like sensors, wearables, and drones.
2729

28-
- **Low Latency**: AI processing happens on-device, so there is no need to send data to the cloud, which reduces latency and enables real-time decision-making.
30+
- Low latency: AI processing happens on-device, so there is no need to send data to the cloud, which reduces latency and enables real-time decision-making.
2931

30-
- **Data Privacy**: With on-device computation, sensitive data remains local, providing enhanced privacy and security. This is a priority in healthcare and personal devices.
32+
- Data privacy: with on-device computation, sensitive data remains local, providing enhanced privacy and security. This is a priority in healthcare and personal devices.
3133

32-
- **Cost-Effective**: Arm devices, which are cost-effective and scalable, can now handle sophisticated Machine Learning tasks, reducing the need for expensive hardware or cloud services.
34+
- Cost-effective: Arm devices, which are cost-effective and scalable, can now handle sophisticated machine learning tasks, reducing the need for expensive hardware or cloud services.
3335

34-
- **Scalability**: With billions of Arm devices in the market, TinyML is well-suited for scaling across industries, enabling widespread adoption of AI at the edge.
36+
- Scalability: with billions of Arm devices in the market, TinyML is well-suited for scaling across industries, enabling widespread adoption of AI at the edge.
3537

3638
TinyML is being deployed across multiple industries, enhancing everyday experiences and enabling groundbreaking solutions. The table below shows some examples of TinyML applications.
3739

0 commit comments

Comments
 (0)