Skip to content

Commit ffbf09f

Browse files
author
Your Name
committed
refined
1 parent e07c58d commit ffbf09f

File tree

3 files changed

+67
-57
lines changed
  • content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm

3 files changed

+67
-57
lines changed

content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/1.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,36 @@ layout: learningpathall
88

99
## Introduction to Performance Libraries
1010

11-
Performance libraries for Arm CPUs, such as the Arm Performance Libraries (APL), provide highly optimized mathematical functions for scientific computing, similar to how cuBLAS serves GPUs and Intel's MKL serves x86 architectures. These libraries can be linked dynamically at runtime or statically during compilation, offering flexibility in deployment. Generally, minimal source code changes are required to support these libraries, making them easy to integrate. They are designed to support multiple versions of the Arm architecture, including those with NEON and SVE extensions. Performance libraries are crafted through extensive benchmarking and optimization, and can be domain-specific, such as genomics libraries, or produced by Arm for general-purpose computing.
11+
The C++ Standard Library provides a collection of classes and functions that are essential for everyday programming tasks, such as data structures, algorithms, and input/output operations. It is designed to be versatile and easy to use, ensuring compatibility and portability across different platforms. However as a result of this portability, standard libraries introduces some limitations. Performance sensitive applications may wish to take maximum advantage of the hardware's capabilities. This is where performance libraries come in.
1212

13-
ILP64 use 64 bits for representing integers, which are often used for indexing large arrays in scentific computing. In C++ source code we use the `long long` type to specify 64-bit integers. Alternatively, LP64 use 32 bits to present integers which are more common in general purpose applications.
13+
Performance libraries like OpenRNG are specialized for high-performance computing tasks and are often tailored to the microarchitecture of a specific processor. These libraries are optimized for speed and efficiency, often leveraging hardware-specific features such as vector units to achieve maximum performance. Performance libraries are crafted through extensive benchmarking and optimization, and can be domain-specific, such as genomics libraries, or produced by Arm for general-purpose computing. For example, OpenRNG focuses on generating random numbers quickly and efficiently, which is crucial for simulations and scientific computations, whereas the C++ Standard Library offers a more general-purpose approach with functions like std::mt19937 for random number generation.
1414

15-
Open Multi-process is a programming interface for paralleling workloads across many CPU cores on shared memory across multiple platforms (i.e. x86, AArch64 etc.). Programmers would interact primarily through compiler directives, such as `#pragma omp parallel` indicating which section of source code can be run on parallel and which require synchronisation. This learning path does not serve to teach you about OpenMP but presumes the reader is familiar.
15+
Performance libraries for Arm CPUs, such as the Arm Performance Libraries (APL), provide highly optimized mathematical functions for scientific computing, similar to how cuBLAS are a set of optimised libaries specifically for NVIDIA GPUs. These libraries can be linked dynamically at runtime or statically during compilation, offering flexibility in deployment. They are designed to support multiple versions of the Arm architecture, including those with NEON and SVE extensions. Generally, minimal source code changes are required to support these libraries, making them easy to integrate.
16+
17+
### Common Versions of performance libraries
18+
19+
Performance libraries are often distributed with the following formats to support various use cases.
20+
21+
- **ILP64** use 64 bits for representing integers, which are often used for indexing large arrays in scentific computing. In C++ source code we use the `long long` type to specify 64-bit integers.
22+
23+
- **LP64** use 32 bits to present integers which are more common in general purpose applications.
24+
25+
- **Open Multi-process** (OpenMP) is a programming interface for paralleling workloads across many CPU cores on shared memory across multiple platforms (i.e. x86, AArch64 etc.). Programmers would interact primarily through compiler directives, such as `#pragma omp parallel` indicating which section of source code can be run on parallel and which sections require synchronisation. This learning path does not serve to teach you about OpenMP but presumes the reader is familiar.
1626

1727
Arm performance libraries like the x86 equivalent, Open Math Kernel Library (MKL) provide optimised functions for both ILP64 and LP64 as well as OpenMP or single threaded implementations. Further, the interface libraries are available as shared libraries for dynamic linking (i.e. `*.so`) or static linking (i.e. `*.a`).
1828

1929
## Why Multiple Performance Libraries Exist
2030

21-
A natural source of confusion stems from the plethora of similar seeming performance libraries, for example OpenBLAS, NVIDIA Performance Libraries (NVPL) which have their own implementations for specific functions, for example basic linear algebra subprograms (BLAS). This begs the question which one should a developer use.
31+
A natural source of confusion stems from the plethora of similar seeming performance libraries, for example OpenBLAS, NVIDIA Performance Libraries (NVPL) which have their own implementations for specific functions, for example basic linear algebra subprograms (BLAS). This begs the question which one should a developer use?
2232

23-
Multiple performance libraries exist to cater to the diverse needs of different hardware architectures and applications. For instance, Arm performance libraries are optimized for Arm CPUs, leveraging their unique instruction sets and power efficiency. On the other hand, NVIDIA performance libraries for Grace CPU are tailored to maximize the performance of NVIDIA's Grace hardware features specific to their own Neoverse implementation.
33+
Multiple performance libraries coexist to cater to the diverse needs of different hardware architectures and applications. For instance, Arm performance libraries are optimized for Arm CPUs, leveraging their unique instruction sets and power efficiency. On the other hand, NVIDIA performance libraries for Grace CPU are tailored to maximize the performance of NVIDIA's Grace hardware features specific to their own Neoverse implementation.
2434

2535
- **Hardware Specialization** Some libraries are designed to be cross-platform, supporting multiple hardware architectures to provide flexibility and broader usability. For example, the OpenBLAS library supports both Arm and x86 architectures, allowing developers to use the same library across different systems.
2636

2737
- **Domain-Specific Libraries**: Libraries are often created to handle specific domains or types of computations more efficiently. For instance, libraries like cuDNN are optimized for deep learning tasks, providing specialized functions that significantly speed up neural network training and inference.
2838
These factors contribute to the existence of multiple performance libraries, each tailored to meet the specific demands of various hardware and applications.
2939

30-
- **Commercial Libraries**: Alternatively, highly performant libraries require a license to use. This is more common in domain specific libraries such as computations chemistry or fluid dynamics.
40+
- **Commercial Libraries**: Alternatively, some highly performant libraries require a license to use. This is more common in domain specific libraries such as computations chemistry or fluid dynamics.
3141

3242
For a directory of optimised libraries produced externally we recommend looking at the [Arm Ecosystem Dashboard](https://www.arm.com/developer-hub/ecosystem-dashboard/?utm_source=google&utm_medium=cpc&utm_content=text_txt_na_ecodash&utm_term=ecodash&utm_campaign=mk24_developer_devhub_keyword_traffic_na&utm_term=arm%20software&gad_source=1&gclid=Cj0KCQiAwOe8BhCCARIsAGKeD56NbfrF3zq4fw5inKdGQMUZFgPqpfLjupj3KVgBsYu4ko7abMI0ePMaAkHNEALw_wcB). There are useful filtres for open-source and commercial implementations.
3343

content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/2.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,39 +8,59 @@ layout: learningpathall
88

99
## Setting Up Your Environment
1010

11+
In this initial example we will use an Arm-based AWS `t4g.2xlarge` instance along with the Arm Performance Libraries. For instructions to connect to an AWS instance, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
1112

12-
- Run on Arm CPUs,
13+
Once connected via `ssh`, install the required packages with the following commands.
1314

1415
```bash
1516
sudo apt update
1617
sudo apt install gcc make
1718
```
18-
Install Arm performance libraries using the following [installation guide](https://learn.arm.com/install-guides/armpl/)
19+
Next, install Arm performance libraries using the following [installation guide](https://learn.arm.com/install-guides/armpl/). Alternatively, use the commands below.
1920

2021
```bash
2122
wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_deb_gcc.tar
2223
tar xvf arm-performance-libraries_24.10_deb_gcc.tar
2324
cd arm-performance-libraries_24.10_deb/
2425
```
25-
```bash
2626

27+
Now we need to install environment modules to set the required environment variables, allowing us to quickly build the example applications.
28+
29+
```bash
2730
sudo add-apt-respository universe
2831
sudo apt install environment-modules
2932
source /usr/share/modules/init/bash
3033
export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles
3134
module avail
3235
```
3336

37+
You should see the following `armpl/24.10.0_gcc` available.
3438
```output
3539
------------------------------------------------------------------------------------------------------- /opt/arm/modulefiles -------------------------------------------------------------------------------------------------------
3640
armpl/24.10.0_gcc
41+
```
3742

38-
Key:
43+
Load the module with the following command.
44+
45+
```bash
46+
module load armpl/24.10.0_gcc
3947
```
48+
49+
Navigate to the `lp64` C source code examples and compile.
50+
4051
```bash
4152
cd $ARMPL_DIR
4253
cd /examples_lp64/
4354
sudo -E make c_examples // -E is to preserve environment variables
4455
```
4556

57+
Your terminal output should show the examples being compiled, ending with.
58+
59+
```output
60+
...
61+
Test passed OK
62+
```
63+
64+
For more information on all the available function, please refer to the [Arm Performance Libraries Reference Guide](https://developer.arm.com/documentation/101004/latest/).
65+
4666

content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/3.md

Lines changed: 27 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ layout: learningpathall
88

99
## Example using Optimised Math library
1010

11-
The libamath library from Arm is an optimized subset of the standard library math functions, providing both scalar and vector functions at different levels of precision. It includes vectorized versions (Neon and SVE) of common math functions found in the standard library, such as those in the <cmath> header.
11+
The libamath library from Arm is an optimized subset of the standard library math functions, providing both scalar and vector functions at different levels of precision. It includes vectorized versions (Neon and SVE) of common math functions found in the standard library, such as those in the <cmath> header.
1212

13-
The trivial snippet below uses the `<cmath>` standard cmath header. Copy and paste the code sample below into a file named `basic_math.cpp`.
13+
The trivial snippet below uses the `<cmath>` standard cmath header to calculate the base 2 exponential of a scalar value. Copy and paste the code sample below into a file named `basic_math.cpp`.
1414

1515
```c++
1616
#include <iostream>
@@ -20,17 +20,21 @@ The trivial snippet below uses the `<cmath>` standard cmath header. Copy and pas
2020
int main() {
2121
std::srand(std::time(0));
2222
double random_number = std::rand() / static_cast<double>(RAND_MAX);
23-
double result = exp(random_number); // Use the optimized exp function from libamath
23+
double result = exp(random_number); // Use the standard exponential function
2424
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
2525
return 0;
2626
}
2727
```
2828

2929
Compiling using the following g++ command. We can use the `ldd` command to print the shared objects for dynamic linking. Here we observe the superset `libm.so` is linked.
3030

31-
```output
31+
```bash
3232
g++ basic_math.cpp -o basic_math
3333
ldd basic_math
34+
```
35+
You should see the following output.
36+
37+
```output
3438
linux-vdso.so.1 (0x0000f55218587000)
3539
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000f55218200000)
3640
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000f55218490000)
@@ -41,7 +45,9 @@ ldd basic_math
4145

4246
## Updating to use Optimised Library
4347

44-
To use the optimised math library `libamath` requires minimal source code changes, just modifying the include statements to point to the correct header file and additional compiler flags.
48+
To use the optimised math library `libamath` requires minimal source code changes for our scalar example, just modifying the include statements to point to the correct header file and additional compiler flags.
49+
50+
Libamath routines have maximum errors inferior to 4 ULPs, where ULP stands for Unit in the Last Place, which is the smallest difference between two consecutive floating-point numbers at a specific precision. These routines only support the default rounding mode (round-to-nearest, ties to even). Therefore, switching from libm to libamath results in a small accuracy loss on a range of routines, similar to other vectorized implementations of these functions.
4551

4652
Copy and paste the following C++ snippet into a file named `optimised_math.cpp`.
4753

@@ -61,9 +67,14 @@ int main() {
6167

6268
Compiling using the following g++ command. Again we can use the `ldd` command to print the shared objects for dynamic linking. Now we can opbserve the `libamath.so` shared object is linked.
6369

64-
```output
70+
```bash
6571
g++ optimised_math.cpp -o optimised_math -lamath -lm
6672
ldd optimised_math
73+
```
74+
You should see the following output.
75+
76+
```output
77+
6778
linux-vdso.so.1 (0x0000eb1eb379b000)
6879
libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000eb1eb35c0000)
6980
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000eb1eb3200000)
@@ -73,51 +84,20 @@ ldd optimised_math
7384
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000eb1eb34f0000
7485
```
7586

87+
### What about vector operations?
7688

77-
```c++
78-
#include <iostream>
79-
#include <cstdlib>
80-
#include <ctime>
81-
#include <cmath> //
89+
The naming convention of the Arm Performance Library for scalar operations follows that of `libm`. Hence, we are able to simply update the header file and recompile. For vector operations, we can either rely on the compiler autovectorisation, whereby the compiler generates the vector code for us. This is used in the Arm Compiler for Linux (ACfL). Alternatively, we can use vector routines, which uses name mangling. Mangling is a technique used in computer programming to modify the names of vector functions to ensure uniqueness and avoid conflicts. This is particularly important in compiled languages like C++ and in environments where multiple libraries or modules may be used together.
8290

83-
int main() {
84-
std::srand(std::time(0));
85-
double random_number = std::rand() / static_cast<double>(RAND_MAX);
86-
double result = exp(random_number); // Use reg
87-
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
88-
return 0;
89-
}
90-
```
91-
92-
```bash
93-
g++ x.cpp -o x -lamath -lm
94-
```
91+
In the context of Arm's AArch64 architecture, vector name mangling follows the specific convention below to differentiate between scalar and vector versions of functions.
9592

9693
```output
97-
ldd x
98-
linux-vdso.so.1 (0x0000ef553b10a000)
99-
libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000ef553af30000)
100-
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ef553ac00000)
101-
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ef553aa50000)
102-
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ef553ae90000)
103-
/lib/ld-linux-aarch64.so.1 (0x0000ef553b0d1000)
104-
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ef553ae60000)
105-
```
106-
107-
```
108-
#include <iostream>
109-
#include <cstdlib>
110-
#include <ctime>
111-
#include <amath.h> // Include the Arm Performance Library header
112-
113-
int main() {
114-
std::srand(std::time(0));
115-
double random_number = std::rand() / static_cast<double>(RAND_MAX);
116-
double result = exp(random_number); // Use the optimized exp function from libamath
117-
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
118-
return 0;
119-
}
120-
94+
'_ZGV' <isa> <mask> <vlen> <signature> '_' <original_name>
12195
```
12296

97+
- **original_name** : name of scalar libm function
98+
- **ISA** : 'n' for Neon, 's' for SVE
99+
- **Mask** : 'M' for masked/predicated version, 'N' for unmasked. Only masked routines are defined for SVE, and only unmasked for Neon.
100+
- **vlen** : integer number representing vector length expressed as number of lanes. For Neon <vlen>='2' in double-precision and <vlen>='4' in single-precision. For SVE, <vlen>='x'.
101+
- **signature** : 'v' for 1 input floating point or integer argument, 'vv' for 2. More details in AArch64's vector function ABI.
123102

103+
Please refer to the [Arm Performance Library reference guide](https://developer.arm.com/documentation/101004/latest/) for more information.

0 commit comments

Comments
 (0)