Skip to content

Commit e07c58d

Browse files
author
Your Name
committed
added x86 example and reformat
1 parent 3f52abd commit e07c58d

File tree

8 files changed

+376
-39
lines changed

8 files changed

+376
-39
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: Introduction to Performance Libraries
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction to Performance Libraries
10+
11+
Performance libraries for Arm CPUs, such as the Arm Performance Libraries (APL), provide highly optimized mathematical functions for scientific computing, similar to how cuBLAS serves GPUs and Intel's MKL serves x86 architectures. These libraries can be linked dynamically at runtime or statically during compilation, offering flexibility in deployment. Generally, minimal source code changes are required to support these libraries, making them easy to integrate. They are designed to support multiple versions of the Arm architecture, including those with NEON and SVE extensions. Performance libraries are crafted through extensive benchmarking and optimization, and can be domain-specific, such as genomics libraries, or produced by Arm for general-purpose computing.
12+
13+
ILP64 use 64 bits for representing integers, which are often used for indexing large arrays in scentific computing. In C++ source code we use the `long long` type to specify 64-bit integers. Alternatively, LP64 use 32 bits to present integers which are more common in general purpose applications.
14+
15+
Open Multi-process is a programming interface for paralleling workloads across many CPU cores on shared memory across multiple platforms (i.e. x86, AArch64 etc.). Programmers would interact primarily through compiler directives, such as `#pragma omp parallel` indicating which section of source code can be run on parallel and which require synchronisation. This learning path does not serve to teach you about OpenMP but presumes the reader is familiar.
16+
17+
Arm performance libraries like the x86 equivalent, Open Math Kernel Library (MKL) provide optimised functions for both ILP64 and LP64 as well as OpenMP or single threaded implementations. Further, the interface libraries are available as shared libraries for dynamic linking (i.e. `*.so`) or static linking (i.e. `*.a`).
18+
19+
## Why Multiple Performance Libraries Exist
20+
21+
A natural source of confusion stems from the plethora of similar seeming performance libraries, for example OpenBLAS, NVIDIA Performance Libraries (NVPL) which have their own implementations for specific functions, for example basic linear algebra subprograms (BLAS). This begs the question which one should a developer use.
22+
23+
Multiple performance libraries exist to cater to the diverse needs of different hardware architectures and applications. For instance, Arm performance libraries are optimized for Arm CPUs, leveraging their unique instruction sets and power efficiency. On the other hand, NVIDIA performance libraries for Grace CPU are tailored to maximize the performance of NVIDIA's Grace hardware features specific to their own Neoverse implementation.
24+
25+
- **Hardware Specialization** Some libraries are designed to be cross-platform, supporting multiple hardware architectures to provide flexibility and broader usability. For example, the OpenBLAS library supports both Arm and x86 architectures, allowing developers to use the same library across different systems.
26+
27+
- **Domain-Specific Libraries**: Libraries are often created to handle specific domains or types of computations more efficiently. For instance, libraries like cuDNN are optimized for deep learning tasks, providing specialized functions that significantly speed up neural network training and inference.
28+
These factors contribute to the existence of multiple performance libraries, each tailored to meet the specific demands of various hardware and applications.
29+
30+
- **Commercial Libraries**: Alternatively, highly performant libraries require a license to use. This is more common in domain specific libraries such as computations chemistry or fluid dynamics.
31+
32+
For a directory of optimised libraries produced externally we recommend looking at the [Arm Ecosystem Dashboard](https://www.arm.com/developer-hub/ecosystem-dashboard/?utm_source=google&utm_medium=cpc&utm_content=text_txt_na_ecodash&utm_term=ecodash&utm_campaign=mk24_developer_devhub_keyword_traffic_na&utm_term=arm%20software&gad_source=1&gclid=Cj0KCQiAwOe8BhCCARIsAGKeD56NbfrF3zq4fw5inKdGQMUZFgPqpfLjupj3KVgBsYu4ko7abMI0ePMaAkHNEALw_wcB). There are useful filtres for open-source and commercial implementations.
33+
34+
Invariably, there will be performance differences between each library and the best way to observe it to use the library within your own program. For more information please read [this blog](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/arm-performance-libraries-24-10).
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: Setting Up Your Environment
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Setting Up Your Environment
10+
11+
12+
- Run on Arm CPUs,
13+
14+
```bash
15+
sudo apt update
16+
sudo apt install gcc make
17+
```
18+
Install Arm performance libraries using the following [installation guide](https://learn.arm.com/install-guides/armpl/)
19+
20+
```bash
21+
wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_deb_gcc.tar
22+
tar xvf arm-performance-libraries_24.10_deb_gcc.tar
23+
cd arm-performance-libraries_24.10_deb/
24+
```
25+
```bash
26+
27+
sudo add-apt-respository universe
28+
sudo apt install environment-modules
29+
source /usr/share/modules/init/bash
30+
export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles
31+
module avail
32+
```
33+
34+
```output
35+
------------------------------------------------------------------------------------------------------- /opt/arm/modulefiles -------------------------------------------------------------------------------------------------------
36+
armpl/24.10.0_gcc
37+
38+
Key:
39+
```
40+
```bash
41+
cd $ARMPL_DIR
42+
cd /examples_lp64/
43+
sudo -E make c_examples // -E is to preserve environment variables
44+
```
45+
46+
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Using Optimised Math Library
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Example using Optimised Math library
10+
11+
The libamath library from Arm is an optimized subset of the standard library math functions, providing both scalar and vector functions at different levels of precision. It includes vectorized versions (Neon and SVE) of common math functions found in the standard library, such as those in the <cmath> header.
12+
13+
The trivial snippet below uses the `<cmath>` standard cmath header. Copy and paste the code sample below into a file named `basic_math.cpp`.
14+
15+
```c++
16+
#include <iostream>
17+
#include <ctime>
18+
#include <cmath> // Include the standard library
19+
20+
int main() {
21+
std::srand(std::time(0));
22+
double random_number = std::rand() / static_cast<double>(RAND_MAX);
23+
double result = exp(random_number); // Use the optimized exp function from libamath
24+
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
25+
return 0;
26+
}
27+
```
28+
29+
Compiling using the following g++ command. We can use the `ldd` command to print the shared objects for dynamic linking. Here we observe the superset `libm.so` is linked.
30+
31+
```output
32+
g++ basic_math.cpp -o basic_math
33+
ldd basic_math
34+
linux-vdso.so.1 (0x0000f55218587000)
35+
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000f55218200000)
36+
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000f55218490000)
37+
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000f55218050000)
38+
/lib/ld-linux-aarch64.so.1 (0x0000f5521854e000)
39+
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000f55218460000)
40+
```
41+
42+
## Updating to use Optimised Library
43+
44+
To use the optimised math library `libamath` requires minimal source code changes, just modifying the include statements to point to the correct header file and additional compiler flags.
45+
46+
Copy and paste the following C++ snippet into a file named `optimised_math.cpp`.
47+
48+
```c++
49+
#include <iostream>
50+
#include <ctime>
51+
#include <amath.h> // Include the Arm Performance Library header
52+
53+
int main() {
54+
std::srand(std::time(0));
55+
double random_number = std::rand() / static_cast<double>(RAND_MAX);
56+
double result = exp(random_number); // Use the optimized exp function from libamath
57+
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
58+
return 0;
59+
}
60+
```
61+
62+
Compiling using the following g++ command. Again we can use the `ldd` command to print the shared objects for dynamic linking. Now we can opbserve the `libamath.so` shared object is linked.
63+
64+
```output
65+
g++ optimised_math.cpp -o optimised_math -lamath -lm
66+
ldd optimised_math
67+
linux-vdso.so.1 (0x0000eb1eb379b000)
68+
libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000eb1eb35c0000)
69+
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000eb1eb3200000)
70+
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000eb1eb3050000)
71+
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000eb1eb3520000)
72+
/lib/ld-linux-aarch64.so.1 (0x0000eb1eb3762000)
73+
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000eb1eb34f0000
74+
```
75+
76+
77+
```c++
78+
#include <iostream>
79+
#include <cstdlib>
80+
#include <ctime>
81+
#include <cmath> //
82+
83+
int main() {
84+
std::srand(std::time(0));
85+
double random_number = std::rand() / static_cast<double>(RAND_MAX);
86+
double result = exp(random_number); // Use reg
87+
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
88+
return 0;
89+
}
90+
```
91+
92+
```bash
93+
g++ x.cpp -o x -lamath -lm
94+
```
95+
96+
```output
97+
ldd x
98+
linux-vdso.so.1 (0x0000ef553b10a000)
99+
libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000ef553af30000)
100+
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ef553ac00000)
101+
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ef553aa50000)
102+
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ef553ae90000)
103+
/lib/ld-linux-aarch64.so.1 (0x0000ef553b0d1000)
104+
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ef553ae60000)
105+
```
106+
107+
```
108+
#include <iostream>
109+
#include <cstdlib>
110+
#include <ctime>
111+
#include <amath.h> // Include the Arm Performance Library header
112+
113+
int main() {
114+
std::srand(std::time(0));
115+
double random_number = std::rand() / static_cast<double>(RAND_MAX);
116+
double result = exp(random_number); // Use the optimized exp function from libamath
117+
std::cout << "Exponential of " << random_number << " is " << result << std::endl;
118+
return 0;
119+
}
120+
121+
```
122+
123+
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: Moving from x86 to AArch64
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Example Porting Application that uses Intel Vector Statistics Library
10+
11+
OpenRNG is an open-source Random Number Generator (RNG) library, initially released with Arm Performance Libraries 24.04, designed to improve performance when porting applications to Arm. It serves as a drop-in replacement for Intel's Vector Statistics Library (VSL). OpenRNG supports various RNG types, including pseudorandom, quasirandom, and nondeterministic generators, and offers tools for efficient multithreading and converting random sequences into common probability distributions. A vector of random numbers is a sequence of numbers that appear random and are used in various applications, such as simulating unpredictable natural processes, modeling financial markets, and creating unpredictable AI behaviors in gaming.
12+
13+
14+
## Run on an X86 Instance
15+
16+
To demonstrate porting we will start with an application running on an x86_64, AWS `t3.2xlarge` instance with 32GB of storage. Please refer to our cloud instance [Getting started with Servers and Cloud computing](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/) guide and select an x86 instance type.
17+
18+
Install the OneAPI toolkit using [Intel's instructions](https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-0/apt.html#GUID-560A487B-1B5B-4406-BB93-22BC7B526BCD).
19+
20+
The following source code uses a classic algorithm to calculate pi. Copy and paste the source code below into a file named 'pi_x86.c`.
21+
22+
```c
23+
/*
24+
* SPDX-FileCopyrightText: <text>Copyright 2024 Arm Limited and/or its
25+
* affiliates <[email protected]></text>
26+
*
27+
* SPDX-License-Identifier: MIT OR Apache-2.0 WITH LLVM-exception
28+
*/
29+
30+
#include <mkl.h> // Using Vector Statistics Library
31+
#include <stdio.h>
32+
#include <stdlib.h>
33+
34+
void assert_message(int condition, const char *message) {
35+
if (!condition) {
36+
printf("Error: %s\n", message);
37+
exit(EXIT_FAILURE);
38+
}
39+
}
40+
41+
int main() {
42+
43+
const size_t nIterations = 1000 * 1000;
44+
const size_t nRandomNumbers = 2 * nIterations;
45+
46+
//
47+
// Declare and initialise the stream.
48+
//
49+
// In this example, we've selected the PHILOX4X32X10 generator and seeded it
50+
// with 42. We can then check that the method executed succesfully by checking
51+
// the return value for VSL_ERROR_OK. Most methods return VSL_ERROR_OK on
52+
// success.
53+
//
54+
VSLStreamStatePtr stream;
55+
int errcode = vslNewStream(&stream, VSL_BRNG_PHILOX4X32X10, 42);
56+
assert_message(errcode == VSL_ERROR_OK, "vslNewStream failed");
57+
58+
//
59+
// Allocate a buffer for storing random numbers.
60+
//
61+
float *randomNumbers = malloc(nRandomNumbers * sizeof(float));
62+
assert_message(randomNumbers != NULL, "malloc failed");
63+
64+
//
65+
// Generate a uniform distribution between 0 and 1.
66+
//
67+
// First, we select the method used to generate the uniform distribution; in
68+
// this example, we use the standard method. We pass in a pointer to an
69+
// initialised stream, the amount of random numbers we want, followed by a
70+
// pointer to a buffer big enough to hold all the random numbers requested.
71+
// Finally, we pass in parameters specific to the distribution, in this case,
72+
// 0 and 1, meaning we want the range [0, 1).
73+
//
74+
errcode = vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, nRandomNumbers,
75+
randomNumbers, 0, 1);
76+
assert_message(errcode == VSL_ERROR_OK, "vsRngUniform failed");
77+
78+
//
79+
// Use the random numbers.
80+
//
81+
// This is a classic algorithm used for estimating the value of pi. We imagine
82+
// a unit square overlapping a quarter of a circle with unit radius. We then
83+
// treat pairs of successive random numbers as points on the unit square. We
84+
// can check if the point is inside the quarter circle by measuring the
85+
// distance between the point and the centre of the circle; if the distance is
86+
// less than 1, the point is inside the circle. The proportion of points
87+
// inside the circle should be
88+
//
89+
// (area of quarter circle) / (area of square) := pi / 4.
90+
//
91+
// so
92+
//
93+
// pi = 4 * (proportion of points inside circle)
94+
//
95+
int count = 0;
96+
for (size_t i = 0; i < nIterations; i++) {
97+
float x = randomNumbers[2 * i + 0];
98+
float y = randomNumbers[2 * i + 1];
99+
100+
if (x * x + y * y < 1) {
101+
count++;
102+
}
103+
}
104+
float estimateOfPi = 4.0f * count / nIterations;
105+
106+
printf("Estimate of pi: %f\n", estimateOfPi);
107+
printf("Number of iterations: %zu\n", nIterations);
108+
109+
//
110+
// The buffer passed into vsRngUniform is still owned by the user.
111+
//
112+
free(randomNumbers);
113+
114+
//
115+
// Release any resources held by the stream.
116+
//
117+
errcode = vslDeleteStream(&stream);
118+
assert_message(errcode == VSL_ERROR_OK, "vslDeleteStream failed");
119+
120+
return EXIT_SUCCESS;
121+
}
122+
123+
```
124+
125+
126+
Compile the source code by running the following commands. Please note: you may need to adjust the oneapi version from 2025.0 to the version installed on your system.
127+
128+
```bash
129+
export LD_LIBRARY_PATH=/opt/intel/oneapi/2025.0/lib:$LD_LIBRARY_PATH
130+
gcc -o pi_x86 pi_x86.c -lmkl_rt -I/opt/intel/oneapi/2025.0/include -L/opt/intel/oneapi/2025.0/lib
131+
```
132+
133+
Using the `ldd` command to print the shared objects we can see we a linking to `libmkl`.
134+
135+
```output
136+
ldd ./pi_x86
137+
linux-vdso.so.1 (0x00007fff9ddc7000)
138+
libmkl_rt.so.2 => /opt/intel/oneapi/2025.0/lib/libmkl_rt.so.2 (0x0000748c46400000)
139+
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000748c46000000)
140+
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000748c4711a000)
141+
/lib64/ld-linux-x86-64.so.2 (0x0000748c4712c000)
142+
```
143+
## Porting to use OpenRNG
144+
145+
OpenRNG in most cases is a drop-in replacement for the Vector Statistics Library. Please refer to the reference guide for full information on which functions are supported. To enable this source code to run on Arm we simply need to adjust the header file.
146+
147+
```output
148+
// from
149+
#include "mkl.h"
150+
// to
151+
#include "openrng.h"
152+
```
153+
154+
```
155+
gcc -c -mcpu=native -I/opt/arm/armpl_24.10_gcc/include -std=c99 pi.c -o pi.o
156+
gcc -mcpu=native pi.o -L/opt/arm/armpl_24.10_gcc/lib -larmpl -lamath -lm -o pi.exe
157+
158+
Running program openrng.exe:
159+
Estimate of pi: 3.142112
160+
Number of iterations: 1000000
161+
```

0 commit comments

Comments
 (0)