ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/1.md‎
Lines changed: 34 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/1.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/2.md‎
Lines changed: 46 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/2.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/3.md‎
Lines changed: 123 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/3.md‎
Lines changed: 123 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/4.md‎
Lines changed: 161 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/Optimised-libraries-on-Arm/4.md‎
Lines changed: 161 additions & 0 deletions
@@ -0,0 +1,34 @@
+---
+title: Introduction to Performance Libraries
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction to Performance Libraries
+
+Performance libraries for Arm CPUs, such as the Arm Performance Libraries (APL), provide highly optimized mathematical functions for scientific computing, similar to how cuBLAS serves GPUs and Intel's MKL serves x86 architectures. These libraries can be linked dynamically at runtime or statically during compilation, offering flexibility in deployment. Generally, minimal source code changes are required to support these libraries, making them easy to integrate. They are designed to support multiple versions of the Arm architecture, including those with NEON and SVE extensions. Performance libraries are crafted through extensive benchmarking and optimization, and can be domain-specific, such as genomics libraries, or produced by Arm for general-purpose computing.
+
+ILP64 use 64 bits for representing integers, which are often used for indexing large arrays in scentific computing. In C++ source code we use the `long long` type to specify 64-bit integers. Alternatively, LP64 use 32 bits to present integers which are more common in general purpose applications. 
+
+Open Multi-process is a programming interface for paralleling workloads across many CPU cores on shared memory across multiple platforms (i.e. x86, AArch64 etc.). Programmers would interact primarily through compiler directives, such as `#pragma omp parallel` indicating which section of source code can be run on parallel and which require synchronisation. This learning path does not serve to teach you about OpenMP but presumes the reader is familiar. 
+
+Arm performance libraries like the x86 equivalent, Open Math Kernel Library (MKL) provide optimised functions for both ILP64 and LP64 as well as OpenMP or single threaded implementations. Further, the interface libraries are available as shared libraries for dynamic linking (i.e. `*.so`) or static linking (i.e. `*.a`).
+
+## Why Multiple Performance Libraries Exist
+
+A natural source of confusion stems from the plethora of similar seeming performance libraries, for example OpenBLAS, NVIDIA Performance Libraries (NVPL) which have their own implementations for specific functions, for example basic linear algebra subprograms (BLAS). This begs the question which one should a developer use. 
+
+Multiple performance libraries exist to cater to the diverse needs of different hardware architectures and applications. For instance, Arm performance libraries are optimized for Arm CPUs, leveraging their unique instruction sets and power efficiency. On the other hand, NVIDIA performance libraries for Grace CPU are tailored to maximize the performance of NVIDIA's Grace hardware features specific to their own Neoverse implementation. 
+
+- **Hardware Specialization**  Some libraries are designed to be cross-platform, supporting multiple hardware architectures to provide flexibility and broader usability. For example, the OpenBLAS library supports both Arm and x86 architectures, allowing developers to use the same library across different systems. 
+
+- **Domain-Specific Libraries**: Libraries are often created to handle specific domains or types of computations more efficiently. For instance, libraries like cuDNN are optimized for deep learning tasks, providing specialized functions that significantly speed up neural network training and inference.
+These factors contribute to the existence of multiple performance libraries, each tailored to meet the specific demands of various hardware and applications.
+
+- **Commercial Libraries**: Alternatively, highly performant libraries require a license to use. This is more common in domain specific libraries such as computations chemistry or fluid dynamics. 
+
+For a directory of optimised libraries produced externally we recommend looking at the [Arm Ecosystem Dashboard](https://www.arm.com/developer-hub/ecosystem-dashboard/?utm_source=google&utm_medium=cpc&utm_content=text_txt_na_ecodash&utm_term=ecodash&utm_campaign=mk24_developer_devhub_keyword_traffic_na&utm_term=arm%20software&gad_source=1&gclid=Cj0KCQiAwOe8BhCCARIsAGKeD56NbfrF3zq4fw5inKdGQMUZFgPqpfLjupj3KVgBsYu4ko7abMI0ePMaAkHNEALw_wcB). There are useful filtres for open-source and commercial implementations. 
+
+Invariably, there will be performance differences between each library and the best way to observe it to use the library within your own program. For more information please read [this blog](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/arm-performance-libraries-24-10).
@@ -0,0 +1,46 @@
+---
+title: Setting Up Your Environment
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Setting Up Your Environment
+
+
+- Run on Arm CPUs, 
+
+```bash
+sudo apt update
+sudo apt install gcc make
+```
+Install Arm performance libraries using the following [installation guide](https://learn.arm.com/install-guides/armpl/)
+
+```bash
+wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_deb_gcc.tar
+tar xvf arm-performance-libraries_24.10_deb_gcc.tar
+cd arm-performance-libraries_24.10_deb/
+```
+```bash
+
+sudo add-apt-respository universe
+sudo apt install environment-modules
+source /usr/share/modules/init/bash
+export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles
+module avail
+```
+
+```output
+------------------------------------------------------------------------------------------------------- /opt/arm/modulefiles -------------------------------------------------------------------------------------------------------
+armpl/24.10.0_gcc  
+
+Key:
+```
+```bash
+cd $ARMPL_DIR
+cd /examples_lp64/
+sudo -E make c_examples // -E is to preserve environment variables
+```
+
+
@@ -0,0 +1,123 @@
+---
+title: Using Optimised Math Library
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Example using Optimised Math library
+
+The libamath library from Arm is an optimized subset of the standard library math functions, providing both scalar and vector functions at different levels of precision. It includes vectorized versions (Neon and SVE) of common math functions found in the standard library, such as those in the <cmath> header.
+
+The trivial snippet below uses the `<cmath>` standard cmath header. Copy and paste the code sample below into a file named `basic_math.cpp`.
+
+```c++
+#include <iostream>
+#include <ctime>
+#include <cmath>  // Include the standard library
+
+int main() {
+    std::srand(std::time(0));
+    double random_number = std::rand() / static_cast<double>(RAND_MAX);
+    double result = exp(random_number); // Use the optimized exp function from libamath
+    std::cout << "Exponential of " << random_number << " is " << result << std::endl;
+    return 0;
+}
+```
+
+Compiling using the following g++ command. We can use the `ldd` command to print the shared objects for dynamic linking. Here we observe the superset `libm.so` is linked.
+
+```output
+g++ basic_math.cpp -o basic_math
+ldd basic_math
+        linux-vdso.so.1 (0x0000f55218587000)
+        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000f55218200000)
+        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000f55218490000)
+        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000f55218050000)
+        /lib/ld-linux-aarch64.so.1 (0x0000f5521854e000)
+        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000f55218460000)
+```
+
+## Updating to use Optimised Library
+
+To use the optimised math library `libamath` requires minimal source code changes, just modifying the include statements to point to the correct header file and additional compiler flags. 
+
+Copy and paste the following C++ snippet into a file named `optimised_math.cpp`.
+
+```c++
+#include <iostream>
+#include <ctime>
+#include <amath.h> // Include the Arm Performance Library header
+
+int main() {
+    std::srand(std::time(0));
+    double random_number = std::rand() / static_cast<double>(RAND_MAX);
+    double result = exp(random_number); // Use the optimized exp function from libamath
+    std::cout << "Exponential of " << random_number << " is " << result << std::endl;
+    return 0;
+}
+```
+
+Compiling using the following g++ command. Again we can use the `ldd` command to print the shared objects for dynamic linking. Now we can opbserve the `libamath.so` shared object is linked. 
+
+```output
+g++ optimised_math.cpp -o optimised_math -lamath -lm
+ldd optimised_math
+        linux-vdso.so.1 (0x0000eb1eb379b000)
+        libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000eb1eb35c0000)
+        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000eb1eb3200000)
+        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000eb1eb3050000)
+        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000eb1eb3520000)
+        /lib/ld-linux-aarch64.so.1 (0x0000eb1eb3762000)
+        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000eb1eb34f0000
+```
+
+
+```c++
+#include <iostream>
+#include <cstdlib>
+#include <ctime>
+#include <cmath> // 
+
+int main() {
+    std::srand(std::time(0));
+    double random_number = std::rand() / static_cast<double>(RAND_MAX);
+    double result = exp(random_number); // Use reg
+    std::cout << "Exponential of " << random_number << " is " << result << std::endl;
+    return 0;
+}
+```
+
+```bash
+g++ x.cpp -o x -lamath -lm
+```
+
+```output
+ldd x
+        linux-vdso.so.1 (0x0000ef553b10a000)
+        libamath.so => /opt/arm/armpl_24.10_gcc/lib/libamath.so (0x0000ef553af30000)
+        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ef553ac00000)
+        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ef553aa50000)
+        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ef553ae90000)
+        /lib/ld-linux-aarch64.so.1 (0x0000ef553b0d1000)
+        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ef553ae60000)
+```
+
+```
+#include <iostream>
+#include <cstdlib>
+#include <ctime>
+#include <amath.h> // Include the Arm Performance Library header
+
+int main() {
+    std::srand(std::time(0));
+    double random_number = std::rand() / static_cast<double>(RAND_MAX);
+    double result = exp(random_number); // Use the optimized exp function from libamath
+    std::cout << "Exponential of " << random_number << " is " << result << std::endl;
+    return 0;
+}
+
+```
+
+
@@ -0,0 +1,161 @@
+---
+title: Moving from x86 to AArch64
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Example Porting Application that uses Intel Vector Statistics Library
+
+OpenRNG is an open-source Random Number Generator (RNG) library, initially released with Arm Performance Libraries 24.04, designed to improve performance when porting applications to Arm. It serves as a drop-in replacement for Intel's Vector Statistics Library (VSL). OpenRNG supports various RNG types, including pseudorandom, quasirandom, and nondeterministic generators, and offers tools for efficient multithreading and converting random sequences into common probability distributions. A vector of random numbers is a sequence of numbers that appear random and are used in various applications, such as simulating unpredictable natural processes, modeling financial markets, and creating unpredictable AI behaviors in gaming.
+
+
+## Run on an X86 Instance
+
+To demonstrate porting we will start with an application running on an x86_64, AWS `t3.2xlarge` instance with 32GB of storage. Please refer to our cloud instance [Getting started with Servers and Cloud computing](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/) guide and select an x86 instance type. 
+
+Install the OneAPI toolkit using [Intel's instructions](https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-0/apt.html#GUID-560A487B-1B5B-4406-BB93-22BC7B526BCD).
+
+The following source code uses a classic algorithm to calculate pi. Copy and paste the source code below into a file named 'pi_x86.c`.
+
+```c
+/*
+ * SPDX-FileCopyrightText: <text>Copyright 2024 Arm Limited and/or its
+ * affiliates <[email protected]></text>
+ *
+ * SPDX-License-Identifier: MIT OR Apache-2.0 WITH LLVM-exception
+ */
+
+#include <mkl.h> // Using Vector Statistics Library
+#include <stdio.h>
+#include <stdlib.h>
+
+void assert_message(int condition, const char *message) {
+  if (!condition) {
+    printf("Error: %s\n", message);
+    exit(EXIT_FAILURE);
+  }
+}
+
+int main() {
+
+  const size_t nIterations = 1000 * 1000;
+  const size_t nRandomNumbers = 2 * nIterations;
+
+  //
+  // Declare and initialise the stream.
+  //
+  // In this example, we've selected the PHILOX4X32X10 generator and seeded it
+  // with 42. We can then check that the method executed succesfully by checking
+  // the return value for VSL_ERROR_OK. Most methods return VSL_ERROR_OK on
+  // success.
+  //
+  VSLStreamStatePtr stream;
+  int errcode = vslNewStream(&stream, VSL_BRNG_PHILOX4X32X10, 42);
+  assert_message(errcode == VSL_ERROR_OK, "vslNewStream failed");
+
+  //
+  // Allocate a buffer for storing random numbers.
+  //
+  float *randomNumbers = malloc(nRandomNumbers * sizeof(float));
+  assert_message(randomNumbers != NULL, "malloc failed");
+
+  //
+  // Generate a uniform distribution between 0 and 1.
+  //
+  // First, we select the method used to generate the uniform distribution; in
+  // this example, we use the standard method. We pass in a pointer to an
+  // initialised stream, the amount of random numbers we want, followed by a
+  // pointer to a buffer big enough to hold all the random numbers requested.
+  // Finally, we pass in parameters specific to the distribution, in this case,
+  // 0 and 1, meaning we want the range [0, 1).
+  //
+  errcode = vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, nRandomNumbers,
+                         randomNumbers, 0, 1);
+  assert_message(errcode == VSL_ERROR_OK, "vsRngUniform failed");
+
+  //
+  // Use the random numbers.
+  //
+  // This is a classic algorithm used for estimating the value of pi. We imagine
+  // a unit square overlapping a quarter of a circle with unit radius. We then
+  // treat pairs of successive random numbers as points on the unit square. We
+  // can check if the point is inside the quarter circle by measuring the
+  // distance between the point and the centre of the circle; if the distance is
+  // less than 1, the point is inside the circle. The proportion of points
+  // inside the circle should be
+  //
+  //  (area of quarter circle) / (area of square) := pi / 4.
+  //
+  // so
+  //
+  //  pi = 4 * (proportion of points inside circle)
+  //
+  int count = 0;
+  for (size_t i = 0; i < nIterations; i++) {
+    float x = randomNumbers[2 * i + 0];
+    float y = randomNumbers[2 * i + 1];
+
+    if (x * x + y * y < 1) {
+      count++;
+    }
+  }
+  float estimateOfPi = 4.0f * count / nIterations;
+
+  printf("Estimate of pi:        %f\n", estimateOfPi);
+  printf("Number of iterations:  %zu\n", nIterations);
+
+  //
+  // The buffer passed into vsRngUniform is still owned by the user.
+  //
+  free(randomNumbers);
+
+  //
+  // Release any resources held by the stream.
+  //
+  errcode = vslDeleteStream(&stream);
+  assert_message(errcode == VSL_ERROR_OK, "vslDeleteStream failed");
+
+  return EXIT_SUCCESS;
+}
+
+```
+
+
+Compile the source code by running the following commands. Please note: you may need to adjust the oneapi version from 2025.0 to the version installed on your system. 
+
+```bash
+export LD_LIBRARY_PATH=/opt/intel/oneapi/2025.0/lib:$LD_LIBRARY_PATH
+gcc -o pi_x86 pi_x86.c -lmkl_rt -I/opt/intel/oneapi/2025.0/include -L/opt/intel/oneapi/2025.0/lib
+```
+
+Using the `ldd` command to print the shared objects we can see we a linking to `libmkl`.
+
+```output
+ldd ./pi_x86
+        linux-vdso.so.1 (0x00007fff9ddc7000)
+        libmkl_rt.so.2 => /opt/intel/oneapi/2025.0/lib/libmkl_rt.so.2 (0x0000748c46400000)
+        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000748c46000000)
+        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000748c4711a000)
+        /lib64/ld-linux-x86-64.so.2 (0x0000748c4712c000)
+```
+## Porting to use OpenRNG
+
+OpenRNG in most cases is a drop-in replacement for the Vector Statistics Library. Please refer to the reference guide for full information on which functions are supported. To enable this source code to run on Arm we simply need to adjust the header file. 
+
+```output
+// from 
+#include "mkl.h" 
+// to
+#include "openrng.h"
+```
+
+```
+gcc -c -mcpu=native -I/opt/arm/armpl_24.10_gcc/include -std=c99 pi.c -o pi.o
+gcc -mcpu=native pi.o -L/opt/arm/armpl_24.10_gcc/lib -larmpl -lamath -lm -o pi.exe
+
+Running program openrng.exe:
+Estimate of pi:        3.142112
+Number of iterations:  1000000
+```