diff --git a/.wordlist.txt b/.wordlist.txt index 9606eb651b..afdc6f20b0 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -4949,4 +4949,15 @@ uop walkthrough warmups xo -yi \ No newline at end of file +yi +AMX +AlexNet +FMAC +MySql +MyStrongPassword +RDBMS +SqueezeNet +TIdentify +goroutines +mysqlslap +squeezenet \ No newline at end of file diff --git a/content/install-guides/container.md b/content/install-guides/container.md index 22f3be8cb2..96f64095da 100644 --- a/content/install-guides/container.md +++ b/content/install-guides/container.md @@ -46,7 +46,7 @@ sw_vers -productVersion Example output: ```output -15.5 +15.6.1 ``` You must be running macOS 15.0 or later to use the Container CLI. @@ -60,13 +60,13 @@ Go to the [GitHub Releases page](https://github.com/apple/container/releases) an For example: ```bash -wget https://github.com/apple/container/releases/download/0.2.0/container-0.2.0-installer-signed.pkg +wget https://github.com/apple/container/releases/download/0.4.1/container-0.4.1-installer-signed.pkg ``` Install the package: ```bash -sudo installer -pkg container-0.2.0-installer-signed.pkg -target / +sudo installer -pkg container-0.4.1-installer-signed.pkg -target / ``` This installs the Container binary at `/usr/local/bin/container`. @@ -90,7 +90,7 @@ container --version Example output: ```output -container CLI version 0.2.0 +container CLI version 0.4.1 (build: release, commit: 4ac18b5) ``` ## Build and run a container diff --git a/content/learning-paths/cross-platform/floating-point-behavior/_index.md b/content/learning-paths/cross-platform/floating-point-behavior/_index.md index 0d9f1dad10..0c2964e03a 100644 --- a/content/learning-paths/cross-platform/floating-point-behavior/_index.md +++ b/content/learning-paths/cross-platform/floating-point-behavior/_index.md @@ -3,22 +3,23 @@ title: Understand floating-point behavior across x86 and Arm architectures minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for developers who are porting applications from x86 to Arm and want to understand floating-point behavior across these architectures. Both architectures provide reliable and consistent floating-point computation following the IEEE 754 standard. +who_is_this_for: This is a topic for developers who are porting applications from x86 to Arm and want to understand floating-point behavior across these architectures. Both architectures provide reliable and consistent floating-point computation following the IEEE 754 standard. learning_objectives: - Understand that Arm and x86 produce identical results for all well-defined floating-point operations. - Recognize that differences only occur in special undefined cases permitted by IEEE 754. - - Learn best practices for writing portable floating-point code across architectures. - - Apply appropriate precision levels for portable results. + - Learn to recognize floating-point differences and make your code portable across architectures. prerequisites: - Access to an x86 and an Arm Linux machine. - Familiarity with floating-point numbers. -author: Kieran Hejmadi +author: + - Kieran Hejmadi + - Jason Andrews ### Tags -skilllevels: Introductory +skilllevels: Advanced subjects: Performance and Architecture armips: - Cortex-A diff --git a/content/learning-paths/cross-platform/floating-point-behavior/how-to-3.md b/content/learning-paths/cross-platform/floating-point-behavior/how-to-3.md index 437fbfdd16..673a67a318 100644 --- a/content/learning-paths/cross-platform/floating-point-behavior/how-to-3.md +++ b/content/learning-paths/cross-platform/floating-point-behavior/how-to-3.md @@ -1,26 +1,26 @@ --- -title: Single and double precision considerations +title: Precision and floating-point instruction considerations weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Understanding numerical precision differences in single vs double precision +When moving from x86 to Arm you may see differences in floating-point behavior. Understanding these differences may require digging deeper into the details, including the precision and the floating-point instructions. -This section explores how different levels of floating-point precision can affect numerical results. The differences shown here are not architecture-specific issues, but demonstrate the importance of choosing appropriate precision levels for numerical computations. +This section explores an example with minor differences in floating-point results, particularly focused on Fused Multiply-Add (FMAC) operations. You can run the example to learn more about how the same C code can produce different results on different platforms. -### Single precision limitations +## Single precision and FMAC differences -Consider two mathematically equivalent functions, `f1()` and `f2()`. While they should theoretically produce the same result, small differences can arise due to the limited precision of floating-point arithmetic. +Consider two mathematically equivalent functions, `f1()` and `f2()`. While they should theoretically produce the same result, small differences can arise due to the limited precision of floating-point arithmetic and the instructions used. -The differences shown in this example are due to using single precision (float) arithmetic, not due to architectural differences between Arm and x86. Both architectures handle single precision arithmetic according to IEEE 754. +When these small differences are amplified, you can observe how Arm and x86 architectures handle floating-point operations differently, particularly with respect to FMAC (Fused Multiply-Add) operations. The example shows the Clang compiler on Arm using FMAC instructions by default, which can lead to slightly different results compared to x86, which is not using FMAC instructions. Functions `f1()` and `f2()` are mathematically equivalent. You would expect them to return the same value given the same input. -Use an editor to copy and paste the C++ code below into a file named `single-precision.cpp` +Use an editor to copy and paste the C code below into a file named `example.c` -```cpp +```c #include #include @@ -42,74 +42,109 @@ int main() { // Theoretically, result1 and result2 should be the same float difference = result1 - result2; - // Multiply by a large number to amplify the error + + // Multiply by a large number to amplify the error - using single precision (float) + // This is where architecture differences occur due to FMAC instructions float final_result = 100000000.0f * difference + 0.0001f; + + // Using double precision for the calculation makes results consistent across platforms + double final_result_double = 100000000.0 * difference + 0.0001; // Print the results printf("f1(%e) = %.10f\n", x, result1); printf("f2(%e) = %.10f\n", x, result2); printf("Difference (f1 - f2) = %.10e\n", difference); - printf("Final result after magnification: %.10f\n", final_result); + printf("Final result after magnification (float): %.10f\n", final_result); + printf("Final result after magnification (double): %.10f\n", final_result_double); return 0; } ``` +You need access to an Arm and x86 Linux computer to compare the results. The output below is from Ubuntu 24.04 using Clang. The Clang version is 18.1.3. + Compile and run the code on both x86 and Arm with the following command: ```bash -g++ -g single-precision.cpp -o single-precision -./single-precision +clang -g example.c -o example -lm +./example ``` -Output running on x86: +The output running on x86: ```output f1(1.000000e-08) = 0.0000000000 f2(1.000000e-08) = 0.0000000050 Difference (f1 - f2) = -4.9999999696e-09 -Final result after magnification: -0.4999000132 +Final result after magnification (float): -0.4999000132 +Final result after magnification (double): -0.4998999970 ``` -Output running on Arm: +The output running on Arm: ```output f1(1.000000e-08) = 0.0000000000 f2(1.000000e-08) = 0.0000000050 Difference (f1 - f2) = -4.9999999696e-09 -Final result after magnification: -0.4998999834 +Final result after magnification (float): -0.4998999834 +Final result after magnification (double): -0.4998999970 ``` -Depending on your compiler and library versions, you may get the same output on both systems. You can also use the `clang` compiler and see if the output matches. +Notice that the double precision results are identical across platforms, while the single precision results differ. + +You can disable the fused multiply-add on Arm with a compiler flag: + +```bash +clang -g -ffp-contract=off example.c -o example2 -lm +./example2 +``` + +Now the output of `example2` on Arm matches the x86 output. + +You can use `objdump` to look at the assembly instructions to confirm the use of FMAC instructions. + +Page through the `objdump` output to find the difference shown below in the `main()` function. ```bash -clang -g single-precision.cpp -o single-precision -lm -./single-precision +llvm-objdump -d ./example | more ``` -In some cases the GNU compiler output differs from the Clang output. +The Arm output includes `fmadd`: + +```output +8c8: 1f010800 fmadd s0, s0, s1, s2 +``` -Here's what's happening: +The x86 uses separate multiply and add instructions: + +```output +125c: f2 0f 59 c1 mulsd %xmm1, %xmm0 +1260: f2 0f 10 0d b8 0d 00 00 movsd 0xdb8(%rip), %xmm1 # 0x2020 <_IO_stdin_used+0x20> +1268: f2 0f 58 c1 addsd %xmm1, %xmm0 +``` -1. Different square root algorithms: x86 and Arm use different hardware and library implementations for `sqrtf(1 + 1e-8)` +{{% notice Note %}} +On Ubuntu 24.04 the GNU Compiler, `gcc`, produces the same result as x86 and does not use the `fmadd` instruction. Be aware that corner case examples like this may change in future compiler versions. +{{% /notice %}} -2. Tiny implementation differences get amplified. The difference between the two `sqrtf()` results is only about 3e-10, but this gets multiplied by 100,000,000, making it visible in the final result. +## Techniques for consistent results -3. Both `f1()` and `f2()` use `sqrtf()`. Even though `f2()` is more numerically stable, both functions call `sqrtf()` with the same input, so they both inherit the same architecture-specific square root result. +You can make the results consistent across platforms in several ways: -4. Compiler and library versions may produce different output due to different implementations of library functions such as `sqrtf()`. +- Use double precision for critical calculations by changing `100000000.0f` to `100000000.0` (double precision). -The final result is that x86 and Arm libraries compute `sqrtf(1.00000001)` with tiny differences in the least significant bits. This is normal and expected behavior and IEEE 754 allows for implementation variations in transcendental functions like square root, as long as they stay within specified error bounds. +- Disable fused multiply-add operations using the `-ffp-contract=off` compiler flag. -The very small difference you see is within acceptable floating-point precision limits. +- Use the compiler flag `-ffp-contract=fast` to enable fused multiply-add on x86. -### Key takeaways +## Key takeaways -- The small differences shown are due to library implementations in single-precision mode, not fundamental architectural differences. -- Single-precision arithmetic has inherent limitations that can cause small numerical differences. -- Using numerically stable algorithms, like `f2()`, can minimize error propagation. -- Understanding [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) is important for writing portable code. +- Different floating-point behavior between architectures can often be traced to specific hardware features or instructions such as Fused Multiply-Add (FMAC) operations. +- FMAC performs multiplication and addition with a single rounding step, which can lead to different results compared to separate multiply and add operations. +- Compilers may use FMAC instructions on Arm by default, but not on x86. +- To ensure consistent results across platforms, consider using double precision for critical calculations and controlling compiler optimizations with flags like `-ffp-contract=off` and `-ffp-contract=fast`. +- Understanding [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) remains important for writing portable code. -By adopting best practices and appropriate precision levels, developers can ensure consistent results across platforms. +If you see differences in floating-point results, it typically means you need to look a little deeper to find the causes. -Continue to the next section to see how precision impacts the results. +These situations are not common, but it is good to be aware of them as a software developer migrating to the Arm architecture. You can be confident that floating-point on Arm behaves predictably and that you can get consistent results across multiple architectures. diff --git a/content/learning-paths/cross-platform/floating-point-behavior/how-to-4.md b/content/learning-paths/cross-platform/floating-point-behavior/how-to-4.md deleted file mode 100644 index 0bbd869072..0000000000 --- a/content/learning-paths/cross-platform/floating-point-behavior/how-to-4.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -title: Minimize floating-point variability across platforms -weight: 5 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -## How can I ensure consistent floating-point results across x86 and Arm? - -The most effective way to ensure consistent floating-point results across platforms is to use double precision arithmetic. Both Arm and x86 produce identical results when using double precision for the same operations. - -### Double precision floating-point eliminates differences - -The example below demonstrates how using double precision eliminates the small differences observed in the previous single-precision example. Switching from `float` to `double` ensures identical results on both architectures. - -Use an editor to copy and paste the C++ file below into a file named `double-precision.cpp` - -```cpp -#include -#include - -// Function 1: Computes sqrt(1 + x) - 1 using the naive approach -double f1(double x) { - return sqrtf(1 + x) - 1; -} - -// Function 2: Computes the same value using an algebraically equivalent transformation -// This version is numerically more stable -double f2(double x) { - return x / (sqrtf(1 + x) + 1); -} - -int main() { - double x = 1e-8; - double result1 = f1(x); - double result2 = f2(x); - - // Theoretically, result1 and result2 should be the same - double difference = result1 - result2; - // Multiply by a large number to amplify the error - double final_result = 100000000.0f * difference + 0.0001f; - - // Print the results - printf("f1(%e) = %.10f\n", x, result1); - printf("f2(%e) = %.10f\n", x, result2); - printf("Difference (f1 - f2) = %.10e\n", difference); - printf("Final result after magnification: %.10f\n", final_result); - - return 0; -} -``` - -Compile on both computers: - -```bash -g++ -o double-precision double-precision.cpp -./double-precision -``` - -Running the new binary on both systems shows that both functions produce identical results. - -Here is the output on both systems: - -```output -f1(1.000000e-08) = 0.0000000050 -f2(1.000000e-08) = 0.0000000050 -Difference (f1 - f2) = -1.7887354748e-17 -Final result after magnification: 0.0000999982 -``` - -By choosing appropriate precision levels, you can write code that remains consistent and reliable across architectures. Precision, however, involves a trade-off: single precision reduces memory use and often improves performance, while double precision is essential for applications demanding higher accuracy and greater numerical stability, particularly to control rounding errors. - -For the vast majority of floating-point application code, you will not notice any differences between x86 and Arm architectures. However, in rare cases where differences do occur, they are usually due to undefined behaviors or non-portable code. These differences should not be a cause for concern, but rather an opportunity to improve the code for better portability and consistency across platforms. By addressing these issues, you can ensure that your floating-point code runs reliably and produces identical results on both x86 and Arm systems. diff --git a/content/learning-paths/cross-platform/vectorization-comparison/_index.md b/content/learning-paths/cross-platform/vectorization-comparison/_index.md index a925bb0166..b67b5168fb 100644 --- a/content/learning-paths/cross-platform/vectorization-comparison/_index.md +++ b/content/learning-paths/cross-platform/vectorization-comparison/_index.md @@ -1,10 +1,6 @@ --- title: "Migrate x86-64 SIMD to Arm64" -draft: true -cascade: - draft: true - minutes_to_complete: 30 who_is_this_for: This is an advanced topic for developers migrating vectorized (SIMD) code from x86-64 to Arm64. diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/1-overview.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/1-overview.md index 68b09f11c7..2fa725c9fe 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/1-overview.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/1-overview.md @@ -6,32 +6,34 @@ weight: 2 layout: learningpathall --- -## TinyML +## Overview This Learning Path is about TinyML. It is a starting point for learning how innovative AI technologies can be used on even the smallest of devices, making Edge AI more accessible and efficient. You will learn how to set up your host machine to facilitate compilation and ensure smooth integration across devices. This section provides an overview of the domain with real-life use cases and available devices. +## What is TinyML? + TinyML represents a significant shift in Machine Learning deployment. Unlike traditional Machine Learning, which typically depends on cloud-based servers or high-performance hardware, TinyML is tailored to function on devices with limited resources, constrained memory, low power, and fewer processing capabilities. TinyML has gained popularity because it enables AI applications to operate in real-time, directly on the device, with minimal latency, enhanced privacy, and the ability to work offline. This shift opens up new possibilities for creating smarter and more efficient embedded systems. -### Benefits and applications +## Benefits and applications The benefits of TinyML align well with the Arm architecture, which is widely used in IoT, mobile devices, and edge AI deployments. Here are some of the key benefits of TinyML on Arm: -- **Power Efficiency**: TinyML models are designed to be extremely power-efficient, making them ideal for battery-operated devices like sensors, wearables, and drones. +- Power efficiency: TinyML models are designed to be extremely power-efficient, making them ideal for battery-operated devices like sensors, wearables, and drones. -- **Low Latency**: AI processing happens on-device, so there is no need to send data to the cloud, which reduces latency and enables real-time decision-making. +- Low latency: AI processing happens on-device, so there is no need to send data to the cloud, which reduces latency and enables real-time decision-making. -- **Data Privacy**: With on-device computation, sensitive data remains local, providing enhanced privacy and security. This is a priority in healthcare and personal devices. +- Data privacy: with on-device computation, sensitive data remains local, providing enhanced privacy and security. This is a priority in healthcare and personal devices. -- **Cost-Effective**: Arm devices, which are cost-effective and scalable, can now handle sophisticated Machine Learning tasks, reducing the need for expensive hardware or cloud services. +- Cost-effective: Arm devices, which are cost-effective and scalable, can now handle sophisticated machine learning tasks, reducing the need for expensive hardware or cloud services. -- **Scalability**: With billions of Arm devices in the market, TinyML is well-suited for scaling across industries, enabling widespread adoption of AI at the edge. +- Scalability: with billions of Arm devices in the market, TinyML is well-suited for scaling across industries, enabling widespread adoption of AI at the edge. TinyML is being deployed across multiple industries, enhancing everyday experiences and enabling groundbreaking solutions. The table below shows some examples of TinyML applications. diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/2-env-setup.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/2-env-setup.md index 6c1ff55547..55b214dac7 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/2-env-setup.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/2-env-setup.md @@ -7,9 +7,6 @@ weight: 3 # Do not modify these elements layout: "learningpathall" --- - -In this section, you will prepare a development environment to compile a machine learning model. - ## Introduction to ExecuTorch ExecuTorch is a lightweight runtime designed for efficient execution of PyTorch models on resource-constrained devices. It enables machine learning inference on embedded and edge platforms, making it well-suited for Arm-based hardware. Since Arm processors are widely used in mobile, IoT, and embedded applications, ExecuTorch leverages Arm's efficient CPU architectures to deliver optimized performance while maintaining low power consumption. By integrating with Arm's compute libraries, it ensures smooth execution of AI workloads on Arm-powered devices, from Cortex-M microcontrollers to Cortex-A application processors. @@ -18,7 +15,7 @@ ExecuTorch is a lightweight runtime designed for efficient execution of PyTorch These instructions have been tested on Ubuntu 22.04, 24.04, and on Windows Subsystem for Linux (WSL). -Python3 is required and comes installed with Ubuntu, but some additional packages are needed: +Python 3 is required and comes installed with Ubuntu, but some additional packages are needed: ```bash sudo apt update @@ -36,7 +33,7 @@ source $HOME/executorch-venv/bin/activate The prompt of your terminal now has `(executorch)` as a prefix to indicate the virtual environment is active. -## Install Executorch +## Install ExecuTorch From within the Python virtual environment, run the commands below to download the ExecuTorch repository and install the required packages: @@ -74,6 +71,6 @@ pip list | grep executorch executorch 1.1.0a0+1883128 ``` -## Next Steps +## Next steps -Proceed to the next section to learn about and set up the virtualized hardware. +Proceed to the next section to set up the virtualized hardware. diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/3-env-setup-fvp.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/3-env-setup-fvp.md index c554c0a575..5a601ffb89 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/3-env-setup-fvp.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/3-env-setup-fvp.md @@ -8,22 +8,24 @@ weight: 5 # 1 is first, 2 is second, etc. layout: "learningpathall" --- -In this section, you will run scripts to set up the Corstone-320 reference package. +## Overview -The Corstone-320 Fixed Virtual Platform (FVP) is a pre-silicon software development environment for Arm-based microcontrollers. It provides a virtual representation of hardware, allowing developers to test and optimize software before actual hardware is available. Designed for AI and machine learning workloads, it includes support for Arm's Ethos-U NPU and Cortex-M processors, making it ideal for embedded AI applications. The FVP accelerates development by enabling early software validation and performance tuning in a flexible, simulation-based environment. +In this section, you run scripts to set up the Corstone-320 reference package. + +The Corstone-320 Fixed Virtual Platform (FVP) is a pre-silicon software development environment for Arm-based microcontrollers. It provides a virtual representation of hardware so you can test and optimize software before boards are available. Designed for AI and machine learning workloads, it includes support for Arm Ethos-U NPUs and Cortex-M processors, which makes it well-suited to embedded AI applications. The FVP accelerates development by enabling early software validation and performance tuning in a flexible, simulation-based environment. The Corstone reference system is provided free of charge, although you will have to accept the license in the next step. For more information on Corstone-320, check out the [official documentation](https://developer.arm.com/documentation/109761/0000?lang=en). -## Corstone-320 FVP Setup for ExecuTorch +## Set up Corstone-320 FVP for ExecuTorch -Run the FVP setup script in the ExecuTorch repository. +Run the FVP setup script in the ExecuTorch repository: ```bash cd $HOME/executorch ./examples/arm/setup.sh --i-agree-to-the-contained-eula ``` -After the script has finished running, it prints a command to run to finalize the installation. This step adds the FVP executables to your system path. +When the script completes, it prints a command to finalize the installation by adding the FVP executables to your `PATH`: ```bash source $HOME/executorch/examples/arm/ethos-u-scratch/setup_path.sh diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md index 08d2e97004..597fc6e65b 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model.md @@ -36,7 +36,7 @@ class SimpleNN(torch.nn.Module): return out # Create the model instance -input_size = 10 # example input features size +input_size = 10 # example input feature size hidden_size = 5 # hidden layer size output_size = 2 # number of output classes @@ -52,7 +52,7 @@ ModelInputs = x print("Model successfully exported to simple_nn.pte") ``` -## Running the model on the Corstone-320 FVP +## Run the model on the Corstone-320 FVP The final step is to take the Python-defined model and run it on the Corstone-320 FVP. This was done upon running the `run.sh` script in a previous section. To wrap up the Learning Path, you will perform these steps separately to better understand what happened under the hood. Start by setting some environment variables that are used by ExecuTorch. @@ -61,7 +61,7 @@ export ET_HOME=$HOME/executorch export executorch_DIR=$ET_HOME/build ``` -Then, generate a model file on the `.pte` format using the Arm examples. The Ahead-of-Time (AoT) Arm compiler will enable optimizations for devices like the Grove Vision AI Module V2 and the Corstone-320 FVP. Run it from the ExecuTorch root directory. +Generate a model in ExecuTorch `.pte` format using the Arm examples. The AoT Arm compiler enables optimizations for devices such as the Grove Vision AI Module V2 and the Corstone-320 FVP. Run the compiler from the ExecuTorch root directory: ```bash cd $ET_HOME @@ -90,7 +90,7 @@ cmake --build $ET_HOME/examples/arm/executor_runner/cmake-out --parallel -- arm_ ``` -Now run the model on the Corstone-320 with the following command: +Run the model on Corstone-320: ```bash FVP_Corstone_SSE-320 \ @@ -104,9 +104,7 @@ FVP_Corstone_SSE-320 \ ``` {{% notice Note %}} - -The argument `mps4_board.visualisation.disable-visualisation=1` disables the FVP GUI. This can speed up launch time for the FVP. - +The argument `mps4_board.visualisation.disable-visualisation=1` disables the FVP GUI and can speed up launch time {{% /notice %}} Observe that the FVP loads the model file. @@ -119,4 +117,4 @@ I [executorch:arm_executor_runner.cpp:412] Model in 0x70000000 $ I [executorch:arm_executor_runner.cpp:414] Model PTE file loaded. Size: 3360 bytes. ``` -You have now set up your environment for TinyML development on Arm, and tested a small PyTorch and ExecuTorch Neural Network. In the next Learning Path of this series, you will learn about optimizing neural networks to run on Arm. +You have now set up your environment for TinyML development on Arm and tested a small PyTorch model with ExecuTorch on the Corstone-320 FVP. In the next Learning Path, you learn how to optimize neural networks to run efficiently on Arm. diff --git a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md index e67b323d70..461505fa0a 100644 --- a/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/_index.md @@ -6,10 +6,10 @@ minutes_to_complete: 40 who_is_this_for: This is an introductory topic for developers and data scientists new to Tiny Machine Learning (TinyML) who want to explore its potential using PyTorch and ExecuTorch. learning_objectives: - - Describe what differentiates TinyML from other AI domains. - - Describe the benefits of deploying AI models on Arm-based edge devices. - - Identify suitable Arm-based devices for TinyML applications. - - Set up and configure a TinyML development environment using ExecuTorch and Corstone-320 Fixed Virtual Platform (FVP). + - Describe what differentiates TinyML from other AI domains + - Describe the benefits of deploying AI models on Arm-based edge devices + - Identify suitable Arm-based devices for TinyML applications + - Set up and configure a TinyML development environment using ExecuTorch and Corstone-320 Fixed Virtual Platform (FVP) prerequisites: - Basic knowledge of Machine Learning concepts diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/_index.md b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/_index.md index ad8a8ade3f..37fc058544 100644 --- a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/_index.md @@ -1,23 +1,21 @@ --- -title: Edge AI with PyTorch & ExecuTorch - Tiny Rock-Paper-Scissors on Arm +title: "Edge AI on Arm: PyTorch and ExecuTorch rock-paper-scissors" minutes_to_complete: 60 -who_is_this_for: This learning path is for machine learning developers interested in deploying TinyML models on Arm-based edge devices. You will learn how to train and deploy a machine learning model for the classic game "Rock-Paper-Scissors" on edge devices. You'll use PyTorch and ExecuTorch, frameworks designed for efficient on-device inference, to build and run a small-scale computer vision model. - +who_is_this_for: This is an introductory topic for machine learning developers who want to deploy TinyML models on Arm-based edge devices using PyTorch and ExecuTorch. learning_objectives: - - Train a small Convolutional Neural Network (CNN) for image classification using PyTorch. - - Understand how to use synthetic data generation for training a model when real-world data is limited. - - Optimize and convert a PyTorch model into an ExecuTorch program (.pte) for Arm-based devices. - - Run the trained model on a local machine to play an interactive mini-game, demonstrating model inference. - + - Train a small Convolutional Neural Network (CNN) for image classification using PyTorch + - Use synthetic data generation for training a model when real data is limited + - Convert and optimize a PyTorch model to an ExecuTorch program (`.pte`) for Arm-based devices + - Run the trained model locally as an interactive mini-game to demonstrate inference prerequisites: - - A basic understanding of machine learning concepts. - - Familiarity with Python and the PyTorch library. - - Having completed [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm). - - An x86 Linux host machine or VM running Ubuntu 22.04 or higher. + - Basic understanding of machine learning concepts + - Familiarity with Python and the PyTorch library + - Completion of the Learning Path [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/) + - An x86 Linux host machine or VM running Ubuntu 22.04 or later author: Dominica Abena O. Amanfo @@ -25,8 +23,8 @@ author: Dominica Abena O. Amanfo skilllevels: Introductory subjects: ML armips: - - Cortex-M - - Ethos-U + - Cortex-M + - Ethos-U tools_software_languages: - tinyML - Computer Vision @@ -36,23 +34,21 @@ tools_software_languages: - ExecuTorch operatingsystems: - - Linux + - Linux further_reading: - - resource: - title: Run Llama 3 on a Raspberry Pi 5 using ExecuTorch - link: /learning-paths/embedded-and-microcontrollers/rpi-llama3 - type: website - - resource: - title: ExecuTorch Examples - link: https://github.com/pytorch/executorch/blob/main/examples/README.md - type: website - - + - resource: + title: Run Llama 3 on a Raspberry Pi 5 using ExecuTorch + link: /learning-paths/embedded-and-microcontrollers/rpi-llama3 + type: website + - resource: + title: ExecuTorch examples + link: https://github.com/pytorch/executorch/blob/main/examples/README.md + type: website ### FIXED, DO NOT MODIFY # ================================================================================ weight: 1 # _index.md always has weight of 1 to order correctly layout: "learningpathall" # All files under learning paths have this same wrapper learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. ---- \ No newline at end of file +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md index ac6b5e10a2..1fde02ba2f 100644 --- a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md +++ b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md @@ -1,46 +1,45 @@ --- -title: Environment Setup +title: Set up your environment weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Overview -This learning path (LP) is a direct follow-up to the [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm) learning path. While the previous one introduced you to the core concepts and the toolchain, this one puts that knowledge into practice with a fun, real-world example. You will move from the simple [Feedforward Neural Network](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model) in the previous LP, to a more practical computer vision task: A tiny Rock-Paper-Scissors game, to demonstrate how these tools can be used to solve a tangible problem and run efficiently on Arm-based edge devices. +## Set up your environment for Tiny rock-paper-scissors on Arm + +This Learning Path is a direct follow-up to [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm). While the previous Learning Path introduced the core concepts and toolchain, this one puts that knowledge into practice with a small, real-world example. You move from a simple [Feedforward Neural Network](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm/4-build-model) to a practical computer vision task: a tiny rock-paper-scissors game that runs efficiently on Arm-based edge devices. You will train a lightweight CNN to classify images of the letters R, P, and S as "rock," "paper," or "scissors." The script uses a synthetic data renderer to create a large dataset of these images with various transformations and noise, eliminating the need for a massive real-world dataset. ### What is a Convolutional Neural Network (CNN)? -A Convolutional Neural Network (CNN) is a type of deep neural network primarily used for analyzing visual imagery. Unlike traditional neural networks, CNNs are designed to process pixel data by using a mathematical operation called **convolution**. This allows them to automatically and adaptively learn spatial hierarchies of features from input images, from low-level features like edges and textures to high-level features like shapes and objects. - -![Image of a convolutional neural network architecture](image.png) -[Image credits](https://medium.com/@atul_86537/learning-ml-from-first-principles-c-linux-the-rick-and-morty-way-convolutional-neural-c76c3df511f4). +A Convolutional Neural Network (CNN) is a type of deep neural network primarily used for analyzing visual imagery. Unlike traditional neural networks, CNNs are designed to process pixel data by using a mathematical operation called convolution. This allows them to automatically and adaptively learn spatial hierarchies of features from input images, from low-level features like edges and textures to high-level features like shapes and objects. -CNNs are the backbone of many modern computer vision applications, including: +A convolutional neural network (CNN) is a deep neural network designed to analyze visual data using the *convolution* operation. CNNs learn spatial hierarchies of features - from edges and textures to shapes and objects - directly from pixels. -- **Image Classification:** Identifying the main object in an image, like classifying a photo as a "cat" or "dog". -- **Object Detection:** Locating specific objects within an image and drawing a box around them. -- **Facial Recognition:** Identifying and verifying individuals based on their faces. +Common CNN applications include: -For the Rock-Paper-Scissors game, you'll use a tiny CNN to classify images of the letters R, P, and S as the corresponding hand gestures. +- Image classification: identify the main object in an image, such as classifying a photo as a cat or dog +- Object detection: locate specific objects in an image and draw bounding boxes +- Facial recognition: identify or verify individuals based on facial features +For the rock-paper-scissors game, you use a tiny CNN to classify the letters R, P, and S as the corresponding hand gestures. +## Environment setup -## Environment Setup -To get started, follow the first three chapters of the [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm) Learning Path. This will set up your development environment and install the necessary tools. Return to this LP once you've run the `./examples/arm/run.sh` script in the ExecuTorch repository. +To get started, complete the first three sections of [Introduction to TinyML on Arm using PyTorch and ExecuTorch](/learning-paths/embedded-and-microcontrollers/introduction-to-tinyml-on-arm). This setup prepares your development environment and installs the required tools. Return here after running the `./examples/arm/run.sh` script in the ExecuTorch repository. -If you just followed the LP above, you should already have your virtual environment activated. If not, activate it using: +If you just completed the earlier Learning Path, your virtual environment should still be active. If not, activate it: ```console source $HOME/executorch-venv/bin/activate ``` The prompt of your terminal now has `(executorch-venv)` as a prefix to indicate the virtual environment is active. -Run the commands below to install the dependencies. +Install Python dependencies: -```bash -pip install argparse numpy pillow torch +```console +pip install numpy pillow torch ``` -You are now ready to create the model. +You’re now ready to create the model. diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fine-tune-2.md b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fine-tune-2.md index e9ffd439ec..e4fa83ecee 100644 --- a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fine-tune-2.md +++ b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fine-tune-2.md @@ -1,20 +1,20 @@ --- -title: Train and Test the Rock-Paper-Scissors Model +title: Train and Test the rock-paper-scissors Model weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Build the Model +## Build the model -Navigate to the Arm examples directory in the ExecuTorch repository. +Navigate to the Arm examples directory in the ExecuTorch repository: ```bash cd $HOME/executorch/examples/arm ``` -Using a file editor of your choice, create a file named `rps_tiny.py`, copy and paste the code shown below: +Create a file named `rps_tiny.py` and paste the following code: ```python #!/usr/bin/env python3 @@ -369,24 +369,24 @@ if __name__ == "__main__": ``` -### About the Script +### About the script The script handles the entire workflow: data generation, model training, and a simple command-line game. -- **Synthetic Data Generation:** The script includes a function `render_rps()` that generates 28x28 grayscale images of the letters 'R', 'P', and 'S' with random rotations, blurs, and noise. This creates a diverse dataset that's used to train the model. -- **Model Architecture:** The model, a TinyRPS class, is a simple Convolutional Neural Network (CNN). It uses a series of 2D convolutional layers, followed by pooling layers to reduce spatial dimensions, and finally, fully connected linear layers to produce a final prediction. This architecture is efficient and well-suited for edge devices. -- **Training:** The script generates synthetic training and validation datasets. It then trains the CNN model using the **Adam optimizer** and **Cross-Entropy Loss**. It tracks validation accuracy and saves the best-performing model to `rps_best.pt`. -- **ExecuTorch Export:** A key part of the script is the `export_to_pte()` function. This function uses the `torch.export module` (or a fallback) to trace the trained PyTorch model and convert it into an ExecuTorch program (`.pte`). This compiled program is highly optimized for deployment on any target hardware, for example Cortex-M or Cortex-A CPUs for embedded devices. -- **CLI Mini-Game**: After training, you can play an interactive game. The script generates an image of your move and a random opponent's move. It then uses the trained model to classify both images and determines the winner based on the model's predictions. +- Synthetic Data Generation: the script includes a function `render_rps()` that generates 28x28 grayscale images of the letters 'R', 'P', and 'S' with random rotations, blurs, and noise. This creates a diverse dataset that's used to train the model. +- Model Architecture: the model, a TinyRPS class, is a simple Convolutional Neural Network (CNN). It uses a series of 2D convolutional layers, followed by pooling layers to reduce spatial dimensions, and finally, fully connected linear layers to produce a final prediction. This architecture is efficient and well-suited for edge devices. +- Training: the script generates synthetic training and validation datasets. It then trains the CNN model using the **Adam optimizer** and **Cross-Entropy Loss**. It tracks validation accuracy and saves the best-performing model to `rps_best.pt`. +- ExecuTorch Export: a key part of the script is the `export_to_pte()` function. This function uses the `torch.export module` (or a fallback) to trace the trained PyTorch model and convert it into an ExecuTorch program (`.pte`). This compiled program is highly optimized for deployment on any target hardware, for example Cortex-M or Cortex-A CPUs for embedded devices. +- CLI Mini-Game: after training, you can play an interactive game. The script generates an image of your move and a random opponent's move. It then uses the trained model to classify both images and determines the winner based on the model's predictions. -### Running the Script: +## Running the Script: -To train the model, export it, and play the game, run the following command: +Train the model, export it, and play the game: ```bash python rps_tiny.py --epochs 8 --export --play ``` -You'll see the training progress, where the model's accuracy rapidly improves on the synthetic data. +You’ll see training progress similar to: ```output == Building synthetic datasets == @@ -402,7 +402,8 @@ Training done. Loaded weights from rps_best.pt [export] wrote rps_tiny.pte ``` -After training and export, the game will start. Type rock, paper, or scissors and see the model's predictions and what your opponent played. + +After training and export, the game starts. Type rock, paper, or scissors, and review the model’s predictions for you and a random opponent: ```output === Rock–Paper–Scissors: Play vs Tiny CNN === diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fvp-3.md b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fvp-3.md index b26333edb0..53983fca76 100644 --- a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fvp-3.md +++ b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/fvp-3.md @@ -6,13 +6,15 @@ weight: 4 layout: learningpathall --- -This section guides you through the process of compiling your trained Rock-Paper-Scissors model and running it on a simulated Arm-based edge device, the Corstone-320 Fixed Virtual Platform (FVP). This final step demonstrates the end-to-end workflow of deploying a TinyML model for on-device inference. +## Compile and run the rock-paper-scissors model on Corstone-320 FVP + +This section shows how to compile your trained rock-paper-scissors model and run it on the Corstone-320 Fixed Virtual Platform (FVP), a simulated Arm-based edge device. This completes the end-to-end workflow for deploying a TinyML model for on-device inference. ## Compile and build the executable -First, you'll use the Ahead-of-Time (AOT) Arm compiler to convert your PyTorch model into a format optimized for the Arm architecture and the Ethos-U NPU. This process, known as delegation, offloads parts of the neural network graph that are compatible with the NPU, allowing for highly efficient inference. +Use the Ahead-of-Time (AoT) Arm compiler to convert your PyTorch model to an ExecuTorch program optimized for Arm and the Ethos-U NPU. This process (delegation) offloads supported parts of the neural network to the NPU for efficient inference. -Set up your environment variables by running the following commands in your terminal: +Set up environment variables: ```bash export ET_HOME=$HOME/executorch @@ -34,7 +36,7 @@ You should see: PTE file saved as rps_tiny_arm_delegate_ethos-u85-128.pte ``` -Next, you'll build the **Ethos-U runner**, which is a bare-metal executable that includes the ExecuTorch runtime and your compiled model. This runner is what the FVP will execute. Navigate to the runner's directory and use CMake to configure the build. +Next, build the Ethos-U runner - a bare-metal executable that includes the ExecuTorch runtime and your compiled model. Configure the build with CMake: ```bash cd $HOME/executorch/examples/arm/executor_runner @@ -52,7 +54,7 @@ cmake -DCMAKE_BUILD_TYPE=Release \ -DSYSTEM_CONFIG=Ethos_U85_SYS_DRAM_Mid ``` -You should see output similar to this, indicating a successful configuration: +You should see configuration output similar to: ```bash -- ******************************************************* @@ -67,13 +69,13 @@ You should see output similar to this, indicating a successful configuration: -- Build files have been written to: ~/executorch/examples/arm/executor_runner/cmake-out ``` -Now, build the executable with CMake: +Build the executable: ```bash cmake --build "$ET_HOME/examples/arm/executor_runner/cmake-out" -j --target arm_executor_runner ``` -### Run the Model on the FVP +## Run the model on the FVP With the `arm_executor_runner` executable ready, you can now run it on the Corstone-320 FVP to see the model on a simulated Arm device. ```bash @@ -88,11 +90,10 @@ FVP_Corstone_SSE-320 \ ``` {{% notice Note %}} -The argument `mps4_board.visualisation.disable-visualisation=1` disables the FVP GUI. This can speed up launch time for the FVP. +`mps4_board.visualisation.disable-visualisation=1` disables the FVP GUI and can reduce launch time {{% /notice %}} - -Observe the output from the FVP. You'll see messages indicating that the model file has been loaded and the inference is running. This confirms that your ExecuTorch program is successfully executing on the simulated Arm hardware. +You should see logs indicating that the model file loads and inference begins: ```output telnetterminal0: Listening for serial connection on port 5000 @@ -109,9 +110,7 @@ I [executorch:EthosUBackend.cpp:116 init()] data:0x70000070 ``` {{% notice Note %}} -The inference itself may take a longer to run with a model this size - note that this is not a reflection of actual execution time. +Inference might take longer with a model of this size on the FVP; this does not reflect real device performance. {{% /notice %}} -You've now successfully built, optimized, and deployed a computer vision model on a simulated Arm-based system. This hands-on exercise demonstrates the power and practicality of TinyML and ExecuTorch for resource-constrained devices. - -In a future learning path, you can explore comparing different model performances and inference times before and after optimization. You could also analyze CPU and memory usage during inference, providing a deeper understanding of how the ExecuTorch framework optimizes your model for edge deployment. \ No newline at end of file +You have now built, optimized, and deployed a computer vision model on a simulated Arm-based system. In a future Learning Path, you can compare performance and latency before and after optimization and analyze CPU and memory usage during inference for deeper insight into ExecuTorch on edge devices. diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index c17a248304..87a21aecdc 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -8,7 +8,7 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 3 -- Linux: 177 +- Linux: 179 - macOS: 13 - Windows: 14 pinned_modules: @@ -20,9 +20,9 @@ pinned_modules: subjects_filter: - CI-CD: 7 - Containers and Virtualization: 32 -- Databases: 17 +- Databases: 18 - Libraries: 9 -- ML: 31 +- ML: 32 - Performance and Architecture: 72 - Storage: 1 - Web: 12 @@ -36,9 +36,9 @@ tools_software_languages_filter: - AI: 1 - Android Studio: 1 - Ansible: 2 -- Apache Bench: 1 - Apache Spark: 2 - Apache Tomcat: 2 +- ApacheBench: 1 - Arm Compiler for Linux: 1 - Arm Development Studio: 3 - Arm ISA: 1 @@ -80,7 +80,7 @@ tools_software_languages_filter: - Daytona: 1 - Demo: 3 - Django: 1 -- Docker: 22 +- Docker: 23 - Envoy: 3 - ExecuTorch: 1 - FAISS: 1 @@ -99,7 +99,6 @@ tools_software_languages_filter: - GitLab: 1 - glibc: 1 - Go: 4 -- go test -bench: 1 - Golang: 1 - Google Axion: 3 - Google Benchmark: 1 @@ -139,15 +138,14 @@ tools_software_languages_filter: - mongostat: 1 - mongotop: 1 - mpi: 1 -- MySQL: 9 +- MySQL: 10 - NEON: 7 -- Neoverse: 1 - Networking: 1 - Nexmark: 1 - NGINX: 4 - Node.js: 3 - Ollama: 1 -- ONNX Runtime: 1 +- ONNX Runtime: 2 - OpenBLAS: 1 - OpenBMC: 1 - OpenJDK 21: 2 @@ -157,7 +155,7 @@ tools_software_languages_filter: - perf: 6 - PostgreSQL: 4 - Profiling: 1 -- Python: 31 +- Python: 32 - PyTorch: 9 - QEMU: 1 - RAG: 1 @@ -170,7 +168,7 @@ tools_software_languages_filter: - Siege: 1 - snappy: 1 - Snort3: 1 -- SQL: 7 +- SQL: 8 - Streamline CLI: 1 - Streamlit: 2 - Supervisor: 1 @@ -204,6 +202,6 @@ weight: 1 cloud_service_providers_filter: - AWS: 17 - Google Cloud: 18 -- Microsoft Azure: 15 +- Microsoft Azure: 17 - Oracle: 2 --- diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md index 1fe6173552..4138b2819c 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/_index.md @@ -1,22 +1,19 @@ --- -title: Deploy Golang on the Microsoft Azure Cobalt 100 processors +title: Deploy Golang on Azure Cobalt 100 on Arm -draft: true -cascade: - draft: true - -minutes_to_complete: 30 +minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for software developers looking to migrate their Golang workloads from x86_64 to Arm-based platforms, specifically on the Microsoft Azure Cobalt 100 processors. +who_is_this_for: This is an introductory topic for software developers, DevOps engineers, and cloud architects looking to migrate their Golang (Go) applications from x86_64 to high-performance Arm-based Azure Cobalt 100 virtual machines for improved cost efficiency and performance. learning_objectives: - - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. - - Deploy Golang on an Arm64-based virtual machine running Ubuntu Pro 24.04 LTS. - - Perform Golang baseline testing and benchmarking on both x86_64 and Arm64 virtual machine. + - Provision an Azure Arm64 virtual machine using the Azure portal, with Ubuntu Pro 24.04 LTS as the base image + - Deploy Golang on an Arm64-based virtual machine running Ubuntu Pro 24.04 LTS + - Perform Golang baseline testing and benchmarking on both x86_64 and Arm64 virtual machines prerequisites: - - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) - - Familiarity with the [Golang](https://go.dev/) and deployment practices on Arm64 platforms. + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Azure Cobalt 100 Arm-based instances (Dpsv6-series) + - Basic familiarity with the [Go programming language](https://go.dev/) and cloud deployment practices + - Understanding of Linux command line and virtual machine management author: Pareena Verma @@ -26,13 +23,13 @@ subjects: Performance and Architecture cloud_service_providers: Microsoft Azure armips: - - Neoverse + - Neoverse tools_software_languages: - Golang operatingsystems: - - Linux + - Linux further_reading: - resource: @@ -42,7 +39,7 @@ further_reading: - resource: title: Testing and Benchmarking in Go link: https://pkg.go.dev/testing - type: Official Documentation + type: Documentation - resource: title: Using go test -bench for Benchmarking link: https://pkg.go.dev/cmd/go#hdr-Testing_flags diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md index 4ad80dac05..53fc7b4acb 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/background.md @@ -1,18 +1,39 @@ --- title: "Overview" - weight: 2 -layout: "learningpathall" +### FIXED, DO NOT MODIFY +layout: learningpathall --- +## Microsoft Azure Cobalt 100 Arm-based processor + +Azure Cobalt 100 is Microsoft's first-generation, custom-designed Arm-based processor built on the advanced Arm Neoverse N2 architecture. This high-performance 64-bit CPU is specifically optimized for cloud-native, scale-out Linux workloads including web servers, application servers, real-time data analytics, open-source databases, and in-memory caching solutions. + +Key performance features include: +- Clock speed: 3.4 GHz for optimal processing power +- Core allocation: dedicated physical core per vCPU for consistent performance +- Architecture: Arm Neoverse N2 for superior energy efficiency +- Target workloads: cloud-native applications and microservices + +To learn more, see the Microsoft blog: [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Golang (Go programming language) -## Cobalt 100 Arm-based processor +Golang (Go) is a modern, open-source programming language developed by Google, specifically designed for building scalable, high-performance applications. Go excels in simplicity, compilation speed, and runtime efficiency, making it an ideal choice for cloud-native development and Arm64 architecture deployment. -Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. +Key performance features include: +- Built-in concurrency with goroutines and channels +- Strong static typing for improved code reliability +- Comprehensive standard library for rapid development +- Fast compilation and efficient garbage collection +- Cross-platform compatibility including native Arm64 support -To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). +Go language primary use cases: +- Cloud-native applications and containerized workloads +- Microservices architecture and API development +- Systems programming and infrastructure tools +- DevOps automation and CI/CD pipelines +- Distributed systems and high-throughput services -## Golang -Golang (or Go) is an open-source programming language developed by Google, designed for simplicity, efficiency, and scalability. It provides built-in support for concurrency, strong typing, and a rich standard library, making it ideal for building reliable, high-performance applications. +For more information, visit the [Go website](https://go.dev/) and see the [Go documentation](https://go.dev/doc/). -Go is widely used for cloud-native development, microservices, system programming, DevOps tools, and distributed systems. Learn more from the [Go official website](https://go.dev/) and its [official documentation](https://go.dev/doc/). diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md index 2f23ad7591..45623d22e6 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/baseline-testing.md @@ -1,32 +1,31 @@ --- -title: Golang Baseline Testing +title: Perform Golang baseline testing and web server deployment on Azure Cobalt 100 weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Baseline testing: run a Go web server on Azure Arm64 -### Baseline Testing: Running a Go Web Server on Azure Arm64 -To validate your Go toolchain and runtime environment, you can build and run a lightweight web server. This ensures that compilation, networking, and runtime execution are working correctly on your Ubuntu Pro 24.04 LTS Arm64 virtual machine running on Azure Cobalt 100. +Validate your Go development environment by building and deploying a complete web application. This baseline test confirms that compilation, networking, and runtime execution work correctly on your Ubuntu Pro 24.04 LTS Arm64 virtual machine powered by Azure Cobalt 100 processors. -1. Create the project directory - -Start by creating a new folder to hold your Go web project and navigate to it: +## Initialize Go Web Project + +Create a dedicated project directory for your Go web application: ```console mkdir goweb && cd goweb ``` -2. Create an HTML Page with Bootstrap Styling - -Next, create a simple web page that your Go server will serve. Using the nano editor (or any editor of your choice), create a file named `index.html`: +## Create an HTML page with Bootstrap styling +Next, create a simple web page that your Go server will serve. Open an editor and create `index.html`: ```console nano index.html ``` -Paste the following HTML code into the `index.html` file. This page uses Bootstrap for styling and includes a header, a welcome message, and a button that links to a Go-powered API endpoint. +Add the following HTML code with Bootstrap styling and Azure Cobalt 100 branding: ```html @@ -65,15 +64,16 @@ Paste the following HTML code into the `index.html` file. This page uses Bootstr ``` -3. Create Golang Web Server Now, let’s create the Go program that will serve your static HTML page and expose a simple API endpoint. +Open an editor and create `main.go`: ```console nano main.go ``` -Paste the following code into the `main.go` file. This sets up a very basic web server that serves files from the current folder, including the `index.html` you just created. When it runs, it will print a message showing the server address. +Paste the following code into the `main.go` file. This sets up a basic web server that serves files from the current folder, including the `index.html` you just created. When it runs, it will print a message showing the server address. +Paste the following code into `main.go`: ```go package main import ( @@ -105,24 +105,22 @@ func main() { ``` {{% notice Note %}}Running on port 80 requires root privileges. Use sudo with the full Go path if needed.{{% /notice %}} -4. Run the Web Server - -Compile and start your Go program with: +## Deploy and Start the Web Server + +Compile and launch your Go web server on Azure Cobalt 100: ```console sudo /usr/local/go/bin/go run main.go ``` -This command compiles the Go source code into a binary and immediately starts the server on port 80. If the server starts successfully, you will see the following message in your terminal: - +Expected output confirming successful startup: ```output 2025/08/19 04:35:06 Server running on http://0.0.0.0:80 ``` -5. Allow HTTP Traffic in Firewall -On Ubuntu Pro 24.04 LTS virtual machines, UFW (Uncomplicated Firewall) is used to manage firewall rules. By default, it allows only SSH (port 22), while other inbound connections are blocked. +### Configure Ubuntu Firewall for HTTP Access -Even if you have already configured Azure Network Security Group (NSG) rules to allow inbound traffic on port 80, the VM level firewall may still block HTTP requests until explicitly opened. +Ubuntu Pro 24.04 LTS uses UFW (Uncomplicated Firewall) to manage network access. Even with Azure Network Security Group (NSG) rules configured, the VM-level firewall requires explicit HTTP access configuration. Run the following commands to allow HTTP traffic on port 80: @@ -145,14 +143,27 @@ To Action From 80/tcp (v6) ALLOW Anywhere (v6) ``` -6. Open in a Browser -To quickly get your VM’s public IP address and form the URL, run: +{{% notice Note %}} +If UFW is already active, `sudo ufw enable` might warn you about disrupting SSH. Proceed only if you understand the impact, or use an Azure VM serial console as a recovery option. +{{% /notice %}} +## Open the site in a browser + +Print your VM’s public URL: ```console echo "http://$(curl -s ifconfig.me)/" ``` -Open this URL in your browser, and you should see the styled HTML landing page being served directly by your Go application. -![golang](images/go-web.png) +Open this URL in your browser. You should see the styled HTML landing page served by your Go application. + +![Go web server running on Azure Cobalt 100 Arm64 alt-text#center](images/go-web.png "Go web server running on Azure Cobalt 100 Arm64") + +## Baseline Testing Complete + +Successfully reaching this page confirms: +- **Go toolchain** is properly installed and configured +- **Development environment** is ready for Arm64 compilation +- **Network connectivity** and firewall configuration are correct +- **Runtime performance** is functioning on Azure Cobalt 100 processors -Reaching this page in your browser confirms that Go is installed correctly, your environment is configured, and your Go web server is working end-to-end on Azure Cobalt 100 (Arm64). You can now proceed to perform further benchmarking tests. +Your Azure Cobalt 100 virtual machine is now ready for advanced Go application development and performance benchmarking. diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md index 85329cabad..69ad8182bb 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/benchmarking.md @@ -1,42 +1,39 @@ --- -title: Benchmarking via go test -bench +title: Run performance tests using go test -bench weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Run Performance Tests Using go test -bench +## Performance benchmarking with Go's built-in testing framework -`go test -bench` (the benchmarking mode of go test) is Golang’s built-in benchmarking framework that measures the performance of functions by running them repeatedly and reporting execution time (**ns/op**), memory usage, and allocations. With the `-benchmem flag`, it also shows memory usage and allocations. It’s simple, reliable, and requires only writing benchmark functions in the standard Golang testing package. +`go test -bench` is Go’s built-in benchmark runner. It repeatedly executes benchmark functions and reports latency (ns/op). With the `-benchmem` flag, it also reports memory usage (B/op) and allocations (allocs/op). It’s simple, reliable, and requires only writing benchmark functions in the standard Golang testing package. -1. Create a Project Folder - -In your terminal, create a directory for your benchmark project and navigate into it: +## Create a project folder +Create a directory for your benchmark project and navigate to it: ```console mkdir gosort-bench cd gosort-bench ``` -2. Initialize a Go Module - -Inside your project directory, initialize a new Go module by running: +## Initialize a Go module +Initialize a new module: ```console go mod init gosort-bench ``` -This creates a `go.mod` file, which defines the module path (gosort-bench in this case) and marks the directory as a Go project. The `go.mod` file also allows Go to manage dependencies (external libraries) automatically, ensuring your project remains reproducible and easy to maintain. - -3. Add Sorting Functions +This creates a `go.mod` file, which defines the module path (`gosort-bench` in this case) and marks the directory as a Go project. The `go.mod` file also allows Go to manage dependencies (external libraries) automatically, ensuring your project remains reproducible and easy to maintain. -Create a file called `sorting.go`: +## Add sorting functions +Create a file named `sorting.go`: ```console nano sorting.go ``` -Paste the following code in `sorting.go`: +Paste the following implementation into `sorting.go`: ```go package sorting func BubbleSort(arr []int) { @@ -75,9 +72,10 @@ func partition(arr []int, low, high int) int { return i + 1 } ``` -The code contains two sorting methods, Bubble Sort and Quick Sort, which arrange numbers in order from smallest to largest. - * Bubble Sort works by repeatedly comparing two numbers side by side and swapping them if they are in the wrong order. It keeps doing this until the whole list is sorted. - * Quick Sort is faster. It picks a pivot number and splits the list into two groups — numbers smaller than the pivot and numbers bigger than it. Then it sorts each group separately. The function partition helps Quick Sort decide where to split the list based on the pivot number. +The code contains two sorting methods, *Bubble Sort* and *Quick Sort*, which arrange numbers in order from smallest to largest: + +- *Bubble Sort* works by repeatedly comparing two numbers side by side and swapping them if they are in the wrong order. It keeps doing this until the whole list is sorted. +- *Quick Sort* is faster. It picks a pivot number and splits the list into two groups. Numbers smaller than the pivot and numbers bigger than it. Then it sorts each group separately. The function partition helps Quick Sort decide where to split the list based on the pivot number. To summarize, Bubble Sort is simple but slow, while Quick Sort is more efficient and usually much faster for big lists of numbers. @@ -90,16 +88,14 @@ mkdir sorting mv sorting.go sorting/ ``` -4. Add Benchmark Tests - -Next, create a benchmark test file named `sorting_benchmark_test.go` in your project’s root directory (not inside the sorting/ folder, so it can import the sorting package cleanly): +### Add benchmark tests +Create a benchmark file named `sorting_benchmark_test.go` in the project root: ```console nano sorting_benchmark_test.go -```` - -Paste the following code into it: +``` +Paste the following code: ```go package sorting_test import ( @@ -133,24 +129,19 @@ func BenchmarkQuickSort(b *testing.B) { } } ``` -The code implements a benchmark that checks how fast Bubble Sort and Quick Sort run in Go. -- It first creates a list of 10,000 random numbers each time before running a sort, so the test is fair and consistent. -- The BenchmarkBubbleSort() function measures the speed of sorting using the slower Bubble Sort method. -- The BenchmarkQuickSort() function measures the speed of sorting using the faster Quick Sort method. +The code implements a benchmark that measures the performance of Bubble Sort and Quick Sort in Go by generating a new list of 10,000 random numbers before each run to keep the test fair and consistent. The BenchmarkBubbleSort() function evaluates the slower Bubble Sort algorithm, while the BenchmarkQuickSort() function evaluates the faster Quick Sort algorithm, allowing you to compare their relative speeds and efficiency. When you run the benchmark, Go will show you how long each sort takes and how much memory it uses, so you can compare the two sorting techniques. -### Run the Benchmark +## Run the Benchmark Execute the benchmark suite using the following command: ```console go test -bench=. -benchmem ``` --bench=. runs every function whose name starts with Benchmark. --benchmem adds memory metrics (B/op, allocs/op) to the report. - -You should see output similar to: +`-bench=.` runs every function whose name starts with `Benchmark`. `-benchmem` adds memory metrics (B/op, allocs/op) to the report. +Expected output: ```output goos: linux goarch: arm64 @@ -160,27 +151,28 @@ BenchmarkQuickSort-4 3506 340873 ns/op 0 PASS ok gosort-bench 2.905s ``` -### Metrics Explained - * ns/op – nanoseconds per operation (lower is better). This is the primary latency metric. - * B/op – bytes allocated per operation (lower is better). This is useful for spotting hidden allocations. - * allocs/op – number of heap allocations per operation (lower is better). Zero here means the algorithm itself didn’t allocate. - -### Benchmark summary on Arm64 -Here is a summary of benchmark results collected on an Arm64 D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine. +## Metrics explained -| Benchmark | Value on Virtual Machine | -|-------------------|--------------------------| -| BubbleSort (ns/op) | 36,616,759 | -| QuickSort (ns/op) | 340,873 | -| BubbleSort runs | 32 | -| QuickSort runs | 3,506 | -| Allocations/op | 0 | -| Bytes/op | 0 | -| Total time (s) | 2.905 | +The metrics reported by go test -bench include ns/op, which measures nanoseconds per operation and reflects latency where lower values are better, B/op, which shows the number of bytes allocated per operation and helps identify memory efficiency, and allocs/op, which indicates the number of heap allocations per operation and highlights how often memory is being allocated, with lower values preferred in all cases. + +## Benchmark summary on Arm64 + +Results collected on an Arm64 D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine: -### Benchmark summary on x86_64 -Here is a summary of the benchmark results collected on x86_64 D4s_v6 Ubuntu Pro 24.04 LTS virtual machine. +| Benchmark | Value | +|----------------------|-------| +| BubbleSort (ns/op) | 36,616,759 | +| QuickSort (ns/op) | 340,873 | +| BubbleSort runs | 32 | +| QuickSort runs | 3,506 | +| Allocations/op | 0 | +| Bytes/op | 0 | +| Total time (s) | 2.905 | + +## Benchmark summary on x86-64 + +Results collected on an x86-64 D4s_v6 Ubuntu Pro 24.04 LTS virtual machine: | Benchmark | Value on Virtual Machine | |-------------------|--------------------------| @@ -193,12 +185,8 @@ Here is a summary of the benchmark results collected on x86_64 D4s_v6 Ubuntu Pro | Total time (s) | 2.716 | -### Benchmarking comparison summary - -When you compare the benchmarking results you will notice that on the Azure Cobalt 100: - -Azure Cobalt 100 (Arm64) outperforms in both BubbleSort and QuickSort benchmarks, with the advantage more pronounced for QuickSort. The performance delta (~15–33%) shows how Arm Neoverse cores deliver strong results in CPU-bound, integer-heavy workloads common in Go applications. +## Benchmarking comparison summary -For real-world Go applications that rely on sorting, JSON processing, and other recursive or data-processing workloads, running on Azure Cobalt 100 Arm64 VMs can deliver better throughput and reduced execution time compared to similarly sized x86_64 VMs. +On Azure Cobalt 100 (Arm64), both BubbleSort and QuickSort run faster, with a larger advantage for QuickSort. The observed performance delta (~15–33%) highlights how Arm Neoverse cores excel at CPU-bound, integer-heavy workloads common in Go services. -These results validate the benefits of running Go workloads on Azure Cobalt 100 Arm64 instances, and establish a baseline for extending benchmarks to real-world workloads beyond sorting. +For real-world Go applications, such as sorting, JSON processing, and other recursive or data-processing tasks, Azure Cobalt 100 Arm64 VMs can provide higher throughput and lower execution time than similarly sized x86-64 VMs. These results validate the benefits of running Go on Cobalt 100 and establish a baseline for extending benchmarks beyond simple sorting. diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md index 7ef1323d1e..75a6e05a23 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/create-instance.md @@ -1,56 +1,64 @@ --- -title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +title: Create an Azure Cobalt 100 Arm64 virtual machine for Golang deployment weight: 3 - ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction +## Prerequisites and setup -There are several ways to create an Arm-based Cobalt 100 virtual machine: +There are several common ways to create an Arm-based Cobalt 100 virtual machine, and you can choose the method that best fits your workflow or requirements: -- The Azure console +- The Azure Portal - The Azure CLI - An infrastructure as code (IaC) tool -In this section, you will use the Azure console to create a virtual machine with the Arm-based Azure Cobalt 100 processor. +In this section, you will launch the Azure Portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. -This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. +This Learning Path focuses on general-purpose virtual machines in the Dpsv6 series. For more information, see the [Microsoft Azure guide for the Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). While the steps to create this instance are included here for convenience, you can also refer to the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). -#### Create an Arm-based Azure Virtual Machine +## Create an Arm-based Azure virtual machine + +Creating a virtual machine on Azure Cobalt 100 follows the standard Azure VM flow, which typically involves specifying basic settings, selecting an operating system image, configuring authentication, and setting up networking and security options. + +For more information, see the [Azure VM creation documentation](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal). + +To create a VM using the Azure Portal, follow these steps: + +- In the Azure portal, go to **Virtual machines**. + +- Select **Create**, then choose **Virtual machine** from the drop-down. + +- On the **Basics** tab, enter **Virtual machine name** and **Region**. + +- Under **Image**, choose your OS (for example, *Ubuntu Pro 24.04 LTS*) and set **Architecture** to **Arm64**. + +- In **Size**, select **See all sizes**, choose the **Dpsv6** series, then select **D4ps_v6**. -Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". -1. Select "Create", and click on "Virtual Machine" from the drop-down list. -2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". -3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. -4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. +![Azure portal VM creation - Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Select the Dpsv6 series and D4ps_v6") -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") +- Under **Authentication type**, choose **SSH public key**. Azure can generate a key pair and store it for future use. For **SSH key type**, **ED25519** is recommended (RSA is also supported). -5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. -6. Fill in the Administrator username for your VM. -7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. -8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. +- Enter the **Administrator username**. -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") +- If generating a new key, select **Generate new key pair**, choose **ED25519** (or **RSA**), and provide a **Key pair name**. -9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: +- In **Inbound port rules**, select **HTTP (80)** and **SSH (22)**. -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") +![Azure portal VM creation - Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Allow inbound port rules") -10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. +- Select **Review + create** and review your configuration. It should look similar to: -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") +![Azure portal VM creation - Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Review and create an Arm64 VM on Cobalt 100") -11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. +When you’re ready, select **Create**, then **Download private key and create resources**. -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") +![Azure portal VM creation - Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Download private key and create resources") -{{% notice Note %}} +Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. -To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). +![Azure portal VM creation - Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "VM deployment confirmation in the Azure portal") -{{% /notice %}} +{{% notice Note %}}To learn more about Arm-based virtual machines on Azure, see the section *Getting Started with Microsoft Azure* within the Learning Path [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure).{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md index b416eb44ed..2d034b3f25 100644 --- a/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/golang-on-azure/deploy.md @@ -1,29 +1,31 @@ --- -title: Install Golang +title: Install and configure Golang on Azure Cobalt 100 Arm64 + weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Install Golang on Azure Cobalt 100 -## Install Golang on Ubuntu Pro 24.04 LTS (Arm64) -This section guides you through installing the latest Go toolchain on Ubuntu Pro 24.04 LTS (Arm64), configuring the environment, and verifying the setup for benchmarking workloads on Azure Cobalt 100 VMs. +This section demonstrates how to install the Go programming language toolchain on Ubuntu Pro 24.04 LTS (Arm64), configure your development environment, and verify the setup for optimal performance on Azure Cobalt 100 virtual machines. -1. Download the Golang archive +## Download the Official Go Distribution -Use the following command to download the latest Go release for Linux Arm64 directly from the official Go distribution site: +Download the latest Arm64-optimized Go distribution directly from the official Go website. This ensures you get the best performance on Azure Cobalt 100 processors: ```console wget https://go.dev/dl/go1.25.0.linux-arm64.tar.gz ``` + {{% notice Note %}} -There are many enhancements added to Golang version 1.18, that have resulted in up to a 20% increase in performance for Golang workloads on Arm-based servers. Please see [this blog](https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/) for the details. +There are many enhancements added to Golang version 1.18, that have resulted in up to a 20% increase in performance for Golang workloads on Arm-based servers. For further information, see the AWS blog [Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton](https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/). The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) also recommends Golang version 1.18 as the minimum recommended on the Arm platforms. {{% /notice %}} -2. Extract the archive +## Extract the archive Unpack the downloaded archive into `/usr/local`, which is the conventional directory for installing system-wide software on Linux. This ensures the Go toolchain is available for all users and integrates cleanly with the system’s environment. @@ -31,41 +33,43 @@ Unpack the downloaded archive into `/usr/local`, which is the conventional direc sudo tar -C /usr/local -xzf ./go1.25.0.linux-arm64.tar.gz ``` -3. Add Go to your system PATH +## Add Go to your shell PATH -To make the Go toolchain accessible from any directory, add its binary location to your shell’s PATH environment variable. Updating your `.bashrc` file ensures this change persists across sessions: +To make the Go toolchain accessible from any directory, add its binary location to your shell’s `PATH` environment variable. Updating your `.bashrc` file ensures this change persists across sessions: ```console echo 'export PATH="$PATH:/usr/local/go/bin"' >> ~/.bashrc ``` -4. Apply the PATH changes immediately +## Reload shell configuration -After updating .bashrc, reload it so your current shell session picks up the new environment variables without requiring you to log out and back in: +Apply the environment changes to your current shell session without requiring a logout/login cycle: ```console source ~/.bashrc ``` -5. Verify Go installation +## Verify Go installation -Check if Go is installed correctly and confirm the version: +Confirm that Go is properly installed and accessible: ```console go version ``` -You should see output similar to: - +Expected output for Azure Cobalt 100 Arm64: ```output go version go1.25.0 linux/arm64 ``` -6. Check Go environment settings + +## Validate Go environment configuration + Use the following command to display Go’s environment variables and confirm that key paths (such as GOROOT and GOPATH) are correctly set: + ```console -go env +go env GOROOT GOPATH GOARCH GOOS ``` You should see output similar to: @@ -118,4 +122,4 @@ GOVERSION='go1.25.0' GOWORK='' PKG_CONFIG='pkg-config' ``` -At this point, the Go installation on Ubuntu Pro 24.04 LTS (Arm64) VM is complete. You are now ready to proceed with Go application development, benchmarking, or performance tuning on Azure Cobalt 100 VMs. +The Go installation on Ubuntu Pro 24.04 LTS (Arm64) VM is now complete and you are ready to proceed with Go application development, benchmarking, or performance tuning on Azure Cobalt 100 VMs. diff --git a/content/learning-paths/servers-and-cloud-computing/kafka/kafka_cluster.md b/content/learning-paths/servers-and-cloud-computing/kafka/kafka_cluster.md index 835d647a29..6e8b8f81be 100644 --- a/content/learning-paths/servers-and-cloud-computing/kafka/kafka_cluster.md +++ b/content/learning-paths/servers-and-cloud-computing/kafka/kafka_cluster.md @@ -1,6 +1,6 @@ --- # User change -title: "Setup a 3 node Kafka Cluster" +title: "Set up a 3 node Kafka Cluster" weight: 4 @@ -9,9 +9,9 @@ layout: "learningpathall" --- -## Setup 3 node Kafka Cluster: +## Set up 3 node Kafka Cluster: -In this section, you will setup a Kafka cluster on 3 Arm machines. Ensure that the [3 Node Zookeeper cluster](/learning-paths/servers-and-cloud-computing/kafka/zookeeper_cluster) is running. +In this section, you will set up a Kafka cluster on 3 Arm machines. Ensure that the [3 Node Zookeeper cluster](/learning-paths/servers-and-cloud-computing/kafka/zookeeper_cluster) is running. ### Node 1: diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md index 790f5c66bd..9477e2b1c3 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md @@ -6,26 +6,21 @@ weight: 2 layout: learningpathall --- -## Overview: Profiling LLMs on Arm CPUs with Streamline +## Profiling LLMs on Arm CPUs with Streamline -Large Language Models (LLMs) run efficiently on Arm CPUs. -Frameworks that run LLMs, such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provides a convenient framework for running LLMs, it also comes with a certain level of complexity. +Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution. While larger models may benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone. + +Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provide a convenient way to run LLMs, but it also comes with a certain level of complexity. To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools. -This learning path demonstrates how to use the **llama-cli** application from llama.cpp together with **Arm Streamline** to analyze the efficiency of LLM inference on Arm CPUs. +This Learning Path demonstrates how to use `llama-cli` application from llama.cpp together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. -In this guide you will learn how to: -- Profile token generation at the **Prefill** and **Decode** stages +You will learn how to: +- Profile token generation at the Prefill and Decode stages - Profile execution of individual tensor nodes and operators -- Profile LLM execution across **multiple threads and cores** +- Profile LLM execution across multiple threads and cores -You will run the **Qwen1_5-0_5b-chat-q4_0.gguf** model with llama-cli on **Arm64 Linux** and use Streamline for analysis. -The same method can also be applied to **Arm64 Android** platforms. +You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis. -## Prerequisites -Before starting this guide, you should be familiar with: -- Basic understanding of llama.cpp -- Understanding of transformer model -- Knowledge of Streamline usage -- An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application +The same method can also be applied to Android platforms. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md index addcdd28b4..70510e4cea 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md @@ -1,57 +1,86 @@ --- -title: Understand the llama.cpp +title: Understand llama.cpp weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Understand the llama.cpp +## Understand llama.cpp -**llama.cpp** is an open-source LLM framework implemented in C++ that supports both training and inference. -This learning path focuses only on **inference on the CPU**. +llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. -The **llama-cli** tool provides a command-line interface to run LLMs with the llama.cpp inference engine. +This Learning Path focuses on inference on Arm CPUs. + +The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. It supports text generation, chat mode, and grammar-constrained output directly from the terminal. ![text#center](images/llama_structure.png "Figure 1. llama-cli Flow") -### What llama-cli does -- Load and interpret LLMs in **.gguf** format -- Build a **compute graph** based on the model structure - - The graph can be divided into subgraphs, each assigned to the most suitable backend device - - In this guide, all operators are executed on the **CPU backend** -- Allocate memory for tensor nodes using the **graph planner** -- Execute tensor nodes in the graph during the **graph_compute** stage, which traverses nodes and forwards work to backend devices +### What does the Llama CLI do? + +Here are the steps performed by `llama-cli`: + +1. Load and interpret LLMs in GGUF format + +2. Build a compute graph based on the model structure + + The graph can be divided into subgraphs, each assigned to the most suitable backend device, but in this Learning Path all operations are executed on the Arm CPU backend. + +3. Allocate memory for tensor nodes using the graph planner -Step2 to Step4 are wrapped inside the function **`llama_decode`**. -During **Prefill** and **Decode**, `llama-cli` repeatedly calls `llama_decode` to generate tokens. -The parameter **`llama_batch`** passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions. +4. Execute tensor nodes in the graph during the `graph_compute` stage, which traverses nodes and forwards work to backend devices + +Steps 2 to 4 are wrapped inside the function `llama_decode`. +During Prefill and Decode, `llama-cli` repeatedly calls `llama_decode` to generate tokens. + +The parameter `llama_batch` passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions. + +### What are the components of llama.cpp? -### Components of llama.cpp The components of llama.cpp include: -![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components") -llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, `OpenCL` etc. +![text#center](images/llama_components.jpg "Figure 2. llama.cpp components") + +llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, and `OpenCL`. + +For the CPU backend, it provides an optimized `ggml-cpu` library, mainly utilizing CPU vector instructions. + +For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages 8-bit integer multiply (i8mm) instructions for acceleration. -For the CPU backend, it provides an optimized `ggml-cpu` library (mainly utilizing CPU vector instructions). -For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages the new **I8MM** instructions for acceleration. The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. ### Prefill and Decode in autoregressive LLMs -Most autoregressive LLMs are Decoder-only model. -Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs. -![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage") + +An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token (word or word piece) in a sequence based on all the previously generated tokens. + +The term "autoregressive" means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. + +For example, when generating the sentence "The cat sat on the", an autoregressive LLM: +1. Takes the input prompt as context +2. Predicts the next most likely token (e.g., "mat") +3. Uses the entire sequence including "mat" to predict the following token +4. Continues this process token by token until completion + +This sequential nature is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one). + +Most autoregressive LLMs are Decoder-only models. This refers to the transformer architecture they use, which consists only of decoder blocks from the original Transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation. + +Decoder-only models like LLaMA have become dominant for text generation because they are simpler to train at scale, can handle both understanding and generation tasks, and are more efficient for text generation. + +Here is a brief introduction to Prefill and Decode stages of autoregressive LLMs. +![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stages") At the Prefill stage, multiple input tokens of the prompt are processed. -It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. + +It mainly performs GEMM (a matrix is multiplied by another matrix) operations to generate the first output token. + ![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage") -At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. +At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (a vector is multiplied by a matrix) operations to generate subsequent output tokens one by one. + ![text#center](images/transformer_decode.jpg "Figure 5. Decode stage") -Therefore, -- **Prefill** is **compute-bound**, dominated by large GEMM operations -- **Decode** is **memory-bound**, dominated by KV cache access and GEMV operations +In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations. -This can be seen in the subsequent analysis with Streamline. \ No newline at end of file +You will see this highlighted during the analysis with Streamline. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md index 85ddc43038..9de51513f7 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md @@ -1,36 +1,40 @@ --- -title: Integrating Streamline Annotations into llama.cpp +title: Integrate Streamline Annotations into llama.cpp weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Integrating Streamline Annotations into llama.cpp +## Integrate Streamline Annotations into llama.cpp -To visualize token generation at the **Prefill** and **Decode** stages, we use **Streamline’s Annotation Marker** feature. -This requires integrating annotation support into the **llama.cpp** project. -More information about the Annotation Marker API can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). +To visualize token generation at the Prefill and Decode stages, you can use Streamline's Annotation Marker feature. + +This requires integrating annotation support into the llama.cpp project. + +More information about the Annotation Marker API can be found in the [Streamline User Guide](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en). {{% notice Note %}} -You can either build natively on an **Arm platform**, or cross-compile on another architecture using an Arm cross-compiler toolchain. +You can either build natively on an Arm platform, or cross-compile on another architecture using an Arm cross-compiler toolchain. {{% /notice %}} ### Step 1: Build Streamline Annotation library Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first. -Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*. +Streamline Annotation support code is in the installation directory such as `Arm/Development Studio 2024.1/sw/streamline/gator/annotate`. -For installation guidance, refer to the [Streamline installation guide](https://learn.arm.com/install-guides/streamline/). +For installation guidance, refer to the [Streamline installation guide](/install-guides/streamline/). Clone the gator repository that matches your Streamline version and build the `Annotation support library`. -The installation step is depends on your development machine. +The installation step depends on your development machine. + +For Arm native build, you can use the following instructions to install the packages. -For Arm native build, you can use following insturction to install the packages. -For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). -You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for Cross-compiler installation. +For other machines, you need to set up the cross compiler environment by installing [Arm GNU toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). + +You can refer to the [GCC install guide](https://learn.arm.com/install-guides/gcc/cross/) for cross-compiler installation. {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -40,7 +44,6 @@ You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for git clone https://github.com/ARM-software/gator.git cd gator ./build-linux.sh - cd annotate make {{< /tab >}} @@ -49,20 +52,22 @@ You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git cd ~ git clone https://github.com/ARM-software/gator.git - cd gator make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool {{< /tab >}} {{< /tabpane >}} -Once complete, the static library **libstreamline_annotate.a** will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file at: `gator/annotate/streamline_annotate.h` +Once complete, the static library `libstreamline_annotate.a` will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file is at `gator/annotate/streamline_annotate.h`. ### Step 2: Integrate Annotation Marker into llama.cpp -Next, we need to install **llama.cpp** to run the LLM model. -To make the following performance profiling content easier to follow, this Learning Path will use a specific release version of llama.cpp to ensure the steps and results remain consistent. +Next, you need to install llama.cpp to run the LLM model. -Before the build **llama.cpp**, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the folder. +{{% notice Note %}} +To make the performance profiling content easier to follow, this Learning Path uses a specific release version of llama.cpp to ensure the steps and results remain consistent. +{{% /notice %}} + +Before building llama.cpp, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the new directory. ```bash cd ~ @@ -74,7 +79,7 @@ mkdir streamline_annotation cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation ``` -To link `libstreamline_annotate.a` library when building llama-cli, adding following lines in the end of `llama.cpp/tools/main/CMakeLists.txt`. +To link the `libstreamline_annotate.a` library when building llama-cli, add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`. ```makefile set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -82,13 +87,13 @@ target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_ann target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}") ``` -To add Annotation Markers to llama-cli, change the llama-cli code **llama.cpp/tools/main/main.cpp** by adding +To add Annotation Markers to `llama-cli`, change the `llama-cli` code in `llama.cpp/tools/main/main.cpp` by adding the include file: ```c #include "streamline_annotate.h" ``` -After the call to common_init(), add the setup macro: +After the call to `common_init()`, add the setup macro: ```c common_init(); @@ -125,16 +130,16 @@ A string is added to the Annotation Marker to record the position of input token ### Step 3: Build llama-cli -For convenience, llama-cli is **static linked**. +For convenience, llama-cli is statically linked. -Firstly, create a new directory `build` understand llama.cpp root directory and go into it. +Create a new directory `build` under the llama.cpp root directory and change to the new directory: ```bash cd ~/llama.cpp -mkdir ./build & cd ./build +mkdir build && cd build ``` -Then configure the project by running +Next, configure the project. {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} @@ -174,23 +179,20 @@ Then configure the project by running {{< /tabpane >}} -Set `CMAKE_C_COMPILER` and `DCMAKE_CXX_COMPILER` to your cross compiler path. Make sure that **-march** in `DCMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware. +Set `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER` to your cross compiler path. Make sure that -march in `CMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware. -In this learning path, we run llama-cli on an Arm CPU that supports **NEON Dotprod** and **I8MM** instructions. -Therefore, we specify: **armv8.2-a+dotprod+i8mm**. +With the flags above you can run `llama-cli` on an Arm CPU that supports NEON dot product and 8-bit integer multiply (i8mm) instructions. -We also specify **-static** and **-g** options: -- **-static**: produces a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries. -- **-g**: includes debug information, which makes source code and function-level profiling in Streamline much easier. +The `-static` and `-g` options are also specified to produce a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries and to include debug information, which makes source code and function-level profiling in Streamline much easier. -so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android. - -Now you can build the project by running: +Now you can build the project using `cmake`: ```bash cd ~/llama.cpp/build cmake --build ./ --config Release ``` -After the building process, you should find the llama-cli will be generated at **~/llama.cpp/build/bin/** directory. +After the building process completes, you can find the `llama-cli` in the `~/llama.cpp/build/bin/` directory. + +You now have an annotated version of `llama-cli` ready for Streamline. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md index d33d989fa9..c413ac15b6 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md @@ -1,46 +1,52 @@ --- -title: Running llama-cli and Analyzing Data with Streamline +title: Run llama-cli and analyze the data with Streamline weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Running llama-cli and Analyzing Data with Streamline +## Run llama-cli and analyze the data with Streamline -After successfully building **llama-cli**, the next step is to set up the runtime environment on your Arm64 platform. +After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. -### Setup gatord +### Set up gatord + +The gator daemon (gatord) is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data. Depending on how you built llama.cpp: -- **Cross Build:** - - Copy the `llama-cli` executable to your Arm64 target. +For the cross-compiled build flow: + + - Copy the `llama-cli` executable to your Arm target. - Also copy the `gatord` binary from the Arm DS or Streamline installation: - Linux: `Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64` - Android: `Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64` -- **Native Build:** +For the native build flow: + - Use the `llama-cli` from your local build and the `gatord` you compiled earlier (`~/gator/build-native-gcc-rel/gatord`). ### Download a lightweight model -Then, download the LLM model into the target platform. -For demonstration, we use the lightweight **Qwen1_5-0_5b-chat-q4_0.gguf** model, which can run on both Arm servers and resource-constrained edge devices: +You can download the LLM model to the target platform. + +For demonstration, use the lightweight `Qwen1_5-0_5b-chat-q4_0.gguf` model, which can run on both Arm servers and resource-constrained edge devices: ```bash cd ~ wget https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf ``` -### Run gatord +### Run the Gator daemon + +Start the gator daemon on your Arm target: -Start the gator daemon on your Arm64 target: ```bash ./gatord ``` -You should see similar messages as below, +You should see similar messages to those shown below: ``` bash Streamline Data Recorder v9.4.0 (Build 9b1e8f8) @@ -50,19 +56,21 @@ Gator ready ### Connect Streamline -Next, we will need use Streamline to setup the collect CPU performance data. +Next, you can use Streamline to set up the collection of CPU performance data. + +If you're accessing the Arm server via SSH, you need to forward port `8080` from the host platform to your local machine. -If you’re accessing the Arm server via **SSH**, you need to forward port `8080` from the host platform to your local machine. ``` bash ssh -i user@arm-server -L 8080:localhost:8080 -N ``` + Append `-L 8080:localhost:8080 -N` to your original SSH command to enable local port forwarding, this allows Arm Streamline on your local machine to connect to the Arm server. -Then launch the Streamline application on your host machine, connect to the gatord running on your Arm64 target with either TCP or ADB connection. +Then launch the Streamline application on your host machine, connect to the gatord running on your Arm target with either TCP or ADB connection. You can select PMU events to be monitored at this point. {{% notice Note %}} -If you are using ssh port forwarding, you need select TCP `127.0.0.1:8080`. +If you are using ssh port forwarding, you need to select TCP `127.0.0.1:8080`. {{% /notice %}} ![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ") @@ -70,26 +78,28 @@ If you are using ssh port forwarding, you need select TCP `127.0.0.1:8080`. Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis. ![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path") -Click `Start Capture` button on Streamline to start collecting data from the Arm64 target. +Click `Start Capture` button on Streamline to start collecting data from the Arm target. {{% notice Note %}} -This guide is not intended to introduce how to use Streamline, if you encounter any issue during setting up gatord or Streamline, please refer this [user guide](https://developer.arm.com/documentation/101816/latest/?lang=en) +This guide is not intended to introduce how to use Streamline, if you encounter any issues with gatord or Streamline, please refer to the [Streamline User Guide](https://developer.arm.com/documentation/101816/latest/?lang=en) {{% /notice %}} ### Run llama-cli -Now, run the llama-cli executable as below, +Run the `llama-cli` executable as below: ``` bash cd ~/llama.cpp/build/bin ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1 ``` -After a while, you can stop the Streamline data collection by clicking ‘Stop’ button on Streamline. Then Streamline tool on your host PC will start the data analysis. +After a while, you can stop the Streamline data collection by clicking the `Stop` button on Streamline. + +Streamline running on your host PC will start the data analysis. ### Analyze the data with Streamline -From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation. +From the timeline view of Streamline, you can see some Annotation Markers. Since an Annotation Marker is added before the llama_decode function, each Annotation Marker marks the start time of a token generation. ![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker") The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example, @@ -97,39 +107,48 @@ The string in the Annotation Marker can be shown when clicking those Annotation The number after `past` indicates the position of input tokens, the number after `n_eval` indicates the number of tokens to be processed this time. -As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage. -![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage") - By checking the string of Annotation Marker, the first token generation at Prefill stage has `past 0, n_eval 78`, which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. -We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. +You can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations. -We can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. +You can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage. At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher. -By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall, + ![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event") -We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. +You can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles. All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage. -Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack. +Now, you can further profile the code execution with Streamline. In the Call Paths view of Streamline, you can see the percentage of running time of functions that are organized in form of call stack. + ![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack") -In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions. +In the Functions view of Streamline, you can see the overall percentage of running time of functions. + ![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view") -As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model, -* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation. - - At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage. - ![text#center](images/prefill_only.png "Figure 14. Prefill only view") +As you can see, the function, graph_compute, takes the largest portion of the running time. + +It shows that large amounts of GEMM and GEMV operations take most of the time. - - At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion. - ![text#center](images/decode_only.png "Figure 15. Decode only view") +With the `Qwen1_5-0_5b-chat-q4_0` model, the computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. -- There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. -- The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. -- The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. +The computation is forwarded to KleidiAI trait by `ggml_cpu_extra_compute_forward`. KleidiAI microkernels implemented with NEON dot product and i8mm vector instructions accelerate the computation. + +At the Prefill stage, `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes advantage of i8mm instructions. Since the Prefill stage only takes a small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if you focus only on the Prefill stage with Samplings view in Timeline, you see `kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm` takes the largest portion of the Prefill stage. + +![text#center](images/prefill_only.png "Figure 14. Prefill only view") + +At the Decode stage, `kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod` KleidiAI ukernel is used for GEMV operators. It takes advantage of dot product instructions. If you look only at the Decode stage, you can see this function takes the second largest portion. + +![text#center](images/decode_only.png "Figure 15. Decode only view") + +There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the weights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library. + +The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library. + +The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time. ### Analyzing results - Annotation Markers show token generation start points. @@ -142,3 +161,8 @@ As we can see, the function, graph_compute, takes the largest portion of the run |---------|----------|----------------|--------------------------------------------------| | Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | | Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | +| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | +| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | +|---------|----------|----------------|--------------------------------------------------| +| Prefill | GEMM | Compute-bound | Heavy SIMD/FP/INT ops, few cache refills | +| Decode | GEMV | Memory-bound | Light compute, many L3 cache misses, ~50% stalls | diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md index fd1cb948dc..8b2ebf8eb0 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/5_operator_deepdive.md @@ -1,20 +1,20 @@ --- -title: Deep Dive Into Individual Operator +title: Deep dive into operators weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Deep Dive Into Individual Operator +## Deep dive into operators -This module shows how to use **Streamline Annotation Channels** to analyze the execution time of each node in the compute graph. More details on Annotation Channels can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en). +You can use Streamline Annotation Channels to analyze the execution time of each node in the compute graph. More details on Annotation Channels can be found in the [Group and Channel annotations](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code/User-space-annotations/Group-and-Channel-annotations?lang=en) section of the Streamline User Guide. ## Integrating Annotation Channels into llama.cpp -In llama.cpp, tensor nodes are executed in the CPU backend inside the function `ggml_graph_compute_thread` (`~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`). +In llama.cpp, tensor nodes are executed in the CPU backend inside the function `ggml_graph_compute_thread()` in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. -In our selected release tag, the loop over tensor nodes looks like this (around line 2862): +In the selected release tag, the loop over tensor nodes looks like this (around line 2862): ```c for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { @@ -23,28 +23,27 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort ggml_compute_forward(¶ms, node); ``` -To monitor operator execution time, let's create annotation channels for each type of operators (such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE` and `GGML_OP_MUL`). +To monitor operator execution time, you can create annotation channels for each type of operators such as `GGML_OP_MUL_MAT`, `GGML_OP_SOFTMAX`, `GGML_OP_ROPE` and `GGML_OP_MUL`. Since `GGML_OP_MUL_MAT` including both GEMM and GEMV operation takes significant portion of execution time, two dedicated annotation channels are created for GEMM and GEMV respectively. -The annotation starts at the beginning of `ggml_compute_forward` and stops at the end, so that the computation of tensor node/operator can be monitored. +The annotation starts at the beginning of `ggml_compute_forward()` and stops at the end, so that the computation of tensor node/operator can be monitored. -### Step 1: Add Annotation Code +### Step 1: Add annotation code -Firstly, add Streamline annotation header file to ggml-cpu.c, +First, add Streamline annotation header file to the file `ggml-cpu.c`: ```c #include "streamline_annotate.h" ``` -Edit `ggml_graph_compute_thread` function in `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. +Edit the `ggml_graph_compute_thread()` function in the file `~/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c`. -Add following code in front and after the **ggml_compute_forward(¶ms, node)**. +Add following code in front and after the `ggml_compute_forward(¶ms, node)`. -Your code will be looks like: +Your code now looks like: ```c - for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) { struct ggml_tensor * node = cgraph->nodes[node_n]; @@ -80,7 +79,7 @@ for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort // --- End Annotation Channel for Streamline ``` -### Step 2: Add Tensor Shape Info (Optional) +### Step 2: Add tensor shape info (optional) You can also add information of the shape and size of source tensor by replace sprintf function as follow: @@ -97,7 +96,7 @@ You can also add information of the shape and size of source tensor by replace s ### Step 3: Update CMakeLists -Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annotation header file and libstreamline_annotate.a library by adding codes, copy following lines inside ggml_add_cpu_backend_variant_impl function. +Edit `~/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt` to include Streamline Annotation header file and `libstreamline_annotate.a` library by adding the lines: ```bash set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -109,64 +108,75 @@ Then, build `llama-cli` again. ### Analyze the data with Streamline -Run llama-cli and collect profiling data with Streamline as previous session. +Run `llama-cli` and collect profiling data with Streamline as you did in the previous session. -String annotations are displayed as text overlays inside the relevant channels in the details panel of the `Timeline` view. +String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view. For example, inside Channel 0 in the following screenshot. + ![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel") The letter A is displayed in the process list to indicate the presence of annotations. + String annotations are also displayed in the Message column in the Log view. + ![text#center](images/deep_dive_2.png "Figure 17. Annotation log") +### View the individual operators at Prefill stage -### View of individual operators at Prefill stage +The screenshot of annotation channel view at Prefill stage is shown as below: -The screenshot of annotation channel view at Prefill stage is shown as below, ![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage") -Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. -This annotation macro is defined as, +The name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. + +This annotation macro is defined as: ```c ANNOTATE_NAME_CHANNEL(channel, group, string) ``` For example, + ```c ANNOTATE_NAME_CHANNEL(0, 0, "MUL_MAT_GEMV"); ANNOTATE_NAME_CHANNEL(1, 0, "MUL_MAT_GEMM"); ``` -The code above sets the name of annotation channel 0 as **MUL_MAT_GEMV** and channel 1 as **MUL_MAT_GEMM**. +The code above sets the name of annotation channel 0 as `MUL_MAT_GEMV` and channel 1 as `MUL_MAT_GEMM`. + By zooming into the timeline view, you can see more details: -![text#center](images/prefill_annotation_channel_2.png "Figure 19. Annotation Channel at Prefill stage") +![text#center](images/prefill_annotation_channel_2.png "Figure 19. Annotation Channel at Prefill stage") When moving the cursor over an annotation channel, Streamline shows: + - The tensor node name - The operator type - The shape and size of the source tensors + ![text#center](images/prefill_annotation_channel_3.png "Figure 20. Annotation Channel Zoom in") -In the example above, we see a `GGML_OP_MUL_MAT` operator for the **FFN_UP** node. -Its source tensors have shapes **[1024, 2816]** and **[1024, 68]**. +In the example above, you see a `GGML_OP_MUL_MAT` operator for the `FFN_UP` node. +The source tensors have shapes [1024, 2816] and [1024, 68]. This view makes it clear that: -- The majority of time at the **Prefill stage** is spent on **MUL_MAT GEMM** operations in the attention and FFN layers. -- There is also a large **MUL_MAT GEMV** operation in the `result_output` linear layer. -- Other operators, such as **MUL, Softmax, Norm, RoPE**, consume only a small portion of execution time. +- The majority of time at the Prefill stage is spent on `MUL_MAT GEMM` operations in the attention and FFN layers. +- There is also a large `MUL_MAT GEMV` operation in the `result_output` linear layer. +- Other operators, such as MUL, Softmax, Norm, RoPE, consume only a small portion of execution time. ### View of individual operators at Decode stage -The annotation channel view for the **Decode stage** is shown below: + +The annotation channel view for the Decode stage is shown below: + ![text#center](images/decode_annotation_channel.png "Figure 21. Annotation Channel at Decode stage") Zooming in provides additional details: + ![text#center](images/decode_annotation_channel_2.png "Figure 22. Annotation Channel string") -From this view, we observe: -- The majority of time in **Decode** is spent on **MUL_MAT GEMV** operations in the attention and FFN layers. +From this view, you can see: +- The majority of time in Decode is spent on `MUL_MAT GEMV` operations in the attention and FFN layers. - In contrast to Prefill, **no GEMM operations** are executed in these layers. -- The `result_output` linear layer has a **large GEMV operation**, which takes an even larger proportion of runtime in Decode. +- The `result_output` linear layer has a large GEMV operation, which takes an even larger proportion of runtime in Decode. - This is expected, since each token generation at Decode is shorter due to KV cache reuse, making the result_output layer more dominant. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md index 908766cf8c..00b0c4bf00 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/6_multithread_analyze.md @@ -1,64 +1,63 @@ --- -title: Analyzing Multi-Core/Multi-Thread Performance +title: Analyze multi-threaded performance weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Analyzing Multi-Core/Multi-Thread Performance +## Analyze multi-threaded performance -The CPU backend in **llama.cpp** uses multiple cores and threads to accelerate operator execution. -It creates a **threadpool**, where: +The CPU backend in llama.cpp uses multiple cores and threads to accelerate operator execution. + +It creates a threadpool, where: - The number of threads is controlled by the `-t` option - If `-t` is not specified, it defaults to the number of CPU cores in the system -The entrypoint for secondary threads is the function **`ggml_graph_compute_secondary_thread`**. +The entrypoint for secondary threads is the function `ggml_graph_compute_secondary_thread()`. When computing a tensor node/operator with a large workload, llama.cpp splits the computation into multiple parts and distributes them across threads. ### Example: MUL_MAT Operator -For the **MUL_MAT** operator, the output matrix **C** can be divided across threads: +For the MUL_MAT operator, the output matrix C can be divided across threads: + ![text#center](images/multi_thread.jpg "Figure 23. Multi-Thread") In this example, four threads each compute one quarter of matrix C. -### Observing Thread Execution with Streamline +### Observing thread execution with Streamline -The execution of multiple threads on CPU cores can be observed using **Core Map** and **Cluster Map** modes in the Streamline Timeline. -Learn more about these modes [here](https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes). +The execution of multiple threads on CPU cores can be observed using Core Map and Cluster Map modes in the Streamline Timeline. -Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1, -* -t 2 → creates two worker threads -* -C 0x3 → sets CPU affinity to core0 and core1 +Learn more about these modes in the [Core Map and Cluster Map modes](https://developer.arm.com/documentation/101816/9-7/Analyze-your-capture/Viewing-application-activity/Core-Map-and-Cluster-Map-modes) section of the Streamline User Guide. + +Run llama-cli with `-t 2 -C 0x3` to specify two threads and thread affinity as CPU core0 and core1. ```bash ./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 2 -C 0x3 ``` -### Streamline Results +### Streamline results -Collect profiling data with **Streamline**, then select **Core Map** and **Cluster Map** modes in the Timeline view. +Collect profiling data with Streamline, then select Core Map and Cluster Map modes in the Timeline view. ![text#center](images/multi_thread_core_map.png "Figure 24. Multi-Thread") In the screenshot above: - Two threads are created -- They are running on **CPU core0** and **CPU core1**, respectively +- They are running on CPU core0 and CPU core1, respectively -In addition, you can use the **Annotation Channel** view to analyze operator execution on a per-thread basis. -Each thread generates its own annotation channel independently. +In addition, you can use the Annotation Channel view to analyze operator execution on a per-thread basis. Each thread generates its own annotation channel independently. ![text#center](images/multi_thread_annotation_channel.png "Figure 25. Multi-Thread") In the screenshot above, at the highlighted time: -- Both threads are executing the **same node** -- In this case, the node is the **result_output linear layer** - +- Both threads are executing the same node +- In this case, the node is the result_output linear layer -Congratulations — you have completed the walkthrough of profiling an LLM model on an Arm CPU. +You have completed the walkthrough of profiling an LLM model on an Arm CPU! -By combining **Arm Streamline** with a solid understanding of llama.cpp, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. +By combining Arm Streamline with a solid understanding of llama.cpp, you can visualize model execution, analyze code efficiency, and identify opportunities for optimization. Keep in mind that adding annotation code to llama.cpp and gatord may introduce a small performance overhead, so profiling results should be interpreted with this in mind. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md index d78e492e38..11ebed8219 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -1,5 +1,5 @@ --- -title: Use Streamline to analyze LLM on CPU with llama.cpp and KleidiAI +title: Analyze llama.cpp with KleidiAI LLM performance using Streamline draft: true cascade: @@ -7,10 +7,10 @@ cascade: minutes_to_complete: 50 -who_is_this_for: This advanced topic is for software developers, performance engineers, and AI practitioners who want to run llama.cpp on Arm-based CPUs, learn how to use Arm Streamline to capture and analyze performance data, understand how LLM inference behaves at the Prefill and Decode stages. +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners who want to run llama.cpp on Arm-based CPUs, learn how to use Arm Streamline to capture and analyze performance data, and understand how LLM inference behaves at the Prefill and Decode stages. learning_objectives: - - Describe the architecture of llama.cpp and the role of Prefill and Decode stages + - Describe the architecture of llama.cpp and the role of the Prefill and Decode stages - Integrate Streamline Annotations into llama.cpp for fine-grained performance insights - Capture and interpret profiling data with Streamline - Use Annotation Channels to analyze specific operators during token generation @@ -18,7 +18,7 @@ learning_objectives: prerequisites: - Basic understanding of llama.cpp - - Understanding of transformer model + - Understanding of transformer models - Knowledge of Streamline usage - An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application @@ -37,7 +37,6 @@ tools_software_languages: - C++ - llama.cpp - KleidiAI - - Neoverse - Profiling operatingsystems: - Linux diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.jpg b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_components.jpg similarity index 100% rename from content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_componetns.jpg rename to content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/llama_components.jpg diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md new file mode 100644 index 0000000000..f609621253 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md @@ -0,0 +1,62 @@ +--- +title: Deploy MySQL on Microsoft Azure Cobalt 100 processors + +draft: true +cascade: + draft: true + +minutes_to_complete: 40 + +who_is_this_for: This is an advanced topic that introduces MySQL deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating MySQL applications from x86_64 to Arm. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. + - Deploy MySQL on the Ubuntu virtual machine. + - Perform MySQL baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) + - Familiarity with relational databases and the basics of [MySQL](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) + +author: Pareena Verma + +### Tags +skilllevels: Advanced +subjects: Databases +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - MySQL + - SQL + - Docker + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Azure Virtual Machines documentation + link: https://learn.microsoft.com/en-us/azure/virtual-machines/ + type: documentation + - resource: + title: Azure Container Instances documentation + link: https://learn.microsoft.com/en-us/azure/container-instances/ + type: documentation + - resource: + title: MySQL Manual + link: https://dev.mysql.com/doc/refman/8.0/en/installing.html + type: documentation + - resource: + title: mysqlslap official website + link: https://dev.mysql.com/doc/refman/8.4/en/mysqlslap.html + type: website + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/background.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/background.md new file mode 100644 index 0000000000..c36ac02dea --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/background.md @@ -0,0 +1,24 @@ +--- +title: "Overview" +weight: 2 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +Typical workloads include web and application servers, data analytics, open-source databases, and caching systems. + +To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## MySQL + +MySQL is an open-source relational database management system (RDBMS) widely used for storing, organizing, and managing structured data. It uses SQL (Structured Query Language) for querying and managing databases, making it one of the most popular choices for web applications, enterprise solutions, and cloud deployments. + +It is known for its reliability, high performance, and ease of use. MySQL supports features like transactions, replication, partitioning, and robust security, making it suitable for both small applications and large-scale production systems. + +Learn more at the [MySQL official website](https://www.mysql.com/) and in the [official documentation](https://dev.mysql.com/doc/) +. diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md new file mode 100644 index 0000000000..401a3cd315 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md @@ -0,0 +1,140 @@ +--- +title: Validate MySQL +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run a functional test of MySQL on Azure Cobalt 100 + +After installing MySQL on your Arm64 virtual machine, you can perform simple baseline testing to validate that MySQL runs correctly and produces the expected output. + +### Start MySQL + +Make sure MySQL is running: + +```console +sudo systemctl start mysql +sudo systemctl enable mysql +``` +### Connect to MySQL + +```console +mysql -u admin -p +``` +Opens the MySQL client and connects as the new user(admin), prompting you to enter the admin password. + +### Show and use Database + +```sql +CREATE DATABASE baseline_test; +SHOW DATABASES; +USE baseline_test; +SELECT DATABASE(); +``` + +- `CREATE DATABASE baseline_test;` - Creates a new database named baseline_test. +- `SHOW DATABASES;` - Lists all available databases. +- `USE baseline_test;` - Switches to the new database. +- `SELECT DATABASE();` - Confirms the current database in use. + +You should see output similar to: + +```output +mysql> CREATE DATABASE baseline_test; +Query OK, 1 row affected (0.01 sec) + +mysql> SHOW DATABASES; ++--------------------+ +| Database | ++--------------------+ +| baseline_test | +| benchmark_db | +| information_schema | +| mydb | +| mysql | +| performance_schema | +| sys | ++--------------------+ +7 rows in set (0.00 sec) + +mysql> USE baseline_test; +Database changed +mysql> SELECT DATABASE(); ++---------------+ +| DATABASE() | ++---------------+ +| baseline_test | ++---------------+ +1 row in set (0.00 sec) +``` +You created a new database named **baseline_test**, verified its presence with `SHOW DATABASES`, and confirmed it is the active database using `SELECT DATABASE()`. + +### Create and show Table + +```sql +CREATE TABLE test_table ( + id INT AUTO_INCREMENT PRIMARY KEY, + name VARCHAR(50), + value INT +); +``` + +- `CREATE TABLE` - Defines a new table named test_table. + - `id` - Primary key with auto-increment. + - `name` - String field up to 50 characters. + - `value` - Integer field. +- `SHOW TABLES;` - Lists all tables in the current database. + +You should see output similar to: + +```output +Query OK, 0 rows affected (0.05 sec) + +mysql> SHOW TABLES; ++-------------------------+ +| Tables_in_baseline_test | ++-------------------------+ +| test_table | ++-------------------------+ +1 row in set (0.00 sec) +``` +You successfully created the table **test_table** in the `baseline_test` database and verified its existence using `SHOW TABLES`. + +### Insert Sample Data + +```sql +INSERT INTO test_table (name, value) +VALUES +('Alice', 100), +('Bob', 200), +('Charlie', 300); +``` +- `INSERT INTO test_table (name, value)` - Specifies which table and columns to insert into. +- `VALUES` - Provides three rows of data. + +After inserting, you can check the data with: + +```sql +SELECT * FROM test_table; +``` +- `SELECT *` - Retrieves all columns. +- `FROM test_table` - Selects from the test_table. + +You should see output similar to: + +```output +mysql> SELECT * FROM test_table; ++----+---------+-------+ +| id | name | value | ++----+---------+-------+ +| 1 | Alice | 100 | +| 2 | Bob | 200 | +| 3 | Charlie | 300 | ++----+---------+-------+ +3 rows in set (0.00 sec) +``` + +The functional test was successful — the **test_table** contains three rows (**Alice, Bob, and Charlie**) with their respective values, confirming MySQL is working +correctly. diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md new file mode 100644 index 0000000000..eb9cc1b80c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md @@ -0,0 +1,143 @@ +--- +title: Benchmark MySQL with mysqlslap +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Benchmark MySQL on Azure Cobalt 100 Arm-based instances and x86_64 instances + +`mysqlslap` is the official MySQL benchmarking tool used to simulate multiple client connections and measure query performance. It helps evaluate **read/write throughput, query response times**, and overall MySQL server performance under different workloads, making it ideal for baseline testing and optimization. + +## Steps for MySQL Benchmarking with mysqlslap + +1. Connect to MySQL and Create a Database + +To access the MySQL server, use the following command based on your `admin` user password: + +```console +mysql -u admin -p +``` +Once logged in, you can create a benchmark_db using SQL commands like: + +```sql +CREATE DATABASE benchmark_db; +USE benchmark_db; +``` + +3. Create a Table and Populate Data + +After logging into MySQL, you can create a table to store benchmark data. Here’s a simple example: + +```sql +CREATE TABLE benchmark_table ( + record_id INT AUTO_INCREMENT PRIMARY KEY, + username VARCHAR(50), + score INT +); +``` +Insert some sample rows manually: + +```sql +INSERT INTO benchmark_table (username,score) VALUES +('John',100),('Jane',200),('Mike',300); +``` + +Or populate automatically with 1000 rows: + +```sql +DELIMITER // +CREATE PROCEDURE populate_benchmark_data() +BEGIN + DECLARE i INT DEFAULT 1; + WHILE i <= 1000 DO + INSERT INTO benchmark_table (username, score) + VALUES (CONCAT('Player', i), i*10); + SET i = i + 1; + END WHILE; +END // +DELIMITER ; + +CALL populate_benchmark_data(); +DROP PROCEDURE populate_benchmark_data; +``` +- The table `benchmark_table` has three columns: `record_id` (primary key), `username`, and `score`. +- You can insert a few rows manually for testing or use a procedure to generate **1000 rows automatically** for more realistic benchmarking + +## Run a Simple Read/Write Benchmark + +Once your table is ready, you can use `mysqlslap` to simulate multiple clients performing queries. This helps test MySQL’s performance under load. + +```console +mysqlslap --user=admin --password="MyStrongPassword!" --host=127.0.0.1 --concurrency=10 --iterations=5 --query="INSERT INTO benchmark_db.benchmark_table (username,score) VALUES('TestUser',123);" --create-schema=benchmark_db +``` +- **--user / --password:** MySQL login credentials. +- **--host:** MySQL server address (127.0.0.1 for local). +- **--concurrency:** Number of simultaneous clients (here, 10). +- **--iterations:** How many times to repeat the test (here, 5). +- **--query:** The SQL statement to run repeatedly. +- **--create-schema:** The database in which to run the query. + +You should see output similar to the following: + +```output +Benchmark + Average number of seconds to run all queries: 0.267 seconds + Minimum number of seconds to run all queries: 0.265 seconds + Maximum number of seconds to run all queries: 0.271 seconds + Number of clients running queries: 10 + Average number of queries per client: 1 +``` + +Below command runs a **read benchmark** on your MySQL database using `mysqlslap`. It simulates multiple clients querying the table at the same time and records the results. + +```console +mysqlslap --user=admin --password="MyStrongPassword!" --host=127.0.0.1 --concurrency=10 --iterations=5 --query="SELECT * FROM benchmark_db.benchmark_table WHERE record_id < 500;" --create-schema=benchmark_db --verbose | tee -a /tmp/mysqlslap_benchmark.log +``` + +You should see output similar to the following: + +```output +Benchmark + Average number of seconds to run all queries: 0.263 seconds + Minimum number of seconds to run all queries: 0.261 seconds + Maximum number of seconds to run all queries: 0.264 seconds + Number of clients running queries: 10 + Average number of queries per client: 1 +``` + +## Benchmark Results Table Explained: + +- **Average number of seconds to run all queries:** This is the average time it took for all the queries in one iteration to complete across all clients. It gives you a quick sense of overall performance. +- **Minimum number of seconds to run all queries:** This is the fastest time any iteration of queries took. +- **Maximum number of seconds to run all queries:** This is the slowest time any iteration of queries took. The closer this is to the average, the more consistent your performance is. +- **Number of clients running queries:** Indicates how many simulated users (or connections) ran queries simultaneously during the test. +- **Average number of queries per client:** Shows the average number of queries each client executed in the benchmark iteration. + +## Benchmark summary on Arm64: +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| Query Type | Average Time (s) | Minimum Time (s) | Maximum Time (s) | Clients | Avg Queries per Client | +|------------|-----------------|-----------------|-----------------|--------|----------------------| +| INSERT | 0.267 | 0.265 | 0.271 | 10 | 1 | +| SELECT | 0.263 | 0.261 | 0.264 | 10 | 1 | + +## Benchmark summary on x86_64: +Here is a summary of the benchmark results collected on x86_64 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| Query Type | Average Time (s) | Minimum Time (s) | Maximum Time (s) | Clients | Avg Queries per Client | +|------------|-----------------|-----------------|-----------------|--------|----------------------| +| INSERT | 0.243 | 0.231 | 0.273 | 10 | 1 | +| SELECT | 0.222 | 0.214 | 0.233 | 10 | 1 | + +## Insights from Benchmark Results + +The benchmark results on the Arm64 virtual machine show: + +- **Balanced Performance for Read and Write Queries:** Both `INSERT` and `SELECT` queries performed consistently, with average times of **0.267s** and **0.263s**, respectively. +- **Low Variability Across Iterations:** The difference between the minimum and maximum times was very small for both query types, indicating stable and predictable behavior under load. +- **Moderate Workload Handling:** With **10 clients** and an average of **1 query per client**, the system handled concurrent operations efficiently without significant delays. +- **Key Takeaway:** The MySQL setup on Arm64 provides reliable and steady performance for both data insertion and retrieval tasks, making it a solid choice for applications requiring dependable database operations. + +You have now benchmarked MySql on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md new file mode 100644 index 0000000000..9571395aa2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md @@ -0,0 +1,50 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". +1. Select "Create", and click on "Virtual Machine" from the drop-down list. +2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". +3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. +4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") + +5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +6. Fill in the Administrator username for your VM. +7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. +8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") + +9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") + +10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") + +11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md new file mode 100644 index 0000000000..7e30f55f3e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md @@ -0,0 +1,139 @@ +--- +title: Install MySQL +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Install MySQL on Azure Cobalt 100 + +This section walks you through installing and securing MySQL on an Azure Arm64 virtual machine. You will set up the database, configure access, and verify it’s running—ready for development and testing. + +Start by installing MySQL and other essential tools: + +## Install MySQL and tools + +1. Update the system and install MySQL +You update your system's package lists to ensure you get the latest versions and then install the MySQL server using the package manager. + +```console +sudo apt update +sudo apt install -y mysql-server +``` + +2. Secure MySQL installation + +After installing MySQL, You are locking down your database so only you can access it safely. It’s like setting up a password and cleaning up unused accounts to make sure no one else can mess with your data. + +```console +sudo mysql_secure_installation +``` +Follow the prompts: + +- Set a strong password for root. +- Remove anonymous users. +- Disallow remote root login. +- Remove test databases. +- Reload privilege tables. + +3. Start and enable MySQL service +You are turning on the database so it starts working and making sure it stays on every time you turn on your computer.: + +```console +sudo systemctl start mysql +sudo systemctl enable mysql +``` +Check the status: + +```console +sudo systemctl status mysql +``` +You should see `active (running)`. + +4. Verify MySQL version + +You check the installed version of MySQL to confirm it’s set up correctly and is running. + +```console +mysql -V +``` +You should see output similar to the following: + +```output +mysql Ver 8.0.43-0ubuntu0.24.04.1 for Linux on aarch64 ((Ubuntu)) +``` +5. Access MySQL shell + +You log in to the MySQL interface using the root user to interact with the database and perform administrative tasks: + +``` +sudo mysql +``` +You should see output similar to the following: + +```output +Welcome to the MySQL monitor. Commands end with ; or \g. +Your MySQL connection id is 17 +Server version: 8.0.43-0ubuntu0.24.04.1 (Ubuntu) + +Copyright (c) 2000, 2025, Oracle and/or its affiliates. + +Oracle is a registered trademark of Oracle Corporation and/or its +affiliates. Other names may be trademarks of their respective +owners. + +Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. + +mysql> +``` + +6. Create a new user + +You are setting up a new area to store your data and giving someone special permissions to use it. This helps you organize your work better and control who can access it: + +```console +sudo mysql +``` + +Inside the MySQL shell, run: + +```sql +CREATE USER 'admin'@'localhost' IDENTIFIED BY 'MyStrongPassword!'; +GRANT ALL PRIVILEGES ON *.* TO 'admin'@'localhost' WITH GRANT OPTION; +FLUSH PRIVILEGES; +; +EXIT; +``` + +- Replace **MyStrongPassword!** with the password you want. +- This reloads the privilege tables so your new password takes effect immediately. + +## Verify Access with New User + +You test logging into MySQL using the new user account to ensure it works and has the proper permissions. In my case new user is `admin`. + +```console +mysql -u admin -p +``` +- Enter your current `admin` password. + +You should see output similar to the following: + +```output +Enter password: +Welcome to the MySQL monitor. Commands end with ; or \g. +Your MySQL connection id is 16 +Server version: 8.0.43-0ubuntu0.24.04.1 (Ubuntu) + +Copyright (c) 2000, 2025, Oracle and/or its affiliates. + +Oracle is a registered trademark of Oracle Corporation and/or its +affiliates. Other names may be trademarks of their respective +owners. + +Type 'help;' or '\h' for help. Type '\c' to clear the current input statement +mysql> exit +``` + +MySQL installation is complete. You can now proceed with the baseline testing of MySQL in the next section diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance.png b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/mysql-azure/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md index 1f64240ea5..ac71403fa2 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/_index.md @@ -1,23 +1,19 @@ --- -title: Deploy NGINX on the Microsoft Azure Cobalt 100 processors +title: Deploy NGINX on Azure Cobalt 100 Arm-based virtual machines -draft: true -cascade: - draft: true - minutes_to_complete: 30 -who_is_this_for: This Learning Path introduces NGINX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machine. It is intended for system administrators and developers looking to deploy and benchmark NGINX on Arm-based instances. +who_is_this_for: This is an introductory topic for system administrators and developers who want to learn how to deploy and benchmark NGINX on Microsoft Azure Cobalt 100 Arm-based instances. learning_objectives: - - Start an Azure Arm64 virtual machine using the Azure console and Ubuntu Pro 24.04 LTS as the base image. - - Deploy the NGINX web server on the Azure Arm64 virtual machine. - - Configure and test a static website using NGINX on the virtual machine. - - Perform baseline testing and benchmarking of NGINX in the Ubuntu Pro 24.04 LTS Arm64 virtual machine environment. + - Create an Arm64 virtual machine on Azure Cobalt 100 (Dpsv6) using the Azure console with Ubuntu Pro 24.04 LTS as the base image + - Install and configure the NGINX web server on the Azure Arm64 virtual machine + - Configure and test a static website with NGINX on the virtual machine + - Run baseline NGINX performance tests with ApacheBench (ab) on Ubuntu Pro 24.04 LTS Arm64 prerequisites: - - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) author: Pareena Verma @@ -31,7 +27,7 @@ armips: tools_software_languages: - NGINX - - Apache Bench + - ApacheBench operatingsystems: - Linux @@ -42,11 +38,11 @@ further_reading: link: https://nginx.org/en/docs/ type: documentation - resource: - title: Apache Bench official documentation + title: ApacheBench official documentation link: https://httpd.apache.org/docs/2.4/programs/ab.html type: documentation - resource: - title: NGINX on Azure + title: NGINX on Azure virtual machines link: https://docs.nginx.com/nginx/deployment-guides/microsoft-azure/virtual-machines-for-nginx/ type: documentation diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md index 9363127800..0dafafd818 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/backgroud.md @@ -1,22 +1,22 @@ --- -title: "Overview" +title: "Overview of Azure Cobalt 100 and NGINX" weight: 2 layout: "learningpathall" --- -## Cobalt 100 Arm-based processor +## Azure Cobalt 100 Arm-based processor -Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. +Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads such as web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which helps ensure consistent and predictable performance. -To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). +To learn more, see the Microsoft blog [Announcing the preview of new Azure VMs based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). ## NGINX -NGINX is a high-performance, open-source web server, reverse proxy, load balancer, and HTTP cache. Originally developed by Igor Sysoev, NGINX is known for its event-driven, asynchronous architecture, which enables it to handle high concurrency with low resource usage. +NGINX is a high-performance open-source web server, reverse proxy, load balancer, and HTTP cache. Known for its event-driven, asynchronous architecture, NGINX handles high concurrency with low resource usage. There are three main variants of NGINX: -- **NGINX Open Source**– Free and [open-source version available at nginx.org](https://nginx.org) -- **NGINX Plus**- [Commercial edition of NGINX](https://www.nginx.com/products/nginx/) with features like dynamic reconfig, active health checks, and monitoring. -- **NGINX Unit**- A lightweight, dynamic application server that complements NGINX. [Learn more at unit.nginx.org](https://unit.nginx.org/). +- Open source NGINX: a free and [open-source version available at nginx.org](https://nginx.org) +- NGINX Plus: a [Commercial edition of NGINX](https://www.nginx.com/products/nginx/) with features like dynamic reconfig, active health checks, and monitoring +- NGINX Unit: a lightweight, dynamic application server that complements NGINX - find out more at the [NGINX website](https://unit.nginx.org/) diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md index 1119102c40..5144c44095 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/baseline.md @@ -6,21 +6,20 @@ weight: 5 layout: learningpathall --- +## Baseline test NGINX with a static website -### Baseline testing with a static website on NGINX -Once NGINX is installed and serving the default welcome page, the next step is to verify that it can serve your own content. A baseline test using a simple static HTML site ensures that NGINX is correctly configured and working as expected on your Ubuntu Pro 24.04 LTS virtual machine. +Once NGINX is installed and serving the default welcome page, verify that it can serve your own content. A baseline test with a simple static HTML site confirms that NGINX is correctly configured and working as expected on your Ubuntu Pro 24.04 LTS virtual machine. -1. Create a Static Website Directory: - -Prepare a folder to host your HTML content. +## Create a static website directory +Prepare a folder to host your HTML content: ```console mkdir -p /var/www/my-static-site cd /var/www/my-static-site ``` -2. Create an HTML file and Web page: -Create a simple HTML file to replace the default NGINX welcome page. Using a file editor of your choice create the file `index.html` with the content below: +## Create an HTML file +Create a simple HTML page to replace the default NGINX welcome page. Using a file editor of your choice, create `index.html` with the following content: ```html @@ -56,30 +55,25 @@ Create a simple HTML file to replace the default NGINX welcome page. Using a fil
-

Welcome to NGINX on Azure Ubuntu Pro 24.04 LTS!

-

Your static site is running beautifully on ARM64

+

Welcome to NGINX on Azure Ubuntu Pro 24.04 LTS!

+

Your static site is running beautifully on Arm64

``` -3. Adjust Permissions: - -Ensure that NGINX (running as the www-data user) can read the files in your custom site directory: +## Adjust permissions +Ensure that NGINX (running as the `www-data` user) can read the files in your custom site directory: ```console sudo chown -R www-data:www-data /var/www/my-static-site ``` -This sets the ownership of the directory and files so that the NGINX process can serve them without permission issues. - -4. Update NGINX Configuration: - -Point NGINX to serve files from your new directory by creating a dedicated configuration file under /etc/nginx/conf.d/. +## Update NGINX configuration +Point NGINX to serve files from your new directory by creating a dedicated configuration file under `/etc/nginx/conf.d/`: ```console sudo nano /etc/nginx/conf.d/static-site.conf ``` -Add the following configuration to it: - +Add the following configuration: ```console server { listen 80 default_server; @@ -97,54 +91,44 @@ server { error_log /var/log/nginx/static-error.log; } ``` -This configuration block tells NGINX to: - - Listen on port 80 (both IPv4 and IPv6). - - Serve files from /var/www/my-static-site. - - Use index.html as the default page. - - Log access and errors to dedicated log files for easier troubleshooting. +This server block listens on port 80 for both IPv4 and IPv6, serves files from `/var/www/my-static-site/`, and uses `index.html` as the default page. It also writes access and error events to dedicated log files to simplify troubleshooting. +{{% notice Note %}} Make sure the path to your `index.html` file is correct before saving. +{{% /notice %}} -5. Disable the default site: - -By default, NGINX comes with a packaged default site configuration. Since you have created a custom config, it is good practice to disable the default to avoid conflicts: - +## Disable the default site +NGINX ships with a packaged default site configuration. Since you created a custom config, disable the default to avoid conflicts: ```console sudo unlink /etc/nginx/sites-enabled/default ``` -6. Test the NGINX Configuration: - -Before applying your changes, always test the configuration to make sure there are no syntax errors: - +## Test the NGINX configuration +Before applying your changes, test the configuration for syntax errors: ```console sudo nginx -t ``` -You should see output similar to: +Expected output: ```output nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successful ``` -If you see both lines, your configuration is valid. - -7. Reload or Restart NGINX: -After testing the configuration, apply your changes by reloading or restarting the NGINX service: +## Reload or restart NGINX +Apply your changes by reloading or restarting the NGINX service: ```console sudo nginx -s reload sudo systemctl restart nginx ``` -8. Test the Static Website in a browser: - -Access your website at your public IP on port 80. - +## Test the static website in a browser +Access your website at your public IP on port 80: ```console http:/// ``` -9. You should see the NGINX welcome page confirming a successful deployment: - -![Static Website Screenshot](images/nginx-web.png) +## Verify the page renders +You should see your custom page instead of the default welcome page: +![Custom static website served by NGINX on Azure VM alt-text#center](images/nginx-web.png "Custom static website served by NGINX on an Azure Arm64 VM") -This verifies the basic functionality of NGINX installation and you can now proceed to benchmarking the performance of NGINX on your Arm-based Azure VM. +This verifies the basic functionality of the NGINX installation. You can now proceed to benchmarking NGINX performance on your Arm-based Azure VM. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md index 1e9b3a0129..757c3dcfaf 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/benchmarking.md @@ -6,25 +6,26 @@ weight: 6 layout: learningpathall --- -## NGINX Benchmarking by ApacheBench +## Benchmark NGINX with ApacheBench (ab) on Ubuntu Pro 24.04 LTS -To understand how your NGINX deployment performs under load, you can benchmark it using ApacheBench (ab). ApacheBench is a lightweight command-line tool for benchmarking HTTP servers. It measures performance metrics like requests per second, response time, and throughput under concurrent load. +Use ApacheBench (**ab**) to measure NGINX performance on your Arm64 Azure VM. This section shows you how to install the tool, run a basic benchmark, interpret key metrics, and review a sample result from an Azure **D4ps_v6** instance. +## Install ApacheBench -1. Install ApacheBench +On **Ubuntu Pro 24.04 LTS**, ApacheBench is provided by the **apache2-utils** package: -On **Ubuntu Pro 24.04 LTS**, ApacheBench is available as part of the `apache2-utils` package: ```console sudo apt update -sudo apt install apache2-utils -y +sudo apt install -y apache2-utils ``` -2. Verify Installation +Verify the installation: ```console ab -V ``` -You should see output similar to: + +Expected output: ```output This is ApacheBench, Version 2.3 <$Revision: 1923142 $> @@ -32,7 +33,7 @@ Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ ``` -3. Basic Benchmark Syntax +## Run a basic benchmark The general syntax for running an ApacheBench test is: @@ -40,14 +41,16 @@ The general syntax for running an ApacheBench test is: ab -n -c ``` -Now run an example: +Example (1,000 requests, 50 concurrent, to the NGINX default page on localhost): ```console ab -n 1000 -c 50 http://localhost/ ``` -This sends **1000 total requests** with **50 concurrent connections** to `http://localhost/`. -You should see a output similar to: +This command sends 1,000 total requests with 50 concurrent connections to `http://localhost/`. + +Sample output: + ```output This is ApacheBench, Version 2.3 <$Revision: 1903618 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ @@ -104,47 +107,43 @@ Percentage of the requests served within a certain time (ms) 100% 2 (longest request) ``` -### Interpret Benchmark Results: - -ApacheBench outputs several metrics. Key ones to focus on include: - - - Requests per second: Average throughput. - - Time per request: Latency per request. - - Failed request: Should ideally be zero. - - Transfer rate: Bandwidth used by the responses. - -### Benchmark summary on Arm64: -Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. - -| **Category** | **Metric** | **Value** | -|---------------------------|-------------------------------------------------|-------------------------------| -| **General Info** | Server Software | nginx/1.24.0 | -| | Server Hostname | localhost | -| | Server Port | 80 | -| | Document Path | / | -| | Document Length | 890 bytes | -| **Test Setup** | Concurrency Level | 50 | -| | Time Taken for Tests | 0.032 sec | -| | Complete Requests | 1000 | -| | Failed Requests | 0 | -| **Transfer Stats** | Total Transferred | 1,132,000 bytes | -| | HTML Transferred | 890,000 bytes | -| | Requests per Second | 31,523.86 [#/sec] | -| | Time per Request (mean) | 1.586 ms | -| | Time per Request (across all) | 0.032 ms | -| | Transfer Rate | 34,848.65 KB/sec | -| **Connection Times (ms)** | Connect (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | -| | Processing (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | -| | Waiting (min / mean / stdev / median / max) | 0 / 1 / 0.2 / 1 / 1 | -| | Total (min / mean / stdev / median / max) | 1 / 2 / 0.1 / 2 / 2 | - -### Analysis of results from NGINX benchmarking on Arm-based Azure Cobalt-100 - -These benchmark results highlight the strong performance characteristics of NGINX running on Arm64-based Azure VMs (such as the D4ps_v6 instance type): - -- High Requests Per second(31,523.86 requests/sec), demonstrating high throughput under concurrent load. -- Response time per request averaged **1.586 ms**, indicating efficient handling of requests with minimal delay. -- **Zero failed requests**, confirming stability and reliability during testing. -- Consistently low **connection and processing times** (mean ≈ 1 ms), ensuring smooth performance. - -Overall, these results illustrate that NGINX on Arm64 machines provides a highly performant solution for web workloads on Azure. You can also use the same benchmarking framework to compare results on equivalent x86-based Azure instances, which provides useful insight into relative performance and cost efficiency across architectures. +## Interpret benchmark results + +ApacheBench produces a variety of metrics, but the most useful ones highlight how well your server handles load. The requests per second value shows the average throughput, while the time per request (mean) indicates the latency experienced by each request. Ideally, the failed requests metric should remain at zero to confirm reliability. Finally, the transfer rate measures the effective bandwidth used by the responses, giving you insight into overall data flow efficiency. + +## Benchmark summary on Arm64 + +The following results were collected on an Arm64 **D4ps_v6** VM running **Ubuntu Pro 24.04 LTS**: + +| Category | Metric | Value | +|---------------------------|--------------------------------------------------|----------------------------| +| General info | Server Software | nginx/1.24.0 | +| | Server Hostname | localhost | +| | Server Port | 80 | +| | Document Path | / | +| | Document Length | 890 bytes | +| Test setup | Concurrency Level | 50 | +| | Time Taken for Tests | 0.032 sec | +| | Complete Requests | 1,000 | +| | Failed Requests | 0 | +| Transfer stats | Total Transferred | 1,132,000 bytes | +| | HTML Transferred | 890,000 bytes | +| | Requests per Second | 31,523.86 #/sec | +| | Time per Request (mean) | 1.586 ms | +| | Time per Request (across all) | 0.032 ms | +| | Transfer Rate | 34,848.65 KB/sec | +| Connection times (ms) | Connect (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | +| | Processing (min / mean / stdev / median / max) | 0 / 1 / 0.1 / 1 / 1 | +| | Waiting (min / mean / stdev / median / max) | 0 / 1 / 0.2 / 1 / 1 | +| | Total (min / mean / stdev / median / max) | 1 / 2 / 0.1 / 2 / 2 | + +## Analysis of results from NGINX benchmarking on Arm-based Azure Cobalt-100 + +These results highlight the performance characteristics of NGINX on Arm64-based Azure VMs (such as **D4ps_v6**): + +- High Requests per second (31,523.86 requests/sec), demonstrating high throughput under concurrent load. +- Response time per request averaged 1.586 ms, indicating efficient handling of requests with minimal delay. +- Zero failed requests, confirming stability and reliability during testing. +- Consistently low connection and processing times (mean ≈ 1 ms), ensuring smooth performance. + +Overall, NGINX on Arm64 provides a performant, cost‑efficient platform for web workloads on Azure. You can use the same benchmark to compare with equivalent x86-based instances to evaluate relative performance and cost efficiency across architectures. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md index 97fb26ce45..c9236e0330 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/deploy.md @@ -1,18 +1,18 @@ --- -title: Install NGINX +title: "Install NGINX" weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- +# Install and verify NGINX on Ubuntu Pro 24.04 LTS (Azure Arm64) +In this section, you install and configure NGINX, a high-performance web server and reverse proxy, on your Azure Arm64 (Cobalt 100) virtual machine. NGINX is widely used to serve static content, handle large volumes of connections efficiently, and act as a load balancer. Running it on your Azure Cobalt 100 virtual machine allows you to serve web traffic securely and reliably. -## NGINX Installation on Ubuntu Pro 24.04 LTS +## Install NGINX (apt) -In this section, you will install and configure NGINX, a high-performance web server and reverse proxy on your Arm-based Azure instance. NGINX is widely used to serve static content, handle large volumes of connections efficiently, and act as a load balancer. Running it on your Azure Cobalt-100 virtual machine will allow you to serve web traffic securely and reliably. - -### Install NGINX +Install NGINX from Ubuntu’s repositories on Ubuntu Pro 24.04 LTS. Run the following commands to install and enable NGINX: @@ -23,29 +23,31 @@ sudo systemctl enable nginx sudo systemctl start nginx ``` -### Verify NGINX +## Verify NGINX is running Check the installed version of NGINX: ```console nginx -v ``` -The output should look like: + +Expected output: ```output nginx version: nginx/1.24.0 (Ubuntu) ``` -{{% notice Note %}} - -The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends NGINX version 1.20.1 as the minimum recommended on the Arm platforms. +{{% notice Note %}} +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends NGINX version 1.20.1 or later for Arm platforms. {{% /notice %}} -You can confirm that NGINX is running correctly by checking its systemd service status: +You can also confirm that NGINX is running correctly by checking its systemd service status: + ```console sudo systemctl status nginx ``` -You should see output similar to: + +Expected output: ```output ● nginx.service - A high performance web server and a reverse proxy server @@ -63,19 +65,20 @@ You should see output similar to: ├─1944 "nginx: worker process" └─1945 "nginx: worker process" ``` -If you see Active: active (running), NGINX is successfully installed and running. +If you see `Active: active (running)`, NGINX is successfully installed and running. -### Validation with curl -Validation with `curl` confirms that NGINX is correctly installed, running, and serving **HTTP** responses. +## Validate HTTP response with curl + +You can validate that NGINX is serving HTTP responses using `curl`: -Run the following command to send a HEAD request to the local NGINX server: ```console curl -I http://localhost/ ``` -The -I option tells curl to request only the HTTP response headers, without downloading the page body. -You should see output similar to: +The **-I** option requests only the HTTP response headers without downloading the page body. + +Expected output: ```output HTTP/1.1 200 OK @@ -89,24 +92,32 @@ ETag: "68be5aff-267" Accept-Ranges: bytes ``` -Output summary: -- HTTP/1.1 200 OK: Confirms that NGINX is responding successfully. -- Server: nginx/1.24.0: Shows that the server is powered by NGINX. -- Content-Type, Content-Length, Last-Modified, ETag: Provide details about the served file and its metadata. +**Output summary** -This step verifies that your NGINX installation is functional at the system level, even before exposing it to external traffic. It’s a quick diagnostic check that is useful when troubleshooting connectivity issues. +| Field | What it tells you | Example | +|------------------|------------------------------------------------------|------------------------------| +| `HTTP/1.1 200 OK`| NGINX responded successfully | `HTTP/1.1 200 OK` | +| Server | NGINX and version returned by the server | `nginx/1.24.0 (Ubuntu)` | +| Content-Type | MIME type of the response | `text/html` | +| Content-Length | Size of the response body in bytes | `615` | +| Last-Modified| Timestamp of the file served | `Mon, 08 Sep 2025 04:26:39 GMT` | +| ETag | Identifier for the specific version of the resource | `68be5aff-267` | -### Allowing HTTP Traffic -When you created your VM instance earlier, you configured the Azure Network Security Group (NSG) to allow inbound HTTP (port 80) traffic. This means the Azure-side firewall is already open for web requests. -On the VM itself, you still need to make sure that the Uncomplicated firewall (UFW) which is used to manage firewall rules on Ubuntu allows web traffic. Run: +This confirms that NGINX is functional at the system level, even before exposing it to external traffic. +## Allow HTTP traffic (port 80) in UFW and NSG + +When you created your VM instance earlier, you configured the Azure Network Security Group (NSG) to allow inbound HTTP (port 80) traffic. + +On the VM itself, you must also allow traffic through the Ubuntu firewall (UFW). To do this, run: ```console sudo ufw allow 80/tcp sudo ufw enable ``` -The output from this command should look like: + +Expected output: ```output sudo ufw enable @@ -115,12 +126,15 @@ Rules updated (v6) Command may disrupt existing ssh connections. Proceed with operation (y|n)? y Firewall is active and enabled on system startup ``` -You can verify that HTTP is now allowed with: + +Verify that HTTP is allowed with: ```console sudo ufw status ``` -You should see an output similar to: + +Expected output: + ```output Status: active @@ -131,17 +145,19 @@ To Action From 8080/tcp (v6) ALLOW Anywhere (v6) 80/tcp (v6) ALLOW Anywhere (v6) ``` -This ensures that both Azure and the VM-level firewalls are aligned to permit HTTP requests. -### Accessing the NGINX Default Page +This ensures that both Azure and the VM-level firewalls permit HTTP requests. -You can now access the NGINX default page from your Virtual machine’s public IP address. Run the following command to display your public URL: +## Access the NGINX welcome page + +You can now access the NGINX welcome page from your VM’s public IP address. Run: ```console echo "http://$(curl -s ifconfig.me)/" ``` -Copy the printed URL and open it in your browser. You should see the default NGINX welcome page, which confirms a successful installation and that HTTP traffic is reaching your VM. -![NGINX](images/nginx-browser.png) +Copy the printed URL and open it in your browser. You should see the default NGINX welcome page as shown below, which confirms that HTTP traffic is reaching your VM: + +![NGINX default welcome page in a web browser on an Azure VM alt-text#center](images/nginx-browser.png) -At this stage, your NGINX installation is complete. You are now ready to proceed with baseline testing and further configuration. +At this stage, your NGINX installation is complete. You are ready to begin baseline testing and further configuration. diff --git a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md index 16d2b8546f..371ce69a3b 100644 --- a/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/nginx-on-azure/instance.md @@ -1,50 +1,55 @@ --- -title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +title: "Create an Arm-based Azure VM with Cobalt 100" weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Set up your development environment -## Introduction +There is more than one way to create an Arm-based Cobalt 100 virtual machine: -There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). In this section, you will use the Azure console to create a virtual machine with Arm-based Azure Cobalt 100 Processor. +- The Microsoft Azure portal +- The Azure CLI +- Your preferred infrastructure as code (IaC) tool -This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. +In this Learning Path, you will use the Azure portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. -While the steps to create this instance are included here for your convenience, you can also refer to the [Deploy a Cobalt 100 Virtual Machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/) +You will focus on the general-purpose virtual machines in the D-series. For further information, see the Microsoft Azure guide for the [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). -#### Create an Arm-based Azure Virtual Machine +While the steps to create this instance are included here for convenience, for further information on setting up Cobalt on Azure, see [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). -Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". -1. Select "Create", and click on "Virtual Machine" from the drop-down list. -2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". -3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. -4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. +## Create an Arm-based Azure virtual machine -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") +Creating a virtual machine based on Azure Cobalt 100 is similar to creating any other virtual machine on Azure. -5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. -6. Fill in the Administrator username for your VM. -7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. -8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. The default port for NGINX when handling standard web traffic (HTTP) is 80. +To get started, launch the Azure portal and navigate to **Virtual Machines**. Then follow these steps: -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") +- Select **Create**, then select **Virtual machine** from the drop-down list. +- In the **Basics** tab, fill in information about the instance such as **Virtual machine name** and **Region**. +- Select the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select **Arm64** as the VM architecture. +- In the **Size** field, select **See all sizes**, select the **D-series v6** family of virtual machines, then select **D4ps_v6**. -9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Selecting the D-series v6 family of virtual machines") -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") +- Now select **SSH public key** as the authentication type. Azure can generate an SSH key pair for you and store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +- Enter the **Administrator username** for your VM. +- Select **Generate new key pair**, then select **RSA SSH format** as the SSH key type. RSA can offer better security with keys longer than 3072 bits. +- Give a **Key pair name** to your SSH key. +- In **Inbound port rules**, select **HTTP (80)** and **SSH (22)** as the inbound ports. The default port for NGINX when handling standard web traffic (HTTP) is 80. -10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Allowing inbound port rules") -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") +Now select the **Review and create** tab and review the configuration for your virtual machine. It should look like the following: -11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Reviewing and creating an Azure Cobalt 100 Arm64 VM") -![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") +When you have made your selection, select **Create**, then **Download private key and create resources**. -{{% notice Note %}} +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Downloading private key and creating resources") -To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). +Your virtual machine should soon be ready and start running. You can SSH into the virtual machine using the private key and the **Public IP** details. -{{% /notice %}} +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "VM deployment confirmation in Azure portal") + +{{% notice Note %}} To find out more about Arm-based virtual machines on Azure, see the section *Getting Started with Microsoft Azure* within the Learning Path [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). {{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md new file mode 100644 index 0000000000..4d901c770b --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md @@ -0,0 +1,63 @@ +--- +title: Deploy SqueezeNet 1.0 INT8 model with ONNX Runtime on Azure Cobalt 100 + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This Learning Path introduces ONNX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating ONNX-based applications from x86_64 to Arm with minimal or no changes. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. + - Deploy ONNX on the Ubuntu Pro virtual machine. + - Perform ONNX baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). + - Basic understanding of Python and machine learning concepts. + - Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services. + +author: Jason Andrews + +### Tags +skilllevels: Advanced +subjects: ML +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Python + - ONNX Runtime + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Azure Virtual Machines documentation + link: https://learn.microsoft.com/en-us/azure/virtual-machines/ + type: documentation + - resource: + title: ONNX Runtime Docs + link: https://onnxruntime.ai/docs/ + type: documentation + - resource: + title: ONNX (Open Neural Network Exchange) documentation + link: https://onnx.ai/ + type: documentation + - resource: + title: onnxruntime_perf_test tool - ONNX Runtime performance benchmarking + link: https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#in-code-performance-profiling + type: documentation + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md new file mode 100644 index 0000000000..03ff40cd59 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md @@ -0,0 +1,21 @@ +--- +title: "Overview" + +weight: 2 + +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## ONNX +ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models. +It provides interoperability between different deep learning frameworks, enabling models trained in one framework (such as PyTorch or TensorFlow) to be deployed and run in another. + +ONNX models are serialized into a standardized format that can be executed by the **ONNX Runtime**, a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference allows developers to build flexible, portable, and production-ready AI workflows. + +ONNX is widely used in cloud, edge, and mobile environments to deliver efficient and scalable inference for deep learning models. Learn more from the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/). diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md new file mode 100644 index 0000000000..3e7ed69a1c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md @@ -0,0 +1,52 @@ +--- +title: Baseline Testing +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## Baseline testing using ONNX Runtime: + +This test measures the inference latency of the ONNX Runtime by timing how long it takes to process a single input using the `squeezenet-int8.onnx model`. It helps evaluate how efficiently the model runs on the target hardware. + +Create a **baseline.py** file with the below code for baseline test of ONNX: + +```python +import onnxruntime as ort +import numpy as np +import time + +session = ort.InferenceSession("squeezenet-int8.onnx") +input_name = session.get_inputs()[0].name +data = np.random.rand(1, 3, 224, 224).astype(np.float32) + +start = time.time() +outputs = session.run(None, {input_name: data}) +end = time.time() + +print("Inference time:", end - start) +``` + +Run the baseline test: + +```console +python3 baseline.py +``` +You should see an output similar to: +```output +Inference time: 0.0026061534881591797 +``` +{{% notice Note %}}Inference time is the amount of time it takes for a trained machine learning model to make a prediction (i.e., produce output) after receiving input data. +input tensor of shape (1, 3, 224, 224): +- 1: batch size +- 3: color channels (RGB) +- 224 x 224: image resolution (common for models like SqueezeNet) +{{% /notice %}} + +#### Output summary: + +- Single inference latency: ~2.60 milliseconds (0.00260 sec) +- This shows the initial (cold-start) inference performance of ONNX Runtime on CPU using an optimized int8 quantized model. +- This demonstrates that the setup is fully working, and ONNX Runtime efficiently executes quantized models on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md new file mode 100644 index 0000000000..56d54578ae --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md @@ -0,0 +1,138 @@ +--- +title: Benchmarking via onnxruntime_perf_test +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Now that you’ve set up and run the ONNX model (e.g., SqueezeNet), you can use it to benchmark inference performance using Python-based timing or tools like **onnxruntime_perf_test**. This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances. + +You can also compare the inference time between Cobalt 100 (Arm64) and similar D-series x86_64-based virtual machine on Azure. + +## Run the performance tests using onnxruntime_perf_test +The **onnxruntime_perf_test** is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models under various runtime conditions (like CPU, GPU, or other execution providers). + +### Install Required Build Tools + +```console +sudo apt update +sudo apt install -y build-essential cmake git unzip pkg-config +sudo apt install -y protobuf-compiler libprotobuf-dev libprotoc-dev git +``` +Then verify: +```console +protoc --version +``` +You should see an output similar to: + +```output +libprotoc 3.21.12 +``` +### Build ONNX Runtime from Source: + +The benchmarking tool, **onnxruntime_perf_test**, isn’t available as a pre-built binary artifact for any platform. So, you have to build it from the source, which is expected to take around 40-50 minutes. + +Clone onnxruntime: +```console +git clone --recursive https://github.com/microsoft/onnxruntime +cd onnxruntime +``` +Now, build the benchmark as below: + +```console +./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests +``` +This will build the benchmark tool inside ./build/Linux/Release/onnxruntime_perf_test. + +### Run the benchmark +Now that the benchmarking tool has been built, you can benchmark the **squeezenet-int8.onnx** model, as below: + +```console +./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I +``` +- **e cpu**: Use the CPU execution provider (not GPU or any other backend). +- **r 100**: Run 100 inferences. +- **m times**: Use "repeat N times" mode. +- **s**: Show detailed statistics. +- **Z**: Disable intra-op thread spinning (reduces CPU usage when idle between runs). +- **I**: Input the ONNX model path without using input/output test data. + +You should see an output similar to: + +```output +Disabling intra-op thread spinning between runs +Session creation time cost: 0.0102016 s +First inference time cost: 2 ms +Total inference time cost: 0.185739 s +Total inference requests: 100 +Average inference time cost: 1.85739 ms +Total inference run time: 0.18581 s +Number of inferences per second: 538.184 +Avg CPU usage: 96 % +Peak working set size: 36696064 bytes +Avg CPU usage:96 +Peak working set size:36696064 +Runs:100 +Min Latency: 0.00183404 s +Max Latency: 0.00190312 s +P50 Latency: 0.00185674 s +P90 Latency: 0.00187215 s +P95 Latency: 0.00187393 s +P99 Latency: 0.00190312 s +P999 Latency: 0.00190312 s +``` +### Benchmark Metrics Explained + +- **Average Inference Time**: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution. +- **Throughput**: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently. +- **CPU Utilization**: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking. +- **Peak Memory Usage**: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. +- **P50 Latency (Median Latency)**: The time below which 50% of inference requests complete. Represents typical latency under normal load. +- **Latency Consistency**: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter. + +### Benchmark summary on Arm64: +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| **Metric** | **Value** | +|----------------------------|-------------------------------| +| **Average Inference Time** | 1.857 ms | +| **Throughput** | 538.18 inferences/sec | +| **CPU Utilization** | 96% | +| **Peak Memory Usage** | 36.70 MB | +| **P50 Latency** | 1.857 ms | +| **P90 Latency** | 1.872 ms | +| **P95 Latency** | 1.874 ms | +| **P99 Latency** | 1.903 ms | +| **P999 Latency** | 1.903 ms | +| **Max Latency** | 1.903 ms | +| **Latency Consistency** | Consistent | + + +### Benchmark summary on x86 +Here is a summary of benchmark results collected on x86 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. + +| **Metric** | **Value on Virtual Machine** | +|----------------------------|-------------------------------| +| **Average Inference Time** | 1.413 ms | +| **Throughput** | 707.48 inferences/sec | +| **CPU Utilization** | 100% | +| **Peak Memory Usage** | 38.80 MB | +| **P50 Latency** | 1.396 ms | +| **P90 Latency** | 1.501 ms | +| **P95 Latency** | 1.520 ms | +| **P99 Latency** | 1.794 ms | +| **P999 Latency** | 1.794 ms | +| **Max Latency** | 1.794 ms | +| **Latency Consistency** | Consistent | + + +### Highlights from Ubuntu Pro 24.04 Arm64 Benchmarking + +When comparing the results on Arm64 vs x86_64 virtual machines: +- **Low-Latency Inference:** Achieved consistent average inference times of ~1.86 ms on Arm64. +- **Strong and Stable Throughput:** Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances. +- **Lightweight Resource Footprint:** Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference. +- **Consistent Performance:** P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure. + +You have now benchmarked ONNX on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md new file mode 100644 index 0000000000..9571395aa2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md @@ -0,0 +1,50 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". +1. Select "Create", and click on "Virtual Machine" from the drop-down list. +2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". +3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. +4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") + +5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +6. Fill in the Administrator username for your VM. +7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. +8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") + +9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") + +10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") + +11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md new file mode 100644 index 0000000000..971777eb11 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md @@ -0,0 +1,78 @@ +--- +title: ONNX Installation +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## ONNX Installation on Azure Ubuntu Pro 24.04 LTS +Install Python, create a virtual environment, and use pip to install ONNX, ONNX Runtime, and dependencies. Verify the setup and validate a sample ONNX model like SqueezeNet. + +### Install Python and Virtual Environment: + +```console +sudo apt update +sudo apt install -y python3 python3-pip python3-virtualenv +``` +Create and activate a virtual environment: + +```console +python3 -m venv onnx-env +source onnx-env/bin/activate +``` +{{% notice Note %}}Using a virtual environment isolates ONNX and its dependencies to avoid system conflicts.{{% /notice %}} + +### Install ONNX and Required Libraries: + +```console +pip install --upgrade pip +pip install onnx onnxruntime fastapi uvicorn numpy +``` +This installs ONNX libraries along with FastAPI (web serving) and NumPy (for input tensor generation). + +### Validate ONNX and ONNX Runtime: +Create **version.py** as below: + +```python +import onnx +import onnxruntime + +print("ONNX version:", onnx.version) +print("ONNX Runtime version:", onnxruntime.__version__) +``` +Now, run version.py: + +```console +python3 version.py +``` +You should see an output similar to: +```output +ONNX version: 1.18.0 +ONNX Runtime version: 1.22.0 +``` +### Download and Validate ONNX Model - SqueezeNet: +SqueezeNet is a lightweight convolutional neural network (CNN) architecture designed to achieve comparable accuracy to AlexNet, but with fewer parameters and smaller model size. + +```console +wget https://github.com/onnx/models/raw/main/validated/vision/classification/squeezenet/model/squeezenet1.0-12-int8.onnx -O squeezenet-int8.onnx +``` +#### Validate the model: + +Create a **vaildation.py** file with the code below for validation for ONNX model: + +```python +import onnx + +model = onnx.load("squeezenet-int8.onnx") +onnx.checker.check_model(model) +print("✅ Model is valid!") +``` +You should see an output similar to: +```output +✅ Model is valid! +``` +This downloads a quantized (INT8) classification model, and validates its structure using ONNX’s built-in checker. + +ONNX installation and model validation are complete. You can now proceed with the baseline testing. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance.png b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/images/ubuntu-pro.png differ