ArmDeveloperEcosystem
diff --git a/‎assets/contributors.csv‎
Lines changed: 1 addition & 0 deletions b/‎assets/contributors.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md‎
Lines changed: 31 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md‎
Lines changed: 57 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md‎
Lines changed: 196 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md‎
Lines changed: 196 additions & 0 deletions
@@ -102,3 +102,4 @@ Ker Liu,,,,,
 Rui Chang,,,,,
 Alejandro Martinez Vicente,Arm,,,,
 Mohamad Najem,Arm,,,,
+Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,,
@@ -0,0 +1,31 @@
+---
+title: Overview
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Overview: Profiling LLMs on Arm CPUs with Streamline
+
+Large Language Models (LLMs) run efficiently on Arm CPUs.  
+Frameworks that run LLMs, such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provides a convenient framework for running LLMs, it also comes with a certain level of complexity. 
+
+To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.
+
+This learning path demonstrates how to use the **llama-cli** application from llama.cpp together with **Arm Streamline** to analyze the efficiency of LLM inference on Arm CPUs.  
+
+In this guide you will learn how to:
+- Profile token generation at the **Prefill** and **Decode** stages
+- Profile execution of individual tensor nodes and operators
+- Profile LLM execution across **multiple threads and cores**
+
+You will run the **Qwen1_5-0_5b-chat-q4_0.gguf** model with llama-cli on **Arm64 Linux** and use Streamline for analysis.  
+The same method can also be applied to **Arm64 Android** platforms.  
+
+## Prerequisites
+Before starting this guide, you should be familiar with:
+- Basic understanding of llama.cpp
+- Understanding of transformer model
+- Knowledge of Streamline usage
+- An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application
@@ -0,0 +1,57 @@
+---
+title: Understand the llama.cpp
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Understand the llama.cpp
+
+**llama.cpp** is an open-source LLM framework implemented in C++ that supports both training and inference.
+This learning path focuses only on **inference on the CPU**.  
+
+The **llama-cli** tool provides a command-line interface to run LLMs with the llama.cpp inference engine. 
+It supports text generation, chat mode, and grammar-constrained output directly from the terminal.  
+
+![text#center](images/llama_structure.png "Figure 1. llama-cli Flow")
+
+### What llama-cli does
+- Load and interpret LLMs in **.gguf** format  
+- Build a **compute graph** based on the model structure  
+  - The graph can be divided into subgraphs, each assigned to the most suitable backend device  
+  - In this guide, all operators are executed on the **CPU backend**  
+- Allocate memory for tensor nodes using the **graph planner**  
+- Execute tensor nodes in the graph during the **graph_compute** stage, which traverses nodes and forwards work to backend devices  
+
+Step2 to Step4 are wrapped inside the function **`llama_decode`**.
+During **Prefill** and **Decode**, `llama-cli` repeatedly calls `llama_decode` to generate tokens.  
+The parameter **`llama_batch`** passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions.  
+
+### Components of llama.cpp
+The components of llama.cpp include: 
+![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components")
+
+llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, `OpenCL` etc.
+
+For the CPU backend, it provides an optimized `ggml-cpu` library (mainly utilizing CPU vector instructions). 
+For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages the new **I8MM** instructions for acceleration. 
+The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait.
+
+### Prefill and Decode in autoregressive LLMs
+Most autoregressive LLMs are Decoder-only model.
+Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs.
+![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage")
+
+At the Prefill stage, multiple input tokens of the prompt are processed.
+It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. 
+![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
+
+At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
+![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
+
+Therefore, 
+- **Prefill** is **compute-bound**, dominated by large GEMM operations  
+- **Decode** is **memory-bound**, dominated by KV cache access and GEMV operations 
+
+This can be seen in the subsequent analysis with Streamline.
@@ -0,0 +1,196 @@
+---
+title: Integrating Streamline Annotations into llama.cpp
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Integrating Streamline Annotations into llama.cpp
+
+To visualize token generation at the **Prefill** and **Decode** stages, we use **Streamline’s Annotation Marker** feature.  
+This requires integrating annotation support into the **llama.cpp** project.  
+More information about the Annotation Marker API can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en).
+
+{{% notice Note %}}
+You can either build natively on an **Arm platform**, or cross-compile on another architecture using an Arm cross-compiler toolchain.
+{{% /notice %}}
+
+### Step 1: Build Streamline Annotation library
+
+Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first.
+
+Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*.
+
+For installation guidance, refer to the [Streamline installation guide](https://learn.arm.com/install-guides/streamline/).
+
+Clone the gator repository that matches your Streamline version and build the `Annotation support library`.
+
+The installation step is depends on your developement machine.
+
+For Arm native build, you can use following insturction to install the packages.
+For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads).
+You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for Cross-compiler installation.
+
+{{< tabpane code=true >}}
+  {{< tab header="Arm Native Build" language="bash">}}
+    apt-get update
+    apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
+    cd ~
+    git clone https://github.com/ARM-software/gator.git
+    cd gator
+    ./build-linux.sh
+
+    cd annotate
+    make  
+  {{< /tab >}}
+  {{< tab header="Cross Compiler" language="bash">}}
+    apt-get update
+    apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
+    cd ~
+    git clone https://github.com/ARM-software/gator.git
+
+    cd gator
+    make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool
+  {{< /tab >}}
+{{< /tabpane >}}
+
+Once complete, the static library **libstreamline_annotate.a** will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file at: `gator/annotate/streamline_annotate.h`
+
+### Step 2: Integrate Annotation Marker into llama.cpp
+
+Next, we need to install **llama.cpp** to run the LLM model.
+To make the following performance profiling content easier to follow, this Learning Path will use a specific release version of llama.cpp to ensure the steps and results remain consistent.
+
+Before the build **llama.cpp**, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the folder. 
+
+```bash
+cd ~
+wget https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz 
+tar -xvzf b6202.tar.gz
+mv llama.cpp-b6202 llama.cpp
+cd ./llama.cpp
+mkdir streamline_annotation
+cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation
+```
+
+To link `libstreamline_annotate.a` library when building llama-cli, adding following lines in the end of `llama.cpp/tools/main/CMakeLists.txt`.
+
+```makefile
+set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a")
+target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation")
+target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}")
+```
+
+To add Annotation Markers to llama-cli, change the llama-cli code **llama.cpp/tools/main/main.cpp** by adding
+
+```c
+#include "streamline_annotate.h" 
+```
+
+After the call to common_init(), add the setup macro:
+
+```c
+    common_init();
+    //Add the Annotation setup code
+    ANNOTATE_SETUP;
+```
+
+Finally, add an annotation marker inside the main loop:
+
+```c
+          for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
+                int n_eval = (int) embd.size() - i;
+                if (n_eval > params.n_batch) {
+                    n_eval = params.n_batch;
+                }
+
+                LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str());
+	
+                // Add annotation marker code for Streamline
+                {
+                  char printf_buf[200];
+                  sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval );
+                  ANNOTATE_MARKER_STR(printf_buf);
+                }
+                // End of annotation marker 
+
+                if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
+                    LOG_ERR("%s : failed to eval\n", __func__);
+                    return 1;
+                }
+```
+
+A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.
+
+### Step 3: Build llama-cli
+
+For convenience, llama-cli is **static linked**.
+
+Firstly, create a new directory `build` understand llama.cpp root directory and go into it.
+
+```bash
+cd ~/llama.cpp
+mkdir ./build & cd ./build
+```
+
+Then configure the project by running 
+
+{{< tabpane code=true >}}
+  {{< tab header="Arm Native Build" language="bash">}}
+    cmake .. \
+      -DGGML_NATIVE=ON \
+      -DLLAMA_F16C=OFF \
+      -DLLAMA_GEMM_ARM=ON \
+      -DBUILD_SHARED_LIBS=OFF \
+      -DCMAKE_EXE_LINKER_FLAGS="-static -g" \
+      -DGGML_OPENMP=OFF \
+      -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
+      -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
+      -DGGML_CPU_KLEIDIAI=ON \
+      -DLLAMA_BUILD_TESTS=OFF \
+      -DLLAMA_BUILD_EXAMPLES=ON \
+      -DLLAMA_CURL=OFF  
+  {{< /tab >}}
+  {{< tab header="Cross Compiler" language="bash">}}
+    cmake .. \
+      -DCMAKE_SYSTEM_NAME=Linux \
+      -DCMAKE_SYSTEM_PROCESSOR=arm \
+      -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \
+      -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \
+      -DLLAMA_NATIVE=OFF \
+      -DLLAMA_F16C=OFF \
+      -DLLAMA_GEMM_ARM=ON \
+      -DBUILD_SHARED_LIBS=OFF \
+      -DCMAKE_EXE_LINKER_FLAGS="-static -g" \
+      -DGGML_OPENMP=OFF \
+      -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
+      -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
+      -DGGML_CPU_KLEIDIAI=ON \
+      -DLLAMA_BUILD_TESTS=OFF \
+      -DLLAMA_BUILD_EXAMPLES=ON \
+      -DLLAMA_CURL=OFF
+  {{< /tab >}}
+{{< /tabpane >}}
+
+
+Set `CMAKE_C_COMPILER` and `DCMAKE_CXX_COMPILER` to your cross compiler path. Make sure that **-march** in `DCMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware. 
+
+
+In this learning path, we run llama-cli on an Arm CPU that supports **NEON Dotprod** and **I8MM** instructions.  
+Therefore, we specify: **armv8.2-a+dotprod+i8mm**.
+
+We also specify **-static** and **-g** options:
+- **-static**: produces a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries.
+- **-g**: includes debug information, which makes source code and function-level profiling in Streamline much easier.  
+
+so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android.
+
+Now you can build the project by running:
+
+```bash
+cd ~/llama.cpp/build
+cmake --build ./ --config Release
+```
+
+After the building process, you should find the llama-cli will be generated at **~/llama.cpp/build/bin/** directory.