Merge pull request #2384 from jasonrandrews/review

jasonrandrews · web-flow · commit 4a96c91f6594 · 2025-10-02T11:35:21.000-05:00
Complete llama.cpp Streamline technical review
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md
@@ -14,7 +14,7 @@ Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provi
 
 To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.
 
-This Learning Path demonstrates how to use `llama-cli` application from llama.cpp together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs.  
+This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs.  
 
 You will learn how to:
 - Profile token generation at the Prefill and Decode stages
@@ -23,4 +23,4 @@ You will learn how to:
 
 You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis.  
 
-The same method can also be applied to Android platforms.
+The same method can also be used on Android.
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md
@@ -83,4 +83,4 @@ At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not
 
 In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations. 
 
-You will see this highlighted during the analysis with Streamline.
+You will see this highlighted during the Streamline performance analysis.
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md
@@ -20,36 +20,33 @@ You can either build natively on an Arm platform, or cross-compile on another ar
 
 ### Step 1: Build Streamline Annotation library
 
-Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first.
+Download and install [Arm Performance Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads) on your development machine.
 
-Streamline Annotation support code is in the installation directory such as `Arm/Development Studio 2024.1/sw/streamline/gator/annotate`.
-
-For installation guidance, refer to the [Streamline installation guide](/install-guides/streamline/).
-
-Clone the gator repository that matches your Streamline version and build the `Annotation support library`.
+{{% notice Note %}}
+You can also download and install [Arm Development Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio#Downloads), as it also includes Streamline.
 
-The installation step depends on your development machine.
+{{% /notice %}}
 
-For Arm native build, you can use the following instructions to install the packages.
+Streamline Annotation support code is in the Arm Performance Studio installation directory in the `streamline/gator/annotate` directory.
 
-For other machines, you need to set up the cross compiler environment by installing [Arm GNU toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads).
+Clone the gator repository that matches your Streamline version and build the `Annotation support library`. You can build it on your current machine using the native build instructions and you can cross compile it for another Arm computer using the cross compile instructions. 
 
-You can refer to the [GCC install guide](https://learn.arm.com/install-guides/gcc/cross/) for cross-compiler installation.
+If you need to set up a cross compiler you can review the [GCC install guide](/install-guides/gcc/cross/).
 
 {{< tabpane code=true >}}
   {{< tab header="Arm Native Build" language="bash">}}
-    apt-get update
-    apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
+    sudo apt-get update
+    sudo apt-get install -y ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
     cd ~
     git clone https://github.com/ARM-software/gator.git
     cd gator
     ./build-linux.sh
     cd annotate
     make  
   {{< /tab >}}
-  {{< tab header="Cross Compiler" language="bash">}}
-    apt-get update
-    apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
+  {{< tab header="Cross Compile" language="bash">}}
+    sudo apt-get update
+    sudo apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
     cd ~
     git clone https://github.com/ARM-software/gator.git
     cd gator
@@ -79,29 +76,31 @@ mkdir streamline_annotation
 cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation
 ```
 
-To link the `libstreamline_annotate.a` library when building llama-cli, add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`.
+To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`.
 
 ```makefile
 set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a")
 target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation")
 target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}")
 ```
 
-To add Annotation Markers to `llama-cli`, change the `llama-cli` code in `llama.cpp/tools/main/main.cpp` by adding the include file:
+To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make 3 modification.
+
+First, add the include file at the top of `main.cpp` with the other include files.
 
 ```c
 #include "streamline_annotate.h" 
 ```
 
-After the call to `common_init()`, add the setup macro:
+Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like:
 
 ```c
     common_init();
     //Add the Annotation setup code
     ANNOTATE_SETUP;
 ```
 
-Finally, add an annotation marker inside the main loop:
+Finally, add an annotation marker inside the main loop. Add the complete code instead the annotation comments so it looks like:
 
 ```c
           for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
@@ -150,8 +149,8 @@ Next, configure the project.
       -DBUILD_SHARED_LIBS=OFF \
       -DCMAKE_EXE_LINKER_FLAGS="-static -g" \
       -DGGML_OPENMP=OFF \
-      -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
-      -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
+      -DCMAKE_C_FLAGS="-march=native -g" \
+      -DCMAKE_CXX_FLAGS="-march=native -g" \
       -DGGML_CPU_KLEIDIAI=ON \
       -DLLAMA_BUILD_TESTS=OFF \
       -DLLAMA_BUILD_EXAMPLES=ON \
@@ -161,8 +160,8 @@ Next, configure the project.
     cmake .. \
       -DCMAKE_SYSTEM_NAME=Linux \
       -DCMAKE_SYSTEM_PROCESSOR=arm \
-      -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \
-      -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \
+      -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
+      -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
       -DLLAMA_NATIVE=OFF \
       -DLLAMA_F16C=OFF \
       -DLLAMA_GEMM_ARM=ON \
@@ -190,7 +189,7 @@ Now you can build the project using `cmake`:
 
 ```bash
 cd ~/llama.cpp/build
-cmake --build ./ --config Release
+cmake --build ./ --config Release -j $(nproc)
 ```
 
 After the building process completes, you can find the `llama-cli` in the `~/llama.cpp/build/bin/` directory.
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md
@@ -8,24 +8,25 @@ layout: learningpathall
 
 ## Run llama-cli and analyze the data with Streamline
 
-After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform.
+After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be your development machine or another Arm system.
 
-### Set up gatord 
+### Set up the gator daemon
 
-The gator daemon (gatord) is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data.
+The gator daemon, `gatord`, is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data.
 
 Depending on how you built llama.cpp:
 
 For the cross-compiled build flow:
 
   - Copy the `llama-cli` executable to your Arm target. 
-  - Also copy the `gatord` binary from the Arm DS or Streamline installation:  
-    - Linux: `Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64`  
-    - Android: `Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64`  
+  - Copy the `gatord` binary from the Arm Performance Studio release. If you are targeting Linux, take it from `streamline\bin\linux\arm64` and if you are targeting Android take it from `streamline\bin\android\arm64`.
+
+Put both of these programs in your home directory on the target system. 
 
 For the native build flow:
+  - Use the `llama-cli` from your local build in `llama.cpp/build/bin` and the `gatord` you compiled earlier at `~/gator/build-native-gcc-rel/gatord`.  
 
-  - Use the `llama-cli` from your local build and the `gatord` you compiled earlier (`~/gator/build-native-gcc-rel/gatord`).  
+You now have the `gatord` and the `llama-cli` on the computer you want to run and profile. 
 
 ### Download a lightweight model
 
@@ -49,8 +50,9 @@ Start the gator daemon on your Arm target:
 You should see similar messages to those shown below:
 
 ``` bash
-Streamline Data Recorder v9.4.0 (Build 9b1e8f8)
-Copyright (c) 2010-2024 Arm Limited. All rights reserved.
+Streamline Data Recorder v9.6.0 (Build oss)
+Copyright (c) 2010-2025 Arm Limited. All rights reserved.
+
 Gator ready
 ```
 
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md
@@ -19,8 +19,8 @@ learning_objectives:
 prerequisites:
     - Basic understanding of llama.cpp
     - Understanding of transformer models
-    - Knowledge of Streamline usage
-    - An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application
+    - Knowledge of Arm Streamline usage
+    - An Arm Neoverse or Cortex-A hardware platform running Linux or Android
 
 author: 
     - Zenon Zhilong Xiu

Original file line number	Diff line number	Diff line change
`@@ -83,4 +83,4 @@ At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not`
`83`	`83`
`84`	`84`	`In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations.`
`85`	`85`
`86`		`-You will see this highlighted during the analysis with Streamline.`
	`86`	`+You will see this highlighted during the Streamline performance analysis.`