Skip to content

Commit 75d89b7

Browse files
Merge pull request #2341 from odincodeshen/main
llama.cpp streamline with lowercase file name.
2 parents 65a3a06 + 5c70db8 commit 75d89b7

21 files changed

+682
-460
lines changed

assets/contributors.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,4 @@ Ker Liu,,,,,
102102
Rui Chang,,,,,
103103
Alejandro Martinez Vicente,Arm,,,,
104104
Mohamad Najem,Arm,,,,
105+
Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,,
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: Overview
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Overview: Profiling LLMs on Arm CPUs with Streamline
10+
11+
Large Language Models (LLMs) run efficiently on Arm CPUs.
12+
Frameworks that run LLMs, such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provides a convenient framework for running LLMs, it also comes with a certain level of complexity.
13+
14+
To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.
15+
16+
This learning path demonstrates how to use the **llama-cli** application from llama.cpp together with **Arm Streamline** to analyze the efficiency of LLM inference on Arm CPUs.
17+
18+
In this guide you will learn how to:
19+
- Profile token generation at the **Prefill** and **Decode** stages
20+
- Profile execution of individual tensor nodes and operators
21+
- Profile LLM execution across **multiple threads and cores**
22+
23+
You will run the **Qwen1_5-0_5b-chat-q4_0.gguf** model with llama-cli on **Arm64 Linux** and use Streamline for analysis.
24+
The same method can also be applied to **Arm64 Android** platforms.
25+
26+
## Prerequisites
27+
Before starting this guide, you should be familiar with:
28+
- Basic understanding of llama.cpp
29+
- Understanding of transformer model
30+
- Knowledge of Streamline usage
31+
- An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Understand the llama.cpp
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Understand the llama.cpp
10+
11+
**llama.cpp** is an open-source LLM framework implemented in C++ that supports both training and inference.
12+
This learning path focuses only on **inference on the CPU**.
13+
14+
The **llama-cli** tool provides a command-line interface to run LLMs with the llama.cpp inference engine.
15+
It supports text generation, chat mode, and grammar-constrained output directly from the terminal.
16+
17+
![text#center](images/llama_structure.png "Figure 1. llama-cli Flow")
18+
19+
### What llama-cli does
20+
- Load and interpret LLMs in **.gguf** format
21+
- Build a **compute graph** based on the model structure
22+
- The graph can be divided into subgraphs, each assigned to the most suitable backend device
23+
- In this guide, all operators are executed on the **CPU backend**
24+
- Allocate memory for tensor nodes using the **graph planner**
25+
- Execute tensor nodes in the graph during the **graph_compute** stage, which traverses nodes and forwards work to backend devices
26+
27+
Step2 to Step4 are wrapped inside the function **`llama_decode`**.
28+
During **Prefill** and **Decode**, `llama-cli` repeatedly calls `llama_decode` to generate tokens.
29+
The parameter **`llama_batch`** passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions.
30+
31+
### Components of llama.cpp
32+
The components of llama.cpp include:
33+
![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components")
34+
35+
llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, `OpenCL` etc.
36+
37+
For the CPU backend, it provides an optimized `ggml-cpu` library (mainly utilizing CPU vector instructions).
38+
For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages the new **I8MM** instructions for acceleration.
39+
The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait.
40+
41+
### Prefill and Decode in autoregressive LLMs
42+
Most autoregressive LLMs are Decoder-only model.
43+
Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs.
44+
![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage")
45+
46+
At the Prefill stage, multiple input tokens of the prompt are processed.
47+
It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token.
48+
![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
49+
50+
At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
51+
![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
52+
53+
Therefore,
54+
- **Prefill** is **compute-bound**, dominated by large GEMM operations
55+
- **Decode** is **memory-bound**, dominated by KV cache access and GEMV operations
56+
57+
This can be seen in the subsequent analysis with Streamline.
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
---
2+
title: Integrating Streamline Annotations into llama.cpp
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Integrating Streamline Annotations into llama.cpp
10+
11+
To visualize token generation at the **Prefill** and **Decode** stages, we use **Streamline’s Annotation Marker** feature.
12+
This requires integrating annotation support into the **llama.cpp** project.
13+
More information about the Annotation Marker API can be found [here](https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en).
14+
15+
{{% notice Note %}}
16+
You can either build natively on an **Arm platform**, or cross-compile on another architecture using an Arm cross-compiler toolchain.
17+
{{% /notice %}}
18+
19+
### Step 1: Build Streamline Annotation library
20+
21+
Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first.
22+
23+
Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*.
24+
25+
For installation guidance, refer to the [Streamline installation guide](https://learn.arm.com/install-guides/streamline/).
26+
27+
Clone the gator repository that matches your Streamline version and build the `Annotation support library`.
28+
29+
The installation step is depends on your developement machine.
30+
31+
For Arm native build, you can use following insturction to install the packages.
32+
For other machine, you need to set up the cross compiler environment by install [aarch64 gcc compiler toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads).
33+
You can refer this [guide](https://learn.arm.com/install-guides/gcc/cross/) for Cross-compiler installation.
34+
35+
{{< tabpane code=true >}}
36+
{{< tab header="Arm Native Build" language="bash">}}
37+
apt-get update
38+
apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
39+
cd ~
40+
git clone https://github.com/ARM-software/gator.git
41+
cd gator
42+
./build-linux.sh
43+
44+
cd annotate
45+
make
46+
{{< /tab >}}
47+
{{< tab header="Cross Compiler" language="bash">}}
48+
apt-get update
49+
apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git
50+
cd ~
51+
git clone https://github.com/ARM-software/gator.git
52+
53+
cd gator
54+
make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool
55+
{{< /tab >}}
56+
{{< /tabpane >}}
57+
58+
Once complete, the static library **libstreamline_annotate.a** will be generated at `~/gator/annotate/libstreamline_annotate.a` and the header file at: `gator/annotate/streamline_annotate.h`
59+
60+
### Step 2: Integrate Annotation Marker into llama.cpp
61+
62+
Next, we need to install **llama.cpp** to run the LLM model.
63+
To make the following performance profiling content easier to follow, this Learning Path will use a specific release version of llama.cpp to ensure the steps and results remain consistent.
64+
65+
Before the build **llama.cpp**, create a directory `streamline_annotation` and copy the library `libstreamline_annotate.a` and the header file `streamline_annotate.h` into the folder.
66+
67+
```bash
68+
cd ~
69+
wget https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz
70+
tar -xvzf b6202.tar.gz
71+
mv llama.cpp-b6202 llama.cpp
72+
cd ./llama.cpp
73+
mkdir streamline_annotation
74+
cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation
75+
```
76+
77+
To link `libstreamline_annotate.a` library when building llama-cli, adding following lines in the end of `llama.cpp/tools/main/CMakeLists.txt`.
78+
79+
```makefile
80+
set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a")
81+
target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_annotation")
82+
target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}")
83+
```
84+
85+
To add Annotation Markers to llama-cli, change the llama-cli code **llama.cpp/tools/main/main.cpp** by adding
86+
87+
```c
88+
#include "streamline_annotate.h"
89+
```
90+
91+
After the call to common_init(), add the setup macro:
92+
93+
```c
94+
common_init();
95+
//Add the Annotation setup code
96+
ANNOTATE_SETUP;
97+
```
98+
99+
Finally, add an annotation marker inside the main loop:
100+
101+
```c
102+
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
103+
int n_eval = (int) embd.size() - i;
104+
if (n_eval > params.n_batch) {
105+
n_eval = params.n_batch;
106+
}
107+
108+
LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str());
109+
110+
// Add annotation marker code for Streamline
111+
{
112+
char printf_buf[200];
113+
sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval );
114+
ANNOTATE_MARKER_STR(printf_buf);
115+
}
116+
// End of annotation marker
117+
118+
if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
119+
LOG_ERR("%s : failed to eval\n", __func__);
120+
return 1;
121+
}
122+
```
123+
124+
A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.
125+
126+
### Step 3: Build llama-cli
127+
128+
For convenience, llama-cli is **static linked**.
129+
130+
Firstly, create a new directory `build` understand llama.cpp root directory and go into it.
131+
132+
```bash
133+
cd ~/llama.cpp
134+
mkdir ./build & cd ./build
135+
```
136+
137+
Then configure the project by running
138+
139+
{{< tabpane code=true >}}
140+
{{< tab header="Arm Native Build" language="bash">}}
141+
cmake .. \
142+
-DGGML_NATIVE=ON \
143+
-DLLAMA_F16C=OFF \
144+
-DLLAMA_GEMM_ARM=ON \
145+
-DBUILD_SHARED_LIBS=OFF \
146+
-DCMAKE_EXE_LINKER_FLAGS="-static -g" \
147+
-DGGML_OPENMP=OFF \
148+
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
149+
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
150+
-DGGML_CPU_KLEIDIAI=ON \
151+
-DLLAMA_BUILD_TESTS=OFF \
152+
-DLLAMA_BUILD_EXAMPLES=ON \
153+
-DLLAMA_CURL=OFF
154+
{{< /tab >}}
155+
{{< tab header="Cross Compiler" language="bash">}}
156+
cmake .. \
157+
-DCMAKE_SYSTEM_NAME=Linux \
158+
-DCMAKE_SYSTEM_PROCESSOR=arm \
159+
-DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \
160+
-DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \
161+
-DLLAMA_NATIVE=OFF \
162+
-DLLAMA_F16C=OFF \
163+
-DLLAMA_GEMM_ARM=ON \
164+
-DBUILD_SHARED_LIBS=OFF \
165+
-DCMAKE_EXE_LINKER_FLAGS="-static -g" \
166+
-DGGML_OPENMP=OFF \
167+
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
168+
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \
169+
-DGGML_CPU_KLEIDIAI=ON \
170+
-DLLAMA_BUILD_TESTS=OFF \
171+
-DLLAMA_BUILD_EXAMPLES=ON \
172+
-DLLAMA_CURL=OFF
173+
{{< /tab >}}
174+
{{< /tabpane >}}
175+
176+
177+
Set `CMAKE_C_COMPILER` and `DCMAKE_CXX_COMPILER` to your cross compiler path. Make sure that **-march** in `DCMAKE_C_FLAGS` and `CMAKE_CXX_FLAGS` matches your Arm CPU hardware.
178+
179+
180+
In this learning path, we run llama-cli on an Arm CPU that supports **NEON Dotprod** and **I8MM** instructions.
181+
Therefore, we specify: **armv8.2-a+dotprod+i8mm**.
182+
183+
We also specify **-static** and **-g** options:
184+
- **-static**: produces a statically linked executable, so it can run on different Arm64 Linux/Android environments without needing shared libraries.
185+
- **-g**: includes debug information, which makes source code and function-level profiling in Streamline much easier.
186+
187+
so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android.
188+
189+
Now you can build the project by running:
190+
191+
```bash
192+
cd ~/llama.cpp/build
193+
cmake --build ./ --config Release
194+
```
195+
196+
After the building process, you should find the llama-cli will be generated at **~/llama.cpp/build/bin/** directory.

0 commit comments

Comments
 (0)