Skip to content

Commit 32365b6

Browse files
authored
Merge pull request #2258 from zenonxiu81/main
Add an new blog on 'Use Streamline to analyze LLM running on CPU with llama.cpp'
2 parents 30709e3 + 144edc8 commit 32365b6

35 files changed

+512
-0
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
title: Analyze token generation at Prefill and Decode stage
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
# Analyze token generation at Prefill and Decode stage
10+
To get a visible token generation view at Prefill and Decode stage, Annotation Marker feature of Streamline is used and the Annotation Marker generation code is integrated to the llama.cpp project.
11+
You can find more information about Annotation Marker feature here, https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en.
12+
13+
## Steps of llama.cpp integration and Streamline setup
14+
15+
### Step 1: Build Streamline Annotation library
16+
Install ArmDS or Arm Streamline on your host PC first.
17+
You can get Streamline Annotation support code in the installation directory such as *"Arm\Development Studio 2024.1\sw\streamline\gator\annotate"*.
18+
You also can get the Annotation support code here, https://github.com/ARM-software/gator/tree/main , please download the right code that matches the version of Streamline tool on your host PC.
19+
20+
Then you can build the Streamline Annotation Library by running
21+
```bash
22+
make CROSS_COMPILE=/path/to/aarch64_linux_gcc_tool
23+
```
24+
25+
for example,
26+
```bash
27+
make CROSS_COMPILE=./Work/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-
28+
```
29+
You can get the aarch64 gcc compiler toolchain here, https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads .
30+
31+
The static linked library, libstreamline_annotate.a, will be produced.
32+
33+
### Step 2: Integrate Annotation Marker code to llama.cpp
34+
Download llama.cpp code from https://github.com/ggml-org/llama.cpp/archive/refs/tags/b6202.tar.gz
35+
Go to llama.cpp root directory and create a directory ‘streamline_annotation’ there.
36+
```bash
37+
cd ./llama.cpp
38+
mkdir streamline_annotation
39+
```
40+
41+
Copy the library ‘libstreamline_annotate.a’ and the header file ‘streamline_annotate.h’ from Step 1 to the directory ‘streamline_annotation’.
42+
43+
To link 'libstreamline_annotate.a' library when building llama-cli, change *llama.cpp\CMakeLists.txt* by adding following lines,
44+
45+
```makefile
46+
set(STREAMLINE_LIB_PATH ${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a)
47+
target_include_directories(llama-cli PRIVATE ${CMAKE_SOURCE_DIR}/streamline_annotation)
48+
target_link_libraries(${TARGET} PRIVATE ${STREAMLINE_LIB_PATH} )
49+
```
50+
51+
To add Annotation Markers to llama-cli, change the llama-cli code *llama.cpp/tools/main/main.cpp* by adding
52+
```c
53+
#include "streamline_annotate.h"
54+
```
55+
and the Annotation Marker code in the 'main' function,
56+
57+
Firstly, add the Streamline Annotation setup code after *common_init*,
58+
```c
59+
common_init();
60+
61+
//Add the Annotation setup code
62+
ANNOTATE_SETUP;
63+
64+
```
65+
66+
67+
then add the Annotation Marker generation code here,
68+
69+
70+
```c
71+
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
72+
int n_eval = (int) embd.size() - i;
73+
if (n_eval > params.n_batch) {
74+
n_eval = params.n_batch;
75+
}
76+
77+
LOG_DBG("eval: %s\n", string_from(ctx, embd).c_str());
78+
79+
// Add annotation marker code for Streamline
80+
{
81+
char printf_buf[200];
82+
sprintf(printf_buf, "past %d, n_eval %d", n_past,n_eval );
83+
ANNOTATE_MARKER_STR(printf_buf);
84+
}
85+
// End of annotation marker
86+
87+
if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
88+
LOG_ERR("%s : failed to eval\n", __func__);
89+
return 1;
90+
}
91+
```
92+
93+
A string is added to the Annotation Marker to record the position of input tokens and numbr of tokens to be processed.
94+
95+
### Step 3: Build llama-cli executable
96+
For convenience, llama-cli is static linked.
97+
98+
Firstly, create a new directory ‘build’ understand llama.cpp root directory and go into it.
99+
```bash
100+
mkdir ./build & cd ./build
101+
```
102+
Then configure the project by running
103+
```bash
104+
cmake .. -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=arm -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ -DLLAMA_NATIVE=OFF -DLLAMA_F16C=OFF -DLLAMA_GEMM_ARM=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_EXE_LINKER_FLAGS="-static -g" -DGGML_OPENMP=OFF -DCMAKE_C_FLAGS="-march=armv8.2-a+i8mm+dotprod -g" -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" -DGGML_CPU_KLEIDIAI=ON -DGGML_OPENMP=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_CURL=OFF
105+
```
106+
107+
Set CMAKE_C_COMPILER and DCMAKE_CXX_COMPILER to your cross compiler path. Make sure that “-march” in DCMAKE_C_FLAGS and CMAKE_CXX_FLAGS matches your Arm CPU hardware.
108+
109+
In this guide, we run llama-cli on an Arm CPU which supports NEON Dotprod and I8MM instructions, so ‘-march’ is specified as ‘armv8.2-a+dotprod+i8mm’. We also specify ‘-static’ and ‘-g’ options so that the llama-cli executable is static linked and with debug info. This makes source code/function level profiling easier and the llama-cli executable runnable on various version of Arm64 Linux/Android.
110+
111+
Now, we can build the project by running
112+
```bash
113+
cmake --build ./ --config Release
114+
```
115+
116+
After the building process, you should find the llama-cli executable in *./build/bin/* directory.
117+
118+
### Step 4: Run llama-cli and analyze the data with Streamline
119+
Copy following files to your Arm64 platform,
120+
* llama-cli executable
121+
* the ‘gatord’ executable in Arm DS or Streamline installation folder, such as *Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64* for Linux and *Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64* for Android
122+
* the LLM model, Qwen1_5-0_5b-chat-q4_0.gguf
123+
124+
Then run the gatord on your Arm64 target
125+
```bash
126+
./gatord
127+
```
128+
You should see similar messages as below,
129+
130+
``` bash
131+
Streamline Data Recorder v9.4.0 (Build 9b1e8f8)
132+
Copyright (c) 2010-2024 Arm Limited. All rights reserved.
133+
Gator ready
134+
```
135+
136+
Then launch the Streamline application on your host PC, connect to the gatord running on your Arm64 target with either TCP or ADB connection. You can select PMU events to be monitored at this point.
137+
138+
![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ")
139+
140+
Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis.
141+
142+
![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path")
143+
144+
Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target.
145+
146+
*Note: This guide is not intended to introduce how to use Streamline, if you encounter any issue during setting up gatord or Streamline, please seek for help from Arm support.*
147+
148+
Now, run the llama-cli executable as below,
149+
150+
``` bash
151+
./llama-cli -m qwen1_5-0_5b-chat-q4_0.gguf -p "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nTell me a story about a fox and a crow? Please do not tell the traditional story in Aesop's fables. Please tell me a positive story about friendship and love. The story should have no more than 400 words<|im_end|>\n<|im_start|>assistant\n" -st -t 1
152+
```
153+
154+
After a while, you can stop the Streamline data collection by clicking ‘Stop’ button on Streamline. Then Streamline tool on your host PC will start the data analysis.
155+
156+
## Analyze the data with Streamline
157+
From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation.
158+
159+
![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker")
160+
161+
The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example,
162+
163+
![text#center](images/annotation_marker_2.png "Figure 9. Annotation String")
164+
165+
The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time.
166+
167+
As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage.
168+
169+
![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage")
170+
171+
By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed.
172+
We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations.
173+
174+
We can further investigate it with PMU event counters that are captured by Streamline. At Prefill stage, the amount of computation, which are indicated by PMU event counters that count number of Advanced SIMD (NEON), Floating point, Integer data processing instruction, is large. However, the memory access is relatively low. Especially, the number of L3 cache refill/miss is much lower than that of Decode stage.
175+
176+
At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher.
177+
By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall,
178+
179+
![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event")
180+
181+
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
182+
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
183+
184+
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
185+
186+
![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
187+
188+
In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions.
189+
190+
![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view")
191+
192+
As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model,
193+
* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation.
194+
- At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage.
195+
196+
![text#center](images/Prefill_only.png "Figure 14. Prefill only view")
197+
198+
- At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion.
199+
200+
![text#center](images/Decode_only.png "Figure 15. Decode only view")
201+
202+
* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
203+
* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.
204+
* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: Conclusion
3+
weight: 7
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
# Conclusion
10+
By leveraging the Streamline tool together with a good understanding of the llama.cpp code, the execution process of the LLM model can be visualized, which helps analyze code efficiency and investigate potential optimization.
11+
12+
Note that addtional annotation code in llama.cpp and gatord might somehow affect the performance.
13+

0 commit comments

Comments
 (0)