Skip to content

Commit 540cf22

Browse files
committed
update after hugo server checking
1 parent 8853420 commit 540cf22

17 files changed

+33
-84
lines changed

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md

Lines changed: 12 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Analyzing token generation at Prefill and Decode stage
2+
title: Analyze token generation at Prefill and Decode stage
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
# Analyzing token generation at Prefill and Decode stage
9+
# Analyze token generation at Prefill and Decode stage
1010
To get a visible token generation view at Prefill and Decode stage, Annotation Marker feature of Streamline is used and the Annotation Marker generation code is integrated to the llama.cpp project.
1111
You can find more information about Annotation Marker feature here, https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en.
1212

@@ -122,15 +122,10 @@ Gator ready
122122

123123
Then launch the Streamline application on your host PC, connect to the gatord running on your Arm64 target with either TCP or ADB connection. You can select PMU events to be monitored at this point.
124124

125-
<p align="center">
126-
<img src="images/streamline_capture.png" alt="Alt text" width="50%"/>
127-
</p>
125+
![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ")
128126

129127
Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis.
130-
131-
<p align="center">
132-
<img src="images/streamline_capture_image.png" alt="Alt text" width="40%"/>
133-
</p>
128+
![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path")
134129

135130
Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target.
136131

@@ -146,24 +141,15 @@ After a while, you can stop the Streamline data collection by clicking ‘Stop
146141

147142
## Analyze the data with Streamline
148143
From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation.
149-
150-
<p align="center">
151-
<img src="images/annotation_marker_1.png" alt="Alt text" width="50%"/>
152-
</p>
144+
![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker")
153145

154146
The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example,
155-
156-
<p align="center">
157-
<img src="images/annotation_marker_2.png" alt="Alt text" width="20%"/>
158-
</p>
147+
![text#center](images/annotation_marker_2.png "Figure 9. Annotation String")
159148

160149
The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time.
161150

162151
As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage.
163-
164-
<p align="center">
165-
<img src="images/annotation_marker_prefill.png" alt="Alt text" width="100%"/>
166-
</p>
152+
![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage")
167153

168154
By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed.
169155
We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations.
@@ -172,39 +158,24 @@ We can further investigate it with PMU event counters that are captured by Strea
172158

173159
At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher.
174160
By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall,
175-
176-
<p align="center">
177-
<img src="images/annotation_pmu_stall.png" alt="Alt text" width="100%"/>
178-
</p>
161+
![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event")
179162

180163
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
181164
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
182165

183166
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
184-
185-
<p align="center">
186-
<img src="images/annotation_prefill_call_stack.png" alt="Alt text" width="70%"/>
187-
</p>
167+
![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
188168

189169
In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions.
190-
191-
<p align="center">
192-
<img src="images/annotation_prefill_functions.png" alt="Alt text" width="70%"/>
193-
</p>
170+
![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view")
194171

195172
As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model,
196173
* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation.
197174
- At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage.
198-
199-
<p align="center">
200-
<img src="images/Prefill_only.png" alt="Alt text" width="70%"/>
201-
</p>
175+
![text#center](images/Prefill_only.png "Figure 14. Prefill only view")
202176

203177
- At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion.
204-
205-
<p align="center">
206-
<img src="images/Decode_only.png " alt="Alt text" width="80%"/>
207-
</p>
178+
![text#center](images/Decode_only.png "Figure 15. Decode only view")
208179

209180
* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
210181
* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md

Lines changed: 10 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -91,21 +91,16 @@ Then build llama-cli executable, run llama-cli and collect profiling data with S
9191

9292
## Analyze the data with Streamline
9393
String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view, for example inside Channel 0 in the following screenshot.
94-
<p align="center">
95-
<img src="images/deep_dive_1.png" alt="Alt text" width="90%"/>
96-
</p>
94+
![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel")
9795

9896
The letter A is displayed in the process list to indicate the presence of annotations.
9997
String annotations are also displayed in the Message column in the Log view.
100-
<p align="center">
101-
<img src="images/deep_dive_2.png" alt="Alt text" width="80%"/>
102-
</p>
98+
![text#center](images/deep_dive_2.png "Figure 17. Annotation log")
99+
103100
### View of individual operators at Prefill stage
104101

105102
The screenshot of annotation channel view at Prefill stage is shown as below,
106-
<p align="center">
107-
<img src="images/prefill_annotation_channel.png" alt="Alt text" width="90%"/>
108-
</p>
103+
![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage")
109104

110105
Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function.
111106
This annotation macro is defined as,
@@ -119,25 +114,19 @@ For example,
119114
```
120115
The code above sets the name of annotation channel 0 as ‘MUL_MAT_GEMV’, the name of annotation channel 1 as ‘MUL_MAT_GEMM’.
121116
We can get more detailed information by zooming in the view,
122-
<p align="center">
123-
<img src="images/prefill_annotation_channel_2.png" alt="Alt text" width="90%"/>
124-
</p>
117+
![text#center](images/prefill_annotation_channel_2.png "Figure 18. Annotation Channel at Decode stage")
125118

126119
When moving the cursor to the Annotation channel, the tensor node name, the name of operation, the shape and size of source tensor nodes will be shown.
127-
<p align="center">
128-
<img src="images/prefill_annotation_channel_3.png" alt="Alt text" width="40%"/>
129-
</p>
120+
![text#center](images/prefill_annotation_channel_3.png "Figure 19. Annotation Channel Zoom in")
121+
130122
The screenshot above shows a GGML_OP_MUL_MAT operator of FFN_UP node, whose source tensors shape/size is [1024, 2816] and [1024, 68].
131123
The view clearly shows that the major time was spent on MUL_MAT GEMM operations of attention layers and FFN layers at Prefill stage. There is a large MUL_MAT GEMV operation at result_output linear layer. Other operators such as MUL, Softmax, Norm, RoPE do not take significant time.
132124

133125
### View of individual operators at Decode stage
134126
The screenshot of annotation channel view at Decode stage is shown as below,
135-
<p align="center">
136-
<img src="images/decode_annotation_channel.png" alt="Alt text" width="80%"/>
137-
</p>
127+
![text#center](images/decode_annotation_channel.png "Figure 20. Annotation Channel at Decode stage")
128+
138129
We can get more detailed information by zooming in the view,
139-
<p align="center">
140-
<img src="images/decode_annotation_channel_2.png" alt="Alt text" width="60%"/>
141-
</p>
130+
![text#center](images/decode_annotation_channel_2.png "Figure 21. Annotation Channel string")
142131

143132
The view shows that the major time was spent on MUL_MAT GEMV operations of attention layers and FFN layers at Decode stage. Comparing with Prefill stage, there is no GEMM at those layers, GEMV operations are performed instead. The large MUL_MAT GEMV operation at result_output linear layer takes more significant portion of time at Decode stage, since the time spent on each token generation at Decode stage is less due to utilization of KV cache. This corresponds to the percentage of execution time of the function ggml_vec_dot_q6_K_q8_K that we observed in previous session.
Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Introduction
2+
title: Overview
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
# Introduction
9+
# Overview
1010
Large Language Models (LLM) run very smoothly on Arm CPUs. The framework that runs LLM models is usually complex. To analyze the execution of LLM and utilize profiling information for potential code optimization, a good understanding of transformer architecture and an appropriate analysis tool is required.
1111
This guide uses llama-cli application from llama.cpp and Arm’s Streamline tool to analyze the efficiency of LLM running on arm CPU.
1212

@@ -17,5 +17,4 @@ The guide includes,
1717

1818
Understanding this guide requires prerequisite knowledge of transformer architecture, llama.cpp and Streamline.
1919

20-
We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform.
21-
20+
We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform.

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,8 @@ layout: learningpathall
99
# Introduction to llama.cpp
1010
llama.cpp is a LLM framework implemented in C++ that can be used for both training and inference. This guide only covers inference on the CPU.
1111
llama-cli provides a terminal interface to interact with LLM using the llama.cpp inference engine. It enables LLM inference, chat mode, grammar-constrained generation directly from the command line.
12-
<p align="center">
13-
<img src="images/llama_structure.png" alt="Alt text" width="50%"/>
14-
</p>
12+
![text#center](images/llama_structure.png "Figure 1. Annotation String")
13+
1514
llama-cli does the following things,
1615
* Load and interpret LLMs in .gguf format.
1716
* Build a compute graph according to the model structure. The compute graph can be divided into subgraphs that are assigned to the most suitable backend devices. At this step, the model structure are converted into a compute graph with many tensor nodes/operators (such as ADD, MUL_MAT, NORM, SOFTMAX) that can be actually computed.
@@ -22,26 +21,19 @@ Since this guide only focuses on running LLM on CPU, all operators are assigned
2221
Those steps above are wrapped in the function ‘llama_decode’. At LLM Prefill and Decode stage, llama-cli calls ‘llama_decode’ repeatedly to generate tokens. However, the parameter ‘llama_batch’ passed to ‘llama_decode' is different at Prefill and Decode stage. ‘llama_batch’ includes information such as input tokens, number of input tokens, the position of input tokens.
2322

2423
The components of llama.cpp include
25-
<p align="center">
26-
<img src="images/llama_componetns.png" alt="Alt text" width="50%"/>
27-
</p>
24+
![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components")
2825

2926
llama.cpp supports various backends such as CPU, GPU, CUDA, OpenCL etc.
3027
For the CPU backend, it provides an optimized ggml-cpu library (mainly utilizing CPU vector instructions). For Arm CPUs, the ggml-cpu library also offers an aarch64 trait that leverages the new I8MM instructions for acceleration. The ggml-cpu library also integrates the Arm KleidiAI library as an additional trait.
3128

3229
Most autoregressive LLMs are Decoder-only model. Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs.
33-
<p align="center">
34-
<img src="images/llm_prefill_decode.jpg" alt="Alt text" width="70%"/>
35-
</p>
30+
![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage")
3631

3732
At the Prefill stage, multiple input tokens of the prompt are processed. It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token.
38-
<p align="center">
39-
<img src="images/transformer_prefill.jpg" alt="Alt text" width="100%"/>
40-
</p>
33+
![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
34+
4135

4236
At the Decode stage, by utilizing the KV cache, it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
43-
<p align="center">
44-
<img src="images/transformer_decode.jpg" alt="Alt text" width="100%"/>
45-
</p>
37+
![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
4638

4739
Therefore, the prefill stage is compute-bound, while the decode stage has relatively less computation and is more memory-bound due to lots of KV cache memory access. This can be seen in the subsequent analysis with Streamline.

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ The entrypoint of secondary thread is ggml_graph_compute_secondary_thread.
1313
When computing one tensor node/operator in the compute graph, if the worksize is big, llama.cpp splits its computation into multiple parts for those threads.
1414
Here is an example of MUL_MAT operator to demonstrate how the splitting is done.
1515

16-
![text#center](images/multi_thread.png "Figure 22. Multi-thread")
16+
![text#center](images/multi_thread.jpg "Figure 22. Multi-thread")
1717

1818
In this example, the result matrix C is split equally between four threads, each thread computes a quarter of matrix C.
1919
The execution of multi-threads on CPU cores can be observed by Streamline. Core Map and Cluster Map modes in the Streamline Timeline view map threads to CPU cores.

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,6 @@ operatingsystems:
2929
- Linux
3030
- Android
3131

32-
33-
3432
further_reading:
3533
- resource:
3634
title: llama.cpp project
74 KB
Loading
36.3 KB
Loading
-780 Bytes
Loading
11.9 KB
Loading

0 commit comments

Comments
 (0)