Skip to content

Commit 9df1b20

Browse files
committed
some formatting
1 parent 5d86186 commit 9df1b20

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ Then launch the Streamline application on your host PC, connect to the gatord ru
138138
![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ")
139139

140140
Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis.
141+
141142
![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path")
142143

143144
Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target.
@@ -154,14 +155,17 @@ After a while, you can stop the Streamline data collection by clicking ‘Stop
154155

155156
## Analyze the data with Streamline
156157
From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation.
158+
157159
![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker")
158160

159161
The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example,
162+
160163
![text#center](images/annotation_marker_2.png "Figure 9. Annotation String")
161164

162165
The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time.
163166

164167
As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage.
168+
165169
![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage")
166170

167171
By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed.
@@ -171,23 +175,28 @@ We can further investigate it with PMU event counters that are captured by Strea
171175

172176
At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher.
173177
By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall,
178+
174179
![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event")
175180

176181
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
177182
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
178183

179184
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
185+
180186
![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
181187

182188
In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions.
189+
183190
![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view")
184191

185192
As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model,
186193
* The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation.
187194
- At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage.
195+
188196
![text#center](images/Prefill_only.png "Figure 14. Prefill only view")
189197

190198
- At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion.
199+
191200
![text#center](images/Decode_only.png "Figure 15. Decode only view")
192201

193202
* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.

0 commit comments

Comments
 (0)