Skip to content

Commit 82d6fd4

Browse files
authored
Update Analyzing_token_generation_at_Prefill_and_Decode_stage.md
1 parent 93cf8e4 commit 82d6fd4

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles du
181181
We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
182182
All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
183183

184-
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
184+
Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack.
185185

186186
![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
187187

@@ -201,4 +201,4 @@ As we can see, the function, graph_compute, takes the largest portion of the run
201201

202202
* There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
203203
* The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.
204-
* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.
204+
* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.

0 commit comments

Comments
 (0)