Update Analyzing_token_generation_at_Prefill_and_Decode_stage.md

pareenaverma · web-flow · commit 82d6fd46a474 · 2025-09-12T16:11:34.000-04:00
diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md
@@ -181,7 +181,7 @@ By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles du
 We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
 All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
 
-Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
+Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are organized in form of call stack.
 
 ![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
 
@@ -201,4 +201,4 @@ As we can see, the function, graph_compute, takes the largest portion of the run
 
 * There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
 * The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.
-* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.
+* The computation of RoPE, Softmax, RMSNorm layers does not take significant portion of the running time.