ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md‎
Lines changed: 12 additions & 41 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Analyzing_token_generation_at_Prefill_and_Decode_stage.md‎
Lines changed: 12 additions & 41 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md‎
Lines changed: 10 additions & 21 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Deep_dive.md‎
Lines changed: 10 additions & 21 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md‎
Lines changed: 3 additions & 4 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md‎
Lines changed: 7 additions & 15 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Introduction_to_llama_cpp.md‎
Lines changed: 7 additions & 15 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md‎
Lines changed: 1 addition & 1 deletion b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/Multi_threads.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md‎
Lines changed: 0 additions & 2 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png‎
74 KB b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Decode_only.png‎
74 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png‎
36.3 KB b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/Prefill_only.png‎
36.3 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_1.png‎
-780 Bytes b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_1.png‎
-780 Bytes
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_2.png‎
11.9 KB b/‎content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/images/annotation_marker_2.png‎
11.9 KB
@@ -1,12 +1,12 @@
 ---
-title: Analyzing token generation at Prefill and Decode stage
+title: Analyze token generation at Prefill and Decode stage
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-# Analyzing token generation at Prefill and Decode stage
+# Analyze token generation at Prefill and Decode stage
 To get a visible token generation view at Prefill and Decode stage, Annotation Marker feature of Streamline is used and the Annotation Marker generation code is integrated to the llama.cpp project. 
 You can find more information about Annotation Marker feature here, https://developer.arm.com/documentation/101816/9-7/Annotate-your-code?lang=en. 
 
@@ -122,15 +122,10 @@ Gator ready
 
 Then launch the Streamline application on your host PC, connect to the gatord running on your Arm64 target with either TCP or ADB connection. You can select PMU events to be monitored at this point. 
 
-<p align="center">
-    <img src="images/streamline_capture.png" alt="Alt text" width="50%"/>
-</p>
+![text#center](images/streamline_capture.png "Figure 6. Streamline Start Capture ")
 
 Set the path of llama-cli executable for Streamline so that its debug info can be used for analysis.
-
-<p align="center">
-    <img src="images/streamline_capture_image.png" alt="Alt text" width="40%"/>
-</p>
+![text#center](images/streamline_capture_image.png "Figure 7. Streamline image path")
 
 Click ‘Start Capture’ button on Streamline to start collecting data from the Arm64 target.
 
@@ -146,24 +141,15 @@ After a while, you can stop the Streamline data collection by clicking ‘Stop
 
 ## Analyze the data with Streamline
 From the timeline view of Streamline, we can see some Annotation Markers. Since we add an Annotation Marker before llama_decode function, each Annotation Marker marks the start time of a token generation. 
-
-<p align="center">
-    <img src="images/annotation_marker_1.png" alt="Alt text" width="50%"/>
-</p>
+![text#center](images/annotation_marker_1.png "Figure 8. Annotation Marker")
 
 The string in the Annotation Marker can be shown when clicking those Annotation Markers. For example,
-
-<p align="center">
-    <img src="images/annotation_marker_2.png" alt="Alt text" width="20%"/>
-</p>
+![text#center](images/annotation_marker_2.png "Figure 9. Annotation String")
 
 The number after ‘past’ indicates the position of input tokens, the number after ‘n_eval’ indicates the number of tokens to be processed this time.
 
 As shown in the timeline view below, with help of Annotation Markers, we can clearly identify the Prefill stage and Decode stage. 
-
-<p align="center">
-    <img src="images/annotation_marker_prefill.png" alt="Alt text" width="100%"/>
-</p>
+![text#center](images/annotation_marker_prefill.png "Figure 10. Annotation Marker at Prefill and Decode stage")
 
 By checking the string of Annotation Marker, the first token generation at Prefill stage has 'past 0, n_eval 78', which means that the position of input tokens starts at 0 and there are 78 input tokens to be processed. 
 We can see that the first token generated at Prefill stage takes more time, since 78 input tokens have to be processed at Prefill stage, it performs lots of GEMM operations. At Decode stage, tokens are generated one by one at mostly equal speed, one token takes less time than that of Prefill stage, thanks to the effect of KV cache. At Decode stage, it performs many GEMV operations.
@@ -172,39 +158,24 @@ We can further investigate it with PMU event counters that are captured by Strea
 
 At Decode stage, the amount of computation is relatively less (since the time of each token is less), but the number of L3 cache refill/miss goes much higher.
 By monitoring other PMU events, Backend Stall Cycles and Backend Stall Cycles due to Memory stall, 
-
-<p align="center">
-    <img src="images/annotation_pmu_stall.png" alt="Alt text" width="100%"/>
-</p>
+![text#center](images/annotation_pmu_stall.png "Figure 11. Backend stall PMU event")
 
 We can see that at Prefill stage, Backend Stall Cycles due to Memory stall are only about 10% of total Backend Stall Cycles. However, at Decode stage, Backend Stall Cycles due to Memory stall are around 50% of total Backend Stall Cycles.
 All those PMU event counters indicate that it is compute-bound at Prefill stage and memory-bound at Decode stage.
 
 Now, let us further profile the code execution with Streamline. In the ‘Call Paths’ view of Streamline, we can see the percentage of running time of functions that are orginized in form of call stack.
-
-<p align="center">
-    <img src="images/annotation_prefill_call_stack.png" alt="Alt text" width="70%"/>
-</p>
+![text#center](images/annotation_prefill_call_stack.png "Figure 12. Call stack")
 
 In the ‘Functions’ view of Streamline, we can see the overall percentage of running time of functions.
-
-<p align="center">
-    <img src="images/annotation_prefill_functions.png" alt="Alt text" width="70%"/>
-</p>
+![text#center](images/annotation_prefill_functions.png "Figure 13. Functions view")
 
 As we can see, the function, graph_compute, takes the largest portion of the running time. It shows that large amounts of GEMM and GEMV operations take most of the time. With Qwen1_5-0_5b-chat-q4_0 model,
 * The computation (GEMM and GEMV) of Q, K, V vectors and most of FFN layers: their weights are with Q4_0 data type and the input activations are with FP32 data type. The computation is forwarded to KleidiAI trait by *ggml_cpu_extra_compute_forward*. KleidiAI ukernels implemented with NEON Dotprod and I8MM vector instructions are used to accelerate the computation.
     - At Prefill stage, *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* KleidiAI ukernel is used for GEMM (Matrix Multiply) operators. It takes the advantage of NEON I8MM instruction. Since Prefill stage only takes small percentage of the whole time, the percentage of this function is small as shown in figures above. However, if we focus on Prefill stage only, with ‘Samplings’ view in Timeline. We can see *kai_run_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm* takes the largest portion of the whole Prefill stage.
-
-    <p align="center">
-      <img src="images/Prefill_only.png" alt="Alt text" width="70%"/>
-    </p>
+    ![text#center](images/Prefill_only.png "Figure 14. Prefill only view")
 
     - At Decode stage, *kai_run_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod* KleidiAI ukernel is used for GEMV operators. It takes advantage of NEON Dotprod instruction. If we focus on Decode stage only, we can see this function takes the second largest portion. 
-
-    <p align="center">
-      <img src="images/Decode_only.png " alt="Alt text" width="80%"/>
-    </p>
+    ![text#center](images/Decode_only.png "Figure 15. Decode only view")
 
 * There is a result_output linear layer in Qwen1_5-0_5b-chat-q4_0 model, the wights are with Q6_K data type. The layer computes a huge [1, 1024] x [1024, 151936] GEMV operation, where 1024 is the embedding size and 151936 is the vocabulary size. This operation cannot be handled by KleidiAI yet, it is handled by the ggml_vec_dot_q6_K_q8_K function in ggml-cpu library.
 * The tensor nodes for computation of Multi-Head attention are presented as three-dimension matrices with FP16 data type (KV cache also holds FP16 values), they are computed by ggml_vec_dot_f16 function in ggml-cpu library.
 
@@ -91,21 +91,16 @@ Then build llama-cli executable, run llama-cli and collect profiling data with S
 
 ## Analyze the data with Streamline
 String annotations are displayed as text overlays inside the relevant channels in the details panel of the Timeline view, for example inside Channel 0 in the following screenshot. 
-    <p align="center">
-      <img src="images/deep_dive_1.png" alt="Alt text" width="90%"/>
-    </p>
+![text#center](images/deep_dive_1.png "Figure 16. Annotation Channel")
 
 The letter A is displayed in the process list to indicate the presence of annotations. 
 String annotations are also displayed in the Message column in the Log view.
-    <p align="center">
-      <img src="images/deep_dive_2.png" alt="Alt text" width="80%"/>
-    </p>
+![text#center](images/deep_dive_2.png "Figure 17. Annotation log")
+
 ### View of individual operators at Prefill stage
 
 The screenshot of annotation channel view at Prefill stage is shown as below,
-    <p align="center">
-      <img src="images/prefill_annotation_channel.png" alt="Alt text" width="90%"/>
-    </p>
+![text#center](images/prefill_annotation_channel.png "Figure 18. Annotation Channel at Prefill stage")
 
 Note that the name of operator in the screenshot above is manually edited. If the name of operator needs to be shown instead of Channel number by Streamline, ANNOTATE_NAME_CHANNEL can be added to ggml_graph_compute_thread function. 
 This annotation macro is defined as,  
@@ -119,25 +114,19 @@ For example,
 ```
 The code above sets the name of annotation channel 0 as ‘MUL_MAT_GEMV’, the name of annotation channel 1 as ‘MUL_MAT_GEMM’.
 We can get more detailed information by zooming in the view,
-    <p align="center">
-      <img src="images/prefill_annotation_channel_2.png" alt="Alt text" width="90%"/>
-    </p>
+![text#center](images/prefill_annotation_channel_2.png "Figure 18. Annotation Channel at Decode stage")
 
 When moving the cursor to the Annotation channel, the tensor node name, the name of operation, the shape and size of source tensor nodes will be shown.
-    <p align="center">
-      <img src="images/prefill_annotation_channel_3.png" alt="Alt text" width="40%"/>
-    </p>
+![text#center](images/prefill_annotation_channel_3.png "Figure 19. Annotation Channel Zoom in")
+
 The screenshot above shows a GGML_OP_MUL_MAT operator of FFN_UP node, whose source tensors shape/size is [1024, 2816] and [1024, 68].
 The view clearly shows that the major time was spent on MUL_MAT GEMM operations of attention layers and FFN layers at Prefill stage. There is a large MUL_MAT GEMV operation at result_output linear layer. Other operators such as MUL, Softmax, Norm, RoPE do not take significant time. 
 
 ### View of individual operators at Decode stage
 The screenshot of annotation channel view at Decode stage is shown as below,
-    <p align="center">
-      <img src="images/decode_annotation_channel.png" alt="Alt text" width="80%"/>
-    </p>
+![text#center](images/decode_annotation_channel.png "Figure 20. Annotation Channel at Decode stage")
+
 We can get more detailed information by zooming in the view,
-    <p align="center">
-      <img src="images/decode_annotation_channel_2.png" alt="Alt text" width="60%"/>
-    </p>
+![text#center](images/decode_annotation_channel_2.png "Figure 21. Annotation Channel string")
 
 The view shows that the major time was spent on MUL_MAT GEMV operations of attention layers and FFN layers at Decode stage. Comparing with Prefill stage, there is no GEMM at those layers, GEMV operations are performed instead. The large MUL_MAT GEMV operation at result_output linear layer takes more significant portion of time at Decode stage, since the time spent on each token generation at Decode stage is less due to utilization of KV cache. This corresponds to the percentage of execution time of the function ggml_vec_dot_q6_K_q8_K that we observed in previous session.
@@ -1,12 +1,12 @@
 ---
-title: Introduction
+title: Overview
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-# Introduction 
+# Overview 
 Large Language Models (LLM) run very smoothly on Arm CPUs. The framework that runs LLM models is usually complex. To analyze the execution of LLM and utilize profiling information for potential code optimization, a good understanding of transformer architecture and an appropriate analysis tool is required.
 This guide uses llama-cli application from llama.cpp and Arm’s Streamline tool to analyze the efficiency of LLM running on arm CPU. 
 
@@ -17,5 +17,4 @@ The guide includes,
 
 Understanding this guide requires prerequisite knowledge of transformer architecture, llama.cpp and Streamline.
 
-We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform. 
-
+We run Qwen1_5-0_5b-chat-q4_0.gguf model with llama-cli on Arm64 Linux and use Streamline for analysis. This guide should also work on Arm64 Android platform. 
@@ -9,9 +9,8 @@ layout: learningpathall
 # Introduction to llama.cpp
 llama.cpp is a LLM framework implemented in C++ that can be used for both training and inference. This guide only covers inference on the CPU.
 llama-cli provides a terminal interface to interact with LLM using the llama.cpp inference engine. It enables LLM inference, chat mode, grammar-constrained generation directly from the command line.
-    <p align="center">
-      <img src="images/llama_structure.png" alt="Alt text" width="50%"/>
-    </p>
+![text#center](images/llama_structure.png "Figure 1. Annotation String")
+
 llama-cli does the following things,
 * Load and interpret LLMs in .gguf format.
 * Build a compute graph according to the model structure. The compute graph can be divided into subgraphs that are assigned to the most suitable backend devices. At this step, the model structure are converted into a compute graph with many tensor nodes/operators (such as ADD, MUL_MAT, NORM, SOFTMAX) that can be actually computed. 
@@ -22,26 +21,19 @@ Since this guide only focuses on running LLM on CPU, all operators are assigned
 Those steps above are wrapped in the function ‘llama_decode’. At LLM Prefill and Decode stage, llama-cli calls ‘llama_decode’ repeatedly to generate tokens. However, the parameter ‘llama_batch’ passed to ‘llama_decode' is different at Prefill and Decode stage. ‘llama_batch’ includes information such as input tokens, number of input tokens, the position of input tokens.
 
 The components of llama.cpp include
-    <p align="center">
-      <img src="images/llama_componetns.png" alt="Alt text" width="50%"/>
-    </p>
+![text#center](images/llama_componetns.jpg "Figure 2. llmama.cpp components")
 
 llama.cpp supports various backends such as CPU, GPU, CUDA, OpenCL etc. 
 For the CPU backend, it provides an optimized ggml-cpu library (mainly utilizing CPU vector instructions). For Arm CPUs, the ggml-cpu library also offers an aarch64 trait that leverages the new I8MM instructions for acceleration. The ggml-cpu library also integrates the Arm KleidiAI library as an additional trait.
 
 Most autoregressive LLMs are Decoder-only model. Here is a brief introduction to Prefill and Decode stage of autoregressive LLMs.
-    <p align="center">
-      <img src="images/llm_prefill_decode.jpg" alt="Alt text" width="70%"/>
-    </p>
+![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stage")
 
 At the Prefill stage, multiple input tokens of the prompt are processed. It mainly performs GEMM (A matrix is multiplied by another matrix) operations to generate the first output token. 
-    <p align="center">
-      <img src="images/transformer_prefill.jpg" alt="Alt text" width="100%"/>
-    </p>
+![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
+
 
 At the Decode stage, by utilizing the KV cache, it mainly performs GEMV (A vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
-    <p align="center">
-      <img src="images/transformer_decode.jpg" alt="Alt text" width="100%"/>
-    </p>
+![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
 
 Therefore, the prefill stage is compute-bound, while the decode stage has relatively less computation and is more memory-bound due to lots of KV cache memory access. This can be seen in the subsequent analysis with Streamline.
@@ -13,7 +13,7 @@ The entrypoint of secondary thread is ggml_graph_compute_secondary_thread.
 When computing one tensor node/operator in the compute graph, if the worksize is big, llama.cpp splits its computation into multiple parts for those threads. 
 Here is an example of MUL_MAT operator to demonstrate how the splitting is done. 
 
-![text#center](images/multi_thread.png "Figure 22. Multi-thread")
+![text#center](images/multi_thread.jpg "Figure 22. Multi-thread")
 
 In this example, the result matrix C is split equally between four threads, each thread computes a quarter of matrix C.
 The execution of multi-threads on CPU cores can be observed by Streamline. Core Map and Cluster Map modes in the Streamline Timeline view map threads to CPU cores. 
 
@@ -29,8 +29,6 @@ operatingsystems:
     - Linux
     - Android
 
-
-
 further_reading:
     - resource:
         title: llama.cpp project