Skip to content

Commit 2f3c546

Browse files
Merge pull request #2402 from madeline-underwood/llamacpp
Llamacpp_JA to review
2 parents 4a15673 + 9106ee6 commit 2f3c546

File tree

7 files changed

+228
-229
lines changed

7 files changed

+228
-229
lines changed

content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,14 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## Profiling LLMs on Arm CPUs with Streamline
9+
## Profile LLMs on Arm CPUs with Streamline
1010

11-
Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution. While larger models may benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone.
11+
Deploying Large Language Models (LLMs) on Arm CPUs provides a power-efficient and flexible solution for many applications. While larger models can benefit from GPU acceleration, techniques like quantization enable a wide range of LLMs to perform effectively on CPUs alone by reducing model precision to save memory.
1212

13-
Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provide a convenient way to run LLMs, but it also comes with a certain level of complexity.
13+
Frameworks such as [llama.cpp](https://github.com/ggml-org/llama.cpp) provide a convenient way to run LLMs. However, understanding their performance characteristics requires specialized analysis tools. To optimize LLM execution on Arm platforms, you need both a basic understanding of transformer architectures and the right profiling tools to identify bottlenecks.
1414

15-
To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools.
15+
This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. You'll gain insights into token generation performance at both the Prefill and Decode stages. You'll also understand how individual tensor operations contribute to overall execution time, and evaluate multi-threaded performance across multiple CPU cores.
1616

17-
This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs.
17+
You will run the Qwen1_5-0_5b-chat-q4_0.gguf model using `llama-cli` on Arm Linux and use Streamline for detailed performance analysis. The same methodology can also be applied on Android systems.
1818

19-
You will learn how to:
20-
- Profile token generation at the Prefill and Decode stages
21-
- Profile execution of individual tensor nodes and operators
22-
- Profile LLM execution across multiple threads and cores
23-
24-
You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis.
25-
26-
The same method can also be used on Android.
19+
By the end of this Learning Path, you'll understand how to profile LLM inference, identify performance bottlenecks, and analyze multi-threaded execution patterns on Arm CPUs.
Lines changed: 49 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,86 +1,89 @@
11
---
2-
title: Understand llama.cpp
2+
title: Explore llama.cpp architecture and the inference workflow
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Understand llama.cpp
9+
## Key concepts and architecture overview
1010

11-
llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference.
11+
llama.cpp is an open-source LLM framework implemented in C++ that supports both training and inference. This Learning Path focuses specifically on inference performance on Arm CPUs.
1212

13-
This Learning Path focuses on inference on Arm CPUs.
13+
The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine. It supports text generation, chat mode, and grammar-constrained output directly from the terminal.
1414

15-
The `llama-cli` tool provides a command-line interface to run LLMs with the llama.cpp inference engine.
16-
It supports text generation, chat mode, and grammar-constrained output directly from the terminal.
15+
{{% notice Note %}}
16+
These are some key terms used in this Learning Path:
17+
- *Inference*: the process of generating text from a trained model
18+
- *GGUF format*: a file format optimized for storing and loading LLM models efficiently
19+
- *Tokenization*: converting text into numerical tokens that the model can process
20+
{{% /notice %}}
1721

18-
![text#center](images/llama_structure.png "Figure 1. llama-cli Flow")
22+
## The llama-cli workflow
1923

20-
### What does the Llama CLI do?
24+
The following diagram shows the high-level workflow of llama-cli during inference:
2125

22-
Here are the steps performed by `llama-cli`:
26+
![Workflow diagram showing llama-cli inference pipeline with input prompt processing through model loading, tokenization, parallel Prefill stage, and sequential Decode stage for token generation alt-text#center](images/llama_structure.png "The llama-cli inference workflow")
2327

24-
1. Load and interpret LLMs in GGUF format
28+
The workflow begins when you provide an input prompt to `llama-cli`. The tool loads the specified GGUF model file and tokenizes your prompt. It then processes the prompt through two distinct stages:
2529

26-
2. Build a compute graph based on the model structure
30+
- Prefill stage: the entire prompt is processed in parallel to generate the first output token
31+
- Decode stage: additional tokens are generated sequentially, one at a time
2732

28-
The graph can be divided into subgraphs, each assigned to the most suitable backend device, but in this Learning Path all operations are executed on the Arm CPU backend.
33+
This process continues until the model generates a complete response or reaches a stopping condition.
2934

30-
3. Allocate memory for tensor nodes using the graph planner
35+
## How does llama-cli process requests?
3136

32-
4. Execute tensor nodes in the graph during the `graph_compute` stage, which traverses nodes and forwards work to backend devices
37+
Here are the steps performed by `llama-cli` during inference:
3338

34-
Steps 2 to 4 are wrapped inside the function `llama_decode`.
35-
During Prefill and Decode, `llama-cli` repeatedly calls `llama_decode` to generate tokens.
39+
- Load and interpret LLMs in GGUF format
3640

37-
The parameter `llama_batch` passed to `llama_decode` differs between stages, containing input tokens, their count, and their positions.
41+
- Build a compute graph based on the model structure:
42+
- A compute graph defines the mathematical operations required for inference
43+
- The graph is divided into subgraphs to optimize execution across available hardware backends
44+
- Each subgraph is assigned to the most suitable backend device; in this Learning Path, all subgraphs are assigned to the Arm CPU backend
45+
46+
- Allocate memory for tensor nodes using the graph planner
47+
- Tensor nodes represent data and operations in the compute graph
3848

39-
### What are the components of llama.cpp?
49+
- Execute tensor nodes in the graph during the `graph_compute` stage
50+
- This stage traverses nodes and forwards work to backend devices
4051

41-
The components of llama.cpp include:
52+
The compute graph building and tensor node execution stages are wrapped inside the function `llama_decode`. During both Prefill and Decode stages, `llama-cli` repeatedly calls `llama_decode` to generate tokens. The parameter `llama_batch` passed to `llama_decode` differs between stages. It contains input tokens, their count, and their positions.
4253

43-
![text#center](images/llama_components.jpg "Figure 2. llama.cpp components")
54+
## What are the components of llama.cpp?
4455

45-
llama.cpp supports various backends such as `CPU`, `GPU`, `CUDA`, and `OpenCL`.
56+
The architecture of llama.cpp includes several key components that work together to provide efficient LLM inference, as shown in the diagram:
4657

47-
For the CPU backend, it provides an optimized `ggml-cpu` library, mainly utilizing CPU vector instructions.
58+
![Architecture diagram showing llama.cpp components including backends, ggml-cpu library, and KleidiAI integration alt-text#center](images/llama_components.jpg "llama.cpp components")
4859

49-
For Arm CPUs, the `ggml-cpu` library also offers an `aarch64` trait that leverages 8-bit integer multiply (i8mm) instructions for acceleration.
60+
llama.cpp provides optimized support for Arm CPUs through its `ggml-cpu` library, which leverages Arm-specific vector instructions such as NEON and SVE, and includes an AArch64 trait that accelerates inference using 8-bit integer multiply (i8mm) instructions. The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait. In addition to Arm CPU support, llama.cpp offers backends for GPU, CUDA, and OpenCL to enable inference on a variety of hardware platforms.
5061

51-
The `ggml-cpu` library also integrates the Arm [KleidiAI](https://github.com/ARM-software/kleidiai) library as an additional trait.
62+
## Prefill and Decode in autoregressive LLMs
5263

53-
### Prefill and Decode in autoregressive LLMs
64+
An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token based on all the previously-generated tokens. A token represents a word or word piece in the sequence.
5465

55-
An autoregressive LLM is a type of Large Language Model that generates text by predicting the next token (word or word piece) in a sequence based on all the previously generated tokens.
66+
The term *autoregressive* means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process. For example, when generating the sentence "The cat sat on the...", an autoregressive LLM takes the input prompt as context and predicts the next most likely token, such as "mat". The model then uses the entire sequence including "mat" to predict the following token, continuing this process token by token until completion, which is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one).
5667

57-
The term "autoregressive" means the model uses its own previous outputs as inputs for generating subsequent outputs, creating a sequential generation process.
58-
59-
For example, when generating the sentence "The cat sat on the", an autoregressive LLM:
60-
1. Takes the input prompt as context
61-
2. Predicts the next most likely token (e.g., "mat")
62-
3. Uses the entire sequence including "mat" to predict the following token
63-
4. Continues this process token by token until completion
64-
65-
This sequential nature is why autoregressive LLMs have two distinct computational phases: Prefill (processing the initial prompt) and Decode (generating tokens one by one).
66-
67-
Most autoregressive LLMs are Decoder-only models. This refers to the transformer architecture they use, which consists only of decoder blocks from the original Transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation.
68+
Most autoregressive LLMs are decoder-only models. This refers to the transformer architecture, which consists only of decoder blocks from the original transformer paper. The alternatives to decoder-only models include encoder-only models used for tasks like classification and encoder-decoder models used for tasks like translation.
6869

6970
Decoder-only models like LLaMA have become dominant for text generation because they are simpler to train at scale, can handle both understanding and generation tasks, and are more efficient for text generation.
7071

71-
Here is a brief introduction to Prefill and Decode stages of autoregressive LLMs.
72-
![text#center](images/llm_prefill_decode.jpg "Figure 3. Prefill and Decode stages")
72+
This diagram introduces the idea of Prefill and Decode stages of autoregressive LLMs:
73+
![Diagram illustrating the two stages of autoregressive LLM inference: Prefill stage processing input tokens and Decode stage generating output tokens sequentially alt-text#center](images/llm_prefill_decode.jpg "Prefill and Decode stages")
74+
75+
The Prefill stage is shown below, and as you can see, multiple input tokens of the prompt are processed simultaneously.
7376

74-
At the Prefill stage, multiple input tokens of the prompt are processed.
77+
In the context of Large Language Models (LLMs), a *matrix* is a two-dimensional array of numbers representing data such as model weights or token embeddings, while a *vector* is a one-dimensional array often used to represent a single token or feature set.
7578

76-
It mainly performs GEMM (a matrix is multiplied by another matrix) operations to generate the first output token.
79+
This stage mainly performs GEMM operations (General Matrix Multiply; where one matrix is multiplied by another matrix) to generate the first output token.
7780

78-
![text#center](images/transformer_prefill.jpg "Figure 4. Prefill stage")
81+
![Diagram showing the Prefill stage processing multiple input tokens in parallel through transformer blocks using GEMM operations alt-text#center](images/transformer_prefill.jpg "Prefill stage")
7982

80-
At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not-lain/kv-caching), it mainly performs GEMV (a vector is multiplied by a matrix) operations to generate subsequent output tokens one by one.
83+
At the Decode stage, the model utilizes the [KV cache](https://huggingface.co/blog/not-lain/kv-caching) (Key-Value cache; which is stored attention information from previous tokens). This stage mainly performs GEMV operations (General Matrix-Vector multiply - where a vector is multiplied by a matrix) to generate subsequent output tokens one by one.
8184

82-
![text#center](images/transformer_decode.jpg "Figure 5. Decode stage")
85+
![Diagram showing the Decode stage generating tokens one by one using KV cache and GEMV operations alt-text#center](images/transformer_decode.jpg "Decode stage")
8386

84-
In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations.
87+
## Summary
8588

86-
You will see this highlighted during the Streamline performance analysis.
89+
In this section, you learned about llama.cpp architecture and its inference workflow. The framework uses a two-stage process where the Prefill stage is compute-bound and dominated by large GEMM operations that process multiple tokens in parallel, while the Decode stage is memory-bound and dominated by KV cache access and GEMV operations that process one token at a time. You will see this distinction between Prefill and Decode stages reflected in the performance metrics and visualizations. In the next section, you'll integrate Streamline annotations into llama.cpp to enable detailed performance profiling of these stages.

0 commit comments

Comments
 (0)