You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
+8-12Lines changed: 8 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,23 +1,19 @@
1
1
---
2
2
title: Distributed inference using llama.cpp
3
3
4
-
draft: true
5
-
cascade:
6
-
draft: true
7
-
8
4
minutes_to_complete: 30
9
5
10
-
who_is_this_for: This learning path is for developers with some experience using llama.cpp who want to learn about distributed inference.
6
+
who_is_this_for: This introductory topic is for developers with some experience using llama.cpp who want to learn distributed inference.
11
7
12
8
learning_objectives:
13
-
- Set up the main host and worker nodes using llama.cpp
14
-
- Run a large quantized model (e.g., Llama 3.1 405B) on CPUs in a distributed manner on Arm machines
9
+
- Set up a main host and worker nodes with llama.cpp
10
+
- Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
15
11
16
12
prerequisites:
17
-
- Three AWS c8g.16xlarge instances with at least 2TB EBS space.
18
-
- Python installed on the AWS instances.
19
-
- Access to Meta’s gated repository for the Llama 3.1 model family, with a Hugging Face token generated for downloading the models.
20
-
- Familiarity with -> [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
13
+
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
14
+
- Python 3 installed on each instance
15
+
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
16
+
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
+66-26Lines changed: 66 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,39 +1,46 @@
1
1
---
2
-
title: Convert model to gguf and quantize
2
+
title: Convert model to GGUF and quantize
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
+
8
9
## Overview
9
-
This example will run on three AWS Graviton4 c8g.16xlarge instances with 64 cores and 128GB of RAM. The instances should have 2TB disk storage, to store downloaded and quantized model weights.
10
10
11
-
You will perform these steps in this Learning Path:
11
+
This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17
+
- Convert Meta's `safetensors` files to a single GGUF file.
18
+
- Quantize the 16-bit GGUF weights file to 4-bit weights.
2. Download and build llama.cpp, a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
15
-
3. Convert Meta's safetensors files to a single gguf file.
16
-
4. Quantize the 16 bit gguf weights file to 4 bit weights.
17
-
5. Load and run the model.
21
+
{{% notice Note %}}
22
+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
23
+
{{% /notice %}}
18
24
19
-
{{% notice Note %}}The "reading time" mentioned on the Introduction page doesn't include downloading, converting, and requantizing the model. The process mentioned on this page will take 6+ hours. You may skip the model download and quantization if you have a quantized gguf file ready to use.{{% /notice %}}
25
+
## Set up dependencies
20
26
21
-
## Procedure
22
-
First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
27
+
Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
23
28
24
29
{{% notice Note %}}
25
-
Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache.
30
+
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
26
31
{{% /notice %}}
27
32
28
-
##### 1. Generate a virtual environment
33
+
##Create a virtual environment
29
34
30
35
```bash
31
36
apt update
32
37
apt install python3.12-venv
33
38
python3 -m venv myenv
34
39
source myenv/bin/activate
35
40
```
36
-
##### 2. Clone the llama.cpp repo and build dependencies
41
+
42
+
## Clone the llama.cpp repo and build dependencies
43
+
37
44
```bash
38
45
git clone https://github.com/ggerganov/llama.cpp
39
46
apt install -y cmake build-essential
@@ -45,54 +52,87 @@ cd build-rpc
45
52
cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON
46
53
cmake --build . --config Release
47
54
```
48
-
`llama.cpp` is now built in the `build-rpc/bin` directory.
49
-
Check that `llama.cpp` has built correctly by running the help command:
55
+
56
+
The build output is placed in the `build-rpc/bin` directory.
57
+
58
+
Verify that the build succeeded by running the help command:
59
+
50
60
```bash
51
61
cd build-rpc
52
62
bin/llama-cli -h
53
63
```
54
64
55
-
##### 3. Download the model (on a single instance)
56
-
Install Huggingface Hub in the virtual environment:
65
+
## Download the model (single instance)
66
+
67
+
Install Hugging Face Hub in your virtual environment:
llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
10
8
11
-
For the purposes of this demonstration, the following experimental setup will be used:
12
-
- Total number of instances: 3
13
-
- Instance type: c8g.16xlarge
14
-
- Model: model.gguf (Llama-3.1-405B_Q4_0)
9
+
## Overview of the cluster
15
10
16
-
One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
11
+
`llama.cpp` is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17
12
18
-
## Cluster setup
13
+
Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.
19
14
20
-
Choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
15
+
In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
16
+
17
+
For this demonstration, the experimental setup includes:
18
+
19
+
- Number of instances: 3
20
+
- Instance type: `c8g.16xlarge`
21
+
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
22
+
23
+
One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
24
+
25
+
## Set up the worker nodes
26
+
27
+
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
28
+
29
+
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
30
+
31
+
{{% notice Note %}}
32
+
The RPC feature in `llama.cpp` is not secure by default, so you should never expose it to the open internet. To reduce this risk, ensure that the security groups for all your EC2 instances are configured to restrict access to trusted IPs or internal VPC traffic only. This prevents unauthorized access to the RPC endpoints.
33
+
{{% /notice %}}
34
+
35
+
Start the worker nodes with the following command:
21
36
22
-
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
23
-
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
24
-
Use the following command to start the listening on the worker nodes:
25
37
```bash
26
38
bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
27
39
```
28
-
Below are the available flag options that can be used with the rpc-server functionality:
40
+
41
+
## Review RPC server options
42
+
43
+
The following flags are available with the `rpc-server` command:
29
44
30
45
```output
31
46
-h, --help show this help message and exit
@@ -36,4 +51,5 @@ Below are the available flag options that can be used with the rpc-server functi
36
51
-m MEM, --mem MEM backend memory size (in MB)
37
52
-c, --cache enable local file cache
38
53
```
39
-
Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
54
+
55
+
Although setting the host to `0.0.0.0` might seem counterintuitive given the earlier security warning, it is acceptable here because the EC2 security groups are configured to block unintended or unauthorized access.
You can find the IP addresses of your AWS instances in the AWS console.
15
20
16
-
You can verify communication with the worker nodes using the following command on master node:
21
+
Verify communication with a worker node by running the following command on the master node:
22
+
17
23
```bash
18
24
telnet 172.31.110.11 50052
19
25
```
20
-
If the backend server is set up correctly, the output of the `telnet` command should look like the following:
21
-
```bash
26
+
If the backend server is set up correctly, the output should look like:
27
+
28
+
```output
22
29
Trying 172.31.110.11...
23
30
Connected to 172.31.110.11.
24
31
Escape character is '^]'.
25
32
```
26
-
Finally, you can execute the following command, to execute distributed inference:
33
+
Run distributed inference using `llama-cli`:
34
+
27
35
```bash
28
36
bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
29
37
```
30
38
31
39
{{% notice Note %}}
32
-
It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp.
40
+
Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
33
41
{{% /notice %}}
42
+
## Understand the command flags
34
43
35
-
Here are short definitions of the flags used in above command:
36
-
-n => Number of maximum output tokens
37
-
--rpc => list of backend workers
38
-
-ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers)
44
+
-`-n`: maximum number of output tokens
45
+
-`--rpc`: list of backend workers
46
+
-`-ngl`: number of layers to offload to backend workers (`999` offloads all layers)
39
47
40
48
{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
41
49
42
-
The output:
50
+
## Review example output
51
+
43
52
```output
44
53
build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
45
54
main: llama backend init
@@ -195,18 +204,22 @@ llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609
195
204
llama_perf_context_print: total time = 79394.06 ms / 132 tokens
196
205
llama_perf_context_print: graphs reused = 0
197
206
```
198
-
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`:
207
+
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.
208
+
209
+
The following table provides brief description of the metrics from `llama_perf`:
| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). |
204
-
| load time | Time to load the model into memory and initialize weights/buffers. |
205
-
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). |
206
-
| eval time | Time to generate output tokens by forward-passing through the model. |
207
-
| total time | Total time for both prompt processing and token generation (excludes model load). |
214
+
| sampling time | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) |
215
+
| load time | Time required to load the model into memory and initialize weights and buffers |
216
+
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
217
+
| eval time | Time to generate output tokens by forward-passing through the model |
218
+
| total time | Total time for both prompt processing and token generation (excludes model load) |
219
+
220
+
## Run distributed inference with llama-server
208
221
209
-
Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
222
+
Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
0 commit comments