Skip to content

Commit 2396b66

Browse files
Content development review
1 parent 61aa16c commit 2396b66

File tree

4 files changed

+77
-48
lines changed

4 files changed

+77
-48
lines changed

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ prerequisites:
1313
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
1414
- Python 3 installed on each instance
1515
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
16-
- Familiarity with [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
16+
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
1717
- Familiarity with AWS
1818

1919
author: Aryan Bhusari

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Convert model to GGUP and quantize
2+
title: Convert model to GGUF and quantize
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -12,22 +12,22 @@ This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance
1212

1313
In this Learning Path, you will:
1414

15-
1. Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
16-
2. Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17-
3. Convert Meta's `safetensors` files to a single GGUF file.
18-
4. Quantize the 16-bit GGUF weights file to 4-bit weights.
19-
5. Load and run the model.
15+
- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
16+
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17+
- Convert Meta's `safetensors` files to a single GGUF file.
18+
- Quantize the 16-bit GGUF weights file to 4-bit weights.
19+
- Load and run the model.
2020

2121
{{% notice Note %}}
22-
The **Reading time** shown on the Introduction page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
22+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
2323
{{% /notice %}}
2424

2525
## Set up dependencies
2626

2727
Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
2828

2929
{{% notice Note %}}
30-
You must repeat the install steps on each device. However, only run the download and quantization steps once. `llama.cpp` will cache the tensors for reuse across devices.
30+
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
3131
{{% /notice %}}
3232

3333
## Create a virtual environment
@@ -62,7 +62,7 @@ cd build-rpc
6262
bin/llama-cli -h
6363
```
6464

65-
## Download the model (single instance only)
65+
## Download the model (single instance)
6666

6767
Install Hugging Face Hub in your virtual environment:
6868

@@ -102,7 +102,7 @@ Run the script:
102102
python3 download.py
103103
```
104104

105-
## Convert and quantize the model (single instance only)
105+
## Convert and quantize the model (single instance)
106106

107107
Install the conversion dependencies:
108108

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,46 @@
11
---
2-
title: Worker Node Configuration
2+
title: Configure the worker nodes
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
## Cluster overview
9-
llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
108

11-
For the purposes of this demonstration, the following experimental setup will be used:
12-
- Total number of instances: 3
13-
- Instance type: c8g.16xlarge
14-
- Model: model.gguf (Llama-3.1-405B_Q4_0)
9+
## Overview of the cluster
1510

16-
One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
11+
`llama.cpp` is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
1712

18-
## Cluster setup
13+
Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.
1914

20-
Choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
15+
In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
16+
17+
For this demonstration, the experimental setup includes:
18+
19+
- Number of instances: 3
20+
- Instance type: `c8g.16xlarge`
21+
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
22+
23+
One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
24+
25+
## Set up the worker nodes
26+
27+
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
28+
29+
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
30+
31+
{{% notice Note %}}
32+
The RPC feature in `llama.cpp` is not secure by default, so you should never expose it to the open internet. To reduce this risk, ensure that the security groups for all your EC2 instances are configured to restrict access to trusted IPs or internal VPC traffic only. This prevents unauthorized access to the RPC endpoints.
33+
{{% /notice %}}
34+
35+
Start the worker nodes with the following command:
2136

22-
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
23-
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
24-
Use the following command to start the listening on the worker nodes:
2537
```bash
2638
bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
2739
```
28-
Below are the available flag options that can be used with the rpc-server functionality:
40+
41+
## Review RPC server options
42+
43+
The following flags are available with the `rpc-server` command:
2944

3045
```output
3146
-h, --help show this help message and exit
@@ -36,4 +51,5 @@ Below are the available flag options that can be used with the rpc-server functi
3651
-m MEM, --mem MEM backend memory size (in MB)
3752
-c, --cache enable local file cache
3853
```
39-
Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
54+
55+
Although setting the host to `0.0.0.0` might seem counterintuitive given the earlier security warning, it is acceptable here because the EC2 security groups are configured to block unintended or unauthorized access.

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md

Lines changed: 35 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,54 @@
11
---
2-
title: Configuring Master Node
2+
title: Configure the master node
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
## Master node setup
9-
In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs.
8+
9+
## Set up the master node
10+
11+
In this section, you configure the master node and verify communication with worker nodes before running distributed inference.
12+
13+
Export the worker node IP addresses, then replace the example values with the IPs for your own nodes:
1014

1115
```bash
12-
export worker_ips = "172.31.110.11:50052,172.31.110.12:50052"
16+
export worker_ips="172.31.110.11:50052,172.31.110.12:50052"
1317
```
18+
1419
You can find the IP addresses of your AWS instances in the AWS console.
1520

16-
You can verify communication with the worker nodes using the following command on master node:
21+
Verify communication with a worker node by running the following command on the master node:
22+
1723
```bash
1824
telnet 172.31.110.11 50052
1925
```
20-
If the backend server is set up correctly, the output of the `telnet` command should look like the following:
21-
```bash
26+
If the backend server is set up correctly, the output should look like:
27+
28+
```output
2229
Trying 172.31.110.11...
2330
Connected to 172.31.110.11.
2431
Escape character is '^]'.
2532
```
26-
Finally, you can execute the following command, to execute distributed inference:
33+
Run distributed inference using `llama-cli`:
34+
2735
```bash
2836
bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
2937
```
3038

3139
{{% notice Note %}}
32-
It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp.
40+
Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
3341
{{% /notice %}}
42+
## Understand the command flags
3443

35-
Here are short definitions of the flags used in above command:
36-
-n => Number of maximum output tokens
37-
--rpc => list of backend workers
38-
-ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers)
44+
- `-n`: maximum number of output tokens
45+
- `--rpc`: list of backend workers
46+
- `-ngl`: number of layers to offload to backend workers (`999` offloads all layers)
3947

4048
{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
4149

42-
The output:
50+
## Review example output
51+
4352
```output
4453
build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
4554
main: llama backend init
@@ -195,18 +204,22 @@ llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609
195204
llama_perf_context_print: total time = 79394.06 ms / 132 tokens
196205
llama_perf_context_print: graphs reused = 0
197206
```
198-
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`:
207+
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.
208+
209+
The following table provides brief description of the metrics from `llama_perf`:
199210

200211

201-
| Log Line | Description |
212+
| Log line | Description |
202213
|-------------------|-----------------------------------------------------------------------------|
203-
| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). |
204-
| load time | Time to load the model into memory and initialize weights/buffers. |
205-
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). |
206-
| eval time | Time to generate output tokens by forward-passing through the model. |
207-
| total time | Total time for both prompt processing and token generation (excludes model load). |
214+
| sampling time | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) |
215+
| load time | Time required to load the model into memory and initialize weights and buffers |
216+
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
217+
| eval time | Time to generate output tokens by forward-passing through the model |
218+
| total time | Total time for both prompt processing and token generation (excludes model load) |
219+
220+
## Run distributed inference with llama-server
208221

209-
Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
222+
Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
210223
```bash
211224
bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
212225
```

0 commit comments

Comments
 (0)