You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ learning_objectives:
10
10
- Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
11
11
12
12
prerequisites:
13
-
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
13
+
- Three AWS c8g.4xlarge instances with at least 500 GB of EBS storage
14
14
- Python 3 installed on each instance
15
15
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
16
16
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,23 +8,23 @@ layout: learningpathall
8
8
9
9
## Overview
10
10
11
-
This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
11
+
This example runs on three AWS Graviton4 `c8g.4xlarge` instances. Each instance has 16 cores, 32 GB of RAM, and 200 GB of disk storage to store the downloaded and quantized model weights.
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17
17
- Convert Meta's `safetensors` files to a single GGUF file.
18
18
- Quantize the 16-bit GGUF weights file to 4-bit weights.
19
19
- Load and run the model.
20
20
21
21
{{% notice Note %}}
22
-
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
22
+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take 1-2 hours. If you already have a quantized GGUF file, you can skip the download and quantization.
23
23
{{% /notice %}}
24
24
25
25
## Set up dependencies
26
26
27
-
Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
27
+
Before you start, make sure you have permission to access Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
28
28
29
29
{{% notice Note %}}
30
30
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
@@ -34,7 +34,7 @@ You must repeat the install steps on each device. However, only run the download
34
34
35
35
```bash
36
36
apt update
37
-
apt install python3.12-venv
37
+
apt install -y python3.12-venv
38
38
python3 -m venv myenv
39
39
source myenv/bin/activate
40
40
```
@@ -58,7 +58,6 @@ The build output is placed in the `build-rpc/bin` directory.
58
58
Verify that the build succeeded by running the help command:
59
59
60
60
```bash
61
-
cd build-rpc
62
61
bin/llama-cli -h
63
62
```
64
63
@@ -73,6 +72,7 @@ pip3 install huggingface_hub
73
72
Create a new Python file named `download.py`:
74
73
75
74
```bash
75
+
cd ../..
76
76
vi download.py
77
77
```
78
78
@@ -81,8 +81,7 @@ Add the following code:
81
81
```python
82
82
import os
83
83
from huggingface_hub import snapshot_download
84
-
85
-
model_id ="meta-llama/Llama-3.1-405B"
84
+
model_id ="meta-llama/Llama-3.1-70B"
86
85
local_dir ="llama-hf"
87
86
88
87
# Create the directory if it doesn't exist
@@ -120,10 +119,10 @@ Quantize the model to 4-bit weights:
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,19 +12,19 @@ layout: learningpathall
12
12
13
13
Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.
14
14
15
-
In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
15
+
In this Learning Path, you’ll explore how to run a 70B parameter model on Arm-based CPUs.
16
16
17
17
For this demonstration, the experimental setup includes:
18
18
19
-
-Number of instances: 3
20
-
- Instance type: `c8g.16xlarge`
21
-
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
19
+
-Total number of instances: 3
20
+
- Instance type: c8g.4xlarge
21
+
- Model: model.gguf (Llama-3.1-70B_Q4_0, ~38GB when quantized to 4 bits)
22
22
23
23
One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
24
24
25
25
## Set up the worker nodes
26
26
27
-
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
27
+
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
28
28
29
29
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
Here's a knock knock joke for kids: Knock, knock. Who's there? The interrupting cow. The interrupting cow wh- Mooooooo!
197
+
A: He had a little lamb.
198
+
Q: What do you get if you cross an elephant and a rhinoceros?
199
+
Q: What's the difference between a cat and a comma?
200
+
A:
201
+
202
+
llama_perf_sampler_print: sampling time = 5.42 ms / 74 runs ( 0.07 ms per token, 13643.07 tokens per second)
203
+
llama_perf_context_print: load time = 489542.78 ms
204
+
llama_perf_context_print: prompt eval time = 1854.82 ms / 10 tokens ( 185.48 ms per token, 5.39 tokens per second)
205
+
llama_perf_context_print: eval time = 36101.93 ms / 63 runs ( 573.05 ms per token, 1.75 tokens per second)
206
+
llama_perf_context_print: total time = 37989.35 ms / 73 tokens
207
+
llama_perf_context_print: graphs reused = 60
206
208
```
207
-
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.
209
+
210
+
That's it! You have successfully run the llama-3.1-70B model on CPUs with the power of llama.cpp RPC functionality.
208
211
209
212
The following table provides brief description of the metrics from `llama_perf`:
210
213
@@ -215,16 +218,4 @@ The following table provides brief description of the metrics from `llama_perf`:
215
218
| load time | Time required to load the model into memory and initialize weights and buffers |
216
219
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
217
220
| eval time | Time to generate output tokens by forward-passing through the model |
218
-
| total time | Total time for both prompt processing and token generation (excludes model load) |
219
-
220
-
## Run distributed inference with llama-server
221
-
222
-
Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
0 commit comments