Skip to content

Commit 941ef70

Browse files
authored
Merge pull request #2243 from JoeStech/model-edits-distributed-inference
Distributed inference LP changes
2 parents c4ae2b4 + f30aee8 commit 941ef70

File tree

4 files changed

+76
-86
lines changed

4 files changed

+76
-86
lines changed

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ learning_objectives:
1010
- Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
1111

1212
prerequisites:
13-
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
13+
- Three AWS c8g.4xlarge instances with at least 500 GB of EBS storage
1414
- Python 3 installed on each instance
1515
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
1616
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,23 @@ layout: learningpathall
88

99
## Overview
1010

11-
This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
11+
This example runs on three AWS Graviton4 `c8g.4xlarge` instances. Each instance has 16 cores, 32 GB of RAM, and 200 GB of disk storage to store the downloaded and quantized model weights.
1212

1313
In this Learning Path, you will:
1414

15-
- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
15+
- Download Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
1616
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
1717
- Convert Meta's `safetensors` files to a single GGUF file.
1818
- Quantize the 16-bit GGUF weights file to 4-bit weights.
1919
- Load and run the model.
2020

2121
{{% notice Note %}}
22-
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
22+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take 1-2 hours. If you already have a quantized GGUF file, you can skip the download and quantization.
2323
{{% /notice %}}
2424

2525
## Set up dependencies
2626

27-
Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
27+
Before you start, make sure you have permission to access Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
2828

2929
{{% notice Note %}}
3030
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
@@ -34,7 +34,7 @@ You must repeat the install steps on each device. However, only run the download
3434

3535
```bash
3636
apt update
37-
apt install python3.12-venv
37+
apt install -y python3.12-venv
3838
python3 -m venv myenv
3939
source myenv/bin/activate
4040
```
@@ -58,7 +58,6 @@ The build output is placed in the `build-rpc/bin` directory.
5858
Verify that the build succeeded by running the help command:
5959

6060
```bash
61-
cd build-rpc
6261
bin/llama-cli -h
6362
```
6463

@@ -73,6 +72,7 @@ pip3 install huggingface_hub
7372
Create a new Python file named `download.py`:
7473

7574
```bash
75+
cd ../..
7676
vi download.py
7777
```
7878

@@ -81,8 +81,7 @@ Add the following code:
8181
```python
8282
import os
8383
from huggingface_hub import snapshot_download
84-
85-
model_id = "meta-llama/Llama-3.1-405B"
84+
model_id = "meta-llama/Llama-3.1-70B"
8685
local_dir = "llama-hf"
8786

8887
# Create the directory if it doesn't exist
@@ -120,10 +119,10 @@ Quantize the model to 4-bit weights:
120119

121120
```bash
122121
cd llama.cpp/build-rpc
123-
bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.GGUF Q4_0
122+
bin/llama-quantize ../../llama-hf/Llama-3.1-70B-F16.gguf Q4_0
124123
```
125124

126-
You can rename the output file to `model.GGUF` for easier use.
125+
You can rename the output file to `model.gguf` for easier use.
127126

128127
Check available quantization options:
129128

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,19 @@ layout: learningpathall
1212

1313
Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.
1414

15-
In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
15+
In this Learning Path, you’ll explore how to run a 70B parameter model on Arm-based CPUs.
1616

1717
For this demonstration, the experimental setup includes:
1818

19-
- Number of instances: 3
20-
- Instance type: `c8g.16xlarge`
21-
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
19+
- Total number of instances: 3
20+
- Instance type: c8g.4xlarge
21+
- Model: model.gguf (Llama-3.1-70B_Q4_0, ~38GB when quantized to 4 bits)
2222

2323
One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
2424

2525
## Set up the worker nodes
2626

27-
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
27+
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
2828

2929
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
3030

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md

Lines changed: 61 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,11 @@ Escape character is '^]'.
3333
Run distributed inference using `llama-cli`:
3434

3535
```bash
36-
bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
36+
bin/llama-cli -m ../../model.gguf -p "Here's a knock knock joke for kids:" -n 128 --rpc "$worker_ips" -ngl 999
3737
```
3838

3939
{{% notice Note %}}
40-
Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
40+
It will take a significant amount of time (~10 minutes) for inference to run.
4141
{{% /notice %}}
4242
## Understand the command flags
4343

@@ -50,25 +50,25 @@ Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensor
5050
## Review example output
5151

5252
```output
53-
build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
53+
build: 6209 (fb22dd07) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
5454
main: llama backend init
5555
main: load the model and apply lora adapter, if any
56-
llama_model_load_from_file_impl: using device RPC[172.31.110.11:50052] (RPC[172.31.110.11:50052]) - 126497 MiB free
57-
llama_model_load_from_file_impl: using device RPC[172.31.110.12:50052] (RPC[172.31.110.12:50052]) - 126497 MiB free
58-
llama_model_loader: loaded meta data with 30 key-value pairs and 1138 tensors from /home/ubuntu/Llama-3.1-405B_Q4_0.gguf (version GGUF V3 (latest))
56+
llama_model_load_from_file_impl: using device RPC[172.31.27.42:50052] (RPC[172.31.27.42:50052]) - 31491 MiB free
57+
llama_model_load_from_file_impl: using device RPC[172.31.20.38:50052] (RPC[172.31.20.38:50052]) - 31491 MiB free
58+
llama_model_loader: loaded meta data with 30 key-value pairs and 724 tensors from model.gguf (version GGUF V3 (latest))
5959
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
6060
llama_model_loader: - kv 0: general.architecture str = llama
6161
llama_model_loader: - kv 1: general.type str = model
6262
llama_model_loader: - kv 2: general.name str = Llama Hf
63-
llama_model_loader: - kv 3: general.size_label str = 406B
63+
llama_model_loader: - kv 3: general.size_label str = 71B
6464
llama_model_loader: - kv 4: general.license str = llama3.1
6565
llama_model_loader: - kv 5: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
6666
llama_model_loader: - kv 6: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
67-
llama_model_loader: - kv 7: llama.block_count u32 = 126
67+
llama_model_loader: - kv 7: llama.block_count u32 = 80
6868
llama_model_loader: - kv 8: llama.context_length u32 = 131072
69-
llama_model_loader: - kv 9: llama.embedding_length u32 = 16384
70-
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 53248
71-
llama_model_loader: - kv 11: llama.attention.head_count u32 = 128
69+
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
70+
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
71+
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
7272
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
7373
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
7474
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
@@ -87,27 +87,31 @@ llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool
8787
llama_model_loader: - kv 27: tokenizer.ggml.add_sep_token bool = false
8888
llama_model_loader: - kv 28: general.quantization_version u32 = 2
8989
llama_model_loader: - kv 29: general.file_type u32 = 2
90-
llama_model_loader: - type f32: 254 tensors
91-
llama_model_loader: - type q4_0: 883 tensors
90+
llama_model_loader: - type f32: 162 tensors
91+
llama_model_loader: - type q4_0: 561 tensors
9292
llama_model_loader: - type q6_K: 1 tensors
9393
print_info: file format = GGUF V3 (latest)
9494
print_info: file type = Q4_0
95-
print_info: file size = 213.13 GiB (4.51 BPW)
95+
print_info: file size = 37.22 GiB (4.53 BPW)
96+
load: printing all EOG tokens:
97+
load: - 128001 ('<|end_of_text|>')
98+
load: - 128008 ('<|eom_id|>')
99+
load: - 128009 ('<|eot_id|>')
96100
load: special tokens cache size = 256
97101
load: token to piece cache size = 0.7999 MB
98102
print_info: arch = llama
99103
print_info: vocab_only = 0
100104
print_info: n_ctx_train = 131072
101-
print_info: n_embd = 16384
102-
print_info: n_layer = 126
103-
print_info: n_head = 128
105+
print_info: n_embd = 8192
106+
print_info: n_layer = 80
107+
print_info: n_head = 64
104108
print_info: n_head_kv = 8
105109
print_info: n_rot = 128
106110
print_info: n_swa = 0
107111
print_info: is_swa_any = 0
108112
print_info: n_embd_head_k = 128
109113
print_info: n_embd_head_v = 128
110-
print_info: n_gqa = 16
114+
print_info: n_gqa = 8
111115
print_info: n_embd_k_gqa = 1024
112116
print_info: n_embd_v_gqa = 1024
113117
print_info: f_norm_eps = 0.0e+00
@@ -116,7 +120,7 @@ print_info: f_clamp_kqv = 0.0e+00
116120
print_info: f_max_alibi_bias = 0.0e+00
117121
print_info: f_logit_scale = 0.0e+00
118122
print_info: f_attn_scale = 0.0e+00
119-
print_info: n_ff = 53248
123+
print_info: n_ff = 28672
120124
print_info: n_expert = 0
121125
print_info: n_expert_used = 0
122126
print_info: causal attn = 1
@@ -127,8 +131,8 @@ print_info: freq_base_train = 500000.0
127131
print_info: freq_scale_train = 1
128132
print_info: n_ctx_orig_yarn = 131072
129133
print_info: rope_finetuned = unknown
130-
print_info: model type = ?B
131-
print_info: model params = 405.85 B
134+
print_info: model type = 70B
135+
print_info: model params = 70.55 B
132136
print_info: general.name = Llama Hf
133137
print_info: vocab type = BPE
134138
print_info: n_vocab = 128256
@@ -143,68 +147,67 @@ print_info: EOG token = 128008 '<|eom_id|>'
143147
print_info: EOG token = 128009 '<|eot_id|>'
144148
print_info: max token length = 256
145149
load_tensors: loading model tensors, this can take a while... (mmap = true)
146-
....................................................................................................
150+
load_tensors: offloading 80 repeating layers to GPU
151+
load_tensors: offloading output layer to GPU
152+
load_tensors: offloaded 81/81 layers to GPU
153+
load_tensors: RPC[172.31.27.42:50052] model buffer size = 18821.56 MiB
154+
load_tensors: RPC[172.31.20.38:50052] model buffer size = 18725.42 MiB
155+
load_tensors: CPU_Mapped model buffer size = 563.62 MiB
156+
..................................................................................................
147157
llama_context: constructing llama_context
148-
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
149158
llama_context: n_seq_max = 1
150159
llama_context: n_ctx = 4096
151160
llama_context: n_ctx_per_seq = 4096
152161
llama_context: n_batch = 2048
153162
llama_context: n_ubatch = 512
154163
llama_context: causal_attn = 1
155164
llama_context: flash_attn = 0
156-
llama_context: kv_unified = true
165+
llama_context: kv_unified = false
157166
llama_context: freq_base = 500000.0
158167
llama_context: freq_scale = 1
159168
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
160169
llama_context: CPU output buffer size = 0.49 MiB
161-
llama_kv_cache_unified: RPC[172.31.110.11:50052] KV buffer size = 800.00 MiB
162-
llama_kv_cache_unified: RPC[172.31.110.12:50052] KV buffer size = 784.00 MiB
163-
llama_kv_cache_unified: CPU KV buffer size = 432.00 MiB
164-
llama_kv_cache_unified: size = 2016.00 MiB ( 4096 cells, 126 layers, 1/ 1 seqs), K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
165-
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
166-
llama_context: RPC[172.31.110.11:50052] compute buffer size = 1160.00 MiB
167-
llama_context: RPC[172.31.110.12:50052] compute buffer size = 1160.00 MiB
168-
llama_context: CPU compute buffer size = 1160.01 MiB
169-
llama_context: graph nodes = 4668
170-
llama_context: graph splits = 4
170+
llama_kv_cache_unified: RPC[172.31.27.42:50052] KV buffer size = 656.00 MiB
171+
llama_kv_cache_unified: RPC[172.31.20.38:50052] KV buffer size = 624.00 MiB
172+
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
173+
llama_context: RPC[172.31.27.42:50052] compute buffer size = 588.01 MiB
174+
llama_context: RPC[172.31.20.38:50052] compute buffer size = 588.01 MiB
175+
llama_context: CPU compute buffer size = 28.01 MiB
176+
llama_context: graph nodes = 2806
177+
llama_context: graph splits = 3
171178
common_init_from_params: added <|end_of_text|> logit bias = -inf
172179
common_init_from_params: added <|eom_id|> logit bias = -inf
173180
common_init_from_params: added <|eot_id|> logit bias = -inf
174181
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
175182
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
176-
main: llama threadpool init, n_threads = 64
183+
main: llama threadpool init, n_threads = 16
177184
178-
system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
185+
system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
179186
180-
sampler seed: 4077122424
187+
sampler seed: 3485539003
181188
sampler params:
182189
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
183190
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
184191
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
185192
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
186193
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
187-
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1
188-
189-
Tell me a joke! (or a funny story)
190-
Thread starter Fiver
191-
This thread is for any jokes you may want to share with other members. Please keep them clean!
192-
Reactions: Fiver
193-
A duck walks into a bar, and asks the bartender, "Have you got any bread?"
194-
The bartender says, "No, we don't have any bread."
195-
The duck leaves.
196-
A few minutes later, the duck returns, and asks the bartender, "Have you got any bread?"
197-
The bartender says, "No, I told you, we don't have any bread."
198-
A few minutes later, the duck returns, and asks the bartender,
199-
200-
llama_perf_sampler_print: sampling time = 9.48 ms / 133 runs ( 0.07 ms per token, 14032.50 tokens per second)
201-
llama_perf_context_print: load time = 1796754.73 ms
202-
llama_perf_context_print: prompt eval time = 1925.98 ms / 5 tokens ( 385.20 ms per token, 2.60 tokens per second)
203-
llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609.68 ms per token, 1.64 tokens per second)
204-
llama_perf_context_print: total time = 79394.06 ms / 132 tokens
205-
llama_perf_context_print: graphs reused = 0
194+
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1
195+
196+
Here's a knock knock joke for kids: Knock, knock. Who's there? The interrupting cow. The interrupting cow wh- Mooooooo!
197+
A: He had a little lamb.
198+
Q: What do you get if you cross an elephant and a rhinoceros?
199+
Q: What's the difference between a cat and a comma?
200+
A:
201+
202+
llama_perf_sampler_print: sampling time = 5.42 ms / 74 runs ( 0.07 ms per token, 13643.07 tokens per second)
203+
llama_perf_context_print: load time = 489542.78 ms
204+
llama_perf_context_print: prompt eval time = 1854.82 ms / 10 tokens ( 185.48 ms per token, 5.39 tokens per second)
205+
llama_perf_context_print: eval time = 36101.93 ms / 63 runs ( 573.05 ms per token, 1.75 tokens per second)
206+
llama_perf_context_print: total time = 37989.35 ms / 73 tokens
207+
llama_perf_context_print: graphs reused = 60
206208
```
207-
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.
209+
210+
That's it! You have successfully run the llama-3.1-70B model on CPUs with the power of llama.cpp RPC functionality.
208211

209212
The following table provides brief description of the metrics from `llama_perf`:
210213

@@ -215,16 +218,4 @@ The following table provides brief description of the metrics from `llama_perf`:
215218
| load time | Time required to load the model into memory and initialize weights and buffers |
216219
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
217220
| eval time | Time to generate output tokens by forward-passing through the model |
218-
| total time | Total time for both prompt processing and token generation (excludes model load) |
219-
220-
## Run distributed inference with llama-server
221-
222-
Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
223-
```bash
224-
bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
225-
```
226-
At the very end of the output to the above command, you will see something like the following:
227-
```output
228-
main: server is listening on http://127.0.0.1:8080 - starting the main loop
229-
srv update_slots: all slots are idle
230-
```
221+
| total time | Total time for both prompt processing and token generation (excludes model load) |

0 commit comments

Comments
 (0)