Skip to content

Commit 135f19e

Browse files
committed
rewrite to put steps in the right order and add additional steps
1 parent b96db01 commit 135f19e

File tree

4 files changed

+368
-231
lines changed

4 files changed

+368
-231
lines changed

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ learning_objectives:
1414
- Run a large quantized model (e.g., Llama 3.1 405B) on CPUs in a distributed manner on Arm machines
1515

1616
prerequisites:
17-
- An AWS Graviton4 c8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
17+
- Three AWS c8g.16xlarge instances with at least 2TB EBS space.
18+
- Python installed on the AWS instances.
19+
- Access to Meta’s gated repository for the Llama 3.1 model family, with a Hugging Face token generated for downloading the models.
1820
- Familiarity with -> [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
1921
- Familiarity with AWS
2022

Lines changed: 127 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,160 @@
11
---
2-
title: Overview and Worker Node Configuration
2+
title: Convert model to gguf and quantize
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Overview
9+
This example will run on three AWS Graviton4 c8g.16xlarge instances with 64 cores and 128GB of RAM. The instances should have 2TB disk storage, to store downloaded and quantized model weights.
810

9-
## Before you begin
10-
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04.2 LTS. You will need at least three Arm server instances with at least 64 cores and 128GB of RAM to run this example. The instructions have been tested on an AWS Graviton4 c8g.16xlarge instance
11+
You will perform these steps in this Learning Path:
1112

12-
## Overview
13-
llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from its publication date, rgerganov's RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
13+
1. Download Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
14+
2. Download and build llama.cpp, a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
15+
3. Convert Meta's safetensors files to a single gguf file.
16+
4. Quantize the 16 bit gguf weights file to 4 bit weights.
17+
5. Load and run the model.
1418

15-
For the purposes of this demonstration, the following experimental setup will be used:
16-
- Total number of instances: 3
17-
- Instance type: c8g.16xlarge
18-
- Model: Llama-3.1-405B_Q4_0.gguf
19+
{{% notice Note %}}The "reading time" mentioned on the Introduction page doesn't include downloading, converting, and requantizing the model. The process mentioned on this page will take 6+ hours. You may skip the model download and quantization if you have a quantized gguf file ready to use.{{% /notice %}}
1920

20-
One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
21+
## Procedure
22+
First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
2123

22-
## Implementation
24+
{{% notice Note %}}
25+
Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, since that will take excessive time -- instead do an `scp` from the quantization machine to the other instances, as shown below.
26+
{{% /notice %}}
2327

24-
1. To get started, follow [this learning path](/learning-paths/servers-and-cloud-computing/llama-cpu) up to the step where you clone the llama.cpp repository. Since this setup involves multiple instances (or devices), you will need to replicate the initial setup on each device. Specifically, after executing the command below on all devices, continue with this learning path starting from Step 2.
28+
##### 1. Generate a virtual environment
2529

2630
```bash
27-
git clone https://github.com/ggerganov/llama.cpp
31+
apt update
32+
apt install python3.12-venv
33+
python3 -m venv myenv
34+
source myenv/bin/activate
2835
```
29-
2. Now we can build the llama.cpp library with the RPC feature enabled by compiling it with the -DLLAMA_RPC=ON flag
36+
##### 2. Clone the llama.cpp repo and build dependencies
3037
```bash
38+
git clone https://github.com/ggerganov/llama.cpp
39+
apt install -y cmake build-essential
40+
apt install -y g++
41+
apt install -y libcurl4-openssl-dev
3142
cd llama.cpp
3243
mkdir -p build-rpc
3344
cd build-rpc
3445
cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON
3546
cmake --build . --config Release
3647
```
37-
3848
`llama.cpp` is now built in the `build-rpc/bin` directory.
3949
Check that `llama.cpp` has built correctly by running the help command:
4050
```bash
4151
cd build-rpc
4252
bin/llama-cli -h
4353
```
44-
If everything was built correctly, you should see a list of all the available flags that can be used with llama-cli.
45-
3. Now, choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
4654

47-
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
48-
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
49-
Use the following command to start the listening on the worker nodes:
55+
##### 3. Download the model (on a single instance)
56+
Install Huggingface Hub in the virtual environment:
5057
```bash
51-
bin/rpc-server -p 50052 -H 0.0.0.0 -t 64
52-
```
53-
Below are the available flag options that can be used with the rpc-server functionality:
58+
pip3 install huggingface_hub
5459

60+
```
61+
Make a python file and name it download.py:
62+
```bash
63+
vi download.py
64+
```
65+
Write the following code to it:
66+
```python
67+
import os
68+
from huggingface_hub import snapshot_download
69+
model_id = "meta-llama/Llama-3.1-405B"
70+
local_dir = "llama-hf"
71+
# Create the directory if it doesn't exist
72+
os.makedirs(local_dir, exist_ok=True)
73+
# Download the model snapshot
74+
snapshot_download( repo_id=model_id, local_dir=local_dir,
75+
revision="main",
76+
token="your_hf_token",
77+
allow_patterns=["*.md", "*.json", "*.safetensors"]
78+
)
79+
```
80+
Execute the file:
81+
```bash
82+
python3 download.py
83+
```
84+
##### 4. Convert the model from .safetensors to gguf and quantize (on a single instance)
85+
Following lines installs the files important for conversion to .gguf format.
86+
```bash
87+
pip3 install -r llama.cpp/requirements.txt
88+
python3 llama.cpp/convert_hf_to_gguf.py llama-hf
89+
cd llama.cpp/build-rpc
90+
bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.gguf Q4_0
91+
```
92+
You may rename the resultant file to model.gguf and use it. There are different quantization options as well, as shown below:
93+
```bash
94+
bin/llama-quantize -h
95+
```
5596
```output
56-
-h, --help show this help message and exit
57-
-t, --threads number of threads for the CPU backend (default: 6)
58-
-d DEV, --device device to use
59-
-H HOST, --host HOST host to bind to (default: 127.0.0.1)
60-
-p PORT, --port PORT port to bind to (default: 50052)
61-
-m MEM, --mem MEM backend memory size (in MB)
62-
-c, --cache enable local file cache
97+
usage: bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type]
98+
[--token-embedding-type] [--tensor-type] [--keep-split] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]
99+
100+
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
101+
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
102+
--pure: Disable k-quant mixtures and quantize all tensors to the same type
103+
--imatrix file_name: use data in file_name as importance matrix for quant optimizations
104+
--include-weights tensor_name: use importance matrix for this/these tensor(s)
105+
--exclude-weights tensor_name: use importance matrix for this/these tensor(s)
106+
--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
107+
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
108+
--tensor-type TENSOR=TYPE: quantize this tensor to this ggml_type. example: --tensor-type attn_q=q8_0
109+
Advanced option to selectively quantize tensors. May be specified multiple times.
110+
--keep-split: will generate quantized model in the same shards as input
111+
--override-kv KEY=TYPE:VALUE
112+
Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
113+
Note: --include-weights and --exclude-weights cannot be used together
114+
115+
Allowed quantization types:
116+
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
117+
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
118+
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
119+
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
120+
19 or IQ2_XXS : 2.06 bpw quantization
121+
20 or IQ2_XS : 2.31 bpw quantization
122+
28 or IQ2_S : 2.5 bpw quantization
123+
29 or IQ2_M : 2.7 bpw quantization
124+
24 or IQ1_S : 1.56 bpw quantization
125+
31 or IQ1_M : 1.75 bpw quantization
126+
36 or TQ1_0 : 1.69 bpw ternarization
127+
37 or TQ2_0 : 2.06 bpw ternarization
128+
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
129+
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
130+
23 or IQ3_XXS : 3.06 bpw quantization
131+
26 or IQ3_S : 3.44 bpw quantization
132+
27 or IQ3_M : 3.66 bpw quantization mix
133+
12 or Q3_K : alias for Q3_K_M
134+
22 or IQ3_XS : 3.3 bpw quantization
135+
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
136+
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
137+
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
138+
25 or IQ4_NL : 4.50 bpw non-linear quantization
139+
30 or IQ4_XS : 4.25 bpw non-linear quantization
140+
15 or Q4_K : alias for Q4_K_M
141+
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
142+
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
143+
17 or Q5_K : alias for Q5_K_M
144+
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
145+
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
146+
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
147+
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
148+
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
149+
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
150+
0 or F32 : 26.00G @ 7B
151+
COPY : only copy tensors, no quantizing
63152
```
64-
Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
153+
154+
##### 5. Copy the quantized gguf to the other instances
155+
156+
Ensure that your EC2 security group has an inbound rule allowing itself, copy your ssh pem file to the instance you did the requantization on, and then use `scp` to copy the quantized gguf file to your two other instances.
157+
158+
{{% notice Note %}}
159+
Use the private IP of your ec2 instances for this copy operation if your SG has a self-reference.
160+
{{% /notice %}}

0 commit comments

Comments
 (0)