Skip to content

Commit 4348ceb

Browse files
committed
set up rpc caching instead of copying gguf files
1 parent b441a78 commit 4348ceb

File tree

2 files changed

+3
-11
lines changed

2 files changed

+3
-11
lines changed

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ You will perform these steps in this Learning Path:
2222
First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
2323

2424
{{% notice Note %}}
25-
Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, since that will take excessive time -- instead do an `scp` from the quantization machine to the other instances, as shown below.
25+
Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache.
2626
{{% /notice %}}
2727

2828
##### 1. Generate a virtual environment
@@ -149,12 +149,4 @@ Allowed quantization types:
149149
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
150150
0 or F32 : 26.00G @ 7B
151151
COPY : only copy tensors, no quantizing
152-
```
153-
154-
##### 5. Copy the quantized gguf to the other instances
155-
156-
Ensure that your EC2 security group has an inbound rule allowing itself, copy your ssh pem file to the instance you did the requantization on, and then use `scp` to copy the quantized gguf file to your two other instances.
157-
158-
{{% notice Note %}}
159-
Use the private IP of your ec2 instances for this copy operation if your SG has a self-reference.
160-
{{% /notice %}}
152+
```

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Communication between the master node and the worker nodes occurs through a sock
2323
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
2424
Use the following command to start the listening on the worker nodes:
2525
```bash
26-
bin/rpc-server -p 50052 -H 0.0.0.0 -t 64
26+
bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
2727
```
2828
Below are the available flag options that can be used with the rpc-server functionality:
2929

0 commit comments

Comments
 (0)