set up rpc caching instead of copying gguf files

JoeStech · JoeStech · commit 4348ceba5a59 · 2025-08-14T19:18:01.000-06:00
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
@@ -22,7 +22,7 @@ You will perform these steps in this Learning Path:
 First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
 
 {{% notice Note %}}
-Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, since that will take excessive time -- instead do an `scp` from the quantization machine to the other instances, as shown below.
+Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache.
 {{% /notice %}}
 
 ##### 1. Generate a virtual environment
@@ -149,12 +149,4 @@ Allowed quantization types:
   32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
    0  or  F32     : 26.00G              @ 7B
           COPY    : only copy tensors, no quantizing
-```
-
-##### 5. Copy the quantized gguf to the other instances
-
-Ensure that your EC2 security group has an inbound rule allowing itself, copy your ssh pem file to the instance you did the requantization on, and then use `scp` to copy the quantized gguf file to your two other instances.
-
-{{% notice Note %}}
-Use the private IP of your ec2 instances for this copy operation if your SG has a self-reference.
-{{% /notice %}}
+```
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md
@@ -23,7 +23,7 @@ Communication between the master node and the worker nodes occurs through a sock
 {{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
 Use the following command to start the listening on the worker nodes:
 ```bash
-bin/rpc-server -p 50052 -H 0.0.0.0 -t 64
+bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
 ```
 Below are the available flag options that can be used with the rpc-server functionality: