Merge pull request #2234 from madeline-underwood/distrib_int

pareenaverma · web-flow · commit 678f1325ef78 · 2025-08-18T12:48:34.000-04:00
Distrib int_PV to review
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
@@ -1,23 +1,19 @@
 ---
 title: Distributed inference using llama.cpp
 
-draft: true
-cascade:
-    draft: true
-    
 minutes_to_complete: 30
 
-who_is_this_for: This learning path is for developers with some experience using llama.cpp who want to learn about distributed inference.
+who_is_this_for: This introductory topic is for developers with some experience using llama.cpp who want to learn distributed inference.
 
 learning_objectives: 
-    - Set up the main host and worker nodes using llama.cpp
-    - Run a large quantized model (e.g., Llama 3.1 405B) on CPUs in a distributed manner on Arm machines
+    - Set up a main host and worker nodes with llama.cpp
+    - Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
 
 prerequisites:
-    - Three AWS c8g.16xlarge instances with at least 2TB EBS space.
-    - Python installed on the AWS instances.
-    - Access to Meta’s gated repository for the Llama 3.1 model family, with a Hugging Face token generated for downloading the models.
-    - Familiarity with -> [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
+    - Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
+    - Python 3 installed on each instance
+    - Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
+    - Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
     - Familiarity with AWS
 
 author: Aryan Bhusari
@@ -38,7 +34,7 @@ operatingsystems:
 
 further_reading:
     - resource:
-        title: Llama.cpp rpc-server code
+        title: llama.cpp RPC server code
         link: https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc
         type: Code
 
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
@@ -1,39 +1,46 @@
 ---
-title: Convert model to gguf and quantize
+title: Convert model to GGUF and quantize
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
+
 ## Overview
-This example will run on three AWS Graviton4 c8g.16xlarge instances with 64 cores and 128GB of RAM. The instances should have 2TB disk storage, to store downloaded and quantized model weights.
 
-You will perform these steps in this Learning Path:
+This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
+
+In this Learning Path, you will:
+
+- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
+- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
+- Convert Meta's `safetensors` files to a single GGUF file.
+- Quantize the 16-bit GGUF weights file to 4-bit weights.
+- Load and run the model.
 
-1. Download Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
-2. Download and build llama.cpp, a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
-3. Convert Meta's safetensors files to a single gguf file.
-4. Quantize the 16 bit gguf weights file to 4 bit weights.
-5. Load and run the model. 
+{{% notice Note %}}
+The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
+{{% /notice %}}
 
-{{% notice Note %}}The "reading time" mentioned on the Introduction page doesn't include downloading, converting, and requantizing the model. The process mentioned on this page will take 6+ hours. You may skip the model download and quantization if you have a quantized gguf file ready to use.{{% /notice %}}
+## Set up dependencies
 
-## Procedure
-First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
+Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
 
 {{% notice Note %}}
-Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache.
+You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
 {{% /notice %}}
 
-##### 1. Generate a virtual environment
+## Create a virtual environment
 
 ```bash
 apt update
 apt install python3.12-venv
 python3 -m venv myenv
 source myenv/bin/activate
 ```
-##### 2. Clone the llama.cpp repo and build dependencies
+
+## Clone the llama.cpp repo and build dependencies
+
 ```bash
 git clone https://github.com/ggerganov/llama.cpp
 apt install -y cmake build-essential
@@ -45,54 +52,87 @@ cd build-rpc
 cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON
 cmake --build . --config Release
 ```
-`llama.cpp` is now built in the `build-rpc/bin` directory.
-Check that `llama.cpp` has built correctly by running the help command:
+
+The build output is placed in the `build-rpc/bin` directory.
+
+Verify that the build succeeded by running the help command:
+
 ```bash
 cd build-rpc
 bin/llama-cli -h
 ```
 
-##### 3. Download the model (on a single instance)
-Install Huggingface Hub in the virtual environment:
+## Download the model (single instance)
+
+Install Hugging Face Hub in your virtual environment:
+
 ```bash
 pip3 install huggingface_hub
-
 ```
-Make a python file and name it download.py:
+
+Create a new Python file named `download.py`:
+
 ```bash
 vi download.py
 ```
-Write the following code to it:
+
+Add the following code:
+
 ```python
 import os
 from huggingface_hub import snapshot_download
+
 model_id = "meta-llama/Llama-3.1-405B"
 local_dir = "llama-hf"
+
 # Create the directory if it doesn't exist
 os.makedirs(local_dir, exist_ok=True)
+
 # Download the model snapshot
 snapshot_download( repo_id=model_id, local_dir=local_dir,
     revision="main",
     token="your_hf_token",
     allow_patterns=["*.md", "*.json", "*.safetensors"]
 )
 ```
-Execute the file:
+
+Run the script:
+
 ```bash
 python3 download.py
 ```
-##### 4. Convert the model from .safetensors to gguf and quantize (on a single instance)
-Following lines installs the files important for conversion to .gguf format.
+
+## Convert and quantize the model (single instance)
+
+Install the conversion dependencies:
+
 ```bash
 pip3 install -r llama.cpp/requirements.txt
+```
+
+Convert the model:
+
+```bash
 python3 llama.cpp/convert_hf_to_gguf.py llama-hf
+```
+
+Quantize the model to 4-bit weights:
+
+```bash
 cd llama.cpp/build-rpc
-bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.gguf Q4_0
+bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.GGUF Q4_0
 ```
-You may rename the resultant file to model.gguf and use it. There are different quantization options as well, as shown below:
+
+You can rename the output file to `model.GGUF` for easier use.
+
+Check available quantization options:
+
 ```bash
 bin/llama-quantize -h
 ```
+
+This command lists supported quantization formats and options. For example:
+
 ```output
 usage: bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type]
        [--token-embedding-type] [--tensor-type] [--keep-split] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md
@@ -1,31 +1,46 @@
 ---
-title: Worker Node Configuration
+title: Configure the worker nodes
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Cluster overview
-llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
 
-For the purposes of this demonstration, the following experimental setup will be used:
-- Total number of instances: 3
-- Instance type: c8g.16xlarge
-- Model: model.gguf (Llama-3.1-405B_Q4_0)
+## Overview of the cluster
 
-One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
+`llama.cpp` is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. 
 
-## Cluster setup
+Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine. 
 
-Choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
+In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
+
+For this demonstration, the experimental setup includes:
+
+- Number of instances: 3
+- Instance type: `c8g.16xlarge`
+- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
+
+One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
+
+## Set up the worker nodes
+
+Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
+
+Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
+
+{{% notice Note %}}
+The RPC feature in `llama.cpp` is not secure by default, so you should never expose it to the open internet. To reduce this risk, ensure that the security groups for all your EC2 instances are configured to restrict access to trusted IPs or internal VPC traffic only. This prevents unauthorized access to the RPC endpoints.
+{{% /notice %}}
+
+Start the worker nodes with the following command:
 
-Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
-{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
-Use the following command to start the listening on the worker nodes:
 ```bash
 bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
 ```
-Below are the available flag options that can be used with the rpc-server functionality:
+
+## Review RPC server options
+
+The following flags are available with the `rpc-server` command:
 
 ```output
 -h, --help                show this help message and exit
@@ -36,4 +51,5 @@ Below are the available flag options that can be used with the rpc-server functi
 -m MEM,  --mem MEM        backend memory size (in MB)
 -c,      --cache          enable local file cache
 ```
-Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
+
+Although setting the host to `0.0.0.0` might seem counterintuitive given the earlier security warning, it is acceptable here because the EC2 security groups are configured to block unintended or unauthorized access.
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md
@@ -1,45 +1,54 @@
 ---
-title: Configuring Master Node
+title: Configure the master node
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Master node setup
-In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs.
+
+## Set up the master node
+
+In this section, you configure the master node and verify communication with worker nodes before running distributed inference.
+
+Export the worker node IP addresses, then replace the example values with the IPs for your own nodes:
 
 ```bash
-export worker_ips = "172.31.110.11:50052,172.31.110.12:50052"
+export worker_ips="172.31.110.11:50052,172.31.110.12:50052"
 ```
+
 You can find the IP addresses of your AWS instances in the AWS console.
 
-You can verify communication with the worker nodes using the following command on master node:
+Verify communication with a worker node by running the following command on the master node:
+
 ```bash
 telnet 172.31.110.11 50052
 ```
-If the backend server is set up correctly, the output of the `telnet` command should look like the following:
-```bash
+If the backend server is set up correctly, the output should look like:
+
+```output
 Trying 172.31.110.11...
 Connected to 172.31.110.11.
 Escape character is '^]'.
 ```
-Finally, you can execute the following command, to execute distributed inference:
+Run distributed inference using `llama-cli`:
+
 ```bash
 bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
 ```
 
 {{% notice Note %}}
-It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp.
+Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
 {{% /notice %}}
+## Understand the command flags
 
-Here are short definitions of the flags used in above command:
--n => Number of maximum output tokens
---rpc => list of backend workers
--ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers)
+- `-n`: maximum number of output tokens  
+- `--rpc`: list of backend workers  
+- `-ngl`: number of layers to offload to backend workers (`999` offloads all layers)  
 
 {{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
 
-The output:
+## Review example output
+
 ```output
 build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
 main: llama backend init
@@ -195,18 +204,22 @@ llama_perf_context_print:        eval time =   77429.95 ms /   127 runs   (  609
 llama_perf_context_print:       total time =   79394.06 ms /   132 tokens
 llama_perf_context_print:    graphs reused =          0
 ```
-That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`: 
+That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. 
+
+The following table provides brief description of the metrics from `llama_perf`: 
 
 
-| Log Line          | Description                                                                 |
+| Log line          | Description                                                                 |
 |-------------------|-----------------------------------------------------------------------------|
-| sampling time     | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). |
-| load time         | Time to load the model into memory and initialize weights/buffers.          |
-| prompt eval time  | Time to process the input prompt tokens before generation (fills KV cache). |
-| eval time         | Time to generate output tokens by forward-passing through the model.        |
-| total time        | Total time for both prompt processing and token generation (excludes model load). |
+| sampling time     | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) |
+| load time         | Time required to load the model into memory and initialize weights and buffers |
+| prompt eval time  | Time to process the input prompt tokens before generation (fills KV cache)  |
+| eval time         | Time to generate output tokens by forward-passing through the model         |
+| total time        | Total time for both prompt processing and token generation (excludes model load) |
+
+## Run distributed inference with llama-server
 
-Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
+Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
 ```bash
 bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
 ```