Content development review

madeline-underwood · madeline-underwood · commit 2396b6665bcc · 2025-08-18T16:23:00.000Z
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
@@ -13,7 +13,7 @@ prerequisites:
     - Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
     - Python 3 installed on each instance
     - Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
-    - Familiarity with [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
+    - Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
     - Familiarity with AWS
 
 author: Aryan Bhusari
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
@@ -1,5 +1,5 @@
 ---
-title: Convert model to GGUP and quantize
+title: Convert model to GGUF and quantize
 weight: 2
 
 ### FIXED, DO NOT MODIFY
@@ -12,22 +12,22 @@ This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance
 
 In this Learning Path, you will:
 
-1. Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
-2. Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
-3. Convert Meta's `safetensors` files to a single GGUF file.
-4. Quantize the 16-bit GGUF weights file to 4-bit weights.
-5. Load and run the model.
+- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
+- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
+- Convert Meta's `safetensors` files to a single GGUF file.
+- Quantize the 16-bit GGUF weights file to 4-bit weights.
+- Load and run the model.
 
 {{% notice Note %}}
-The **Reading time** shown on the Introduction page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
+The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
 {{% /notice %}}
 
 ## Set up dependencies
 
 Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
 
 {{% notice Note %}}
-You must repeat the install steps on each device. However, only run the download and quantization steps once. `llama.cpp` will cache the tensors for reuse across devices.
+You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
 {{% /notice %}}
 
 ## Create a virtual environment
@@ -62,7 +62,7 @@ cd build-rpc
 bin/llama-cli -h
 ```
 
-## Download the model (single instance only)
+## Download the model (single instance)
 
 Install Hugging Face Hub in your virtual environment:
 
@@ -102,7 +102,7 @@ Run the script:
 python3 download.py
 ```
 
-## Convert and quantize the model (single instance only)
+## Convert and quantize the model (single instance)
 
 Install the conversion dependencies:
 
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md
@@ -1,31 +1,46 @@
 ---
-title: Worker Node Configuration
+title: Configure the worker nodes
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Cluster overview
-llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
 
-For the purposes of this demonstration, the following experimental setup will be used:
-- Total number of instances: 3
-- Instance type: c8g.16xlarge
-- Model: model.gguf (Llama-3.1-405B_Q4_0)
+## Overview of the cluster
 
-One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
+`llama.cpp` is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. 
 
-## Cluster setup
+Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine. 
 
-Choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
+In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
+
+For this demonstration, the experimental setup includes:
+
+- Number of instances: 3
+- Instance type: `c8g.16xlarge`
+- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
+
+One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
+
+## Set up the worker nodes
+
+Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
+
+Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
+
+{{% notice Note %}}
+The RPC feature in `llama.cpp` is not secure by default, so you should never expose it to the open internet. To reduce this risk, ensure that the security groups for all your EC2 instances are configured to restrict access to trusted IPs or internal VPC traffic only. This prevents unauthorized access to the RPC endpoints.
+{{% /notice %}}
+
+Start the worker nodes with the following command:
 
-Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
-{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
-Use the following command to start the listening on the worker nodes:
 ```bash
 bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
 ```
-Below are the available flag options that can be used with the rpc-server functionality:
+
+## Review RPC server options
+
+The following flags are available with the `rpc-server` command:
 
 ```output
 -h, --help                show this help message and exit
@@ -36,4 +51,5 @@ Below are the available flag options that can be used with the rpc-server functi
 -m MEM,  --mem MEM        backend memory size (in MB)
 -c,      --cache          enable local file cache
 ```
-Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
+
+Although setting the host to `0.0.0.0` might seem counterintuitive given the earlier security warning, it is acceptable here because the EC2 security groups are configured to block unintended or unauthorized access.
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md
@@ -1,45 +1,54 @@
 ---
-title: Configuring Master Node
+title: Configure the master node
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Master node setup
-In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs.
+
+## Set up the master node
+
+In this section, you configure the master node and verify communication with worker nodes before running distributed inference.
+
+Export the worker node IP addresses, then replace the example values with the IPs for your own nodes:
 
 ```bash
-export worker_ips = "172.31.110.11:50052,172.31.110.12:50052"
+export worker_ips="172.31.110.11:50052,172.31.110.12:50052"
 ```
+
 You can find the IP addresses of your AWS instances in the AWS console.
 
-You can verify communication with the worker nodes using the following command on master node:
+Verify communication with a worker node by running the following command on the master node:
+
 ```bash
 telnet 172.31.110.11 50052
 ```
-If the backend server is set up correctly, the output of the `telnet` command should look like the following:
-```bash
+If the backend server is set up correctly, the output should look like:
+
+```output
 Trying 172.31.110.11...
 Connected to 172.31.110.11.
 Escape character is '^]'.
 ```
-Finally, you can execute the following command, to execute distributed inference:
+Run distributed inference using `llama-cli`:
+
 ```bash
 bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
 ```
 
 {{% notice Note %}}
-It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp.
+Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
 {{% /notice %}}
+## Understand the command flags
 
-Here are short definitions of the flags used in above command:
--n => Number of maximum output tokens
---rpc => list of backend workers
--ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers)
+- `-n`: maximum number of output tokens  
+- `--rpc`: list of backend workers  
+- `-ngl`: number of layers to offload to backend workers (`999` offloads all layers)  
 
 {{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
 
-The output:
+## Review example output
+
 ```output
 build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
 main: llama backend init
@@ -195,18 +204,22 @@ llama_perf_context_print:        eval time =   77429.95 ms /   127 runs   (  609
 llama_perf_context_print:       total time =   79394.06 ms /   132 tokens
 llama_perf_context_print:    graphs reused =          0
 ```
-That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`: 
+That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. 
+
+The following table provides brief description of the metrics from `llama_perf`: 
 
 
-| Log Line          | Description                                                                 |
+| Log line          | Description                                                                 |
 |-------------------|-----------------------------------------------------------------------------|
-| sampling time     | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). |
-| load time         | Time to load the model into memory and initialize weights/buffers.          |
-| prompt eval time  | Time to process the input prompt tokens before generation (fills KV cache). |
-| eval time         | Time to generate output tokens by forward-passing through the model.        |
-| total time        | Total time for both prompt processing and token generation (excludes model load). |
+| sampling time     | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) |
+| load time         | Time required to load the model into memory and initialize weights and buffers |
+| prompt eval time  | Time to process the input prompt tokens before generation (fills KV cache)  |
+| eval time         | Time to generate output tokens by forward-passing through the model         |
+| total time        | Total time for both prompt processing and token generation (excludes model load) |
+
+## Run distributed inference with llama-server
 
-Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
+Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
 ```bash
 bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
 ```