Merge pull request #2243 from JoeStech/model-edits-distributed-inference

pareenaverma · web-flow · commit 941ef701a988 · 2025-08-20T07:41:17.000-04:00
Distributed inference LP changes
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
@@ -10,7 +10,7 @@ learning_objectives:
     - Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
 
 prerequisites:
-    - Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
+    - Three AWS c8g.4xlarge instances with at least 500 GB of EBS storage
     - Python 3 installed on each instance
     - Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
     - Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md
@@ -8,23 +8,23 @@ layout: learningpathall
 
 ## Overview
 
-This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
+This example runs on three AWS Graviton4 `c8g.4xlarge` instances. Each instance has 16 cores, 32 GB of RAM, and 200 GB of disk storage to store the downloaded and quantized model weights.
 
 In this Learning Path, you will:
 
-- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
+- Download Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
 - Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
 - Convert Meta's `safetensors` files to a single GGUF file.
 - Quantize the 16-bit GGUF weights file to 4-bit weights.
 - Load and run the model.
 
 {{% notice Note %}}
-The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
+The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take 1-2 hours. If you already have a quantized GGUF file, you can skip the download and quantization.
 {{% /notice %}}
 
 ## Set up dependencies
 
-Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
+Before you start, make sure you have permission to access Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
 
 {{% notice Note %}}
 You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
@@ -34,7 +34,7 @@ You must repeat the install steps on each device. However, only run the download
 
 ```bash
 apt update
-apt install python3.12-venv
+apt install -y python3.12-venv
 python3 -m venv myenv
 source myenv/bin/activate
 ```
@@ -58,7 +58,6 @@ The build output is placed in the `build-rpc/bin` directory.
 Verify that the build succeeded by running the help command:
 
 ```bash
-cd build-rpc
 bin/llama-cli -h
 ```
 
@@ -73,6 +72,7 @@ pip3 install huggingface_hub
 Create a new Python file named `download.py`:
 
 ```bash
+cd ../..
 vi download.py
 ```
 
@@ -81,8 +81,7 @@ Add the following code:
 ```python
 import os
 from huggingface_hub import snapshot_download
-
-model_id = "meta-llama/Llama-3.1-405B"
+model_id = "meta-llama/Llama-3.1-70B"
 local_dir = "llama-hf"
 
 # Create the directory if it doesn't exist
@@ -120,10 +119,10 @@ Quantize the model to 4-bit weights:
 
 ```bash
 cd llama.cpp/build-rpc
-bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.GGUF Q4_0
+bin/llama-quantize ../../llama-hf/Llama-3.1-70B-F16.gguf Q4_0
 ```
 
-You can rename the output file to `model.GGUF` for easier use.
+You can rename the output file to `model.gguf` for easier use.
 
 Check available quantization options:
 
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md
@@ -12,19 +12,19 @@ layout: learningpathall
 
 Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine. 
 
-In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
+In this Learning Path, you’ll explore how to run a 70B parameter model on Arm-based CPUs.
 
 For this demonstration, the experimental setup includes:
 
-- Number of instances: 3
-- Instance type: `c8g.16xlarge`
-- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
+- Total number of instances: 3
+- Instance type: c8g.4xlarge
+- Model: model.gguf (Llama-3.1-70B_Q4_0, ~38GB when quantized to 4 bits)
 
 One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
 
 ## Set up the worker nodes
 
-Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
+Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
 
 Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md
@@ -33,11 +33,11 @@ Escape character is '^]'.
 Run distributed inference using `llama-cli`:
 
 ```bash
-bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
+bin/llama-cli -m ../../model.gguf -p "Here's a knock knock joke for kids:" -n 128 --rpc "$worker_ips" -ngl 999
 ```
 
 {{% notice Note %}}
-Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
+It will take a significant amount of time (~10 minutes) for inference to run.
 {{% /notice %}}
 ## Understand the command flags
 
@@ -50,25 +50,25 @@ Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensor
 ## Review example output
 
 ```output
-build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
+build: 6209 (fb22dd07) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
 main: llama backend init
 main: load the model and apply lora adapter, if any
-llama_model_load_from_file_impl: using device RPC[172.31.110.11:50052] (RPC[172.31.110.11:50052]) - 126497 MiB free
-llama_model_load_from_file_impl: using device RPC[172.31.110.12:50052] (RPC[172.31.110.12:50052]) - 126497 MiB free
-llama_model_loader: loaded meta data with 30 key-value pairs and 1138 tensors from /home/ubuntu/Llama-3.1-405B_Q4_0.gguf (version GGUF V3 (latest))
+llama_model_load_from_file_impl: using device RPC[172.31.27.42:50052] (RPC[172.31.27.42:50052]) - 31491 MiB free
+llama_model_load_from_file_impl: using device RPC[172.31.20.38:50052] (RPC[172.31.20.38:50052]) - 31491 MiB free
+llama_model_loader: loaded meta data with 30 key-value pairs and 724 tensors from model.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Llama Hf
-llama_model_loader: - kv   3:                         general.size_label str              = 406B
+llama_model_loader: - kv   3:                         general.size_label str              = 71B
 llama_model_loader: - kv   4:                            general.license str              = llama3.1
 llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
 llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
-llama_model_loader: - kv   7:                          llama.block_count u32              = 126
+llama_model_loader: - kv   7:                          llama.block_count u32              = 80
 llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
-llama_model_loader: - kv   9:                     llama.embedding_length u32              = 16384
-llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 53248
-llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 128
+llama_model_loader: - kv   9:                     llama.embedding_length u32              = 8192
+llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 28672
+llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
@@ -87,27 +87,31 @@ llama_model_loader: - kv  26:               tokenizer.ggml.add_bos_token bool
 llama_model_loader: - kv  27:               tokenizer.ggml.add_sep_token bool             = false
 llama_model_loader: - kv  28:               general.quantization_version u32              = 2
 llama_model_loader: - kv  29:                          general.file_type u32              = 2
-llama_model_loader: - type  f32:  254 tensors
-llama_model_loader: - type q4_0:  883 tensors
+llama_model_loader: - type  f32:  162 tensors
+llama_model_loader: - type q4_0:  561 tensors
 llama_model_loader: - type q6_K:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_0
-print_info: file size   = 213.13 GiB (4.51 BPW)
+print_info: file size   = 37.22 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 128001 ('<|end_of_text|>')
+load:   - 128008 ('<|eom_id|>')
+load:   - 128009 ('<|eot_id|>')
 load: special tokens cache size = 256
 load: token to piece cache size = 0.7999 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 131072
-print_info: n_embd           = 16384
-print_info: n_layer          = 126
-print_info: n_head           = 128
+print_info: n_embd           = 8192
+print_info: n_layer          = 80
+print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 128
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 128
 print_info: n_embd_head_v    = 128
-print_info: n_gqa            = 16
+print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 1024
 print_info: n_embd_v_gqa     = 1024
 print_info: f_norm_eps       = 0.0e+00
@@ -116,7 +120,7 @@ print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
-print_info: n_ff             = 53248
+print_info: n_ff             = 28672
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
@@ -127,8 +131,8 @@ print_info: freq_base_train  = 500000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 131072
 print_info: rope_finetuned   = unknown
-print_info: model type       = ?B
-print_info: model params     = 405.85 B
+print_info: model type       = 70B
+print_info: model params     = 70.55 B
 print_info: general.name     = Llama Hf
 print_info: vocab type       = BPE
 print_info: n_vocab          = 128256
@@ -143,68 +147,67 @@ print_info: EOG token        = 128008 '<|eom_id|>'
 print_info: EOG token        = 128009 '<|eot_id|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
-....................................................................................................
+load_tensors: offloading 80 repeating layers to GPU
+load_tensors: offloading output layer to GPU
+load_tensors: offloaded 81/81 layers to GPU
+load_tensors: RPC[172.31.27.42:50052] model buffer size = 18821.56 MiB
+load_tensors: RPC[172.31.20.38:50052] model buffer size = 18725.42 MiB
+load_tensors:   CPU_Mapped model buffer size =   563.62 MiB
+..................................................................................................
 llama_context: constructing llama_context
-llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 4096
 llama_context: n_ctx_per_seq = 4096
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = 0
-llama_context: kv_unified    = true
+llama_context: kv_unified    = false
 llama_context: freq_base     = 500000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
 llama_context:        CPU  output buffer size =     0.49 MiB
-llama_kv_cache_unified: RPC[172.31.110.11:50052] KV buffer size =   800.00 MiB
-llama_kv_cache_unified: RPC[172.31.110.12:50052] KV buffer size =   784.00 MiB
-llama_kv_cache_unified:        CPU KV buffer size =   432.00 MiB
-llama_kv_cache_unified: size = 2016.00 MiB (  4096 cells, 126 layers,  1/ 1 seqs), K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
-llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
-llama_context: RPC[172.31.110.11:50052] compute buffer size =  1160.00 MiB
-llama_context: RPC[172.31.110.12:50052] compute buffer size =  1160.00 MiB
-llama_context:        CPU compute buffer size =  1160.01 MiB
-llama_context: graph nodes  = 4668
-llama_context: graph splits = 4
+llama_kv_cache_unified: RPC[172.31.27.42:50052] KV buffer size =   656.00 MiB
+llama_kv_cache_unified: RPC[172.31.20.38:50052] KV buffer size =   624.00 MiB
+llama_kv_cache_unified: size = 1280.00 MiB (  4096 cells,  80 layers,  1/1 seqs), K (f16):  640.00 MiB, V (f16):  640.00 MiB
+llama_context: RPC[172.31.27.42:50052] compute buffer size =   588.01 MiB
+llama_context: RPC[172.31.20.38:50052] compute buffer size =   588.01 MiB
+llama_context:        CPU compute buffer size =    28.01 MiB
+llama_context: graph nodes  = 2806
+llama_context: graph splits = 3
 common_init_from_params: added <|end_of_text|> logit bias = -inf
 common_init_from_params: added <|eom_id|> logit bias = -inf
 common_init_from_params: added <|eot_id|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
-main: llama threadpool init, n_threads = 64
+main: llama threadpool init, n_threads = 16
 
-system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
+system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
 
-sampler seed: 4077122424
+sampler seed: 3485539003
 sampler params:
 	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
 	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
-generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1
-
-Tell me a joke! (or a funny story)
-Thread starter Fiver
-This thread is for any jokes you may want to share with other members. Please keep them clean!
-Reactions: Fiver
-A duck walks into a bar, and asks the bartender, "Have you got any bread?"
-The bartender says, "No, we don't have any bread."
-The duck leaves.
-A few minutes later, the duck returns, and asks the bartender, "Have you got any bread?"
-The bartender says, "No, I told you, we don't have any bread."
-A few minutes later, the duck returns, and asks the bartender,
-
-llama_perf_sampler_print:    sampling time =       9.48 ms /   133 runs   (    0.07 ms per token, 14032.50 tokens per second)
-llama_perf_context_print:        load time = 1796754.73 ms
-llama_perf_context_print: prompt eval time =    1925.98 ms /     5 tokens (  385.20 ms per token,     2.60 tokens per second)
-llama_perf_context_print:        eval time =   77429.95 ms /   127 runs   (  609.68 ms per token,     1.64 tokens per second)
-llama_perf_context_print:       total time =   79394.06 ms /   132 tokens
-llama_perf_context_print:    graphs reused =          0
+generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1
+
+Here's a knock knock joke for kids: Knock, knock. Who's there? The interrupting cow. The interrupting cow wh- Mooooooo!
+A: He had a little lamb.
+Q: What do you get if you cross an elephant and a rhinoceros?
+Q: What's the difference between a cat and a comma?
+A:
+
+llama_perf_sampler_print:    sampling time =       5.42 ms /    74 runs   (    0.07 ms per token, 13643.07 tokens per second)
+llama_perf_context_print:        load time =  489542.78 ms
+llama_perf_context_print: prompt eval time =    1854.82 ms /    10 tokens (  185.48 ms per token,     5.39 tokens per second)
+llama_perf_context_print:        eval time =   36101.93 ms /    63 runs   (  573.05 ms per token,     1.75 tokens per second)
+llama_perf_context_print:       total time =   37989.35 ms /    73 tokens
+llama_perf_context_print:    graphs reused =         60
 ```
-That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. 
+
+That's it! You have successfully run the llama-3.1-70B model on CPUs with the power of llama.cpp RPC functionality. 
 
 The following table provides brief description of the metrics from `llama_perf`: 
 
@@ -215,16 +218,4 @@ The following table provides brief description of the metrics from `llama_perf`:
 | load time         | Time required to load the model into memory and initialize weights and buffers |
 | prompt eval time  | Time to process the input prompt tokens before generation (fills KV cache)  |
 | eval time         | Time to generate output tokens by forward-passing through the model         |
-| total time        | Total time for both prompt processing and token generation (excludes model load) |
-
-## Run distributed inference with llama-server
-
-Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
-```bash
-bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
-```
-At the very end of the output to the above command, you will see something like the following:
-```output
-main: server is listening on http://127.0.0.1:8080 - starting the main loop
-srv  update_slots: all slots are idle
-```
+| total time        | Total time for both prompt processing and token generation (excludes model load) |