|
1 | 1 | --- |
2 | | -title: Configuring Master Node |
| 2 | +title: Configure the master node |
3 | 3 | weight: 4 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
6 | 6 | layout: learningpathall |
7 | 7 | --- |
8 | | -## Master node setup |
9 | | -In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs. |
| 8 | + |
| 9 | +## Set up the master node |
| 10 | + |
| 11 | +In this section, you configure the master node and verify communication with worker nodes before running distributed inference. |
| 12 | + |
| 13 | +Export the worker node IP addresses, then replace the example values with the IPs for your own nodes: |
10 | 14 |
|
11 | 15 | ```bash |
12 | | -export worker_ips = "172.31.110.11:50052,172.31.110.12:50052" |
| 16 | +export worker_ips="172.31.110.11:50052,172.31.110.12:50052" |
13 | 17 | ``` |
| 18 | + |
14 | 19 | You can find the IP addresses of your AWS instances in the AWS console. |
15 | 20 |
|
16 | | -You can verify communication with the worker nodes using the following command on master node: |
| 21 | +Verify communication with a worker node by running the following command on the master node: |
| 22 | + |
17 | 23 | ```bash |
18 | 24 | telnet 172.31.110.11 50052 |
19 | 25 | ``` |
20 | | -If the backend server is set up correctly, the output of the `telnet` command should look like the following: |
21 | | -```bash |
| 26 | +If the backend server is set up correctly, the output should look like: |
| 27 | + |
| 28 | +```output |
22 | 29 | Trying 172.31.110.11... |
23 | 30 | Connected to 172.31.110.11. |
24 | 31 | Escape character is '^]'. |
25 | 32 | ``` |
26 | | -Finally, you can execute the following command, to execute distributed inference: |
| 33 | +Run distributed inference using `llama-cli`: |
| 34 | + |
27 | 35 | ```bash |
28 | 36 | bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999 |
29 | 37 | ``` |
30 | 38 |
|
31 | 39 | {{% notice Note %}} |
32 | | -It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp. |
| 40 | +Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp. |
33 | 41 | {{% /notice %}} |
| 42 | +## Understand the command flags |
34 | 43 |
|
35 | | -Here are short definitions of the flags used in above command: |
36 | | --n => Number of maximum output tokens |
37 | | ---rpc => list of backend workers |
38 | | --ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers) |
| 44 | +- `-n`: maximum number of output tokens |
| 45 | +- `--rpc`: list of backend workers |
| 46 | +- `-ngl`: number of layers to offload to backend workers (`999` offloads all layers) |
39 | 47 |
|
40 | 48 | {{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}} |
41 | 49 |
|
42 | | -The output: |
| 50 | +## Review example output |
| 51 | + |
43 | 52 | ```output |
44 | 53 | build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu |
45 | 54 | main: llama backend init |
@@ -195,18 +204,22 @@ llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609 |
195 | 204 | llama_perf_context_print: total time = 79394.06 ms / 132 tokens |
196 | 205 | llama_perf_context_print: graphs reused = 0 |
197 | 206 | ``` |
198 | | -That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`: |
| 207 | +That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. |
| 208 | + |
| 209 | +The following table provides brief description of the metrics from `llama_perf`: |
199 | 210 |
|
200 | 211 |
|
201 | | -| Log Line | Description | |
| 212 | +| Log line | Description | |
202 | 213 | |-------------------|-----------------------------------------------------------------------------| |
203 | | -| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). | |
204 | | -| load time | Time to load the model into memory and initialize weights/buffers. | |
205 | | -| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). | |
206 | | -| eval time | Time to generate output tokens by forward-passing through the model. | |
207 | | -| total time | Total time for both prompt processing and token generation (excludes model load). | |
| 214 | +| sampling time | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) | |
| 215 | +| load time | Time required to load the model into memory and initialize weights and buffers | |
| 216 | +| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) | |
| 217 | +| eval time | Time to generate output tokens by forward-passing through the model | |
| 218 | +| total time | Total time for both prompt processing and token generation (excludes model load) | |
| 219 | + |
| 220 | +## Run distributed inference with llama-server |
208 | 221 |
|
209 | | -Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference: |
| 222 | +Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference: |
210 | 223 | ```bash |
211 | 224 | bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99 |
212 | 225 | ``` |
|
0 commit comments