You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now, on the master node, you can verify communication with the worker nodes using the following command:
21
+
Now, on the master node, you can verify communication with the worker nodes using the following command on master node:
22
22
```bash
23
23
telnet 172.31.110.11 50052
24
24
```
25
25
If the backend server is set up correctly, the output of the `telnet` command should look like the following:
26
26
```bash
27
27
Trying 172.31.110.11...
28
-
Connected to localhost.
28
+
Connected to 172.31.110.11.
29
29
Escape character is '^]'.
30
30
```
31
31
Finally, you can execute the following command, to execute distributed inference:
32
32
```bash
33
-
bin/llama-cli -m /home/ubuntu/model.gguf -p "Tell me a joke" -n 128 --rpc 172.31.110.11:50052,172.31.110.12:50052 -ngl 99
33
+
bin/llama-cli -m /home/ubuntu/model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99
34
34
```
35
+
{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}} <br>
35
36
The model file for this experiment is hosted on Arm’s private AWS S3 bucket. If you don’t have access to it, you can find a publicly available version of the model on Hugging Face.
36
37
The output:
37
38
```output
@@ -201,7 +202,7 @@ That's it! You have sucessfully run the llama-3.1-8B model on CPUs with the powe
201
202
202
203
Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for disributed inference:
0 commit comments