You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <[email protected]>
When using the CUDA backend, you can specify the device with the `CUDA_VISIBLE_DEVICES` environment variable, e.g.:
68
+
You can control the set of exposed CUDA devices with the `CUDA_VISIBLE_DEVICES` environment variable or the `--device` command line option. The following two commands have the same effect:
60
69
```bash
61
70
$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
71
+
$ bin/rpc-server --device CUDA0 -p 50052
62
72
```
63
-
This way you can run multiple `rpc-server` instances on the same host, each with a different CUDA device.
64
73
74
+
### Main host
65
75
66
-
On the main host build `llama.cpp` for the local backend and add `-DGGML_RPC=ON` to the build options.
67
-
Finally, when running `llama-cli`, use the `--rpc` option to specify the host and port of each `rpc-server`:
76
+
On the main host build `llama.cpp`with the backends for the local devices and add `-DGGML_RPC=ON` to the build options.
77
+
Finally, when running `llama-cli` or `llama-server`, use the `--rpc` option to specify the host and port of each `rpc-server`:
68
78
69
79
```bash
70
-
$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
This way you can offload model layers to both local and remote devices.
83
+
By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory.
84
+
You can override this behavior with the `--tensor-split` option and set custom proportions when splitting tensor data across devices.
74
85
75
86
### Local cache
76
87
@@ -83,3 +94,11 @@ $ bin/rpc-server -c
83
94
```
84
95
85
96
By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.
97
+
98
+
### Troubleshooting
99
+
100
+
Use the `GGML_RPC_DEBUG` environment variable to enable debug messages from `rpc-server`:
0 commit comments