Skip to content

Commit 8e9b2e2

Browse files
dtrawinsmzeglaCopilot
authored
add example for llama-swap integration (#3818)
### 🛠 Summary Add example how to integrate OVMS with llama-swap for managing idle models. ### 🧪 Checklist - [ ] Unit tests added. - [ ] The documentation updated. - [ ] Change follows security best practices. `` --------- Co-authored-by: Miłosz Żeglarski <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 1280294 commit 8e9b2e2

File tree

3 files changed

+189
-0
lines changed

3 files changed

+189
-0
lines changed

ci/lib_search.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,7 @@ def check_dir(start_dir):
151151
"results.txt",
152152
"windows_bdba.bat",
153153
"windows_sign.bat",
154+
"config.yaml",
154155
"kserve-openvino.yaml",
155156
]
156157

extras/llama_swap/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# OpenVINO Model Server service integration with llama_swap
2+
3+
4+
In scenario when OVMS is installed on a client platform, it might be common that the host doesn't have capacity to load all the desired models at the same time.
5+
6+
[Llama_swap](https://github.com/mostlygeek/llama-swap) provides capabilities to load the models on-demand and unload them when not needed.
7+
8+
While this tool was implemented for llama-cpp project, it can be easily enabled also for OpenVINO Model Server.
9+
10+
11+
## Prerequisites
12+
13+
- OVMS installed as a [windows service](../../docs/windows_service.md)
14+
15+
## Pull the models needed for the deployment
16+
17+
```bat
18+
ovms pull --task embeddings --model_name OpenVINO/Qwen3-Embedding-0.6B-int8-ov --target_device GPU --cache_dir .ov_cache --pooling LAST
19+
ovms pull --task text_generation --model_name OpenVINO/Qwen3-4B-int4-ov --target_device GPU --cache_dir .ov_cache --tool_parser hermes3
20+
ovms pull --task text_generation --model_name OpenVINO/InternVL2-2B-int4-ov --target_device GPU --cache_dir .ov_cache
21+
ovms pull --task text_generation --model_name OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --target_device GPU --cache_dir .ov_cache --tool_parser mistral
22+
```
23+
24+
## Configure config.yaml for llama_swap
25+
26+
Follow the [installation steps](https://github.com/mostlygeek/llama-swap/tree/main?tab=readme-ov-file#installation). Recommended is using windows binary package.
27+
28+
The important elements for OVMS integrations are for each model are:
29+
```
30+
cmd: |
31+
powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
32+
cmdStop: |
33+
powershell -NoProfile -Command "ovms.exe --remove_from_config --model_name ${MODEL_ID}"
34+
proxy: ${base_url}
35+
checkEndpoint: models/${MODEL_ID}
36+
name: ${MODEL_ID}
37+
```
38+
39+
This configuration adds and removes a model on demand from OVMS config.json. That automatically loads or unloads the model from the service serving.
40+
Thanks to cache_dir which stored model compilation result, reloading of the model is faster.
41+
42+
Here is an example of a complete [config.yaml](./config.yaml)
43+
44+
Models which should act together in a workflow, should be grouped to minimize impact from model loading time. Check llama_swap documentation about it. Be aware that model reloading is clearing KV cache.
45+
46+
## Connect from the client
47+
48+
Start llama_swap proxy as:
49+
```
50+
llama-swap.exe -listen 127.0.0.1:8080 -watch-config
51+
```
52+
53+
On the OpenAI client connect using `base_url=http://127.0.0.1:8080/v1`.
54+
55+
For example:
56+
```python
57+
from openai import OpenAI
58+
59+
client = OpenAI(
60+
base_url="http://127.0.0.1:8080/v1",
61+
api_key="unused"
62+
)
63+
64+
stream = client.chat.completions.create(
65+
model="OpenVINO/Qwen3-4B-int4-ov",
66+
messages=[{"role": "user", "content": "Hello."}],
67+
stream=True,
68+
)
69+
for chunk in stream:
70+
if chunk.choices[0].delta.content is not None:
71+
print(chunk.choices[0].delta.content, end="", flush=True)
72+
```
73+
74+
75+
## Limitations
76+
77+
Currently, llama_cpp doesn't support `image` and `rerank` endpoints. It can be used for `chat/completions` `embeddings` and `audio` endpoints.
78+

extras/llama_swap/config.yaml

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# llama-swap YAML configuration example
2+
# -------------------------------------
3+
#
4+
# 💡 Tip - Use an LLM with this file!
5+
# ====================================
6+
# This example configuration is written to be LLM friendly. Try
7+
# copying this file into an LLM and asking it to explain or generate
8+
# sections for you.
9+
# ====================================
10+
11+
# Usage notes:
12+
# - Below are all the available configuration options for llama-swap.
13+
# - Settings noted as "required" must be in your configuration file
14+
# - Settings noted as "optional" can be omitted
15+
16+
# healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
17+
# - optional, default: 120
18+
# - minimum value is 15 seconds, anything less will be set to this value
19+
healthCheckTimeout: 120
20+
21+
# logLevel: sets the logging value
22+
# - optional, default: info
23+
# - Valid log levels: debug, info, warn, error
24+
logLevel: debug
25+
26+
# metricsMaxInMemory: maximum number of metrics to keep in memory
27+
# - optional, default: 1000
28+
# - controls how many metrics are stored in memory before older ones are discarded
29+
# - useful for limiting memory usage when processing large volumes of metrics
30+
metricsMaxInMemory: 1000
31+
32+
ttl: 300
33+
34+
# startPort: sets the starting port number for the automatic ${PORT} macro.
35+
# - optional, default: 5800
36+
# - the ${PORT} macro can be used in model.cmd and model.proxy settings
37+
# - it is automatically incremented for every model that uses it
38+
startPort: 10001
39+
40+
# macros: a dictionary of string substitutions
41+
# - optional, default: empty dictionary
42+
# - macros are reusable snippets
43+
# - used in a model's cmd, cmdStop, proxy and checkEndpoint
44+
# - useful for reducing common configuration settings
45+
macros:
46+
"base_url": http://127.0.0.1:8000/v3
47+
48+
# models: a dictionary of model configurations
49+
# - required
50+
# - each key is the model's ID, used in API requests
51+
# - model settings have default values that are used if they are not defined here
52+
# - below are examples of the various settings a model can have:
53+
# - available model settings: env, cmd, cmdStop, proxy, aliases, checkEndpoint, ttl, unlisted
54+
models:
55+
56+
OpenVINO/Qwen3-Embedding-0.6B-int8-ov:
57+
# cmd: the command to run to start the inference server.
58+
# - required
59+
# - it is just a string, similar to what you would run on the CLI
60+
# - using `|` allows for comments in the command, these will be parsed out
61+
# - macros can be used within cmd
62+
cmd: |
63+
powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
64+
cmdStop: |
65+
powershell -NoProfile -Command "ovms.exe --remove_from_config --model_name ${MODEL_ID}"
66+
proxy: ${base_url}
67+
checkEndpoint: models/${MODEL_ID}
68+
name: ${MODEL_ID}
69+
70+
OpenVINO/Qwen3-4B-int4-ov:
71+
# cmd: the command to run to start the inference server.
72+
# - required
73+
# - it is just a string, similar to what you would run on the CLI
74+
# - using `|` allows for comments in the command, these will be parsed out
75+
# - macros can be used within cmd
76+
cmd: |
77+
powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
78+
cmdStop: |
79+
powershell -NoProfile -Command "ovms.exe --remove_from_config --model_name ${MODEL_ID}"
80+
proxy: ${base_url}
81+
checkEndpoint: models/${MODEL_ID}
82+
name: ${MODEL_ID}
83+
84+
OpenVINO/InternVL2-2B-int4-ov:
85+
# cmd: the command to run to start the inference server.
86+
# - required
87+
# - it is just a string, similar to what you would run on the CLI
88+
# - using `|` allows for comments in the command, these will be parsed out
89+
# - macros can be used within cmd
90+
cmd: |
91+
powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
92+
cmdStop: |
93+
powershell -NoProfile -Command "ovms.exe --remove_from_config --model_name ${MODEL_ID}"
94+
proxy: ${base_url}
95+
checkEndpoint: models/${MODEL_ID}
96+
name: ${MODEL_ID}
97+
98+
OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov:
99+
# cmd: the command to run to start the inference server.
100+
# - required
101+
# - it is just a string, similar to what you would run on the CLI
102+
# - using `|` allows for comments in the command, these will be parsed out
103+
# - macros can be used within cmd
104+
cmd: |
105+
powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
106+
cmdStop: |
107+
powershell -NoProfile -Command "ovms.exe --remove_from_config --model_name ${MODEL_ID}"
108+
proxy: ${base_url}
109+
checkEndpoint: models/${MODEL_ID}
110+
name: ${MODEL_ID}

0 commit comments

Comments
 (0)