add example for llama-swap integration (#3818)

dtrawins · mzegla · Copilot · web-flow · commit 8e9b2e2cd391 · 2025-11-27T15:01:45.000+01:00
### 🛠 Summary

Add example how to integrate OVMS with llama-swap for managing idle
models.

### 🧪 Checklist

- [ ] Unit tests added.
- [ ] The documentation updated.
- [ ] Change follows security best practices.
``

---------

Co-authored-by: Miłosz Żeglarski &lt;milosz.zeglarski@intel.com&gt;
Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/ci/lib_search.py b/ci/lib_search.py
@@ -151,6 +151,7 @@ def check_dir(start_dir):
         "results.txt",
         "windows_bdba.bat",
         "windows_sign.bat",
+        "config.yaml",
         "kserve-openvino.yaml",
         ]
 
diff --git a/extras/llama_swap/README.md b/extras/llama_swap/README.md
@@ -0,0 +1,78 @@
+# OpenVINO Model Server service integration with llama_swap
+
+
+In scenario when OVMS is installed on a client platform, it might be common that the host doesn't have capacity to load all the desired models at the same time.
+
+[Llama_swap](https://github.com/mostlygeek/llama-swap) provides capabilities to load the models on-demand and unload them when not needed.
+
+While this tool was implemented for llama-cpp project, it can be easily enabled also for OpenVINO Model Server.
+
+
+## Prerequisites
+
+- OVMS installed as a [windows service](../../docs/windows_service.md)
+
+## Pull the models needed for the deployment
+
+```bat
+ovms pull --task embeddings --model_name OpenVINO/Qwen3-Embedding-0.6B-int8-ov --target_device GPU --cache_dir .ov_cache --pooling LAST
+ovms pull --task text_generation --model_name OpenVINO/Qwen3-4B-int4-ov --target_device GPU --cache_dir .ov_cache --tool_parser hermes3
+ovms pull --task text_generation --model_name OpenVINO/InternVL2-2B-int4-ov --target_device GPU --cache_dir .ov_cache 
+ovms pull --task text_generation --model_name OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --target_device GPU --cache_dir .ov_cache --tool_parser mistral
+```
+
+## Configure config.yaml for llama_swap
+
+Follow the [installation steps](https://github.com/mostlygeek/llama-swap/tree/main?tab=readme-ov-file#installation). Recommended is using windows binary package.
+
+The important elements for OVMS integrations are for each model are:
+```
+    cmd: |
+      powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
+    cmdStop: |
+      powershell -NoProfile -Command "ovms.exe --remove_from_config  --model_name ${MODEL_ID}"
+    proxy: ${base_url}
+    checkEndpoint: models/${MODEL_ID}
+    name: ${MODEL_ID}
+```
+
+This configuration adds and removes a model on demand from OVMS config.json. That automatically loads or unloads the model from the service serving.
+Thanks to cache_dir which stored model compilation result, reloading of the model is faster.
+
+Here is an example of a complete [config.yaml](./config.yaml)
+
+Models which should act together in a workflow, should be grouped to minimize impact from model loading time. Check llama_swap documentation about it. Be aware that model reloading is clearing KV cache.
+
+## Connect from the client
+
+Start llama_swap proxy as:
+```
+llama-swap.exe -listen 127.0.0.1:8080 -watch-config
+```
+
+On the OpenAI client connect using `base_url=http://127.0.0.1:8080/v1`.
+
+For example:
+```python
+from openai import OpenAI
+
+client = OpenAI(
+  base_url="http://127.0.0.1:8080/v1",
+  api_key="unused"
+)
+
+stream = client.chat.completions.create(
+    model="OpenVINO/Qwen3-4B-int4-ov",
+    messages=[{"role": "user", "content": "Hello."}],
+    stream=True,
+)
+for chunk in stream:
+    if chunk.choices[0].delta.content is not None:
+        print(chunk.choices[0].delta.content, end="", flush=True)
+```
+
+
+## Limitations
+
+Currently, llama_cpp doesn't support `image` and `rerank` endpoints. It can be used for `chat/completions` `embeddings` and `audio` endpoints.
+
diff --git a/extras/llama_swap/config.yaml b/extras/llama_swap/config.yaml
@@ -0,0 +1,110 @@
+# llama-swap YAML configuration example
+# -------------------------------------
+#
+# 💡 Tip - Use an LLM with this file!
+# ====================================
+#  This example configuration is written to be LLM friendly. Try
+#  copying this file into an LLM and asking it to explain or generate
+#  sections for you.
+# ====================================
+
+# Usage notes:
+# - Below are all the available configuration options for llama-swap.
+# - Settings noted as "required" must be in your configuration file
+# - Settings noted as "optional" can be omitted
+
+# healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
+# - optional, default: 120
+# - minimum value is 15 seconds, anything less will be set to this value
+healthCheckTimeout: 120
+
+# logLevel: sets the logging value
+# - optional, default: info
+# - Valid log levels: debug, info, warn, error
+logLevel: debug
+
+# metricsMaxInMemory: maximum number of metrics to keep in memory
+# - optional, default: 1000
+# - controls how many metrics are stored in memory before older ones are discarded
+# - useful for limiting memory usage when processing large volumes of metrics
+metricsMaxInMemory: 1000
+
+ttl: 300
+
+# startPort: sets the starting port number for the automatic ${PORT} macro.
+# - optional, default: 5800
+# - the ${PORT} macro can be used in model.cmd and model.proxy settings
+# - it is automatically incremented for every model that uses it
+startPort: 10001
+
+# macros: a dictionary of string substitutions
+# - optional, default: empty dictionary
+# - macros are reusable snippets
+# - used in a model's cmd, cmdStop, proxy and checkEndpoint
+# - useful for reducing common configuration settings
+macros:
+  "base_url": http://127.0.0.1:8000/v3
+
+# models: a dictionary of model configurations
+# - required
+# - each key is the model's ID, used in API requests
+# - model settings have default values that are used if they are not defined here
+# - below are examples of the various settings a model can have:
+# - available model settings: env, cmd, cmdStop, proxy, aliases, checkEndpoint, ttl, unlisted
+models:
+
+  OpenVINO/Qwen3-Embedding-0.6B-int8-ov:
+    # cmd: the command to run to start the inference server.
+    # - required
+    # - it is just a string, similar to what you would run on the CLI
+    # - using `|` allows for comments in the command, these will be parsed out
+    # - macros can be used within cmd
+    cmd: |
+      powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
+    cmdStop: |
+      powershell -NoProfile -Command "ovms.exe --remove_from_config  --model_name ${MODEL_ID}"
+    proxy: ${base_url}
+    checkEndpoint: models/${MODEL_ID}
+    name: ${MODEL_ID}
+
+  OpenVINO/Qwen3-4B-int4-ov:
+    # cmd: the command to run to start the inference server.
+    # - required
+    # - it is just a string, similar to what you would run on the CLI
+    # - using `|` allows for comments in the command, these will be parsed out
+    # - macros can be used within cmd
+    cmd: |
+      powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
+    cmdStop: |
+      powershell -NoProfile -Command "ovms.exe --remove_from_config  --model_name ${MODEL_ID}"
+    proxy: ${base_url}
+    checkEndpoint: models/${MODEL_ID}
+    name: ${MODEL_ID}
+
+  OpenVINO/InternVL2-2B-int4-ov:
+    # cmd: the command to run to start the inference server.
+    # - required
+    # - it is just a string, similar to what you would run on the CLI
+    # - using `|` allows for comments in the command, these will be parsed out
+    # - macros can be used within cmd
+    cmd: |
+      powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
+    cmdStop: |
+      powershell -NoProfile -Command "ovms.exe --remove_from_config  --model_name ${MODEL_ID}"
+    proxy: ${base_url}
+    checkEndpoint: models/${MODEL_ID}
+    name: ${MODEL_ID}
+
+  OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov:
+    # cmd: the command to run to start the inference server.
+    # - required
+    # - it is just a string, similar to what you would run on the CLI
+    # - using `|` allows for comments in the command, these will be parsed out
+    # - macros can be used within cmd
+    cmd: |
+      powershell -NoProfile -Command "ovms.exe --add_to_config --model_name ${MODEL_ID}; Start-Sleep -Seconds 999999"
+    cmdStop: |
+      powershell -NoProfile -Command "ovms.exe --remove_from_config  --model_name ${MODEL_ID}"
+    proxy: ${base_url}
+    checkEndpoint: models/${MODEL_ID}
+    name: ${MODEL_ID}

Original file line number	Diff line number	Diff line change
`@@ -151,6 +151,7 @@ def check_dir(start_dir):`
`151`	`151`	`"results.txt",`
`152`	`152`	`"windows_bdba.bat",`
`153`	`153`	`"windows_sign.bat",`
	`154`	`+ "config.yaml",`
`154`	`155`	`"kserve-openvino.yaml",`
`155`	`156`	`]`
`156`	`157`