You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# OpenVINO Model Server service integration with llama_swap
2
+
3
+
4
+
In scenario when OVMS is installed on a client platform, it might be common that the host doesn't have capacity to load all the desired models at the same time.
5
+
6
+
[Llama_swap](https://github.com/mostlygeek/llama-swap) provides capabilities to load the models on-demand and unload them when not needed.
7
+
8
+
While this tool was implemented for llama-cpp project, it can be easily enabled also for OpenVINO Model Server.
9
+
10
+
11
+
## Prerequisites
12
+
13
+
- OVMS installed as a [windows service](../../docs/windows_service.md)
Follow the [installation steps](https://github.com/mostlygeek/llama-swap/tree/main?tab=readme-ov-file#installation). Recommended is using windows binary package.
27
+
28
+
The important elements for OVMS integrations are for each model are:
This configuration adds and removes a model on demand from OVMS config.json. That automatically loads or unloads the model from the service serving.
40
+
Thanks to cache_dir which stored model compilation result, reloading of the model is faster.
41
+
42
+
Here is an example of a complete [config.yaml](./config.yaml)
43
+
44
+
Models which should act together in a workflow, should be grouped to minimize impact from model loading time. Check llama_swap documentation about it. Be aware that model reloading is clearing KV cache.
0 commit comments