You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html)
66
66
67
67
Start the model server with GPU accelerations using a command:
68
-
```console
68
+
```bat
69
69
ovms --model_path model --model_name resnet --port 9000 --target_device GPU
Copy file name to clipboardExpand all lines: docs/llm/reference.md
+42-27Lines changed: 42 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Overview
4
4
5
-
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's[GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
5
+
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in its[GenAI Library](https://github.com/openvinotoolkit/openvino.genai) like:
6
6
- Continuous Batching
7
7
- Paged Attention
8
8
- Dynamic Split Fuse
@@ -12,8 +12,27 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
12
12
13
13
Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
14
14
15
+
## Servable Types
16
+
17
+
Starting with 2025.1, we can highlight four servable types. Such distinction is made based on the input type and underlying GenAI pipeline.
18
+
The servable types are:
19
+
- Language Model Continuous Batching,
20
+
- Language Model Stateful,
21
+
- Visual Language Model Continuous Batching,
22
+
- Visual Language Model Stateful.
23
+
24
+
First part - Language Model / Visual Language Model - determines whether servable accepts only text or both text and images on the input.
25
+
Seconds part - Continuous Batching / Stateful - determines what kind of GenAI pipeline is used as the engine. By default CPU and GPU devices work on Continuous Batching pipelines. NPU device works only on Stateful servable type.
26
+
27
+
User does not have to explicitly select servable type. It is inferred based on model directory contents and selected target device.
28
+
Model directory contents determine if model can work only with text or visual input as well. As for target device, setting it to `NPU` will always pick Stateful servable, while any other device will result in deploying Continuous Batching servable.
29
+
30
+
Stateful servables ignore most of the configuration used by Continuous Batching, but this will be mentioned later. Some servable types have additional limitations mentioned in the limitations section at the end of this document.
31
+
32
+
Despite all the differences, all servable types share the same LLM calculator which imposes certain flow in every GenAI-based endpoint.
33
+
15
34
## LLM Calculator
16
-
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
35
+
As you can see in the quickstart, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2025/1/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
17
36
18
37
On the input it expects a HttpPayload struct passed by the Model Server frontend:
19
38
```cpp
@@ -99,10 +118,14 @@ utilization of resource will be lower. Old cache will be cleared automatically b
99
118
100
119
`dynamic_split_fuse`[algorithm](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen#b-dynamic-splitfuse-) is enabled by default to boost the throughput by splitting the tokens to even chunks. In some conditions like with very low concurrency or with very short prompts, it might be beneficial to disable this algorithm. When it is disabled, there should be set also the parameter `max_num_batched_tokens` to match the model max context length.
101
120
102
-
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
121
+
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size `{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}`.
122
+
123
+
**Important for NPU users**: NPU plugin sets a limitation on prompt (1024 tokens by default) that can be modified by setting `MAX_PROMPT_LEN` in `plugin_config`, for example to double that limit set: `{"MAX_PROMPT_LEN": 2048}`
103
124
104
125
The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `best_of_limit` or set `max_tokens_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
105
126
127
+
**Note that the following options are ignored in Stateful servables (so in deployments on NPU): cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching**
128
+
106
129
107
130
## Canceling the generation
108
131
@@ -136,29 +159,11 @@ In node configuration we set `models_path` indicating location of the directory
136
159
137
160
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing.
138
161
139
-
This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
140
-
141
-
In your python environment install required dependencies:
142
-
```
143
-
pip3 install "optimum-intel[nncf,openvino]
144
-
```
162
+
Additionally, Visual Language Models have encoder and decoder models for text and vision and potentially other auxiliary models.
145
163
146
-
Because there is very dynamic development in optimum-intel and openvino, it is recommended to use the latest versions of the dependencies:
optimum-cli export openvino --disable-convert-tokenizer --model {LLM model in HF hub or Pytorch model folder} --weight-format {fp32/fp16/int8/int4/int4_sym_g128/int4_asym_g128/int4_sym_g64/int4_asym_g64} {target folder name}
155
-
```
156
-
Precision parameter is important and can influence performance, accuracy and memory usage. It is recommended to start experiments with `fp16`. The precision `int8` can reduce the memory consumption and improve latency with low impact on accuracy. Try `int4` to minimize memory usage and check various algorithm to achieve optimal results.
164
+
This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
157
165
158
-
Export the tokenizer model with a command:
159
-
```
160
-
convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
161
-
```
166
+
We recommend using [export script](../../demos/common/export_models/README.md) to prepare models directory structure for serving.
@@ -190,8 +195,6 @@ When default template is loaded, servable accepts `/chat/completions` calls when
190
195
191
196
Errors during configuration files processing (access issue, corrupted file, incorrect content) result in servable loading failure.
192
197
193
-
194
-
195
198
## Limitations
196
199
197
200
There are several known limitations which are expected to be addressed in the coming releases:
@@ -201,7 +204,19 @@ There are several known limitations which are expected to be addressed in the co
201
204
-`logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens.
202
205
- Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version.
203
206
207
+
Some servable types introduce additional limitations:
208
+
209
+
### Stateful servable limitations
210
+
-`finish_reason` not supported (always set to `stop`),
211
+
-`logprobs` not supported,
212
+
- sequential request processing (only one request is handled at a time)
213
+
214
+
### Visual Language servable limitations
215
+
- works only on `/chat/completions` endpoint,
216
+
-`image_url` input supports only base64 encoded image, not an actual URL
0 commit comments