Skip to content

Commit f243950

Browse files
authored
VLM docs update [releases/2025/1] (openvinotoolkit#3210)
1 parent a272fd9 commit f243950

File tree

6 files changed

+140
-38
lines changed

6 files changed

+140
-38
lines changed

demos/image_classification_using_tf_model/python/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Make sure to:
4242
on every shell that will start OpenVINO Model Server.
4343

4444
And start Model Server using the following command:
45-
```console
45+
```bat
4646
ovms --model_name resnet --model_path model/ --port 9000
4747
```
4848

demos/image_classification_with_string_output/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Make sure to:
3939
on every shell that will start OpenVINO Model Server.
4040

4141
And start Model Server using the following command:
42-
```console
42+
```bat
4343
ovms --model_name mobile_net --model_path model/ --rest_port 8000
4444
```
4545

demos/universal-sentence-encoder/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ Make sure to:
5656
on every shell that will start OpenVINO Model Server.
5757

5858
And start Model Server using the following command:
59-
```console
59+
```bat
6060
ovms --model_name usem --model_path universal-sentence-encoder-multilingual/ --plugin_config "{\"NUM_STREAMS\": 1}" --port 9000 --rest_port 8000
6161
```
6262

docs/accelerators.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ docker run --rm -it --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -u $(i
6565
Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html)
6666

6767
Start the model server with GPU accelerations using a command:
68-
```console
68+
```bat
6969
ovms --model_path model --model_name resnet --port 9000 --target_device GPU
7070
```
7171

@@ -83,7 +83,7 @@ docker run --device /dev/accel -p 9000:9000 --group-add=$(stat -c "%g" /dev/dri/
8383

8484
### Binary package
8585
Start the model server with NPU accelerations using a command:
86-
```console
86+
```bat
8787
ovms --model_path model --model_name resnet --port 9000 --target_device NPU --batch_size 1
8888
```
8989

@@ -113,7 +113,7 @@ docker run --rm -d --device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render*
113113

114114
### Binary package
115115

116-
```console
116+
```bat
117117
ovms --model_path model --model_name resnet --port 9000 --target_device "HETERO:GPU,CPU"
118118
```
119119

@@ -181,22 +181,22 @@ docker run --rm -d --device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render*
181181
Below is the equivalent of the deployment command with a binary package at below:
182182

183183
AUTO
184-
```console
184+
```bat
185185
ovms --model_path model --model_name resnet --port 9000 --target_device AUTO:GPU,CPU
186186
```
187187

188188
THROUGHPUT
189-
```console
189+
```bat
190190
ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"THROUGHPUT\"}" --target_device AUTO:GPU,CPU
191191
```
192192

193193
LATENCY
194-
```console
194+
```bat
195195
ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"LATENCY\"}" --target_device AUTO:GPU,CPU
196196
```
197197

198198
CUMULATIVE_THROUGHPUT
199-
```console
199+
```bat
200200
ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"CUMULATIVE_THROUGHPUT\"}" --target_device AUTO:GPU,CPU
201201
```
202202

@@ -223,6 +223,6 @@ In the example above, there will be 200ms timeout to wait for filling the batch
223223
### Binary package
224224

225225
The same deployment with a binary package can be completed with a command:
226-
```console
226+
```bat
227227
ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"AUTO_BATCH_TIMEOUT\": 200}" --target_device "BATCH:CPU(16)"
228228
```

docs/clients_genai.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,47 @@ curl http://localhost:8000/v3/chat/completions \
7474
:::
7575
::::
7676

77+
### Request chat completions with unary calls (with image input)
78+
79+
::::{tab-set}
80+
:::{tab-item} python [OpenAI]
81+
:sync: python-openai
82+
```{code} python
83+
import base64
84+
from openai import OpenAI
85+
86+
def encode_image(image_path):
87+
with open(image_path, "rb") as image_file:
88+
return base64.b64encode(image_file.read()).decode("utf-8")
89+
90+
image_path = "/path/to/image"
91+
image = encode_image(image_path)
92+
93+
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
94+
response = client.chat.completions.create(
95+
model="openbmb/MiniCPM-V-2_6",
96+
messages=[
97+
{
98+
"role": "user",
99+
"content": [
100+
{
101+
"type": "text",
102+
"text": "What is in this image?",
103+
},
104+
{
105+
"type": "image_url",
106+
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
107+
},
108+
],
109+
}
110+
],
111+
stream=False,
112+
)
113+
print(response.choices[0].message)
114+
```
115+
:::
116+
::::
117+
77118
Check [LLM quick start](./llm/quickstart.md) and [end to end demo of text generation](../demos/continuous_batching/README.md).
78119

79120
### Request completions with unary calls
@@ -137,6 +178,52 @@ for chunk in stream:
137178
```
138179
:::
139180
::::
181+
182+
### Request chat completions with streaming (with image input)
183+
184+
::::{tab-set}
185+
:::{tab-item} python [OpenAI]
186+
:sync: python-openai
187+
```{code} python
188+
import base64
189+
from openai import OpenAI
190+
191+
def encode_image(image_path):
192+
with open(image_path, "rb") as image_file:
193+
return base64.b64encode(image_file.read()).decode("utf-8")
194+
195+
image_path = "/path/to/image"
196+
image = encode_image(image_path)
197+
198+
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
199+
200+
stream = client.chat.completions.create(
201+
model="openbmb/MiniCPM-V-2_6",
202+
messages=[
203+
{
204+
"role": "user",
205+
"content": [
206+
{
207+
"type": "text",
208+
"text": "What is in this image?",
209+
},
210+
{
211+
"type": "image_url",
212+
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
213+
},
214+
],
215+
}
216+
],
217+
stream=True,
218+
)
219+
220+
for chunk in stream:
221+
if chunk.choices[0].delta.content is not None:
222+
print(chunk.choices[0].delta.content, end="")
223+
```
224+
:::
225+
::::
226+
140227
Check [LLM quick start](./llm/quickstart.md) and [end to end demo of text generation](../demos/continuous_batching/README.md).
141228

142229
### Request completions with streaming

docs/llm/reference.md

Lines changed: 42 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Overview
44

5-
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
5+
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in its [GenAI Library](https://github.com/openvinotoolkit/openvino.genai) like:
66
- Continuous Batching
77
- Paged Attention
88
- Dynamic Split Fuse
@@ -12,8 +12,27 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
1212

1313
Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
1414

15+
## Servable Types
16+
17+
Starting with 2025.1, we can highlight four servable types. Such distinction is made based on the input type and underlying GenAI pipeline.
18+
The servable types are:
19+
- Language Model Continuous Batching,
20+
- Language Model Stateful,
21+
- Visual Language Model Continuous Batching,
22+
- Visual Language Model Stateful.
23+
24+
First part - Language Model / Visual Language Model - determines whether servable accepts only text or both text and images on the input.
25+
Seconds part - Continuous Batching / Stateful - determines what kind of GenAI pipeline is used as the engine. By default CPU and GPU devices work on Continuous Batching pipelines. NPU device works only on Stateful servable type.
26+
27+
User does not have to explicitly select servable type. It is inferred based on model directory contents and selected target device.
28+
Model directory contents determine if model can work only with text or visual input as well. As for target device, setting it to `NPU` will always pick Stateful servable, while any other device will result in deploying Continuous Batching servable.
29+
30+
Stateful servables ignore most of the configuration used by Continuous Batching, but this will be mentioned later. Some servable types have additional limitations mentioned in the limitations section at the end of this document.
31+
32+
Despite all the differences, all servable types share the same LLM calculator which imposes certain flow in every GenAI-based endpoint.
33+
1534
## LLM Calculator
16-
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
35+
As you can see in the quickstart, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2025/1/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
1736

1837
On the input it expects a HttpPayload struct passed by the Model Server frontend:
1938
```cpp
@@ -99,10 +118,14 @@ utilization of resource will be lower. Old cache will be cleared automatically b
99118

100119
`dynamic_split_fuse` [algorithm](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen#b-dynamic-splitfuse-) is enabled by default to boost the throughput by splitting the tokens to even chunks. In some conditions like with very low concurrency or with very short prompts, it might be beneficial to disable this algorithm. When it is disabled, there should be set also the parameter `max_num_batched_tokens` to match the model max context length.
101120

102-
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
121+
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size `{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}`.
122+
123+
**Important for NPU users**: NPU plugin sets a limitation on prompt (1024 tokens by default) that can be modified by setting `MAX_PROMPT_LEN` in `plugin_config`, for example to double that limit set: `{"MAX_PROMPT_LEN": 2048}`
103124

104125
The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `best_of_limit` or set `max_tokens_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
105126

127+
**Note that the following options are ignored in Stateful servables (so in deployments on NPU): cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching**
128+
106129

107130
## Canceling the generation
108131

@@ -136,29 +159,11 @@ In node configuration we set `models_path` indicating location of the directory
136159

137160
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing.
138161

139-
This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
140-
141-
In your python environment install required dependencies:
142-
```
143-
pip3 install "optimum-intel[nncf,openvino]
144-
```
162+
Additionally, Visual Language Models have encoder and decoder models for text and vision and potentially other auxiliary models.
145163

146-
Because there is very dynamic development in optimum-intel and openvino, it is recommended to use the latest versions of the dependencies:
147-
```
148-
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
149-
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino_tokenizers openvino
150-
```
151-
152-
LLM model can be exported with a command:
153-
```
154-
optimum-cli export openvino --disable-convert-tokenizer --model {LLM model in HF hub or Pytorch model folder} --weight-format {fp32/fp16/int8/int4/int4_sym_g128/int4_asym_g128/int4_sym_g64/int4_asym_g64} {target folder name}
155-
```
156-
Precision parameter is important and can influence performance, accuracy and memory usage. It is recommended to start experiments with `fp16`. The precision `int8` can reduce the memory consumption and improve latency with low impact on accuracy. Try `int4` to minimize memory usage and check various algorithm to achieve optimal results.
164+
This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
157165

158-
Export the tokenizer model with a command:
159-
```
160-
convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
161-
```
166+
We recommend using [export script](../../demos/common/export_models/README.md) to prepare models directory structure for serving.
162167

163168
Check [tested models](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models).
164169

@@ -190,8 +195,6 @@ When default template is loaded, servable accepts `/chat/completions` calls when
190195

191196
Errors during configuration files processing (access issue, corrupted file, incorrect content) result in servable loading failure.
192197

193-
194-
195198
## Limitations
196199

197200
There are several known limitations which are expected to be addressed in the coming releases:
@@ -201,7 +204,19 @@ There are several known limitations which are expected to be addressed in the co
201204
- `logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens.
202205
- Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version.
203206

207+
Some servable types introduce additional limitations:
208+
209+
### Stateful servable limitations
210+
- `finish_reason` not supported (always set to `stop`),
211+
- `logprobs` not supported,
212+
- sequential request processing (only one request is handled at a time)
213+
214+
### Visual Language servable limitations
215+
- works only on `/chat/completions` endpoint,
216+
- `image_url` input supports only base64 encoded image, not an actual URL
217+
204218
## References
205219
- [Chat Completions API](../model_server_rest_api_chat.md)
206220
- [Completions API](../model_server_rest_api_completions.md)
207-
- [Demo](../../demos/continuous_batching/README.md)
221+
- Demos on [CPU/GPU](../../demos/continuous_batching/README.md) and [NPU](../../demos/llm_npu/README.md)
222+
- VLM Demos on [CPU/GPU](../../demos/continuous_batching/vlm/README.md) and [NPU](../../demos/vlm_npu/README.md)

0 commit comments

Comments
 (0)