VLM docs update [releases/2025/1] (openvinotoolkit#3210)

mzegla · web-flow · commit f24395017bde · 2025-04-08T18:23:46.000+02:00
diff --git a/demos/image_classification_using_tf_model/python/README.md b/demos/image_classification_using_tf_model/python/README.md
@@ -42,7 +42,7 @@ Make sure to:
 on every shell that will start OpenVINO Model Server.
 
 And start Model Server using the following command:
-```console
+```bat
 ovms --model_name resnet --model_path model/ --port 9000
 ```
 
diff --git a/demos/image_classification_with_string_output/README.md b/demos/image_classification_with_string_output/README.md
@@ -39,7 +39,7 @@ Make sure to:
 on every shell that will start OpenVINO Model Server.
 
 And start Model Server using the following command:
-```console
+```bat
 ovms --model_name mobile_net --model_path model/ --rest_port 8000
 ```
 
diff --git a/demos/universal-sentence-encoder/README.md b/demos/universal-sentence-encoder/README.md
@@ -56,7 +56,7 @@ Make sure to:
 on every shell that will start OpenVINO Model Server.
 
 And start Model Server using the following command:
-```console
+```bat
 ovms --model_name usem --model_path universal-sentence-encoder-multilingual/ --plugin_config "{\"NUM_STREAMS\": 1}" --port 9000 --rest_port 8000
 ```
 
diff --git a/docs/accelerators.md b/docs/accelerators.md
@@ -65,7 +65,7 @@ docker run --rm -it  --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -u $(i
 Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html)
 
 Start the model server with GPU accelerations using a command:
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --target_device GPU
 ```
 
@@ -83,7 +83,7 @@ docker run --device /dev/accel -p 9000:9000 --group-add=$(stat -c "%g" /dev/dri/
 
 ### Binary package
 Start the model server with NPU accelerations using a command:
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --target_device NPU --batch_size 1
 ```
 
@@ -113,7 +113,7 @@ docker run --rm -d --device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render*
 
 ### Binary package
 
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --target_device "HETERO:GPU,CPU"
 ```
 
@@ -181,22 +181,22 @@ docker run --rm -d --device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render*
 Below is the equivalent of the deployment command with a binary package at below:
 
 AUTO
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --target_device AUTO:GPU,CPU
 ```
 
 THROUGHPUT
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"THROUGHPUT\"}" --target_device AUTO:GPU,CPU
 ```
 
 LATENCY
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"LATENCY\"}" --target_device AUTO:GPU,CPU
 ```
 
 CUMULATIVE_THROUGHPUT
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERFORMANCE_HINT\": \"CUMULATIVE_THROUGHPUT\"}" --target_device AUTO:GPU,CPU
 ```
 
@@ -223,6 +223,6 @@ In the example above, there will be 200ms timeout to wait for filling the batch
 ### Binary package
 
 The same deployment with a binary package can be completed with a command:
-```console
+```bat
 ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"AUTO_BATCH_TIMEOUT\": 200}" --target_device "BATCH:CPU(16)"
 ```
diff --git a/docs/clients_genai.md b/docs/clients_genai.md
@@ -74,6 +74,47 @@ curl http://localhost:8000/v3/chat/completions \
 :::
 ::::
 
+### Request chat completions with unary calls (with image input)
+
+::::{tab-set}
+:::{tab-item} python [OpenAI] 
+:sync: python-openai
+```{code} python
+import base64
+from openai import OpenAI
+
+def encode_image(image_path):
+  with open(image_path, "rb") as image_file:
+    return base64.b64encode(image_file.read()).decode("utf-8")
+
+image_path = "/path/to/image"
+image = encode_image(image_path)
+
+client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
+response = client.chat.completions.create(
+  model="openbmb/MiniCPM-V-2_6",
+  messages=[
+    {
+      "role": "user",
+      "content": [
+        {
+          "type": "text",
+          "text": "What is in this image?",
+        },
+        {
+          "type": "image_url",
+          "image_url": {"url": f"data:image/jpeg;base64,{image}"},
+        },
+      ],
+    }
+  ],
+  stream=False,
+)
+print(response.choices[0].message)
+```
+:::
+::::
+
 Check [LLM quick start](./llm/quickstart.md) and [end to end demo of text generation](../demos/continuous_batching/README.md).
 
 ### Request completions with unary calls
@@ -137,6 +178,52 @@ for chunk in stream:
 ```
 :::
 ::::
+
+### Request chat completions with streaming (with image input)
+
+::::{tab-set}
+:::{tab-item} python [OpenAI] 
+:sync: python-openai
+```{code} python
+import base64
+from openai import OpenAI
+
+def encode_image(image_path):
+  with open(image_path, "rb") as image_file:
+    return base64.b64encode(image_file.read()).decode("utf-8")
+
+image_path = "/path/to/image"
+image = encode_image(image_path)
+
+client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
+
+stream = client.chat.completions.create(
+  model="openbmb/MiniCPM-V-2_6",
+  messages=[
+    {
+      "role": "user",
+      "content": [
+        {
+          "type": "text",
+          "text": "What is in this image?",
+        },
+        {
+          "type": "image_url",
+          "image_url": {"url": f"data:image/jpeg;base64,{image}"},
+        },
+      ],
+    }
+  ],
+  stream=True,
+)
+
+for chunk in stream:
+  if chunk.choices[0].delta.content is not None:
+    print(chunk.choices[0].delta.content, end="")
+```
+:::
+::::
+
 Check [LLM quick start](./llm/quickstart.md) and [end to end demo of text generation](../demos/continuous_batching/README.md).
 
 ### Request completions with streaming
diff --git a/docs/llm/reference.md b/docs/llm/reference.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
+With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in its [GenAI Library](https://github.com/openvinotoolkit/openvino.genai) like:
   - Continuous Batching
   - Paged Attention
   - Dynamic Split Fuse
@@ -12,8 +12,27 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
 
 Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
 
+## Servable Types
+
+Starting with 2025.1, we can highlight four servable types. Such distinction is made based on the input type and underlying GenAI pipeline.
+The servable types are:
+- Language Model Continuous Batching,
+- Language Model Stateful,
+- Visual Language Model Continuous Batching,
+- Visual Language Model Stateful.
+
+First part - Language Model / Visual Language Model - determines whether servable accepts only text or both text and images on the input.
+Seconds part - Continuous Batching / Stateful - determines what kind of GenAI pipeline is used as the engine. By default CPU and GPU devices work on Continuous Batching pipelines. NPU device works only on Stateful servable type.
+
+User does not have to explicitly select servable type. It is inferred based on model directory contents and selected target device.
+Model directory contents determine if model can work only with text or visual input as well. As for target device, setting it to `NPU` will always pick Stateful servable, while any other device will result in deploying Continuous Batching servable. 
+
+Stateful servables ignore most of the configuration used by Continuous Batching, but this will be mentioned later. Some servable types have additional limitations mentioned in the limitations section at the end of this document.
+
+Despite all the differences, all servable types share the same LLM calculator which imposes certain flow in every GenAI-based endpoint.
+
 ## LLM Calculator
-As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
+As you can see in the quickstart, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2025/1/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
 
 On the input it expects a HttpPayload struct passed by the Model Server frontend:
 ```cpp
@@ -99,10 +118,14 @@ utilization of resource will be lower. Old cache will be cleared automatically b
 
 `dynamic_split_fuse` [algorithm](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen#b-dynamic-splitfuse-) is enabled by default to boost the throughput by splitting the tokens to even chunks. In some conditions like with very low concurrency or with very short prompts, it might be beneficial to disable this algorithm. When it is disabled, there should be set also the parameter `max_num_batched_tokens` to match the model max context length.
 
-`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
+`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size `{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}`.
+
+**Important for NPU users**: NPU plugin sets a limitation on prompt (1024 tokens by default) that can be modified by setting `MAX_PROMPT_LEN` in `plugin_config`, for example to double that limit set: `{"MAX_PROMPT_LEN": 2048}`
 
 The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `best_of_limit` or set `max_tokens_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
 
+**Note that the following options are ignored in Stateful servables (so in deployments on NPU): cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching**
+
 
 ## Canceling the generation
 
@@ -136,29 +159,11 @@ In node configuration we set `models_path` indicating location of the directory
 
 Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing.
 
-This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
-
-In your python environment install required dependencies:
-```
-pip3 install "optimum-intel[nncf,openvino]
-```
+Additionally, Visual Language Models have encoder and decoder models for text and vision and potentially other auxiliary models.
 
-Because there is very dynamic development in optimum-intel and openvino, it is recommended to use the latest versions of the dependencies:
-```
-export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
-pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git  openvino_tokenizers openvino
-```
-
-LLM model can be exported with a command:
-```
-optimum-cli export openvino --disable-convert-tokenizer --model {LLM model in HF hub or Pytorch model folder} --weight-format {fp32/fp16/int8/int4/int4_sym_g128/int4_asym_g128/int4_sym_g64/int4_asym_g64} {target folder name}
-```
-Precision parameter is important and can influence performance, accuracy and memory usage. It is recommended to start experiments with `fp16`. The precision `int8` can reduce the memory consumption and improve latency with low impact on accuracy. Try `int4` to minimize memory usage and check various algorithm to achieve optimal results.
+This model directory can be created based on the models from Hugging Face Hub or from the PyTorch model stored on the local filesystem. Exporting the models to Intermediate Representation format is one time operation and can speed up the loading time and reduce the storage volume, if it's combined with quantization and compression.
 
-Export the tokenizer model with a command:
-```
-convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
-```
+We recommend using [export script](../../demos/common/export_models/README.md) to prepare models directory structure for serving.
 
 Check [tested models](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models).
 
@@ -190,8 +195,6 @@ When default template is loaded, servable accepts `/chat/completions` calls when
 
 Errors during configuration files processing (access issue, corrupted file, incorrect content) result in servable loading failure.
 
-
-
 ## Limitations
 
 There are several known limitations which are expected to be addressed in the coming releases:
@@ -201,7 +204,19 @@ There are several known limitations which are expected to be addressed in the co
 - `logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens.
 - Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version.
 
+Some servable types introduce additional limitations:
+
+### Stateful servable limitations
+- `finish_reason` not supported (always set to `stop`),
+- `logprobs` not supported,
+- sequential request processing (only one request is handled at a time)
+
+### Visual Language servable limitations
+- works only on `/chat/completions` endpoint,
+- `image_url` input supports only base64 encoded image, not an actual URL
+
 ## References
 - [Chat Completions API](../model_server_rest_api_chat.md)
 - [Completions API](../model_server_rest_api_completions.md)
-- [Demo](../../demos/continuous_batching/README.md)
+- Demos on [CPU/GPU](../../demos/continuous_batching/README.md) and [NPU](../../demos/llm_npu/README.md)
+- VLM Demos on [CPU/GPU](../../demos/continuous_batching/vlm/README.md) and [NPU](../../demos/vlm_npu/README.md)