The VLM OpenVINO serving microservice enables support for VLM models that are not supported yet in OpenVINO model serving. This section provides step-by-step instructions to:
- Set up the microservice using a pre-built Docker image for quick deployment.
- Run predefined tasks to explore its functionality.
- Learn how to modify basic configurations to suit specific requirements.
Before you begin, ensure the following:
- System Requirements: Verify that your system meets the minimum requirements.
- Docker Installed: Install Docker. For installation instructions, see Get Docker.
This guide assumes basic familiarity with Docker commands and terminal usage. If you are new to Docker, see Docker Documentation for an introduction.
First, set the required VLM_MODEL_NAME environment variable:
export VLM_MODEL_NAME=Qwen/Qwen2.5-VL-3B-InstructRefer to model list for the supported models that can be used.
NOTE: You can change the model name, model compression format, device and the number of Uvicorn workers by editing the
setup.shfile.
The VLM OpenVINO Serving microservice supports many optional environment variables for customizing behavior, performance, and logging. For complete details on all available environment variables, including examples and advanced configurations, see the Environment Variables Guide.
Quick Configuration Examples:
# Basic CPU setup (default)
export VLM_MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
# GPU acceleration
export VLM_MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
export VLM_DEVICE=GPU
# Performance optimization
export VLM_MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
export OV_CONFIG='{"PERFORMANCE_HINT": "THROUGHPUT"}'
# Production setup with clean logging
export VLM_MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
export VLM_LOG_LEVEL=warning
export VLM_ACCESS_LOG_FILE="/dev/null"Key Environment Variables:
- VLM_DEVICE: Set to
CPU(default) orGPUfor device selection - OV_CONFIG: JSON string for OpenVINO performance tuning
- VLM_LOG_LEVEL: Control logging verbosity (
debug,info,warning,error) - VLM_MAX_COMPLETION_TOKENS: Limit response length
- HUGGINGFACE_TOKEN: Required for gated models
- VLM_TELEMETRY_PATH / VLM_TELEMETRY_MAX_RECORDS: Configure where
/v1/telemetrydata is stored and how many records are retained
For detailed information about each variable, configuration examples, and advanced setups, refer to the Environment Variables Guide.
Set the environment with default values by running the following script:
source setup.shNOTE: For a complete reference of all environment variables, their descriptions, and usage examples, see the Environment Variables Guide.
The user has an option to either build the docker images or use prebuilt images as documented below.
Configure the registry: The VLM OpenVINO Serving microservice uses registry URL and tag to pull the required image.
export REGISTRY_URL=intel
export TAG=latestTo run the server using Docker Compose, use the following command:
docker compose -f docker/compose.yaml up -dTo run the server with GPU acceleration, follow these steps:
Configure your GPU device using the instructions in the Device Configuration section in Environment Variables Guide. For GPU setup:
# For single GPU or automatic GPU selection
export VLM_DEVICE=GPU
# For specific GPU device (if multiple GPUs available)
export VLM_DEVICE=GPU.0 # Use first GPU
export VLM_DEVICE=GPU.1 # Use second GPUsource setup.shNote: When
VLM_DEVICE=GPUis set, the setup script automatically optimizes settings for GPU performance (changes compression format toint4and sets workers to 1).
docker compose -f docker/compose.yaml up -dAfter starting the service, verify your GPU setup:
# Check service health
curl --location --request GET 'http://localhost:9764/health'
# Check available devices and current configuration
curl --location --request GET 'http://localhost:9764/device'Note: For detailed GPU configuration options, device discovery, and performance tuning recommendations, refer to the
Device Configurationsection in Environment Variables Guide.
To stop and remove the Docker containers, use the following command:
docker compose -f docker/compose.yaml downcurl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the activities and events captured in the image. Provide a detailed description of what is happening. While referring to an object or person or entity, identify them as uniquely as possible such that it can be tracked in future. Keep attention to detail, but avoid speculation or unnecessary attribution of details."
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
}
}
]
}
],
"max_completion_tokens": 500,
"temperature": 0.1,
"top_p": 0.3,
"frequency_penalty": 1
}'curl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"max_completion_tokens": 100,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,<base64 image value>"
}
}
]
}
]
}'curl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe these image. Generate the output in json format as {image_1:Description1, image_2:Description2}"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
}
}
]
}
],
"max_completion_tokens": 200
}'-
Method 1 (using curl call):
curl --location 'http://localhost:9764/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "Qwen/Qwen2.5-VL-3B-Instruct", "messages": [ { "role": "user", "content": "Describe this video and remember this number: 4245" }, { "role": "assistant", "content": "The video appears to be taken at night, as indicated by the darkness and artificial lighting. The timestamp on the video suggests it was recorded early in the morning on August 25, 2024, in the Eastern Time Zone (ET). The camera is labeled indicates that it is a body-worn camera used by law enforcement.\n\nThe scene shows a sidewalk bordered by a metal fence on both sides. There are trees lining the sidewalk, and some people can be seen walking in the distance. In the background, there are parked cars and what appears to be a building with illuminated windows. The overall atmosphere seems calm, with no immediate signs of distress or urgency.\n\nRemember the number: 4245" }, { "role": "user", "content": "What is the number ?" } ], "max_completion_tokens": 1000 }'
-
Method 2 (using openai python client):
from openai import OpenAI client = OpenAI( base_url = "http://localhost:9764/v1", api_key="EMPTY", ) # Define the conversation history messages = [ { "role": "user", "content": "Describe this video and remember this number: 4245" }, { "role": "assistant", "content": "The video appears to be taken at night, as indicated by the darkness and artificial lighting. The timestamp on the video suggests it was recorded early in the morning on August 25, 2024, in the Eastern Time Zone (ET). The camera is labeled indicates that it is a body-worn camera used by law enforcement.\n\nThe scene shows a sidewalk bordered by a metal fence on both sides. There are trees lining the sidewalk, and some people can be seen walking in the distance. In the background, there are parked cars and what appears to be a building with illuminated windows. The overall atmosphere seems calm, with no immediate signs of distress or urgency.\n\nRemember the number: 4245" }, { "role": "user", "content": "What did I ask you to do? What is the number?" } ] # Send the request to the model response = client.chat.completions.create( model="Qwen/Qwen2.5-VL-3B-Instruct", messages=messages, max_completion_tokens=1000, ) # Print the model's response print(response.choices[0].message.content)
NOTE: video_url type input is only supported with the
Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-InstructorQwen/Qwen2-VL-2B-Instructmodels. Although other models will accept input asvideotype, but internally they will process it as multi-image input only.
curl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Consider these images as frames of single video. Describe this video and sequence of events."
},
{
"type": "video",
"video": [
"http://localhost:8080/chunk_6_frame_3.jpeg",
"http://localhost:8080/chunk_6_frame_4.jpeg"
]
}
]
}
],
"max_completion_tokens": 1000
}'NOTE: video_url type input is only supported with the
Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-InstructorQwen/Qwen2-VL-2B-Instructmodels. NOTE:max_pixelsandfpsare optional parameters.
curl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this video"
},
{
"type": "video_url",
"video_url": {
"url": "http://localhost:8080/original-1sec.mp4"
},
"max_pixels": "360*420",
"fps": 1
}
]
}
],
"max_completion_tokens": 1000,
"stream":true
}'NOTE: video_url type input is only supported with the
Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-InstructorQwen/Qwen2-VL-2B-Instructmodels. NOTE:max_pixelsandfpsare optional parameters.
# Encode video to base64 (ensure you have a video file named 'test.mp4')
export VIDEO_B64=$(base64 -w 0 test.mp4)
# Create JSON payload
cat <<EOF > payload.json
{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this video"
},
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,$VIDEO_B64"
}
}
]
}
],
"max_completion_tokens": 1000,
"stream":true
}
EOF
# Send request
curl --location 'http://localhost:9764/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data @payload.jsonThe microservice exposes /v1/telemetry to inspect the most recent (up to 100) inference requests. Each entry contains high-level request parameters, media counts, usage metrics, and perf telemetry captured from the model backend.
Tip: Use
VLM_TELEMETRY_PATHto move the JSONL file to a different mount (for persistent storage or easier scraping) andVLM_TELEMETRY_MAX_RECORDSto adjust how many records are kept.
Default: The endpoint returns up to 100 entries when no
limitvalue is provided.
Note: Telemetry metrics are available for all models that execute inference through the
openvino_genaipipeline. The only exception today isHuggingFaceTB/SmolVLM2-2.2B-Instruct, which relies onOVModelForVisualCausalLMfromoptimum-inteland therefore does not emitopenvino_genaiPerfMetrics.
curl --location 'http://localhost:9764/v1/telemetry?limit=5'The response follows the TelemetryListResponse schema:
count: number of items returned (newest first)items[]: individual telemetry records withid,timestamp,status,request.parameters,request.media,usage, andtelemetry
Use this endpoint to verify request history across multiple workers or to collect quick performance snapshots without accessing container logs.
To get the list of available devices
curl --location --request GET 'http://localhost:9764/device'To get specific device details
curl --location --request GET 'http://localhost:9764/device/CPU' \
--header 'Content-Type: application/json'To ensure the functionality of the microservice and measure test coverage, follow these steps:
-
Install Dependencies
Install the required dependencies, including development dependencies, using:poetry install --with test -
Run Tests with Poetry
Use the following command to run all tests:poetry run pytest
-
Run Tests with Coverage
To collect coverage data while running tests, use:poetry run coverage run --source=src -m pytest
-
Generate Coverage Report
After running the tests, generate a coverage report:poetry run coverage report -m
-
Generate HTML Coverage Report (Optional)
For a detailed view, generate an HTML report:poetry run coverage html
Open the
htmlcov/index.htmlfile in your browser to view the report.
These steps will help you verify the functionality of the microservice and ensure adequate test coverage.
-
Docker Container Fails to Start:
- Run
docker logs {{container-name}}to identify the issue. - Check if the required port is available.
- Run
-
Cannot Access the Microservice:
-
Confirm the container is running:
docker ps
-