OpenVINO Model Server 2025.4.1

Latest

Latest

atobiszei released this 18 Dec 12:55

· 2 commits to releases/2025/4 since this release

0385f29

2025.4.1 is a minor release with bug fixes and improvements based on OpenVINO 2025.4.1.

Preview:

Added preview support for GPT-OSS agentic use case.
As of 2025.4.1, the best accuracy setting is achieved with:

--pipeline_type LM (without continuous batching and concurrency)
--target_device GPU (this configuration was validated on Lunar Lake, Arrow Lake-H, and Intel Arc Battlemage dGPU with >=16 GB VRAM)
It is also required to use INT4 precision.

Bug fixes:

Fixed escaping for whitespace characters in string arguments for qwen3coder tool-call parser.
Changed requests handling to chat/completions endpoint with streaming and usage tracking to LLM pipelines without continuous batching. Such pipelines do not track generated tokens. So far the last chunk wasn't delivered to the client which could result in a missing token in the response. Now the last chunk is delivered with token usage set as 0 which should be ignored.
Minor documentation and demos fixes

Assets 16