From a3c9fbe545d2b01bcf6c937f98ee951f4a7ecf44 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Sun, 10 Aug 2025 01:44:17 +0200 Subject: [PATCH 01/16] minor fixed in demos and docs --- demos/code_local_assistant/README.md | 17 +++-- demos/common/export_models/README.md | 18 +++++- demos/common/export_models/export_model.py | 7 ++- demos/continuous_batching/accuracy/README.md | 2 +- .../accuracy/{gorila.patch => gorilla.patch} | 0 .../continuous_batching/agentic_ai/README.md | 18 +++++- .../agentic_ai/openai_agent.py | 2 +- demos/continuous_batching/scaling/README.md | 13 ++-- .../speculative_decoding/README.md | 63 ++++++++++--------- docs/deploying_server_baremetal.md | 16 ++--- 10 files changed, 97 insertions(+), 59 deletions(-) rename demos/continuous_batching/accuracy/{gorila.patch => gorilla.patch} (100%) diff --git a/demos/code_local_assistant/README.md b/demos/code_local_assistant/README.md index 15cbc0a183..2ebb9068e8 100644 --- a/demos/code_local_assistant/README.md +++ b/demos/code_local_assistant/README.md @@ -9,8 +9,7 @@ With the rise of AI PC capabilities, hosting own Visual Studio code assistant is - Intel Meteor Lake, Lunar Lake, Arrow Lake or newer Intel CPU. ## Prepare Code Chat/Edit Model -We need to use medium size model in order to keep 50ms/word for human to feel the chat responsive. -This will work in streaming mode, meaning we will see the chat response/code diff generation slowly roll out in real-time. +We need to use medium size model to get reliable responses but also to fit it to the available memory on the host or discrete GPU. Download export script, install its dependencies and create directory for the models: ```console @@ -22,10 +21,10 @@ mkdir models Export `codellama/CodeLlama-7b-Instruct-hf`: ```console -python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models +python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device GPU --overwrite_models ``` -> **Note:** Use `--target_device GPU` for Intel GPU or omit this parameter to run on Intel CPU +> **Note:** Use `--target_device NPU` for Intel NPU or omit this parameter to run on Intel CPU ## Prepare Code Completion Model For this task we need smaller, lighter model that will produce code quicker than chat task. @@ -104,10 +103,16 @@ Please refer to OpenVINO Model Server installation first: [link](../../docs/depl ovms --rest_port 8000 --config_path ./models/config_all.json ``` -### Linux: via Docker +### Linux: via Docker with GPU +```bash +docker run -d --rm --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \ + -p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/models/config_all.json +``` + +### Linux: via Docker with NPU ```bash docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \ - -p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:2025.2 --rest_port 8000 --config_path /workspace/models/config_all.json + -p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/models/config_all.json ``` ## Set Up Visual Studio Code diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index 3ff84026dd..8927c65496 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -111,19 +111,31 @@ Text generation for NPU target device. Command below sets max allowed prompt siz ```console python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --config_file_path models/config_all.json --model_repository_path models --target_device NPU --max_prompt_len 2048 --ov_cache_dir ./models/.ov_cache ``` +> **Note:** Some models like `mistralai/Mistral-7B-Instruct-v0.3` might fail to export because the task can't be determined automatically. In such situation it can be set in `--extra_quantization_parameters`. For example: +```console +python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --extra_quantization_params "--task text-generation-with-past" +``` +> **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments ofter export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files). +It will ensure, the generation stops after eos token. ### Embedding Models #### Embeddings with deployment on a single CPU host: ```console -python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json +python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json ``` #### Embeddings with deployment on a dual CPU host: ```console -python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json --num_streams 2 +python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json --num_streams 2 +``` + +#### Embeddings with pooling parameter +```console +python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --weight-format fp16 --config_file_path models/config_all.json ``` + #### With Input Truncation By default, embeddings endpoint returns an error when the input exceed the maximum model context length. It is possible to change the behavior to truncate prompts automatically to fit the model. Add `--truncate` option in the export command. @@ -138,7 +150,7 @@ python export_model.py embeddings \ ### Reranking Models ```console -python export_model.py rerank \ +python export_model.py rerank_ov \ --source_model BAAI/bge-reranker-large \ --weight-format int8 \ --config_file_path models/config_all.json \ diff --git a/demos/common/export_models/export_model.py b/demos/common/export_models/export_model.py index e892bd6a91..a4862ad01b 100644 --- a/demos/common/export_models/export_model.py +++ b/demos/common/export_models/export_model.py @@ -401,7 +401,12 @@ def export_text_generation_model(model_repository_path, source_model, model_name task_parameters['extra_quantization_params'] = "--sym --ratio 1.0 --group-size -1" optimum_command = "optimum-cli export openvino --model {} --weight-format {} {} --trust-remote-code {}".format(source_model, precision, task_parameters['extra_quantization_params'], llm_model_path) if os.system(optimum_command): - raise ValueError("Failed to export llm model", source_model) + raise ValueError("Failed to export llm model", source_model) + if not (os.path.isfile(os.path.join(llm_model_path, 'openvino_detokenizer.xml'))): + print("Tokenizer and detokenizer not found in the exported model. Exporting tokenizer and detokenizer from HF model") + convert_tokenizer_command = "convert_tokenizer --with-detokenizer -o {} {}".format(llm_model_path, source_model) + if os.system(convert_tokenizer_command): + raise ValueError("Failed to export tokenizer and detokenizer", source_model) ### Export draft model for speculative decoding draft_source_model = task_parameters.get("draft_source_model", None) draft_model_dir_name = None diff --git a/demos/continuous_batching/accuracy/README.md b/demos/continuous_batching/accuracy/README.md index 74865a3cbe..89e06912b5 100644 --- a/demos/continuous_batching/accuracy/README.md +++ b/demos/continuous_batching/accuracy/README.md @@ -113,7 +113,7 @@ Use [Berkeley function call leaderboard ](https://github.com/ShishirPatil/gorill git clone https://github.com/ShishirPatil/gorilla cd gorilla/berkeley-function-call-leaderboard git checkout ac37049f00022af54cc44b6aa0cad4402c22d1a0 -curl -s https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/agent-accuracy/demos/continuous_batching/accuracy/gorila.patch | git apply -v +curl -s https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/accuracy/gorilla.patch | git apply -v pip install -e . ``` The commands below assumes the models is deployed with the name `openvino-qwen3-8b-int8`. It must match the name set in the `bfcl_eval/constants/model_config.py`. diff --git a/demos/continuous_batching/accuracy/gorila.patch b/demos/continuous_batching/accuracy/gorilla.patch similarity index 100% rename from demos/continuous_batching/accuracy/gorila.patch rename to demos/continuous_batching/accuracy/gorilla.patch diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 8a2816905b..0c888fca36 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -1,5 +1,7 @@ # Agentic AI with OpenVINO Model Server {#ovms_demos_continuous_batching_agent} +This demo version requires OVMS version 2025.3. Build it from [source](../../../docs/build_from_source.md) before it is published. + OpenVINO Model Server can be used to serve language models for AI Agents. It supports the usage of tools in the context of content generation. It can be integrated with MCP servers and AI agent frameworks. You can learn more about [tools calling based on OpenAI API](https://platform.openai.com/docs/guides/function-calling?api-mode=responses) @@ -10,10 +12,14 @@ Here are presented required steps to deploy language models trained for tools su The application employing OpenAI agent SDK is using MCP server. It is equipped with a set of tools to providing context for the content generation. The tools can also be used for automation purposes based on input in text format. + + ## Export LLM model Currently supported models: - Qwen/Qwen3-8B +- Qwen/Qwen3-4B - meta-llama/Llama-3.1-8B-Instruct +- meta-llama/Llama-3.2-3B-Instruct - NousResearch/Hermes-3-Llama-3.1-8B - microsoft/Phi-4-mini-instruct @@ -23,7 +29,7 @@ The model response with tool call follow a specific syntax which is process by a Download export script, install it's dependencies and create directory for the models: ```console curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/requirements.txt mkdir models ``` Run `export_model.py` script to download and quantize the model: @@ -47,7 +53,13 @@ python export_model.py text_generation --source_model Qwen/Qwen3-8B --weight-for :::: You can use similar commands for different models. Change the source_model and the tools_model_type (note that as of today the following types as available: `[phi4, llama3, qwen3, hermes3]`). -> **Note:** The tuned chat template will be copied to the model folder as template.jinja and the response parser will be set in the graph.pbtxt +> **Note:** Some models give more reliable responses with tunned chat template. Copy custom template to the model folder like below: +``` +curl -L -o models/meta-llama/Llama-3.1-8B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.1_json.jinja +curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja +curl -L -o models/NousResearch/Hermes-3-Llama-3.1-8B/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_hermes.jinja +curl -L -o models/microsoft/Phi-4-mini-instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_phi4_mini.jinja +``` ## Start OVMS @@ -74,7 +86,7 @@ In case you want to use GPU device to run the generation, add extra docker param to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. It can be applied using the commands below: ```bash -docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:2025.2-gpu \ +docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:latest-gpu \ --rest_port 8000 --model_path /models/Qwen/Qwen3-8B --model_name Qwen/Qwen3-8B ``` ::: diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index 436e397ac4..b960bdf72c 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -117,7 +117,7 @@ def get_model(self, _) -> Model: agent = Agent( name="Assistant", mcp_servers=[fs_server, weather_server], - model_settings=ModelSettings(tool_choice="auto", temperature=0.0), + model_settings=ModelSettings(tool_choice="auto", temperature=0.0,max_tokens=1000, extra_body={"chat_template_kwargs":{"enable_thinking": False}}), ) loop = asyncio.new_event_loop() loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream)) diff --git a/demos/continuous_batching/scaling/README.md b/demos/continuous_batching/scaling/README.md index 60139c0632..822342e452 100644 --- a/demos/continuous_batching/scaling/README.md +++ b/demos/continuous_batching/scaling/README.md @@ -137,10 +137,10 @@ python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B ``` Start the Model Server instances: ```bash -docker run --device /dev/dri/renderD128 -d --rm -p 8003:8003 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model -docker run --device /dev/dri/renderD129 -d --rm -p 8004:8004 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model -docker run --device /dev/dri/renderD130 -d --rm -p 8005:8005 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model -docker run --device /dev/dri/renderD131 -d --rm -p 8006:8006 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model +docker run --device /dev/dri/renderD128 -d --rm -p 8003:8003 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest-gpu --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model +docker run --device /dev/dri/renderD129 -d --rm -p 8004:8004 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest-gpu --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model +docker run --device /dev/dri/renderD130 -d --rm -p 8005:8005 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest-gpu --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model +docker run --device /dev/dri/renderD131 -d --rm -p 8006:8006 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest-gpu --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model ``` Confirm in logs if the containers loaded the models successfully. @@ -211,11 +211,12 @@ Continuous batching with Multi GPU configuration will be added soon. Export the model: ```bash -python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_name DeepSeek-R1-Distill-Qwen-32B_INT4 --weight-format int4 --model_repository_path models --target_device HETERO:GPU.0,GPU.1 --pipeline_type LM +python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_name DeepSeek-R1-Distill-Qwen-32B_INT4 --weight-format int4 --model_repository_path models --target_device HETERO:GPU.0,GPU.1 --pipeline_type LM_CB ``` +> **Note**: Using the pipeline type LM_CB which includes continuous batching, requires OVMS version 2025.3. Build it from source before the publication. ```bash -docker run --device /dev/dri -d --rm -p 8000:8000 -u 0 -v $(pwd)/models/DeepSeek-R1-Distill-Qwen-32B_INT4:/model:ro openvino/model_server:latest --rest_port 8000 --model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_path /model +docker run --device /dev/dri -d --rm -p 8000:8000 -u 0 -v $(pwd)/models/DeepSeek-R1-Distill-Qwen-32B_INT4:/model:ro openvino/model_server:latest-gpu --rest_port 8000 --model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_path /model ``` ### Testing the scalability diff --git a/demos/continuous_batching/speculative_decoding/README.md b/demos/continuous_batching/speculative_decoding/README.md index 38cc00a7c5..09de27c5a8 100644 --- a/demos/continuous_batching/speculative_decoding/README.md +++ b/demos/continuous_batching/speculative_decoding/README.md @@ -141,40 +141,43 @@ Models used in this demo - `meta-llama/CodeLlama-7b-hf` and `AMD-Llama-135m` are Below you can see an exemplary unary request (you can switch `stream` parameter to enable streamed response). Compared to calls to regular continuous batching model, this request has additional parameter `num_assistant_tokens` which specifies how many tokens should a draft model generate before main model validates them. - ```console -curl http://localhost:8000/v3/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/CodeLlama-7b-hf", - "temperature": 0, - "max_tokens":100, - "stream":false, - "prompt": "def quicksort(numbers):", - "num_assistant_tokens": 5 - }'| jq . +pip3 install openai +``` +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v3", + api_key="unused" +) + +stream = client.completions.create( + model="meta-llama/CodeLlama-7b-hf", + prompt="def quicksort(numbers):", + temperature=0, + max_tokens=100, + extra_body={"num_assistant_tokens": 5} + stream=True, +) +for chunk in stream: + if chunk.choices[0].text is not None: + print(chunk.choices[0].text, end="", flush=True) ``` -```json -{ - "choices": [ - { - "finish_reason": "length", - "index": 0, - "logprobs": null, - "text": "\n if len(numbers) <= 1:\n return numbers\n else:\n pivot = numbers[0]\n lesser = [x for x in numbers[1:] if x <= pivot]\n greater = [x for x in numbers[1:] if x > pivot]\n return quicksort(lesser) + [pivot] + quicksort(greater)\n\n\ndef quicksort_recursive(numbers):\n if" - } - ], - "created": 1737547359, - "model": "meta-llama/CodeLlama-7b-hf-sd", - "object": "text_completion", - "usage": { - "prompt_tokens": 9, - "completion_tokens": 100, - "total_tokens": 109 - } -} ``` +if len(numbers) <= 1: + return numbers +else: + pivot = numbers[0] + lesser = [x for x in numbers[1:] if x <= pivot] + greater = [x for x in numbers[1:] if x > pivot] + return quicksort(lesser) + [pivot] + quicksort(greater) + +def quicksort_recursive(numbers): + if +``` + High value for `num_assistant_tokens` brings profit when tokens generated by the draft model mostly match the main model. If they don't, tokens are dropped and both models do additional work. For low values such risk is lower, but the potential performance boost is limited. Usually the value of `5` is a good compromise. diff --git a/docs/deploying_server_baremetal.md b/docs/deploying_server_baremetal.md index 07b3d40606..d8279cd66a 100644 --- a/docs/deploying_server_baremetal.md +++ b/docs/deploying_server_baremetal.md @@ -15,12 +15,12 @@ You can download model server package in two configurations. One with Python sup :sync: ubuntu-22-04 Download precompiled package (without python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_ubuntu22.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_ubuntu22.tar.gz tar -xzvf ovms_ubuntu22.tar.gz ``` or precompiled package (with python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_ubuntu22_python_on.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_ubuntu22_python_on.tar.gz tar -xzvf ovms_ubuntu22_python_on.tar.gz ``` Install required libraries: @@ -43,12 +43,12 @@ pip3 install "Jinja2==3.1.6" "MarkupSafe==3.0.2" :sync: ubuntu-24-04 Download precompiled package (without python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_ubuntu24.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_ubuntu24.tar.gz tar -xzvf ovms_ubuntu24.tar.gz ``` or precompiled package (with python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_ubuntu24_python_on.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_ubuntu24_python_on.tar.gz tar -xzvf ovms_ubuntu24_python_on.tar.gz ``` Install required libraries: @@ -71,12 +71,12 @@ pip3 install "Jinja2==3.1.6" "MarkupSafe==3.0.2" :sync: rhel-9.6 Download precompiled package (without python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_redhat.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_redhat.tar.gz tar -xzvf ovms_redhat.tar.gz ``` or precompiled package (with python): ```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_redhat_python_on.tar.gz +wget https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_redhat_python_on.tar.gz tar -xzvf ovms_redhat_python_on.tar.gz ``` Install required libraries: @@ -102,14 +102,14 @@ Make sure you have [Microsoft Visual C++ Redistributable](https://aka.ms/vs/17/r Download and unpack model server archive for Windows(with python): ```bat -curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_windows_python_on.zip -o ovms.zip +curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_windows_python_on.zip -o ovms.zip tar -xf ovms.zip ``` or archive without python: ```bat -curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2/ovms_windows_python_off.zip -o ovms.zip +curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1 /ovms_windows_python_off.zip -o ovms.zip tar -xf ovms.zip ``` From 797484e50458258082aac186815344e987015bfe Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Sun, 10 Aug 2025 01:51:47 +0200 Subject: [PATCH 02/16] fix export script for new params --- demos/common/export_models/export_model.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/demos/common/export_models/export_model.py b/demos/common/export_models/export_model.py index a4862ad01b..673a651800 100644 --- a/demos/common/export_models/export_model.py +++ b/demos/common/export_models/export_model.py @@ -231,7 +231,8 @@ def add_common_arguments(parser): reasoning_parser: "{{reasoning_parser}}",{% endif %} {%- if tool_parser %} tool_parser: "{{tool_parser}}",{% endif %} - enable_tool_guided_generation: {% if not enable_tool_guided_generation %}false{% else %} true{% endif%}, + {%- if enable_tool_guided_generation %} + enable_tool_guided_generation: {% if not enable_tool_guided_generation %}false{% else %} true{% endif%},{% endif %} } } input_stream_handler { From 99d2a9eae6d97b8e619d94639eae3dcfb8ba55b3 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Sun, 10 Aug 2025 02:15:26 +0200 Subject: [PATCH 03/16] fix sdl --- ci/lib_search.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ci/lib_search.py b/ci/lib_search.py index 9db3f4c722..97d9906b86 100644 --- a/ci/lib_search.py +++ b/ci/lib_search.py @@ -107,7 +107,7 @@ def check_dir(start_dir): 'net_http.patch', 'partial.patch', 'ovms_drogon_trantor.patch', - 'gorila.patch', + 'gorilla.patch', 'opencv_cmake_flags.txt', 'ovms-c/dist', 'requirements.txt', From 932b412d7f39e3f9caf33cdd884d8079517b918d Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 11 Aug 2025 09:22:54 +0200 Subject: [PATCH 04/16] spelling --- demos/continuous_batching/agentic_ai/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 0c888fca36..b1bc6255d0 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -53,7 +53,7 @@ python export_model.py text_generation --source_model Qwen/Qwen3-8B --weight-for :::: You can use similar commands for different models. Change the source_model and the tools_model_type (note that as of today the following types as available: `[phi4, llama3, qwen3, hermes3]`). -> **Note:** Some models give more reliable responses with tunned chat template. Copy custom template to the model folder like below: +> **Note:** Some models give more reliable responses with tuned chat template. Copy custom template to the model folder like below: ``` curl -L -o models/meta-llama/Llama-3.1-8B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.1_json.jinja curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja From 58ab619c0d0a99e4e32dae9449ff5f56e5883206 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 11 Aug 2025 10:25:39 +0200 Subject: [PATCH 05/16] spelling --- demos/common/export_models/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index 8927c65496..d6124b6ca4 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -115,7 +115,7 @@ python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-In ```console python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --extra_quantization_params "--task text-generation-with-past" ``` -> **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments ofter export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files). +> **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments after export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files). It will ensure, the generation stops after eos token. ### Embedding Models From 0b0364a2f77f35e2ed6048565d4536cb17308599 Mon Sep 17 00:00:00 2001 From: "Trawinski, Dariusz" Date: Mon, 11 Aug 2025 12:57:20 +0200 Subject: [PATCH 06/16] Apply suggestions from code review Co-authored-by: ngrozae <104074686+ngrozae@users.noreply.github.com> --- demos/continuous_batching/agentic_ai/openai_agent.py | 2 +- demos/continuous_batching/speculative_decoding/README.md | 2 +- docs/deploying_server_baremetal.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index b960bdf72c..74fc610866 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -117,7 +117,7 @@ def get_model(self, _) -> Model: agent = Agent( name="Assistant", mcp_servers=[fs_server, weather_server], - model_settings=ModelSettings(tool_choice="auto", temperature=0.0,max_tokens=1000, extra_body={"chat_template_kwargs":{"enable_thinking": False}}), + model_settings=ModelSettings(tool_choice="auto", temperature=0.0, max_tokens=1000, extra_body={"chat_template_kwargs": {"enable_thinking": False}}), ) loop = asyncio.new_event_loop() loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream)) diff --git a/demos/continuous_batching/speculative_decoding/README.md b/demos/continuous_batching/speculative_decoding/README.md index 09de27c5a8..827c479b54 100644 --- a/demos/continuous_batching/speculative_decoding/README.md +++ b/demos/continuous_batching/speculative_decoding/README.md @@ -165,7 +165,7 @@ for chunk in stream: print(chunk.choices[0].text, end="", flush=True) ``` -``` +Output: if len(numbers) <= 1: return numbers else: diff --git a/docs/deploying_server_baremetal.md b/docs/deploying_server_baremetal.md index d8279cd66a..968cbd1661 100644 --- a/docs/deploying_server_baremetal.md +++ b/docs/deploying_server_baremetal.md @@ -109,7 +109,7 @@ tar -xf ovms.zip or archive without python: ```bat -curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1 /ovms_windows_python_off.zip -o ovms.zip +curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.2.1/ovms_windows_python_off.zip -o ovms.zip tar -xf ovms.zip ``` From 5991053069fe5a5d1f54f0edb456706cbb4e25b7 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 11 Aug 2025 23:32:49 +0200 Subject: [PATCH 07/16] fix model export for image generation --- demos/common/export_models/export_model.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/demos/common/export_models/export_model.py b/demos/common/export_models/export_model.py index 673a651800..f08198822c 100644 --- a/demos/common/export_models/export_model.py +++ b/demos/common/export_models/export_model.py @@ -672,11 +672,12 @@ def export_image_generation_model(model_repository_path, source_model, model_nam args['draft_model_name'] = args['draft_source_model'] ### +if args['extra_quantization_params'] is None: + args['extra_quantization_params'] = "" + template_parameters = {k: v for k, v in args.items() if k not in ['model_repository_path', 'source_model', 'model_name', 'precision', 'version', 'config_file_path', 'overwrite_models']} print("template params:", template_parameters) -if template_parameters['extra_quantization_params'] is None: - template_parameters['extra_quantization_params'] = "" if args['task'] == 'text_generation': export_text_generation_model(args['model_repository_path'], args['source_model'], args['model_name'], args['precision'], template_parameters, args['config_file_path']) From 314b60df36c03a1bce32d1fffa8410be01f2789f Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 11 Aug 2025 23:57:37 +0200 Subject: [PATCH 08/16] add auto_gptq to export script --- demos/common/export_models/requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/demos/common/export_models/requirements.txt b/demos/common/export_models/requirements.txt index a5dde79fe4..668cc3bfe9 100644 --- a/demos/common/export_models/requirements.txt +++ b/demos/common/export_models/requirements.txt @@ -13,4 +13,5 @@ transformers<4.52 einops torchvision timm==1.0.15 +auto_gptq==0.7.1 # for GPTQ models diffusers==0.33.1 # for image generation From 98a4c8ce705d5a2c996455e413a1006d98404048 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 12:10:57 +0200 Subject: [PATCH 09/16] fix installing auto_gptq --- demos/common/export_models/README.md | 2 +- demos/common/export_models/requirements.txt | 2 +- demos/python_demos/requirements.txt | 3 ++- prepare_llm_models.sh | 2 +- windows_prepare_llm_models.bat | 2 +- 5 files changed, 6 insertions(+), 5 deletions(-) diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index d6124b6ca4..aaff9e50b3 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -10,7 +10,7 @@ This script automates exporting models from Hugging Faces hub or fine-tuned in P ```console git clone https://github.com/openvinotoolkit/model_server cd model_server/demos/common/export_models -pip install -q -r requirements.txt +BUILD_CUDA_EXT=0 pip install -q -r requirements.txt mkdir models python export_model.py --help ``` diff --git a/demos/common/export_models/requirements.txt b/demos/common/export_models/requirements.txt index 668cc3bfe9..125d8ed785 100644 --- a/demos/common/export_models/requirements.txt +++ b/demos/common/export_models/requirements.txt @@ -9,7 +9,7 @@ nncf>=2.11.0 sentence_transformers sentencepiece==0.2.0 openai -transformers<4.52 +transformers<4.53 einops torchvision timm==1.0.15 diff --git a/demos/python_demos/requirements.txt b/demos/python_demos/requirements.txt index b1d180601b..e2a4c2bbb3 100644 --- a/demos/python_demos/requirements.txt +++ b/demos/python_demos/requirements.txt @@ -7,8 +7,9 @@ huggingface_hub==0.32.0 nncf>=2.11.0 sentence_transformers sentencepiece==0.2.0 -transformers<4.52 +transformers<4.53 einops torchvision timm==1.0.15 +auto_gptq==0.7.1 # for GPTQ models diffusers==0.33.1 # for image generation diff --git a/prepare_llm_models.sh b/prepare_llm_models.sh index cb3ffdd18f..02ea874f17 100755 --- a/prepare_llm_models.sh +++ b/prepare_llm_models.sh @@ -59,7 +59,7 @@ else python3.10 -m venv .venv . .venv/bin/activate pip3 install -U pip - pip3 install -U -r demos/common/export_models/requirements.txt + BUILD_CUDA_EXT=0 pip3 install -U -r demos/common/export_models/requirements.txt fi mkdir -p $1 diff --git a/windows_prepare_llm_models.bat b/windows_prepare_llm_models.bat index 58521134b5..77af09f433 100644 --- a/windows_prepare_llm_models.bat +++ b/windows_prepare_llm_models.bat @@ -63,7 +63,7 @@ if !errorlevel! neq 0 exit /b !errorlevel! set python -m pip install --upgrade pip if !errorlevel! neq 0 exit /b !errorlevel! -pip install -U -r demos\common\export_models\requirements.txt +BUILD_CUDA_EXT=0 pip install -U -r demos\common\export_models\requirements.txt if !errorlevel! neq 0 exit /b !errorlevel! if not exist "%~1" mkdir "%~1" From 44b1e36f53f53879ec4d81d5f2e564de02e4722c Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 13:03:25 +0200 Subject: [PATCH 10/16] fix prepare model on windows --- windows_prepare_llm_models.bat | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/windows_prepare_llm_models.bat b/windows_prepare_llm_models.bat index 77af09f433..88ca4c68b5 100644 --- a/windows_prepare_llm_models.bat +++ b/windows_prepare_llm_models.bat @@ -63,7 +63,8 @@ if !errorlevel! neq 0 exit /b !errorlevel! set python -m pip install --upgrade pip if !errorlevel! neq 0 exit /b !errorlevel! -BUILD_CUDA_EXT=0 pip install -U -r demos\common\export_models\requirements.txt +set BUILD_CUDA_EXT=0 +pip install -U -r demos\common\export_models\requirements.txt if !errorlevel! neq 0 exit /b !errorlevel! if not exist "%~1" mkdir "%~1" From 739719f8ce0df3fe8b50dba0610fed62b45e6636 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 13:26:38 +0200 Subject: [PATCH 11/16] restore quen3 tool call type --- demos/common/export_models/export_model.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/common/export_models/export_model.py b/demos/common/export_models/export_model.py index f08198822c..86f60fe57b 100644 --- a/demos/common/export_models/export_model.py +++ b/demos/common/export_models/export_model.py @@ -52,7 +52,7 @@ def add_common_arguments(parser): 'Not effective if target device is not NPU', dest='max_prompt_len') parser_text.add_argument('--prompt_lookup_decoding', action='store_true', help='Set pipeline to use prompt lookup decoding', dest='prompt_lookup_decoding') parser_text.add_argument('--reasoning_parser', choices=["qwen3"], help='Set the type of the reasoning parser for reasoning content extraction', dest='reasoning_parser') -parser_text.add_argument('--tool_parser', choices=["llama3","phi4","hermes3"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser') +parser_text.add_argument('--tool_parser', choices=["llama3","phi4","hermes3", "qwen3"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser') parser_text.add_argument('--enable_tool_guided_generation', action='store_true', help='Enables enforcing tool schema during generation. Requires setting tool_parser', dest='enable_tool_guided_generation') parser_embeddings = subparsers.add_parser('embeddings', help='[deprecated] export model for embeddings endpoint with models split into separate, versioned directories') From fc75c8cb9da74d0178b414c92aae7980991f37be Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 14:02:42 +0200 Subject: [PATCH 12/16] fix embedding demo command --- demos/embeddings/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/embeddings/README.md b/demos/embeddings/README.md index f515155965..71318a2a36 100644 --- a/demos/embeddings/README.md +++ b/demos/embeddings/README.md @@ -65,7 +65,7 @@ For example: > **Note:** By default OVMS returns first token embeddings as sequence embeddings (called CLS pooling). It can be changed using `--pooling` option if needed by the model. Supported values are CLS and LAST. For example: ```console -python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --weight-format fp16 --pooling LAST --config_file_path models/config.json` +python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --weight-format fp16 --pooling LAST --config_file_path models/config.json ``` ## Tested models From 44fce55488d7e77b838181c66c776e2d8b38a106 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 14:13:48 +0200 Subject: [PATCH 13/16] fix building python --- demos/python_demos/Dockerfile.redhat | 2 +- demos/python_demos/Dockerfile.ubuntu | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/demos/python_demos/Dockerfile.redhat b/demos/python_demos/Dockerfile.redhat index cacd66da79..31b79b0ecb 100644 --- a/demos/python_demos/Dockerfile.redhat +++ b/demos/python_demos/Dockerfile.redhat @@ -21,6 +21,6 @@ ENV PYTHONPATH=/ovms/lib/python RUN if [ -f /usr/bin/dnf ] ; then export DNF_TOOL=dnf ; else export DNF_TOOL=microdnf ; fi ; \ $DNF_TOOL install -y python3-pip git COPY requirements.txt . -RUN pip3 install -r requirements.txt +RUN BUILD_CUDA_EXT=0 pip3 install -r requirements.txt USER ovms ENTRYPOINT [ "/ovms/bin/ovms" ] diff --git a/demos/python_demos/Dockerfile.ubuntu b/demos/python_demos/Dockerfile.ubuntu index 11e4839d42..3308772278 100644 --- a/demos/python_demos/Dockerfile.ubuntu +++ b/demos/python_demos/Dockerfile.ubuntu @@ -21,7 +21,7 @@ ENV PYTHONPATH=/ovms/lib/python RUN apt update && apt install -y python3-pip git COPY requirements.txt . ENV PIP_BREAK_SYSTEM_PACKAGES=1 -RUN pip3 install -r requirements.txt --no-cache-dir +RUN BUILD_CUDA_EXT=0 pip3 install -r requirements.txt --no-cache-dir RUN opt_in_out --opt_out USER ovms ENTRYPOINT [ "/ovms/bin/ovms" ] From 49412f99fca99457ffa792925bf3a36dfbfce713 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 12 Aug 2025 15:38:27 +0200 Subject: [PATCH 14/16] fix docs automation --- demos/common/export_models/README.md | 5 ++++- demos/common/export_models/requirements.txt | 1 - demos/image_generation/README.md | 10 +++++----- prepare_llm_models.sh | 2 +- windows_prepare_llm_models.bat | 1 - 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index aaff9e50b3..0c5aa9c09a 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -10,7 +10,7 @@ This script automates exporting models from Hugging Faces hub or fine-tuned in P ```console git clone https://github.com/openvinotoolkit/model_server cd model_server/demos/common/export_models -BUILD_CUDA_EXT=0 pip install -q -r requirements.txt +pip install -q -r requirements.txt mkdir models python export_model.py --help ``` @@ -118,6 +118,9 @@ python export_model.py text_generation --source_model mistralai/Mistral-7B-Instr > **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments after export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files). It will ensure, the generation stops after eos token. +> **Note:** In oder to export GPTQ models, you need to install also package `auto_gptq` via command `BUILD_CUDA_EXT=0 pip install auto_gptq` on Linux and `set BUILD_CUDA_EXT=0 && pip install auto_gptq` on Windows. + + ### Embedding Models #### Embeddings with deployment on a single CPU host: diff --git a/demos/common/export_models/requirements.txt b/demos/common/export_models/requirements.txt index 125d8ed785..25f19ac39d 100644 --- a/demos/common/export_models/requirements.txt +++ b/demos/common/export_models/requirements.txt @@ -13,5 +13,4 @@ transformers<4.53 einops torchvision timm==1.0.15 -auto_gptq==0.7.1 # for GPTQ models diffusers==0.33.1 # for image generation diff --git a/demos/image_generation/README.md b/demos/image_generation/README.md index c758aef3d4..2aa4dec73e 100644 --- a/demos/image_generation/README.md +++ b/demos/image_generation/README.md @@ -37,7 +37,7 @@ mkdir -p models docker run -d --rm --user $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models/:rw \ -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy \ - openvino/model_server:2025.2 \ + openvino/model_server:latest \ --rest_port 8000 \ --model_repository_path /models/ \ --task image_generation \ @@ -81,7 +81,7 @@ mkdir -p models docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models/:rw \ --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) \ -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy \ - openvino/model_server:2025.2-gpu \ + openvino/model_server:latest-gpu \ --rest_port 8000 \ --model_repository_path /models/ \ --task image_generation \ @@ -165,7 +165,7 @@ Here, the original models in `safetensors` format and the tokenizers will be con Quantization ensures faster initialization time, better performance and lower memory consumption. Image generation pipeline parameters will be defined inside the `graph.pbtxt` file. -Download export script (2025.2 and later), install it's dependencies and create directory for the models: +Download export script, install it's dependencies and create directory for the models: ```console curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/export_model.py -o export_model.py pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/requirements.txt @@ -247,7 +247,7 @@ Running this command starts the container with CPU only target device: Start docker container: ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro \ - openvino/model_server:2025.2 \ + openvino/model_server:latest \ --rest_port 8000 \ --model_name OpenVINO/FLUX.1-schnell-int4-ov \ --model_path /models/black-forest-labs/FLUX.1-schnell @@ -286,7 +286,7 @@ It can be applied using the commands below: ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro \ --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) \ - openvino/model_server:2025.2-gpu \ + openvino/model_server:latest-gpu \ --rest_port 8000 \ --model_name OpenVINO/FLUX.1-schnell-int4-ov \ --model_path /models/black-forest-labs/FLUX.1-schnell diff --git a/prepare_llm_models.sh b/prepare_llm_models.sh index 02ea874f17..cb3ffdd18f 100755 --- a/prepare_llm_models.sh +++ b/prepare_llm_models.sh @@ -59,7 +59,7 @@ else python3.10 -m venv .venv . .venv/bin/activate pip3 install -U pip - BUILD_CUDA_EXT=0 pip3 install -U -r demos/common/export_models/requirements.txt + pip3 install -U -r demos/common/export_models/requirements.txt fi mkdir -p $1 diff --git a/windows_prepare_llm_models.bat b/windows_prepare_llm_models.bat index 88ca4c68b5..58521134b5 100644 --- a/windows_prepare_llm_models.bat +++ b/windows_prepare_llm_models.bat @@ -63,7 +63,6 @@ if !errorlevel! neq 0 exit /b !errorlevel! set python -m pip install --upgrade pip if !errorlevel! neq 0 exit /b !errorlevel! -set BUILD_CUDA_EXT=0 pip install -U -r demos\common\export_models\requirements.txt if !errorlevel! neq 0 exit /b !errorlevel! From d073fb0e999c4a2b123213224170daba59af8d67 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Wed, 13 Aug 2025 11:44:39 +0200 Subject: [PATCH 15/16] refresh command help in readme --- demos/common/export_models/README.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index 0c5aa9c09a..d95cc3c122 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -37,10 +37,11 @@ python export_model.py text_generation --help ``` Expected Output: ```console -usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] [--overwrite_models] [--target_device TARGET_DEVICE] - [--ov_cache_dir OV_CACHE_DIR] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] [--kv_cache_precision {u8}] [--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--enable_prefix_caching] [--disable_dynamic_split_fuse] - [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS] [--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding] - [--tools_model_type {llama3,phi4,hermes3,qwen3}] +usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] + [--overwrite_models] [--target_device TARGET_DEVICE] [--ov_cache_dir OV_CACHE_DIR] [--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] + [--kv_cache_precision {u8}] [--enable_prefix_caching] [--disable_dynamic_split_fuse] [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS] + [--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding] + [--reasoning_parser {qwen3}] [--tool_parser {llama3,phi4,hermes3,qwen3}] [--enable_tool_guided_generation] options: -h, --help show this help message and exit @@ -59,12 +60,12 @@ options: CPU, GPU, NPU or HETERO, default is CPU --ov_cache_dir OV_CACHE_DIR Folder path for compilation cache to speedup initialization time + --extra_quantization_params EXTRA_QUANTIZATION_PARAMS + Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2" --pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO} Type of the pipeline to be used. AUTO is used by default --kv_cache_precision {u8} u8 or empty (model default). Reduced kv cache precision to u8 lowers the cache size consumption. - --extra_quantization_params EXTRA_QUANTIZATION_PARAMS - Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2" --enable_prefix_caching This algorithm is used to cache the prompt tokens. --disable_dynamic_split_fuse @@ -83,8 +84,12 @@ options: Sets NPU specific property for maximum number of tokens in the prompt. Not effective if target device is not NPU --prompt_lookup_decoding Set pipeline to use prompt lookup decoding - --tools_model_type {llama3,phi4,hermes3,qwen3} - Set the type of model chat template and output parser + --reasoning_parser {qwen3} + Set the type of the reasoning parser for reasoning content extraction + --tool_parser {llama3,phi4,hermes3,qwen3} + Set the type of the tool parser for tool calls extraction + --enable_tool_guided_generation + Enables enforcing tool schema during generation. Requires setting tool_parser ``` ## Model Export Examples From 64e883bc93afa610b03e4d860cda376e2a1d91d4 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Wed, 13 Aug 2025 13:07:59 +0200 Subject: [PATCH 16/16] spelling --- demos/common/export_models/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md index d95cc3c122..cc366e3386 100644 --- a/demos/common/export_models/README.md +++ b/demos/common/export_models/README.md @@ -123,7 +123,7 @@ python export_model.py text_generation --source_model mistralai/Mistral-7B-Instr > **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments after export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files). It will ensure, the generation stops after eos token. -> **Note:** In oder to export GPTQ models, you need to install also package `auto_gptq` via command `BUILD_CUDA_EXT=0 pip install auto_gptq` on Linux and `set BUILD_CUDA_EXT=0 && pip install auto_gptq` on Windows. +> **Note:** In order to export GPTQ models, you need to install also package `auto_gptq` via command `BUILD_CUDA_EXT=0 pip install auto_gptq` on Linux and `set BUILD_CUDA_EXT=0 && pip install auto_gptq` on Windows. ### Embedding Models