Skip to content

Commit 3ac4002

Browse files
dtrawinsngrozae
andauthored
Minor docs fixes (#3565)
info about exporting gptq models fixed export image gen models without extra_quantization_params updated image tags in demos spelling mistakes fixes in export_model script fixes in agentic demo --------- Co-authored-by: ngrozae <[email protected]>
1 parent a3786a9 commit 3ac4002

File tree

17 files changed

+132
-83
lines changed

17 files changed

+132
-83
lines changed

ci/lib_search.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ def check_dir(start_dir):
107107
'net_http.patch',
108108
'partial.patch',
109109
'ovms_drogon_trantor.patch',
110-
'gorila.patch',
110+
'gorilla.patch',
111111
'opencv_cmake_flags.txt',
112112
'ovms-c/dist',
113113
'requirements.txt',

demos/code_local_assistant/README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@ With the rise of AI PC capabilities, hosting own Visual Studio code assistant is
99
- Intel Meteor Lake, Lunar Lake, Arrow Lake or newer Intel CPU.
1010

1111
## Prepare Code Chat/Edit Model
12-
We need to use medium size model in order to keep 50ms/word for human to feel the chat responsive.
13-
This will work in streaming mode, meaning we will see the chat response/code diff generation slowly roll out in real-time.
12+
We need to use medium size model to get reliable responses but also to fit it to the available memory on the host or discrete GPU.
1413

1514
Download export script, install its dependencies and create directory for the models:
1615
```console
@@ -22,10 +21,10 @@ mkdir models
2221
2322
Export `codellama/CodeLlama-7b-Instruct-hf`:
2423
```console
25-
python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models
24+
python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device GPU --overwrite_models
2625
```
2726

28-
> **Note:** Use `--target_device GPU` for Intel GPU or omit this parameter to run on Intel CPU
27+
> **Note:** Use `--target_device NPU` for Intel NPU or omit this parameter to run on Intel CPU
2928
3029
## Prepare Code Completion Model
3130
For this task we need smaller, lighter model that will produce code quicker than chat task.
@@ -104,10 +103,16 @@ Please refer to OpenVINO Model Server installation first: [link](../../docs/depl
104103
ovms --rest_port 8000 --config_path ./models/config_all.json
105104
```
106105

107-
### Linux: via Docker
106+
### Linux: via Docker with GPU
107+
```bash
108+
docker run -d --rm --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
109+
-p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/models/config_all.json
110+
```
111+
112+
### Linux: via Docker with NPU
108113
```bash
109114
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
110-
-p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:2025.2 --rest_port 8000 --config_path /workspace/models/config_all.json
115+
-p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/models/config_all.json
111116
```
112117

113118
## Set Up Visual Studio Code

demos/common/export_models/README.md

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,11 @@ python export_model.py text_generation --help
3737
```
3838
Expected Output:
3939
```console
40-
usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] [--overwrite_models] [--target_device TARGET_DEVICE]
41-
[--ov_cache_dir OV_CACHE_DIR] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] [--kv_cache_precision {u8}] [--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--enable_prefix_caching] [--disable_dynamic_split_fuse]
42-
[--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS] [--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding]
43-
[--tools_model_type {llama3,phi4,hermes3,qwen3}]
40+
usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH]
41+
[--overwrite_models] [--target_device TARGET_DEVICE] [--ov_cache_dir OV_CACHE_DIR] [--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}]
42+
[--kv_cache_precision {u8}] [--enable_prefix_caching] [--disable_dynamic_split_fuse] [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS]
43+
[--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding]
44+
[--reasoning_parser {qwen3}] [--tool_parser {llama3,phi4,hermes3,qwen3}] [--enable_tool_guided_generation]
4445

4546
options:
4647
-h, --help show this help message and exit
@@ -59,12 +60,12 @@ options:
5960
CPU, GPU, NPU or HETERO, default is CPU
6061
--ov_cache_dir OV_CACHE_DIR
6162
Folder path for compilation cache to speedup initialization time
63+
--extra_quantization_params EXTRA_QUANTIZATION_PARAMS
64+
Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2"
6265
--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}
6366
Type of the pipeline to be used. AUTO is used by default
6467
--kv_cache_precision {u8}
6568
u8 or empty (model default). Reduced kv cache precision to u8 lowers the cache size consumption.
66-
--extra_quantization_params EXTRA_QUANTIZATION_PARAMS
67-
Add advanced quantization parameters. Check optimum-intel documentation. Example: "--sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2"
6869
--enable_prefix_caching
6970
This algorithm is used to cache the prompt tokens.
7071
--disable_dynamic_split_fuse
@@ -83,8 +84,12 @@ options:
8384
Sets NPU specific property for maximum number of tokens in the prompt. Not effective if target device is not NPU
8485
--prompt_lookup_decoding
8586
Set pipeline to use prompt lookup decoding
86-
--tools_model_type {llama3,phi4,hermes3,qwen3}
87-
Set the type of model chat template and output parser
87+
--reasoning_parser {qwen3}
88+
Set the type of the reasoning parser for reasoning content extraction
89+
--tool_parser {llama3,phi4,hermes3,qwen3}
90+
Set the type of the tool parser for tool calls extraction
91+
--enable_tool_guided_generation
92+
Enables enforcing tool schema during generation. Requires setting tool_parser
8893
```
8994

9095
## Model Export Examples
@@ -111,19 +116,34 @@ Text generation for NPU target device. Command below sets max allowed prompt siz
111116
```console
112117
python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --config_file_path models/config_all.json --model_repository_path models --target_device NPU --max_prompt_len 2048 --ov_cache_dir ./models/.ov_cache
113118
```
119+
> **Note:** Some models like `mistralai/Mistral-7B-Instruct-v0.3` might fail to export because the task can't be determined automatically. In such situation it can be set in `--extra_quantization_parameters`. For example:
120+
```console
121+
python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --extra_quantization_params "--task text-generation-with-past"
122+
```
123+
> **Note:** Model `microsoft/Phi-3.5-vision-instruct` requires one manual adjustments after export in the file `generation_config.json` like in the [PR](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/discussions/40/files).
124+
It will ensure, the generation stops after eos token.
125+
126+
> **Note:** In order to export GPTQ models, you need to install also package `auto_gptq` via command `BUILD_CUDA_EXT=0 pip install auto_gptq` on Linux and `set BUILD_CUDA_EXT=0 && pip install auto_gptq` on Windows.
127+
114128

115129
### Embedding Models
116130

117131
#### Embeddings with deployment on a single CPU host:
118132
```console
119-
python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json
133+
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json
120134
```
121135

122136
#### Embeddings with deployment on a dual CPU host:
123137
```console
124-
python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json --num_streams 2
138+
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config_all.json --num_streams 2
139+
```
140+
141+
#### Embeddings with pooling parameter
142+
```console
143+
python export_model.py embeddings_ov --source_model Qwen/Qwen3-Embedding-0.6B --weight-format fp16 --config_file_path models/config_all.json
125144
```
126145

146+
127147
#### With Input Truncation
128148
By default, embeddings endpoint returns an error when the input exceed the maximum model context length.
129149
It is possible to change the behavior to truncate prompts automatically to fit the model. Add `--truncate` option in the export command.
@@ -138,7 +158,7 @@ python export_model.py embeddings \
138158
139159
### Reranking Models
140160
```console
141-
python export_model.py rerank \
161+
python export_model.py rerank_ov \
142162
--source_model BAAI/bge-reranker-large \
143163
--weight-format int8 \
144164
--config_file_path models/config_all.json \

demos/common/export_models/export_model.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def add_common_arguments(parser):
5252
'Not effective if target device is not NPU', dest='max_prompt_len')
5353
parser_text.add_argument('--prompt_lookup_decoding', action='store_true', help='Set pipeline to use prompt lookup decoding', dest='prompt_lookup_decoding')
5454
parser_text.add_argument('--reasoning_parser', choices=["qwen3"], help='Set the type of the reasoning parser for reasoning content extraction', dest='reasoning_parser')
55-
parser_text.add_argument('--tool_parser', choices=["llama3","phi4","hermes3"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser')
55+
parser_text.add_argument('--tool_parser', choices=["llama3","phi4","hermes3", "qwen3"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser')
5656
parser_text.add_argument('--enable_tool_guided_generation', action='store_true', help='Enables enforcing tool schema during generation. Requires setting tool_parser', dest='enable_tool_guided_generation')
5757

5858
parser_embeddings = subparsers.add_parser('embeddings', help='[deprecated] export model for embeddings endpoint with models split into separate, versioned directories')
@@ -231,7 +231,8 @@ def add_common_arguments(parser):
231231
reasoning_parser: "{{reasoning_parser}}",{% endif %}
232232
{%- if tool_parser %}
233233
tool_parser: "{{tool_parser}}",{% endif %}
234-
enable_tool_guided_generation: {% if not enable_tool_guided_generation %}false{% else %} true{% endif%},
234+
{%- if enable_tool_guided_generation %}
235+
enable_tool_guided_generation: {% if not enable_tool_guided_generation %}false{% else %} true{% endif%},{% endif %}
235236
}
236237
}
237238
input_stream_handler {
@@ -401,7 +402,12 @@ def export_text_generation_model(model_repository_path, source_model, model_name
401402
task_parameters['extra_quantization_params'] = "--sym --ratio 1.0 --group-size -1"
402403
optimum_command = "optimum-cli export openvino --model {} --weight-format {} {} --trust-remote-code {}".format(source_model, precision, task_parameters['extra_quantization_params'], llm_model_path)
403404
if os.system(optimum_command):
404-
raise ValueError("Failed to export llm model", source_model)
405+
raise ValueError("Failed to export llm model", source_model)
406+
if not (os.path.isfile(os.path.join(llm_model_path, 'openvino_detokenizer.xml'))):
407+
print("Tokenizer and detokenizer not found in the exported model. Exporting tokenizer and detokenizer from HF model")
408+
convert_tokenizer_command = "convert_tokenizer --with-detokenizer -o {} {}".format(llm_model_path, source_model)
409+
if os.system(convert_tokenizer_command):
410+
raise ValueError("Failed to export tokenizer and detokenizer", source_model)
405411
### Export draft model for speculative decoding
406412
draft_source_model = task_parameters.get("draft_source_model", None)
407413
draft_model_dir_name = None
@@ -666,11 +672,12 @@ def export_image_generation_model(model_repository_path, source_model, model_nam
666672
args['draft_model_name'] = args['draft_source_model']
667673
###
668674

675+
if args['extra_quantization_params'] is None:
676+
args['extra_quantization_params'] = ""
677+
669678
template_parameters = {k: v for k, v in args.items() if k not in ['model_repository_path', 'source_model', 'model_name', 'precision', 'version', 'config_file_path', 'overwrite_models']}
670679
print("template params:", template_parameters)
671680

672-
if template_parameters['extra_quantization_params'] is None:
673-
template_parameters['extra_quantization_params'] = ""
674681
if args['task'] == 'text_generation':
675682
export_text_generation_model(args['model_repository_path'], args['source_model'], args['model_name'], args['precision'], template_parameters, args['config_file_path'])
676683

demos/common/export_models/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ nncf>=2.11.0
99
sentence_transformers
1010
sentencepiece==0.2.0
1111
openai
12-
transformers<4.52
12+
transformers<4.53
1313
einops
1414
torchvision
1515
timm==1.0.15

demos/continuous_batching/accuracy/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Use [Berkeley function call leaderboard ](https://github.com/ShishirPatil/gorill
113113
git clone https://github.com/ShishirPatil/gorilla
114114
cd gorilla/berkeley-function-call-leaderboard
115115
git checkout ac37049f00022af54cc44b6aa0cad4402c22d1a0
116-
curl -s https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/agent-accuracy/demos/continuous_batching/accuracy/gorila.patch | git apply -v
116+
curl -s https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/accuracy/gorilla.patch | git apply -v
117117
pip install -e .
118118
```
119119
The commands below assumes the models is deployed with the name `openvino-qwen3-8b-int8`. It must match the name set in the `bfcl_eval/constants/model_config.py`.

demos/continuous_batching/agentic_ai/README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Agentic AI with OpenVINO Model Server {#ovms_demos_continuous_batching_agent}
22

3+
This demo version requires OVMS version 2025.3. Build it from [source](../../../docs/build_from_source.md) before it is published.
4+
35
OpenVINO Model Server can be used to serve language models for AI Agents. It supports the usage of tools in the context of content generation.
46
It can be integrated with MCP servers and AI agent frameworks.
57
You can learn more about [tools calling based on OpenAI API](https://platform.openai.com/docs/guides/function-calling?api-mode=responses)
@@ -10,10 +12,14 @@ Here are presented required steps to deploy language models trained for tools su
1012
The application employing OpenAI agent SDK is using MCP server. It is equipped with a set of tools to providing context for the content generation.
1113
The tools can also be used for automation purposes based on input in text format.
1214

15+
16+
1317
## Export LLM model
1418
Currently supported models:
1519
- Qwen/Qwen3-8B
20+
- Qwen/Qwen3-4B
1621
- meta-llama/Llama-3.1-8B-Instruct
22+
- meta-llama/Llama-3.2-3B-Instruct
1723
- NousResearch/Hermes-3-Llama-3.1-8B
1824
- microsoft/Phi-4-mini-instruct
1925

@@ -23,7 +29,7 @@ The model response with tool call follow a specific syntax which is process by a
2329
Download export script, install it's dependencies and create directory for the models:
2430
```console
2531
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
26-
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
32+
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/requirements.txt
2733
mkdir models
2834
```
2935
Run `export_model.py` script to download and quantize the model:
@@ -47,7 +53,13 @@ python export_model.py text_generation --source_model Qwen/Qwen3-8B --weight-for
4753
::::
4854

4955
You can use similar commands for different models. Change the source_model and the tools_model_type (note that as of today the following types as available: `[phi4, llama3, qwen3, hermes3]`).
50-
> **Note:** The tuned chat template will be copied to the model folder as template.jinja and the response parser will be set in the graph.pbtxt
56+
> **Note:** Some models give more reliable responses with tuned chat template. Copy custom template to the model folder like below:
57+
```
58+
curl -L -o models/meta-llama/Llama-3.1-8B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.1_json.jinja
59+
curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja
60+
curl -L -o models/NousResearch/Hermes-3-Llama-3.1-8B/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_hermes.jinja
61+
curl -L -o models/microsoft/Phi-4-mini-instruct/template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_phi4_mini.jinja
62+
```
5163

5264

5365
## Start OVMS
@@ -74,7 +86,7 @@ In case you want to use GPU device to run the generation, add extra docker param
7486
to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration.
7587
It can be applied using the commands below:
7688
```bash
77-
docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:2025.2-gpu \
89+
docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:latest-gpu \
7890
--rest_port 8000 --model_path /models/Qwen/Qwen3-8B --model_name Qwen/Qwen3-8B
7991
```
8092
:::

demos/continuous_batching/agentic_ai/openai_agent.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ def get_model(self, _) -> Model:
117117
agent = Agent(
118118
name="Assistant",
119119
mcp_servers=[fs_server, weather_server],
120-
model_settings=ModelSettings(tool_choice="auto", temperature=0.0),
120+
model_settings=ModelSettings(tool_choice="auto", temperature=0.0, max_tokens=1000, extra_body={"chat_template_kwargs": {"enable_thinking": False}}),
121121
)
122122
loop = asyncio.new_event_loop()
123123
loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream))

0 commit comments

Comments
 (0)