Skip to content

Commit 211d338

Browse files
atobiszeidtrawins
andauthored
Pull docs update (#3381)
Co-authored-by: Trawinski, Dariusz <[email protected]>
1 parent dd5934c commit 211d338

File tree

6 files changed

+345
-195
lines changed

6 files changed

+345
-195
lines changed

demos/common/export_models/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
# Exporting GEN AI Models {#ovms_demos_common_export}
1+
# Exporting models using script {#ovms_demos_common_export}
22

3-
This script automates exporting models from Hugging Faces hub or fine-tuned in PyTorch format to the `models` repository for deployment with OpenVINO Model Server. In one step it prepares a complete set of resources in the `models` repository for a supported GenAI use case.
3+
This documents describes how to export, optimize and configure models prior to server deployment with provided python script. This approach is more flexible than using [pull feature](../../../docs/pull_hf_models.md) from OVMS as it allows for using models that were not optimized beforehand and provided in OpenVINO organization in HuggingFace, but requires having Python set up to work. You can find the script [here](https://github.com/openvinotoolkit/model_server/blob/main/demos/common/export_models/export_model.py). If your model is available in [OpenVINO organization](https://huggingface.co/OpenVINO), then you can follow steps described [here](../../../docs/pull_hf_models.md).
4+
5+
## What it does
46

7+
This script automates exporting models from Hugging Faces hub or fine-tuned in PyTorch format to the `models` repository for deployment with OpenVINO Model Server. In one step it prepares a complete set of resources in the `models` repository for a supported GenAI use case.
58

69
## Quick Start
710
```console

docs/models_repository.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ hidden:
77
---
88
ovms_docs_models_repository_classic
99
ovms_docs_models_repository_graph
10-
ovms_demos_common_export
10+
ovms_docs_prepare_genai
1111
1212
```
1313

14+
Depending on what kind of models are to be served, you should follow steps below for:
1415

1516
[Classical models](./models_repository_classic.md)
1617

1718
[Graphs](./models_repository_graph.md)
1819

19-
[Generative use cases](../demos/common/export_models/README.md)
20-
20+
[Generative AI use cases](./prepare_generative_use_cases.md)

docs/parameters.md

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,6 @@
2222
| `"metrics_list"` | `string` | Comma separated list of [metrics](metrics.md). If unset, only default metrics will be enabled.|
2323
| `"allowed_local_media_path"` | `string` | Path to the directory containing images to include in requests. If unset, local filesystem images in requests are not supported.|
2424

25-
26-
2725
> **Note** : Specifying config_path is mutually exclusive with putting model parameters in the CLI ([serving multiple models](./starting_server.md)).
2826
2927
| Option | Value format | Description |
@@ -55,4 +53,65 @@ Configuration options for the server are defined only via command-line options a
5553
| `help` | `NA` | Shows help message and exit |
5654
| `version` | `NA` | Shows binary version |
5755

56+
## Pull mode configuration options
57+
58+
Shared configuration options for the pull, and pull & start mode. In the presence of ```--pull``` parameter OVMS will only pull model without serving.
59+
60+
### Pull Mode Options
61+
62+
| Option | Value format | Description |
63+
|-----------------------------|--------------|---------------------------------------------------------------------------------------------------------------|
64+
| `--pull` | `NA` | Runs the server in pull mode to download the model from the Hugging Face repository. |
65+
| `--source_model` | `string` | Name of the model in the Hugging Face repository. If not set, `model_name` is used. `Required` |
66+
| `--model_repository_path` | `string` | Directory where all required model files will be saved. |
67+
| `--model_name` | `string` | Name of the model as exposed externally by the server. |
68+
| `"target_device"` | `string` | Device name to be used to execute inference operations. Accepted values are: `"CPU"/"GPU"/"MULTI"/"HETERO"` |
69+
| `--task` | `string` | Task type the model will support (`text_generation`, `embedding`, `rerank`, `image_generation`). Default: `text_generation` |
70+
71+
There are also additional environment variables that may change the behavior of pulling:
72+
73+
### Environment Variables for Pull Mode
74+
75+
| Variable | Value format | Description |
76+
|-----------------|--------------|--------------------------------------------------------------------------------------------------------------------------|
77+
| `HF_ENDPOINT` | `string` | Default: `huggingface.co`. For users in China, set to `https://hf-mirror.com` if needed. |
78+
| `HF_TOKEN` | `string` | Authentication token required for accessing some models from Hugging Face. |
79+
| `https_proxy` | `string` | If set, model downloads will use this proxy. |
80+
81+
Task specific parameters for different tasks (text generation/image generation/embeddings/rerank) are listed below:
82+
83+
### Text generation
84+
| option | Value format | Description |
85+
|-------------------------------|--------------|----------------------------------------------------------------------------------------------------------------|
86+
| `--max_num_seqs` | `integer` | The maximum number of sequences that can be processed together. Default: 256. |
87+
| `--pipeline_type` | `string` | Type of the pipeline to be used. Choices: `LM`, `LM_CB`, `VLM`, `VLM_CB`, `AUTO`. Default: `AUTO`. |
88+
| `--enable_prefix_caching` | `bool` | Enables algorithm to cache the prompt tokens. Default: true. |
89+
| `--max_num_batched_tokens` | `integer` | The maximum number of tokens that can be batched together. |
90+
| `--cache_size` | `integer` | Cache size in GB. Default: 10. |
91+
| `--draft_source_model` | `string` | HF model name or path to the local folder with PyTorch or OpenVINO draft model. |
92+
| `--dynamic_split_fuse` | `bool` | Enables dynamic split fuse algorithm. Default: true. |
93+
| `--max_prompt_len` | `integer` | Sets NPU specific property for maximum number of tokens in the prompt. |
94+
| `--kv_cache_precision` | `string` | Reduced kv cache precision to `u8` lowers the cache size consumption. Accepted values: `u8` or empty (default).|
95+
96+
### Image generation
97+
| option | Value format | Description |
98+
|-----------------------------------|--------------|---------------------------------------------------------------------------------------------------------------------|
99+
| `--max_resolution` | `string` | Maximum allowed resolution in the format `WxH` (W = width, H = height). If not specified, inherited from the model. |
100+
| `--default_resolution` | `string` | Default resolution in the format `WxH` when not specified by the client. If not specified, inherited from the model.|
101+
| `--max_num_images_per_prompt` | `integer` | Maximum number of images a client can request per prompt in a single request. In 2025.2 release only 1 image generation per request is supported. |
102+
| `--default_num_inference_steps` | `integer` | Default number of inference steps when not specified by the client. |
103+
| `--max_num_inference_steps` | `integer` | Maximum number of inference steps a client can request for a given model. |
104+
| `--num_streams` | `integer` | Number of parallel execution streams for image generation models. Use at least 2 on 2-socket CPU systems. |
105+
106+
### Embeddings
107+
| option | Value format | Description |
108+
|---------------------------|--------------|--------------------------------------------------------------------------------|
109+
| `--num_streams` | `integer` | The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
110+
| `--normalize` | `bool` | Normalize the embeddings. Default: true. |
111+
| `--mean_pooling` | `bool` | Mean pooling option. Default: false. |
58112

113+
### Rerank
114+
| option | Value format | Description |
115+
|---------------------------|--------------|--------------------------------------------------------------------------------|
116+
| `--num_streams` | `integer` | The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
117+
| `--max_allowed_chunks` | `integer` | Maximum allowed chunks. Default: 10000. |
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Exporting GEN AI Models {#ovms_docs_prepare_genai}
2+
3+
```{toctree}
4+
---
5+
maxdepth: 1
6+
hidden:
7+
---
8+
9+
ovms_docs_pull
10+
ovms_demos_common_export
11+
12+
```
13+
14+
Prepare model using OVMS [pull mode](./pull_hf_models.py) when it is available in [OpenVINO organization](https://huggingface.co/OpenVINO).
15+
16+
Prepare models using [python script](./export_model_script.md) otherwise.

docs/pull_hf_models.md

Lines changed: 39 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -1,152 +1,60 @@
1-
*Note:*
2-
This functionality is a work in progress
3-
4-
# Pulling the models {#ovms_pul}
5-
6-
There is a special mode to make OVMS pull the model from Hugging Face before starting the service:
7-
8-
```
9-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --task <task> --task_params <task_params>
10-
```
11-
12-
| option | description |
13-
|---------------------------|-----------------------------------------------------------------------------------------------|
14-
| `--pull` | Instructs the server to run in pulling mode to get the model from the Hugging Face repository |
15-
| `--source_model` | Specifies the model name in the Hugging Face model repository (optional - if empty model_name is used) |
16-
| `--model_repository_path` | Directory where all required model files will be saved |
17-
| `--model_name` | Name of the model as exposed externally by the server |
18-
| `--task` | Defines the task the model will support (e.g., text_generation/embedding, rerank, etc.) |
19-
| `--task_params` | Task-specific parameters in a format to be determined (TBD FIXME) |
20-
1+
# OVMS Pull mode {#ovms_docs_pull}
212

22-
It will prepare all needed configuration files to support LLMS with OVMS in model repository
3+
This documents describes how leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models from OpenVINO organization in HuggingFace (HF). This approach assumes that you are pulling from [OpenVINO organization](https://huggingface.co/OpenVINO) from HF. If the model is not from that organization, follow steps described in [this document](../demos/common/export_models/README.md).
234

24-
# Starting the mediapipe graph or LLM models
25-
Now you can start server with single mediapipe graph, or LLM model that is already present in local filesystem with:
5+
### Pulling the models
266

27-
```
28-
docker run -d --rm -v <model_repository_path>:/models -p 9000:9000 -p 8000:8000 openvino/model_server:latest \
29-
--model_path <path_to_model> --model_name <model_name> --port 9000 --rest_port 8000
30-
```
31-
32-
Server will detect the type of requested servable (model or mediapipe graph) and load it accordingly. This detection is based on the presence of a `.pbtxt` file, which defines the Mediapipe graph structure.
33-
34-
*Note*: There is no online model modification nor versioning capability as of now for graphs, LLM like models.
35-
36-
# Starting the LLM model from HF directly
7+
There is a special mode to make OVMS pull the model from Hugging Face before starting the service:
378

38-
In case you do not want to prepare model repository before starting the server in one command you can run OVMS with:
9+
::::{tab-set}
10+
:::{tab-item} With Docker
11+
:sync: docker
12+
**Required:** Docker Engine installed
3913

14+
```text
15+
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
4016
```
41-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --source_model <model_name_in_HF> --model_repository_path /models --model_name <ovms_servable_name> --task <task> --task_params <task_params>
42-
```
43-
44-
It will download required model files, prepare configuration for OVMS and start serving the model.
45-
46-
# Starting the LLM model from local storage
17+
:::
4718

48-
In case you have predownloaded the model files from HF but you lack OVMS configuration files you can start OVMS with
49-
```
50-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --source_model <model_name_in_HF> --model_repository_path <path_where_to_store_ovms_config_files> --model_name <external_model_name> --task <task> --task_params <task_params>
51-
```
52-
This command will create graph.pbtxt in the ```model_repository_path/source_model``` path.
53-
54-
# Simplified mediapipe graphs and LLM models loading
55-
56-
Now there is an easier way to specify LLM configurations in `config.json`. In the `model_config` section, it is sufficient to specify `model_name` and `base_path`, and the server will detect if there is a graph configuration file (`.pbtxt`) present and load the servable accordingly.
57-
58-
For example, the `model_config` section in `config.json` could look like this:
59-
60-
```json
61-
{
62-
"model_config_list": [
63-
{
64-
"config": {
65-
"name": "text_generation_model",
66-
"base_path": "/models/text_generation_model"
67-
}
68-
},
69-
{
70-
"config": {
71-
"name": "embedding_model",
72-
"base_path": "/models/embedding_model"
73-
}
74-
},
75-
{
76-
"config": {
77-
"name": "mediapipe_graph",
78-
"base_path": "/models/mediapipe_graph"
79-
}
80-
}
81-
]
82-
}
83-
```
84-
# List models
19+
:::{tab-item} On Baremetal Host
20+
:sync: baremetal
21+
**Required:** OpenVINO Model Server package - see [deployment instructions](../deploying_server_baremetal.md) for details.
8522

86-
To check what models are servable from specified model repository:
87-
```
88-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
89-
--model_repository_path /models --list_models
23+
```text
24+
ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_repository_path> --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
9025
```
26+
:::
27+
::::
9128

92-
For following directory structure:
93-
```
94-
/models
95-
├── meta
96-
│ ├── llama4
97-
│ │ └── graph.pbtxt
98-
│ ├── llama3.1
99-
│ │ └── graph.pbtxt
100-
├── LLama3.2
101-
│ └── graph.pbtxt
102-
└── resnet
103-
└── 1
104-
└── saved_model.pb
105-
```
29+
Example for pulling `OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov`:
10630

107-
The output would be:
108-
```
109-
meta/llama4
110-
meta/llama3.1
111-
LLama3.2
112-
resnet
31+
```text
32+
ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --target_device CPU --task text_generation
11333
```
34+
::::{tab-set}
35+
:::{tab-item} With Docker
36+
:sync: docker
37+
**Required:** Docker Engine installed
11438

115-
# Enable model
116-
117-
To add model to ovms configuration file with specific model use either:
118-
39+
```text
40+
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation
11941
```
120-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
121-
--model_repository_path /models/<model_path> --add_to_config <config_file_directory_path> --model_name <name>
122-
```
123-
124-
When model is directly inside `/models`.
42+
:::
12543

126-
Or
44+
:::{tab-item} On Baremetal Host
45+
:sync: baremetal
46+
**Required:** OpenVINO Model Server package - see [deployment instructions](../deploying_server_baremetal.md) for details.
12747

48+
```text
49+
ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation
12850
```
129-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
130-
--add_to_config <config_file_directory_path> --model_name <name> --model_path <model_path>
131-
```
132-
when there is no model_repository specified.
51+
:::
52+
::::
13353

134-
## TIP: Use relative paths to make the config.json transferable in model_repository across ovms instances.
135-
For example:
136-
```
137-
cd model_repository_path
138-
ovms --add_to_config . --model_name OpenVINO/DeepSeek-R1-Distill-Qwen-1.5B-int4-ov --model_repository_path .
139-
```
14054

141-
# Disable model
55+
It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.
14256

143-
If you want to remove model from configuration file you can do it either manually or use command:
57+
In case you want to setup model and start server in one step follow instructions on [this page](./starting_server.md).
14458

145-
```
146-
docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
147-
--remove_from_config <config_file_directory_path> --model_name <name>
148-
```
149-
150-
FIXME TODO TBD
151-
- adjust existing documentation to link with this doc
152-
- task, task_params to be updated explained
59+
*Note:*
60+
When using pull mode you need both read and write access rights to models repository.

0 commit comments

Comments
 (0)