Pull docs update (#3381)

atobiszei · dtrawins · web-flow · commit 211d338ef220 · 2025-06-13T22:16:51.000+02:00
Co-authored-by: Trawinski, Dariusz &lt;dariusz.trawinski@intel.com&gt;
diff --git a/demos/common/export_models/README.md b/demos/common/export_models/README.md
@@ -1,7 +1,10 @@
-# Exporting GEN AI Models {#ovms_demos_common_export}
+# Exporting models using script {#ovms_demos_common_export}
 
-This script automates exporting models from Hugging Faces hub or fine-tuned in PyTorch format to the `models` repository for deployment with OpenVINO Model Server. In one step it prepares a complete set of resources in the `models` repository for a supported GenAI use case.
+This documents describes how to export, optimize and configure models prior to server deployment with provided python script. This approach is more flexible than using [pull feature](../../../docs/pull_hf_models.md) from OVMS as it allows for using models that were not optimized beforehand and provided in OpenVINO organization in HuggingFace, but requires having Python set up to work. You can find the script [here](https://github.com/openvinotoolkit/model_server/blob/main/demos/common/export_models/export_model.py). If your model is available in [OpenVINO organization](https://huggingface.co/OpenVINO), then you can follow steps described [here](../../../docs/pull_hf_models.md).
+
+## What it does
 
+This script automates exporting models from Hugging Faces hub or fine-tuned in PyTorch format to the `models` repository for deployment with OpenVINO Model Server. In one step it prepares a complete set of resources in the `models` repository for a supported GenAI use case.
 
 ## Quick Start
 ```console
diff --git a/docs/models_repository.md b/docs/models_repository.md
@@ -7,14 +7,14 @@ hidden:
 ---
 ovms_docs_models_repository_classic
 ovms_docs_models_repository_graph
-ovms_demos_common_export
+ovms_docs_prepare_genai
 
 ```
 
+Depending on what kind of models are to be served, you should follow steps below for:
 
 [Classical models](./models_repository_classic.md)
 
 [Graphs](./models_repository_graph.md)
 
-[Generative use cases](../demos/common/export_models/README.md)
-
+[Generative AI use cases](./prepare_generative_use_cases.md)
diff --git a/docs/parameters.md b/docs/parameters.md
@@ -22,8 +22,6 @@
 | `"metrics_list"` | `string` | Comma separated list of [metrics](metrics.md). If unset, only default metrics will be enabled.|
 | `"allowed_local_media_path"` | `string` | Path to the directory containing images to include in requests. If unset, local filesystem images in requests are not supported.|
 
-
-
 > **Note** : Specifying config_path is mutually exclusive with putting model parameters in the CLI ([serving multiple models](./starting_server.md)).
 
 | Option  | Value format  | Description  |
@@ -55,4 +53,65 @@ Configuration options for the server are defined only via command-line options a
 | `help` | `NA` |  Shows help message and exit |
 | `version` | `NA` |  Shows binary version |
 
+## Pull mode configuration options
+
+Shared configuration options for the pull, and pull & start mode. In the presence of ```--pull``` parameter OVMS will only pull model without serving.
+
+### Pull Mode Options
+
+| Option                      | Value format | Description                                                                                                   |
+|-----------------------------|--------------|---------------------------------------------------------------------------------------------------------------|
+| `--pull`                    | `NA`         | Runs the server in pull mode to download the model from the Hugging Face repository.  |
+| `--source_model`            | `string`     | Name of the model in the Hugging Face repository. If not set, `model_name` is used. `Required`                |
+| `--model_repository_path`   | `string`     | Directory where all required model files will be saved.                                                       |
+| `--model_name`              | `string`     | Name of the model as exposed externally by the server.                                                        |
+| `"target_device"` | `string` | Device name to be used to execute inference operations. Accepted values are: `"CPU"/"GPU"/"MULTI"/"HETERO"` |
+| `--task`                    | `string`     | Task type the model will support (`text_generation`, `embedding`, `rerank`, `image_generation`).  Default: `text_generation` |
+
+There are also additional environment variables that may change the behavior of pulling:
+
+### Environment Variables for Pull Mode
+
+| Variable        | Value format | Description                                                                                                              |
+|-----------------|--------------|--------------------------------------------------------------------------------------------------------------------------|
+| `HF_ENDPOINT`   | `string`     | Default: `huggingface.co`. For users in China, set to `https://hf-mirror.com` if needed.                                 |
+| `HF_TOKEN`      | `string`     | Authentication token required for accessing some models from Hugging Face.                                               |
+| `https_proxy`   | `string`     | If set, model downloads will use this proxy.                                                                             |
+
+Task specific parameters for different tasks (text generation/image generation/embeddings/rerank) are listed below:
+
+### Text generation
+| option                        | Value format | Description                                                                                                    |
+|-------------------------------|--------------|----------------------------------------------------------------------------------------------------------------|
+| `--max_num_seqs`              | `integer`    | The maximum number of sequences that can be processed together. Default: 256.                                  |
+| `--pipeline_type`             | `string`     | Type of the pipeline to be used. Choices: `LM`, `LM_CB`, `VLM`, `VLM_CB`, `AUTO`. Default: `AUTO`.             |
+| `--enable_prefix_caching`     | `bool`       | Enables algorithm to cache the prompt tokens. Default: true.                                                   |
+| `--max_num_batched_tokens`    | `integer`    | The maximum number of tokens that can be batched together.                                                     |
+| `--cache_size`                | `integer`    | Cache size in GB. Default: 10.                                                                                 |
+| `--draft_source_model`        | `string`     | HF model name or path to the local folder with PyTorch or OpenVINO draft model.                                |
+| `--dynamic_split_fuse`        | `bool`       | Enables dynamic split fuse algorithm. Default: true.                                                           |
+| `--max_prompt_len`            | `integer`    | Sets NPU specific property for maximum number of tokens in the prompt.                                         |
+| `--kv_cache_precision`        | `string`     | Reduced kv cache precision to `u8` lowers the cache size consumption. Accepted values: `u8` or empty (default).|
+
+### Image generation
+| option                            | Value format | Description                                                                                                         |
+|-----------------------------------|--------------|---------------------------------------------------------------------------------------------------------------------|
+| `--max_resolution`                | `string`     | Maximum allowed resolution in the format `WxH` (W = width, H = height). If not specified, inherited from the model. |
+| `--default_resolution`            | `string`     | Default resolution in the format `WxH` when not specified by the client. If not specified, inherited from the model.|
+| `--max_num_images_per_prompt`     | `integer`    | Maximum number of images a client can request per prompt in a single request. In 2025.2 release only 1 image generation per request is supported. |
+| `--default_num_inference_steps`   | `integer`    | Default number of inference steps when not specified by the client.                                                 |
+| `--max_num_inference_steps`       | `integer`    | Maximum number of inference steps a client can request for a given model.                                           |
+| `--num_streams`                   | `integer`    | Number of parallel execution streams for image generation models. Use at least 2 on 2-socket CPU systems.           |
+
+### Embeddings
+| option                    | Value format | Description                                                                    |
+|---------------------------|--------------|--------------------------------------------------------------------------------|
+| `--num_streams`           | `integer`    | The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
+| `--normalize`             | `bool`       | Normalize the embeddings. Default: true.                                       |
+| `--mean_pooling`          | `bool`       | Mean pooling option. Default: false.                                           |
 
+### Rerank
+| option                    | Value format | Description                                                                    |
+|---------------------------|--------------|--------------------------------------------------------------------------------|
+| `--num_streams`           | `integer`    | The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
+| `--max_allowed_chunks`    | `integer`    | Maximum allowed chunks. Default: 10000.                                        |
diff --git a/docs/prepare_generative_use_cases.md b/docs/prepare_generative_use_cases.md
@@ -0,0 +1,16 @@
+# Exporting GEN AI Models {#ovms_docs_prepare_genai}
+
+```{toctree}
+---
+maxdepth: 1
+hidden:
+---
+
+ovms_docs_pull
+ovms_demos_common_export
+
+```
+
+Prepare model using OVMS [pull mode](./pull_hf_models.py) when it is available in [OpenVINO organization](https://huggingface.co/OpenVINO).
+
+Prepare models using [python script](./export_model_script.md) otherwise.
diff --git a/docs/pull_hf_models.md b/docs/pull_hf_models.md
@@ -1,152 +1,60 @@
-*Note:*
-This functionality is a work in progress
-
-# Pulling the models {#ovms_pul}
-
-There is a special mode to make OVMS pull the model from Hugging Face before starting the service:
-
-```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --task <task> --task_params <task_params>
-```
-
-| option                    | description                                                                                   |
-|---------------------------|-----------------------------------------------------------------------------------------------|
-| `--pull`                  | Instructs the server to run in pulling mode to get the model from the Hugging Face repository |
-| `--source_model`          | Specifies the model name in the Hugging Face model repository (optional - if empty model_name is used) |
-| `--model_repository_path` | Directory where all required model files will be saved                                        |
-| `--model_name`            | Name of the model as exposed externally by the server                                         |
-| `--task`                  | Defines the task the model will support (e.g., text_generation/embedding, rerank, etc.)                       |
-| `--task_params`           | Task-specific parameters in a format to be determined (TBD FIXME)                             |
-
+# OVMS Pull mode {#ovms_docs_pull}
 
-It will prepare all needed configuration files to support LLMS with OVMS in model repository
+This documents describes how leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models from OpenVINO organization in HuggingFace (HF). This approach assumes that you are pulling from [OpenVINO organization](https://huggingface.co/OpenVINO) from HF. If the model is not from that organization, follow steps described in [this document](../demos/common/export_models/README.md).
 
-# Starting the mediapipe graph or LLM models
-Now you can start server with single mediapipe graph, or LLM model that is already present in local filesystem with:
+### Pulling the models
 
-```
-docker run -d --rm -v <model_repository_path>:/models -p 9000:9000 -p 8000:8000 openvino/model_server:latest \
---model_path <path_to_model> --model_name <model_name> --port 9000 --rest_port 8000
-```
-
-Server will detect the type of requested servable (model or mediapipe graph) and load it accordingly. This detection is based on the presence of a `.pbtxt` file, which defines the Mediapipe graph structure.
-
-*Note*: There is no online model modification nor versioning capability as of now for graphs, LLM like models.
-
-# Starting the LLM model from HF directly
+There is a special mode to make OVMS pull the model from Hugging Face before starting the service:
 
-In case you do not want to prepare model repository before starting the server in one command you can run OVMS with:
+::::{tab-set}
+:::{tab-item} With Docker
+:sync: docker
+**Required:** Docker Engine installed
 
+```text
+docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
 ```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --source_model <model_name_in_HF> --model_repository_path /models --model_name <ovms_servable_name> --task <task> --task_params <task_params>
-```
-
-It will download required model files, prepare configuration for OVMS and start serving the model.
-
-# Starting the LLM model from local storage
+:::
 
-In case you have predownloaded the model files from HF but you lack OVMS configuration files you can start OVMS with
-```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest --source_model <model_name_in_HF> --model_repository_path <path_where_to_store_ovms_config_files> --model_name <external_model_name> --task <task> --task_params <task_params>
-```
-This command will create graph.pbtxt in the ```model_repository_path/source_model``` path.
-
-# Simplified mediapipe graphs and LLM models loading
-
-Now there is an easier way to specify LLM configurations in `config.json`. In the `model_config` section, it is sufficient to specify `model_name` and `base_path`, and the server will detect if there is a graph configuration file (`.pbtxt`) present and load the servable accordingly. 
-
-For example, the `model_config` section in `config.json` could look like this:
-
-```json
-{
-    "model_config_list": [
-        {
-            "config": {
-                "name": "text_generation_model",
-                "base_path": "/models/text_generation_model"
-            }
-        },
-        {
-            "config": {
-                "name": "embedding_model",
-                "base_path": "/models/embedding_model"
-            }
-        },
-        {
-            "config": {
-                "name": "mediapipe_graph",
-                "base_path": "/models/mediapipe_graph"
-            }
-        }
-    ]
-}
-```
-# List models
+:::{tab-item} On Baremetal Host
+:sync: baremetal
+**Required:** OpenVINO Model Server package - see [deployment instructions](../deploying_server_baremetal.md) for details.
 
-To check what models are servable from specified model repository:
-```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
---model_repository_path /models --list_models
+```text
+ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_repository_path> --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
 ```
+:::
+::::
 
-For following directory structure:
-```
-/models
-├── meta
-│   ├── llama4
-│   │   └── graph.pbtxt
-│   ├── llama3.1
-│   │   └── graph.pbtxt
-├── LLama3.2
-│   └── graph.pbtxt
-└── resnet
-    └── 1
-        └── saved_model.pb
-```
+Example for pulling `OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov`:
 
-The output would be:
-```
-meta/llama4
-meta/llama3.1
-LLama3.2
-resnet
+```text
+ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --target_device CPU --task text_generation 
 ```
+::::{tab-set}
+:::{tab-item} With Docker
+:sync: docker
+**Required:** Docker Engine installed
 
-# Enable model
-
-To add model to ovms configuration file with specific model use either:
-
+```text
+docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation
 ```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
---model_repository_path /models/<model_path> --add_to_config <config_file_directory_path> --model_name <name>
-```
-
-When model is directly inside `/models`.
+:::
 
-Or
+:::{tab-item} On Baremetal Host
+:sync: baremetal
+**Required:** OpenVINO Model Server package - see [deployment instructions](../deploying_server_baremetal.md) for details.
 
+```text
+ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation 
 ```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
---add_to_config <config_file_directory_path> --model_name <name> --model_path <model_path>
-```
-when there is no model_repository specified.
+:::
+::::
 
-## TIP: Use relative paths to make the config.json transferable in model_repository across ovms instances.
-For example:
-```
-cd model_repository_path
-ovms --add_to_config . --model_name OpenVINO/DeepSeek-R1-Distill-Qwen-1.5B-int4-ov --model_repository_path .
-```
 
-# Disable model
+It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.
 
-If you want to remove model from configuration file you can do it either manually or use command:
+In case you want to setup model and start server in one step follow instructions on [this page](./starting_server.md).
 
-```
-docker run -d --rm -v <model_repository_path>:/models openvino/model_server:latest \
---remove_from_config <config_file_directory_path> --model_name <name>
-```
-
-FIXME TODO TBD
-- adjust existing documentation to link with this doc
-- task, task_params to be updated explained
+*Note:*
+When using pull mode you need both read and write access rights to models repository.
diff --git a/docs/starting_server.md b/docs/starting_server.md