You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
parser.add_argument('--weight-format', default='int8', help='precision of the exported model', dest='precision')
32
32
parser.add_argument('--config_file_path', default='config.json', help='path to the config file', dest='config_file_path')
33
33
parser.add_argument('--overwrite_models', default=False, action='store_true', help='Overwrite the model if it already exists in the models repository', dest='overwrite_models')
34
-
parser.add_argument('--target_device', default="CPU", help='CPUor GPU, default is CPU', dest='target_device')
34
+
parser.add_argument('--target_device', default="CPU", help='CPU, GPU, NPU or HETERO, default is CPU', dest='target_device')
35
35
36
36
parser=argparse.ArgumentParser(description='Export Hugging face models to OVMS models repository including all configuration for deployments')
parser_text=subparsers.add_parser('text_generation', help='export model for chat and completion endpoints')
40
40
add_common_arguments(parser_text)
41
-
parser_text.add_argument('--pipeline_type', default=None, help='Type of the pipeline to be used. Can be either TEXT_CB or VLM_CB. When undefined, it will be autodetected', dest='pipeline_type')
41
+
parser_text.add_argument('--pipeline_type', default=None, choices=["LM", "LM_CB", "VLM", "VLM_CB", "AUTO"], help='Type of the pipeline to be used. AUTO is used by default', dest='pipeline_type')
42
42
parser_text.add_argument('--kv_cache_precision', default=None, choices=["u8"], help='u8 or empty (model default). Reduced kv cache precision to u8 lowers the cache size consumption.', dest='kv_cache_precision')
parser_text.add_argument('--enable_prefix_caching', action='store_true', help='This algorithm is used to cache the prompt tokens.', dest='enable_prefix_caching')
'Using this option will create configuration for speculative decoding', dest='draft_source_model')
51
51
parser_text.add_argument('--draft_model_name', required=False, default=None, help='Draft model name that should be used in the deployment. '
52
52
'Equal to draft_source_model if HF model name is used. Available only in draft_source_model has been specified.', dest='draft_model_name')
53
+
parser_text.add_argument('--max_prompt_len', required=False, type=int, default=None, help='Sets NPU specific property for maximum number of tokens in the prompt. '
54
+
'Not effective if target device is not NPU', dest='max_prompt_len')
53
55
54
56
parser_embeddings=subparsers.add_parser('embeddings', help='export model for embeddings endpoint')
Copy file name to clipboardExpand all lines: demos/llm_npu/README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,9 @@ models
72
72
└── tokenizer.json
73
73
```
74
74
75
-
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
75
+
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments.
76
+
Note that by default, NPU sets limitation on the prompt length to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter.
77
+
Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
Note that by default, NPU sets limitation on the prompt length (which in VLM also include image tokens) to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter.
45
+
43
46
> **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md#visual-language-models) with OpenVINO.
There are several known limitations which are expected to be addressed in the coming releases:
201
201
202
202
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
203
-
- Multi modal models are not supported yet. Images can't be sent now as the context.
204
203
-`logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens.
205
204
- Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version.
206
205
@@ -210,10 +209,14 @@ Some servable types introduce additional limitations:
210
209
-`finish_reason` not supported (always set to `stop`),
211
210
-`logprobs` not supported,
212
211
- sequential request processing (only one request is handled at a time)
212
+
- only a single response can be returned. Parameter `n` is not supported.
213
+
-**[NPU only]** beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
214
+
-**[NPU only]** models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
213
215
214
216
### Visual Language servable limitations
215
217
- works only on `/chat/completions` endpoint,
216
218
-`image_url` input supports only base64 encoded image, not an actual URL
219
+
-**[NPU only]** requests MUST include one and only one image in the messages context. Other request will be rejected
0 commit comments