You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: demos/README.md
+11-5Lines changed: 11 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,11 +44,17 @@ ovms_string_output_model_demo
44
44
45
45
OpenVINO Model Server demos have been created to showcase the usage of the model server as well as demonstrate it’s capabilities.
46
46
### Check Out New Generative AI Demos
47
-
-[LLM Text Generation with continuous batching](continuous_batching/README.md)
48
-
-[VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)
49
-
-[OpenAI API text embeddings ](embeddings/README.md)
50
-
-[Reranking with Cohere API](rerank/README.md)
51
-
-[RAG with OpenAI API endpoint and langchain](https://github.com/openvinotoolkit/model_server/blob/releases/2025/1/demos/continuous_batching/rag/rag_demo.ipynb)
47
+
| Demo | Description |
48
+
|---|---|
49
+
|[LLM Text Generation with continuous batching](continuous_batching/README.md)|Generate text with LLM models and continuous batching pipeline|
50
+
|[VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)|Generate text with VLM models and continuous batching pipeline|
51
+
|[OpenAI API text embeddings ](embeddings/README.md)|Get text embeddings via endpoint compatible with OpenAI API|
52
+
|[Reranking with Cohere API](rerank/README.md)| Rerank documents via endpoint compatible with Cohere|
53
+
|[RAG with OpenAI API endpoint and langchain](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb)| Example how to use RAG with model server endpoints|
54
+
|[LLM on NPU](./llm_npu/README.md)| Generate text with LLM models and NPU acceleration|
55
+
|[VLM on NPU](./vlm_npu/README.md)| Generate text with VLM models and NPU acceleration|
56
+
|[VisualCode assistant](./code_completion_copilot/README.md)|Use Continue extension in Visual Studio Code with local OVMS|
57
+
52
58
53
59
Check out the list below to see complete step-by-step examples of using OpenVINO Model Server with real world use cases:
@@ -133,53 +133,51 @@ Completion endpoint should be used to pass the prompt directly by the client and
133
133
134
134
:::{dropdown} **Unary call with cURL**
135
135
```console
136
-
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"mistralai/Mistral-7B-Instruct-v0.2\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
136
+
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
137
137
```
138
138
```json
139
139
{
140
140
"choices": [
141
141
{
142
-
"finish_reason": "length",
142
+
"finish_reason": "stop",
143
143
"index": 0,
144
-
"logprobs": null,
145
144
"message": {
146
-
"content": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,",
145
+
"content": "OpenVINO (Open Visual Inference and Optimization for computational resources) is an open-source toolkit that automates neural network model computations across various platforms and",
147
146
"role": "assistant"
148
147
}
149
148
}
150
149
],
151
-
"created": 1724405301,
152
-
"model": "mistralai/Mistral-7B-Instruct-v0.2",
150
+
"created": 1742944805,
151
+
"model": "meta-llama/Llama-3.1-8B-Instruct",
153
152
"object": "chat.completion",
154
153
"usage": {
155
-
"prompt_tokens": 27,
154
+
"prompt_tokens": 47,
156
155
"completion_tokens": 30,
157
-
"total_tokens": 57
156
+
"total_tokens": 77
158
157
}
159
158
}
160
159
```
161
160
162
161
A similar call can be made with a `completion` endpoint:
163
162
```console
164
-
curl http://localhost:8000/v3/completions -H "Content-Type: application/json"-d "{\"model\": \"mistralai/Mistral-7B-Instruct-v0.2\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
163
+
curl http://localhost:8000/v3/completions -H "Content-Type: application/json"-d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
165
164
```
166
165
```json
167
166
{
168
167
"choices": [
169
168
{
170
-
"finish_reason": "length",
169
+
"finish_reason": "stop",
171
170
"index": 0,
172
-
"logprobs": null,
173
-
"text": "\n\nOpenVINO is an open-source computer vision platform developed by Intel for deploying and optimizing computer vision, machine learning, and autonomous driving applications. It"
171
+
"text": " Introduction\nOpenVINO can be used in automation of various business processes, which brings timely assistance in operations with these models. Additionally OpenVINO simpl"
174
172
}
175
173
],
176
-
"created": 1724405354,
177
-
"model": "mistralai/Mistral-7B-Instruct-v0.2",
174
+
"created": 1742944929,
175
+
"model": "meta-llama/Llama-3.1-8B-Instruct",
178
176
"object": "text_completion",
179
177
"usage": {
180
-
"prompt_tokens": 23,
178
+
"prompt_tokens": 14,
181
179
"completion_tokens": 30,
182
-
"total_tokens": 53
180
+
"total_tokens": 44
183
181
}
184
182
}
185
183
```
@@ -203,7 +201,7 @@ client = OpenAI(
203
201
)
204
202
205
203
response = client.chat.completions.create(
206
-
model="mistralai/Mistral-7B-Instruct-v0.2",
204
+
model="meta-llama/Llama-3.1-8B-Instruct",
207
205
messages=[{"role": "user", "content": "Say this is a test"}],
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/releases/2025/1/demos/continuous_batching/accuracy/README.md)
321
340
341
+
> **Note:** Text generation on NPU is not returning the log_probs which are required to calculate some of the metrics. Only the tasks of type `generate_until` can be used.
342
+
For example `--tasks leaderboard_ifeval`.
343
+
322
344
323
345
## Limitations
324
346
325
347
- beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
326
348
- models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
327
349
- log_probs are not supported
328
350
- finish reason is always set to "stop".
351
+
- only a single response can be returned. Parameter `n` is not supported.
@@ -252,6 +255,7 @@ Check the [guide of using lm-evaluation-harness](https://github.com/openvinotool
252
255
- models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
253
256
- log_probs are not supported
254
257
- finish reason is always set to "stop".
258
+
- only a single response can be returned. Parameter `n` is not supported.
0 commit comments