Skip to content

Commit 67365b1

Browse files
authored
1 parent f243950 commit 67365b1

File tree

4 files changed

+79
-46
lines changed

4 files changed

+79
-46
lines changed

demos/README.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,17 @@ ovms_string_output_model_demo
4444

4545
OpenVINO Model Server demos have been created to showcase the usage of the model server as well as demonstrate it’s capabilities.
4646
### Check Out New Generative AI Demos
47-
- [LLM Text Generation with continuous batching](continuous_batching/README.md)
48-
- [VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)
49-
- [OpenAI API text embeddings ](embeddings/README.md)
50-
- [Reranking with Cohere API](rerank/README.md)
51-
- [RAG with OpenAI API endpoint and langchain](https://github.com/openvinotoolkit/model_server/blob/releases/2025/1/demos/continuous_batching/rag/rag_demo.ipynb)
47+
| Demo | Description |
48+
|---|---|
49+
|[LLM Text Generation with continuous batching](continuous_batching/README.md)|Generate text with LLM models and continuous batching pipeline|
50+
|[VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)|Generate text with VLM models and continuous batching pipeline|
51+
|[OpenAI API text embeddings ](embeddings/README.md)|Get text embeddings via endpoint compatible with OpenAI API|
52+
|[Reranking with Cohere API](rerank/README.md)| Rerank documents via endpoint compatible with Cohere|
53+
|[RAG with OpenAI API endpoint and langchain](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb)| Example how to use RAG with model server endpoints|
54+
|[LLM on NPU](./llm_npu/README.md)| Generate text with LLM models and NPU acceleration|
55+
|[VLM on NPU](./vlm_npu/README.md)| Generate text with VLM models and NPU acceleration|
56+
|[VisualCode assistant](./code_completion_copilot/README.md)|Use Continue extension in Visual Studio Code with local OVMS|
57+
5258

5359
Check out the list below to see complete step-by-step examples of using OpenVINO Model Server with real world use cases:
5460

demos/continuous_batching/vlm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/r
155155
```python
156156
import requests
157157
import base64
158-
base_url='http://localhost:8080/v3'
158+
base_url='http://localhost:8000/v3'
159159
model_name = "OpenGVLab/InternVL2_5-8B"
160160

161161
def convert_image(Image):

demos/llm_npu/README.md

Lines changed: 54 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Text generation serving with NPU acceleration #ovms_demos_llm_npu
1+
# Text generation serving with NPU acceleration {#ovms_demos_llm_npu}
22

33

44
This demo shows how to deploy LLM models in the OpenVINO Model Server with NPU acceleration.
@@ -38,7 +38,7 @@ Run `export_model.py` script to download and quantize the model:
3838
3939
**LLM**
4040
```console
41-
python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.2 --target_device NPU --config_file_path models/config.json --model_repository_path models --overwrite_models
41+
python export_model.py text_generation --source_model meta-llama/Llama-3.1-8B-Instruct --target_device NPU --config_file_path models/config.json --model_repository_path models --overwrite_models
4242
```
4343
Below is a list of tested models:
4444
- meta-llama/Meta-Llama-3-8B-Instruct
@@ -81,7 +81,7 @@ The default configuration should work in most cases but the parameters can be tu
8181

8282
Running this command starts the container with NPU enabled:
8383
```bash
84-
docker run -d --rm --device /dev/accel -p 9000:9000 --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
84+
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
8585
-p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
8686
```
8787
:::
@@ -110,7 +110,7 @@ curl http://localhost:8000/v1/config
110110
```
111111
```json
112112
{
113-
"mistralai/Mistral-7B-Instruct-v0.2": {
113+
"meta-llama/Llama-3.1-8B-Instruct": {
114114
"model_version_status": [
115115
{
116116
"version": "1",
@@ -133,53 +133,51 @@ Completion endpoint should be used to pass the prompt directly by the client and
133133

134134
:::{dropdown} **Unary call with cURL**
135135
```console
136-
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"mistralai/Mistral-7B-Instruct-v0.2\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
136+
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
137137
```
138138
```json
139139
{
140140
"choices": [
141141
{
142-
"finish_reason": "length",
142+
"finish_reason": "stop",
143143
"index": 0,
144-
"logprobs": null,
145144
"message": {
146-
"content": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,",
145+
"content": "OpenVINO (Open Visual Inference and Optimization for computational resources) is an open-source toolkit that automates neural network model computations across various platforms and",
147146
"role": "assistant"
148147
}
149148
}
150149
],
151-
"created": 1724405301,
152-
"model": "mistralai/Mistral-7B-Instruct-v0.2",
150+
"created": 1742944805,
151+
"model": "meta-llama/Llama-3.1-8B-Instruct",
153152
"object": "chat.completion",
154153
"usage": {
155-
"prompt_tokens": 27,
154+
"prompt_tokens": 47,
156155
"completion_tokens": 30,
157-
"total_tokens": 57
156+
"total_tokens": 77
158157
}
159158
}
160159
```
161160

162161
A similar call can be made with a `completion` endpoint:
163162
```console
164-
curl http://localhost:8000/v3/completions -H "Content-Type: application/json"-d "{\"model\": \"mistralai/Mistral-7B-Instruct-v0.2\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
163+
curl http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
165164
```
166165
```json
167166
{
168167
"choices": [
169168
{
170-
"finish_reason": "length",
169+
"finish_reason": "stop",
171170
"index": 0,
172-
"logprobs": null,
173-
"text": "\n\nOpenVINO is an open-source computer vision platform developed by Intel for deploying and optimizing computer vision, machine learning, and autonomous driving applications. It"
171+
"text": " Introduction\nOpenVINO can be used in automation of various business processes, which brings timely assistance in operations with these models. Additionally OpenVINO simpl"
174172
}
175173
],
176-
"created": 1724405354,
177-
"model": "mistralai/Mistral-7B-Instruct-v0.2",
174+
"created": 1742944929,
175+
"model": "meta-llama/Llama-3.1-8B-Instruct",
178176
"object": "text_completion",
179177
"usage": {
180-
"prompt_tokens": 23,
178+
"prompt_tokens": 14,
181179
"completion_tokens": 30,
182-
"total_tokens": 53
180+
"total_tokens": 44
183181
}
184182
}
185183
```
@@ -203,7 +201,7 @@ client = OpenAI(
203201
)
204202

205203
response = client.chat.completions.create(
206-
model="mistralai/Mistral-7B-Instruct-v0.2",
204+
model="meta-llama/Llama-3.1-8B-Instruct",
207205
messages=[{"role": "user", "content": "Say this is a test"}],
208206
stream=False,
209207
)
@@ -212,7 +210,7 @@ print(response.choices[0].message.content)
212210

213211
Output:
214212
```
215-
It looks like you're testing me!
213+
This is only a test.
216214
```
217215

218216
A similar code can be applied for the completion endpoint:
@@ -228,7 +226,7 @@ client = OpenAI(
228226
)
229227

230228
response = client.completions.create(
231-
model="mistralai/Mistral-7B-Instruct-v0.2",
229+
model="meta-llama/Llama-3.1-8B-Instruct",
232230
prompt="Say this is a test.",
233231
stream=False,
234232
)
@@ -237,7 +235,7 @@ print(response.choices[0].text)
237235

238236
Output:
239237
```
240-
It looks like you're testing me!
238+
This is only a test.
241239
```
242240
:::
243241

@@ -258,7 +256,7 @@ client = OpenAI(
258256
)
259257

260258
stream = client.chat.completions.create(
261-
model="mistralai/Mistral-7B-Instruct-v0.2",
259+
model="meta-llama/Llama-3.1-8B-Instruct",
262260
messages=[{"role": "user", "content": "Say this is a test"}],
263261
stream=True,
264262
)
@@ -269,7 +267,7 @@ for chunk in stream:
269267

270268
Output:
271269
```
272-
It looks like you're testing me!
270+
This is only a test.
273271
```
274272

275273
A similar code can be applied for the completion endpoint:
@@ -285,7 +283,7 @@ client = OpenAI(
285283
)
286284

287285
stream = client.completions.create(
288-
model="mistralai/Mistral-7B-Instruct-v0.2",
286+
model="meta-llama/Llama-3.1-8B-Instruct",
289287
prompt="Say this is a test.",
290288
stream=True,
291289
)
@@ -296,7 +294,7 @@ for chunk in stream:
296294

297295
Output:
298296
```
299-
It looks like you're testing me!
297+
This is only a test.
300298
```
301299
:::
302300

@@ -310,22 +308,47 @@ cd vllm
310308
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
311309
cd benchmarks
312310
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
313-
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model mistralai/Mistral-7B-Instruct-v0.2 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 --request-rate inf --max-concurrency 1
314-
315-
311+
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Llama-3.1-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 30 --max-concurrency 1
312+
Maximum request concurrency: 1
313+
314+
============ Serving Benchmark Result ============
315+
Successful requests: 30
316+
Benchmark duration (s): 480.20
317+
Total input tokens: 6434
318+
Total generated tokens: 6113
319+
Request throughput (req/s): 0.06
320+
Output token throughput (tok/s): 12.73
321+
Total Token throughput (tok/s): 26.13
322+
---------------Time to First Token----------------
323+
Mean TTFT (ms): 1922.09
324+
Median TTFT (ms): 1920.85
325+
P99 TTFT (ms): 1952.11
326+
-----Time per Output Token (excl. 1st token)------
327+
Mean TPOT (ms): 65.74
328+
Median TPOT (ms): 68.95
329+
P99 TPOT (ms): 70.40
330+
---------------Inter-token Latency----------------
331+
Mean ITL (ms): 83.65
332+
Median ITL (ms): 70.11
333+
P99 ITL (ms): 212.48
334+
==================================================
316335
```
317336

318337
## Testing the model accuracy over serving API
319338

320339
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/releases/2025/1/demos/continuous_batching/accuracy/README.md)
321340

341+
> **Note:** Text generation on NPU is not returning the log_probs which are required to calculate some of the metrics. Only the tasks of type `generate_until` can be used.
342+
For example `--tasks leaderboard_ifeval`.
343+
322344

323345
## Limitations
324346

325347
- beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
326348
- models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
327349
- log_probs are not supported
328350
- finish reason is always set to "stop".
351+
- only a single response can be returned. Parameter `n` is not supported.
329352

330353
## References
331354
- [Chat Completions API](../../docs/model_server_rest_api_chat.md)

demos/vlm_npu/README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Serving for Text generation with Visual Language Models with NPU acceleration #ovms_demos_vlm_npu
1+
# Serving for Text generation with Visual Language Models with NPU acceleration {#ovms_demos_vlm_npu}
22

33

44
This demo shows how to deploy VLM models in the OpenVINO Model Server with NPU acceleration.
@@ -72,7 +72,7 @@ The default configuration should work in most cases but the parameters can be tu
7272

7373
Running this command starts the container with NPU enabled:
7474
```bash
75-
docker run -d --rm --device /dev/accel -p 9000:9000 --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
75+
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
7676
-p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
7777
```
7878
:::
@@ -118,16 +118,18 @@ curl http://localhost:8000/v1/config
118118

119119
## Request Generation
120120

121-
122-
:::{dropdown} **Unary call with python requests library**
123121
```console
124122
pip3 install requests
125123
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/static/images/zebra.jpeg -o zebra.jpeg
126124
```
125+
![zebra](https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/static/images/zebra.jpeg)
126+
127+
:::{dropdown} **Unary call with python requests library**
128+
127129
```python
128130
import requests
129131
import base64
130-
base_url='http://localhost:8080/v3'
132+
base_url='http://localhost:8000/v3'
131133
model_name = "microsoft/Phi-3.5-vision-instruct"
132134

133135
def convert_image(Image):
@@ -136,7 +138,8 @@ def convert_image(Image):
136138
return base64_image
137139

138140
import requests
139-
payload = {"model": "microsoft/Phi-3.5-vision-instruct",
141+
payload = {
142+
"model": model_name,
140143
"messages": [
141144
{
142145
"role": "user",
@@ -191,8 +194,8 @@ pip3 install openai
191194
```python
192195
from openai import OpenAI
193196
import base64
194-
base_url='http://localhost:8080/v3'
195-
model_name = "OpenGVLab/InternVL2_5-8B"
197+
base_url='http://localhost:8000/v3'
198+
model_name = "microsoft/Phi-3.5-vision-instruct"
196199

197200
client = OpenAI(api_key='unused', base_url=base_url)
198201

@@ -237,7 +240,7 @@ cd vllm
237240
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
238241
cd benchmarks
239242
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
240-
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenGVLab/InternVL2_5-8B --endpoint /v1/chat/completions --request-rate 1 --num-prompts 10 --trust-remote-code --max-concurrency 1
243+
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model microsoft/Phi-3.5-vision-instruct --endpoint /v3/chat/completions --num-prompts 10 --trust-remote-code --max-concurrency 1
241244

242245
```
243246

@@ -252,6 +255,7 @@ Check the [guide of using lm-evaluation-harness](https://github.com/openvinotool
252255
- models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
253256
- log_probs are not supported
254257
- finish reason is always set to "stop".
258+
- only a single response can be returned. Parameter `n` is not supported.
255259

256260
## References
257261
- [Chat Completions API](../../docs/model_server_rest_api_chat.md)

0 commit comments

Comments
 (0)