You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -11,78 +11,37 @@ It is targeted on client machines equipped with NPU accelerator.
11
11
12
12
## Prerequisites
13
13
14
-
**OVMS 2025.1 or higher**
15
-
16
-
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
17
-
18
14
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
19
15
20
-
**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app
21
-
16
+
**(Optional) Client**: git and Python for using OpenAI client package
22
17
23
18
## Model preparation
24
-
Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized.
25
-
That ensures faster initialization time, better performance and lower memory consumption.
26
-
LLM engine parameters will be defined inside the `graph.pbtxt` file.
27
19
28
-
Download export script, install it's dependencies and create directory for the models:
Run `export_model.py` script to download and quantize the model:
20
+
Multiple [OpenVINO models optimized for NPU](https://huggingface.co/collections/OpenVINO/llms-optimized-for-npu) are available and can be downloaded directly using OVMS with the `--pull` parameter.
36
21
37
-
> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
**Note:** The parameter `--ov_cache` stores the model compilation cache to speedup initialization time for sequential startup. Drop this parameter if you don't want to store the compilation cache.
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments.
78
-
Note that by default, NPU sets limitation on the prompt length to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter.
79
-
Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
38
+
:::
39
+
::::
80
40
81
41
## Server Deployment
82
42
83
43
:::{dropdown} **Deploying with Docker**
84
44
85
-
86
45
Running this command starts the container with NPU enabled:
87
46
```bash
88
47
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render*| head -n 1) -u $(id -u):$(id -g) \
@@ -102,7 +61,7 @@ as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in
102
61
Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
@@ -137,52 +96,55 @@ Completion endpoint should be used to pass the prompt directly by the client and
137
96
138
97
:::{dropdown} **Unary call with cURL**
139
98
```console
140
-
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
99
+
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"OpenVINO/Qwen3-8B-int4-cw-ov\", \"max_tokens\":50, \"stream\":false, \"chat_template_kwargs\":{\"enable_thinking\":false}, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO Model Server?\"}]}"
141
100
```
142
101
```json
143
102
{
144
-
"choices": [
145
-
{
146
-
"finish_reason": "stop",
147
-
"index": 0,
148
-
"message": {
149
-
"content": "OpenVINO (Open Visual Inference and Optimization for computational resources) is an open-source toolkit that automates neural network model computations across various platforms and",
150
-
"role": "assistant"
103
+
"choices":[
104
+
{
105
+
"finish_reason":"stop",
106
+
"index":0,
107
+
"message":{
108
+
"content":"**OpenVINO Model Server** (also known as **Model Server** or **OVMS**) is a high-performance, open-source inference server that allows you to deploy and serve deep learning models as RESTful or gRPC endpoints. It is part",
109
+
"role":"assistant",
110
+
"tool_calls":[
111
+
112
+
]
113
+
}
151
114
}
152
-
}
153
-
],
154
-
"created": 1742944805,
155
-
"model": "meta-llama/Llama-3.1-8B-Instruct",
156
-
"object": "chat.completion",
157
-
"usage": {
158
-
"prompt_tokens": 47,
159
-
"completion_tokens": 30,
160
-
"total_tokens": 77
161
-
}
115
+
],
116
+
"created":1763718082,
117
+
"model":"OpenVINO/Qwen3-8B-int4-cw-ov",
118
+
"object":"chat.completion",
119
+
"usage":{
120
+
"prompt_tokens":31,
121
+
"completion_tokens":50,
122
+
"total_tokens":81
123
+
}
162
124
}
163
125
```
164
126
165
127
A similar call can be made with a `completion` endpoint:
166
128
```console
167
-
curl http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
129
+
curl http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"OpenVINO/Qwen3-8B-int4-cw-ov\",\"max_tokens\":50, \"stream\":false,\"prompt\": \"What are the 3 main tourist attractions in Paris?\"}"
168
130
```
169
131
```json
170
132
{
171
-
"choices":[
172
-
{
173
-
"finish_reason":"stop",
174
-
"index":0,
175
-
"text":" Introduction\nOpenVINO can be used in automation of various business processes, which brings timely assistance in operations with these models. Additionally OpenVINO simpl"
176
-
}
177
-
],
178
-
"created":1742944929,
179
-
"model":"meta-llama/Llama-3.1-8B-Instruct",
180
-
"object":"text_completion",
181
-
"usage":{
182
-
"prompt_tokens":14,
183
-
"completion_tokens":30,
184
-
"total_tokens":44
185
-
}
133
+
"choices":[
134
+
{
135
+
"finish_reason":"stop",
136
+
"index":0,
137
+
"text":" The three main tourist attractions in Paris are the Eiffel Tower, the Louvre, and the Notre-Dame de Paris. The Eiffel Tower is one of the most iconic landmarks in Paris and is a must-see for most visitors."
138
+
}
139
+
],
140
+
"created":1763976213,
141
+
"model":"OpenVINO/Qwen3-8B-int4-cw-ov",
142
+
"object":"text_completion",
143
+
"usage":{
144
+
"prompt_tokens":11,
145
+
"completion_tokens":50,
146
+
"total_tokens":61
147
+
}
186
148
}
187
149
```
188
150
@@ -205,17 +167,24 @@ client = OpenAI(
205
167
)
206
168
207
169
response = client.chat.completions.create(
208
-
model="meta-llama/Llama-3.1-8B-Instruct",
209
-
messages=[{"role": "user", "content": "Say this is a test"}],
170
+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
171
+
messages=[{"role": "user", "content": "What is OpenVINO Model Server?"}],
**OpenVINO™ Model Server** is a high-performance, open-source inference server that allows you to deploy and serve deep learning models as a RESTful API. It is part of the **Intel® OpenVINO™ toolkit**, which is a comprehensive development toolkit for optimizing and deploying deep learning models on Intel®-based hardware.
182
+
183
+
---
184
+
185
+
## ✅ What is OpenVINO Model Server?
186
+
187
+
The **OpenVINO Model Server** is a **lightweight**, **highly optimized** and ...
219
188
```
220
189
221
190
A similar code can be applied for the completion endpoint:
@@ -231,8 +200,8 @@ client = OpenAI(
231
200
)
232
201
233
202
response = client.completions.create(
234
-
model="meta-llama/Llama-3.1-8B-Instruct",
235
-
prompt="Say this is a test.",
203
+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
204
+
prompt="What are the 3 main tourist attractions in Paris?",
The three main tourist attractions in Paris are the Eiffel Tower, the Louvre Museum, and the Notre-Dame de Paris. The Eiffel Tower is a symbol of Paris and one of the most visited landmarks in the world. The Louvre Museum is home to the Mona Lisa and other famous artworks. The Notre-Dame de Paris is a famous cathedral and a symbol of the city's rich history and architecture. These three attractions are the most popular among tourists visiting Paris.
245
214
```
246
215
:::
247
216
@@ -262,10 +231,11 @@ client = OpenAI(
262
231
)
263
232
264
233
stream = client.chat.completions.create(
265
-
model="meta-llama/Llama-3.1-8B-Instruct",
266
-
messages=[{"role": "user", "content": "Say this is a test"}],
234
+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
235
+
messages=[{"role": "user", "content": "What is OpenVINO Model Server?"}],
**OpenVINO™ Model Server** (formerly known as **OpenVINO™ Toolkit Model Server**) is a high-performance, open-source server that allows you to deploy and serve deep learning models in a production environment. It is part of the **Intel® OpenVINO™ Toolkit**, which is designed to optimize and deploy deep learning models for inference on Intel hardware.
248
+
249
+
---
250
+
251
+
## 📌 What is OpenVINO Model Server?
252
+
253
+
The **OpenVINO Model Server** is a **lightweight**...
278
254
```
279
255
280
256
A similar code can be applied for the completion endpoint:
@@ -290,8 +266,8 @@ client = OpenAI(
290
266
)
291
267
292
268
stream = client.completions.create(
293
-
model="meta-llama/Llama-3.1-8B-Instruct",
294
-
prompt="Say this is a test.",
269
+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
270
+
prompt="What are the 3 main tourist attractions in Paris?",
295
271
max_tokens=100,
296
272
stream=True,
297
273
)
@@ -302,46 +278,10 @@ for chunk in stream:
302
278
303
279
Output:
304
280
```
305
-
This is only a test.
281
+
The three main tourist attractions in Paris are the Eiffel Tower, the Louvre, and the Notre-Dame de Paris. The Eiffel Tower is the most iconic landmark and offers a great view of the city. The Louvre is a world-famous art museum that houses the Mona Lisa and other famous artworks. The Notre-Dame de Paris is a stunning example of French Gothic architecture and is the cathedral of the city. These three attractions are the most visited and most famous in Paris,
306
282
```
307
283
:::
308
284
309
-
## Benchmarking text generation with high concurrency
310
-
311
-
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
312
-
It can be demonstrated using benchmarking app from vLLM repository:
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md)
0 commit comments