Skip to content

Commit f30ebd7

Browse files
przepeckCopilotdtrawins
authored
LLM NPU demo improvements (#3817)
### 🛠 Summary [CVS-176669](https://jira.devtools.intel.com/browse/CVS-176669) Adjust LLM NPU demo to new parameters, add models from OV organization ### 🧪 Checklist - [ ] Unit tests added. - [ ] The documentation updated. - [ ] Change follows security best practices. `` --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Trawinski, Dariusz <[email protected]>
1 parent fc06ee1 commit f30ebd7

File tree

1 file changed

+93
-153
lines changed

1 file changed

+93
-153
lines changed

demos/llm_npu/README.md

Lines changed: 93 additions & 153 deletions
Original file line numberDiff line numberDiff line change
@@ -11,78 +11,37 @@ It is targeted on client machines equipped with NPU accelerator.
1111
1212
## Prerequisites
1313

14-
**OVMS 2025.1 or higher**
15-
16-
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
17-
1814
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
1915

20-
**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app
21-
16+
**(Optional) Client**: git and Python for using OpenAI client package
2217

2318
## Model preparation
24-
Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized.
25-
That ensures faster initialization time, better performance and lower memory consumption.
26-
LLM engine parameters will be defined inside the `graph.pbtxt` file.
2719

28-
Download export script, install it's dependencies and create directory for the models:
29-
```console
30-
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
31-
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
32-
mkdir models
33-
```
34-
35-
Run `export_model.py` script to download and quantize the model:
20+
Multiple [OpenVINO models optimized for NPU](https://huggingface.co/collections/OpenVINO/llms-optimized-for-npu) are available and can be downloaded directly using OVMS with the `--pull` parameter.
3621

37-
> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
22+
### Pulling model
3823

39-
**LLM**
40-
```console
41-
python export_model.py text_generation --source_model meta-llama/Llama-3.1-8B-Instruct --target_device NPU --config_file_path models/config.json --ov_cache_dir ./models/.ov_cache --model_repository_path models --overwrite_models
42-
```
43-
**Note:** The parameter `--ov_cache` stores the model compilation cache to speedup initialization time for sequential startup. Drop this parameter if you don't want to store the compilation cache.
44-
45-
Below is a list of tested models:
46-
- meta-llama/Meta-Llama-3-8B-Instruct
47-
- meta-llama/Llama-3.1-8B
48-
- microsoft/Phi-3-mini-4k-instruct
49-
- Qwen/Qwen2-7B
50-
- mistralai/Mistral-7B-Instruct-v0.2
51-
- openbmb/MiniCPM-1B-sft-bf16
52-
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
53-
- TheBloke/Llama-2-7B-Chat-GPTQ
54-
- Qwen/Qwen2-7B-Instruct-GPTQ-Int4
55-
56-
You should have a model folder like below:
24+
::::{tab-set}
25+
:::{tab-item} Linux
26+
:sync: Linux
27+
```bash
28+
docker run -d --rm -u $(id -u):$(id -g) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --pull --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path /models --target_device NPU --task text_generation --tool_parser hermes3 --cache_dir .ov_cache --enable_prefix_caching true --max_prompt_len 2000
29+
docker run -d --rm -u $(id -u):$(id -g) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --add_to_config --config_path /models/config.json --model_name OpenVINO/Qwen3-8B-int4-cw-ov --model_path /models/OpenVINO/Qwen3-8B-int4-cw-ov
5730
```
58-
tree models
59-
models
60-
├── config.json
61-
└── mistralai
62-
└── Mistral-7B-Instruct-v0.2
63-
├── config.json
64-
├── generation_config.json
65-
├── graph.pbtxt
66-
├── openvino_detokenizer.bin
67-
├── openvino_detokenizer.xml
68-
├── openvino_model.bin
69-
├── openvino_model.xml
70-
├── openvino_tokenizer.bin
71-
├── openvino_tokenizer.xml
72-
├── special_tokens_map.json
73-
├── tokenizer_config.json
74-
└── tokenizer.json
31+
:::
32+
:::{tab-item} Windows
33+
:sync: Windows
34+
```bat
35+
ovms.exe --pull --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --target_device NPU --task text_generation --tool_parser hermes3 --cache_dir .ov_cache --enable_prefix_caching true --max_prompt_len 2000
36+
ovms.exe --add_to_config --config_path models\config.json --model_name OpenVINO/Qwen3-8B-int4-cw-ov --model_path OpenVINO\Qwen3-8B-int4-cw-ov
7537
```
76-
77-
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments.
78-
Note that by default, NPU sets limitation on the prompt length to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter.
79-
Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
38+
:::
39+
::::
8040

8141
## Server Deployment
8242

8343
:::{dropdown} **Deploying with Docker**
8444

85-
8645
Running this command starts the container with NPU enabled:
8746
```bash
8847
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
@@ -102,7 +61,7 @@ as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in
10261
Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
10362

10463
```bat
105-
ovms --rest_port 8000 --config_path ./models/config.json
64+
ovms --rest_port 8000 --config_path models\config.json
10665
```
10766
:::
10867

@@ -114,18 +73,18 @@ curl http://localhost:8000/v1/config
11473
```
11574
```json
11675
{
117-
"meta-llama/Llama-3.1-8B-Instruct": {
118-
"model_version_status": [
119-
{
120-
"version": "1",
121-
"state": "AVAILABLE",
122-
"status": {
123-
"error_code": "OK",
124-
"error_message": "OK"
125-
}
126-
}
127-
]
128-
}
76+
"OpenVINO/Qwen3-8B-int4-cw-ov": {
77+
"model_version_status": [
78+
{
79+
"version": "1",
80+
"state": "AVAILABLE",
81+
"status": {
82+
"error_code": "OK",
83+
"error_message": "OK"
84+
}
85+
}
86+
]
87+
}
12988
}
13089
```
13190

@@ -137,52 +96,55 @@ Completion endpoint should be used to pass the prompt directly by the client and
13796

13897
:::{dropdown} **Unary call with cURL**
13998
```console
140-
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\", \"max_tokens\":30,\"stream\":false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO?\"}]}"
99+
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"OpenVINO/Qwen3-8B-int4-cw-ov\", \"max_tokens\":50, \"stream\":false, \"chat_template_kwargs\":{\"enable_thinking\":false}, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},{\"role\": \"user\",\"content\": \"What is OpenVINO Model Server?\"}]}"
141100
```
142101
```json
143102
{
144-
"choices": [
145-
{
146-
"finish_reason": "stop",
147-
"index": 0,
148-
"message": {
149-
"content": "OpenVINO (Open Visual Inference and Optimization for computational resources) is an open-source toolkit that automates neural network model computations across various platforms and",
150-
"role": "assistant"
103+
"choices":[
104+
{
105+
"finish_reason":"stop",
106+
"index":0,
107+
"message":{
108+
"content":"**OpenVINO Model Server** (also known as **Model Server** or **OVMS**) is a high-performance, open-source inference server that allows you to deploy and serve deep learning models as RESTful or gRPC endpoints. It is part",
109+
"role":"assistant",
110+
"tool_calls":[
111+
112+
]
113+
}
151114
}
152-
}
153-
],
154-
"created": 1742944805,
155-
"model": "meta-llama/Llama-3.1-8B-Instruct",
156-
"object": "chat.completion",
157-
"usage": {
158-
"prompt_tokens": 47,
159-
"completion_tokens": 30,
160-
"total_tokens": 77
161-
}
115+
],
116+
"created":1763718082,
117+
"model":"OpenVINO/Qwen3-8B-int4-cw-ov",
118+
"object":"chat.completion",
119+
"usage":{
120+
"prompt_tokens":31,
121+
"completion_tokens":50,
122+
"total_tokens":81
123+
}
162124
}
163125
```
164126

165127
A similar call can be made with a `completion` endpoint:
166128
```console
167-
curl http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Llama-3.1-8B-Instruct\",\"max_tokens\":30,\"stream\":false,\"prompt\": \"You are a helpful assistant. What is OpenVINO? \"}"
129+
curl http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"OpenVINO/Qwen3-8B-int4-cw-ov\", \"max_tokens\":50, \"stream\":false, \"prompt\": \"What are the 3 main tourist attractions in Paris?\"}"
168130
```
169131
```json
170132
{
171-
"choices": [
172-
{
173-
"finish_reason": "stop",
174-
"index": 0,
175-
"text": " Introduction\nOpenVINO can be used in automation of various business processes, which brings timely assistance in operations with these models. Additionally OpenVINO simpl"
176-
}
177-
],
178-
"created": 1742944929,
179-
"model": "meta-llama/Llama-3.1-8B-Instruct",
180-
"object": "text_completion",
181-
"usage": {
182-
"prompt_tokens": 14,
183-
"completion_tokens": 30,
184-
"total_tokens": 44
185-
}
133+
"choices":[
134+
{
135+
"finish_reason":"stop",
136+
"index":0,
137+
"text":" The three main tourist attractions in Paris are the Eiffel Tower, the Louvre, and the Notre-Dame de Paris. The Eiffel Tower is one of the most iconic landmarks in Paris and is a must-see for most visitors."
138+
}
139+
],
140+
"created":1763976213,
141+
"model":"OpenVINO/Qwen3-8B-int4-cw-ov",
142+
"object":"text_completion",
143+
"usage":{
144+
"prompt_tokens":11,
145+
"completion_tokens":50,
146+
"total_tokens":61
147+
}
186148
}
187149
```
188150

@@ -205,17 +167,24 @@ client = OpenAI(
205167
)
206168

207169
response = client.chat.completions.create(
208-
model="meta-llama/Llama-3.1-8B-Instruct",
209-
messages=[{"role": "user", "content": "Say this is a test"}],
170+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
171+
messages=[{"role": "user", "content": "What is OpenVINO Model Server?"}],
210172
max_tokens=100,
211173
stream=False,
174+
extra_body={"chat_template_kwargs":{"enable_thinking": False}}
212175
)
213176
print(response.choices[0].message.content)
214177
```
215178

216179
Output:
217180
```
218-
This is only a test.
181+
**OpenVINO™ Model Server** is a high-performance, open-source inference server that allows you to deploy and serve deep learning models as a RESTful API. It is part of the **Intel® OpenVINO™ toolkit**, which is a comprehensive development toolkit for optimizing and deploying deep learning models on Intel®-based hardware.
182+
183+
---
184+
185+
## ✅ What is OpenVINO Model Server?
186+
187+
The **OpenVINO Model Server** is a **lightweight**, **highly optimized** and ...
219188
```
220189

221190
A similar code can be applied for the completion endpoint:
@@ -231,8 +200,8 @@ client = OpenAI(
231200
)
232201

233202
response = client.completions.create(
234-
model="meta-llama/Llama-3.1-8B-Instruct",
235-
prompt="Say this is a test.",
203+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
204+
prompt="What are the 3 main tourist attractions in Paris?",
236205
max_tokens=100,
237206
stream=False,
238207
)
@@ -241,7 +210,7 @@ print(response.choices[0].text)
241210

242211
Output:
243212
```
244-
This is only a test.
213+
The three main tourist attractions in Paris are the Eiffel Tower, the Louvre Museum, and the Notre-Dame de Paris. The Eiffel Tower is a symbol of Paris and one of the most visited landmarks in the world. The Louvre Museum is home to the Mona Lisa and other famous artworks. The Notre-Dame de Paris is a famous cathedral and a symbol of the city's rich history and architecture. These three attractions are the most popular among tourists visiting Paris.
245214
```
246215
:::
247216

@@ -262,10 +231,11 @@ client = OpenAI(
262231
)
263232

264233
stream = client.chat.completions.create(
265-
model="meta-llama/Llama-3.1-8B-Instruct",
266-
messages=[{"role": "user", "content": "Say this is a test"}],
234+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
235+
messages=[{"role": "user", "content": "What is OpenVINO Model Server?"}],
267236
max_tokens=100,
268237
stream=True,
238+
extra_body={"chat_template_kwargs":{"enable_thinking": False}}
269239
)
270240
for chunk in stream:
271241
if chunk.choices[0].delta.content is not None:
@@ -274,7 +244,13 @@ for chunk in stream:
274244

275245
Output:
276246
```
277-
This is only a test.
247+
**OpenVINO™ Model Server** (formerly known as **OpenVINO™ Toolkit Model Server**) is a high-performance, open-source server that allows you to deploy and serve deep learning models in a production environment. It is part of the **Intel® OpenVINO™ Toolkit**, which is designed to optimize and deploy deep learning models for inference on Intel hardware.
248+
249+
---
250+
251+
## 📌 What is OpenVINO Model Server?
252+
253+
The **OpenVINO Model Server** is a **lightweight**...
278254
```
279255

280256
A similar code can be applied for the completion endpoint:
@@ -290,8 +266,8 @@ client = OpenAI(
290266
)
291267

292268
stream = client.completions.create(
293-
model="meta-llama/Llama-3.1-8B-Instruct",
294-
prompt="Say this is a test.",
269+
model="OpenVINO/Qwen3-8B-int4-cw-ov",
270+
prompt="What are the 3 main tourist attractions in Paris?",
295271
max_tokens=100,
296272
stream=True,
297273
)
@@ -302,46 +278,10 @@ for chunk in stream:
302278

303279
Output:
304280
```
305-
This is only a test.
281+
The three main tourist attractions in Paris are the Eiffel Tower, the Louvre, and the Notre-Dame de Paris. The Eiffel Tower is the most iconic landmark and offers a great view of the city. The Louvre is a world-famous art museum that houses the Mona Lisa and other famous artworks. The Notre-Dame de Paris is a stunning example of French Gothic architecture and is the cathedral of the city. These three attractions are the most visited and most famous in Paris,
306282
```
307283
:::
308284

309-
## Benchmarking text generation with high concurrency
310-
311-
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
312-
It can be demonstrated using benchmarking app from vLLM repository:
313-
```console
314-
git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
315-
cd vllm
316-
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
317-
cd benchmarks
318-
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
319-
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Llama-3.1-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 30 --max-concurrency 1
320-
Maximum request concurrency: 1
321-
322-
============ Serving Benchmark Result ============
323-
Successful requests: 30
324-
Benchmark duration (s): 480.20
325-
Total input tokens: 6434
326-
Total generated tokens: 6113
327-
Request throughput (req/s): 0.06
328-
Output token throughput (tok/s): 12.73
329-
Total Token throughput (tok/s): 26.13
330-
---------------Time to First Token----------------
331-
Mean TTFT (ms): 1922.09
332-
Median TTFT (ms): 1920.85
333-
P99 TTFT (ms): 1952.11
334-
-----Time per Output Token (excl. 1st token)------
335-
Mean TPOT (ms): 65.74
336-
Median TPOT (ms): 68.95
337-
P99 TPOT (ms): 70.40
338-
---------------Inter-token Latency----------------
339-
Mean ITL (ms): 83.65
340-
Median ITL (ms): 70.11
341-
P99 ITL (ms): 212.48
342-
==================================================
343-
```
344-
345285
## Testing the model accuracy over serving API
346286

347287
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md)

0 commit comments

Comments
 (0)