You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vllm/0.10.2-xpu.md
+29-19Lines changed: 29 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,14 @@ The vLLM used in this docker image is based on [v0.10.2](https://github.com/vllm
14
14
| IPEX | 2.8.10 |
15
15
| OneCCL | 2021.15.4 |
16
16
17
-
## 1. What's Supported?
17
+
## 1. What's new in this release?
18
+
19
+
* Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance.
20
+
* Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len.
21
+
* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement.
22
+
* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.
23
+
24
+
## 2. What's Supported?
18
25
19
26
Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
20
27
@@ -49,7 +56,7 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
49
56
50
57
* Multi Modality Support
51
58
52
-
In this release, image/audio input can be processed using Qwen2.5-VL series models and.............
59
+
In this release, image/audio input can be processed using Qwen2.5-VL series models, like Qwen/Qwen2.5-VL-32B-Instruct on 4 BMG cards.
53
60
54
61
* Pooling Models Support
55
62
@@ -61,20 +68,17 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
61
68
62
69
* Data Parallelism
63
70
64
-
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.
71
+
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
72
+
73
+
* MoE models
74
+
75
+
Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
65
76
66
77
Other features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported now. We also have some experimental features supported, including:
67
78
68
79
***torch.compile**: Can be enabled for fp16/bf16 path.
69
80
***speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
70
81
***async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
71
-
***MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
72
-
73
-
## Optimizations
74
-
75
-
* FMHA Optimizations: XXXXX.
76
-
* Tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. For details, please refer [2021.15.5](https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.5/intel-oneccl-2021.15.5.4_offline.sh).
77
-
* oneDNN GEMM optimization: fp8 gemm performance with batch size ranging from 1 to 128 are all optimized to above 80% TFLOPs efficiency.
78
82
79
83
## Supported Models
80
84
@@ -92,6 +96,12 @@ The table below lists models that have been verified by Intel. However, there sh
92
96
| Text Generation | Qwen/Qwen3-14B |✅︎|✅︎||
93
97
| Text Generation | Qwen/Qwen3-32B |✅︎|✅︎||
94
98
| Text Generation | Qwen/Qwen3-30B-A3B |✅︎|✅︎||
99
+
| Text Generation | Qwen/Qwen3-coder-30B-A3B-Instruct |✅︎|✅︎||
100
+
| Text Generation | Qwen/QwQ-32B |✅︎|✅︎||
101
+
| Text Generation | OpenGVLab/InternVL3_5-8B |✅︎|✅︎||
102
+
| Text Generation | OpenGVLab/InternVL3_5-14B |✅︎|✅︎||
103
+
| Text Generation | OpenGVLab/InternVL3_5-38B |✅︎|✅︎||
104
+
| Text Generation | openbmb/MiniCPM-V-4 |✅︎|✅︎||
95
105
| Text Generation | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎||
96
106
| Text Generation | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎||
97
107
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎||
@@ -112,24 +122,24 @@ The table below lists models that have been verified by Intel. However, there sh
112
122
| Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎||
113
123
| Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎||
114
124
115
-
## 2. Limitations
125
+
## 3. Limitations
116
126
117
127
Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism) and MLA(Multi-head Latent Attention).
118
128
119
129
The following issues are known issues:
120
130
121
-
* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
131
+
* Qwen/Qwen3-30B-A3B FP16/BF16 need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
122
132
* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
123
133
124
-
## 3. How to Get Started
134
+
## 4. How to Get Started
125
135
126
-
### 3.1. Prerequisite
136
+
### 4.1. Prerequisite
127
137
128
138
| OS | Hardware |
129
139
| ---------- | ---------- |
130
140
| Ubuntu 25.04 | Intel® Arc™ B-Series |
131
141
132
-
### 3.2. Prepare a Serving Environment
142
+
### 4.2. Prepare a Serving Environment
133
143
134
144
1. Get the released docker image with command `docker pull intel/vllm:0.10.2-xpu`
135
145
2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.10.2-xpu /bin/bash`
@@ -143,9 +153,9 @@ In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` enviro
143
153
export HUGGING_FACE_HUB_TOKEN=xxxxxx
144
154
```
145
155
146
-
### 3.3. Launch Workloads
156
+
### 4.3. Launch Workloads
147
157
148
-
#### 3.3.1. Launch Server in the Server Environment
158
+
#### 4.3.1. Launch Server in the Server Environment
It may take some time. Showing `INFO: Application startup complete.` indicates that the server is ready.
190
200
191
-
#### 3.3.2. Raise Requests for Benchmarking in the Client Environment
201
+
#### 4.3.2. Raise Requests for Benchmarking in the Client Environment
192
202
193
203
We leverage a [benchmarking script](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) which is provided in vLLM to perform performance benchmarking. You can use your own client scripts as well.
Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.
0 commit comments