Skip to content

Commit d84fe3b

Browse files
committed
add what's new and new supported models
Signed-off-by: Yan Ma <[email protected]>
1 parent 13ed748 commit d84fe3b

File tree

1 file changed

+29
-19
lines changed

1 file changed

+29
-19
lines changed

vllm/0.10.2-xpu.md

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,14 @@ The vLLM used in this docker image is based on [v0.10.2](https://github.com/vllm
1414
| IPEX   | 2.8.10 |
1515
| OneCCL   | 2021.15.4 |
1616

17-
## 1. What's Supported?
17+
## 1. What's new in this release?
18+
19+
* Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance.
20+
* Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len.
21+
* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement.
22+
* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.
23+
24+
## 2. What's Supported?
1825

1926
Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
2027

@@ -49,7 +56,7 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
4956

5057
* Multi Modality Support
5158

52-
In this release, image/audio input can be processed using Qwen2.5-VL series models and.............
59+
In this release, image/audio input can be processed using Qwen2.5-VL series models, like Qwen/Qwen2.5-VL-32B-Instruct on 4 BMG cards.
5360

5461
* Pooling Models Support
5562

@@ -61,20 +68,17 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
6168

6269
* Data Parallelism
6370

64-
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.
71+
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
72+
73+
* MoE models
74+
75+
Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
6576

6677
Other features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported now. We also have some experimental features supported, including:
6778

6879
* **torch.compile**: Can be enabled for fp16/bf16 path.
6980
* **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
7081
* **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
71-
* **MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
72-
73-
## Optimizations
74-
75-
* FMHA Optimizations: XXXXX.
76-
* Tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. For details, please refer [2021.15.5](https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.5/intel-oneccl-2021.15.5.4_offline.sh).
77-
* oneDNN GEMM optimization: fp8 gemm performance with batch size ranging from 1 to 128 are all optimized to above 80% TFLOPs efficiency.
7882

7983
## Supported Models
8084

@@ -92,6 +96,12 @@ The table below lists models that have been verified by Intel. However, there sh
9296
| Text Generation | Qwen/Qwen3-14B |✅︎|✅︎| |
9397
| Text Generation | Qwen/Qwen3-32B |✅︎|✅︎| |
9498
| Text Generation | Qwen/Qwen3-30B-A3B |✅︎|✅︎| |
99+
| Text Generation | Qwen/Qwen3-coder-30B-A3B-Instruct |✅︎|✅︎| |
100+
| Text Generation | Qwen/QwQ-32B |✅︎|✅︎| |
101+
| Text Generation | OpenGVLab/InternVL3_5-8B |✅︎|✅︎| |
102+
| Text Generation | OpenGVLab/InternVL3_5-14B |✅︎|✅︎| |
103+
| Text Generation | OpenGVLab/InternVL3_5-38B |✅︎|✅︎| |
104+
| Text Generation | openbmb/MiniCPM-V-4 |✅︎|✅︎| |
95105
| Text Generation | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎| |
96106
| Text Generation | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎| |
97107
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎| |
@@ -112,24 +122,24 @@ The table below lists models that have been verified by Intel. However, there sh
112122
| Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎| |
113123
| Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎| |
114124

115-
## 2. Limitations
125+
## 3. Limitations
116126

117127
Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism) and MLA(Multi-head Latent Attention).
118128

119129
The following issues are known issues:
120130

121-
* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
131+
* Qwen/Qwen3-30B-A3B FP16/BF16 need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
122132
* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
123133

124-
## 3. How to Get Started
134+
## 4. How to Get Started
125135

126-
### 3.1. Prerequisite
136+
### 4.1. Prerequisite
127137

128138
| OS | Hardware |
129139
| ---------- | ---------- |
130140
| Ubuntu 25.04 | Intel® Arc™ B-Series |
131141

132-
### 3.2. Prepare a Serving Environment
142+
### 4.2. Prepare a Serving Environment
133143

134144
1. Get the released docker image with command `docker pull intel/vllm:0.10.2-xpu`
135145
2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.10.2-xpu /bin/bash`
@@ -143,9 +153,9 @@ In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` enviro
143153
export HUGGING_FACE_HUB_TOKEN=xxxxxx
144154
```
145155

146-
### 3.3. Launch Workloads
156+
### 4.3. Launch Workloads
147157

148-
#### 3.3.1. Launch Server in the Server Environment
158+
#### 4.3.1. Launch Server in the Server Environment
149159

150160
Command:
151161

@@ -188,7 +198,7 @@ INFO: Application startup complete.
188198

189199
It may take some time. Showing `INFO: Application startup complete.` indicates that the server is ready.
190200

191-
#### 3.3.2. Raise Requests for Benchmarking in the Client Environment
201+
#### 4.3.2. Raise Requests for Benchmarking in the Client Environment
192202

193203
We leverage a [benchmarking script](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) which is provided in vLLM to perform performance benchmarking. You can use your own client scripts as well.
194204

@@ -227,6 +237,6 @@ P99 ITL (ms): xxx
227237
==================================================
228238
```
229239

230-
## 4. Need Assistance?
240+
## 5. Need Assistance?
231241

232242
Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.

0 commit comments

Comments
 (0)