You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vllm/0.10.2-xpu.md
+24-14Lines changed: 24 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,14 @@ The vLLM used in this docker image is based on [v0.10.2](https://github.com/vllm
14
14
| IPEX | 2.8.10 |
15
15
| OneCCL | 2021.15.4 |
16
16
17
-
## 1. What's Supported?
17
+
## 1. What's new in this release?
18
+
19
+
* Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance.
20
+
* Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len.
21
+
* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement.
22
+
* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.
23
+
24
+
## 2. What's Supported?
18
25
19
26
Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
20
27
@@ -49,7 +56,7 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
49
56
50
57
* Multi Modality Support
51
58
52
-
In this release, image/audio input can be processed using Qwen2.5-VL series models and.............
59
+
In this release, image/audio input can be processed using Qwen2.5-VL series models, like Qwen/Qwen2.5-VL-32B-Instruct on 4 BMG cards.
53
60
54
61
* Pooling Models Support
55
62
@@ -61,20 +68,17 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
61
68
62
69
* Data Parallelism
63
70
64
-
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.
71
+
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
72
+
73
+
* MoE models
74
+
75
+
Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
65
76
66
77
Other features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported now. We also have some experimental features supported, including:
67
78
68
79
***torch.compile**: Can be enabled for fp16/bf16 path.
69
80
***speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
70
81
***async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
71
-
***MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
72
-
73
-
## Optimizations
74
-
75
-
* FMHA Optimizations: XXXXX.
76
-
* Tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. For details, please refer [2021.15.5](https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.5/intel-oneccl-2021.15.5.4_offline.sh).
77
-
* oneDNN GEMM optimization: fp8 gemm performance with batch size ranging from 1 to 128 are all optimized to above 80% TFLOPs efficiency.
78
82
79
83
## Supported Models
80
84
@@ -92,6 +96,12 @@ The table below lists models that have been verified by Intel. However, there sh
92
96
| Text Generation | Qwen/Qwen3-14B |✅︎|✅︎||
93
97
| Text Generation | Qwen/Qwen3-32B |✅︎|✅︎||
94
98
| Text Generation | Qwen/Qwen3-30B-A3B |✅︎|✅︎||
99
+
| Text Generation | Qwen/Qwen3-coder-30B-A3B-Instruct |✅︎|✅︎||
100
+
| Text Generation | Qwen/QwQ-32B |✅︎|✅︎||
101
+
| Text Generation | OpenGVLab/InternVL3_5-8B |✅︎|✅︎||
102
+
| Text Generation | OpenGVLab/InternVL3_5-14B |✅︎|✅︎||
103
+
| Text Generation | OpenGVLab/InternVL3_5-38B |✅︎|✅︎||
104
+
| Text Generation | openbmb/MiniCPM-V-4 |✅︎|✅︎||
95
105
| Text Generation | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎||
96
106
| Text Generation | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎||
97
107
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎||
@@ -112,16 +122,16 @@ The table below lists models that have been verified by Intel. However, there sh
112
122
| Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎||
113
123
| Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎||
114
124
115
-
## 2. Limitations
125
+
## 3. Limitations
116
126
117
127
Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism) and MLA(Multi-head Latent Attention).
118
128
119
129
The following issues are known issues:
120
130
121
-
* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
131
+
* Qwen/Qwen3-30B-A3B FP16/BF16 need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
122
132
* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.
0 commit comments