add what's new and new supported models

yma11 · yma11 · commit d84fe3b964d5 · 2025-09-26T11:20:14.000Z
Signed-off-by: Yan Ma &lt;yan.ma@intel.com&gt;
diff --git a/vllm/0.10.2-xpu.md b/vllm/0.10.2-xpu.md
@@ -14,7 +14,14 @@ The vLLM used in this docker image is based on [v0.10.2](https://github.com/vllm
 | IPEX        | 2.8.10  |
 | OneCCL      | 2021.15.4   |
 
-## 1. What's Supported?
+## 1. What's new in this release?
+
+* Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance.
+* Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len.
+* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement.
+* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.
+
+## 2. What's Supported?
 
 Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
 
@@ -49,7 +56,7 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
 
 * Multi Modality Support
 
-  In this release, image/audio input can be processed using Qwen2.5-VL series models and.............
+  In this release, image/audio input can be processed using Qwen2.5-VL series models, like Qwen/Qwen2.5-VL-32B-Instruct on 4 BMG cards.
 
 * Pooling Models Support
 
@@ -61,20 +68,17 @@ Besides, following up vLLM V1 design, corresponding optimized kernels and featur
 
 * Data Parallelism
 
-  vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.
+  vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
+
+* MoE models
+
+  Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
 
 Other features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported now. We also have some experimental features supported, including:
 
 * **torch.compile**: Can be enabled for fp16/bf16 path.
 * **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
 * **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
-* **MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
-
-## Optimizations
-
-* FMHA Optimizations: XXXXX.
-* Tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. For details, please refer [2021.15.5](https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.5/intel-oneccl-2021.15.5.4_offline.sh).
-* oneDNN GEMM optimization: fp8 gemm performance with batch size ranging from 1 to 128 are all optimized to above 80% TFLOPs efficiency.
 
 ## Supported Models
 
@@ -92,6 +96,12 @@ The table below lists models that have been verified by Intel. However, there sh
 | Text Generation | Qwen/Qwen3-14B                            |✅︎|✅︎| |
 | Text Generation | Qwen/Qwen3-32B                            |✅︎|✅︎| |
 | Text Generation | Qwen/Qwen3-30B-A3B                        |✅︎|✅︎| |
+| Text Generation | Qwen/Qwen3-coder-30B-A3B-Instruct         |✅︎|✅︎| |
+| Text Generation | Qwen/QwQ-32B                              |✅︎|✅︎| |
+| Text Generation | OpenGVLab/InternVL3_5-8B                  |✅︎|✅︎| |
+| Text Generation | OpenGVLab/InternVL3_5-14B                 |✅︎|✅︎| |
+| Text Generation | OpenGVLab/InternVL3_5-38B                 |✅︎|✅︎| |
+| Text Generation | openbmb/MiniCPM-V-4                       |✅︎|✅︎| |
 | Text Generation | deepseek-ai/DeepSeek-V2-Lite              |✅︎|✅︎| |
 | Text Generation | meta-llama/Llama-3.1-8B-Instruct          |✅︎|✅︎| |
 | Text Generation | baichuan-inc/Baichuan2-13B-Chat           |✅︎|✅︎| |
@@ -112,24 +122,24 @@ The table below lists models that have been verified by Intel. However, there sh
 | Embedding Model | Qwen/Qwen3-Embedding-8B                   |✅︎|✅︎| |
 | Reranker Model  | Qwen/Qwen3-Reranker-8B                    |✅︎|✅︎| |
 
-## 2. Limitations
+## 3. Limitations
 
 Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism) and MLA(Multi-head Latent Attention).
 
 The following issues are known issues:
 
-* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
+* Qwen/Qwen3-30B-A3B FP16/BF16 need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
 * W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
 
-## 3. How to Get Started
+## 4. How to Get Started
 
-### 3.1. Prerequisite
+### 4.1. Prerequisite
 
 | OS | Hardware |
 | ---------- | ---------- |
 | Ubuntu 25.04 | Intel® Arc™ B-Series |
 
-### 3.2. Prepare a Serving Environment
+### 4.2. Prepare a Serving Environment
 
 1. Get the released docker image with command `docker pull intel/vllm:0.10.2-xpu`
 2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.10.2-xpu /bin/bash`
@@ -143,9 +153,9 @@ In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` enviro
 export HUGGING_FACE_HUB_TOKEN=xxxxxx
 ```
 
-### 3.3. Launch Workloads
+### 4.3. Launch Workloads
 
-#### 3.3.1. Launch Server in the Server Environment
+#### 4.3.1. Launch Server in the Server Environment
 
 Command:
 
@@ -188,7 +198,7 @@ INFO:     Application startup complete.
 
 It may take some time. Showing `INFO:     Application startup complete.` indicates that the server is ready.
 
-#### 3.3.2. Raise Requests for Benchmarking in the Client Environment
+#### 4.3.2. Raise Requests for Benchmarking in the Client Environment
 
 We leverage a [benchmarking script](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) which is provided in vLLM to perform performance benchmarking. You can use your own client scripts as well.
 
@@ -227,6 +237,6 @@ P99 ITL (ms):                            xxx
 ==================================================
 ```
 
-## 4. Need Assistance?
+## 5. Need Assistance?
 
 Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.