Skip to content

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Sep 24, 2025

This PR provides notes for vLLM v0.10.2 release on Intel Multi-Arc, including some key features, optimizations and HowTos.

| OneAPI   | 2025.1.3-0 |
| PyTorch   | PyTorch 2.8 |
| IPEX   | 2.8.10 |
| OneCCL   | 2021.15.4 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneccl version is likely to be changed. keep it as a place holder for update when bkc release happened.


vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).

* Pipeline Parallelism

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we roll back the oneccl release to 2021.15.3 then PP will be rolled back to naive implementation w/o performance. then we lose this feature.


* Data Parallelism

vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now."
-> This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.

* **torch.compile**: Can be enabled for fp16/bf16 path.
* **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
* **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
* **MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.
Copy link

@rogerxfeng8 rogerxfeng8 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoE models are officially supported in this release, not "experimental". It's actually one of the key models we optimized, besides the multimodality.
Let's move the MoE models to the official feature list. GPT-OSS 20B and 120B in mxfp4 data type should be highlighted here.


The following issues are known issues:

* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case, or fp16/bf16 only? for fp8 my understanding is that it can work with =0.9.


## Optimizations

* FMHA Optimizations: XXXXX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention kernel optimizations for decoding steps
MoE model optimizations using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants