Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions _posts/2025-11-30-vllm-omni.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
layout: post
title: "Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving"
author: "The vLLM-Omni Team"
image: /assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png
---

We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png" alt="vllm-omni logo" width="80%">
</p>


Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.

**vLLM-Omni** is the first open source framework to support omni-modality model serving that extends vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png" alt="omni-modality model architecture" width="80%">
</p>

## **Why vLLM-Omni?**

Traditional serving engines were optimized for text-based Autoregressive (AR) tasks. As models evolve into "omni" agents—capable of seeing, hearing, and speaking—the serving infrastructure must evolve with them.

vLLM-Omni addresses three critical shifts in model architecture:

1. **True Omni-Modality:** Processing and generating Text, Image, Video, and Audio seamlessly.
2. **Beyond Autoregression:** Extending vLLM's efficient memory management to **Diffusion Transformers (DiT)** and other parallel generation models.
3. **Heterogeneous Pipelines:** Managing complex workflows where a single request might trigger a visual encoder, an AR reasoning step, and a diffusion-based video generation step.

## **Inside the Architecture**

vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. As shown above, the architecture unifies distinct phases:

* **Modality Encoders:** Efficiently processing inputs (ViT, T5, etc.)
* **LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage.
* **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs.

### **Key Features**

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png" alt="vllm-omni user interface" width="80%">
</p>

* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.

* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various omni-modality models including Qwen-Omni, Qwen-Image, and other state-of-the-art models.

* **Performance:** We utilize pipelined stage execution to overlap computation for high throughput performance, ensuring that while one stage is processing, others aren't idle.

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png" alt="vllm-omni pipelined stage execution" width="80%">
</p>

We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving.

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-vs-hf.png" alt="vLLM-Omni against Hugging Face Transformers" width="80%">
</p>


## **Future Roadmap**

vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support and pushing the boundaries of efficient inference even further.

* **Expanded Model Support:** We plan to support a wider range of open-source omni-models and diffusion transformers as they emerge.
* **Deeper vLLM Integration:** merging core omni-features upstream to make multi-modality a first-class citizen in the entire vLLM ecosystem.
* **Diffusion Acceleration:** parallel inference(DP/TP/SP/USP...), cache acceleration(TeaCache/DBCache...) and compute acceleration(quantization/sparse attn...).
* **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency.
* **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere.


## **Getting Started**

Getting started with vLLM-Omni is straightforward. The initial vllm-omni v0.11.0rc release is built on top of vLLM v0.11.0.

### **Installation**

Check out our [Installation Doc](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/) for details.

### **Serving the omni-modality models**

Check out our [examples directory](https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. vLLM-Omni also provides the gradio support to improve user experience, below is a demo example for serving Qwen-Image:

<p align="center">
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png" alt="vllm-omni serving qwen-image with gradio" width="80%">
</p>

## **Join the Community**

This is just the beginning for omni-modality serving. We are actively developing support for more architectures and invite the community to help shape the future of vLLM-Omni.

* **Code & Docs:** [GitHub Repository](https://github.com/vllm-project/vllm-omni) | [Documentation](https://vllm-omni.readthedocs.io/en/latest/)
* **Weekly Meeting:** Join us every Wednesday at 11:30 (UTC+8) to discuss roadmap and features. [Join here](https://tinyurl.com/vllm-omni-meeting).

Let's build the future of omni-modal serving together\!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.