vllm-project · youkaichao · Dec 1, 2025 · Nov 28, 2025 · Nov 29, 2025 · Nov 29, 2025
diff --git a/_posts/2025-11-30-vllm-omni.md b/_posts/2025-11-30-vllm-omni.md
@@ -0,0 +1,98 @@
+---
+layout: post
+title: "Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving"
+author: "The vLLM-Omni Team"
+image: /assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png
+---
+
+We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png" alt="vllm-omni logo" width="80%">  
+</p>  
+
+
+Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures. 
+
+**vLLM-Omni** is the first open source framework to support omni-modality model serving that extends vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png" alt="omni-modality model architecture" width="80%">  
+</p>  
+
+## **Why vLLM-Omni?**
+
+Traditional serving engines were optimized for text-based Autoregressive (AR) tasks. As models evolve into "omni" agents—capable of seeing, hearing, and speaking—the serving infrastructure must evolve with them.
+
+vLLM-Omni addresses three critical shifts in model architecture:
+
+1. **True Omni-Modality:** Processing and generating Text, Image, Video, and Audio seamlessly.  
+2. **Beyond Autoregression:** Extending vLLM's efficient memory management to **Diffusion Transformers (DiT)** and other parallel generation models.  
+3. **Heterogeneous Pipelines:** Managing complex workflows where a single request might trigger a visual encoder, an AR reasoning step, and a diffusion-based video generation step.
+
+## **Inside the Architecture**
+
+vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. As shown above, the architecture unifies distinct phases:
+
+* **Modality Encoders:** Efficiently processing inputs (ViT, T5, etc.)  
+* **LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage.  
+* **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs.
+
+### **Key Features**
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png" alt="vllm-omni user interface" width="80%">  
+</p>  
+
+* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.  
+
+* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various omni-modality models including Qwen-Omni, Qwen-Image, and other state-of-the-art models.
+
+* **Performance:** We utilize pipelined stage execution to overlap computation for high throughput performance, ensuring that while one stage is processing, others aren't idle. 
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png" alt="vllm-omni pipelined stage execution" width="80%">  
+</p>
+
+We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving.
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-vs-hf.png" alt="vLLM-Omni against Hugging Face Transformers" width="80%">  
+</p>
+
+
+## **Future Roadmap**
+
+vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support and pushing the boundaries of efficient inference even further.
+
+* **Expanded Model Support:** We plan to support a wider range of open-source omni-models and diffusion transformers as they emerge.  
+* **Deeper vLLM Integration:** merging core omni-features upstream to make multi-modality a first-class citizen in the entire vLLM ecosystem.
+* **Diffusion Acceleration:** parallel inference(DP/TP/SP/USP...),  cache acceleration(TeaCache/DBCache...) and compute acceleration(quantization/sparse attn...).
+* **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency.
+* **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere.  
+
+
+## **Getting Started**
+
+Getting started with vLLM-Omni is straightforward. The initial vllm-omni v0.11.0rc release is built on top of vLLM v0.11.0. 
+
+### **Installation**
+
+Check out our [Installation Doc](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/) for details.
+
+### **Serving the omni-modality models**
+
+Check out our [examples directory](https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. vLLM-Omni also provides the gradio support to improve user experience, below is a demo example for serving Qwen-Image:
+
+<p align="center">  
+<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png" alt="vllm-omni serving qwen-image with gradio" width="80%">  
+</p>  
+
+## **Join the Community**
+
+This is just the beginning for omni-modality serving. We are actively developing support for more architectures and invite the community to help shape the future of vLLM-Omni.
+
+* **Code & Docs:** [GitHub Repository](https://github.com/vllm-project/vllm-omni) | [Documentation](https://vllm-omni.readthedocs.io/en/latest/)  
+* **Weekly Meeting:** Join us every Wednesday at 11:30 (UTC+8) to discuss roadmap and features. [Join here](https://tinyurl.com/vllm-omni-meeting).
+
+Let's build the future of omni-modal serving together\!
diff --git a/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png b/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png
diff --git a/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png b/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png
diff --git a/assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png b/assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png
diff --git a/assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png b/assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png
diff --git a/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png b/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png
diff --git a/assets/figures/2025-11-30-vllm-omni/vllm-omni-vs-hf.png b/assets/figures/2025-11-30-vllm-omni/vllm-omni-vs-hf.png