vllm-omni framework and support for qwen2.5-omni [WIP] by Gaohan123 · Pull Request #7 · vllm-project/vllm-omni

Gaohan123 · 2025-10-13T07:37:38Z

Reorganize worker modules: rename gpu_diffusion_* to new naming scheme
Add new AR GPU worker and model runner
Add diffusion scheduler and utilities
Add stage configs for qwen2.5 omni model
Update init.py files across modules for new imports
Add output processor and arg utilities
Add test file for omni LLM

- Reorganize worker modules: rename gpu_diffusion_* to new naming scheme - Add new AR GPU worker and model runner - Add diffusion scheduler and utilities - Add stage configs for qwen2.5 omni model - Update __init__.py files across modules for new imports - Add output processor and arg utilities - Add test file for omni LLM

gemini-code-assist · 2025-10-13T07:38:15Z

Summary of Changes

Hello @Gaohan123, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational framework for vllm-omni, enabling multi-stage, multimodal model inference within vLLM. It introduces a modular architecture where different model stages (e.g., autoregressive, diffusion) can be chained together, each with its own specialized worker and scheduler. The changes include significant refactoring of core components, new configuration mechanisms, and enhanced output processing to seamlessly handle diverse data types like text, images, and audio across these stages. This allows for complex generative workflows, exemplified by the integration of the Qwen2.5-Omni model.

Highlights

Framework Reorganization: Worker modules have been reorganized, renaming gpu_diffusion_* components to a new, more consistent naming scheme, and introducing new core modules for scheduling and distributed components.
New AR GPU Worker and Model Runner: A new ARModelRunner and ARGPUWorker have been added to handle autoregressive (AR) tasks, designed to expose per-request hidden representations and integrate multimodal embeddings.
Diffusion Scheduler and Utilities: A specialized DiffusionScheduler has been implemented, extending OmniScheduler, to optimize scheduling for diffusion models by allocating all required tokens at once and immediately marking requests as finished, suitable for single-step generation tasks.
Qwen2.5-Omni Model Support: Dedicated stage configurations (qwen2_5_omni.yaml) have been added to support the Qwen2.5-Omni-7B model, defining its 'thinker', 'talker', and 'code2wav' stages with specific worker and scheduler classes.
Enhanced Multimodal Output Processing: The MultimodalOutputProcessor has been refactored to inherit from VLLMOutputProcessor, introducing OmniRequestState to track and accumulate multimodal tensors (e.g., images, latents, audio) and route outputs based on their type.
Refactored OmniLLM Entrypoints: The OmniLLM class has been simplified to extend the base vllm.entrypoints.llm.LLM, while a new OmniLM class now acts as the orchestrator, loading stage configurations and managing multiple OmniLLM instances for multi-stage processing.
New Configuration and Argument Utilities: New configuration classes like OmniModelConfig and OmniConfig have been introduced, along with OmniEngineArgs to extend EngineArgs with Omni-specific parameters, facilitating flexible stage definition.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant architectural refactoring to support multi-stage, multi-modal models, specifically qwen2.5-omni, within a new vllm-omni framework. The changes are extensive and well-structured, introducing concepts like stage-based engine configurations, specialized workers (AR and Diffusion), and corresponding schedulers. The refactoring of OmniLLM into a pipeline orchestrator (OmniLM) and a single-stage executor (OmniLLM) is a clean design. The new output processor is also a major improvement, providing robust handling for various multimodal outputs. While the work is still in progress, the foundational framework is solid. I've identified a critical syntax error and a few medium-severity issues related to maintainability and dead code that should be addressed.

gemini-code-assist · 2025-10-13T07:39:58Z

vllm_omni/core/sched/diffusion_scheduler.py

+    """
+    Scheduler for the diffusion model.
+    This scheduler is modified to stop the request immediately for the diffusion model.
+    This is because the diffusion model can generate the final image/audio in one step.
+    Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler. 
+    The original scheduler is still used for the AR model.
+    """


This block of text is formatted as a multi-line string (docstring), but it's located in the middle of the class definition, not at the beginning of a module, class, or function. This will cause a SyntaxError when the Python interpreter parses this file. To fix this, you should convert it into a block comment by prefixing each line with a #.

# Scheduler for the diffusion model. # This scheduler is modified to stop the request immediately for the diffusion model. # This is because the diffusion model can generate the final image/audio in one step. # Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler. # The original scheduler is still used for the AR model.

gemini-code-assist · 2025-10-13T07:39:58Z

vllm_omni/core/sched/diffusion_scheduler.py

+        """扩散快速通道：
+        - 一次性喂入该请求的全部输入 token（若为 0，则分配 1 个占位 token）。
+        - 若无法一次性满足 token 预算，则退回上游 vLLM 的默认调度。
+        """


This file contains numerous comments written in Chinese. For better maintainability and accessibility in an open-source project, it's highly recommended to write all comments in English. This ensures that a wider range of contributors can understand the code's intent. Please consider translating these and other Chinese comments in this file.

gemini-code-assist · 2025-10-13T07:39:58Z

vllm_omni/entrypoints/omni_llm_bak.py

@@ -0,0 +1,714 @@
+"""


This file appears to be a backup of the old omni_llm.py implementation. Including backup or temporary files in a pull request can lead to confusion and adds unnecessary code to the repository. It's best practice to rely on version control (like Git) for history and remove such files before merging.

gemini-code-assist · 2025-10-13T07:39:58Z

vllm_omni/worker/diffusion_model_runner.py

+        # if hasattr(self.model, "sample"):
+        #     return self.model.sample(**kwargs)
+        # if hasattr(self.model, "forward"):
+        #     return self.model.forward(**kwargs)
+        # if hasattr(self.model, "diffuse"):
+        #     return self.model.diffuse(**kwargs)


In the _run_diffusion method, the checks for model.sample and model.diffuse are commented out, leaving only model.forward as the active path. If the intention is to support only the forward method for the current model, it would be clearer to remove the commented-out code and add a comment explaining this limitation. However, to make the runner more generic for future diffusion models, consider uncommenting this logic.

Suggested change

# if hasattr(self.model, "sample"):

# return self.model.sample(**kwargs)

# if hasattr(self.model, "forward"):

# return self.model.forward(**kwargs)

# if hasattr(self.model, "diffuse"):

# return self.model.diffuse(**kwargs)

if hasattr(self.model, "sample"):

return self.model.sample(**kwargs)

if hasattr(self.model, "forward"):

return self.model.forward(**kwargs)

if hasattr(self.model, "diffuse"):

return self.model.diffuse(**kwargs)

hsliuustc0106 · 2025-10-17T23:31:56Z

based on my understanding, this huge PR can be split into the following mini PRs:

PR #1: Foundation & Infrastructure
Qwen3-Omni model configuration, loading, and tokenizer support
Scope: All foundational components needed for Qwen3-Omni support
Files:
vllm_omni/config/qwen3_omni_config.py (new)
vllm_omni/model_executor/model_loader/qwen3_omni_loader.py (new)
vllm_omni/model_executor/layers/qwen3_omni_tokenizer.py (new)
Update existing init.py files
Update vllm_omni/config/init.py
Changes:
Add Qwen3OmniConfig dataclass
Add model detection and loading logic
Add multimodal tokenizer wrapper
Add image preprocessing for Qwen3-Omni
Add configuration validation
Dependencies: None (can be merged independently)
Testing: Unit tests for all components

hsliuustc0106 · 2025-10-17T23:34:14Z

PR #2: Core Processing Components
Model execution, request processing, and output handling
Scope: Core processing logic for Qwen3-Omni
Files:

vllm_omni/model_executor/models/qwen3_omni_model.py (new)
vllm_omni/request/qwen3_omni_request.py (new)
vllm_omni/engine/qwen3_omni_output_processor.py (new)
Update vllm_omni/request.py
Update vllm_omni/engine/output_processor.py

Changes:

Implement Qwen3OmniModel class with forward pass logic
Add Qwen3OmniRequest class with multimodal input validation
Implement Qwen3OmniOutputProcessor class
Add multimodal attention mechanisms
Add request preprocessing and output post-processing

Dependencies: PR #1
Testing: Unit tests for model execution, request processing, and output handling

hsliuustc0106 · 2025-10-17T23:34:42Z

PR #3: Integration & API
Worker integration and API endpoints
Scope: Integration with existing vLLM-omni infrastructure
Files:

vllm_omni/worker/qwen3_omni_worker.py (new)
vllm_omni/entrypoints/qwen3_omni_api.py (new)
Update vllm_omni/worker/gpu_diffusion_worker.py
Update vllm_omni/entrypoints/api_server.py

Changes:

Add Qwen3-Omni worker class
Integrate with existing worker framework
Add GPU memory management
Add Qwen3-Omni specific API endpoints
Add request/response schemas
Add API documentation

Dependencies: PR #2
Testing: Integration tests and API tests

hsliuustc0106 · 2025-10-17T23:35:04Z

PR #4: Examples & Documentation
Usage examples, configurations, and comprehensive documentation
Scope: Complete the Qwen3-Omni support with examples and docs
Files:

examples/qwen3_omni/ (new directory)
basic_usage.py
configs/qwen3_omni_local.yaml
README.md
docs/models/qwen3_omni.md (new)
Update docs/api/ with Qwen3-Omni endpoints
Update main README.md

Changes:

Add basic usage example
Add configuration examples
Add comprehensive model documentation
Add API documentation
Add troubleshooting guide
Update project documentation

Dependencies: PR #3
Testing: Example validation and documentation review

hsliuustc0106 · 2025-10-18T08:12:22Z

vllm_omni/entrypoints/omni_lm.py

-        elif 'max_model_len' in kwargs:
-            # If max_model_len is set but max_num_batched_tokens is not, set it to max_model_len
-            kwargs['max_num_batched_tokens'] = kwargs['max_model_len']
+class OmniLM:


what's the difference between OmniLM and OmniLLM

OmniLM is a higher level abstraction as entry class, which init multiple OmniLLMs. Each OmniLLM inits one engine.

hsliuustc0106 · 2025-10-18T08:13:30Z

vllm_omni/entrypoints/omni_lm.py

+        self.request_counter = Counter()
+        self.default_sampling_params: Union[dict[str, Any], None] = None
+
+        if envs.VLLM_USE_V1:


V0 has been removed

The code is based on a vLLM version around Aug 2025. Later we will adapt to latest stable vLLM v0.11.0

hsliuustc0106 · 2025-10-18T08:14:49Z

vllm_omni/engine/processor.py

+        data_parallel_rank: Optional[int] = None,
+    ) -> tuple[Optional[str], OmniEngineCoreRequest]:
+
+        # TODO(woosuk): Support pooling models.


do we copy this file from vllm?

We modified some functions to add additional parameters.

hsliuustc0106 · 2025-10-18T08:16:34Z

vllm_omni/entrypoints/utils.py

+from vllm_omni.config import OmniStageConfig
+
+
+def load_stage_configs(omni_args: OmniEngineArgs) -> List[OmniStageConfig]:


load_stage_configs_from_engine_args

hsliuustc0106 · 2025-10-18T08:17:29Z

examples/offline_inference/qwen_2_5_omni/processing_omni.py

+
+logger = logging.getLogger(__name__)
+
+IMAGE_FACTOR = 28


do these magic numbers work for all models?

The file is for qwen2.5-omni only. It has been moved to examples.

hsliuustc0106 · 2025-10-18T08:18:57Z

vllm_omni/worker/diffusion_model_runner.py

@@ -0,0 +1,376 @@
+# SPDX-License-Identifier: Apache-2.0


for the model_runner and worker name, I think gpu_xx_model_runner and gpu_xx_worker are aligned with the AFD case: gpu_ffn_model_runner and gpu_ffn_worker.

…ve old test files and deprecated implementations - Add offline inference examples for Qwen 2.5 Omni - Add stage_input_processors for better modularity - Update architecture documentation - Refactor entrypoints (omni_llm -> omni_lm) - Update core scheduler and cache management

hsliuustc0106 · 2025-10-24T09:05:21Z

stale PR closed

fix stream audio output

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

Gaohan123 added 2 commits October 17, 2025 16:25

Add multimodal inputs processing, model layers, and worker improvements

2c96349

fix: correct off-by-one error in pooler_output slicing in ARModelRunner

9cff605

hsliuustc0106 reviewed Oct 18, 2025

View reviewed changes

hsliuustc0106 linked an issue Oct 18, 2025 that may be closed by this pull request

[RFC]: Roadmap to support the qwen-omni model in vllm-omni #10

Closed

15 tasks

hsliuustc0106 added the enhancement New feature or request label Oct 18, 2025

hsliuustc0106 assigned Gaohan123 Oct 18, 2025

Gaohan123 added 4 commits October 20, 2025 11:39

Add vllm_omni/model_executor/models folder

29f4d3b

Update .gitignore with vLLM-omni specific exclusions

4ccd028

Update README and rename architecture doc file

8a02336

hsliuustc0106 removed a link to an issue Oct 24, 2025

[RFC]: Roadmap to support the qwen-omni model in vllm-omni #10

Closed

15 tasks

hsliuustc0106 closed this Oct 24, 2025

Gaohan123 deleted the dev branch November 1, 2025 02:26

R2-Y pushed a commit to R2-Y/vllm-omni that referenced this pull request Jan 17, 2026

Merge pull request vllm-project#7 from R2-Y/why_async_chunk

f563e2e

fix stream audio output

zyqzhang1996 mentioned this pull request Jan 21, 2026

[Bug]: [NPU]Qwen-Image-2512 use usp failed in 1328*1328 #845

Closed

1 task

		from vllm_omni.config import OmniStageConfig


		def load_stage_configs(omni_args: OmniEngineArgs) -> List[OmniStageConfig]:


		logger = logging.getLogger(__name__)

		IMAGE_FACTOR = 28

Conversation

Gaohan123 commented Oct 13, 2025

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Oct 17, 2025

Uh oh!

hsliuustc0106 commented Oct 17, 2025

Uh oh!

hsliuustc0106 commented Oct 17, 2025

Uh oh!

hsliuustc0106 commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants