Skip to content

vllm-omni framework and support for qwen2.5-omni [WIP]#7

Closed
Gaohan123 wants to merge 7 commits intovllm-project:mainfrom
Gaohan123:dev
Closed

vllm-omni framework and support for qwen2.5-omni [WIP]#7
Gaohan123 wants to merge 7 commits intovllm-project:mainfrom
Gaohan123:dev

Conversation

@Gaohan123
Copy link
Collaborator

  • Reorganize worker modules: rename gpu_diffusion_* to new naming scheme
  • Add new AR GPU worker and model runner
  • Add diffusion scheduler and utilities
  • Add stage configs for qwen2.5 omni model
  • Update init.py files across modules for new imports
  • Add output processor and arg utilities
  • Add test file for omni LLM

- Reorganize worker modules: rename gpu_diffusion_* to new naming scheme
- Add new AR GPU worker and model runner
- Add diffusion scheduler and utilities
- Add stage configs for qwen2.5 omni model
- Update __init__.py files across modules for new imports
- Add output processor and arg utilities
- Add test file for omni LLM
@gemini-code-assist
Copy link

Summary of Changes

Hello @Gaohan123, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational framework for vllm-omni, enabling multi-stage, multimodal model inference within vLLM. It introduces a modular architecture where different model stages (e.g., autoregressive, diffusion) can be chained together, each with its own specialized worker and scheduler. The changes include significant refactoring of core components, new configuration mechanisms, and enhanced output processing to seamlessly handle diverse data types like text, images, and audio across these stages. This allows for complex generative workflows, exemplified by the integration of the Qwen2.5-Omni model.

Highlights

  • Framework Reorganization: Worker modules have been reorganized, renaming gpu_diffusion_* components to a new, more consistent naming scheme, and introducing new core modules for scheduling and distributed components.
  • New AR GPU Worker and Model Runner: A new ARModelRunner and ARGPUWorker have been added to handle autoregressive (AR) tasks, designed to expose per-request hidden representations and integrate multimodal embeddings.
  • Diffusion Scheduler and Utilities: A specialized DiffusionScheduler has been implemented, extending OmniScheduler, to optimize scheduling for diffusion models by allocating all required tokens at once and immediately marking requests as finished, suitable for single-step generation tasks.
  • Qwen2.5-Omni Model Support: Dedicated stage configurations (qwen2_5_omni.yaml) have been added to support the Qwen2.5-Omni-7B model, defining its 'thinker', 'talker', and 'code2wav' stages with specific worker and scheduler classes.
  • Enhanced Multimodal Output Processing: The MultimodalOutputProcessor has been refactored to inherit from VLLMOutputProcessor, introducing OmniRequestState to track and accumulate multimodal tensors (e.g., images, latents, audio) and route outputs based on their type.
  • Refactored OmniLLM Entrypoints: The OmniLLM class has been simplified to extend the base vllm.entrypoints.llm.LLM, while a new OmniLM class now acts as the orchestrator, loading stage configurations and managing multiple OmniLLM instances for multi-stage processing.
  • New Configuration and Argument Utilities: New configuration classes like OmniModelConfig and OmniConfig have been introduced, along with OmniEngineArgs to extend EngineArgs with Omni-specific parameters, facilitating flexible stage definition.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant architectural refactoring to support multi-stage, multi-modal models, specifically qwen2.5-omni, within a new vllm-omni framework. The changes are extensive and well-structured, introducing concepts like stage-based engine configurations, specialized workers (AR and Diffusion), and corresponding schedulers. The refactoring of OmniLLM into a pipeline orchestrator (OmniLM) and a single-stage executor (OmniLLM) is a clean design. The new output processor is also a major improvement, providing robust handling for various multimodal outputs. While the work is still in progress, the foundational framework is solid. I've identified a critical syntax error and a few medium-severity issues related to maintainability and dead code that should be addressed.

Comment on lines +143 to +149
"""
Scheduler for the diffusion model.
This scheduler is modified to stop the request immediately for the diffusion model.
This is because the diffusion model can generate the final image/audio in one step.
Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler.
The original scheduler is still used for the AR model.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of text is formatted as a multi-line string (docstring), but it's located in the middle of the class definition, not at the beginning of a module, class, or function. This will cause a SyntaxError when the Python interpreter parses this file. To fix this, you should convert it into a block comment by prefixing each line with a #.

# Scheduler for the diffusion model.
# This scheduler is modified to stop the request immediately for the diffusion model.
# This is because the diffusion model can generate the final image/audio in one step.
# Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler. 
# The original scheduler is still used for the AR model.

Comment on lines +14 to +17
"""扩散快速通道:
- 一次性喂入该请求的全部输入 token(若为 0,则分配 1 个占位 token)。
- 若无法一次性满足 token 预算,则退回上游 vLLM 的默认调度。
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file contains numerous comments written in Chinese. For better maintainability and accessibility in an open-source project, it's highly recommended to write all comments in English. This ensures that a wider range of contributors can understand the code's intent. Please consider translating these and other Chinese comments in this file.

@@ -0,0 +1,714 @@
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be a backup of the old omni_llm.py implementation. Including backup or temporary files in a pull request can lead to confusion and adds unnecessary code to the repository. It's best practice to rely on version control (like Git) for history and remove such files before merging.

Comment on lines +185 to +190
# if hasattr(self.model, "sample"):
# return self.model.sample(**kwargs)
# if hasattr(self.model, "forward"):
# return self.model.forward(**kwargs)
# if hasattr(self.model, "diffuse"):
# return self.model.diffuse(**kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the _run_diffusion method, the checks for model.sample and model.diffuse are commented out, leaving only model.forward as the active path. If the intention is to support only the forward method for the current model, it would be clearer to remove the commented-out code and add a comment explaining this limitation. However, to make the runner more generic for future diffusion models, consider uncommenting this logic.

Suggested change
# if hasattr(self.model, "sample"):
# return self.model.sample(**kwargs)
# if hasattr(self.model, "forward"):
# return self.model.forward(**kwargs)
# if hasattr(self.model, "diffuse"):
# return self.model.diffuse(**kwargs)
if hasattr(self.model, "sample"):
return self.model.sample(**kwargs)
if hasattr(self.model, "forward"):
return self.model.forward(**kwargs)
if hasattr(self.model, "diffuse"):
return self.model.diffuse(**kwargs)

@hsliuustc0106
Copy link
Collaborator

based on my understanding, this huge PR can be split into the following mini PRs:

PR #1: Foundation & Infrastructure
Qwen3-Omni model configuration, loading, and tokenizer support
Scope: All foundational components needed for Qwen3-Omni support
Files:
vllm_omni/config/qwen3_omni_config.py (new)
vllm_omni/model_executor/model_loader/qwen3_omni_loader.py (new)
vllm_omni/model_executor/layers/qwen3_omni_tokenizer.py (new)
Update existing init.py files
Update vllm_omni/config/init.py
Changes:
Add Qwen3OmniConfig dataclass
Add model detection and loading logic
Add multimodal tokenizer wrapper
Add image preprocessing for Qwen3-Omni
Add configuration validation
Dependencies: None (can be merged independently)
Testing: Unit tests for all components

@hsliuustc0106
Copy link
Collaborator

PR #2: Core Processing Components
Model execution, request processing, and output handling
Scope: Core processing logic for Qwen3-Omni
Files:

  1. vllm_omni/model_executor/models/qwen3_omni_model.py (new)
  2. vllm_omni/request/qwen3_omni_request.py (new)
  3. vllm_omni/engine/qwen3_omni_output_processor.py (new)
  4. Update vllm_omni/request.py
  5. Update vllm_omni/engine/output_processor.py

Changes:

  1. Implement Qwen3OmniModel class with forward pass logic
  2. Add Qwen3OmniRequest class with multimodal input validation
  3. Implement Qwen3OmniOutputProcessor class
  4. Add multimodal attention mechanisms
  5. Add request preprocessing and output post-processing

Dependencies: PR #1
Testing: Unit tests for model execution, request processing, and output handling

@hsliuustc0106
Copy link
Collaborator

PR #3: Integration & API
Worker integration and API endpoints
Scope: Integration with existing vLLM-omni infrastructure
Files:

  1. vllm_omni/worker/qwen3_omni_worker.py (new)
  2. vllm_omni/entrypoints/qwen3_omni_api.py (new)
  3. Update vllm_omni/worker/gpu_diffusion_worker.py
  4. Update vllm_omni/entrypoints/api_server.py

Changes:

  1. Add Qwen3-Omni worker class
  2. Integrate with existing worker framework
  3. Add GPU memory management
  4. Add Qwen3-Omni specific API endpoints
  5. Add request/response schemas
  6. Add API documentation

Dependencies: PR #2
Testing: Integration tests and API tests

@hsliuustc0106
Copy link
Collaborator

PR #4: Examples & Documentation
Usage examples, configurations, and comprehensive documentation
Scope: Complete the Qwen3-Omni support with examples and docs
Files:

  1. examples/qwen3_omni/ (new directory)
  2. basic_usage.py
  3. configs/qwen3_omni_local.yaml
  4. README.md
  5. docs/models/qwen3_omni.md (new)
  6. Update docs/api/ with Qwen3-Omni endpoints
  7. Update main README.md

Changes:

  1. Add basic usage example
  2. Add configuration examples
  3. Add comprehensive model documentation
  4. Add API documentation
  5. Add troubleshooting guide
  6. Update project documentation

Dependencies: PR #3
Testing: Example validation and documentation review

elif 'max_model_len' in kwargs:
# If max_model_len is set but max_num_batched_tokens is not, set it to max_model_len
kwargs['max_num_batched_tokens'] = kwargs['max_model_len']
class OmniLM:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between OmniLM and OmniLLM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OmniLM is a higher level abstraction as entry class, which init multiple OmniLLMs. Each OmniLLM inits one engine.

self.request_counter = Counter()
self.default_sampling_params: Union[dict[str, Any], None] = None

if envs.VLLM_USE_V1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V0 has been removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is based on a vLLM version around Aug 2025. Later we will adapt to latest stable vLLM v0.11.0

data_parallel_rank: Optional[int] = None,
) -> tuple[Optional[str], OmniEngineCoreRequest]:

# TODO(woosuk): Support pooling models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we copy this file from vllm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We modified some functions to add additional parameters.

from vllm_omni.config import OmniStageConfig


def load_stage_configs(omni_args: OmniEngineArgs) -> List[OmniStageConfig]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_stage_configs_from_engine_args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


logger = logging.getLogger(__name__)

IMAGE_FACTOR = 28
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these magic numbers work for all models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is for qwen2.5-omni only. It has been moved to examples.

@@ -0,0 +1,376 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the model_runner and worker name, I think gpu_xx_model_runner and gpu_xx_worker are aligned with the AFD case: gpu_ffn_model_runner and gpu_ffn_worker.

@hsliuustc0106 hsliuustc0106 linked an issue Oct 18, 2025 that may be closed by this pull request
15 tasks
@hsliuustc0106 hsliuustc0106 added the enhancement New feature or request label Oct 18, 2025
…ve old test files and deprecated implementations - Add offline inference examples for Qwen 2.5 Omni - Add stage_input_processors for better modularity - Update architecture documentation - Refactor entrypoints (omni_llm -> omni_lm) - Update core scheduler and cache management
@hsliuustc0106
Copy link
Collaborator

stale PR closed

@Gaohan123 Gaohan123 deleted the dev branch November 1, 2025 02:26
R2-Y pushed a commit to R2-Y/vllm-omni that referenced this pull request Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants