vllm-omni framework and support for qwen2.5-omni [WIP]#7
vllm-omni framework and support for qwen2.5-omni [WIP]#7Gaohan123 wants to merge 7 commits intovllm-project:mainfrom
Conversation
Gaohan123
commented
Oct 13, 2025
- Reorganize worker modules: rename gpu_diffusion_* to new naming scheme
- Add new AR GPU worker and model runner
- Add diffusion scheduler and utilities
- Add stage configs for qwen2.5 omni model
- Update init.py files across modules for new imports
- Add output processor and arg utilities
- Add test file for omni LLM
- Reorganize worker modules: rename gpu_diffusion_* to new naming scheme - Add new AR GPU worker and model runner - Add diffusion scheduler and utilities - Add stage configs for qwen2.5 omni model - Update __init__.py files across modules for new imports - Add output processor and arg utilities - Add test file for omni LLM
Summary of ChangesHello @Gaohan123, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the foundational framework for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant architectural refactoring to support multi-stage, multi-modal models, specifically qwen2.5-omni, within a new vllm-omni framework. The changes are extensive and well-structured, introducing concepts like stage-based engine configurations, specialized workers (AR and Diffusion), and corresponding schedulers. The refactoring of OmniLLM into a pipeline orchestrator (OmniLM) and a single-stage executor (OmniLLM) is a clean design. The new output processor is also a major improvement, providing robust handling for various multimodal outputs. While the work is still in progress, the foundational framework is solid. I've identified a critical syntax error and a few medium-severity issues related to maintainability and dead code that should be addressed.
| """ | ||
| Scheduler for the diffusion model. | ||
| This scheduler is modified to stop the request immediately for the diffusion model. | ||
| This is because the diffusion model can generate the final image/audio in one step. | ||
| Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler. | ||
| The original scheduler is still used for the AR model. | ||
| """ |
There was a problem hiding this comment.
This block of text is formatted as a multi-line string (docstring), but it's located in the middle of the class definition, not at the beginning of a module, class, or function. This will cause a SyntaxError when the Python interpreter parses this file. To fix this, you should convert it into a block comment by prefixing each line with a #.
# Scheduler for the diffusion model.
# This scheduler is modified to stop the request immediately for the diffusion model.
# This is because the diffusion model can generate the final image/audio in one step.
# Note: This is just a minimal modification to the original scheduler, and there should be some further efforts to optimize the scheduler.
# The original scheduler is still used for the AR model.| """扩散快速通道: | ||
| - 一次性喂入该请求的全部输入 token(若为 0,则分配 1 个占位 token)。 | ||
| - 若无法一次性满足 token 预算,则退回上游 vLLM 的默认调度。 | ||
| """ |
There was a problem hiding this comment.
This file contains numerous comments written in Chinese. For better maintainability and accessibility in an open-source project, it's highly recommended to write all comments in English. This ensures that a wider range of contributors can understand the code's intent. Please consider translating these and other Chinese comments in this file.
| @@ -0,0 +1,714 @@ | |||
| """ | |||
There was a problem hiding this comment.
| # if hasattr(self.model, "sample"): | ||
| # return self.model.sample(**kwargs) | ||
| # if hasattr(self.model, "forward"): | ||
| # return self.model.forward(**kwargs) | ||
| # if hasattr(self.model, "diffuse"): | ||
| # return self.model.diffuse(**kwargs) |
There was a problem hiding this comment.
In the _run_diffusion method, the checks for model.sample and model.diffuse are commented out, leaving only model.forward as the active path. If the intention is to support only the forward method for the current model, it would be clearer to remove the commented-out code and add a comment explaining this limitation. However, to make the runner more generic for future diffusion models, consider uncommenting this logic.
| # if hasattr(self.model, "sample"): | |
| # return self.model.sample(**kwargs) | |
| # if hasattr(self.model, "forward"): | |
| # return self.model.forward(**kwargs) | |
| # if hasattr(self.model, "diffuse"): | |
| # return self.model.diffuse(**kwargs) | |
| if hasattr(self.model, "sample"): | |
| return self.model.sample(**kwargs) | |
| if hasattr(self.model, "forward"): | |
| return self.model.forward(**kwargs) | |
| if hasattr(self.model, "diffuse"): | |
| return self.model.diffuse(**kwargs) |
|
based on my understanding, this huge PR can be split into the following mini PRs: PR #1: Foundation & Infrastructure |
|
PR #2: Core Processing Components
Changes:
Dependencies: PR #1 |
|
PR #3: Integration & API
Changes:
Dependencies: PR #2 |
|
PR #4: Examples & Documentation
Changes:
Dependencies: PR #3 |
| elif 'max_model_len' in kwargs: | ||
| # If max_model_len is set but max_num_batched_tokens is not, set it to max_model_len | ||
| kwargs['max_num_batched_tokens'] = kwargs['max_model_len'] | ||
| class OmniLM: |
There was a problem hiding this comment.
what's the difference between OmniLM and OmniLLM
There was a problem hiding this comment.
OmniLM is a higher level abstraction as entry class, which init multiple OmniLLMs. Each OmniLLM inits one engine.
| self.request_counter = Counter() | ||
| self.default_sampling_params: Union[dict[str, Any], None] = None | ||
|
|
||
| if envs.VLLM_USE_V1: |
There was a problem hiding this comment.
V0 has been removed
There was a problem hiding this comment.
The code is based on a vLLM version around Aug 2025. Later we will adapt to latest stable vLLM v0.11.0
| data_parallel_rank: Optional[int] = None, | ||
| ) -> tuple[Optional[str], OmniEngineCoreRequest]: | ||
|
|
||
| # TODO(woosuk): Support pooling models. |
There was a problem hiding this comment.
do we copy this file from vllm?
There was a problem hiding this comment.
We modified some functions to add additional parameters.
vllm_omni/entrypoints/utils.py
Outdated
| from vllm_omni.config import OmniStageConfig | ||
|
|
||
|
|
||
| def load_stage_configs(omni_args: OmniEngineArgs) -> List[OmniStageConfig]: |
There was a problem hiding this comment.
load_stage_configs_from_engine_args
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| IMAGE_FACTOR = 28 |
There was a problem hiding this comment.
do these magic numbers work for all models?
There was a problem hiding this comment.
The file is for qwen2.5-omni only. It has been moved to examples.
| @@ -0,0 +1,376 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
for the model_runner and worker name, I think gpu_xx_model_runner and gpu_xx_worker are aligned with the AFD case: gpu_ffn_model_runner and gpu_ffn_worker.
…ve old test files and deprecated implementations - Add offline inference examples for Qwen 2.5 Omni - Add stage_input_processors for better modularity - Update architecture documentation - Refactor entrypoints (omni_llm -> omni_lm) - Update core scheduler and cache management
|
stale PR closed |
fix stream audio output