Skip to content

add support for MammothModa2 model#336

Open
HonestDeng wants to merge 125 commits intovllm-project:mainfrom
HonestDeng:add-mammoth-moda2-support
Open

add support for MammothModa2 model#336
HonestDeng wants to merge 125 commits intovllm-project:mainfrom
HonestDeng:add-mammoth-moda2-support

Conversation

@HonestDeng
Copy link

@HonestDeng HonestDeng commented Dec 16, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolve #314 , add support for MammothModa2 model https://github.com/bytedance/mammothmoda

Test Plan

Machine:

  • H200(140GB) x 1

Parallel:

  • TP: None

Image:

  • Size: 1024 x 1024
  • DiT Step: 50
  1. Image Summery

Machine:

  • H200(140GB) x 1

Parallel:

  • TP: None

Image:

  • Size: 1024 x 1024

Test Result

Image in the left side is generated by MammothModa2 official implementation while the right side from vllm-omni:
image

This table shows performance in two implementations:

Stages official-impl vllm-omni
AR stage 83.529s 74.06s
DiT stage 10.320s 9.65s

Transfer time: 4.012ms

We get better performance.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
For simplicity, most code of DiT stage is copied from https://github.com/bytedance/mammothmoda.
These code will be simplified and reviewd after the pipeline running
successfully.

Signed-off-by: HonestDeng <2958906959@qq.com>
because preview version of mammothmoda2 only use last hidden state

Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
@hsliuustc0106
Copy link
Collaborator

Hi, will the model be ready before 1230 release?

@HonestDeng
Copy link
Author

HonestDeng commented Dec 20, 2025

Yes.

The MammothModa2-Preview is combined Qwen25-VL(with extra gen-experts in MLP layers) with an DiT module for image generation. Now I have already implemented the Qwen25-VL part of MammothModa2-Preview by reusing vllm code, such as Qwen2Attention, Qwen2MLP, and we can takes text and image as input to generate text token.

Now I'm currently working on DiT parts. Hopefully I will finish DiT parts in this weekend and review my code before 1230.

I'm not quite familiar in supporting new models. If there is any problem in my code, please correct me. Thanks!

Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
@hsliuustc0106
Copy link
Collaborator

Yes.

The MammothModa2-Preview is combined Qwen25-VL(with extra gen-experts in MLP layers) with an DiT module for image generation. Now I have already implemented the Qwen25-VL part of MammothModa2-Preview by reusing vllm code, such as Qwen2Attention, Qwen2MLP, and we can takes text and image as input to generate text token.

Now I'm currently working on DiT parts. Hopefully I will finish DiT parts in this weekend and review my code before 1230.

I'm not quite familiar in supporting new models. If there is any problem in my code, please correct me. Thanks!

the model seems quite similar to Qwen-Image strcuture with a qwen-vl for encoding and a DiT module for image generation.

@HonestDeng HonestDeng force-pushed the add-mammoth-moda2-support branch from 8e2db46 to c6deeb1 Compare March 1, 2026 10:54
@HonestDeng HonestDeng marked this pull request as ready for review March 1, 2026 11:00
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 982e321f6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@princepride
Copy link
Collaborator

I've run the test_mammoth_moda2.py in my local machine and pass all the test cases. Is it OK?

Thanks! I will review it tmr.

HonestDeng and others added 2 commits March 1, 2026 20:31
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@hsliuustc0106
Copy link
Collaborator

Thanks for reviewing. I've fixed the these 5 issues. And it is strange for the first issue: answer_start_index = max(L - 10, 0).

Actually, I just followed the original project to use the last 10 token. I didn't figure out the reason. It's quite weird.

Anyway, I've fixed the problem by using all generated token as answer.

you can comment an issue at the original project

@hsliuustc0106
Copy link
Collaborator

I've run the test_mammoth_moda2.py in my local machine and pass all the test cases. Is it OK?

any performance speed up update?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we can't directly use: vllm_omni/model_executor/stage_configs/mammoth_moda2.yaml, maybe you can refer to qwen3-omni, let final_output can be output in different stage.

Copy link
Author

@HonestDeng HonestDeng Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two tasks need different stage topologies: summarization uses engine_output_type: text and terminates at Stage 0, while T2I uses engine_output_type: latent and must continue to Stage 1's ar2dit processor — routing a comprehension request through the two-stage config would break ar2dit on the incompatible format.

The Qwen3-Omni pattern works because every request always goes through all stages sequentially. MammothModa2 needs a true branch (stop at Stage 0 for text, continue to Stage 1 for image), which requires per-request dynamic stage skipping.

Therefore, we can't directly use vllm_omni/model_executor/stage_configs/mammoth_moda2.yaml for summarize task.

I've moved examples/offline_inference/mammothmodal2_preview/mammoth_moda2_t2i.yaml and examples/offline_inference/mammothmodal2_preview/mammoth_moda2_image_summarize.yaml to vllm_omni/model_executor/stage_configs/mammoth_moda2.yaml and vllm_omni/model_executor/stage_configs/mammoth_moda2_ar.yaml for simplicity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we use vllm_omni/model_executor/stage_configs/mammoth_moda2.yaml for t2i task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need provide example that one stage deployed on two device

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted this config file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can directly use dit model under diffusion folder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! The real implementation is already in mammoth_moda2_dit.py — the file under model_executor/ is just a thin re-export shim.

The shim is needed because OmniModelRegistry in registry.py hardcodes the prefix vllm_omni.model_executor.models when resolving module paths, so a model living under vllm_omni.diffusion can't be registered there directly without the shim.

The DiffusionModelRegistry in registry.py does use the correct vllm_omni.diffusion.models. prefix, but it's a separate registry for pipeline-style models instantiated with OmniDiffusionConfig. MammothModa2DiTForConditionalGeneration is a vLLM nn.Module loaded with VllmConfig, so it can't be plugged into that registry either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz create a new folder like: https://github.com/vllm-project/vllm/tree/main/vllm/transformers_utils/configs and put customer config under it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, in vLLM, we put processor and model implement in same file, please combine mammoth_moda2_ar.py and mammoth_moda2.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

from .registry import OmniModelRegistry # noqa: F401

__all__ = ["Qwen3OmniMoeForConditionalGeneration"]
__all__ = ["Qwen3OmniMoeForConditionalGeneration", "Mammothmoda2Config"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After put Mammothmoda2Config under transformers_utils/configs, we can remove it from here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've move Mammothmoda2Config to transformers_utils/configs and deleted code in vllm_omni/model_executor/models/__init__.py that imports Mammothmoda2Config.

However, we need an 'eagerly import' to register Mammothmoda2Config to model_type mammothmoda2. Therefore, I add some code in vllm_omni/__init__.py to import these configs.

from .registry import OmniModelRegistry # noqa: F401

__all__ = ["Qwen3OmniMoeForConditionalGeneration"]
__all__ = ["Qwen3OmniMoeForConditionalGeneration", "Mammothmoda2Config"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsliuustc0106 @ZJY0516 Why we have Qwen3OmniMoeForConditionalGeneration in this file? Is it have something special?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need a folder mammothmoda2_dit_layer to store model's module file. You can refer other dit model's file structure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@princepride
Copy link
Collaborator

@HonestDeng PTAL

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #336 Review: Add support for MammothModa2 model

Overview

This PR adds support for MammothModa2, a multi-modal image generation model with a two-stage architecture:

  • AR Stage: Based on Qwen2.5-VL with MoE (Mixture of Experts) for dual vocabulary handling
  • DiT Stage: Diffusion transformer for image generation via flow-matching

Scale: 4,151 additions across 27 files

Critical Issues: 0 found ✓

Important Issues: 4 found

1. Potential Shape Mismatch in moe_forward Not Fully Validated

File: vllm_omni/model_executor/models/mammoth_moda2/mammoth_moda2_ar.py:126-131

The flat_mask.reshape(-1) operation could silently produce incorrect results if the original shape has different dimensions that happen to have the same product.

Suggestion: Validate gen_token_mask.numel() != total_tokens before the reshape operation.

2. Chinese Comment in Production Code

File: vllm_omni/diffusion/models/mammoth_moda2/mammoth_moda2_dit.py:196

text_hidden_states=inputs_embeds,  # 占位,runner 不会用到

Please replace with English: # placeholder, not used by runner

3. Unused Parameter

File: vllm_omni/model_executor/stage_input_processors/mammoth_moda2.py:13

The requires_multimodal_data parameter is accepted but never used. Either use it or remove it.

4. Hardcoded num_reqs=1 Silently Ignores Caller's Argument

File: vllm_omni/diffusion/models/mammoth_moda2/mammoth_moda2_dit.py:84

def get_dummy_runtime_additional_information(self, num_reqs: int) -> list[dict[str, object]]:
    num_reqs = 1  # TODO: support num_reqs > 1

Consider raising NotImplementedError for num_reqs > 1 instead of silently ignoring.

Strengths

  1. Well-structured architecture: Clean separation between AR and DiT stages
  2. Comprehensive config design: Proper handling of dual vocabulary
  3. Good test coverage: Unit tests for config parsing, stage processor, and e2e tests
  4. Proper weight loading: Weight mappers correctly filter stage-specific weights
  5. Token constraints: _apply_t2i_token_constraints properly constrains sampling

MRO Pattern Check ✓

All model classes follow proper inheritance order (nn.Module before mixins). No MRO issues detected.

Recommendation

After addressing the 4 important issues (especially the Chinese comment and the num_reqs handling), this PR should be ready for merge.

raise ValueError(f"Unexpected hidden_states shape: {tuple(hidden_states.shape)}")

# mask: [num_tokens] or [B, L] -> flatten to [total_tokens]
flat_mask = gen_token_mask.reshape(-1) # type: ignore[union-attr]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shape validation order

Consider validating gen_token_mask.numel() != total_tokens before the reshape operation. The current reshape(-1) could succeed but produce semantically incorrect results if the original shape has different dimensions that happen to have the same product.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Prepare negative prompt (for CFG). If none provided, fall back to unconditional.
negative_prompt_embeds = None
negative_prompt_attention_mask = None
if text_guidance_scale > 1.0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chinese comment in production code

Please replace with English: # placeholder, not used by runner

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def ar2dit(
stage_list: list[Any],
engine_input_source: list[int],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter

The requires_multimodal_data parameter is accepted but never used in the function body. Either use this parameter or remove it to match the interface contract.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interface param. Cannot be deleted.

theta=10000,
)

# vLLM PP interface compatibility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently ignores caller argument

This hardcoded num_reqs = 1 silently ignores the callers num_reqs argument, which could cause issues in batched scenarios. Consider raising NotImplementedError for num_reqs > 1 instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
Signed-off-by: HonestDeng <2958906959@qq.com>
@HonestDeng
Copy link
Author

HonestDeng commented Mar 2, 2026

I've addressed the requested changes. PTAL when you have a moment.

@hsliuustc0106
Copy link
Collaborator

Code review

Found 1 issue:

  1. Unclosed file handle - open(special_tokens_file) is called without using a context manager, causing a file handle leak. While Python's garbage collector will eventually close it, the timing is non-deterministic. In long-running inference servers or with repeated model loads, unclosed handles can accumulate.

self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)
vision_tokens = [t.strip() for t in open(special_tokens_file).readlines() if len(t.strip()) > 0]
SPECIAL_TOKENS = tuple(

Fix: Use with open(special_tokens_file) as f: vision_tokens = [t.strip() for t in f.readlines() if len(t.strip()) > 0]

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Signed-off-by: HonestDeng <2958906959@qq.com>
@HonestDeng
Copy link
Author

Code review

Found 1 issue:

  1. Unclosed file handle - open(special_tokens_file) is called without using a context manager, causing a file handle leak. While Python's garbage collector will eventually close it, the timing is non-deterministic. In long-running inference servers or with repeated model loads, unclosed handles can accumulate.

self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)
vision_tokens = [t.strip() for t in open(special_tokens_file).readlines() if len(t.strip()) > 0]
SPECIAL_TOKENS = tuple(

Fix: Use with open(special_tokens_file) as f: vision_tokens = [t.strip() for t in f.readlines() if len(t.strip()) > 0]

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: bytedance-research/MammothModa2-Preview