Skip to content

Comments

[Model] Add Moondream3 model support#32325

Open
sniper35 wants to merge 18 commits intovllm-project:mainfrom
sniper35:add-moondream3-model
Open

[Model] Add Moondream3 model support#32325
sniper35 wants to merge 18 commits intovllm-project:mainfrom
sniper35:add-moondream3-model

Conversation

@sniper35
Copy link

@sniper35 sniper35 commented Jan 14, 2026

Purpose

Closes #25215.

Test Plan

Offline commands:

  Offline inference examples

  from vllm import LLM, SamplingParams

  llm = LLM(
      model="moondream/moondream3-preview",
      tokenizer="moondream/starmie-v1",
      trust_remote_code=True, dtype="bfloat16",
      max_model_len=2048, enforce_eager=True,
      limit_mm_per_prompt={"image": 1},
  )

  # --- Query ---
  llm.generate(
      {"prompt": "<|endoftext|><image><|md_reserved_0|>query<|md_reserved_1|>What is this?<|md_reserved_2|>",
       "multi_modal_data": {"image": image}},
      SamplingParams(max_tokens=50, temperature=0),
  )

  # --- Caption ---
  llm.generate(
      {"prompt": "<|endoftext|><image><|md_reserved_0|>describe<|md_reserved_1|>normal<|md_reserved_2|>",
       "multi_modal_data": {"image": image}},
      SamplingParams(max_tokens=100, temperature=0),
  )

  # --- Detect (needs extra_args) ---
  llm.generate(
      {"prompt": "<|endoftext|><image><|md_reserved_0|>detect<|md_reserved_1|> sign<|md_reserved_2|>",
       "multi_modal_data": {"image": image}},
      SamplingParams(max_tokens=500, temperature=0,
                     extra_args={"moondream3_task": "detect"}),
  )
  # Returns JSON: {"objects": [{"x_min": ..., "y_min": ..., "x_max": ..., "y_max": ...}]}

  # --- Point (needs extra_args) ---
  llm.generate(
      {"prompt": "<|endoftext|><image><|md_reserved_0|>point<|md_reserved_1|> sign<|md_reserved_2|>",
       "multi_modal_data": {"image": image}},
      SamplingParams(max_tokens=500, temperature=0,
                     extra_args={"moondream3_task": "point"}),
  )
  # Returns JSON: {"points": [{"x": ..., "y": ...}]}

Test Result

Compare the outputs from vllm and HF:

  "inputs": {
    "caption_image": "cherry_blossom",
    "detect_image": "stop_sign",
    "object": "sign",
    "point_image": "stop_sign",
    "query_image": "stop_sign"
  },

  "hf_outputs": {
    "caption": "A tall, slender tower with a white top and gray framework stands against a bright blue sky. Branches with pink blossoms frame the tower in the foreground, creating a layered effect. The blossoms appear dense and full, with some showing hints of orange or yellow at the edges. The perspective is from below, looking up at the tower.",
    "detect": {
      "objects": [
        {
          "x_max": 0.3590589910745621,
          "x_min": 0.16633163392543793,
          "y_max": 0.4200967699289322,
          "y_min": 0.1345907300710678
        }
      ]
    },
    "point": {
      "points": [
        {
          "x": 0.2822265625,
          "y": 0.318359375
        }
      ]
    },
    "query": "The image shows a red stop sign mounted on a pole in the foreground of a street. Behind the sign, there is a red Chinese archway with Chinese characters on it. In the background, a black SUV is driving down the street. Buildings are visible on both sides of the street, and there are several pedestrians walking along the sidewalk. A tree is visible behind the archway. The scene captures a typical urban street setting with Asian architectural elements."
  },

  "vllm_outputs": {
    "caption": "A tall tower with a white top and light blue-green horizontal bands is visible through a dense canopy of pink cherry blossom trees. The sky is a clear, bright blue. The trees frame the tower, creating a layered effect with branches and blossoms in the foreground.",
    "detect": {
      "objects": [
        {
          "x_max": 0.3590589910745621,
          "x_min": 0.16633163392543793,
          "y_max": 0.4210672974586487,
          "y_min": 0.13362020254135132
        }
      ]
    },
    "point": {
      "points": [
        {
          "x": 0.2822265625,
          "y": 0.3095703125
        }
      ]
    },
    "query": "A black SUV is parked on a city street near a red Chinese-style gate or archway. A red octagonal stop sign is mounted on a pole in the foreground. Buildings with signage are visible in the background, and there are decorative stone statues flanking the gate. A tree is visible behind the"
  }

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the new-model Requests to new models label Jan 14, 2026
@mergify
Copy link

mergify bot commented Jan 14, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sniper35.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 14, 2026
@sniper35 sniper35 marked this pull request as draft January 14, 2026 11:20
@mergify mergify bot removed the needs-rebase label Jan 14, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Moondream3 model. The implementation includes the model architecture for both the vision and text components, a custom processor for handling Moondream3's specific image tiling and tokenization, and necessary registrations. The code is comprehensive, but I've found a couple of critical issues in the model implementation that would prevent it from working correctly. One is a fragile dependency in the image encoding logic, and the other is an incorrect weight name remapping during model loading. Addressing these will be crucial for the model to function as intended.

pixel_values = pixel_values.to(device=device, dtype=dtype)

features = self.vision(pixel_values)
grid_size = self.config.vision.enc_n_layers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The grid_size is being set to self.config.vision.enc_n_layers. While the value (27) is coincidentally correct for the default configuration (crop_size 378 / patch_size 14 = 27), this is semantically incorrect and very brittle. The grid size of the vision encoder output depends on the image crop size and patch size, not the number of encoder layers. This will break if the model configuration changes in a way that decouples these values. The grid size should be calculated from the vision config's crop_size and enc_patch_size for correctness and robustness.

Suggested change
grid_size = self.config.vision.enc_n_layers
grid_size = self.config.vision.crop_size // self.config.vision.enc_patch_size

Comment on lines +1256 to +1257
name = name.replace(".attn.qkv.", ".attn.qkv_proj.")
name = name.replace(".attn.proj.", ".attn.out_proj.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The weight name remapping for the attention layers is incorrect. The code replaces .attn.qkv. with .attn.qkv_proj. and .attn.proj. with .attn.out_proj.. However, the Moondream3Attention module defines its layers with prefixes that result in parameter names containing ...attn.qkv.weight and ...attn.proj.weight. This mismatch will cause the weights for the attention QKV and output projections to fail to load, leading to model errors. These remapping lines should be removed to match the parameter names defined in Moondream3Attention.

@sniper35 sniper35 changed the title [Model] Add Moondream3 model support [Model] Add Moondream3 model support[WIP] Jan 14, 2026
@DarkLight1337
Copy link
Member

Heads up that you might need to update some imports after #32327

@mergify
Copy link

mergify bot commented Jan 20, 2026

Documentation preview: https://vllm--32325.org.readthedocs.build/en/32325/

@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) labels Jan 20, 2026
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make the documentation not cluttered

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current:
Screenshot 2026-01-19 at 6 52 18 PM

After:
Screenshot 2026-01-19 at 6 51 19 PM

@sniper35 sniper35 force-pushed the add-moondream3-model branch 2 times, most recently from 2544de1 to f8ded0c Compare January 20, 2026 02:59
@sniper35 sniper35 force-pushed the add-moondream3-model branch 3 times, most recently from 3e885c4 to c82c1bd Compare February 11, 2026 07:36
@mergify
Copy link

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sniper35.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@sniper35
Copy link
Author

Hey @copumpkin you can pull my branch to test by yourself. All the four skills are supported. Here is an instruction to run it.
moondream3_testing.md

Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
(cherry picked from commit 03c4c7c)
Signed-off-by: Dong Wang <dongw2019@gmail.com>
(cherry picked from commit 008cdac)
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
@sniper35 sniper35 force-pushed the add-moondream3-model branch from 5aee540 to c7be284 Compare February 24, 2026 09:49
Signed-off-by: Dong Wang <dongw2019@gmail.com>
@sniper35 sniper35 marked this pull request as ready for review February 24, 2026 10:40
@sniper35 sniper35 changed the title [Model] Add Moondream3 model support[WIP] [Model] Add Moondream3 model support Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Could support moondream vlm model?

3 participants