[Model] Add Moondream3 model support#32325
[Model] Add Moondream3 model support#32325sniper35 wants to merge 18 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request adds support for the Moondream3 model. The implementation includes the model architecture for both the vision and text components, a custom processor for handling Moondream3's specific image tiling and tokenization, and necessary registrations. The code is comprehensive, but I've found a couple of critical issues in the model implementation that would prevent it from working correctly. One is a fragile dependency in the image encoding logic, and the other is an incorrect weight name remapping during model loading. Addressing these will be crucial for the model to function as intended.
| pixel_values = pixel_values.to(device=device, dtype=dtype) | ||
|
|
||
| features = self.vision(pixel_values) | ||
| grid_size = self.config.vision.enc_n_layers |
There was a problem hiding this comment.
The grid_size is being set to self.config.vision.enc_n_layers. While the value (27) is coincidentally correct for the default configuration (crop_size 378 / patch_size 14 = 27), this is semantically incorrect and very brittle. The grid size of the vision encoder output depends on the image crop size and patch size, not the number of encoder layers. This will break if the model configuration changes in a way that decouples these values. The grid size should be calculated from the vision config's crop_size and enc_patch_size for correctness and robustness.
| grid_size = self.config.vision.enc_n_layers | |
| grid_size = self.config.vision.crop_size // self.config.vision.enc_patch_size |
| name = name.replace(".attn.qkv.", ".attn.qkv_proj.") | ||
| name = name.replace(".attn.proj.", ".attn.out_proj.") |
There was a problem hiding this comment.
The weight name remapping for the attention layers is incorrect. The code replaces .attn.qkv. with .attn.qkv_proj. and .attn.proj. with .attn.out_proj.. However, the Moondream3Attention module defines its layers with prefixes that result in parameter names containing ...attn.qkv.weight and ...attn.proj.weight. This mismatch will cause the weights for the attention QKV and output projections to fail to load, leading to model errors. These remapping lines should be removed to match the parameter names defined in Moondream3Attention.
|
Heads up that you might need to update some imports after #32327 |
|
Documentation preview: https://vllm--32325.org.readthedocs.build/en/32325/ |
| <sup>E</sup> Pre-computed embeddings can be inputted for this modality. | ||
|
|
There was a problem hiding this comment.
This is to make the documentation not cluttered
2544de1 to
f8ded0c
Compare
3e885c4 to
c82c1bd
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hey @copumpkin you can pull my branch to test by yourself. All the four skills are supported. Here is an instruction to run it. |
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com> (cherry picked from commit 03c4c7c)
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
5aee540 to
c7be284
Compare
Signed-off-by: Dong Wang <dongw2019@gmail.com>


Purpose
Closes #25215.
Test Plan
Offline commands:
Test Result
Compare the outputs from vllm and HF:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.