[model] support LLaVA-OneVision-1.5 #6284

slin000111 · 2025-10-24T03:43:37Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.
support LLaVA-OneVision-1.5, #6123

Experiment results

Paste your experiment result here(if needed).

update faq.

…tation.

Fixed the inconsistencies between the Chinese and English documentation.

Update link to sequence parallel example

gemini-code-assist · 2025-10-24T03:43:56Z

Summary of Changes

Hello @slin000111, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by integrating the LLaVA-OneVision-1.5 series of multimodal models. The changes involve not only listing these new models in the documentation but also deeply embedding their support within the system's core, including new model types, architectures, and a dedicated template for processing complex multimodal inputs. This ensures that users can leverage the advanced vision and language understanding of LLaVA-OneVision-1.5 models seamlessly.

Highlights

New Model Support: Added comprehensive support for the LLaVA-OneVision-1.5 family of multimodal models, including 4B-Instruct, 8B-Instruct, 4B-Base, and 8B-Base variants.
Documentation Updates: Updated both Chinese and English documentation to reflect the newly integrated LLaVA-OneVision-1.5 models, detailing their requirements and capabilities.
Core Model Integration: Integrated the new LLaVA-OneVision-1.5 model type and architecture into the system's constants and model registration mechanisms, ensuring proper recognition and handling.
Custom Multimodal Template: Introduced a specialized template (LLavaOneVision1_5Template) designed to handle the unique tokenization, media processing (images and videos), and embedding generation requirements for LLaVA-OneVision-1.5 models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the LLaVA-OneVision-1.5 model, including necessary changes to model registration, architecture definitions, and template handling for multimodal inputs. The implementation is mostly sound, with updates to documentation to reflect the new model. My review focuses on improving maintainability by replacing magic numbers with constants, ensuring dependency declarations are complete and accurate, and enhancing the robustness of the video processing logic.

gemini-code-assist · 2025-10-24T03:45:37Z

swift/llm/model/model/llava.py

+        requires=['transformers>=4.53.0', 'qwen_vl_utils'],
+        tags=['vision'],


The requires list is missing the decord dependency, which is necessary for video processing, and a version for qwen_vl_utils. Also, the tags list is missing 'video'. The documentation correctly includes these. Please update them to ensure all dependencies are declared and the model's capabilities are correctly tagged.

Suggested change

requires=['transformers>=4.53.0', 'qwen_vl_utils'],

tags=['vision'],

requires=['transformers>=4.53.0', 'qwen_vl_utils>=0.0.6', 'decord'],

tags=['vision', 'video'],

gemini-code-assist · 2025-10-24T03:45:37Z

swift/llm/model/model/llava.py

+                                              model_dir)
+    model_cls._no_split_modules = ['LLaVAOneVision1_5_DecoderLayer', 'RiceBlock']
+    model, processor = get_model_tokenizer_multimodal(model_dir, *args, **kwargs)
+    model.config.vision_start_token_id = 151652


The value 151652 is a magic number. To improve code readability and maintainability, it's better to define it as a constant with a descriptive name (e.g., LLAVA_ONEVISION_VISION_START_TOKEN_ID) at the module level and use the constant here.

gemini-code-assist · 2025-10-24T03:45:37Z

swift/llm/template/template/llava.py

+    image_token_id = 151655
+    video_token_id = 151656


The token IDs 151655 and 151656 are used as magic numbers. For better readability and maintainability, please define them as constants with descriptive names at the module level and use them here.

gemini-code-assist · 2025-10-24T03:45:37Z

swift/llm/template/template/llava.py

+                    if hasattr(processor, 'video_processor'):
+                        processor_func = processor.video_processor
+                    else:
+                        processor_func = processor.image_processor
+                        kwargs['images'] = None


The fallback to processor.image_processor for video processing is fragile. It assumes image_processor can handle videos if video_processor is absent. This might not hold for other models, potentially causing future bugs. A more robust implementation would be to either check if image_processor supports video or raise an error if a video is provided without a video_processor.

slin000111 and others added 16 commits October 10, 2025 16:17

update faq

c4a0133

Merge branch 'update_qa' into main

662c6ff

update faq.

Fixed the inconsistencies between the Chinese and English FAQ documen…

0236c2d

…tation.

Merge branch 'update_qa' into main

0dd6511

Fixed the inconsistencies between the Chinese and English documentation.

Update link to sequence parallel example

dfbed96

Merge branch 'modelscope:main' into main

7be980c

Merge branch 'fix_sequence_parallel_link' into main

3b60a9d

Update link to sequence parallel example

Merge branch 'main' into main

c3a2628

Merge branch 'modelscope:main' into main

05dbcce

Merge branch 'modelscope:main' into main

f72659e

Merge branch 'modelscope:main' into main

10edf1e

Merge branch 'modelscope:main' into main

1367fe1

Merge branch 'modelscope:main' into main

5943294

support llava-onevision-1.5

2f72847

update model list

1e99d19

update model list

988cb7c

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

add test

c5a2a9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model] support LLaVA-OneVision-1.5 #6284

[model] support LLaVA-OneVision-1.5 #6284

Uh oh!

slin000111 commented Oct 24, 2025

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		requires=['transformers>=4.53.0', 'qwen_vl_utils'],
		tags=['vision'],

[model] support LLaVA-OneVision-1.5 #6284

Are you sure you want to change the base?

[model] support LLaVA-OneVision-1.5 #6284

Uh oh!

Conversation

slin000111 commented Oct 24, 2025

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant