Skip to content

Conversation

@slin000111
Copy link
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.
support LLaVA-OneVision-1.5, #6123

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @slin000111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by integrating the LLaVA-OneVision-1.5 series of multimodal models. The changes involve not only listing these new models in the documentation but also deeply embedding their support within the system's core, including new model types, architectures, and a dedicated template for processing complex multimodal inputs. This ensures that users can leverage the advanced vision and language understanding of LLaVA-OneVision-1.5 models seamlessly.

Highlights

  • New Model Support: Added comprehensive support for the LLaVA-OneVision-1.5 family of multimodal models, including 4B-Instruct, 8B-Instruct, 4B-Base, and 8B-Base variants.
  • Documentation Updates: Updated both Chinese and English documentation to reflect the newly integrated LLaVA-OneVision-1.5 models, detailing their requirements and capabilities.
  • Core Model Integration: Integrated the new LLaVA-OneVision-1.5 model type and architecture into the system's constants and model registration mechanisms, ensuring proper recognition and handling.
  • Custom Multimodal Template: Introduced a specialized template (LLavaOneVision1_5Template) designed to handle the unique tokenization, media processing (images and videos), and embedding generation requirements for LLaVA-OneVision-1.5 models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the LLaVA-OneVision-1.5 model, including necessary changes to model registration, architecture definitions, and template handling for multimodal inputs. The implementation is mostly sound, with updates to documentation to reflect the new model. My review focuses on improving maintainability by replacing magic numbers with constants, ensuring dependency declarations are complete and accurate, and enhancing the robustness of the video processing logic.

Comment on lines +419 to +420
requires=['transformers>=4.53.0', 'qwen_vl_utils'],
tags=['vision'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The requires list is missing the decord dependency, which is necessary for video processing, and a version for qwen_vl_utils. Also, the tags list is missing 'video'. The documentation correctly includes these. Please update them to ensure all dependencies are declared and the model's capabilities are correctly tagged.

Suggested change
requires=['transformers>=4.53.0', 'qwen_vl_utils'],
tags=['vision'],
requires=['transformers>=4.53.0', 'qwen_vl_utils>=0.0.6', 'decord'],
tags=['vision', 'video'],

model_dir)
model_cls._no_split_modules = ['LLaVAOneVision1_5_DecoderLayer', 'RiceBlock']
model, processor = get_model_tokenizer_multimodal(model_dir, *args, **kwargs)
model.config.vision_start_token_id = 151652
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 151652 is a magic number. To improve code readability and maintainability, it's better to define it as a constant with a descriptive name (e.g., LLAVA_ONEVISION_VISION_START_TOKEN_ID) at the module level and use the constant here.

Comment on lines +314 to +315
image_token_id = 151655
video_token_id = 151656
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The token IDs 151655 and 151656 are used as magic numbers. For better readability and maintainability, please define them as constants with descriptive names at the module level and use them here.

Comment on lines +371 to +375
if hasattr(processor, 'video_processor'):
processor_func = processor.video_processor
else:
processor_func = processor.image_processor
kwargs['images'] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback to processor.image_processor for video processing is fragile. It assumes image_processor can handle videos if video_processor is absent. This might not hold for other models, potentially causing future bugs. A more robust implementation would be to either check if image_processor supports video or raise an error if a video is provided without a video_processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant