-
Notifications
You must be signed in to change notification settings - Fork 66
support qwen2.5vl(fastv,dart,visionzip) #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @SmudgedWings, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly extends the system's capabilities by integrating comprehensive support for the Qwen2.5-VL multimodal model. It adapts existing token reduction techniques—DART, FastV, and VisionZip—to work seamlessly with Qwen2.5-VL's architecture, particularly focusing on accurate vision token identification and efficient pruning. The changes involve introducing a new model definition, refining token handling logic across the sparsification methods, and adding visualization tools to aid in analysis.
Highlights
- Qwen2.5-VL Model Integration: Added a new
Qwen2_5VLmodel class and registered it, enabling the use of this model within the system. This includes defining its specific architecture and layer subsets for pruning. - Token Reduction Method Adaptation: Modified DART, FastV, and VisionZip to correctly identify and process vision tokens (
vision_token_index,vision_token_length,vision_token_start_index) forQwen2.5-VLand other VLM models, ensuring compatibility and proper pruning. - Qwen2.5-VL Specific VisionZip Logic: Implemented a dedicated
Qwen2_5_VLVisionAttention_forwardfunction and associated hooks within VisionZip to handle the unique attention mechanism and token merging forQwen2.5-VL's visual tokens. - Generalized Token Configuration: Updated the base
TokenReductionModuleto retrieve vision token configuration parameters more robustly using.get()methods, improving flexibility. - Visualization Utilities: Introduced a new
visualizer.pymodule with functions to visualize kept image patches and attention maps, which can be valuable for debugging and understanding the token reduction process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the qwen2.5vl model and refactors several token reduction methods (DART, FastV, VisionZip) to be more generic and support the new model architecture. The changes are extensive and introduce new model definitions, evaluation wrappers, and specific pruning logic for qwen2.5vl.
My review has identified a critical bug in the new pruning logic that would likely cause it to fail. I've also pointed out several medium-severity issues related to code correctness, robustness, and adherence to best practices. Addressing these points will improve the quality and stability of the new features.
| dominant_num = int(self.dominant * attn_logits.size(0)) | ||
| contextual_num = max(int(self.contextual * attn_logits.size(0)), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation for dominant_num and contextual_num appears to be incorrect. self.dominant and self.contextual are loaded from the configuration as absolute token counts. However, they are being multiplied by attn_logits.size(0) (the sequence length), which will result in values that are far too large and will likely cause an error in the subsequent torch.topk call.
The multiplication should be removed, and the values from the configuration should be used directly.
| dominant_num = int(self.dominant * attn_logits.size(0)) | |
| contextual_num = max(int(self.contextual * attn_logits.size(0)), 1) | |
| dominant_num = self.dominant | |
| contextual_num = max(self.contextual, 1) |
| self.dominant = special_config['dominant'] | ||
| self.contextual = special_config['contextual'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using direct dictionary access special_config['...'] is less safe than using special_config.get('...', default_value). This change makes the code less robust, as it will raise a KeyError if the keys are not present in the configuration. It's better to retain the safe access pattern with default values.
| self.dominant = special_config['dominant'] | |
| self.contextual = special_config['contextual'] | |
| self.dominant = special_config.get('dominant', 191) | |
| self.contextual = special_config.get('contextual', 30) |
| first, last = st_idx[0].item(), st_idx[-1].item() | ||
| img_mask[first: last + 1] = ~select_mask | ||
| img_mask = ~img_mask | ||
| contexual_input_idx = false_pos[target_indices] + first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inputs_embeds[:, contexual_input_idx] = contextual_tokens | ||
| kwargs['inputs_embeds'] = inputs_embeds[:, img_mask] | ||
| del contextual_tokens, hidden_states_filtered, hidden_to_merge, aggregated_hidden | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of torch.cuda.empty_cache() can negatively impact performance as it is a blocking, synchronizing call. While it can help in releasing memory, it's often a sign of high memory pressure that might be better addressed by optimizing tensor lifetimes. Please consider if this call is strictly necessary, especially in library code.
| 'vision_token_start_index': 15 | ||
| } | ||
|
|
||
| # todo: check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llmc/models/qwen2_5vl.py
Outdated
| logger.warning( | ||
| 'Can not import lmms_eval. ' | ||
| 'If you need it, please upgrade transformers.' | ||
| ) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llmc/utils/visualizer.py
Outdated
| from PIL import Image, ImageDraw | ||
| import matplotlib.pyplot as plt | ||
|
|
||
| def save_image(imgae_tensor,mean,std,save_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support qwen2.5vl(fastv,dart,visionzip),set token_start_idx ,token_length,vision_token_idx in model