Skip to content

Conversation

@chengtao-lv
Copy link
Collaborator

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @chengtao-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR introduces a new token reduction method called 'DyCoke' specifically designed for vision models, integrated into the Llava OneVision architecture. The core idea appears to be reducing the number of visual tokens by merging similar ones across frames based on cosine similarity. This involves adding the DyCoke implementation, registering it within the token reduction framework, and updating the Llava OneVision model to correctly identify its vision blocks for applying this method. A new configuration file is also added to define the parameters for the DyCoke method.

Highlights

  • New Token Reduction Method: Adds the implementation for a new token reduction method named 'DyCoke', aimed at reducing visual tokens in models like Llava OneVision.
  • Vision Model Integration: Modifies the Llava OneVision model to correctly identify its vision encoder layers, enabling the application of token reduction methods to the visual modality.
  • Configuration: Includes a new YAML configuration file (dycoke.yml) to specify parameters for the DyCoke token reduction method.
  • Token Merging Logic: Implements a function (dycole_ttm) that calculates cosine similarity between tokens in adjacent frames and merges tokens based on a defined merging ratio.

Changelog

Click here to see the changelog
  • configs/sparsification/methods/DyCoke/dycoke.yml
    • Added a new configuration file for the DyCoke token reduction method.
    • Defines parameters such as dycoke_layer_idx, num_tokens_per_frame, merging_ratio, and dycoke_radio.
  • llmc/compression/token_reduction/init.py
    • Added an import statement to include the new DyCoke module, making it available in the token_reduction package.
  • llmc/compression/token_reduction/dycoke.py
    • Added the core implementation of the DyCoke token reduction method.
    • Includes the dycole_ttm function for token merging based on similarity.
    • Registers DyCoke with the TOKEN_REDUCTION_REGISTRY.
    • Adds a hook (add_dycole_ttm_to_get_2dPool) to integrate the merging logic into the model's processing pipeline, specifically for LlavaMetaForCausalLM.
  • llmc/models/llava_onevision.py
    • Added the find_blocks method to the LLaVA_OV class.
    • This method now correctly identifies the vision encoder layers (self.vision_model.vision_tower.vision_model.encoder.layers) when the modality is 'vision', in addition to handling the 'language' modality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the DyCoke token reduction method. The changes include a new configuration file, the core implementation of the DyCoke algorithm, and necessary integrations into the existing model and token reduction framework.

Overall, the PR lays a good foundation for the DyCoke method. However, there are several points that need attention, including a potential bug in handling an odd number of frames, typos in configuration and code, and areas where clarity could be improved. Addressing these will enhance the robustness and maintainability of the new feature.

Summary of Findings

  • Potential Bug: Data Loss with Odd Number of Frames: In dycoke.py, the dycole_ttm function's first loop does not correctly handle cases where num_frames is odd and greater than 1, potentially leading to the last frame being lost. This is a high-severity issue.
  • Configuration Clarity and Typos: The dycoke.yml configuration file contains placeholder paths that need user attention. There's also a parameter dycoke_radio which appears to be a typo for dycoke_ratio or merging_ratio and is currently unused by the Python code.
  • Naming Inconsistencies: The core function dycole_ttm and its registration helper add_dycole_ttm_to_get_2dPool in dycoke.py seem to have a typo ('dycole' instead of 'dycoke'). Consistent naming aligning with 'DyCoke' is recommended.
  • Code Clarity: The dycole_ttm function in dycoke.py is complex. A docstring explaining its logic, inputs, and outputs would greatly improve its understandability. Additionally, a comment regarding the second processing loop could be more descriptive.
  • Unused Imports: The file llmc/compression/token_reduction/dycoke.py imports functools, List, Optional, Tuple, Union from typing, Cache, DynamicCache from transformers.cache_utils, and prefill_wrapper from .utils, but these imports are not used in the new code. (Severity: low, not commented inline due to review settings)
  • Redundant Tensor Conversion: In llmc/compression/token_reduction/dycoke.py (lines 40-42 and 76-78), torch.tensor(similarity) is used where similarity is already a tensor. This is redundant. torch.stack(similarities) directly should suffice. (Severity: low, not commented inline due to review settings)

Merge Readiness

This pull request introduces promising functionality with the DyCoke method. However, due to the identified high-severity issue regarding data loss with an odd number of frames, and several medium-severity issues related to configuration, naming, and clarity, I recommend that these changes be addressed before merging. Once these points are resolved, the PR will be in a much stronger state. As a reviewer, I am not authorized to approve pull requests, so please ensure further review and approval from authorized maintainers after addressing the feedback.

Comment on lines +23 to +64
for i in range(0, num_frames - 1, 2):
# Get tokens for adjacent frames
frame1_tokens = image_feature[
i * num_tokens_per_frame: (i + 1) * num_tokens_per_frame
]
frame2_tokens = image_feature[
(i + 1) * num_tokens_per_frame: (i + 2) * num_tokens_per_frame
]

# Calculate cosine similarity between normalized tokens
frame1_norm = torch.nn.functional.normalize(frame1_tokens, p=2, dim=1)
frame2_norm = torch.nn.functional.normalize(frame2_tokens, p=2, dim=1)
similarity = torch.nn.functional.cosine_similarity(
frame1_norm, frame2_norm, dim=1
)
similarities.append(similarity)

similarities = torch.stack(
[torch.tensor(similarity) for similarity in similarities]
)

# Process even frames
modified_image_feature = []
for i in range(0, num_frames - 1, 2):
frame1_tokens = image_feature[
i * num_tokens_per_frame: (i + 1) * num_tokens_per_frame
]
frame2_tokens = image_feature[
(i + 1) * num_tokens_per_frame: (i + 2) * num_tokens_per_frame
]

avg_similarity = similarities[i // 2]
num_tokens_to_keep = int(merging_ratio * num_tokens_per_frame)
tokens_to_keep = avg_similarity.topk(num_tokens_to_keep, largest=False).indices

modified_image_feature.append(frame1_tokens)
modified_image_feature.append(frame2_tokens[tokens_to_keep])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The first loop for i in range(0, num_frames - 1, 2) processes pairs of frames. If num_frames is odd and greater than 1 (e.g., 3, 5), the last frame will be omitted from modified_image_feature. For example, if num_frames = 3, the loop runs for i=0, processing frame0 and frame1. frame2 is never added to modified_image_feature, leading to data loss.

Could you review the logic for handling an odd number of frames? One common approach is to append the last frame unprocessed if num_frames is odd.

A docstring explaining the overall algorithm, inputs, and outputs of this function would also be very beneficial for future understanding and maintenance.

seed: &seed 42
model:
type: Llava OneVision
path: model path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model path is set to a placeholder model path. Is this intended to be a generic template? If so, it might be helpful to add a comment indicating that this needs to be replaced with an actual path. If a default or example path could be provided, that would be even better for usability.

type: vqa
name: [mme]
download: False
path: MME dataset path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the model path, the dataset path MME dataset path is a placeholder. Could you clarify if this should be updated by the user, or perhaps provide an example or default?

dycoke_layer_idx: 3
num_tokens_per_frame: 196
merging_ratio: 0.7
dycoke_radio: 0.7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a parameter dycoke_radio: 0.7. Is 'radio' a typo for 'ratio'?

Additionally, the Python code (llmc/compression/token_reduction/dycoke.py) uses pruning_paras['merging_ratio'] (defined on line 21) but does not seem to use dycoke_radio.

Could you clarify:

  1. If dycoke_radio should be dycoke_ratio or perhaps merging_ratio?
  2. If it's a distinct parameter, how is it intended to be used?
        dycoke_ratio: 0.7 # Or perhaps remove if redundant with merging_ratio

save:
save_trans: False
save_fake: False
save_path: /path/to/save/ No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The save_path is set to a generic placeholder /path/to/save/. It would be helpful to either note that this needs user configuration or provide a more concrete example/default.

from .utils import prefill_wrapper


def dycole_ttm(image_feature, pruning_paras):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There seems to be a typo in the function name. Should dycole_ttm be dycoke_ttm to align with the module name 'DyCoke' and the class name? This typo appears in its definition and usages (lines 100, 126). Consistent naming would improve readability and maintainability.

Suggested change
def dycole_ttm(image_feature, pruning_paras):
def dycoke_ttm(image_feature, pruning_paras):

modified_image_feature.append(frame1_tokens)
modified_image_feature.append(frame2_tokens[tokens_to_keep])

# Process odd frames

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment # Process odd frames for the loop starting on line 63 might be misleading. This loop iterates with i taking values 0, 4, 8, ... and processes frame_i and frame_{i+2}. These are originally even-indexed frames.

Perhaps a more descriptive comment like # Further prune even-indexed frames based on similarity to preceding even frames or similar would better reflect the logic?

return combined_tokens


def add_dycole_ttm_to_get_2dPool(model, post_hook_fn, pruning_paras):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Typo in function name: add_dycole_ttm_to_get_2dPool. Should this be add_dycoke_ttm_to_get_2dPool for consistency with the intended 'DyCoke' naming?

Suggested change
def add_dycole_ttm_to_get_2dPool(model, post_hook_fn, pruning_paras):
def add_dycoke_ttm_to_get_2dPool(model, post_hook_fn, pruning_paras):

Comment on lines +125 to +131
add_dycole_ttm_to_get_2dPool(
self.model.model, dycole_ttm, self.model.model.pruning_paras

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Typo in function name usage: dycole_ttm. Should this be dycoke_ttm?

Suggested change
add_dycole_ttm_to_get_2dPool(
self.model.model, dycole_ttm, self.model.model.pruning_paras
add_dycoke_ttm_to_get_2dPool(
self.model.model, dycoke_ttm, self.model.model.pruning_paras

@helloyongyang helloyongyang merged commit 62d47ef into main Jun 6, 2025
2 checks passed
@helloyongyang helloyongyang deleted the vlm branch June 6, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants