Skip to content

Conversation

@chengtao-lv
Copy link
Collaborator

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @chengtao-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR introduces a new token reduction method called 'PruneVid', specifically designed for processing video or sequential image inputs within models like Llava. The core idea is to dynamically merge visual tokens based on temporal segments and spatial characteristics within those segments, aiming to reduce the number of tokens processed by the language model while preserving relevant information. The implementation involves adding a new configuration, registering the method, and providing the core logic for temporal segmentation, static/dynamic token splitting, and spatial merging. Additionally, a utility function used for hooking into the model's processing pipeline has been refactored into a shared utility file.

Highlights

  • New Token Reduction Method: PruneVid: This PR adds a new token reduction method named 'PruneVid' to the framework. This method is intended for sparsifying visual tokens, particularly in the context of video or sequential image inputs for models like Llava.
  • Dynamic Temporal and Spatial Merging: The core logic of PruneVid involves dynamically segmenting input frames temporally based on clustering. Within these temporal segments, tokens are split into 'static' and 'dynamic' groups based on similarity and then potentially merged spatially using clustering and averaging, aiming to reduce redundancy.
  • Utility Function Refactoring: A utility function add_post_hook_to_get_2dPool, which is used to add custom processing logic after the model's get_2dPool method, has been moved from dycoke.py to a shared utils.py file for better code organization and reusability.
  • Configuration Added: A new YAML configuration file (prunevid.yml) is added to define the parameters for the PruneVid method, including ratios for temporal segmentation and spatial clustering, and a similarity threshold.

Changelog

Click here to see the changelog
  • configs/sparsification/methods/PruneVid/prunevid.yml
    • Added a new configuration file for the PruneVid token reduction method.
    • Defines parameters such as lora_alpha, selected_layers, alphas, taus, temporal_segment_ratios, and cluster_ratios.
  • llmc/compression/token_reduction/init.py
    • Added PruneVid to the list of imported token reduction modules, making it available for registration and use.
  • llmc/compression/token_reduction/dycoke.py
    • Changed the import from .utils to include add_post_hook_to_get_2dPool (line 17).
    • Removed the local definition of add_dycole_ttm_to_get_2dPool (lines 105-112).
    • Updated the call in register_reduction_modules to use the newly imported add_post_hook_to_get_2dPool (line 120).
  • llmc/compression/token_reduction/prunevid.py
    • Added a new file containing the implementation of the PruneVid token reduction method.
    • Includes helper functions for DPC-KNN clustering (cluster_dpc_knn, lines 45-112), refining clusters (refine_clusters, lines 115-222), calculating segment lengths (segment_lengths, lines 225-258), computing cluster average vectors (compute_cluster_vectors, lines 261-306), spatial token merging (spatial_merge_tokens, lines 309-312), and the main dynamic frame merging logic (merge_frames_dynamic, lines 315-397).
    • Registers the PruneVid class in the TOKEN_REDUCTION_REGISTRY (line 400).
    • Registers the merge_frames_dynamic function as a post-hook to the model's get_2dPool method (lines 409-411).
  • llmc/compression/token_reduction/utils.py
    • Added the add_post_hook_to_get_2dPool utility function (lines 98-105), refactored from dycoke.py.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new token reduction method called PruneVid, along with its configuration and supporting utility functions. The core logic resides in the new prunevid.py file, which implements several algorithms for clustering, segmenting, and merging frame tokens.

The refactoring in dycoke.py and utils.py to create a more generic add_post_hook_to_get_2dPool function is a good improvement for modularity.

Overall, the implementation of PruneVid is comprehensive, but there are a few critical and high-severity issues related to potential runtime errors and maintainability that need to be addressed. I've detailed these in the comments below. As no specific style guide was provided, my review considers general Python best practices (like PEP 8) and code clarity.

Summary of Findings

  • Potential Division by Zero: In merge_frames_dynamic, there's a division by (window_size - 1). If window_size is 1, this will cause a runtime error. This needs to be handled (critical severity).
  • Batch Size Handling: In merge_frames_dynamic, B (batch size) is hardcoded to 1. If the input frames can be batched, this could lead to incorrect behavior or errors (high severity).
  • Maintainability: Magic Numbers: The constant 14 is used in merge_frames_dynamic for feature shape checks. This should be a named constant or configurable parameter (medium severity).
  • Maintainability: Internationalization: Docstrings and comments in refine_clusters are in Chinese, which could be a barrier for non-Chinese speaking contributors. Translation to English is recommended (medium severity).
  • Code Style: Unused Imports: The import functools in llmc/compression/token_reduction/prunevid.py is not used. (low severity - not commented directly due to review settings)
  • Code Style: Redundant Imports: import torch is present inside the refine_clusters function in llmc/compression/token_reduction/prunevid.py, while torch is already imported at the module level. (low severity - not commented directly due to review settings)
  • Configuration: Placeholder Paths: The prunevid.yml configuration file contains placeholder paths like model path. This is acceptable if it's a template, but worth noting. (low severity - not commented directly due to review settings)
  • Code Quality: Dead Code: There are several commented-out lines of code in llmc/compression/token_reduction/prunevid.py (e.g., around lines 380, 394, 397) that seem to be dead code. (low severity - not commented directly due to review settings)

Merge Readiness

This pull request introduces valuable functionality with the PruneVid method. However, due to the identified critical issue (potential division by zero) and a high-severity concern (batch size handling), I recommend that these changes not be merged until these points are addressed. The medium-severity issues regarding maintainability should also be considered for improvement.

I am not authorized to approve pull requests, so please ensure these changes are reviewed and approved by a maintainer after the necessary modifications are made.

Comment on lines +342 to +344
frames_sim = (frames_sim.sum(dim=-2) - 1).sum(dim=-2) / (
window_size * (window_size - 1)
) # B L

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation for frames_sim involves division by (window_size * (window_size - 1)). If window_size is 1 (which can happen if segment_lengths produces a segment of a single frame), this will result in a division by zero error.

How should cases where window_size == 1 be handled? Perhaps these frames are always considered dynamic, or their similarity calculation needs a special case (e.g., similarity is 0 or not computed, and the frame is passed through).

        frames_sim = einsum('b w l c, b t l c -> b w t l', frames_normed, frames_normed)
        if window_size > 1:
            frames_sim = (frames_sim.sum(dim=-2) - 1).sum(dim=-2) / (
                window_size * (window_size - 1)
            )  # B L
        else:
            # Handle cases where window_size is 1. 
            # For example, set similarity to a default (e.g., 0 or -infinity)
            # or ensure these frames are treated as fully dynamic.
            # Below is one way: assume no static components if window_size is 1.
            frames_sim = torch.full((B, L), float('-inf'), device=frames.device, dtype=frames.dtype) # B L

Comment on lines +317 to +322
B = 1
num_frames, L, C = frames.shape
threshold = pruning_paras['taus']
cluster_ratio = pruning_paras['cluster_ratios']
temporal_segment_ratio = pruning_paras['temporal_segment_ratios']
frames = frames.view(B, num_frames, L, C) # B T L C

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a hardcoded B = 1 (batch size) here, and frames are reshaped using this. The input frames to this function comes from model.get_2dPool.
Could you clarify the expected shape of frames?

  1. If frames is guaranteed to be 3D (num_frames, L, C) (i.e., features for a single video instance), then B=1 and frames.view(B, num_frames, L, C) is essentially frames.unsqueeze(0) which is fine.
  2. However, if get_2dPool can return batched features, e.g., (ActualBatchSize, num_frames, L, C), then num_frames, L, C = frames.shape (line 318) would be incorrect if ActualBatchSize > 1, and hardcoding B=1 would lead to errors or incorrect processing for batches.

If ActualBatchSize can be greater than 1, the batch dimension needs to be handled dynamically. For example:

if frames.ndim == 3:  # (T, L, C)
    actual_B = 1
    num_frames_dim, L_dim, C_dim = frames.shape
    frames_for_processing = frames.unsqueeze(0)
elif frames.ndim == 4:  # (B, T, L, C)
    actual_B, num_frames_dim, L_dim, C_dim = frames.shape
    frames_for_processing = frames
else:
    raise ValueError(f"Unexpected frames ndim: {frames.ndim}, shape: {frames.shape}")

# ... later use frames_for_processing ...
# idx_clusters, _ = cluster_dpc_knn(
#     frames_for_processing.mean(dim=2), # Assuming dim 2 is L (tokens per frame)
#     cluster_num=int(num_frames_dim * temporal_segment_ratio), k=k
# )

This would also affect how window_list (derived from idx_clusters) is iterated if actual_B > 1.

Comment on lines +116 to +123
"""根据给定的聚类结果,对每个批次进行精炼处理。

Args:
cluster_idx: Tensor of shape (B, N),每个元素是聚类的索引。

Returns:
refined_cluster_idx: Tensor of shape (B, N),精炼后的聚类结果。
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for refine_clusters is in Chinese. For consistency and broader understanding within the project (assuming the primary language is English), could this be translated to English? The internal comments detailing the steps (e.g., "步骤1") are also in Chinese and would benefit from translation if this function's logic is complex enough to warrant them.

Comment on lines +357 to +360
if static_feat.shape[1] > 14:
static_feat = spatial_merge_tokens(
static_feat, num_cluster=int(static_feat.shape[1] * cluster_ratio), k=7
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number 14 in static_feat.shape[1] > 14 (and similarly on line 372 for dynamic_feat_window) appears to be a magic number.

Could this be defined as a named constant at the beginning of the file or passed as a parameter (e.g., from pruning_paras)? This would improve readability and make it easier to adjust if needed.

@helloyongyang helloyongyang merged commit bbe1658 into main Jun 9, 2025
2 checks passed
@helloyongyang helloyongyang deleted the vlm branch June 9, 2025 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants