-
Notifications
You must be signed in to change notification settings - Fork 66
prunevid #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prunevid #389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @chengtao-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR introduces a new token reduction method called 'PruneVid', specifically designed for processing video or sequential image inputs within models like Llava. The core idea is to dynamically merge visual tokens based on temporal segments and spatial characteristics within those segments, aiming to reduce the number of tokens processed by the language model while preserving relevant information. The implementation involves adding a new configuration, registering the method, and providing the core logic for temporal segmentation, static/dynamic token splitting, and spatial merging. Additionally, a utility function used for hooking into the model's processing pipeline has been refactored into a shared utility file.
Highlights
- New Token Reduction Method: PruneVid: This PR adds a new token reduction method named 'PruneVid' to the framework. This method is intended for sparsifying visual tokens, particularly in the context of video or sequential image inputs for models like Llava.
- Dynamic Temporal and Spatial Merging: The core logic of PruneVid involves dynamically segmenting input frames temporally based on clustering. Within these temporal segments, tokens are split into 'static' and 'dynamic' groups based on similarity and then potentially merged spatially using clustering and averaging, aiming to reduce redundancy.
- Utility Function Refactoring: A utility function
add_post_hook_to_get_2dPool, which is used to add custom processing logic after the model'sget_2dPoolmethod, has been moved fromdycoke.pyto a sharedutils.pyfile for better code organization and reusability. - Configuration Added: A new YAML configuration file (
prunevid.yml) is added to define the parameters for the PruneVid method, including ratios for temporal segmentation and spatial clustering, and a similarity threshold.
Changelog
Click here to see the changelog
- configs/sparsification/methods/PruneVid/prunevid.yml
- Added a new configuration file for the PruneVid token reduction method.
- Defines parameters such as
lora_alpha,selected_layers,alphas,taus,temporal_segment_ratios, andcluster_ratios.
- llmc/compression/token_reduction/init.py
- Added
PruneVidto the list of imported token reduction modules, making it available for registration and use.
- Added
- llmc/compression/token_reduction/dycoke.py
- Changed the import from
.utilsto includeadd_post_hook_to_get_2dPool(line 17). - Removed the local definition of
add_dycole_ttm_to_get_2dPool(lines 105-112). - Updated the call in
register_reduction_modulesto use the newly importedadd_post_hook_to_get_2dPool(line 120).
- Changed the import from
- llmc/compression/token_reduction/prunevid.py
- Added a new file containing the implementation of the
PruneVidtoken reduction method. - Includes helper functions for DPC-KNN clustering (
cluster_dpc_knn, lines 45-112), refining clusters (refine_clusters, lines 115-222), calculating segment lengths (segment_lengths, lines 225-258), computing cluster average vectors (compute_cluster_vectors, lines 261-306), spatial token merging (spatial_merge_tokens, lines 309-312), and the main dynamic frame merging logic (merge_frames_dynamic, lines 315-397). - Registers the
PruneVidclass in theTOKEN_REDUCTION_REGISTRY(line 400). - Registers the
merge_frames_dynamicfunction as a post-hook to the model'sget_2dPoolmethod (lines 409-411).
- Added a new file containing the implementation of the
- llmc/compression/token_reduction/utils.py
- Added the
add_post_hook_to_get_2dPoolutility function (lines 98-105), refactored fromdycoke.py.
- Added the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new token reduction method called PruneVid, along with its configuration and supporting utility functions. The core logic resides in the new prunevid.py file, which implements several algorithms for clustering, segmenting, and merging frame tokens.
The refactoring in dycoke.py and utils.py to create a more generic add_post_hook_to_get_2dPool function is a good improvement for modularity.
Overall, the implementation of PruneVid is comprehensive, but there are a few critical and high-severity issues related to potential runtime errors and maintainability that need to be addressed. I've detailed these in the comments below. As no specific style guide was provided, my review considers general Python best practices (like PEP 8) and code clarity.
Summary of Findings
- Potential Division by Zero: In
merge_frames_dynamic, there's a division by(window_size - 1). Ifwindow_sizeis 1, this will cause a runtime error. This needs to be handled (critical severity). - Batch Size Handling: In
merge_frames_dynamic,B(batch size) is hardcoded to 1. If the inputframescan be batched, this could lead to incorrect behavior or errors (high severity). - Maintainability: Magic Numbers: The constant
14is used inmerge_frames_dynamicfor feature shape checks. This should be a named constant or configurable parameter (medium severity). - Maintainability: Internationalization: Docstrings and comments in
refine_clustersare in Chinese, which could be a barrier for non-Chinese speaking contributors. Translation to English is recommended (medium severity). - Code Style: Unused Imports: The import
functoolsinllmc/compression/token_reduction/prunevid.pyis not used. (low severity - not commented directly due to review settings) - Code Style: Redundant Imports:
import torchis present inside therefine_clustersfunction inllmc/compression/token_reduction/prunevid.py, whiletorchis already imported at the module level. (low severity - not commented directly due to review settings) - Configuration: Placeholder Paths: The
prunevid.ymlconfiguration file contains placeholder paths likemodel path. This is acceptable if it's a template, but worth noting. (low severity - not commented directly due to review settings) - Code Quality: Dead Code: There are several commented-out lines of code in
llmc/compression/token_reduction/prunevid.py(e.g., around lines 380, 394, 397) that seem to be dead code. (low severity - not commented directly due to review settings)
Merge Readiness
This pull request introduces valuable functionality with the PruneVid method. However, due to the identified critical issue (potential division by zero) and a high-severity concern (batch size handling), I recommend that these changes not be merged until these points are addressed. The medium-severity issues regarding maintainability should also be considered for improvement.
I am not authorized to approve pull requests, so please ensure these changes are reviewed and approved by a maintainer after the necessary modifications are made.
| frames_sim = (frames_sim.sum(dim=-2) - 1).sum(dim=-2) / ( | ||
| window_size * (window_size - 1) | ||
| ) # B L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation for frames_sim involves division by (window_size * (window_size - 1)). If window_size is 1 (which can happen if segment_lengths produces a segment of a single frame), this will result in a division by zero error.
How should cases where window_size == 1 be handled? Perhaps these frames are always considered dynamic, or their similarity calculation needs a special case (e.g., similarity is 0 or not computed, and the frame is passed through).
frames_sim = einsum('b w l c, b t l c -> b w t l', frames_normed, frames_normed)
if window_size > 1:
frames_sim = (frames_sim.sum(dim=-2) - 1).sum(dim=-2) / (
window_size * (window_size - 1)
) # B L
else:
# Handle cases where window_size is 1.
# For example, set similarity to a default (e.g., 0 or -infinity)
# or ensure these frames are treated as fully dynamic.
# Below is one way: assume no static components if window_size is 1.
frames_sim = torch.full((B, L), float('-inf'), device=frames.device, dtype=frames.dtype) # B L| B = 1 | ||
| num_frames, L, C = frames.shape | ||
| threshold = pruning_paras['taus'] | ||
| cluster_ratio = pruning_paras['cluster_ratios'] | ||
| temporal_segment_ratio = pruning_paras['temporal_segment_ratios'] | ||
| frames = frames.view(B, num_frames, L, C) # B T L C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a hardcoded B = 1 (batch size) here, and frames are reshaped using this. The input frames to this function comes from model.get_2dPool.
Could you clarify the expected shape of frames?
- If
framesis guaranteed to be 3D(num_frames, L, C)(i.e., features for a single video instance), thenB=1andframes.view(B, num_frames, L, C)is essentiallyframes.unsqueeze(0)which is fine. - However, if
get_2dPoolcan return batched features, e.g.,(ActualBatchSize, num_frames, L, C), thennum_frames, L, C = frames.shape(line 318) would be incorrect ifActualBatchSize > 1, and hardcodingB=1would lead to errors or incorrect processing for batches.
If ActualBatchSize can be greater than 1, the batch dimension needs to be handled dynamically. For example:
if frames.ndim == 3: # (T, L, C)
actual_B = 1
num_frames_dim, L_dim, C_dim = frames.shape
frames_for_processing = frames.unsqueeze(0)
elif frames.ndim == 4: # (B, T, L, C)
actual_B, num_frames_dim, L_dim, C_dim = frames.shape
frames_for_processing = frames
else:
raise ValueError(f"Unexpected frames ndim: {frames.ndim}, shape: {frames.shape}")
# ... later use frames_for_processing ...
# idx_clusters, _ = cluster_dpc_knn(
# frames_for_processing.mean(dim=2), # Assuming dim 2 is L (tokens per frame)
# cluster_num=int(num_frames_dim * temporal_segment_ratio), k=k
# )This would also affect how window_list (derived from idx_clusters) is iterated if actual_B > 1.
| """根据给定的聚类结果,对每个批次进行精炼处理。 | ||
|
|
||
| Args: | ||
| cluster_idx: Tensor of shape (B, N),每个元素是聚类的索引。 | ||
|
|
||
| Returns: | ||
| refined_cluster_idx: Tensor of shape (B, N),精炼后的聚类结果。 | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for refine_clusters is in Chinese. For consistency and broader understanding within the project (assuming the primary language is English), could this be translated to English? The internal comments detailing the steps (e.g., "步骤1") are also in Chinese and would benefit from translation if this function's logic is complex enough to warrant them.
| if static_feat.shape[1] > 14: | ||
| static_feat = spatial_merge_tokens( | ||
| static_feat, num_cluster=int(static_feat.shape[1] * cluster_ratio), k=7 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number 14 in static_feat.shape[1] > 14 (and similarly on line 372 for dynamic_feat_window) appears to be a magic number.
Could this be defined as a named constant at the beginning of the file or passed as a parameter (e.g., from pruning_paras)? This would improve readability and make it easier to adjust if needed.
No description provided.