fastv,fastervlm,visionzip for llava1.5 #393

zhangbilang · 2025-06-22T22:50:18Z

Accuracy Alignment

gemini-code-assist

Summary of Changes

Hello @SmudgedWings, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the compatibility and performance alignment of FastV, FasterVLM, and VisionZip token reduction methods with the LLaVA 1.5 model. The changes involve updating configuration parameters, refactoring the core logic of these methods to support different LLaVA model variants, and ensuring proper handling of model-specific settings like use_cache.

Highlights

Configuration Updates: I've updated the sparsification rates for FastV and FasterVLM in their respective YAML configuration files to 0.778. Additionally, the VisionZip configuration has been restructured to explicitly group vision-related parameters under a new vision key.
LLaVA 1.5 Model Compatibility: I've adapted the FastV, FasterVLM, and VisionZip token reduction implementations to ensure compatibility with different LLaVA model architectures (specifically LlavaHf and Llava). This involved introducing conditional logic for hook registrations and adjusting internal parameter handling (pruning_pars renamed to pruning_paras) based on the detected model class.
VisionZip Refactoring: The VisionZip integration has undergone significant refactoring. The custom CLIPVisionTower_VisionZip class has been removed, and its core logic is now applied by directly patching the CLIPEncoderLayer's forward method for Llava models. This allows for a more direct and integrated approach to token reduction within the vision tower.
Model Cache Management: I've enforced use_cache to be True for LLaVA model configurations, as it appears to be a necessary setting for these sparsification methods. Correspondingly, an explicit use_cache = False setting within the vlm_model.forward wrapper has been removed to prevent conflicts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for FastV, FasterVLM, and VisionZip token reduction methods for LLaVA 1.5 models. The changes primarily involve adding conditional logic to handle different LLaVA model architectures and implementing the necessary hooks and model modifications for these compression techniques. The review identified critical issues related to global monkey-patching and potential runtime errors, along with suggestions for code robustness and maintainability.

gemini-code-assist · 2025-06-22T22:52:21Z

llmc/compression/token_reduction/fastervlm.py

+            else:
+                pruning_paras['image_attentions'] = pruning_paras['image_attentions_list']


If len(pruning_paras['image_attentions_list']) is greater than 1, pruning_paras['image_attentions'] will be a list of tensors. The pruning_hook at line 59 expects a single tensor, which will raise an AttributeError. Stack and average the attentions here.

Suggested change

else:

pruning_paras['image_attentions'] = pruning_paras['image_attentions_list']

else:

# If multiple attention tensors are collected, stack and average them.

pruning_paras['image_attentions'] = torch.stack(pruning_paras['image_attentions_list']).mean(dim=0)

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/visionzip.py

-        self.model.vlm_model.__class__.forward = visionzip_forward
-        self.model.vlm_model.vision_tower.register_forward_pre_hook(
+        if self.model.__class__.__name__ == 'LlavaHf':
+            self.model.vlm_model.__class__.forward = visionzip_forward


Modifying __class__.forward monkey-patches the method for all instances of this class. Modify the forward method on the instance instead using types.MethodType.

from types import MethodType self.model.vlm_model.forward = MethodType(visionzip_forward, self.model.vlm_model)

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/visionzip.py

+            self.model.vlm_model.__class__.forward = visionzip_forward
+        elif self.model.__class__.__name__ == 'Llava':
+            from transformers.models.clip.modeling_clip import CLIPEncoderLayer
+            CLIPEncoderLayer.forward = CLIP_EncoderLayer_forward


Globally monkey-patching CLIPEncoderLayer.forward is extremely risky. Modify the method only for the specific instances you are working with, or if you must patch the class, restore the original method afterward to limit the scope of the change.

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/fastervlm.py

+            elif pruning_paras['select_feature'] in ('full', 'cls_patch'):
+                image_attention = image_attentions
            else:
                raise ValueError(f'Unexpected select feature: {self.select_feature}')


The variable self.select_feature is not defined in this class context. Use pruning_paras['select_feature'] instead.

Suggested change

raise ValueError(f'Unexpected select feature: {self.select_feature}')

raise ValueError(f"Unexpected select feature: {pruning_paras['select_feature']}")

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/fastv.py

+            hook_fn = make_hook_prepare_inputs_labels_for_multimodal(self.pruning_paras)
+            self.model.vlm_model._original_prepare_inputs_labels_for_multimodal = (
+                self.model.vlm_model.prepare_inputs_labels_for_multimodal
+            )
+            self.model.vlm_model.prepare_inputs_labels_for_multimodal = MethodType(
+                hook_fn, self.model.vlm_model
+            )


Monkey-patching methods at runtime can make the code difficult to understand and maintain. Consider a more explicit mechanism, like subclassing the model or using a wrapper class. If monkey-patching is the only viable option, add a comment explaining why it's necessary.

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/fastervlm.py

+        if self.model.__class__.__name__ == 'LlavaHf':
+            self.model.vision_model.register_forward_pre_hook(
+                update_output_attentions_hook,
+                with_kwargs=True
+            )
+
+            self.model.vision_model.register_forward_hook(
+                functools.partial(store_attention_hook, pruning_paras=self.pruning_paras),
+            )
+        elif self.model.__class__.__name__ == 'Llava':
+            self.model.vision_model.register_forward_pre_hook(
+                functools.partial(clear_attentions_hook, pruning_paras=self.pruning_paras),
+            )
+
+            self.model.vision_model.register_forward_hook(
+                functools.partial(update_attentions_hook, pruning_paras=self.pruning_paras),
+            )
+
+            self.model.vision_model.vision_tower.register_forward_pre_hook(
+                update_output_attentions_hook,
+                with_kwargs=True
+            )
+
+            self.model.vision_model.vision_tower.register_forward_hook(
+                functools.partial(store_attention_hook, pruning_paras=self.pruning_paras),
+            )

        self.model.vision_projector.register_forward_pre_hook(
-            functools.partial(pruning_hook, pruning_pars=self.model.model.parameters),
+            functools.partial(pruning_hook, pruning_paras=self.pruning_paras),
            with_kwargs=True
        )

-        self.model.vlm_model.register_forward_pre_hook(
-            functools.partial(get_image_mask_hook, pruning_pars=self.model.model.parameters),
-            with_kwargs=True
-        )
+        if self.model.__class__.__name__ == 'LlavaHf':
+            self.model.vlm_model.register_forward_pre_hook(
+                functools.partial(get_image_mask_hook, pruning_paras=self.pruning_paras),
+                with_kwargs=True
+            )

-        self.model.model.register_forward_pre_hook(
-            functools.partial(
-                prepare_inputs_for_llm_hook, pruning_pars=self.model.model.parameters
-            ),
-            with_kwargs=True
-        )
+            self.model.model.register_forward_pre_hook(
+                functools.partial(
+                    prepare_inputs_for_llm_hook, pruning_paras=self.pruning_paras
+                ),
+                with_kwargs=True
+            )
+        elif self.model.__class__.__name__ == 'Llava':
+            self.model.vision_projector.register_forward_hook(
+                functools.partial(prepare_inputs_hook, pruning_paras=self.pruning_paras),
+            )


Using self.model.__class__.__name__ for conditional logic is brittle. Use isinstance with the specific model classes (e.g., isinstance(self.model, LlavaHf)) for a more robust approach.

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/visionzip.py

-                torch.bmm(assign_one_hot.transpose(1, 2), hidden_to_merge) / counts
-            )
-            target_hidden = hidden_states_filtered[:, target_indices, :]
+    r = self.self_attn.k_proj._info['r'].pop(0)


Using pop(0) on a list is an O(n) operation. Consider using collections.deque, which provides an O(1) popleft() operation.

gemini-code-assist · 2025-06-22T22:52:22Z

llmc/compression/token_reduction/visionzip.py

+        if self.model.__class__.__name__ == 'LlavaHf':
+            self.model.vlm_model.__class__.forward = visionzip_forward
+        elif self.model.__class__.__name__ == 'Llava':
+            from transformers.models.clip.modeling_clip import CLIPEncoderLayer


Imports should generally be at the top of the file, as per PEP 8 guidelines. Move this import to the top of the file.

fastv,fastervlm,visionzip for llava1.5

14ae841

gemini-code-assist bot reviewed Jun 22, 2025

View reviewed changes

llmc-reviewer approved these changes Jun 22, 2025

View reviewed changes

gemini-code-assist bot reviewed Jun 22, 2025

View reviewed changes

zhangbilang merged commit ff18600 into ModelTC:main Jun 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fastv,fastervlm,visionzip for llava1.5 #393

fastv,fastervlm,visionzip for llava1.5 #393

Uh oh!

zhangbilang commented Jun 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

gemini-code-assist bot Jun 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		else:
		pruning_paras['image_attentions'] = pruning_paras['image_attentions_list']

-            else:
-                pruning_paras['image_attentions'] = pruning_paras['image_attentions_list']
+            else:
+                # If multiple attention tensors are collected, stack and average them.
+                pruning_paras['image_attentions'] = torch.stack(pruning_paras['image_attentions_list']).mean(dim=0)

	raise ValueError(f'Unexpected select feature: {self.select_feature}')
	raise ValueError(f"Unexpected select feature: {pruning_paras['select_feature']}")

fastv,fastervlm,visionzip for llava1.5 #393

fastv,fastervlm,visionzip for llava1.5 #393

Uh oh!

Conversation

zhangbilang commented Jun 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants