Skip to content

Conversation

@chengtao-lv
Copy link
Collaborator

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @chengtao-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the llmc framework by integrating advanced performance optimizations and detailed evaluation capabilities. It introduces a sophisticated token reduction technique for Qwen2 models to improve inference efficiency, complemented by a robust system for collecting and reporting critical runtime statistics like memory consumption and processing times during evaluation.

Highlights

  • New Feature: Token Reduction (HOLITOM): Implemented a novel HOLITOM-based token reduction strategy specifically for Qwen2 models. This strategy prunes image tokens based on their attention scores at a configurable layer (HOLITOM_k) and merges the hidden states of less important tokens into more salient ones, aiming to reduce sequence length and improve efficiency.
  • New Feature: Performance Statistics Collection: Introduced a comprehensive system to collect and report key performance metrics during model evaluation. This includes tracking average prefill time, average decode time, and peak CUDA memory usage, providing valuable insights into model efficiency.
  • Bug Fix & Configuration Improvement: Resolved issues related to configuration management for token reduction parameters (HOLITOM_k, HOLITOM_r) by transitioning from environment variables (os.getenv) to instance-specific self.pruning_paras. This also includes a fix to ensure these parameters are correctly propagated to deeply nested model structures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance statistics collection and includes bug fixes. The changes in holitom.py introduce several high and critical severity issues, including potential crashes, incorrect logic, and inefficient code. Please prioritize fixing the identified issues to ensure the code is correct and robust.

Comment on lines +245 to +250
prefill = (
llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count
)
decode = (
llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation of prefill and decode average times can lead to a ZeroDivisionError if prefill_count or decode_count is zero. Add checks to prevent division by zero.

Suggested change
prefill = (
llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count
)
decode = (
llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count
)
prefill = (
(llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count)
if llmc_model.vlm_model.prefill_count > 0 else 0.0
)
decode = (
(llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count)
if llmc_model.vlm_model.decode_count > 0 else 0.0
)

def add_sparse_config(self):
special_config = self.config.get('special', {})
self.model.model.pruning_paras = special_config
self.model.model.model.pruning_paras = special_config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The line self.model.model.model.pruning_paras = special_config appears to be redundant. self.model.model.pruning_paras is set on the preceding line. Verify if this assignment is necessary and remove if it is not.

Comment on lines +1410 to +1421
for b in range(hidden_states.size(0)):
for i in range(len(non_topk_indices)):
non_topk_idx = non_topk_indices[i]
most_similar_topk_idx = (
top_attention_rank_index[
sim_max_index[b, i]
]
)
hidden_states[b, most_similar_topk_idx, :] = (
hidden_states[b, most_similar_topk_idx, :]
+ hidden_states[b, non_topk_idx, :]
) / 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The nested Python loops are inefficient for tensor operations on a GPU. Refactor this section to use vectorized PyTorch operations (e.g., torch.scatter_add_).

Comment on lines +1460 to +1480
if layer_idx == HOLITOM_k - 1:
output_attentions = True
else:
output_attentions = False

layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
position_embeddings=position_embeddings,
**flash_attn_kwargs,
)

hidden_states = layer_outputs[0]

# if output_attentions:
# all_self_attns += (layer_outputs[1],)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for handling output_attentions incorrectly overwrites the user-provided output_attentions value in each iteration. This prevents users from getting attention outputs for all layers if they request it.

Suggested change
if layer_idx == HOLITOM_k - 1:
output_attentions = True
else:
output_attentions = False
layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
position_embeddings=position_embeddings,
**flash_attn_kwargs,
)
hidden_states = layer_outputs[0]
# if output_attentions:
# all_self_attns += (layer_outputs[1],)
# Determine if we need attentions for this layer, either for pruning or because the user requested them.
_get_attentions = output_attentions or (layer_idx == HOLITOM_k - 1)
layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=_get_attentions,
use_cache=use_cache,
cache_position=cache_position,
position_embeddings=position_embeddings,
**flash_attn_kwargs,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_self_attns += (layer_outputs[1],)

def end_time_hook(module, inputs, kwargs, layer_outputs):
torch.cuda.synchronize()
elapsed_prefill = time.time() - module.start_time
if kwargs['inputs_embeds'] is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing kwargs['inputs_embeds'] directly can raise a KeyError if inputs_embeds is not present in the kwargs dictionary. Use kwargs.get('inputs_embeds') is not None to avoid a potential crash.

Suggested change
if kwargs['inputs_embeds'] is not None:
if kwargs.get('inputs_embeds') is not None:

@gushiqiao gushiqiao merged commit d6e0724 into main Jul 3, 2025
2 checks passed
@gushiqiao gushiqiao deleted the vlm branch July 3, 2025 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants