add stat and fix bugs #399

chengtao-lv · 2025-07-03T07:06:42Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @chengtao-lv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the llmc framework by integrating advanced performance optimizations and detailed evaluation capabilities. It introduces a sophisticated token reduction technique for Qwen2 models to improve inference efficiency, complemented by a robust system for collecting and reporting critical runtime statistics like memory consumption and processing times during evaluation.

Highlights

New Feature: Token Reduction (HOLITOM): Implemented a novel HOLITOM-based token reduction strategy specifically for Qwen2 models. This strategy prunes image tokens based on their attention scores at a configurable layer (HOLITOM_k) and merges the hidden states of less important tokens into more salient ones, aiming to reduce sequence length and improve efficiency.
New Feature: Performance Statistics Collection: Introduced a comprehensive system to collect and report key performance metrics during model evaluation. This includes tracking average prefill time, average decode time, and peak CUDA memory usage, providing valuable insights into model efficiency.
Bug Fix & Configuration Improvement: Resolved issues related to configuration management for token reduction parameters (HOLITOM_k, HOLITOM_r) by transitioning from environment variables (os.getenv) to instance-specific self.pruning_paras. This also includes a fix to ensure these parameters are correctly propagated to deeply nested model structures.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces performance statistics collection and includes bug fixes. The changes in holitom.py introduce several high and critical severity issues, including potential crashes, incorrect logic, and inefficient code. Please prioritize fixing the identified issues to ensure the code is correct and robust.

gemini-code-assist · 2025-07-03T07:09:50Z

llmc/eval/eval_vqa.py

+            prefill = (
+                llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count
+            )
+            decode = (
+                llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count
+            )


The calculation of prefill and decode average times can lead to a ZeroDivisionError if prefill_count or decode_count is zero. Add checks to prevent division by zero.

Suggested change

prefill = (

llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count

)

decode = (

llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count

)

prefill = (

(llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count)

if llmc_model.vlm_model.prefill_count > 0 else 0.0

)

decode = (

(llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count)

if llmc_model.vlm_model.decode_count > 0 else 0.0

)

gemini-code-assist · 2025-07-03T07:09:50Z

llmc/compression/token_reduction/holitom.py

    def add_sparse_config(self):
        special_config = self.config.get('special', {})
        self.model.model.pruning_paras = special_config
+        self.model.model.model.pruning_paras = special_config


The line self.model.model.model.pruning_paras = special_config appears to be redundant. self.model.model.pruning_paras is set on the preceding line. Verify if this assignment is necessary and remove if it is not.

gemini-code-assist · 2025-07-03T07:09:50Z

llmc/compression/token_reduction/holitom.py

+                                for b in range(hidden_states.size(0)):
+                                    for i in range(len(non_topk_indices)):
+                                        non_topk_idx = non_topk_indices[i]
+                                        most_similar_topk_idx = (
+                                            top_attention_rank_index[
+                                                sim_max_index[b, i]
+                                            ]
+                                        )
+                                        hidden_states[b, most_similar_topk_idx, :] = (
+                                            hidden_states[b, most_similar_topk_idx, :]
+                                            + hidden_states[b, non_topk_idx, :]
+                                        ) / 2


The nested Python loops are inefficient for tensor operations on a GPU. Refactor this section to use vectorized PyTorch operations (e.g., torch.scatter_add_).

gemini-code-assist · 2025-07-03T07:09:50Z

llmc/compression/token_reduction/holitom.py

+                            if layer_idx == HOLITOM_k - 1:
+                                output_attentions = True
+                            else:
+                                output_attentions = False
+
+                            layer_outputs = decoder_layer(
+                                hidden_states,
+                                attention_mask=causal_mask,
+                                position_ids=position_ids,
+                                past_key_value=past_key_values,
+                                output_attentions=output_attentions,
+                                use_cache=use_cache,
+                                cache_position=cache_position,
+                                position_embeddings=position_embeddings,
+                                **flash_attn_kwargs,
+                            )
+
+                        hidden_states = layer_outputs[0]
+
+                        # if output_attentions:
+                        #     all_self_attns += (layer_outputs[1],)


The logic for handling output_attentions incorrectly overwrites the user-provided output_attentions value in each iteration. This prevents users from getting attention outputs for all layers if they request it.

Suggested change

if layer_idx == HOLITOM_k - 1:

output_attentions = True

else:

output_attentions = False

layer_outputs = decoder_layer(

hidden_states,

attention_mask=causal_mask,

position_ids=position_ids,

past_key_value=past_key_values,

output_attentions=output_attentions,

use_cache=use_cache,

cache_position=cache_position,

position_embeddings=position_embeddings,

**flash_attn_kwargs,

)

hidden_states = layer_outputs[0]

# if output_attentions:

# all_self_attns += (layer_outputs[1],)

# Determine if we need attentions for this layer, either for pruning or because the user requested them.

_get_attentions = output_attentions or (layer_idx == HOLITOM_k - 1)

layer_outputs = decoder_layer(

hidden_states,

attention_mask=causal_mask,

position_ids=position_ids,

past_key_value=past_key_values,

output_attentions=_get_attentions,

use_cache=use_cache,

cache_position=cache_position,

position_embeddings=position_embeddings,

**flash_attn_kwargs,

)

hidden_states = layer_outputs[0]

if output_attentions:

all_self_attns += (layer_outputs[1],)

gemini-code-assist · 2025-07-03T07:09:50Z

llmc/eval/eval_vqa.py

+        def end_time_hook(module, inputs, kwargs, layer_outputs):
+            torch.cuda.synchronize()
+            elapsed_prefill = time.time() - module.start_time
+            if kwargs['inputs_embeds'] is not None:


Accessing kwargs['inputs_embeds'] directly can raise a KeyError if inputs_embeds is not present in the kwargs dictionary. Use kwargs.get('inputs_embeds') is not None to avoid a potential crash.

Suggested change

if kwargs['inputs_embeds'] is not None:

if kwargs.get('inputs_embeds') is not None:

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

add stat and fix bugs

640a647

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

gushiqiao approved these changes Jul 3, 2025

View reviewed changes

gushiqiao merged commit d6e0724 into main Jul 3, 2025
2 checks passed

gushiqiao deleted the vlm branch July 3, 2025 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add stat and fix bugs #399

add stat and fix bugs #399

Uh oh!

chengtao-lv commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if kwargs['inputs_embeds'] is not None:
	if kwargs.get('inputs_embeds') is not None:

add stat and fix bugs #399

add stat and fix bugs #399

Uh oh!

Conversation

chengtao-lv commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants