-
Notifications
You must be signed in to change notification settings - Fork 66
add stat and fix bugs #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @chengtao-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the llmc framework by integrating advanced performance optimizations and detailed evaluation capabilities. It introduces a sophisticated token reduction technique for Qwen2 models to improve inference efficiency, complemented by a robust system for collecting and reporting critical runtime statistics like memory consumption and processing times during evaluation.
Highlights
- New Feature: Token Reduction (HOLITOM): Implemented a novel HOLITOM-based token reduction strategy specifically for Qwen2 models. This strategy prunes image tokens based on their attention scores at a configurable layer (
HOLITOM_k) and merges the hidden states of less important tokens into more salient ones, aiming to reduce sequence length and improve efficiency. - New Feature: Performance Statistics Collection: Introduced a comprehensive system to collect and report key performance metrics during model evaluation. This includes tracking average prefill time, average decode time, and peak CUDA memory usage, providing valuable insights into model efficiency.
- Bug Fix & Configuration Improvement: Resolved issues related to configuration management for token reduction parameters (
HOLITOM_k,HOLITOM_r) by transitioning from environment variables (os.getenv) to instance-specificself.pruning_paras. This also includes a fix to ensure these parameters are correctly propagated to deeply nested model structures.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance statistics collection and includes bug fixes. The changes in holitom.py introduce several high and critical severity issues, including potential crashes, incorrect logic, and inefficient code. Please prioritize fixing the identified issues to ensure the code is correct and robust.
| prefill = ( | ||
| llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count | ||
| ) | ||
| decode = ( | ||
| llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of prefill and decode average times can lead to a ZeroDivisionError if prefill_count or decode_count is zero. Add checks to prevent division by zero.
| prefill = ( | |
| llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count | |
| ) | |
| decode = ( | |
| llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count | |
| ) | |
| prefill = ( | |
| (llmc_model.vlm_model.prefill_time / llmc_model.vlm_model.prefill_count) | |
| if llmc_model.vlm_model.prefill_count > 0 else 0.0 | |
| ) | |
| decode = ( | |
| (llmc_model.vlm_model.decode_time / llmc_model.vlm_model.decode_count) | |
| if llmc_model.vlm_model.decode_count > 0 else 0.0 | |
| ) |
| def add_sparse_config(self): | ||
| special_config = self.config.get('special', {}) | ||
| self.model.model.pruning_paras = special_config | ||
| self.model.model.model.pruning_paras = special_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for b in range(hidden_states.size(0)): | ||
| for i in range(len(non_topk_indices)): | ||
| non_topk_idx = non_topk_indices[i] | ||
| most_similar_topk_idx = ( | ||
| top_attention_rank_index[ | ||
| sim_max_index[b, i] | ||
| ] | ||
| ) | ||
| hidden_states[b, most_similar_topk_idx, :] = ( | ||
| hidden_states[b, most_similar_topk_idx, :] | ||
| + hidden_states[b, non_topk_idx, :] | ||
| ) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if layer_idx == HOLITOM_k - 1: | ||
| output_attentions = True | ||
| else: | ||
| output_attentions = False | ||
|
|
||
| layer_outputs = decoder_layer( | ||
| hidden_states, | ||
| attention_mask=causal_mask, | ||
| position_ids=position_ids, | ||
| past_key_value=past_key_values, | ||
| output_attentions=output_attentions, | ||
| use_cache=use_cache, | ||
| cache_position=cache_position, | ||
| position_embeddings=position_embeddings, | ||
| **flash_attn_kwargs, | ||
| ) | ||
|
|
||
| hidden_states = layer_outputs[0] | ||
|
|
||
| # if output_attentions: | ||
| # all_self_attns += (layer_outputs[1],) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for handling output_attentions incorrectly overwrites the user-provided output_attentions value in each iteration. This prevents users from getting attention outputs for all layers if they request it.
| if layer_idx == HOLITOM_k - 1: | |
| output_attentions = True | |
| else: | |
| output_attentions = False | |
| layer_outputs = decoder_layer( | |
| hidden_states, | |
| attention_mask=causal_mask, | |
| position_ids=position_ids, | |
| past_key_value=past_key_values, | |
| output_attentions=output_attentions, | |
| use_cache=use_cache, | |
| cache_position=cache_position, | |
| position_embeddings=position_embeddings, | |
| **flash_attn_kwargs, | |
| ) | |
| hidden_states = layer_outputs[0] | |
| # if output_attentions: | |
| # all_self_attns += (layer_outputs[1],) | |
| # Determine if we need attentions for this layer, either for pruning or because the user requested them. | |
| _get_attentions = output_attentions or (layer_idx == HOLITOM_k - 1) | |
| layer_outputs = decoder_layer( | |
| hidden_states, | |
| attention_mask=causal_mask, | |
| position_ids=position_ids, | |
| past_key_value=past_key_values, | |
| output_attentions=_get_attentions, | |
| use_cache=use_cache, | |
| cache_position=cache_position, | |
| position_embeddings=position_embeddings, | |
| **flash_attn_kwargs, | |
| ) | |
| hidden_states = layer_outputs[0] | |
| if output_attentions: | |
| all_self_attns += (layer_outputs[1],) |
| def end_time_hook(module, inputs, kwargs, layer_outputs): | ||
| torch.cuda.synchronize() | ||
| elapsed_prefill = time.time() - module.start_time | ||
| if kwargs['inputs_embeds'] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.