Fix KV cache calibration for attention modules not named self_attn#2477
Fix KV cache calibration for attention modules not named self_attn#2477changjonathanc wants to merge 3 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue in KV cache calibration where attention modules not explicitly named Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where KV cache calibration was not correctly applied to attention modules unless they were named self_attn. The proposed solution adds a fallback pass in both start_calibration and end_calibration to identify any missed attention modules using the name-agnostic is_attention_module function. This ensures all relevant modules are properly calibrated and frozen. The changes are well-reasoned and are accompanied by a targeted unit test that validates the fix. My review includes one suggestion to refactor a small amount of duplicated code for better maintainability.
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
compressed_tensors' _apply_kv_cache_scheme uses is_attention_module() to set quantization_scheme on attention modules (name-agnostic), but start_calibration only iterates modules matching resolved_targets which includes "re:.*self_attn$". This regex misses attention modules with different names (e.g. "attention", "self_attention"), leaving their KV cache observers uninitialized and scales as garbage values. Add a fallback pass in start_calibration and end_calibration that uses is_attention_module() to catch any attention modules missed by the regex-based target matching. This closes the gap noted in the existing TODO: "decouple reliance on this regex for matching attention". Signed-off-by: Jonathan Chang <changjonathanc@users.noreply.github.com>
d27cc26 to
e534e49
Compare
Hi @changjonathanc , can you point me to where you're seeing this? These lines in compressed-tensors? Is it just a matter of updating the targets list to include |
|
llm-compressor/src/llmcompressor/modifiers/quantization/quantization/mixin.py Lines 204 to 206 in 370c04c |
Hi @changjonathanc , can't we just update the regex to |
|
Hi @brian-dellabetta , honestly i'm not 100% sure, so here is what claude told me and i think it makes sense: When you do KV cache quantization, the library needs to:
The bug is that steps 1-2 and step 3 use different methods to find the attention modules, so they don't always agree on which modules to process. Step 1 & 2: How modules get their scheme attached This happens in compressed-tensors, in _apply_kv_cache_scheme: So a module whose Python class is named LlamaAttention passes this check — regardless of what it's called inside the model tree (self_attn, attention, mha, whatever). After this step, any attention module has quantization_scheme set on it. Step 3: How calibration observers get initialized This happens in llm-compressor, in start_calibration: match_named_modules uses self.resolved_targets, which contains: This regex matches module paths like model.layers.0.self_attn — it checks the name you'd use to look up the module, not the class name. The Bug Imagine a model where the attention module is stored at path layers.0.attention (not self_attn), but the class is still LlamaAttention. What the PR does It adds a fallback in start_calibration: after the regex loop, do a second pass using is_attention_module() — the exact same check that _apply_kv_cache_scheme uses — to catch anything the regex missed: Now both sides use is_attention_module() as the source of truth, so they always agree.` |
@changjonathanc rather than re-applying after the fact, we just need to update the regex used to match on both Unfortunately there's no guaranteed way to determine if a module is an attention module and we have to rely on these fuzzy matches, so we'll have to update when necessary as naming conventions drift. Can you try with that one-line change to the regex, rather than the changes you have in |
|
Hi @brian-dellabetta, I think your suggestion passes the test, but the confusion still exists, where one part of the code patches by the class name of a module, while the other matches the path. wdyt? |
Hi @changjonathanc , these are two separate checks:
In case (1) we will have to update the conditional logic to capture any drift in module class naming conventions, |
|
Hi @brian-dellabetta, that makes sense. But since compressed-tensors already decided which modules get |
Thanks @changjonathanc for the detail, good to know. I almost replied that a better solution would be to try to consolidate the logic, but that might be better as an RFC. I will raise this issue with the team next week and discuss internally |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Hi @changjonathanc , I was able to discuss this today internally. The is_attention_module check is needed because compressed-tensors does not have access to targets when applying the kv cache scheme (see here). kv cache scheme must be applied model-wide based on our data model, QuantizationScheme has no notion of targets. To avoid any breaking changes to that API, I suggest we
- add a patch to your use case by updating targets, and
- update is_attention_module in compressed-tensors, if necessary, if it's not catching your use case.
will ping over vllm slack in case it's easier to discuss over slack
| if self.resolved_config.kv_cache_scheme is not None: | ||
| # TODO: decouple reliance on this regex for matching attention | ||
| # TODO: also apply is_attention_module() fallback in initialize_quantization | ||
| targets.add("re:.*self_attn$") |
There was a problem hiding this comment.
| targets.add("re:.*self_attn$") | |
| targets.add("re:.*(self_attn|attention)$") |
| self._start_calibrating_module(module) | ||
|
|
||
| # Fallback: catch attention modules missed by the "re:.*self_attn$" regex. | ||
| if self.resolved_config.kv_cache_scheme is not None: |
There was a problem hiding this comment.
with the update to targets, ideally we can remove this fallback
| freeze_module_quantization(module) # remove observers | ||
|
|
||
| # Also freeze attention modules missed by the regex fallback in start_calibration. | ||
| if self.resolved_config.kv_cache_scheme is not None: |
There was a problem hiding this comment.
with update to targets, ideally we can remove this fallback
| if self.resolved_config.kv_cache_scheme is not None: | ||
| for _, module in model.named_modules(): | ||
| if ( | ||
| is_attention_module(module) |
There was a problem hiding this comment.
note if we need to update is_attention_module we can do that in a compressed-tensors PR
SUMMARY:
_apply_kv_cache_scheme(in compressed-tensors) discovers attention modules viais_attention_module(), which is name-agnostic. However,start_calibrationonly iterates modules matchingresolved_targets, which includes"re:.*self_attn$"for KV cache. This regex misses attention modules with different names (e.g."attention","self_attention"), leaving their observers uninitialized and KV cache scales as garbage values.Add a fallback pass in
start_calibrationandend_calibrationthat usesis_attention_module()to catch any attention modules missed by the regex. Gated bykv_cache_scheme is not Noneso there is zero cost when KV cache quantization is not used. This addresses the existing TODO: "decouple reliance on this regex for matching attention".TEST PLAN:
attention(notself_attn). Verifies observers are initialized, hooks are registered, and modules are frozen correctly.