-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: sliding window KV cache for Gemma-3 models (issue #2145) #2153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: sliding window KV cache for Gemma-3 models (issue #2145) #2153
Conversation
for more information, see https://pre-commit.ci
|
Hi @Chelsi-create , thank you for the PR! |
|
Hi @Chelsi-create, thanks very much for working on this! I am the original filer of the bug and I really appreciate you taking up the issue. I have several questions related to your testing and findings. (1) It looks like you are focused on an issue with sliding window attention in Gemma, but in my original runlitgpt.py script (using LLM.generate), I also see the same problem of repetition when running Llama. Were you not able to replicate the problem with Llama using my runlitgpt.py script? And if you were able to reproduce the problem with Llama, did this problem get somehow fixed as a side effect of the fix you made for sliding window KV cache management in Gemma? I realize that Llama only uses Global Attention, so that's why I am confused how the problem with Llama was also apparently fixed (you mention Llama passing all three "long form" generation tests in your testing summary). Another thing worth mentioning here ... The problem with Gemma (before your fix) occurs both when using LLM.generate (as in my runlitgpt.py script) and also when running Gemma using litgpt chat (for the same prompts and parameter settings). However, when running Llama, I only see the repetition problem when running Llama with LLM.generate and NOT when I use the same prompt and parameters in litgpt chat. This suggests there are two different bugs. Gemma probably fails in both cases because of the sliding window context issue you are addressing with this PR. However, why would Llama fail when using LLM.generate (with very similar symptoms as the Gemma failure) but not fail when using litgpt chat? This is all quite puzzling to me. Of course, my comments here assume you are able to reproduce what I just described: Llama has the repetitive behavior (just like Gemma) before your fix when using LLM.generate() but not when using litgpt chat. (2) You make a distinction between LLM.generate and chat_generate(), but your linked files for runlitgpt.py and runlitgpt_v1.py both seem to be using LLM.generate. The only difference seems to be in the prompts. So, I am a little confused about the difference between LLM.generate and chat_generate() and why this distinction matters in the context of this bug. (3) You mention this in your testing summary: "After fix: coherent outputs with limited repetition (7-9 tokens)." I'm somewhat concerned to hear that there is still any repetition at all. Can you share an example of the "limited repetition" you are seeing? I'd prefer we not close the original issue until these questions are addressed. Please let me know if there is any way I can help in resolving any of these questions. I'd really like to use litgpt but I'm uneasy about that until I understand the answers to the questions I raised above. Thanks! |
|
Hi @drwslacy47 , Thank you so much for the detailed and thoughtful questions. I have tried to address all your questions below: (1) Llama Behavior: Why it Failed with
|
fixes #2145
Overview
This PR addresses a critical issue affecting Gemma-3 models (
4B-IT,12B-IT,27B-IT) that caused them to produce gibberish or repetitive text after approximately 800-1000 tokens during continuous long-form generation.The fix introduces correct sliding window KV cache management for models using hybrid attention architectures.
Key Fixes
1. Sliding Window KV Cache Limiting (
litgpt/model.py-build_kv_cache)sliding_window_size(1024 tokens) for sliding-window layers instead of full sequence length.2. Circular Buffer Implementation (
litgpt/model.py-KVCacheclass)3. Attention Mask Dimension Fix (
litgpt/model.py-CausalSelfAttention.forward)Why This Matters
The Hidden Bug
This issue was difficult to detect in production:
Testing Summary:
Gemma-3-4B-IT: All three long-form test prompts passed.Llama-3.2-3B: All three long-form test prompts passed.Configuration:
chat_generate()(low-level)max_new_tokens=1500Known Limitations
LLM.generate()(high-level API) still requires integration work.chat_generate()API is fully functional.chat_generate()directly (example in full PR description).Notes
litgpt/model.py(~40 lines)runlitgpt_v1.py - this file uses the high-level API implementation. There are atill some bugs in the
litgpt/api.py, which will be addresses in another PR.runlitgpt.py - this code uses the low-level API implementation and it works for both the models now.