- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4
Update llama.py - Fix embedding generation error #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…in Gemma3ChatHandler
Replace llama_kv_cache_clear -> llama_kv_self_clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an embedding generation error by replacing calls to the wrong cache-clear function with the correct one.
- Calls to llama_kv_cache_clearhave been updated tollama_kv_self_clearin the embedding flow.
- Ensures the context cache is properly cleared before and after decoding batches.
Comments suppressed due to low confidence (1)
llama_cpp/llama.py:982
- Add or update unit tests for the embedfunction to verify that embeddings are generated correctly withllama_kv_self_clearand that the cache is fully cleared before and after decoding.
        data: Union[List[List[float]], List[List[List[float]]]] = []
| # decode and fetch embeddings | ||
| data: Union[List[List[float]], List[List[List[float]]]] = [] | ||
|  | ||
| def decode_batch(seq_sizes: List[int]): | 
    
      
    
      Copilot
AI
    
    
    
      Jul 12, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a brief comment explaining why llama_kv_self_clear is used here instead of the previous llama_kv_cache_clear, to clarify the intended cache-clearing behavior for future maintainers.
| def decode_batch(seq_sizes: List[int]): | |
| def decode_batch(seq_sizes: List[int]): | |
| # Clear the self-attention key-value cache to prepare for decoding the next batch. | |
| # `llama_kv_self_clear` is used here instead of `llama_kv_cache_clear` because it specifically | |
| # clears the cache for self-attention mechanisms, which is required for accurate embedding generation. | 
Copilot uses AI. Check for mistakes.
| We can close this PR if  | 
ef28569    to
    a096d51      
    Compare
  
    
Replace llama_kv_cache_clear -> llama_kv_self_clear.
Revert back until llama_kv_cache_clear function will be fixed