fix: use eager attention for SDPA compatibility with transformers >=4.36#398
Open
majiayu000 wants to merge 2 commits intoresemble-ai:masterfrom
Open
fix: use eager attention for SDPA compatibility with transformers >=4.36#398majiayu000 wants to merge 2 commits intoresemble-ai:masterfrom
majiayu000 wants to merge 2 commits intoresemble-ai:masterfrom
Conversation
Fixes resemble-ai#339 Set attn_implementation='eager' when creating LlamaConfig to avoid ValueError when using output_attentions=True with transformers >=4.36. SDPA (Scaled Dot Product Attention) became the default in transformers >=4.36 but doesn't support output_attentions=True. Using eager attention ensures compatibility with all features including voice cloning. Co-Authored-By: Claude <noreply@anthropic.com>
|
Looking at this comment It is likely intended for a single layer. Is it possible to only apply eager to that layer only, and keep fast sdpa kernel on the other layers? |
Instead of globally disabling SDPA optimization for the entire model, this change applies eager attention only to the specific layers (9, 12, 13) that need attention weights for alignment analysis. Changes: - Remove global attn_implementation='eager' from LlamaConfig - Modify _add_attention_spy to wrap forward method of specific layers - Temporarily switch to eager attention only during those layers' forward pass - Restore original attn_implementation after each layer completes This preserves SDPA performance benefits for all other layers while still supporting output_attentions=True for alignment stream analysis. Addresses review feedback from @George0828Zhang 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Author
|
@George0828Zhang Thanks for the great suggestion! I've updated the implementation to apply eager attention only to the specific layers that need Key changes:
This preserves SDPA performance for all other layers while supporting alignment stream analysis. Please take a look and let me know if this addresses your concern! |
|
Thanks. I installed your commit and it seemed to work fine. |
Author
|
Happy to help! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
attn_implementation='eager'when creating LlamaConfigoutput_attentions=Trueworks correctlyProblem
Fixes #339
In transformers >=4.36, SDPA became the default attention implementation. However, SDPA doesn't support
output_attentions=True, which Chatterbox uses during inference with voice references.This causes:
Solution
Explicitly set
attn_implementation='eager'in LlamaConfig. Eager attention fully supports all features including output_attentions.Impact