Skip to content

fix: use eager attention for SDPA compatibility with transformers >=4.36#398

Open
majiayu000 wants to merge 2 commits intoresemble-ai:masterfrom
majiayu000:fix/sdpa-attention-compatibility
Open

fix: use eager attention for SDPA compatibility with transformers >=4.36#398
majiayu000 wants to merge 2 commits intoresemble-ai:masterfrom
majiayu000:fix/sdpa-attention-compatibility

Conversation

@majiayu000
Copy link

Summary

  • Set attn_implementation='eager' when creating LlamaConfig
  • Ensures output_attentions=True works correctly

Problem

Fixes #339

In transformers >=4.36, SDPA became the default attention implementation. However, SDPA doesn't support output_attentions=True, which Chatterbox uses during inference with voice references.

This causes:

ValueError: The `output_attentions` attribute is not supported when using the `attn_implementation` set to sdpa.

Solution

Explicitly set attn_implementation='eager' in LlamaConfig. Eager attention fully supports all features including output_attentions.

Impact

  • Voice cloning and voice conversion features now work with modern transformers versions
  • No performance regression for typical use cases (inference already uses output_attentions)

Fixes resemble-ai#339

Set attn_implementation='eager' when creating LlamaConfig to avoid
ValueError when using output_attentions=True with transformers >=4.36.

SDPA (Scaled Dot Product Attention) became the default in transformers
>=4.36 but doesn't support output_attentions=True. Using eager attention
ensures compatibility with all features including voice cloning.

Co-Authored-By: Claude <noreply@anthropic.com>
@George0828Zhang
Copy link

Looking at this comment

# using it for all layers slows things down too much. We can apply it to just one layer

It is likely intended for a single layer. Is it possible to only apply eager to that layer only, and keep fast sdpa kernel on the other layers?

Instead of globally disabling SDPA optimization for the entire model,
this change applies eager attention only to the specific layers (9, 12, 13)
that need attention weights for alignment analysis.

Changes:
- Remove global attn_implementation='eager' from LlamaConfig
- Modify _add_attention_spy to wrap forward method of specific layers
- Temporarily switch to eager attention only during those layers' forward pass
- Restore original attn_implementation after each layer completes

This preserves SDPA performance benefits for all other layers while
still supporting output_attentions=True for alignment stream analysis.

Addresses review feedback from @George0828Zhang

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@majiayu000
Copy link
Author

@George0828Zhang Thanks for the great suggestion!

I've updated the implementation to apply eager attention only to the specific layers that need output_attentions=True.

Key changes:

  • Removed global attn_implementation='eager' from LlamaConfig
  • Modified _add_attention_spy() to wrap the forward method of layers 9, 12, 13 (based on LLAMA_ALIGNED_HEADS)
  • Each wrapped layer temporarily switches to eager attention, calls forward with output_attentions=True, then restores the original implementation

This preserves SDPA performance for all other layers while supporting alignment stream analysis.

Please take a look and let me know if this addresses your concern!

@George0828Zhang
Copy link

Thanks. I installed your commit and it seemed to work fine.

@majiayu000
Copy link
Author

Happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDPA Compatibility Error: output_attentions not supported when using voice references with transformers >=4.36

2 participants