-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
Issue
The config.json in the Step-Audio-EditX model is missing the tie_word_embeddings configuration key. This causes transformers 4.54+ to incorrectly tie the lm_head and embed_tokens weights together, even though they have different values in the checkpoint.
Root Cause
config.jsondoes not contain"tie_word_embeddings"- transformers 4.54+ defaults to
tie_word_embeddings=Truewhen this key is missing - The model checkpoint has separate weights for
lm_headandembed_tokens(different norms) - Tying overwrites the correct
lm_headweights withembed_tokensweights - This causes the model to generate text tokens instead of audio tokens
- Result: Silent/gibberish audio generation and generation ignoring
max_new_tokens
Solution
Add the following line to config.json:
"tie_word_embeddings": falseThis tells transformers to keep the weights separate, which matches your checkpoint structure.
Impact
This affects all users of Step-Audio-EditX with transformers 4.54+. Users have to implement workarounds to restore weights after model loading.
References
- transformers PR addressing similar issues: Fix: Don't tie weights when checkpoint has different values huggingface/transformers#42612
- Workaround implemented in TTS-Audio-Suite: ‘Step Audio EditX Engine’ does not generate according to the actual ‘input tokens’, but instead follows ‘max_new_tokens’. diodiogod/TTS-Audio-Suite#202
Metadata
Metadata
Assignees
Labels
No labels