-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Checks
- This template is only for bug reports, usage problems go with 'Help Wanted'.
- I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- I have searched for existing issues, including closed ones, and couldn't find a solution.
- I am using English to submit this issue to facilitate community communication.
This Question is already sovled. I'll show here to help others who face the same issue.
Your model is probably fine. The bug actually occurs in the infer part.
The code I provided has been tested under Chinese dataset. The adaptation of English still need to judge.
utils_infer_changed_version.py
Problem Description
When analyzing the F5-TTS preprocessing pipeline when testing the model (it's the issue occured in "test model" of the GUI), a critical issue affecting training quality was discovered.
Current Behavior:
- When the reference audio exceeds 12 seconds, it gets truncated to the first 12 seconds or shorter
- However, the corresponding reference text remains completely unchanged
- This causes the model to learn an incorrect mapping:
short audio ↔ long text
This leads to:
- ❌ Attention mechanism confusion (Generated content shows repeated phrases: e.g., Input reference text: "I am the Shore Keeper, your assistant" → Output generated audio: "your assistant, I am the Shore Keeper, your assistant, Shore Keeper, assistant")
- ❌ Duration prediction distortion (Generated speech is extremely short and spoken very fast)
- ❌ Prosody learning errors (Pauses are completely wrong, punctuation is entirely ignored)
Root Cause:
In the preprocess_ref_audio_text() function in file 'src/f5_tts/infer/utils_infer.py':
# Audio cutting:
aseg = aseg[:12000] # Or cutting based on silence
# But text remains the same:
return ref_audio, ref_text # ❌ No text cutting!The solution is provided in the python file. It actually add the function of cutting the reference text in a right way inorder to make model reference the "well-matched" audio-text.
Steps to Reproduce
conda activated f5-tts
f5-tts_finetune-gradio
……(training part)
Converting audio...
Audio is over 12s, clipping short.
Generating audio in 3 batches...
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.01s/it]
When listening to the audio, the audion is really fast and word is misordered.
✔️ Expected Behavior
The audio generated should be well paced and ordered.
❌ Actual Behavior
When listening to the audio, the audion is really fast and word is misordered.