Skip to content

finetune-gradio "test model" function has inference bug with speech & text mismatch, caused by clipped ref_audio and unchanged ref_text (sampled from dataset) #1207

@KangZENG50025543

Description

@KangZENG50025543

Checks

  • This template is only for bug reports, usage problems go with 'Help Wanted'.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

This Question is already sovled. I'll show here to help others who face the same issue.

Your model is probably fine. The bug actually occurs in the infer part.

The code I provided has been tested under Chinese dataset. The adaptation of English still need to judge.
utils_infer_changed_version.py

Problem Description

When analyzing the F5-TTS preprocessing pipeline when testing the model (it's the issue occured in "test model" of the GUI), a critical issue affecting training quality was discovered.

Current Behavior:

  • When the reference audio exceeds 12 seconds, it gets truncated to the first 12 seconds or shorter
  • However, the corresponding reference text remains completely unchanged
  • This causes the model to learn an incorrect mapping: short audio ↔ long text

This leads to:

  • ❌ Attention mechanism confusion (Generated content shows repeated phrases: e.g., Input reference text: "I am the Shore Keeper, your assistant" → Output generated audio: "your assistant, I am the Shore Keeper, your assistant, Shore Keeper, assistant")
  • ❌ Duration prediction distortion (Generated speech is extremely short and spoken very fast)
  • ❌ Prosody learning errors (Pauses are completely wrong, punctuation is entirely ignored)

Root Cause:
In the preprocess_ref_audio_text() function in file 'src/f5_tts/infer/utils_infer.py':

# Audio cutting:
aseg = aseg[:12000]  # Or cutting based on silence

# But text remains the same:
return ref_audio, ref_text  # ❌ No text cutting!

The solution is provided in the python file. It actually add the function of cutting the reference text in a right way inorder to make model reference the "well-matched" audio-text.

Steps to Reproduce

conda activated f5-tts

f5-tts_finetune-gradio

……(training part)

Converting audio...
Audio is over 12s, clipping short.

Generating audio in 3 batches...
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.01s/it]

When listening to the audio, the audion is really fast and word is misordered.

✔️ Expected Behavior

The audio generated should be well paced and ordered.

❌ Actual Behavior

When listening to the audio, the audion is really fast and word is misordered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions