finetune-gradio "test model" function has inference bug with speech & text mismatch, caused by clipped ref_audio and unchanged ref_text (sampled from dataset)

### Checks

- [x] This template is only for bug reports, usage problems go with 'Help Wanted'.
- [x] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- [x] I have searched for existing issues, including closed ones, and couldn't find a solution.
- [x] I am using English to submit this issue to facilitate community communication.

## This Question is already sovled. I'll show here to help others who face the same issue. 
Your model is probably fine. The bug actually occurs in the infer part.

The code I provided has been tested under Chinese dataset. The adaptation of English still need to judge.
[utils_infer_changed_version.py](https://github.com/user-attachments/files/23314828/utils_infer_changed_version.py)

## Problem Description

When analyzing the F5-TTS preprocessing pipeline when testing the model (it's the issue occured in "test model" of the GUI), a critical issue affecting training quality was discovered.

**Current Behavior:**
- When the reference audio exceeds 12 seconds, it gets truncated to the first 12 seconds or shorter
- However, the corresponding reference text remains completely unchanged
- This causes the model to learn an incorrect mapping: `short audio ↔ long text`

**This leads to:**
- ❌ Attention mechanism confusion (Generated content shows repeated phrases: e.g., Input reference text: "I am the Shore Keeper, your assistant" → Output generated audio: "your assistant, I am the Shore Keeper, your assistant, Shore Keeper, assistant")
- ❌ Duration prediction distortion (Generated speech is extremely short and spoken very fast)
- ❌ Prosody learning errors (Pauses are completely wrong, punctuation is entirely ignored)

**Root Cause:**
In the `preprocess_ref_audio_text()` function in file `'src/f5_tts/infer/utils_infer.py'`:
```python
# Audio cutting:
aseg = aseg[:12000]  # Or cutting based on silence

# But text remains the same:
return ref_audio, ref_text  # ❌ No text cutting!
```

The solution is provided in the python file. It actually add the function of cutting the reference text in a right way inorder to make model reference the "well-matched" audio-text.


### Steps to Reproduce

conda activated f5-tts

f5-tts_finetune-gradio

……(training part)

Converting audio...
Audio is over 12s, clipping short.

Generating audio in 3 batches...
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.01s/it]

When listening to the audio, the audion is really fast and word is misordered.

### ✔️ Expected Behavior

The audio generated should be well paced and ordered.

### ❌ Actual Behavior

When listening to the audio, the audion is really fast and word is misordered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

finetune-gradio "test model" function has inference bug with speech & text mismatch, caused by clipped ref_audio and unchanged ref_text (sampled from dataset) #1207

Checks

This Question is already sovled. I'll show here to help others who face the same issue.

Problem Description

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

finetune-gradio "test model" function has inference bug with speech & text mismatch, caused by clipped ref_audio and unchanged ref_text (sampled from dataset) #1207

Description

Checks

This Question is already sovled. I'll show here to help others who face the same issue.

Problem Description

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions