StyleTTS2 base model first training with LJSpeech dataset issue: Inference results are bad

I am trying to train the StyleTTS2 model using the LJSpeech dataset. I am using the base model and I did not make any code changes. I have downloaded the LJSpeech dataset from the link provided in the recipe and upsample it to 24000 sample rate using `librosa.resample(res_type=”soxr_hq”)` method.  I have only updated the `root_path` in `config.yaml` to direct it to the upsampled wav files. 

A sample line from` Data/train_list.txt `file can be found below. 

<img width="1404" height="56" alt="Image" src="https://github.com/user-attachments/assets/7a574816-307c-4977-b3ae-96b60e51e0b6" />

A sample line from` Data/OOD_texts.txt ` can be found below.

<img width="1296" height="56" alt="Image" src="https://github.com/user-attachments/assets/04cd4b0f-34d2-41e3-af6b-bf5a62ddbfeb" />


GPU: RTX 6000 pro

You can see my config file below.

<img width="1038" height="2826" alt="Image" src="https://github.com/user-attachments/assets/825a3f2c-f791-422f-a0db-f9941494e2cb" />



I have only trained for the first train and try to check for the inference results. I have used the Inference_LJSpeech.ipynb file as an inference code without making any code changes. However, the result of the first train was very bad.  The duration of the synthesis was approximately 2 minutes for the sample text in Inference_LJSpeech.ipynb which is bizarre. The output audio file was nothing but noise. 

I have not encountered any NaN or problematic loss values during training. 


Here are my train and evaluation tensorboard graphics.

<img width="1400" height="800" alt="Image" src="https://github.com/user-attachments/assets/4ee7a5f7-4d7a-4381-aa09-bd511d017076" />

<img width="1400" height="800" alt="Image" src="https://github.com/user-attachments/assets/97a78c92-b939-4412-89fa-d0f0543fba3b" />


I have also tested the inference code using the pre-trained LJSpeech model and could not find any problem,  the results seem fine.  

I have checked my code repeatedly but could not find any bugs. 
Can anyone help me to solve this issue? 

Thanks in advance.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StyleTTS2 base model first training with LJSpeech dataset issue: Inference results are bad #351

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

StyleTTS2 base model first training with LJSpeech dataset issue: Inference results are bad #351

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions