Skip to content

StyleTTS2 base model first training with LJSpeech dataset issue: Inference results are bad #351

@cagil-suslu

Description

@cagil-suslu

I am trying to train the StyleTTS2 model using the LJSpeech dataset. I am using the base model and I did not make any code changes. I have downloaded the LJSpeech dataset from the link provided in the recipe and upsample it to 24000 sample rate using librosa.resample(res_type=”soxr_hq”) method. I have only updated the root_path in config.yaml to direct it to the upsampled wav files.

A sample line fromData/train_list.txtfile can be found below.

Image

A sample line fromData/OOD_texts.txt can be found below.

Image

GPU: RTX 6000 pro

You can see my config file below.

Image

I have only trained for the first train and try to check for the inference results. I have used the Inference_LJSpeech.ipynb file as an inference code without making any code changes. However, the result of the first train was very bad. The duration of the synthesis was approximately 2 minutes for the sample text in Inference_LJSpeech.ipynb which is bizarre. The output audio file was nothing but noise.

I have not encountered any NaN or problematic loss values during training.

Here are my train and evaluation tensorboard graphics.

Image Image

I have also tested the inference code using the pre-trained LJSpeech model and could not find any problem, the results seem fine.

I have checked my code repeatedly but could not find any bugs.
Can anyone help me to solve this issue?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions