-
Notifications
You must be signed in to change notification settings - Fork 653
Description
I am trying to train the StyleTTS2 model using the LJSpeech dataset. I am using the base model and I did not make any code changes. I have downloaded the LJSpeech dataset from the link provided in the recipe and upsample it to 24000 sample rate using librosa.resample(res_type=”soxr_hq”) method. I have only updated the root_path in config.yaml to direct it to the upsampled wav files.
A sample line fromData/train_list.txtfile can be found below.
A sample line fromData/OOD_texts.txt can be found below.
GPU: RTX 6000 pro
You can see my config file below.
I have only trained for the first train and try to check for the inference results. I have used the Inference_LJSpeech.ipynb file as an inference code without making any code changes. However, the result of the first train was very bad. The duration of the synthesis was approximately 2 minutes for the sample text in Inference_LJSpeech.ipynb which is bizarre. The output audio file was nothing but noise.
I have not encountered any NaN or problematic loss values during training.
Here are my train and evaluation tensorboard graphics.
I have also tested the inference code using the pre-trained LJSpeech model and could not find any problem, the results seem fine.
I have checked my code repeatedly but could not find any bugs.
Can anyone help me to solve this issue?
Thanks in advance.