Skip to content

This is PyTorch Implementation of A Non-Autoregressive Transformer with unsupervised learning durations based on Transformer & Conformer blocks, supporting for Vietnamese language.

Notifications You must be signed in to change notification settings

ducnt18121997/Viet-Transformer-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

End-to-end TTS system - PyTorch Implementation

This is PyTorch Implementation of A Non-Autoregressive Transformer with unsupervised learning durations based on Transformer & Conformer blocks, supporting a family of supervised and unsupervised duration modelings, aiming to achieve the ultimate Text2Speech with Vietnamese dataset, researched and developed by Dean Ng.

Static Badge Static Badge Static Badge

Phonemes Presentation

Architecture Design

Phonemes Acoustic

Audio Upsampler

Linguistic Encoder

Duration Modeling

Speaker Embeddings

1. Installation

conda create --name venv python=3.10
conda install conda-forge::ffmpeg
# GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
# CPU Only
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -c pytorch
pip install -r requirements.txt

2. Train & Test

Model Configuration

    ├── config/
    │   └── model_config.yaml 
    │   └── preprocessing_config.yaml
    │   └── train_config.yaml 

Dataset Format

    /path_to_dataset[/w `speaker.json`]: 
        ├── dataset/
        │   └── speakers.json (JSON Object {`speaker_id`: `index`})
        │   └── accents.json (JSON Object {`accent_id`: `index`})
        │   └── speaker no.1/
        │       └── wavs/
        │       └── metadata.csv
        │   └── speaker no.2/
        │       └── wavs/
        │       └── metadata.csv
        |   ...
        │   └── speaker no.{n}/
        │       └── wavs/
        │       └── metadata.csv

    /path_to_dataset[/wo `speaker.json`]
        ├── dataset/
        │   └── wavs/
        |       └── ...
        │   └── metadata.csv

Train & Test

[*] python train.py 
    --task /task_name {text2wav, fastspeech2, adaspeech, jets, vits2, matcha, hifigan}
    --input_folder /path_to_input_dataset 
    --data_folder /path_to_data_save_folder 
    # use for continue training model
    --checkpoint /path_to_pretrained_checkpoint
    # use for joint-train from disconnect pre-trained
    --acoustic_checkpoint /path_to_pretrained_acoustic
    --vocoder_checkpoint /path_to_pretrained_vocoder
    # config for joint-train
    --version {fastspeech2, matcha, adaspeech}
    # use for training new speaker from base model
    --is_finetune
[*] python test.py \
    --new_id /path_to_test_file 
    --acoustic_path /path_to_acoustic_checkpoint 
    # use when inference 2 model
    --vocoder_path /path_to_vocoder_checkpoint 
    --model_type {JOINT, JETS} 
    --output_folder /path_to_output_folder

Experiments

During train phase, we found some experiments:

i) Encoder and decoder block with 6 (from 4) blocks and 386 (from 256) hidden dims give better result for 22050 sample rate input audios

ii) Unsupervised model after long-term training is better than supervised model (for experiment, please use supervised for faster training)

iii) Comformer block take a lot of GPU when training but give better result than Transformer block

References

Citation

If you use this code in your research, please cite:

@misc{deanng_2024,
    author = {Dean Nguyen},
    title = {End-to-end TTS system - PyTorch Implementation},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ducnt18121997/Viet-Transformer-TTS}}
}

About

This is PyTorch Implementation of A Non-Autoregressive Transformer with unsupervised learning durations based on Transformer & Conformer blocks, supporting for Vietnamese language.

Topics

Resources

Stars

Watchers

Forks

Languages