End-to-end TTS system - PyTorch Implementation

This is PyTorch Implementation of A Non-Autoregressive Transformer with unsupervised learning durations based on Transformer & Conformer blocks, supporting a family of supervised and unsupervised duration modelings, aiming to achieve the ultimate Text2Speech with Vietnamese dataset, researched and developed by Dean Ng.

1. Installation

conda create --name venv python=3.10
conda install conda-forge::ffmpeg
# GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
# CPU Only
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -c pytorch
pip install -r requirements.txt

2. Train & Test

Model Configuration

    ├── config/
    │   └── model_config.yaml 
    │   └── preprocessing_config.yaml
    │   └── train_config.yaml

Dataset Format

    /path_to_dataset[/w `speaker.json`]: 
        ├── dataset/
        │   └── speakers.json (JSON Object {`speaker_id`: `index`})
        │   └── accents.json (JSON Object {`accent_id`: `index`})
        │   └── speaker no.1/
        │       └── wavs/
        │       └── metadata.csv
        │   └── speaker no.2/
        │       └── wavs/
        │       └── metadata.csv
        |   ...
        │   └── speaker no.{n}/
        │       └── wavs/
        │       └── metadata.csv

    /path_to_dataset[/wo `speaker.json`]
        ├── dataset/
        │   └── wavs/
        |       └── ...
        │   └── metadata.csv

Train & Test

[*] python train.py 
    --task /task_name {text2wav, fastspeech2, adaspeech, jets, vits2, matcha, hifigan}
    --input_folder /path_to_input_dataset 
    --data_folder /path_to_data_save_folder 
    # use for continue training model
    --checkpoint /path_to_pretrained_checkpoint
    # use for joint-train from disconnect pre-trained
    --acoustic_checkpoint /path_to_pretrained_acoustic
    --vocoder_checkpoint /path_to_pretrained_vocoder
    # config for joint-train
    --version {fastspeech2, matcha, adaspeech}
    # use for training new speaker from base model
    --is_finetune

[*] python test.py \
    --new_id /path_to_test_file 
    --acoustic_path /path_to_acoustic_checkpoint 
    # use when inference 2 model
    --vocoder_path /path_to_vocoder_checkpoint 
    --model_type {JOINT, JETS} 
    --output_folder /path_to_output_folder

Experiments

During train phase, we found some experiments:

i) Encoder and decoder block with 6 (from 4) blocks and 386 (from 256) hidden dims give better result for 22050 sample rate input audios

ii) Unsupervised model after long-term training is better than supervised model (for experiment, please use supervised for faster training)

iii) Comformer block take a lot of GPU when training but give better result than Transformer block

References

Citation

If you use this code in your research, please cite:

@misc{deanng_2024,
    author = {Dean Nguyen},
    title = {End-to-end TTS system - PyTorch Implementation},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ducnt18121997/Viet-Transformer-TTS}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
log		log
models		models
modules		modules
src		src
.gitignore		.gitignore
README.md		README.md
example.py		example.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-end TTS system - PyTorch Implementation

Phonemes Presentation

Architecture Design

Phonemes Acoustic

Audio Upsampler

Linguistic Encoder

Duration Modeling

Speaker Embeddings

1. Installation

2. Train & Test

Model Configuration

Dataset Format

Train & Test

Experiments

References

Citation

About

Uh oh!

Languages

ducnt18121997/Viet-Transformer-TTS

Folders and files

Latest commit

History

Repository files navigation

End-to-end TTS system - PyTorch Implementation

Phonemes Presentation

Architecture Design

Phonemes Acoustic

Audio Upsampler

Linguistic Encoder

Duration Modeling

Speaker Embeddings

1. Installation

2. Train & Test

Model Configuration

Dataset Format

Train & Test

Experiments

References

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages