This is PyTorch Implementation of A Non-Autoregressive Transformer with unsupervised learning durations based on Transformer & Conformer blocks, supporting a family of supervised and unsupervised duration modelings, aiming to achieve the ultimate Text2Speech with Vietnamese dataset, researched and developed by Dean Ng.
-
Text-to-Speech Synthesis using Phoneme Concatenation (Mahwash & Shibli, 2014)
-
The Effect of Tone Modeling in Vietnamese LVCSR System (Quoc Bao et al., 2016)
-
HMM-Based Vietnamese Speech Synthesis (Thu Trang et al, 2015)
-
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech (Lim et al., 2022)
-
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (Jaehyeon Kim et al., 2021)
-
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (Ren et al., 2020)
-
Matcha-TTS: A fast TTS architecture with conditional flow matching (Shivam et al., 2023)
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (Kong et al., 2020)
-
Attention Is All You Need (Vaswani et al., 2017)
-
Conformer: Convolution-augmented Transformer for Speech Recognition (Gulati et al., 2020)
-
Differentiable Duration Modeling for End-to-End Text-to-Speech (Nguyen et al., 2022)
-
One TTS Alignment To Rule Them All (Badlani et al., 2021)
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (Desplanques et al., 2020)
conda create --name venv python=3.10
conda install conda-forge::ffmpeg
# GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
# CPU Only
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -c pytorch
pip install -r requirements.txt ├── config/
│ └── model_config.yaml
│ └── preprocessing_config.yaml
│ └── train_config.yaml
/path_to_dataset[/w `speaker.json`]:
├── dataset/
│ └── speakers.json (JSON Object {`speaker_id`: `index`})
│ └── accents.json (JSON Object {`accent_id`: `index`})
│ └── speaker no.1/
│ └── wavs/
│ └── metadata.csv
│ └── speaker no.2/
│ └── wavs/
│ └── metadata.csv
| ...
│ └── speaker no.{n}/
│ └── wavs/
│ └── metadata.csv
/path_to_dataset[/wo `speaker.json`]
├── dataset/
│ └── wavs/
| └── ...
│ └── metadata.csv[*] python train.py
--task /task_name {text2wav, fastspeech2, adaspeech, jets, vits2, matcha, hifigan}
--input_folder /path_to_input_dataset
--data_folder /path_to_data_save_folder
# use for continue training model
--checkpoint /path_to_pretrained_checkpoint
# use for joint-train from disconnect pre-trained
--acoustic_checkpoint /path_to_pretrained_acoustic
--vocoder_checkpoint /path_to_pretrained_vocoder
# config for joint-train
--version {fastspeech2, matcha, adaspeech}
# use for training new speaker from base model
--is_finetune[*] python test.py \
--new_id /path_to_test_file
--acoustic_path /path_to_acoustic_checkpoint
# use when inference 2 model
--vocoder_path /path_to_vocoder_checkpoint
--model_type {JOINT, JETS}
--output_folder /path_to_output_folderDuring train phase, we found some experiments:
i) Encoder and decoder block with 6 (from 4) blocks and 386 (from 256) hidden dims give better result for 22050 sample rate input audios
ii) Unsupervised model after long-term training is better than supervised model (for experiment, please use supervised for faster training)
iii) Comformer block take a lot of GPU when training but give better result than Transformer block
- ESPnet
- jik876's HiFi-GAN
- ming024's FastSpeech2
- jaywalnut310's VITS
- shivammehta25's Matcha-TTS
- TaoRuijie's ECAPA-TDNN
- keonlee9420's Transformer-TTS
If you use this code in your research, please cite:
@misc{deanng_2024,
author = {Dean Nguyen},
title = {End-to-end TTS system - PyTorch Implementation},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ducnt18121997/Viet-Transformer-TTS}}
}