Skip to content

Commit 7c1a0d9

Browse files
committed
Merge remote-tracking branch 'upstream/master' into lju
2 parents a4b3d64 + 2959501 commit 7c1a0d9

File tree

12 files changed

+487
-15
lines changed

12 files changed

+487
-15
lines changed

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,10 @@
1919
:zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
2020

2121
## What's new
22-
- 2021/06/01 (**NEW!**) Integrated with [Huggingface Hub](https://huggingface.co/tensorspeech). See the [PR](https://github.com/TensorSpeech/TensorFlowTTS/pull/555). Thanks [patrickvonplaten](https://github.com/patrickvonplaten) and [osanseviero](https://github.com/osanseviero)
23-
- 2021/03/18 (**NEW!**) Support IOS for FastSpeech2 and MB MelGAN. Thanks [kewlbear](https://github.com/kewlbear). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/ios)
24-
- 2021/01/18 (**NEW!**) Support TFLite C++ inference. Thanks [luan78zaoha](https://github.com/luan78zaoha). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cpptflite)
22+
- 2021/08/12 (**NEW!**) Support French TTS (Tacotron2, Multiband MelGAN). Pls see the [colab](https://colab.research.google.com/drive/1jd3u46g-fGQw0rre8fIwWM9heJvrV1c0?usp=sharing). Many Thanks [Samuel Delalez](https://github.com/samuel-lunii)
23+
- 2021/06/01 Integrated with [Huggingface Hub](https://huggingface.co/tensorspeech). See the [PR](https://github.com/TensorSpeech/TensorFlowTTS/pull/555). Thanks [patrickvonplaten](https://github.com/patrickvonplaten) and [osanseviero](https://github.com/osanseviero)
24+
- 2021/03/18 Support IOS for FastSpeech2 and MB MelGAN. Thanks [kewlbear](https://github.com/kewlbear). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/ios)
25+
- 2021/01/18 Support TFLite C++ inference. Thanks [luan78zaoha](https://github.com/luan78zaoha). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cpptflite)
2526
- 2020/12/02 Support German TTS with [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts). See the [Colab](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing). Thanks [thorstenMueller](https://github.com/thorstenMueller) and [monatis](https://github.com/monatis)
2627
- 2020/11/24 Add HiFi-GAN vocoder. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/hifigan)
2728
- 2020/11/19 Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
@@ -47,7 +48,7 @@
4748
- Support both Single/Multi GPU in base trainer class.
4849
- TFlite conversion for all supported models.
4950
- Android example.
50-
- Support many languages (currently, we support Chinese, Korean, English.)
51+
- Support many languages (currently, we support Chinese, Korean, English, French and German)
5152
- Support C++ inference.
5253
- Support Convert weight for some models from PyTorch to TensorFlow to accelerate speed.
5354

@@ -115,7 +116,7 @@ Prepare a dataset in the following format:
115116

116117
Where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.
117118

118-
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
119+
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts/synpaflex]` for example.
119120

120121
## Preprocessing
121122

@@ -131,14 +132,17 @@ The preprocessing has two steps:
131132

132133
To reproduce the steps above:
133134
```
134-
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
135-
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
135+
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
136+
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
136137
```
137138

138-
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/) and [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) for dataset argument. In the future, we intend to support more datasets.
139+
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/), [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) and
140+
[`synpaflex`](https://www.ortolang.fr/market/corpora/synpaflex-corpus/) for dataset argument. In the future, we intend to support more datasets.
139141

140142
**Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.
141143

144+
**Note**: To run `synpaflex` preprocessing, please first run the notebook [notebooks/prepare_synpaflex.ipynb](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/notebooks/prepare_synpaflex.ipynb). We need to reformat it first before run preprocessing.
145+
142146
After preprocessing, the structure of the project folder should be:
143147
```
144148
|- [NAME_DATASET]/
@@ -254,7 +258,7 @@ A detail implementation of base_trainer from [tensorflow_tts/trainer/base_traine
254258
All models on this repo are trained based-on **GanBasedTrainer** (see [train_melgan.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/melgan/train_melgan.py), [train_melgan_stft.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/melgan.stft/train_melgan_stft.py), [train_multiband_melgan.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py)) and **Seq2SeqBasedTrainer** (see [train_tacotron2.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/tacotron2/train_tacotron2.py), [train_fastspeech.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/fastspeech/train_fastspeech.py)).
255259

256260
# End-to-End Examples
257-
You can know how to inference each model at [notebooks](https://github.com/tensorspeech/TensorFlowTTS/tree/master/notebooks) or see a [colab](https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing) (for English), [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing) (for Korean). Here is an example code for end2end inference with fastspeech2 and multi-band melgan. We uploaded all our pretrained in [HuggingFace Hub](https://huggingface.co/tensorspeech).
261+
You can know how to inference each model at [notebooks](https://github.com/tensorspeech/TensorFlowTTS/tree/master/notebooks) or see a [colab](https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing) (for English), [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing) (for Korean), [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing) (for Chinese), [colab](https://colab.research.google.com/drive/1jd3u46g-fGQw0rre8fIwWM9heJvrV1c0?usp=sharing) (for French), [colab](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing) (for German). Here is an example code for end2end inference with fastspeech2 and multi-band melgan. We uploaded all our pretrained in [HuggingFace Hub](https://huggingface.co/tensorspeech).
258262

259263
```python
260264
import numpy as np
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
2+
# This is the hyperparameter configuration file for Multi-Band MelGAN.
3+
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 1000k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
sampling_rate: 22050
11+
hop_size: 256 # Hop size.
12+
format: "npy"
13+
14+
15+
###########################################################
16+
# GENERATOR NETWORK ARCHITECTURE SETTING #
17+
###########################################################
18+
model_type: "multiband_melgan_generator"
19+
20+
multiband_melgan_generator_params:
21+
out_channels: 4 # Number of output channels (number of subbands).
22+
kernel_size: 7 # Kernel size of initial and final conv layers.
23+
filters: 384 # Initial number of channels for conv layers.
24+
upsample_scales: [8, 4, 2] # List of Upsampling scales.
25+
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
26+
stacks: 4 # Number of stacks in a single residual stack module.
27+
is_weight_norm: false # Use weight-norm or not.
28+
29+
###########################################################
30+
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
31+
###########################################################
32+
multiband_melgan_discriminator_params:
33+
out_channels: 1 # Number of output channels.
34+
scales: 3 # Number of multi-scales.
35+
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
36+
downsample_pooling_params: # Parameters of the above pooling function.
37+
pool_size: 4
38+
strides: 2
39+
kernel_sizes: [5, 3] # List of kernel size.
40+
filters: 16 # Number of channels of the initial conv layer.
41+
max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
42+
downsample_scales: [4, 4, 4] # List of downsampling scales.
43+
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
44+
nonlinear_activation_params: # Parameters of nonlinear activation function.
45+
alpha: 0.2
46+
is_weight_norm: false # Use weight-norm or not.
47+
48+
###########################################################
49+
# STFT LOSS SETTING #
50+
###########################################################
51+
stft_loss_params:
52+
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
53+
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
54+
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
55+
56+
subband_stft_loss_params:
57+
fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
58+
frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
59+
frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
60+
61+
###########################################################
62+
# ADVERSARIAL LOSS SETTING #
63+
###########################################################
64+
lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
65+
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
66+
67+
###########################################################
68+
# DATA LOADER SETTING #
69+
###########################################################
70+
batch_size: 64 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
71+
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
72+
batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
73+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
74+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
75+
is_shuffle: true # shuffle dataset after each epoch.
76+
77+
###########################################################
78+
# OPTIMIZER & SCHEDULER SETTING #
79+
###########################################################
80+
generator_optimizer_params:
81+
lr_fn: "PiecewiseConstantDecay"
82+
lr_params:
83+
boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
84+
values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
85+
amsgrad: false
86+
87+
discriminator_optimizer_params:
88+
lr_fn: "PiecewiseConstantDecay"
89+
lr_params:
90+
boundaries: [100000, 200000, 300000, 400000, 500000]
91+
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
92+
93+
amsgrad: false
94+
95+
gradient_accumulation_steps: 1
96+
###########################################################
97+
# INTERVAL SETTING #
98+
###########################################################
99+
discriminator_train_start_steps: 200000 # steps begin training discriminator
100+
train_max_steps: 4000000 # Number of training steps.
101+
save_interval_steps: 20000 # Interval steps to save checkpoint.
102+
eval_interval_steps: 5000 # Interval steps to evaluate the network.
103+
log_interval_steps: 200 # Interval steps to record the training log.
104+
105+
###########################################################
106+
# OTHER SETTING #
107+
###########################################################
108+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# This is the hyperparameter configuration file for Tacotron2 v1.
2+
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but 65k iters is enough to get a good models.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
13+
###########################################################
14+
# NETWORK ARCHITECTURE SETTING #
15+
###########################################################
16+
model_type: "tacotron2"
17+
18+
tacotron2_params:
19+
dataset: synpaflex
20+
embedding_hidden_size: 512
21+
initializer_range: 0.02
22+
embedding_dropout_prob: 0.1
23+
n_speakers: 1
24+
n_conv_encoder: 5
25+
encoder_conv_filters: 512
26+
encoder_conv_kernel_sizes: 5
27+
encoder_conv_activation: 'relu'
28+
encoder_conv_dropout_rate: 0.5
29+
encoder_lstm_units: 256
30+
n_prenet_layers: 2
31+
prenet_units: 256
32+
prenet_activation: 'relu'
33+
prenet_dropout_rate: 0.5
34+
n_lstm_decoder: 1
35+
reduction_factor: 1
36+
decoder_lstm_units: 1024
37+
attention_dim: 128
38+
attention_filters: 32
39+
attention_kernel: 31
40+
n_mels: 80
41+
n_conv_postnet: 5
42+
postnet_conv_filters: 512
43+
postnet_conv_kernel_sizes: 5
44+
postnet_dropout_rate: 0.1
45+
attention_type: "lsa"
46+
47+
###########################################################
48+
# DATA LOADER SETTING #
49+
###########################################################
50+
batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
51+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
52+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
53+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
54+
is_shuffle: true # shuffle dataset after each epoch.
55+
use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)
56+
# refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)
57+
58+
###########################################################
59+
# OPTIMIZER & SCHEDULER SETTING #
60+
###########################################################
61+
optimizer_params:
62+
initial_learning_rate: 0.001
63+
end_learning_rate: 0.00001
64+
decay_steps: 150000 # < train_max_steps is recommend.
65+
warmup_proportion: 0.02
66+
weight_decay: 0.001
67+
68+
gradient_accumulation_steps: 1
69+
var_train_expr: null # trainable variable expr (eg. 'embeddings|decoder_cell' )
70+
# must separate by |. if var_train_expr is null then we
71+
# training all variables.
72+
###########################################################
73+
# INTERVAL SETTING #
74+
###########################################################
75+
train_max_steps: 200000 # Number of training steps.
76+
save_interval_steps: 2000 # Interval steps to save checkpoint.
77+
eval_interval_steps: 500 # Interval steps to evaluate the network.
78+
log_interval_steps: 200 # Interval steps to record the training log.
79+
start_schedule_teacher_forcing: 200001 # don't need to apply schedule teacher forcing.
80+
start_ratio_value: 0.5 # start ratio of scheduled teacher forcing.
81+
schedule_decay_steps: 50000 # decay step scheduled teacher forcing.
82+
end_ratio_value: 0.0 # end ratio of scheduled teacher forcing.
83+
###########################################################
84+
# OTHER SETTING #
85+
###########################################################
86+
num_save_intermediate_results: 1 # Number of results to be saved as intermediate results.

0 commit comments

Comments
 (0)