Skip to content

Commit 4ae9582

Browse files
authored
Merge branch 'master' into parallel-wavegan
2 parents fca393a + 5f21de9 commit 4ae9582

File tree

76 files changed

+3874
-627
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+3874
-627
lines changed

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,13 @@ ljspeech
3232
/datasets
3333
/examples/tacotron2/exp/
3434
/temp/
35+
LibriTTS/
36+
dataset/
37+
mfa/
38+
kss/
39+
baker/
40+
libritts/
41+
dump_baker/
42+
dump_ljspeech/
43+
dump_kss/
44+
dump_libritts/

README.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
:zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
2020

2121
## What's new
22+
- 2020/08/18 **(NEW!)** Update [new base processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/base_processor.py). Add [AutoProcessor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/inference/auto_processor.py) and [pretrained processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/pretrained/) json file.
23+
- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
2224
- 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
2325
- 2020/07/17 Support MultiGPU for all Trainer.
2426
- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
@@ -35,15 +37,17 @@
3537
- Mixed precision to speed-up training if posible.
3638
- Support both Single/Multi GPU in base trainer class.
3739
- TFlite conversion for all supported model.
40+
- Android example.
41+
- Support many languages (currently, we support Chinese, Korean, English.)
3842

3943
## Requirements
4044
This repository is tested on Ubuntu 18.04 with:
4145

42-
- Python 3.6+
46+
- Python 3.7+
4347
- Cuda 10.1
4448
- CuDNN 7.6.5
4549
- Tensorflow 2.2/2.3
46-
- [Tensorflow Addons](https://github.com/tensorflow/addons) 0.10.0
50+
- [Tensorflow Addons](https://github.com/tensorflow/addons) >= 0.10.0
4751

4852
Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. **We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.**
4953

@@ -90,7 +94,7 @@ Here in an audio samples on valid set. [tacotron-2](https://drive.google.com/ope
9094

9195
Prepare a dataset in the following format:
9296
```
93-
|- datasets/
97+
|- [NAME_DATASET]/
9498
| |- metadata.csv
9599
| |- wav/
96100
| |- file1.wav
@@ -99,6 +103,8 @@ Prepare a dataset in the following format:
99103

100104
where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format, you can ignore preprocessing steps if you have other format dataset.
101105

106+
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
107+
102108
## Preprocessing
103109

104110
The preprocessing has two steps:
@@ -113,20 +119,22 @@ The preprocessing has two steps:
113119

114120
To reproduce the steps above:
115121
```
116-
tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
117-
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
122+
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
123+
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker/libritts]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
118124
```
119125

120-
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/) and [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset) for dataset argument. In the future, we intend to support more datasets.
126+
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) and [`libritts`](http://www.openslr.org/60/) for dataset argument. In the future, we intend to support more datasets.
127+
128+
**Note**: To runing `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need reformat it first before run preprocessing.
121129

122130
After preprocessing, the structure of the project folder should be:
123131
```
124-
|- datasets/
132+
|- [NAME_DATASET]/
125133
| |- metadata.csv
126134
| |- wav/
127135
| |- file1.wav
128136
| |- ...
129-
|- dump/
137+
|- dump_[ljspeech/kss/baker/libritts]/
130138
| |- train/
131139
| |- ids/
132140
| |- LJ001-0001-ids.npy
@@ -184,8 +192,10 @@ After preprocessing, the structure of the project folder should be:
184192

185193
We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats` and `wave`) for each type of input.
186194

195+
187196
**IMPORTANT NOTES**:
188197
- This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
198+
- Regardless how your dataset is formatted, the final structure of `dump` folder **SHOULD** follow above structure to be able use the training script or you can modify by yourself 😄.
189199

190200
## Training models
191201

@@ -194,6 +204,7 @@ To know how to training model from scratch or fine-tune with other datasets/lang
194204
- For Tacotron-2 tutorial, pls see [example/tacotron2](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/tacotron2)
195205
- For FastSpeech tutorial, pls see [example/fastspeech](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech)
196206
- For FastSpeech2 tutorial, pls see [example/fastspeech2](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2)
207+
- For FastSpeech2 + MFA tutorial, pls see [example/fastspeech2_libritts](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2_libritts)
197208
- For MelGAN tutorial, pls see [example/melgan](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan)
198209
- For MelGAN + STFT Loss tutorial, pls see [example/melgan.stft](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan.stft)
199210
- For Multiband-MelGAN tutorial, pls see [example/multiband_melgan](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan)
@@ -237,10 +248,9 @@ import yaml
237248

238249
import tensorflow as tf
239250

240-
from tensorflow_tts.processor import LJSpeechProcessor
241-
242251
from tensorflow_tts.inference import AutoConfig
243252
from tensorflow_tts.inference import TFAutoModel
253+
from tensorflow_tts.inference import AutoProcessor
244254

245255
# initialize fastspeech model.
246256
fs_config = AutoConfig.from_pretrained('/examples/fastspeech/conf/fastspeech.v1.yaml')
@@ -259,7 +269,7 @@ melgan = TFAutoModel.from_pretrained(
259269

260270

261271
# inference
262-
processor = LJSpeechProcessor(None, cleaner_names="english_cleaners")
272+
processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")
263273

264274
ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
265275
ids = tf.expand_dims(ids, 0)
@@ -281,7 +291,7 @@ sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")
281291
```
282292

283293
# Contact
284-
[Minh Nguyen Quan Anh](https://github.com/dathudeptrai): [email protected], [erogol](https://github.com/erogol): [email protected], [Kuan Chen](https://github.com/azraelkuan): [email protected], [Takuya Ebata](https://github.com/MokkeMeguru): [email protected], [Trinh Le Quang](https://github.com/l4zyf9x): [email protected]
294+
[Minh Nguyen Quan Anh](https://github.com/dathudeptrai): [email protected], [erogol](https://github.com/erogol): [email protected], [Kuan Chen](https://github.com/azraelkuan): [email protected], [Dawid Kobus](https://github.com/machineko): [email protected], [Takuya Ebata](https://github.com/MokkeMeguru): [email protected], [Trinh Le Quang](https://github.com/l4zyf9x): trinhle.cse@gmail.com, [Yunchao He](https://github.com/candlewill): [email protected], [Alejandro Miguel Velasquez](https://github.com/ZDisket): xml506ok@gmail.com
285295

286296
# License
287297
Overrall, Almost models here are licensed under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) for all countries in the world, except in **Viet Nam** this framework cannot be used for production in any way without permission from TensorflowTTS's Authors. There is an exception, Tacotron-2 can be used with any perpose. So, if you are VietNamese and want to use this framework for production, you **Must** contact our in andvance.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v2.
2+
# the different of v2 and v1 is that v2 apply linformer technique.
3+
# Please make sure this is adjusted for the Baker dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
hop_size: 256 # Hop size.
11+
format: "npy"
12+
13+
14+
###########################################################
15+
# NETWORK ARCHITECTURE SETTING #
16+
###########################################################
17+
model_type: "fastspeech2"
18+
19+
fastspeech2_params:
20+
dataset: baker
21+
n_speakers: 1
22+
encoder_hidden_size: 256
23+
encoder_num_hidden_layers: 3
24+
encoder_num_attention_heads: 2
25+
encoder_attention_head_size: 16 # in v1, = 384//2
26+
encoder_intermediate_size: 1024
27+
encoder_intermediate_kernel_size: 3
28+
encoder_hidden_act: "mish"
29+
decoder_hidden_size: 256
30+
decoder_num_hidden_layers: 3
31+
decoder_num_attention_heads: 2
32+
decoder_attention_head_size: 16 # in v1, = 384//2
33+
decoder_intermediate_size: 1024
34+
decoder_intermediate_kernel_size: 3
35+
decoder_hidden_act: "mish"
36+
variant_prediction_num_conv_layers: 2
37+
variant_predictor_filter: 256
38+
variant_predictor_kernel_size: 3
39+
variant_predictor_dropout_rate: 0.5
40+
num_mels: 80
41+
hidden_dropout_prob: 0.2
42+
attention_probs_dropout_prob: 0.1
43+
max_position_embeddings: 2048
44+
initializer_range: 0.02
45+
output_attentions: False
46+
output_hidden_states: False
47+
48+
###########################################################
49+
# DATA LOADER SETTING #
50+
###########################################################
51+
batch_size: 16 # Batch size.
52+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
53+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
54+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
55+
is_shuffle: true # shuffle dataset after each epoch.
56+
###########################################################
57+
# OPTIMIZER & SCHEDULER SETTING #
58+
###########################################################
59+
optimizer_params:
60+
initial_learning_rate: 0.001
61+
end_learning_rate: 0.00005
62+
decay_steps: 150000 # < train_max_steps is recommend.
63+
warmup_proportion: 0.02
64+
weight_decay: 0.001
65+
66+
67+
###########################################################
68+
# INTERVAL SETTING #
69+
###########################################################
70+
train_max_steps: 200000 # Number of training steps.
71+
save_interval_steps: 5000 # Interval steps to save checkpoint.
72+
eval_interval_steps: 500 # Interval steps to evaluate the network.
73+
log_interval_steps: 200 # Interval steps to record the training log.
74+
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
75+
###########################################################
76+
# OTHER SETTING #
77+
###########################################################
78+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Fast speech 2 multi-speaker english lang based
2+
3+
## Prepare
4+
Everything is done from main repo folder so TensorflowTTS/
5+
6+
0. Optional* [Download](http://www.openslr.org/60/) and prepare libritts (helper to prepare libri in examplesfastspeech2_libritts/libri_experiment/prepare_libri.ipynb)
7+
- Dataset structure after finish this step:
8+
```
9+
|- TensorFlowTTS/
10+
| |- LibriTTS/
11+
| |- |- train-clean-100/
12+
| |- |- SPEAKERS.txt
13+
| |- |- ...
14+
| |- libritts/
15+
| |- |- 200/
16+
| |- |- |- 200_124139_000001_000000.txt
17+
| |- |- |- 200_124139_000001_000000.wav
18+
| |- |- |- ...
19+
| |- |- 250/
20+
| |- |- ...
21+
| |- tensorflow_tts/
22+
| |- models/
23+
| |- ...
24+
```
25+
1. Extract Duration (use examples/mfa_extraction or pretrained tacotron2)
26+
2. Optional* build docker
27+
- ```
28+
bash examples/fastspeech2_libritts/scripts/build.sh
29+
```
30+
3. Optional* run docker
31+
- ```
32+
bash examples/fastspeech2_libritts/scripts/interactive.sh
33+
```
34+
4. Preprocessing:
35+
- ```
36+
tensorflow-tts-preprocess --rootdir ./libritts \
37+
--outdir ./dump_libritts \
38+
--config preprocess/preprocess_libritts.yaml \
39+
--dataset libritts
40+
```
41+
42+
5. Normalization:
43+
- ```
44+
tensorflow-tts-normalize --rootdir ./dump_libritts \
45+
--outdir ./dump_libritts \
46+
--config preprocess/preprocess_libritts.yaml \
47+
--dataset libritts
48+
```
49+
50+
6. Change CharactorDurationF0EnergyMelDataset speaker mapper in fastspeech2_dataset to match your dataset (if you use libri with mfa_extraction you didnt need to change anything)
51+
7. Change train_libri.sh to match your dataset and run:
52+
- ```
53+
bash examples/fastspeech2_libritts/scripts/train_libri.sh
54+
```
55+
8. Optional* If u have problems with tensor sizes mismatch check step 5 in `examples/mfa_extraction` directory
56+
57+
## Comments
58+
59+
This version is using popular train.txt '|' split used in other repos. Training files should looks like this =>
60+
61+
Wav Path | Text | Speaker Name
62+
63+
Wav Path2 | Text | Speaker Name
64+
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v1.
2+
# Please make sure this is adjusted for the LibriTTS dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
###########################################################
13+
# NETWORK ARCHITECTURE SETTING #
14+
###########################################################
15+
model_type: fastspeech2
16+
17+
fastspeech2_params:
18+
dataset: "libritts"
19+
n_speakers: 20
20+
encoder_hidden_size: 384
21+
encoder_num_hidden_layers: 4
22+
encoder_num_attention_heads: 2
23+
encoder_attention_head_size: 192 # hidden_size // num_attention_heads
24+
encoder_intermediate_size: 1024
25+
encoder_intermediate_kernel_size: 3
26+
encoder_hidden_act: "mish"
27+
decoder_hidden_size: 384
28+
decoder_num_hidden_layers: 4
29+
decoder_num_attention_heads: 2
30+
decoder_attention_head_size: 192 # hidden_size // num_attention_heads
31+
decoder_intermediate_size: 1024
32+
decoder_intermediate_kernel_size: 3
33+
decoder_hidden_act: "mish"
34+
variant_prediction_num_conv_layers: 2
35+
variant_predictor_filter: 256
36+
variant_predictor_kernel_size: 3
37+
variant_predictor_dropout_rate: 0.5
38+
num_mels: 80
39+
hidden_dropout_prob: 0.2
40+
attention_probs_dropout_prob: 0.1
41+
max_position_embeddings: 2048
42+
initializer_range: 0.02
43+
output_attentions: False
44+
output_hidden_states: False
45+
46+
###########################################################
47+
# DATA LOADER SETTING #
48+
###########################################################
49+
batch_size: 32 # Batch size.
50+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
51+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
52+
mel_length_threshold: 48 # remove all targets has mel_length <= 32
53+
is_shuffle: true # shuffle dataset after each epoch.
54+
###########################################################
55+
# OPTIMIZER & SCHEDULER SETTING #
56+
###########################################################
57+
optimizer_params:
58+
initial_learning_rate: 0.0001
59+
end_learning_rate: 0.00001
60+
decay_steps: 120000 # < train_max_steps is recommend.
61+
warmup_proportion: 0.02
62+
weight_decay: 0.001
63+
64+
65+
###########################################################
66+
# INTERVAL SETTING #
67+
###########################################################
68+
train_max_steps: 150000 # Number of training steps.
69+
save_interval_steps: 5000 # Interval steps to save checkpoint.
70+
eval_interval_steps: 5000 # Interval steps to evaluate the network.
71+
log_interval_steps: 200 # Interval steps to record the training log.
72+
###########################################################
73+
# OTHER SETTING #
74+
###########################################################
75+
use_griffin: true # Use GL on evaluation or not.
76+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

0 commit comments

Comments
 (0)