Skip to content

Commit 2182343

Browse files
committed
🔧 Merge master to branch and fix conflict.
2 parents 138f11c + 3c388e7 commit 2182343

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+2840
-99
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,7 @@ ljspeech
3232
/datasets
3333
/examples/tacotron2/exp/
3434
/temp/
35+
LibriTTS/
36+
dataset/
37+
mfa/
3538
kss
36-
LibriTTS

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
:zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
2020

2121
## What's new
22+
- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
2223
- 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
2324
- 2020/07/17 Support MultiGPU for all Trainer.
2425
- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
@@ -35,15 +36,17 @@
3536
- Mixed precision to speed-up training if posible.
3637
- Support both Single/Multi GPU in base trainer class.
3738
- TFlite conversion for all supported model.
39+
- Android example.
40+
- Support many languages (currently, we support Chinese, Korean, English.)
3841

3942
## Requirements
4043
This repository is tested on Ubuntu 18.04 with:
4144

42-
- Python 3.6+
45+
- Python 3.7+
4346
- Cuda 10.1
4447
- CuDNN 7.6.5
4548
- Tensorflow 2.2/2.3
46-
- [Tensorflow Addons](https://github.com/tensorflow/addons) 0.10.0
49+
- [Tensorflow Addons](https://github.com/tensorflow/addons) >= 0.10.0
4750

4851
Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. **We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.**
4952

@@ -113,11 +116,11 @@ The preprocessing has two steps:
113116

114117
To reproduce the steps above:
115118
```
116-
tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
117-
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
119+
tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker]
120+
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker]
118121
```
119122

120-
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/) and [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset) for dataset argument. In the future, we intend to support more datasets.
123+
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) for dataset argument. In the future, we intend to support more datasets.
121124

122125
After preprocessing, the structure of the project folder should be:
123126
```
@@ -184,6 +187,7 @@ After preprocessing, the structure of the project folder should be:
184187

185188
We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats` and `wave`) for each type of input.
186189

190+
187191
**IMPORTANT NOTES**:
188192
- This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
189193

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v2.
2+
# the different of v2 and v1 is that v2 apply linformer technique.
3+
# Please make sure this is adjusted for the Baker dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
hop_size: 256 # Hop size.
11+
format: "npy"
12+
13+
14+
###########################################################
15+
# NETWORK ARCHITECTURE SETTING #
16+
###########################################################
17+
model_type: "fastspeech2"
18+
19+
fastspeech2_params:
20+
dataset: baker
21+
n_speakers: 1
22+
encoder_hidden_size: 256
23+
encoder_num_hidden_layers: 3
24+
encoder_num_attention_heads: 2
25+
encoder_attention_head_size: 16 # in v1, = 384//2
26+
encoder_intermediate_size: 1024
27+
encoder_intermediate_kernel_size: 3
28+
encoder_hidden_act: "mish"
29+
decoder_hidden_size: 256
30+
decoder_num_hidden_layers: 3
31+
decoder_num_attention_heads: 2
32+
decoder_attention_head_size: 16 # in v1, = 384//2
33+
decoder_intermediate_size: 1024
34+
decoder_intermediate_kernel_size: 3
35+
decoder_hidden_act: "mish"
36+
variant_prediction_num_conv_layers: 2
37+
variant_predictor_filter: 256
38+
variant_predictor_kernel_size: 3
39+
variant_predictor_dropout_rate: 0.5
40+
num_mels: 80
41+
hidden_dropout_prob: 0.2
42+
attention_probs_dropout_prob: 0.1
43+
max_position_embeddings: 2048
44+
initializer_range: 0.02
45+
output_attentions: False
46+
output_hidden_states: False
47+
48+
###########################################################
49+
# DATA LOADER SETTING #
50+
###########################################################
51+
batch_size: 16 # Batch size.
52+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
53+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
54+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
55+
is_shuffle: true # shuffle dataset after each epoch.
56+
###########################################################
57+
# OPTIMIZER & SCHEDULER SETTING #
58+
###########################################################
59+
optimizer_params:
60+
initial_learning_rate: 0.001
61+
end_learning_rate: 0.00005
62+
decay_steps: 150000 # < train_max_steps is recommend.
63+
warmup_proportion: 0.02
64+
weight_decay: 0.001
65+
66+
67+
###########################################################
68+
# INTERVAL SETTING #
69+
###########################################################
70+
train_max_steps: 200000 # Number of training steps.
71+
save_interval_steps: 5000 # Interval steps to save checkpoint.
72+
eval_interval_steps: 500 # Interval steps to evaluate the network.
73+
log_interval_steps: 200 # Interval steps to record the training log.
74+
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
75+
###########################################################
76+
# OTHER SETTING #
77+
###########################################################
78+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Fast speech 2 multi-speaker english lang based
2+
3+
## Prepare
4+
Everything is done from main repo folder so TensorflowTTS/
5+
6+
0. Optional* [Download](http://www.openslr.org/60/) and prepare libritts (helper to prepare libri in examples/fastspeech2_multispeaker/libri_experiment/prepare_libri.ipynb)
7+
- Dataset structure after finish this step:
8+
```
9+
|- TensorFlowTTS/
10+
| |- LibriTTS/
11+
| |- |- train-clean-100/
12+
| |- |- SPEAKERS.txt
13+
| |- |- ...
14+
| |- dataset/
15+
| |- |- 200/
16+
| |- |- |- 200_124139_000001_000000.txt
17+
| |- |- |- 200_124139_000001_000000.wav
18+
| |- |- |- ...
19+
| |- |- 250/
20+
| |- |- ...
21+
| |- tensorflow_tts/
22+
| |- models/
23+
| |- ...
24+
```
25+
1. Extract Duration (use examples/mfa_extraction or pretrained tacotron2)
26+
2. Optional* build docker
27+
- ```
28+
bash examples/fastspeech2_multispeaker/scripts/build.sh
29+
```
30+
3. Optional* run docker
31+
- ```
32+
bash examples/fastspeech2_multispeaker/scripts/interactive.sh
33+
```
34+
4. Preprocessing:
35+
- ```
36+
tensorflow-tts-preprocess --rootdir ./dataset \
37+
--outdir ./dump \
38+
--config preprocess/preprocess_libritts.yaml \
39+
--dataset multispeaker
40+
```
41+
42+
5. Normalization:
43+
- ```
44+
tensorflow-tts-normalize --rootdir ./dump \
45+
--outdir ./dump \
46+
--config preprocess/preprocess_libritts.yaml \
47+
--dataset multispeaker
48+
```
49+
50+
6. Change CharactorDurationF0EnergyMelDataset speaker mapper in fastspeech2_dataset to match your dataset (if you use libri with mfa_extraction you didnt need to change anything)
51+
7. Change train_libri.sh to match your dataset and run:
52+
- ```
53+
bash examples/fastspeech2_multispeaker/scripts/train_libri.sh
54+
```
55+
8. Optional* If u have problems with tensor sizes mismatch check step 5 in `examples/mfa_extraction` directory
56+
57+
## Comments
58+
59+
This version is using popular train.txt '|' split used in other repos. Training files should looks like this =>
60+
61+
Wav Path | Text | Speaker Name
62+
63+
Wav Path2 | Text | Speaker Name
64+
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v1.
2+
# Please make sure this is adjusted for the LibriTTS dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
###########################################################
13+
# NETWORK ARCHITECTURE SETTING #
14+
###########################################################
15+
model_type: fastspeech2
16+
17+
fastspeech2_params:
18+
n_speakers: 20
19+
encoder_hidden_size: 384
20+
encoder_num_hidden_layers: 4
21+
encoder_num_attention_heads: 2
22+
encoder_attention_head_size: 192 # hidden_size // num_attention_heads
23+
encoder_intermediate_size: 1024
24+
encoder_intermediate_kernel_size: 3
25+
encoder_hidden_act: "mish"
26+
decoder_hidden_size: 384
27+
decoder_num_hidden_layers: 4
28+
decoder_num_attention_heads: 2
29+
decoder_attention_head_size: 192 # hidden_size // num_attention_heads
30+
decoder_intermediate_size: 1024
31+
decoder_intermediate_kernel_size: 3
32+
decoder_hidden_act: "mish"
33+
variant_prediction_num_conv_layers: 2
34+
variant_predictor_filter: 256
35+
variant_predictor_kernel_size: 3
36+
variant_predictor_dropout_rate: 0.5
37+
num_mels: 80
38+
hidden_dropout_prob: 0.2
39+
attention_probs_dropout_prob: 0.1
40+
max_position_embeddings: 2048
41+
initializer_range: 0.02
42+
output_attentions: False
43+
output_hidden_states: False
44+
45+
###########################################################
46+
# DATA LOADER SETTING #
47+
###########################################################
48+
batch_size: 32 # Batch size.
49+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
50+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
51+
mel_length_threshold: 48 # remove all targets has mel_length <= 32
52+
is_shuffle: true # shuffle dataset after each epoch.
53+
###########################################################
54+
# OPTIMIZER & SCHEDULER SETTING #
55+
###########################################################
56+
optimizer_params:
57+
initial_learning_rate: 0.0001
58+
end_learning_rate: 0.00001
59+
decay_steps: 120000 # < train_max_steps is recommend.
60+
warmup_proportion: 0.02
61+
weight_decay: 0.001
62+
63+
64+
###########################################################
65+
# INTERVAL SETTING #
66+
###########################################################
67+
train_max_steps: 150000 # Number of training steps.
68+
save_interval_steps: 5000 # Interval steps to save checkpoint.
69+
eval_interval_steps: 5000 # Interval steps to evaluate the network.
70+
log_interval_steps: 200 # Interval steps to record the training log.
71+
###########################################################
72+
# OTHER SETTING #
73+
###########################################################
74+
use_griffin: true # Use GL on evaluation or not.
75+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

0 commit comments

Comments
 (0)