Skip to content

Commit 93ec250

Browse files
committed
Merge branch 'dev/chinese_example' of https://github.com/TensorSpeech/TensorflowTTS into dev/chinese_example
2 parents 1adb09e + 3813890 commit 93ec250

File tree

21 files changed

+944
-35
lines changed

21 files changed

+944
-35
lines changed

README.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
<h2 align="center">
22
<p> :yum: TensorflowTTS
33
<p align="center">
4-
<a href="https://github.com/dathudeptrai/TensorflowTTS/actions">
5-
<img alt="Build" src="https://github.com/dathudeptrai/TensorflowTTS/workflows/CI/badge.svg?branch=master">
4+
<a href="https://github.com/tensorspeech/TensorFlowTTS/actions">
5+
<img alt="Build" src="https://github.com/tensorspeech/TensorFlowTTS/workflows/CI/badge.svg?branch=master">
66
</a>
7-
<a href="https://github.com/dathudeptrai/TensorflowTTS/blob/master/LICENSE">
8-
<img alt="GitHub" src="https://img.shields.io/github/license/dathudeptrai/TensorflowTTS?color=red">
7+
<a href="https://github.com/tensorspeech/TensorFlowTTS/blob/master/LICENSE">
8+
<img alt="GitHub" src="https://img.shields.io/github/license/tensorspeech/TensorflowTTS?color=red">
99
</a>
1010
<a href="https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing">
1111
<img alt="Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
@@ -19,8 +19,9 @@
1919
:zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
2020

2121
## What's new
22-
- 2020/07/17 **(NEW!)** Support MultiGPU for all Trainer.
23-
- 2020/07/05 **(New!)** Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
22+
- 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
23+
- 2020/07/17 Support MultiGPU for all Trainer.
24+
- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
2425
- 2020/06/20 [FastSpeech2](https://arxiv.org/abs/2006.04558) implementation with Tensorflow is supported.
2526
- 2020/06/07 [Multi-band MelGAN (MB MelGAN)](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/multiband_melgan/) implementation with Tensorflow is supported.
2627

@@ -41,7 +42,7 @@ This repository is tested on Ubuntu 18.04 with:
4142
- Python 3.6+
4243
- Cuda 10.1
4344
- CuDNN 7.6.5
44-
- Tensorflow 2.2
45+
- Tensorflow 2.2/2.3
4546
- [Tensorflow Addons](https://github.com/tensorflow/addons) 0.10.0
4647

4748
Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. **We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.**
@@ -54,8 +55,8 @@ $ pip install TensorflowTTS
5455
### From source
5556
Examples are included in the repository but are not shipped with the framework. Therefore, in order to run the latest verion of examples, you need install from source following bellow.
5657
```bash
57-
$ git clone https://github.com/TensorSpeech/TensorflowTTS.git
58-
$ cd TensorflowTTS
58+
$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
59+
$ cd TensorFlowTTS
5960
$ pip install .
6061
```
6162
If you want upgrade the repository and its dependencies:
@@ -112,10 +113,12 @@ The preprocessing has two steps:
112113

113114
To reproduce the steps above:
114115
```
115-
tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml
116-
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml
116+
tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
117+
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
117118
```
118119

120+
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/) and [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset) for dataset argument. In the future, we intend to support more datasets.
121+
119122
After preprocessing, the structure of the project folder should be:
120123
```
121124
|- datasets/
@@ -225,7 +228,7 @@ A detail implementation of base_trainer from [tensorflow_tts/trainer/base_traine
225228
All models on this repo are trained based-on **GanBasedTrainer** (see [train_melgan.py](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/melgan/train_melgan.py), [train_melgan_stft.py](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/melgan.stft/train_melgan_stft.py), [train_multiband_melgan.py](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py)) and **Seq2SeqBasedTrainer** (see [train_tacotron2.py](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/tacotron2/train_tacotron2.py), [train_fastspeech.py](https://github.com/dathudeptrai/TensorflowTTS/blob/master/examples/fastspeech/train_fastspeech.py)).
226229

227230
# End-to-End Examples
228-
You can know how to inference each model at [notebooks](https://github.com/dathudeptrai/TensorflowTTS/tree/master/notebooks) or see a [colab](https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing). Here is an example code for end2end inference with fastspeech and melgan.
231+
You can know how to inference each model at [notebooks](https://github.com/dathudeptrai/TensorflowTTS/tree/master/notebooks) or see a [colab](https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing) (for English), [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing) (for Korean). Here is an example code for end2end inference with fastspeech and melgan.
229232

230233
```python
231234
import numpy as np
@@ -242,15 +245,15 @@ from tensorflow_tts.inference import TFAutoModel
242245
# initialize fastspeech model.
243246
fs_config = AutoConfig.from_pretrained('/examples/fastspeech/conf/fastspeech.v1.yaml')
244247
fastspeech = TFAutoModel.from_pretrained(
245-
config=config,
248+
config=fs_config,
246249
pretrained_path="./examples/fastspeech/pretrained/model-195000.h5"
247250
)
248251

249252

250253
# initialize melgan model
251254
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
252255
melgan = TFAutoModel.from_pretrained(
253-
config=config,
256+
config=melgan_config,
254257
pretrained_path="./examples/melgan/checkpoint/generator-1500000.h5"
255258
)
256259

examples/fastspeech2/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,9 @@ CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/decode_fastspeech2.py \
5252
## Pretrained Models and Audio samples
5353
| Model | Conf | Lang | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters |
5454
| :------ | :---: | :---: | :----: | :--------: | :---------------: | :-----: |
55-
| [fastspeech2.v1](https://drive.google.com/drive/folders/158vFyC2pxw9xKdxp-C5WPEtgtUiWZYE0?usp=sharing) | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2/conf/fastspeech2.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 150k |
55+
| [fastspeech2.v1](https://drive.google.com/drive/folders/158vFyC2pxw9xKdxp-C5WPEtgtUiWZYE0?usp=sharing) | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 150k |
56+
| [fastspeech2.kss.v1](https://drive.google.com/drive/folders/1DU952--jVnJ5SZDSINRs7dVVSpdB7tC_?usp=sharing) | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.kss.v1.yaml) | KO | 22.05k | 80-7600 | 1024 / 256 / None | 200k |
57+
| [fastspeech2.kss.v2](https://drive.google.com/drive/folders/1G3-AJnEsu2rYXYgo2iGIVJfCqqfbpwMu?usp=sharing) | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.kss.v2.yaml) | KO | 22.05k | 80-7600 | 1024 / 256 / None | 200k |
5658

5759
## Reference
5860

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v1.
2+
# Please make sure this is adjusted for the KSS dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
13+
###########################################################
14+
# NETWORK ARCHITECTURE SETTING #
15+
###########################################################
16+
model_type: "fastspeech2"
17+
18+
fastspeech2_params:
19+
dataset: "kss"
20+
n_speakers: 1
21+
encoder_hidden_size: 384
22+
encoder_num_hidden_layers: 4
23+
encoder_num_attention_heads: 2
24+
encoder_attention_head_size: 192 # hidden_size // num_attention_heads
25+
encoder_intermediate_size: 1024
26+
encoder_intermediate_kernel_size: 3
27+
encoder_hidden_act: "mish"
28+
decoder_hidden_size: 384
29+
decoder_num_hidden_layers: 4
30+
decoder_num_attention_heads: 2
31+
decoder_attention_head_size: 192 # hidden_size // num_attention_heads
32+
decoder_intermediate_size: 1024
33+
decoder_intermediate_kernel_size: 3
34+
decoder_hidden_act: "mish"
35+
variant_prediction_num_conv_layers: 2
36+
variant_predictor_filter: 256
37+
variant_predictor_kernel_size: 3
38+
variant_predictor_dropout_rate: 0.5
39+
num_mels: 80
40+
hidden_dropout_prob: 0.2
41+
attention_probs_dropout_prob: 0.1
42+
max_position_embeddings: 2048
43+
initializer_range: 0.02
44+
output_attentions: False
45+
output_hidden_states: False
46+
47+
###########################################################
48+
# DATA LOADER SETTING #
49+
###########################################################
50+
batch_size: 16 # Batch size.
51+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
52+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
53+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
54+
is_shuffle: true # shuffle dataset after each epoch.
55+
###########################################################
56+
# OPTIMIZER & SCHEDULER SETTING #
57+
###########################################################
58+
optimizer_params:
59+
initial_learning_rate: 0.001
60+
end_learning_rate: 0.00005
61+
decay_steps: 150000 # < train_max_steps is recommend.
62+
warmup_proportion: 0.02
63+
weight_decay: 0.001
64+
65+
66+
###########################################################
67+
# INTERVAL SETTING #
68+
###########################################################
69+
train_max_steps: 200000 # Number of training steps.
70+
save_interval_steps: 5000 # Interval steps to save checkpoint.
71+
eval_interval_steps: 500 # Interval steps to evaluate the network.
72+
log_interval_steps: 200 # Interval steps to record the training log.
73+
###########################################################
74+
# OTHER SETTING #
75+
###########################################################
76+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v2.
2+
# the different of v2 and v1 is that v2 apply linformer technique.
3+
# Please make sure this is adjusted for the KSS dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
hop_size: 256 # Hop size.
11+
format: "npy"
12+
13+
14+
###########################################################
15+
# NETWORK ARCHITECTURE SETTING #
16+
###########################################################
17+
model_type: "fastspeech2"
18+
19+
fastspeech2_params:
20+
dataset: "kss"
21+
n_speakers: 1
22+
encoder_hidden_size: 256
23+
encoder_num_hidden_layers: 3
24+
encoder_num_attention_heads: 2
25+
encoder_attention_head_size: 16 # in v1, = 384//2
26+
encoder_intermediate_size: 1024
27+
encoder_intermediate_kernel_size: 3
28+
encoder_hidden_act: "mish"
29+
decoder_hidden_size: 256
30+
decoder_num_hidden_layers: 3
31+
decoder_num_attention_heads: 2
32+
decoder_attention_head_size: 16 # in v1, = 384//2
33+
decoder_intermediate_size: 1024
34+
decoder_intermediate_kernel_size: 3
35+
decoder_hidden_act: "mish"
36+
variant_prediction_num_conv_layers: 2
37+
variant_predictor_filter: 256
38+
variant_predictor_kernel_size: 3
39+
variant_predictor_dropout_rate: 0.5
40+
num_mels: 80
41+
hidden_dropout_prob: 0.2
42+
attention_probs_dropout_prob: 0.1
43+
max_position_embeddings: 2048
44+
initializer_range: 0.02
45+
output_attentions: False
46+
output_hidden_states: False
47+
48+
###########################################################
49+
# DATA LOADER SETTING #
50+
###########################################################
51+
batch_size: 16 # Batch size.
52+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
53+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
54+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
55+
is_shuffle: true # shuffle dataset after each epoch.
56+
###########################################################
57+
# OPTIMIZER & SCHEDULER SETTING #
58+
###########################################################
59+
optimizer_params:
60+
initial_learning_rate: 0.001
61+
end_learning_rate: 0.00005
62+
decay_steps: 150000 # < train_max_steps is recommend.
63+
warmup_proportion: 0.02
64+
weight_decay: 0.001
65+
66+
67+
###########################################################
68+
# INTERVAL SETTING #
69+
###########################################################
70+
train_max_steps: 200000 # Number of training steps.
71+
save_interval_steps: 5000 # Interval steps to save checkpoint.
72+
eval_interval_steps: 500 # Interval steps to evaluate the network.
73+
log_interval_steps: 200 # Interval steps to record the training log.
74+
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
75+
###########################################################
76+
# OTHER SETTING #
77+
###########################################################
78+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

examples/multiband_melgan/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,8 @@ Here is a learning curves of melgan based on this config [`multiband_melgan.v1.y
7474
## Pretrained Models and Audio samples
7575
| Model | Conf | Lang | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters |
7676
| :------ | :---: | :---: | :----: | :--------: | :---------------: | :-----: |
77-
| [multiband_melgan.v1](https://drive.google.com/drive/folders/1Hg82YnPbX6dfF7DxVs4c96RBaiFbh-cT?usp=sharing) | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 940K |
77+
| [multiband_melgan.v1](https://drive.google.com/drive/folders/1Hg82YnPbX6dfF7DxVs4c96RBaiFbh-cT?usp=sharing) | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 940K |
78+
| [multiband_melgan.v1](https://drive.google.com/drive/folders/199XCXER51PWf_VzUpOwxfY_8XDfeXuZl?usp=sharing) | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml) | KO | 22.05k | 80-7600 | 1024 / 256 / None | 1000K |
7879

7980
## Reference
8081

examples/multiband_melgan/decode_mb_melgan.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def main():
9595
config.update(vars(args))
9696

9797
if config["format"] == "npy":
98-
mel_query = "*-norm-feats.npy" if args.use_norm == 1 else "*-raw-feats.npy"
98+
mel_query = "*-fs-after-feats.npy" if "fastspeech" in args.rootdir else "*-norm-feats.npy" if args.use_norm == 1 else "*-raw-feats.npy"
9999
mel_load_fn = np.load
100100
else:
101101
raise ValueError("Only npy is supported.")

examples/tacotron2/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,8 @@ Here is a result of tacotron2 based on this config [`tacotron2.v1.yaml`](https:/
109109
## Pretrained Models and Audio samples
110110
| Model | Conf | Lang | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters | reduction factor|
111111
| :------ | :---: | :---: | :----: | :--------: | :---------------: | :-----: | :-----: |
112-
| [tacotron2.v1](https://drive.google.com/open?id=1kaPXRdLg9gZrll9KtvH3-feOBMM8sn3_) | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/tacotron2/conf/tacotron2.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 65k | 1
112+
| [tacotron2.v1](https://drive.google.com/open?id=1kaPXRdLg9gZrll9KtvH3-feOBMM8sn3_) | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2/conf/tacotron2.v1.yaml) | EN | 22.05k | 80-7600 | 1024 / 256 / None | 65K | 1
113+
| [tacotron2.v1](https://drive.google.com/drive/folders/1WMBe01BBnYf3sOxMhbvnF2CUHaRTpBXJ?usp=sharing) | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2/conf/tacotron2.kss.v1.yaml) | KO | 22.05k | 80-7600 | 1024 / 256 / None | 100K | 1
113114

114115
## Reference
115116

0 commit comments

Comments
 (0)