Skip to content

Commit 39185be

Browse files
authored
Merge pull request #105 from TensorSpeech/dev/chinese_example
init chinese example (tacotron2 and mb-melgan)
2 parents 1303ab8 + 0a9d774 commit 39185be

File tree

20 files changed

+966
-17
lines changed

20 files changed

+966
-17
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ ljspeech
3535
LibriTTS/
3636
dataset/
3737
mfa/
38+
kss

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,29 @@ After preprocessing, the structure of the project folder should be:
184184

185185
We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats` and `wave`) for each type of input.
186186

187+
### Preprocessing Chinese Dataset
188+
please download the open dataset from [Data-Baker](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), and extract data like this:
189+
```
190+
.
191+
├── PhoneLabeling
192+
│ ├── 000001.interval
193+
│ ├── ...
194+
│ └── 010000.interval
195+
├── ProsodyLabeling
196+
│ └── 000001-010000.txt
197+
└── Wave
198+
├── 000001.wav
199+
├── ...
200+
└── 010000.wav
201+
```
202+
203+
after install tensorflowtts, you can process data like this:
204+
```shell
205+
tensorflow-tts-preprocess --dataset baker --rootdir ./baker --outdir ./dump --config ./preprocess/baker_preprocess.yaml
206+
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config ./preprocess/baker_preprocess.yaml --dataset baker
207+
```
208+
209+
187210
**IMPORTANT NOTES**:
188211
- This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
189212

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# This is the hyperparameter configuration file for FastSpeech2 v2.
2+
# the different of v2 and v1 is that v2 apply linformer technique.
3+
# Please make sure this is adjusted for the Baker dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 200k iters but a best checkpoint is around 150k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
hop_size: 256 # Hop size.
11+
format: "npy"
12+
13+
14+
###########################################################
15+
# NETWORK ARCHITECTURE SETTING #
16+
###########################################################
17+
model_type: "fastspeech2"
18+
19+
fastspeech2_params:
20+
dataset: baker
21+
n_speakers: 1
22+
encoder_hidden_size: 256
23+
encoder_num_hidden_layers: 3
24+
encoder_num_attention_heads: 2
25+
encoder_attention_head_size: 16 # in v1, = 384//2
26+
encoder_intermediate_size: 1024
27+
encoder_intermediate_kernel_size: 3
28+
encoder_hidden_act: "mish"
29+
decoder_hidden_size: 256
30+
decoder_num_hidden_layers: 3
31+
decoder_num_attention_heads: 2
32+
decoder_attention_head_size: 16 # in v1, = 384//2
33+
decoder_intermediate_size: 1024
34+
decoder_intermediate_kernel_size: 3
35+
decoder_hidden_act: "mish"
36+
variant_prediction_num_conv_layers: 2
37+
variant_predictor_filter: 256
38+
variant_predictor_kernel_size: 3
39+
variant_predictor_dropout_rate: 0.5
40+
num_mels: 80
41+
hidden_dropout_prob: 0.2
42+
attention_probs_dropout_prob: 0.1
43+
max_position_embeddings: 2048
44+
initializer_range: 0.02
45+
output_attentions: False
46+
output_hidden_states: False
47+
48+
###########################################################
49+
# DATA LOADER SETTING #
50+
###########################################################
51+
batch_size: 16 # Batch size.
52+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
53+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
54+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
55+
is_shuffle: true # shuffle dataset after each epoch.
56+
###########################################################
57+
# OPTIMIZER & SCHEDULER SETTING #
58+
###########################################################
59+
optimizer_params:
60+
initial_learning_rate: 0.001
61+
end_learning_rate: 0.00005
62+
decay_steps: 150000 # < train_max_steps is recommend.
63+
warmup_proportion: 0.02
64+
weight_decay: 0.001
65+
66+
67+
###########################################################
68+
# INTERVAL SETTING #
69+
###########################################################
70+
train_max_steps: 200000 # Number of training steps.
71+
save_interval_steps: 5000 # Interval steps to save checkpoint.
72+
eval_interval_steps: 500 # Interval steps to evaluate the network.
73+
log_interval_steps: 200 # Interval steps to record the training log.
74+
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
75+
###########################################################
76+
# OTHER SETTING #
77+
###########################################################
78+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
2+
# This is the hyperparameter configuration file for Multi-Band MelGAN.
3+
# Please make sure this is adjusted for the Baker dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 1000k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
sampling_rate: 24000
11+
hop_size: 300 # Hop size.
12+
format: "npy"
13+
14+
15+
###########################################################
16+
# GENERATOR NETWORK ARCHITECTURE SETTING #
17+
###########################################################
18+
model_type: "multiband_melgan_generator"
19+
20+
multiband_melgan_generator_params:
21+
out_channels: 4 # Number of output channels (number of subbands).
22+
kernel_size: 7 # Kernel size of initial and final conv layers.
23+
filters: 384 # Initial number of channels for conv layers.
24+
upsample_scales: [3, 5, 5] # List of Upsampling scales.
25+
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
26+
stacks: 4 # Number of stacks in a single residual stack module.
27+
is_weight_norm: false # Use weight-norm or not.
28+
29+
###########################################################
30+
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
31+
###########################################################
32+
discriminator_params:
33+
out_channels: 1 # Number of output channels.
34+
scales: 3 # Number of multi-scales.
35+
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
36+
downsample_pooling_params: # Parameters of the above pooling function.
37+
pool_size: 4
38+
strides: 2
39+
kernel_sizes: [5, 3] # List of kernel size.
40+
filters: 16 # Number of channels of the initial conv layer.
41+
max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
42+
downsample_scales: [4, 4, 4] # List of downsampling scales.
43+
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
44+
nonlinear_activation_params: # Parameters of nonlinear activation function.
45+
alpha: 0.2
46+
is_weight_norm: false # Use weight-norm or not.
47+
48+
###########################################################
49+
# STFT LOSS SETTING #
50+
###########################################################
51+
stft_loss_params:
52+
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
53+
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
54+
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
55+
56+
subband_stft_loss_params:
57+
fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
58+
frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
59+
frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
60+
61+
###########################################################
62+
# ADVERSARIAL LOSS SETTING #
63+
###########################################################
64+
lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
65+
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
66+
67+
###########################################################
68+
# DATA LOADER SETTING #
69+
###########################################################
70+
batch_size: 64 # Batch size.
71+
batch_max_steps: 9600 # Length of each audio in batch for training. Make sure dividable by hop_size.
72+
batch_max_steps_valid: 48000 # Length of each audio for validation. Make sure dividable by hope_size.
73+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
74+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
75+
is_shuffle: true # shuffle dataset after each epoch.
76+
77+
###########################################################
78+
# OPTIMIZER & SCHEDULER SETTING #
79+
###########################################################
80+
generator_optimizer_params:
81+
lr_fn: "PiecewiseConstantDecay"
82+
lr_params:
83+
boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
84+
values: [0.001, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
85+
amsgrad: false
86+
87+
discriminator_optimizer_params:
88+
lr_fn: "PiecewiseConstantDecay"
89+
lr_params:
90+
boundaries: [100000, 200000, 300000, 400000, 500000]
91+
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
92+
amsgrad: false
93+
94+
###########################################################
95+
# INTERVAL SETTING #
96+
###########################################################
97+
discriminator_train_start_steps: 200000 # steps begin training discriminator
98+
train_max_steps: 4000000 # Number of training steps.
99+
save_interval_steps: 20000 # Interval steps to save checkpoint.
100+
eval_interval_steps: 5000 # Interval steps to evaluate the network.
101+
log_interval_steps: 200 # Interval steps to record the training log.
102+
103+
###########################################################
104+
# OTHER SETTING #
105+
###########################################################
106+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

examples/multiband_melgan/decode_mb_melgan.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,14 +110,14 @@ def main():
110110

111111
# define model and load checkpoint
112112
mb_melgan = TFMelGANGenerator(
113-
config=MultiBandMelGANGeneratorConfig(**config["multiband_melgan_generator"]),
113+
config=MultiBandMelGANGeneratorConfig(**config["multiband_melgan_generator_params"]),
114114
name="multiband_melgan_generator",
115115
)
116116
mb_melgan._build()
117117
mb_melgan.load_weights(args.checkpoint)
118118

119119
pqmf = TFPQMF(
120-
config=MultiBandMelGANGeneratorConfig(**config["multiband_melgan_generator"]), name="pqmf"
120+
config=MultiBandMelGANGeneratorConfig(**config["multiband_melgan_generator_params"]), name="pqmf"
121121
)
122122

123123
for data in tqdm(dataset, desc="[Decoding]"):
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# This is the hyperparameter configuration file for Tacotron2 v1.
2+
# Please make sure this is adjusted for the Baker dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but 65k iters is enough to get a good models.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
13+
###########################################################
14+
# NETWORK ARCHITECTURE SETTING #
15+
###########################################################
16+
model_type: "tacotron2"
17+
18+
tacotron2_params:
19+
dataset: baker
20+
embedding_hidden_size: 512
21+
initializer_range: 0.5
22+
embedding_dropout_prob: 0.1
23+
n_speakers: 1
24+
n_conv_encoder: 5
25+
encoder_conv_filters: 512
26+
encoder_conv_kernel_sizes: 5
27+
encoder_conv_activation: 'relu'
28+
encoder_conv_dropout_rate: 0.5
29+
encoder_lstm_units: 256
30+
n_prenet_layers: 2
31+
prenet_units: 256
32+
prenet_activation: 'relu'
33+
prenet_dropout_rate: 0.5
34+
n_lstm_decoder: 1
35+
reduction_factor: 2
36+
decoder_lstm_units: 1024
37+
attention_dim: 128
38+
attention_filters: 32
39+
attention_kernel: 31
40+
n_mels: 80
41+
n_conv_postnet: 5
42+
postnet_conv_filters: 512
43+
postnet_conv_kernel_sizes: 5
44+
postnet_dropout_rate: 0.1
45+
attention_type: "lsa"
46+
47+
###########################################################
48+
# DATA LOADER SETTING #
49+
###########################################################
50+
batch_size: 32 # Batch size.
51+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
52+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
53+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
54+
is_shuffle: true # shuffle dataset after each epoch.
55+
use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)
56+
# refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)
57+
58+
###########################################################
59+
# OPTIMIZER & SCHEDULER SETTING #
60+
###########################################################
61+
optimizer_params:
62+
initial_learning_rate: 0.001
63+
end_learning_rate: 0.00001
64+
decay_steps: 150000 # < train_max_steps is recommend.
65+
warmup_proportion: 0.02
66+
weight_decay: 0.001
67+
68+
69+
###########################################################
70+
# INTERVAL SETTING #
71+
###########################################################
72+
train_max_steps: 200000 # Number of training steps.
73+
save_interval_steps: 5000 # Interval steps to save checkpoint.
74+
eval_interval_steps: 500 # Interval steps to evaluate the network.
75+
log_interval_steps: 100 # Interval steps to record the training log.
76+
start_schedule_teacher_forcing: 200001 # don't need to apply schedule teacher forcing.
77+
start_ratio_value: 0.5 # start ratio of scheduled teacher forcing.
78+
schedule_decay_steps: 50000 # decay step scheduled teacher forcing.
79+
end_ratio_value: 0.0 # end ratio of scheduled teacher forcing.
80+
###########################################################
81+
# OTHER SETTING #
82+
###########################################################
83+
num_save_intermediate_results: 1 # Number of results to be saved as intermediate results.

examples/tacotron2/conf/tacotron2.v1.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ format: "npy"
1616
model_type: "tacotron2"
1717

1818
tacotron2_params:
19+
dataset: ljspeech
1920
embedding_hidden_size: 512
2021
initializer_range: 0.02
2122
embedding_dropout_prob: 0.1

examples/tacotron2/decode_tacotron2.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
import tensorflow as tf
2626
import yaml
2727
from tqdm import tqdm
28+
import matplotlib.pyplot as plt
2829

2930
from examples.tacotron2.tacotron_dataset import CharactorMelDataset
3031
from tensorflow_tts.configs import Tacotron2Config
@@ -109,11 +110,13 @@ def main():
109110

110111
# define data-loader
111112
dataset = CharactorMelDataset(
113+
dataset=config["tacotron2_params"]["dataset"],
112114
root_dir=args.rootdir,
113115
charactor_query=char_query,
114116
mel_query=mel_query,
115117
charactor_load_fn=char_load_fn,
116118
mel_load_fn=mel_load_fn,
119+
reduction_factor=config["tacotron2_params"]["reduction_factor"]
117120
)
118121
dataset = dataset.create(allow_cache=True, batch_size=args.batch_size)
119122

0 commit comments

Comments
 (0)