TensorSpeech
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 9 additions & 5 deletions b/‎README.md‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎examples/fastspeech2/conf/fastspeech2.baker.v2.yaml‎
Lines changed: 78 additions & 0 deletions b/‎examples/fastspeech2/conf/fastspeech2.baker.v2.yaml‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎examples/fastspeech2_multispeaker/README.md‎
Lines changed: 64 additions & 0 deletions b/‎examples/fastspeech2_multispeaker/README.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎examples/fastspeech2_multispeaker/conf/fastspeech2libritts.yaml‎
Lines changed: 75 additions & 0 deletions b/‎examples/fastspeech2_multispeaker/conf/fastspeech2libritts.yaml‎
Lines changed: 75 additions & 0 deletions
@@ -32,5 +32,7 @@ ljspeech
 /datasets
 /examples/tacotron2/exp/
 /temp/
+LibriTTS/
+dataset/
+mfa/
 kss
-LibriTTS
 
@@ -19,6 +19,7 @@
 :zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
 
 ## What's new
+- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
 - 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
 - 2020/07/17 Support MultiGPU for all Trainer.
 - 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
@@ -35,15 +36,17 @@
 - Mixed precision to speed-up training if posible.
 - Support both Single/Multi GPU in base trainer class.
 - TFlite conversion for all supported model.
+- Android example.
+- Support many languages (currently, we support Chinese, Korean, English.)
 
 ## Requirements
 This repository is tested on Ubuntu 18.04 with:
 
-- Python 3.6+
+- Python 3.7+
 - Cuda 10.1
 - CuDNN 7.6.5
 - Tensorflow 2.2/2.3
-- [Tensorflow Addons](https://github.com/tensorflow/addons) 0.10.0
+- [Tensorflow Addons](https://github.com/tensorflow/addons) >= 0.10.0
 
 Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. **We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.**
 
@@ -113,11 +116,11 @@ The preprocessing has two steps:
 
 To reproduce the steps above:
 ```
-tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
-tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
+tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker]
+tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker]
 ```
 
-Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/) and [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset) for dataset argument. In the future, we intend to support more datasets.
+Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) for dataset argument. In the future, we intend to support more datasets.
 
 After preprocessing, the structure of the project folder should be:
 ```
@@ -184,6 +187,7 @@ After preprocessing, the structure of the project folder should be:
 
 We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats` and `wave`) for each type of input.
 
+
 **IMPORTANT NOTES**:
 - This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
 
 
@@ -0,0 +1,78 @@
+# This is the hyperparameter configuration file for FastSpeech2 v2.
+# the different of v2 and v1 is that v2 apply linformer technique.
+# Please make sure this is adjusted for the Baker dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 200k iters but a best checkpoint is around 150k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#              NETWORK ARCHITECTURE SETTING               #
+###########################################################
+model_type: "fastspeech2"
+
+fastspeech2_params:
+    dataset: baker
+    n_speakers: 1
+    encoder_hidden_size: 256
+    encoder_num_hidden_layers: 3
+    encoder_num_attention_heads: 2
+    encoder_attention_head_size: 16  # in v1, = 384//2
+    encoder_intermediate_size: 1024
+    encoder_intermediate_kernel_size: 3
+    encoder_hidden_act: "mish"
+    decoder_hidden_size: 256
+    decoder_num_hidden_layers: 3
+    decoder_num_attention_heads: 2
+    decoder_attention_head_size: 16  # in v1, = 384//2
+    decoder_intermediate_size: 1024
+    decoder_intermediate_kernel_size: 3
+    decoder_hidden_act: "mish"
+    variant_prediction_num_conv_layers: 2
+    variant_predictor_filter: 256
+    variant_predictor_kernel_size: 3
+    variant_predictor_dropout_rate: 0.5
+    num_mels: 80
+    hidden_dropout_prob: 0.2
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 2048
+    initializer_range: 0.02
+    output_attentions: False
+    output_hidden_states: False
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 16              # Batch size.
+remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
+mel_length_threshold: 32    # remove all targets has mel_length <= 32 
+is_shuffle: true            # shuffle dataset after each epoch.
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+optimizer_params:
+    initial_learning_rate: 0.001
+    end_learning_rate: 0.00005
+    decay_steps: 150000          # < train_max_steps is recommend.
+    warmup_proportion: 0.02
+    weight_decay: 0.001
+    
+    
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+train_max_steps: 200000               # Number of training steps.
+save_interval_steps: 5000             # Interval steps to save checkpoint.
+eval_interval_steps: 500              # Interval steps to evaluate the network.
+log_interval_steps: 200               # Interval steps to record the training log.
+delay_f0_energy_steps: 3              # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
@@ -0,0 +1,64 @@
+# Fast speech 2 multi-speaker english lang based
+
+## Prepare
+Everything is done from main repo folder so TensorflowTTS/
+
+0. Optional* [Download](http://www.openslr.org/60/) and prepare libritts (helper to prepare libri in examples/fastspeech2_multispeaker/libri_experiment/prepare_libri.ipynb)
+- Dataset structure after finish this step:
+    ```
+    |- TensorFlowTTS/
+    |   |- LibriTTS/
+    |   |-  |- train-clean-100/
+    |   |-  |- SPEAKERS.txt
+    |   |-  |- ...
+    |   |- dataset/
+    |   |-  |- 200/
+    |   |-  |-  |- 200_124139_000001_000000.txt
+    |   |-  |-  |- 200_124139_000001_000000.wav
+    |   |-  |-  |- ...
+    |   |-  |- 250/
+    |   |-  |- ...
+    |   |- tensorflow_tts/
+    |       |- models/
+    |       |- ...
+    ``` 
+1. Extract Duration (use examples/mfa_extraction or pretrained tacotron2) 
+2. Optional* build docker 
+- ```
+  bash examples/fastspeech2_multispeaker/scripts/build.sh
+  ```
+3. Optional* run docker
+- ```
+  bash examples/fastspeech2_multispeaker/scripts/interactive.sh
+  ```
+4. Preprocessing:
+- ```
+  tensorflow-tts-preprocess --rootdir ./dataset \
+    --outdir ./dump \
+    --config preprocess/preprocess_libritts.yaml \
+    --dataset multispeaker
+  ```
+
+5. Normalization:
+- ```
+  tensorflow-tts-normalize --rootdir ./dump \
+    --outdir ./dump \
+    --config preprocess/preprocess_libritts.yaml \
+    --dataset multispeaker
+  ```
+
+6. Change CharactorDurationF0EnergyMelDataset speaker mapper in fastspeech2_dataset to match your dataset (if you use libri with mfa_extraction you didnt need to change anything)
+7. Change train_libri.sh to match your dataset and run:
+- ```
+  bash examples/fastspeech2_multispeaker/scripts/train_libri.sh
+  ```
+8. Optional* If u have problems with tensor sizes mismatch check step 5 in `examples/mfa_extraction` directory
+
+## Comments
+
+This version is using popular train.txt '|' split used in other repos. Training files should looks like this =>
+
+Wav Path | Text | Speaker Name
+
+Wav Path2 | Text | Speaker Name
+
@@ -0,0 +1,75 @@
+# This is the hyperparameter configuration file for FastSpeech2 v1.
+# Please make sure this is adjusted for the LibriTTS dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 200k iters but a best checkpoint is around 150k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+hop_size: 256            # Hop size.
+format: "npy"
+
+###########################################################
+#              NETWORK ARCHITECTURE SETTING               #
+###########################################################
+model_type: fastspeech2
+
+fastspeech2_params:
+    n_speakers: 20
+    encoder_hidden_size: 384
+    encoder_num_hidden_layers: 4
+    encoder_num_attention_heads: 2
+    encoder_attention_head_size: 192  # hidden_size // num_attention_heads
+    encoder_intermediate_size: 1024
+    encoder_intermediate_kernel_size: 3
+    encoder_hidden_act: "mish"
+    decoder_hidden_size: 384
+    decoder_num_hidden_layers: 4
+    decoder_num_attention_heads: 2
+    decoder_attention_head_size: 192  # hidden_size // num_attention_heads
+    decoder_intermediate_size: 1024
+    decoder_intermediate_kernel_size: 3
+    decoder_hidden_act: "mish"
+    variant_prediction_num_conv_layers: 2
+    variant_predictor_filter: 256
+    variant_predictor_kernel_size: 3
+    variant_predictor_dropout_rate: 0.5
+    num_mels: 80
+    hidden_dropout_prob: 0.2
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 2048
+    initializer_range: 0.02
+    output_attentions: False
+    output_hidden_states: False
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 32               # Batch size.
+remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
+mel_length_threshold: 48    # remove all targets has mel_length <= 32
+is_shuffle: true            # shuffle dataset after each epoch.
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+optimizer_params:
+    initial_learning_rate: 0.0001
+    end_learning_rate: 0.00001
+    decay_steps: 120000          # < train_max_steps is recommend.
+    warmup_proportion: 0.02
+    weight_decay: 0.001
+    
+    
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+train_max_steps: 150000               # Number of training steps.
+save_interval_steps: 5000             # Interval steps to save checkpoint.
+eval_interval_steps: 5000              # Interval steps to evaluate the network.
+log_interval_steps: 200               # Interval steps to record the training log.
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+use_griffin: true                 # Use GL on evaluation or not.
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.