TensorSpeech
diff --git a/‎.gitignore‎
Lines changed: 10 additions & 0 deletions b/‎.gitignore‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 22 additions & 12 deletions b/‎README.md‎
Lines changed: 22 additions & 12 deletions
diff --git a/‎examples/fastspeech2/conf/fastspeech2.baker.v2.yaml‎
Lines changed: 78 additions & 0 deletions b/‎examples/fastspeech2/conf/fastspeech2.baker.v2.yaml‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎examples/fastspeech2_libritts/README.md‎
Lines changed: 64 additions & 0 deletions b/‎examples/fastspeech2_libritts/README.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml‎
Lines changed: 76 additions & 0 deletions b/‎examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml‎
Lines changed: 76 additions & 0 deletions
@@ -32,3 +32,13 @@ ljspeech
 /datasets
 /examples/tacotron2/exp/
 /temp/
+LibriTTS/
+dataset/
+mfa/
+kss/
+baker/
+libritts/
+dump_baker/
+dump_ljspeech/
+dump_kss/
+dump_libritts/
@@ -19,6 +19,8 @@
 :zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
 
 ## What's new
+- 2020/08/18 **(NEW!)** Update [new base processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/base_processor.py). Add [AutoProcessor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/inference/auto_processor.py) and [pretrained processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/pretrained/) json file.
+- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
 - 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
 - 2020/07/17 Support MultiGPU for all Trainer.
 - 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from TFlite team for his support.
@@ -35,15 +37,17 @@
 - Mixed precision to speed-up training if posible.
 - Support both Single/Multi GPU in base trainer class.
 - TFlite conversion for all supported model.
+- Android example.
+- Support many languages (currently, we support Chinese, Korean, English.)
 
 ## Requirements
 This repository is tested on Ubuntu 18.04 with:
 
-- Python 3.6+
+- Python 3.7+
 - Cuda 10.1
 - CuDNN 7.6.5
 - Tensorflow 2.2/2.3
-- [Tensorflow Addons](https://github.com/tensorflow/addons) 0.10.0
+- [Tensorflow Addons](https://github.com/tensorflow/addons) >= 0.10.0
 
 Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. **We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.**
 
@@ -90,7 +94,7 @@ Here in an audio samples on valid set. [tacotron-2](https://drive.google.com/ope
 
 Prepare a dataset in the following format:
 ```
-|- datasets/
+|- [NAME_DATASET]/
 |   |- metadata.csv
 |   |- wav/
 |       |- file1.wav
@@ -99,6 +103,8 @@ Prepare a dataset in the following format:
 
 where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format, you can ignore preprocessing steps if you have other format dataset.
 
+Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
+
 ## Preprocessing
 
 The preprocessing has two steps:
@@ -113,20 +119,22 @@ The preprocessing has two steps:
 
 To reproduce the steps above:
 ```
-tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
-tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
+tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
+tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker/libritts]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
 ```
 
-Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/) and [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset) for dataset argument. In the future, we intend to support more datasets.
+Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) and [`libritts`](http://www.openslr.org/60/) for dataset argument. In the future, we intend to support more datasets.
+
+**Note**: To runing `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need reformat it first before run preprocessing.
 
 After preprocessing, the structure of the project folder should be:
 ```
-|- datasets/
+|- [NAME_DATASET]/
 |   |- metadata.csv
 |   |- wav/
 |       |- file1.wav
 |       |- ...
-|- dump/
+|- dump_[ljspeech/kss/baker/libritts]/
 |   |- train/
 |       |- ids/
 |           |- LJ001-0001-ids.npy
@@ -184,8 +192,10 @@ After preprocessing, the structure of the project folder should be:
 
 We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats` and `wave`) for each type of input.
 
+
 **IMPORTANT NOTES**:
 - This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
+- Regardless how your dataset is formatted, the final structure of `dump` folder **SHOULD** follow above structure to be able use the training script or you can modify by yourself 😄.
 
 ## Training models
 
@@ -194,6 +204,7 @@ To know how to training model from scratch or fine-tune with other datasets/lang
 - For Tacotron-2 tutorial, pls see [example/tacotron2](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/tacotron2)
 - For FastSpeech tutorial, pls see [example/fastspeech](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech)
 - For FastSpeech2 tutorial, pls see [example/fastspeech2](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2)
+- For FastSpeech2 + MFA tutorial, pls see [example/fastspeech2_libritts](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2_libritts)
 - For MelGAN tutorial, pls see [example/melgan](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan)
 - For MelGAN + STFT Loss tutorial, pls see [example/melgan.stft](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan.stft)
 - For Multiband-MelGAN tutorial, pls see [example/multiband_melgan](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan)
@@ -237,10 +248,9 @@ import yaml
 
 import tensorflow as tf
 
-from tensorflow_tts.processor import LJSpeechProcessor
-
 from tensorflow_tts.inference import AutoConfig
 from tensorflow_tts.inference import TFAutoModel
+from tensorflow_tts.inference import AutoProcessor
 
 # initialize fastspeech model.
 fs_config = AutoConfig.from_pretrained('/examples/fastspeech/conf/fastspeech.v1.yaml')
@@ -259,7 +269,7 @@ melgan = TFAutoModel.from_pretrained(
 
 
 # inference
-processor = LJSpeechProcessor(None, cleaner_names="english_cleaners")
+processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")
 
 ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
 ids = tf.expand_dims(ids, 0)
@@ -281,7 +291,7 @@ sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")
 ```
 
 # Contact
-[Minh Nguyen Quan Anh](https://github.com/dathudeptrai): [email protected], [erogol](https://github.com/erogol): [email protected], [Kuan Chen](https://github.com/azraelkuan): [email protected], [Takuya Ebata](https://github.com/MokkeMeguru): [email protected], [Trinh Le Quang](https://github.com/l4zyf9x): [email protected]
+[Minh Nguyen Quan Anh](https://github.com/dathudeptrai): [email protected], [erogol](https://github.com/erogol): [email protected], [Kuan Chen](https://github.com/azraelkuan): [email protected], [Dawid Kobus](https://github.com/machineko): [email protected], [Takuya Ebata](https://github.com/MokkeMeguru): [email protected], [Trinh Le Quang](https://github.com/l4zyf9x): trinhle.cse@gmail.com, [Yunchao He](https://github.com/candlewill): [email protected], [Alejandro Miguel Velasquez](https://github.com/ZDisket): xml506ok@gmail.com
 
 # License
 Overrall, Almost models here are licensed under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) for all countries in the world, except in **Viet Nam** this framework cannot be used for production in any way without permission from TensorflowTTS's Authors. There is an exception, Tacotron-2 can be used with any perpose. So, if you are VietNamese and want to use this framework for production, you **Must** contact our in andvance.
 
@@ -0,0 +1,78 @@
+# This is the hyperparameter configuration file for FastSpeech2 v2.
+# the different of v2 and v1 is that v2 apply linformer technique.
+# Please make sure this is adjusted for the Baker dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 200k iters but a best checkpoint is around 150k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#              NETWORK ARCHITECTURE SETTING               #
+###########################################################
+model_type: "fastspeech2"
+
+fastspeech2_params:
+    dataset: baker
+    n_speakers: 1
+    encoder_hidden_size: 256
+    encoder_num_hidden_layers: 3
+    encoder_num_attention_heads: 2
+    encoder_attention_head_size: 16  # in v1, = 384//2
+    encoder_intermediate_size: 1024
+    encoder_intermediate_kernel_size: 3
+    encoder_hidden_act: "mish"
+    decoder_hidden_size: 256
+    decoder_num_hidden_layers: 3
+    decoder_num_attention_heads: 2
+    decoder_attention_head_size: 16  # in v1, = 384//2
+    decoder_intermediate_size: 1024
+    decoder_intermediate_kernel_size: 3
+    decoder_hidden_act: "mish"
+    variant_prediction_num_conv_layers: 2
+    variant_predictor_filter: 256
+    variant_predictor_kernel_size: 3
+    variant_predictor_dropout_rate: 0.5
+    num_mels: 80
+    hidden_dropout_prob: 0.2
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 2048
+    initializer_range: 0.02
+    output_attentions: False
+    output_hidden_states: False
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 16              # Batch size.
+remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
+mel_length_threshold: 32    # remove all targets has mel_length <= 32 
+is_shuffle: true            # shuffle dataset after each epoch.
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+optimizer_params:
+    initial_learning_rate: 0.001
+    end_learning_rate: 0.00005
+    decay_steps: 150000          # < train_max_steps is recommend.
+    warmup_proportion: 0.02
+    weight_decay: 0.001
+    
+    
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+train_max_steps: 200000               # Number of training steps.
+save_interval_steps: 5000             # Interval steps to save checkpoint.
+eval_interval_steps: 500              # Interval steps to evaluate the network.
+log_interval_steps: 200               # Interval steps to record the training log.
+delay_f0_energy_steps: 3              # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
@@ -0,0 +1,64 @@
+# Fast speech 2 multi-speaker english lang based
+
+## Prepare
+Everything is done from main repo folder so TensorflowTTS/
+
+0. Optional* [Download](http://www.openslr.org/60/) and prepare libritts (helper to prepare libri in examplesfastspeech2_libritts/libri_experiment/prepare_libri.ipynb)
+- Dataset structure after finish this step:
+    ```
+    |- TensorFlowTTS/
+    |   |- LibriTTS/
+    |   |-  |- train-clean-100/
+    |   |-  |- SPEAKERS.txt
+    |   |-  |- ...
+    |   |- libritts/
+    |   |-  |- 200/
+    |   |-  |-  |- 200_124139_000001_000000.txt
+    |   |-  |-  |- 200_124139_000001_000000.wav
+    |   |-  |-  |- ...
+    |   |-  |- 250/
+    |   |-  |- ...
+    |   |- tensorflow_tts/
+    |       |- models/
+    |       |- ...
+    ``` 
+1. Extract Duration (use examples/mfa_extraction or pretrained tacotron2) 
+2. Optional* build docker 
+- ```
+  bash examples/fastspeech2_libritts/scripts/build.sh
+  ```
+3. Optional* run docker
+- ```
+  bash examples/fastspeech2_libritts/scripts/interactive.sh
+  ```
+4. Preprocessing:
+- ```
+  tensorflow-tts-preprocess --rootdir ./libritts \
+    --outdir ./dump_libritts \
+    --config preprocess/preprocess_libritts.yaml \
+    --dataset libritts
+  ```
+
+5. Normalization:
+- ```
+  tensorflow-tts-normalize --rootdir ./dump_libritts \
+    --outdir ./dump_libritts \
+    --config preprocess/preprocess_libritts.yaml \
+    --dataset libritts
+  ```
+
+6. Change CharactorDurationF0EnergyMelDataset speaker mapper in fastspeech2_dataset to match your dataset (if you use libri with mfa_extraction you didnt need to change anything)
+7. Change train_libri.sh to match your dataset and run:
+- ```
+  bash examples/fastspeech2_libritts/scripts/train_libri.sh
+  ```
+8. Optional* If u have problems with tensor sizes mismatch check step 5 in `examples/mfa_extraction` directory
+
+## Comments
+
+This version is using popular train.txt '|' split used in other repos. Training files should looks like this =>
+
+Wav Path | Text | Speaker Name
+
+Wav Path2 | Text | Speaker Name
+
@@ -0,0 +1,76 @@
+# This is the hyperparameter configuration file for FastSpeech2 v1.
+# Please make sure this is adjusted for the LibriTTS dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 200k iters but a best checkpoint is around 150k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+hop_size: 256            # Hop size.
+format: "npy"
+
+###########################################################
+#              NETWORK ARCHITECTURE SETTING               #
+###########################################################
+model_type: fastspeech2
+
+fastspeech2_params:
+    dataset: "libritts"
+    n_speakers: 20
+    encoder_hidden_size: 384
+    encoder_num_hidden_layers: 4
+    encoder_num_attention_heads: 2
+    encoder_attention_head_size: 192  # hidden_size // num_attention_heads
+    encoder_intermediate_size: 1024
+    encoder_intermediate_kernel_size: 3
+    encoder_hidden_act: "mish"
+    decoder_hidden_size: 384
+    decoder_num_hidden_layers: 4
+    decoder_num_attention_heads: 2
+    decoder_attention_head_size: 192  # hidden_size // num_attention_heads
+    decoder_intermediate_size: 1024
+    decoder_intermediate_kernel_size: 3
+    decoder_hidden_act: "mish"
+    variant_prediction_num_conv_layers: 2
+    variant_predictor_filter: 256
+    variant_predictor_kernel_size: 3
+    variant_predictor_dropout_rate: 0.5
+    num_mels: 80
+    hidden_dropout_prob: 0.2
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 2048
+    initializer_range: 0.02
+    output_attentions: False
+    output_hidden_states: False
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 32               # Batch size.
+remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
+mel_length_threshold: 48    # remove all targets has mel_length <= 32
+is_shuffle: true            # shuffle dataset after each epoch.
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+optimizer_params:
+    initial_learning_rate: 0.0001
+    end_learning_rate: 0.00001
+    decay_steps: 120000          # < train_max_steps is recommend.
+    warmup_proportion: 0.02
+    weight_decay: 0.001
+    
+    
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+train_max_steps: 150000               # Number of training steps.
+save_interval_steps: 5000             # Interval steps to save checkpoint.
+eval_interval_steps: 5000              # Interval steps to evaluate the network.
+log_interval_steps: 200               # Interval steps to record the training log.
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+use_griffin: true                 # Use GL on evaluation or not.
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.