🚩 Adjust configs and readmes

ZDisket · ZDisket · commit 5b15bb9e5199 · 2021-09-15T13:03:29.000-03:00
diff --git a/examples/multiband_melgan_hf/README.md b/examples/multiband_melgan_hf/README.md
@@ -1,23 +1,24 @@
-﻿# Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech
-Based on the script [`train_multiband_melgan.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/train_multiband_melgan.py).
+﻿
+# Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech
+Based on the script [`train_multiband_melgan_hf.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan_hf/train_multiband_melgan_hf.py).
 
 ## Training Multi-band MelGAN from scratch with LJSpeech dataset.
-This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at  [link](https://keithito.com/LJ-Speech-Dataset/).
+This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech Ultimate, you can download the dataset at  [link](https://machineexperiments.tumblr.com/post/662408083204685824/ljspeech-ultimate).
 
 ### Step 1: Create Tensorflow based Dataloader (tf.dataset)
 Please see detail at [examples/melgan/](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan#step-1-create-tensorflow-based-dataloader-tfdataset)
 
 ### Step 2: Training from scratch
-After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_multiband_melgan.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/train_multiband_melgan.py). Here is an example command line to training melgan-stft from scratch:
+After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_multiband_melgan_hf.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan_hf/train_multiband_melgan_hf.py). Here is an example command line to training melgan-stft from scratch:
 
 First, you need training generator with only stft loss: 
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan_hf/train_multiband_melgan_hf.py \
   --train-dir ./dump/train/ \
   --dev-dir ./dump/valid/ \
-  --outdir ./examples/multiband_melgan/exp/train.multiband_melgan.v1/ \
-  --config ./examples/multiband_melgan/conf/multiband_melgan.v1.yaml \
+  --outdir ./examples/multiband_melgan_hf/exp/train.multiband_melgan_hf.v1/ \
+  --config ./examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1.yml \
   --use-norm 1 \
   --generator_mixed_precision 1 \
   --resume ""
@@ -26,27 +27,30 @@ CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.p
 Then resume and start training generator + discriminator:
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan_hf/train_multiband_melgan_hf.py \
   --train-dir ./dump/train/ \
   --dev-dir ./dump/valid/ \
-  --outdir ./examples/multiband_melgan/exp/train.multiband_melgan.v1/ \
-  --config ./examples/multiband_melgan/conf/multiband_melgan.v1.yaml \
+  --outdir ./examples/multiband_melgan_hf/exp/train.multiband_melgan_hf.v1/ \
+  --config ./examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1.yml \
   --use-norm 1 \
-  --resume ./examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/ckpt-200000
+  --resume ./examples/multiband_melgan_hf/exp/train.multiband_melgan_hf.v1/checkpoints/ckpt-200000
 ```
 
 IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode. 
 
 In case you want to resume the training progress, please following below example command line:
 
 ```bash
---resume ./examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/ckpt-100000
+--resume ./examples/multiband_melgan_hf/exp/train.multiband_melgan_hf.v1/checkpoints/ckpt-100000
 ```
 
-If you want to finetune a model, use `--pretrained` like this with the filename of the generator
+If you want to finetune a model, use `--pretrained` like this with the filename of the generator and discriminator, separated by comma.
 ```bash
---pretrained ptgenerator.h5
+--pretrained ptgenerator.h5,ptdiscriminator.h5
 ```
+It is recommended that you first train text2mel model then extract postnets so that vocoder learns to compensate for flaws, if you do so, append `--postnets 1` to arguments
+
+
 
 **IMPORTANT NOTES**:
 
@@ -58,20 +62,20 @@ If you want to finetune a model, use `--pretrained` like this with the filename
 To running inference on folder mel-spectrogram (eg valid folder), run below command line:
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/decode_mb_melgan.py \
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan_hf/decode_mb_melgan.py \
   --rootdir ./dump/valid/ \
-  --outdir ./prediction/multiband_melgan.v1/ \
-  --checkpoint ./examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/generator-940000.h5 \
-  --config ./examples/multiband_melgan/conf/multiband_melgan.v1.yaml \
+  --outdir ./prediction/multiband_melgan_hf.v1/ \
+  --checkpoint ./examples/multiband_melgan_hf/exp/train.multiband_melgan_hf.v1/checkpoints/generator-920000.h5 \
+  --config ./examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1.yml \
   --batch-size 32 \
   --use-norm 1
 ```
 
 ## Finetune MelGAN STFT with ljspeech pretrained on other languages
-Just load pretrained model and training from scratch with other languages. **DO NOT FORGET** re-preprocessing on your dataset if needed. A hop_size should be 256 if you want to use our pretrained.
+Just load pretrained model and training from scratch with other languages. **DO NOT FORGET** re-preprocessing on your dataset if needed. A hop_size should be 512 if you want to use our pretrained.
 
 ## Learning Curves
-Here is a learning curves of melgan based on this config [`multiband_melgan.v1.yaml`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml)
+Here is a learning curves of melgan based on this config [`multiband_melgan_hf.v1.yaml`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan_hf/conf/multiband_melgan_hf.v1.yaml)
 
 <img src="fig/eval.png" height="300" width="850">
 
@@ -80,11 +84,7 @@ Here is a learning curves of melgan based on this config [`multiband_melgan.v1.y
 ## Pretrained Models and Audio samples
 | Model                                                                                                          | Conf                                                                                                                        | Lang  | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters | Notes |
 | :------                                                                                                        | :---:                                                                                                                       | :---: | :----:  | :--------:     | :---------------:    | :-----:   | :-----: |
-| [multiband_melgan.v1](https://drive.google.com/drive/folders/1Hg82YnPbX6dfF7DxVs4c96RBaiFbh-cT?usp=sharing)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml)          | EN    | 22.05k  | 80-7600        | 1024 / 256 / None    | 940K    |  -|
-| [multiband_melgan.v1](https://drive.google.com/drive/folders/199XCXER51PWf_VzUpOwxfY_8XDfeXuZl?usp=sharing)             | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml)          | KO    | 22.05k  | 80-7600        | 1024 / 256 / None    | 1000K    | -|
-| [multiband_melgan.v1_24k](https://drive.google.com/drive/folders/14H6Oa8kGxlIhfZZFf6JFzWL5NVHDpKai?usp=sharing)             | [link](https://drive.google.com/file/d/1l2jBwTWVVsRuT5FLDOIDToEhqWBmuCMe/view?usp=sharing)          | EN | 24k  | 80-7600        | 2048 / 300 / 1200    | 1000K    | Converted from [kan-bayashi's model](https://drive.google.com/drive/folders/1jfB15igea6tOQ0hZJGIvnpf3QyNhTLnq?usp=sharing); good universal vocoder|
-
-
+| [multiband_melgan_hf.lju.v1](https://drive.google.com/drive/folders/1tOMzik_Nr4eY63gooKYSmNTJyXC6Pp55?usp=sharing)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1.yml)          | EN    | 44.1k  | 20-11025        | 2048 / 512 / 2048     | 920K    |  -|
 
 
 ## Reference
diff --git a/examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1ft.yml b/examples/multiband_melgan_hf/conf/multiband_melgan_hf.lju.v1ft.yml
@@ -1,6 +1,6 @@
 
 #This is the hyperparameter configuration file for finetuning MB-MelGAN + MPD. It is intended to be used for finetuning generator and discriminator
-#Trains fast, it is mostly done at 30k steps, although one can still see small improvements beyond that
+#Trains fast, adapts to new voice within 30k steps, although it is beneficial to keep training beyond that for about 200k, depending on dataset size.
 
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
@@ -98,7 +98,7 @@ discriminator_optimizer_params:
     lr_fn: "PiecewiseConstantDecay"
     lr_params: 
         boundaries: [100000, 200000]
-        values: [0.00003125, 0.00003125, 0.00003125]
+        values: [0.000015625, 0.000015625, 0.000015625]
     amsgrad: false
 
 ###########################################################
diff --git a/examples/tacotron2/README.md b/examples/tacotron2/README.md
@@ -75,6 +75,8 @@ CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/extract_duration.py \
   --win-back 3
 ```
 
+To extract postnets for training vocoder, follow above steps but with `extract_postnets.py`
+
 You also can download my extracted durations at 40k steps at [link](https://drive.google.com/drive/u/1/folders/1kaPXRdLg9gZrll9KtvH3-feOBMM8sn3_?usp=drive_open).
 
 ## Finetune Tacotron-2 with ljspeech pretrained on other languages
@@ -116,6 +118,7 @@ Here is a result of tacotron2 based on this config [`tacotron2.v1.yaml`](https:/
 | :------                                                                                                        | :---:                                                                                                                       | :---: | :----:  | :--------:     | :---------------:    | :-----: |  :-----: |
 | [tacotron2.v1](https://drive.google.com/open?id=1kaPXRdLg9gZrll9KtvH3-feOBMM8sn3_)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2/conf/tacotron2.v1.yaml)          | EN    | 22.05k  | 80-7600        | 1024 / 256 / None    | 65K    | 1
 | [tacotron2.v1](https://drive.google.com/drive/folders/1WMBe01BBnYf3sOxMhbvnF2CUHaRTpBXJ?usp=sharing)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2/conf/tacotron2.kss.v1.yaml)          | KO    | 22.05k  | 80-7600        | 1024 / 256 / None    | 100K    | 1
+| [tacotron2.lju.v1](https://drive.google.com/drive/folders/1tOMzik_Nr4eY63gooKYSmNTJyXC6Pp55?usp=sharing)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2/conf/tacotron2.lju.v1.yaml)          | EN    | 44.1k  | 20-11025        | 2048 / 512 / None    | 126K    | 1
 
 ## Reference
 
diff --git a/examples/tacotron2/conf/tacotron2.lju.v1.yaml b/examples/tacotron2/conf/tacotron2.lju.v1.yaml
@@ -73,7 +73,7 @@ var_train_expr: null  # trainable variable expr (eg. 'embeddings|decoder_cell' )
 ###########################################################
 #                    INTERVAL SETTING                     #
 ###########################################################
-train_max_steps: 170000                 # Number of training steps.
+train_max_steps: 200000                 # Number of training steps.
 save_interval_steps: 2000               # Interval steps to save checkpoint.
 eval_interval_steps: 500                # Interval steps to evaluate the network.
 log_interval_steps: 200                 # Interval steps to record the training log.
diff --git a/tensorflow_tts/processor/pretrained/ljspeechu_mapper.json b/tensorflow_tts/processor/pretrained/ljspeechu_mapper.json
@@ -0,0 +1 @@
+{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, ":": 8, ";": 9, "?": 10, "@AA": 11, "@AA0": 12, "@AA1": 13, "@AA2": 14, "@AE": 15, "@AE0": 16, "@AE1": 17, "@AE2": 18, "@AH": 19, "@AH0": 20, "@AH1": 21, "@AH2": 22, "@AO": 23, "@AO0": 24, "@AO1": 25, "@AO2": 26, "@AW": 27, "@AW0": 28, "@AW1": 29, "@AW2": 30, "@AY": 31, "@AY0": 32, "@AY1": 33, "@AY2": 34, "@B": 35, "@CH": 36, "@D": 37, "@DH": 38, "@EH": 39, "@EH0": 40, "@EH1": 41, "@EH2": 42, "@ER": 43, "@ER0": 44, "@ER1": 45, "@ER2": 46, "@EY": 47, "@EY0": 48, "@EY1": 49, "@EY2": 50, "@F": 51, "@G": 52, "@HH": 53, "@IH": 54, "@IH0": 55, "@IH1": 56, "@IH2": 57, "@IY": 58, "@IY0": 59, "@IY1": 60, "@IY2": 61, "@JH": 62, "@K": 63, "@L": 64, "@M": 65, "@N": 66, "@NG": 67, "@OW": 68, "@OW0": 69, "@OW1": 70, "@OW2": 71, "@OY": 72, "@OY0": 73, "@OY1": 74, "@OY2": 75, "@P": 76, "@R": 77, "@S": 78, "@SH": 79, "@T": 80, "@TH": 81, "@UH": 82, "@UH0": 83, "@UH1": 84, "@UH2": 85, "@UW": 86, "@UW0": 87, "@UW1": 88, "@UW2": 89, "@V": 90, "@W": 91, "@Y": 92, "@Z": 93, "@ZH": 94, "eos": 95}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": ":", "9": ";", "10": "?", "11": "@AA", "12": "@AA0", "13": "@AA1", "14": "@AA2", "15": "@AE", "16": "@AE0", "17": "@AE1", "18": "@AE2", "19": "@AH", "20": "@AH0", "21": "@AH1", "22": "@AH2", "23": "@AO", "24": "@AO0", "25": "@AO1", "26": "@AO2", "27": "@AW", "28": "@AW0", "29": "@AW1", "30": "@AW2", "31": "@AY", "32": "@AY0", "33": "@AY1", "34": "@AY2", "35": "@B", "36": "@CH", "37": "@D", "38": "@DH", "39": "@EH", "40": "@EH0", "41": "@EH1", "42": "@EH2", "43": "@ER", "44": "@ER0", "45": "@ER1", "46": "@ER2", "47": "@EY", "48": "@EY0", "49": "@EY1", "50": "@EY2", "51": "@F", "52": "@G", "53": "@HH", "54": "@IH", "55": "@IH0", "56": "@IH1", "57": "@IH2", "58": "@IY", "59": "@IY0", "60": "@IY1", "61": "@IY2", "62": "@JH", "63": "@K", "64": "@L", "65": "@M", "66": "@N", "67": "@NG", "68": "@OW", "69": "@OW0", "70": "@OW1", "71": "@OW2", "72": "@OY", "73": "@OY0", "74": "@OY1", "75": "@OY2", "76": "@P", "77": "@R", "78": "@S", "79": "@SH", "80": "@T", "81": "@TH", "82": "@UH", "83": "@UH0", "84": "@UH1", "85": "@UH2", "86": "@UW", "87": "@UW0", "88": "@UW1", "89": "@UW2", "90": "@V", "91": "@W", "92": "@Y", "93": "@Z", "94": "@ZH", "95": "eos"}, "speakers_map": {"ljspeech": 0}, "processor_name": "LJSpeechUltimateProcessor"}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, ":": 8, ";": 9, "?": 10, "@AA": 11, "@AA0": 12, "@AA1": 13, "@AA2": 14, "@AE": 15, "@AE0": 16, "@AE1": 17, "@AE2": 18, "@AH": 19, "@AH0": 20, "@AH1": 21, "@AH2": 22, "@AO": 23, "@AO0": 24, "@AO1": 25, "@AO2": 26, "@AW": 27, "@AW0": 28, "@AW1": 29, "@AW2": 30, "@AY": 31, "@AY0": 32, "@AY1": 33, "@AY2": 34, "@B": 35, "@CH": 36, "@D": 37, "@DH": 38, "@EH": 39, "@EH0": 40, "@EH1": 41, "@EH2": 42, "@ER": 43, "@ER0": 44, "@ER1": 45, "@ER2": 46, "@EY": 47, "@EY0": 48, "@EY1": 49, "@EY2": 50, "@F": 51, "@G": 52, "@HH": 53, "@IH": 54, "@IH0": 55, "@IH1": 56, "@IH2": 57, "@IY": 58, "@IY0": 59, "@IY1": 60, "@IY2": 61, "@JH": 62, "@K": 63, "@L": 64, "@M": 65, "@N": 66, "@NG": 67, "@OW": 68, "@OW0": 69, "@OW1": 70, "@OW2": 71, "@OY": 72, "@OY0": 73, "@OY1": 74, "@OY2": 75, "@P": 76, "@R": 77, "@S": 78, "@SH": 79, "@T": 80, "@TH": 81, "@UH": 82, "@UH0": 83, "@UH1": 84, "@UH2": 85, "@UW": 86, "@UW0": 87, "@UW1": 88, "@UW2": 89, "@V": 90, "@W": 91, "@Y": 92, "@Z": 93, "@ZH": 94, "eos": 95}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": ":", "9": ";", "10": "?", "11": "@AA", "12": "@AA0", "13": "@AA1", "14": "@AA2", "15": "@AE", "16": "@AE0", "17": "@AE1", "18": "@AE2", "19": "@AH", "20": "@AH0", "21": "@AH1", "22": "@AH2", "23": "@AO", "24": "@AO0", "25": "@AO1", "26": "@AO2", "27": "@AW", "28": "@AW0", "29": "@AW1", "30": "@AW2", "31": "@AY", "32": "@AY0", "33": "@AY1", "34": "@AY2", "35": "@B", "36": "@CH", "37": "@D", "38": "@DH", "39": "@EH", "40": "@EH0", "41": "@EH1", "42": "@EH2", "43": "@ER", "44": "@ER0", "45": "@ER1", "46": "@ER2", "47": "@EY", "48": "@EY0", "49": "@EY1", "50": "@EY2", "51": "@F", "52": "@G", "53": "@HH", "54": "@IH", "55": "@IH0", "56": "@IH1", "57": "@IH2", "58": "@IY", "59": "@IY0", "60": "@IY1", "61": "@IY2", "62": "@JH", "63": "@K", "64": "@L", "65": "@M", "66": "@N", "67": "@NG", "68": "@OW", "69": "@OW0", "70": "@OW1", "71": "@OW2", "72": "@OY", "73": "@OY0", "74": "@OY1", "75": "@OY2", "76": "@P", "77": "@R", "78": "@S", "79": "@SH", "80": "@T", "81": "@TH", "82": "@UH", "83": "@UH0", "84": "@UH1", "85": "@UH2", "86": "@UW", "87": "@UW0", "88": "@UW1", "89": "@UW2", "90": "@V", "91": "@W", "92": "@Y", "93": "@Z", "94": "@ZH", "95": "eos"}, "speakers_map": {"ljspeech": 0}, "processor_name": "LJSpeechUltimateProcessor"}