TensorSpeech
diff --git a/‎examples/multiband_pwgan/README.md‎
Lines changed: 84 additions & 0 deletions b/‎examples/multiband_pwgan/README.md‎
Lines changed: 84 additions & 0 deletions
diff --git a/‎examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml‎
Lines changed: 100 additions & 0 deletions b/‎examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎examples/multiband_pwgan/conf/multiband_pwgan.v1ft.yaml‎
Lines changed: 98 additions & 0 deletions b/‎examples/multiband_pwgan/conf/multiband_pwgan.v1ft.yaml‎
Lines changed: 98 additions & 0 deletions
@@ -0,0 +1,84 @@
+# Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (With ParallelWaveGAN discriminator)
+Based on the script [`train_multiband_pwgan.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_pwgan/train_multiband_pwgan.py).
+
+## Training Multi-band MelGAN from scratch with LJSpeech dataset.
+This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at  [link](https://keithito.com/LJ-Speech-Dataset/).
+
+### Step 1: Create Tensorflow based Dataloader (tf.dataset)
+Please see detail at [examples/melgan/](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/melgan#step-1-create-tensorflow-based-dataloader-tfdataset)
+
+### Step 2: Training from scratch
+After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_multiband_pwgan.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_pwgan/train_multiband_pwgan.py). Here is an example command line to training melgan-stft from scratch:
+
+First, you need training generator with only stft loss: 
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/train_multiband_pwgan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/multiband_pwgan/exp/train.multiband_melgan.v1/ \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --use-norm 1 \
+  --generator_mixed_precision 1 \
+  --resume ""
+```
+
+Then resume and start training generator + discriminator:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/train_multiband_pwgan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/multiband_pwgan/exp/train.multiband_melgan.v1/ \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --use-norm 1 \
+  --resume ./examples/multiband_pwgan/exp/train.multiband_melgan.v1/checkpoints/ckpt-200000
+```
+
+IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode. 
+
+In case you want to resume the training progress, please following below example command line:
+
+```bash
+--resume ./examples/multiband_pwgan/exp/train.multiband_melgan.v1/checkpoints/ckpt-100000
+```
+
+**IMPORTANT NOTES**:
+
+- If Your Dataset is 16K, upsample_scales = [2, 4, 8] worked.
+- If Your Dataset is > 16K (22K, 24K, ...), upsample_scales = [2, 4, 8] didn't worked, used [8, 4, 2] instead.
+
+### Step 3: Decode audio from folder mel-spectrogram
+To running inference on folder mel-spectrogram (eg valid folder), run below command line:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/decode_mb_melgan.py \
+  --rootdir ./dump/valid/ \
+  --outdir ./prediction/multiband_melgan.v1/ \
+  --checkpoint ./examples/multiband_pwgan/exp/train.multiband_melgan.v1/checkpoints/generator-940000.h5 \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --batch-size 32 \
+  --use-norm 1
+```
+
+## Finetune MelGAN STFT with ljspeech pretrained on other languages
+Just load pretrained model and training from scratch with other languages. **DO NOT FORGET** re-preprocessing on your dataset if needed. A hop_size should be 256 if you want to use our pretrained.
+
+## Learning Curves
+Here is a learning curves of melgan based on this config [`multiband_pwgan.v1.yaml`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml)
+
+<img src="fig/eval.png" height="300" width="850">
+
+<img src="fig/train.png" height="300" width="850">
+
+## Pretrained Models and Audio samples
+| Model                                                                                                          | Conf                                                                                                                        | Lang  | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters |
+| :------                                                                                                        | :---:                                                                                                                       | :---: | :----:  | :--------:     | :---------------:    | :-----: |
+| [multiband_melgan.v1](https://drive.google.com/drive/folders/1Hg82YnPbX6dfF7DxVs4c96RBaiFbh-cT?usp=sharing)             | [link](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml)          | EN    | 22.05k  | 80-7600        | 1024 / 256 / None    | 940K    |
+| [multiband_melgan.v1](https://drive.google.com/drive/folders/199XCXER51PWf_VzUpOwxfY_8XDfeXuZl?usp=sharing)             | [link](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml)          | KO    | 22.05k  | 80-7600        | 1024 / 256 / None    | 1000K    |
+
+## Reference
+
+1. https://github.com/kan-bayashi/ParallelWaveGAN
+2. [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
+3. [Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106)
@@ -0,0 +1,100 @@
+
+# This is the hyperparameter configuration file for Multi-Band MelGAN with PWGAN discriminator.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 1000k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "multiband_melgan_generator"
+
+multiband_melgan_generator_params:
+    out_channels: 4               # Number of output channels (number of subbands).
+    kernel_size: 7                # Kernel size of initial and final conv layers.
+    filters: 384                  # Initial number of channels for conv layers.
+    upsample_scales: [8, 4, 2]    # List of Upsampling scales.
+    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
+    stacks: 4                     # Number of stacks in a single residual stack module.
+    is_weight_norm: false         # Use weight-norm or not.
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+parallel_wavegan_discriminator_params:
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Number of output channels.
+    n_layers: 10            # Number of conv layers.
+    conv_channels: 64     # Number of chnn layers.
+    use_bias: true            # Whether to use bias parameter in conv.
+    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation_params:      # Nonlinear function parameters
+        alpha: 0.2           # Alpha in LeakyReLU.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+subband_stft_loss_params:
+    fft_lengths: [384, 683, 171]  # List of FFT size for STFT-based loss.
+    frame_steps: [30, 60, 10]     # List of hop size for STFT-based loss
+    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0      # Loss balancing coefficient for feature matching loss
+lambda_adv: 2.5              # Loss balancing coefficient for adversarial loss.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64                 # Batch size.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000]
+        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 200000  # steps begin training discriminator
+train_max_steps: 4000000                 # Number of training steps.
+save_interval_steps: 20000               # Interval steps to save checkpoint.
+eval_interval_steps: 5000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
@@ -0,0 +1,98 @@
+
+# This is the hyperparameter configuration file for Multi-Band MelGAN with PWGAN discriminator.
+# This one is adjusted for finetuning
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "multiband_melgan_generator"
+
+multiband_melgan_generator_params:
+    out_channels: 4               # Number of output channels (number of subbands).
+    kernel_size: 7                # Kernel size of initial and final conv layers.
+    filters: 384                  # Initial number of channels for conv layers.
+    upsample_scales: [8, 4, 2]    # List of Upsampling scales.
+    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
+    stacks: 4                     # Number of stacks in a single residual stack module.
+    is_weight_norm: false         # Use weight-norm or not.
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+parallel_wavegan_discriminator_params:
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Number of output channels.
+    n_layers: 10            # Number of conv layers.
+    conv_channels: 64     # Number of chnn layers.
+    use_bias: true            # Whether to use bias parameter in conv.
+    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation_params:      # Nonlinear function parameters
+        alpha: 0.2           # Alpha in LeakyReLU.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+subband_stft_loss_params:
+    fft_lengths: [384, 683, 171]  # List of FFT size for STFT-based loss.
+    frame_steps: [30, 60, 10]     # List of hop size for STFT-based loss
+    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0      # Loss balancing coefficient for feature matching loss
+lambda_adv: 2.5              # Loss balancing coefficient for adversarial loss.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64                 # Batch size.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000]
+        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 0  # steps begin training discriminator
+train_max_steps: 200000                 # Number of training steps.
+save_interval_steps: 5000               # Interval steps to save checkpoint.
+eval_interval_steps: 1000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.