TensorSpeech
diff --git a/‎examples/multiband_pwgan/README.md‎
Lines changed: 86 additions & 0 deletions b/‎examples/multiband_pwgan/README.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml‎
Lines changed: 101 additions & 0 deletions b/‎examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎examples/multiband_pwgan/conf/multiband_pwgan.v1ft.yaml‎
Lines changed: 105 additions & 0 deletions b/‎examples/multiband_pwgan/conf/multiband_pwgan.v1ft.yaml‎
Lines changed: 105 additions & 0 deletions
@@ -0,0 +1,86 @@
+# Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (With ParallelWaveGAN discriminator)
+Based on the script [`train_multiband_pwgan.py`](https://github.com/tensorspeech/TensorflowTTS/tree/master/examples/multiband_pwgan/train_multiband_pwgan.py).
+
+## Training Multi-band MelGAN with PWGAN generator from scratch with LJSpeech dataset.
+This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at  [link](https://keithito.com/LJ-Speech-Dataset/).
+
+### Step 1: Create Tensorflow based Dataloader (tf.dataset)
+Please see detail at [examples/melgan/](https://github.com/tensorspeech/TensorflowTTS/tree/master/examples/melgan#step-1-create-tensorflow-based-dataloader-tfdataset)
+
+### Step 2: Training from scratch
+After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_multiband_pwgan.py`](https://github.com/tensorspeech/TensorflowTTS/tree/master/examples/multiband_pwgan/train_multiband_pwgan.py). Here is an example command line to training melgan-stft from scratch:
+
+First, you need training generator with only stft loss: 
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/train_multiband_pwgan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/ \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --use-norm 1 \
+  --generator_mixed_precision 1 \
+  --resume ""
+```
+
+Then resume and start training generator + discriminator:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/train_multiband_pwgan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/ \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --use-norm 1 \
+  --resume ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/checkpoints/ckpt-200000
+```
+
+IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode. 
+
+In case you want to resume the training progress, please following below example command line:
+
+```bash
+--resume ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/checkpoints/ckpt-100000
+```
+
+**IMPORTANT NOTES**:
+
+- If Your Dataset is 16K, upsample_scales = [2, 4, 8] worked.
+- If Your Dataset is > 16K (22K, 24K, ...), upsample_scales = [2, 4, 8] didn't worked, used [8, 4, 2] instead.
+
+### Step 3: Decode audio from folder mel-spectrogram
+To running inference on folder mel-spectrogram (eg valid folder), run below command line:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/decode_mb_melgan.py \
+  --rootdir ./dump/valid/ \
+  --outdir ./prediction/multiband_melgan.v1/ \
+  --checkpoint ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/checkpoints/generator-940000.h5 \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1.yaml \
+  --batch-size 32 \
+  --use-norm 1
+```
+
+## Finetune Multi-Band MelGAN + PWGAN Disc with ljspeech pretrained on other languages
+Download generator weights of (any) Multi-Band MelGAN model, pass to `--pretrained` argument.
+It's recommended to use (and tune if necessary), the dedicated finetuning config `multiband_pwgan.v1ft.yaml`
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/multiband_pwgan/train_multiband_pwgan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/multiband_pwgan/exp/train.multiband_pwgan.v1/ \
+  --config ./examples/multiband_pwgan/conf/multiband_pwgan.v1ft.yaml \
+  --use-norm 1 \
+  --generator_mixed_precision 1 \
+  --pretrained "ptgen.h5"
+```
+
+## Notes
+1. Using RAdam for discriminator
+
+## Reference
+
+1. https://github.com/kan-bayashi/ParallelWaveGAN
+2. [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
+3. [Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106)
@@ -0,0 +1,101 @@
+
+# This is the hyperparameter configuration file for Multi-Band MelGAN with PWGAN discriminator.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 1000k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "multiband_melgan_generator"
+
+multiband_melgan_generator_params:
+    out_channels: 4               # Number of output channels (number of subbands).
+    kernel_size: 7                # Kernel size of initial and final conv layers.
+    filters: 384                  # Initial number of channels for conv layers.
+    upsample_scales: [8, 4, 2]    # List of Upsampling scales.
+    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
+    stacks: 4                     # Number of stacks in a single residual stack module.
+    is_weight_norm: false         # Use weight-norm or not.
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+parallel_wavegan_discriminator_params:
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Number of output channels.
+    n_layers: 10            # Number of conv layers.
+    conv_channels: 64     # Number of chnn layers.
+    use_bias: true            # Whether to use bias parameter in conv.
+    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation_params:      # Nonlinear function parameters
+        alpha: 0.2           # Alpha in LeakyReLU.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+subband_stft_loss_params:
+    fft_lengths: [384, 683, 171]  # List of FFT size for STFT-based loss.
+    frame_steps: [30, 60, 10]     # List of hop size for STFT-based loss
+    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0      # Loss balancing coefficient for feature matching loss
+lambda_adv: 2.5              # Loss balancing coefficient for adversarial loss.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64                 # Batch size.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "ExponentialDecay"
+    lr_params: 
+        initial_learning_rate: 0.0005
+        decay_steps: 200000
+        decay_rate: 0.5
+
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 200000  # steps begin training discriminator
+train_max_steps: 4000000                 # Number of training steps.
+save_interval_steps: 20000               # Interval steps to save checkpoint.
+eval_interval_steps: 5000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
@@ -0,0 +1,105 @@
+
+# This is the hyperparameter configuration file for Multi-Band MelGAN with PWGAN discriminator.
+# This one is adjusted for finetuning, used to finetune the LJSpeech pretrained Multi-Band MelGAN generator on a 50-minute male speaker dataset
+# You may have to tune this for your own
+
+# Main differences from regular training config are: 
+# 1. We start training the discriminator from the start
+# 2. The learning rate is very low
+# 3. Max iterations, save intervals, and associates are lowered because this gets done very quickly
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "multiband_melgan_generator"
+
+multiband_melgan_generator_params:
+    out_channels: 4               # Number of output channels (number of subbands).
+    kernel_size: 7                # Kernel size of initial and final conv layers.
+    filters: 384                  # Initial number of channels for conv layers.
+    upsample_scales: [8, 4, 2]    # List of Upsampling scales.
+    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
+    stacks: 4                     # Number of stacks in a single residual stack module.
+    is_weight_norm: false         # Use weight-norm or not.
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+parallel_wavegan_discriminator_params:
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Number of output channels.
+    n_layers: 10            # Number of conv layers.
+    conv_channels: 64     # Number of chnn layers.
+    use_bias: true            # Whether to use bias parameter in conv.
+    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation_params:      # Nonlinear function parameters
+        alpha: 0.2           # Alpha in LeakyReLU.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+subband_stft_loss_params:
+    fft_lengths: [384, 683, 171]  # List of FFT size for STFT-based loss.
+    frame_steps: [30, 60, 10]     # List of hop size for STFT-based loss
+    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0      # Loss balancing coefficient for feature matching loss
+lambda_adv: 2.5              # Loss balancing coefficient for adversarial loss.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64                 # Batch size.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [1000, 5000, 10000, 20000]
+        values: [0.00000000001, 0.000000000005, 0.000000000002, 0.0000000000005, 0.0000000000002]
+    amsgrad: false
+
+
+discriminator_optimizer_params:
+    lr_fn: "ExponentialDecay"
+    lr_params: 
+        initial_learning_rate: 0.0000000005
+        decay_steps: 70000
+        decay_rate: 0.5
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 0  # steps begin training discriminator
+train_max_steps: 10000                 # Number of training steps.
+save_interval_steps: 1500               # Interval steps to save checkpoint.
+eval_interval_steps: 500                # Interval steps to evaluate the network.
+log_interval_steps: 100                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.