Merge pull request #144 from lmxue/valle_resume

HeCheng0625 · web-flow · commit 1e4a650a5ac6 · 2024-03-25T20:24:11.000+08:00
Fix bug for VALLE resume
diff --git a/config/valle.json b/config/valle.json
@@ -36,6 +36,7 @@
         // "scaling_xformers": false, // Apply Reworked Conformer scaling on Transformers 
     },
     "train": {
+        "use_dynamic_batchsize": false, // If use dynamic batch size
         "ddp": false,
         "train_stage": 1, // 0: train all modules, For VALL_E, support 1: AR Decoder 2: NAR Decoder(s)
         "max_epoch": 20, 
diff --git a/egs/tts/VALLE/README.md b/egs/tts/VALLE/README.md
@@ -17,7 +17,7 @@ There are four stages in total:
 ## 1. Data Preparation
 
 ### Dataset Download
-You can use the commonly used TTS dataset to train VALL-E model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train VALL-E model for the first time. How to download dataset is detailed [here](../../datasets/README.md).
+You can use the commonly used TTS dataset to train the VALL-E model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train the VALL-E model for the first time. How to download the dataset is detailed [here](../../datasets/README.md).
 
 ### Configuration
 
@@ -51,7 +51,7 @@ Specify the `processed_dir` and the `log_dir` and for saving the processed data
 
 ### Run
 
-Run the `run.sh` as the preproces stage (set  `--stage 1`):
+Run the `run.sh` as the preprocess stage (set  `--stage 1`):
 
 ```bash
 sh egs/tts/VALLE/run.sh --stage 1
@@ -64,22 +64,22 @@ sh egs/tts/VALLE/run.sh --stage 1
 
 ### Configuration
 
-We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
+We provide the default hyperparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
 
-```
+```json
 "train": {
         "batch_size": 4,
     }
 ```
 
-### Run
+### Train From Scratch
 
-Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify an experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
 
-Specifically, VALL-E need to train a autoregressive (AR) model and then a non-autoregressive (NAR) model. So, you can set `--model_train_stage 1` to train AR model, and set `--model_train_stage 2` to train NAR model, where `--ar_model_ckpt_dir` should be set as the ckeckpoint path to the trained AR model.
+Specifically, VALL-E needs to train an autoregressive (AR) model and then a non-autoregressive (NAR) model. So, you can set `--model_train_stage 1` to train AR model, and set `--model_train_stage 2` to train NAR model, where `--ar_model_ckpt_dir` should be set as the checkpoint path to the trained AR model.
 
 
-Train a AR moel, just run:
+Train an AR model, just run:
 
 ```bash
 sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName]
@@ -89,7 +89,74 @@ Train a NAR model, just run:
 ```bash
 sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName]
 ```
-<!-- > **NOTE:** To train a NAR model, `--checkpoint_path` should be set as the ckeckpoint path to the trained AR model. -->
+<!-- > **NOTE:** To train a NAR model, `--checkpoint_path` should be set as the checkpoint path to the trained AR model. -->
+
+
+### Train From Existing Source
+
+We support training from existing sources for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
+
+By setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`, 
+
+Train an AR model, just run:
+
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true
+```
+
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true
+```
+
+
+
+You can also choose a **specific checkpoint** for retraining by `--resume_from_ckpt_path` argument. For example, if you want to resume training from the checkpoint `Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]`,
+
+Train an AR model, just run:
+
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificARCheckpoint]"
+```
+
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificNARCheckpoint]"
+```
+
+
+If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`, 
+
+
+Train an AR model, just run:
+
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificARCheckpoint]" \
+    --resume_type "finetune"
+```
+
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificNARCheckpoint]" \
+    --resume_type "finetune"
+```
+
+> **NOTE:** The `--resume_type` is set as `"resume"` in default. It's not necessary to specify it when resuming training.
+> 
+> The difference between `"resume"` and `"finetune"` is that the `"finetune"` will **only** load the pretrained model weights from the checkpoint, while the `"resume"` will load all the training states (including optimizer, scheduler, etc.) from the checkpoint.
+
+
+
 
 > **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
 
@@ -127,8 +194,8 @@ sh egs/tts/VALLE/run.sh --stage 3 --gpu "0" \
 ```
 
 We have released pre-trained VALL-E models, so you can download the pre-trained model and then generate speech following the above inference instruction. Specifically, 
-1. The pre-trained VALL-E trained on [LibriTTS](https://github.com/open-mmlab/Amphion/tree/main/egs/datasets#libritts) can be download [here](https://huggingface.co/amphion/valle-libritts).
-2. The pre-trained VALL-E trained on a part of [Libri-light](https://ai.meta.com/tools/libri-light/) (about 6k hours) can be download [here](https://huggingface.co/amphion/valle_librilight_6k).
+1. The pre-trained VALL-E trained on [LibriTTS](https://github.com/open-mmlab/Amphion/tree/main/egs/datasets#libritts) can be downloaded [here](https://huggingface.co/amphion/valle-libritts).
+2. The pre-trained VALL-E trained on the part of [Libri-light](https://ai.meta.com/tools/libri-light/) (about 6k hours) can be downloaded [here](https://huggingface.co/amphion/valle_librilight_6k).
 
 ```bibtex
 @article{wang2023neural,
diff --git a/egs/tts/VALLE/run.sh b/egs/tts/VALLE/run.sh
@@ -17,7 +17,7 @@ python setup.py build_ext --inplace
 cd $work_dir
 
 ######## Parse the Given Parameters from the Commond ###########
-options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,ar_model_ckpt_dir:,infer_output_dir:,infer_mode:,infer_test_list_file:,infer_text:,infer_text_prompt:,infer_audio_prompt:,model_train_stage:,name:,stage: -- "$@")
+options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,ar_model_ckpt_dir:,infer_output_dir:,infer_mode:,infer_test_list_file:,infer_text:,infer_text_prompt:,infer_audio_prompt:,model_train_stage:,name:,stage:,resume:,resume_from_ckpt_path:,resume_type: -- "$@")
 eval set -- "$options"
 
 while true; do
@@ -52,6 +52,13 @@ while true; do
     # [Only for Inference] The inference audio prompt. It is only used when the inference model is "single".
     --infer_audio_prompt) shift; infer_audio_prompt=$1 ; shift ;;
 
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --resume_from_ckpt_path) shift; resume_from_ckpt_path=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+
     --) shift ; break ;;
     *) echo "Invalid option: $1" exit 1 ;;
   esac
@@ -98,13 +105,38 @@ if [ $running_stage -eq 2 ]; then
 
     echo "Exprimental Name: $exp_name"
 
-    CUDA_VISIBLE_DEVICES=$gpu accelerate launch --main_process_port 29510 \
-    "${work_dir}"/bins/tts/train.py \
-        --config $exp_config \
-        --exp_name $exp_name \
-        --log_level debug \
-        --train_stage $model_train_stage \
-        --checkpoint_path $ar_model_ckpt_dir
+    # Add default value
+    if [ -z "$resume_from_ckpt_path" ]; then
+        resume_from_ckpt_path=""
+    fi
+
+    if [ -z "$resume_type" ]; then
+        resume_type="resume"
+    fi
+
+
+    if [ "$resume" = true ]; then
+        echo "Resume from the existing experiment..."
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch --main_process_port 29510 \
+        "${work_dir}"/bins/tts/train.py \
+            --config $exp_config \
+            --exp_name $exp_name \
+            --log_level debug \
+            --train_stage $model_train_stage \
+            --ar_model_ckpt_dir $ar_model_ckpt_dir \
+            --resume \
+            --checkpoint_path "$resume_from_ckpt_path" \
+            --resume_type "$resume_type"
+    else
+        echo "Start a new experiment..."
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch --main_process_port 29510 \
+        "${work_dir}"/bins/tts/train.py \
+            --config $exp_config \
+            --exp_name $exp_name \
+            --log_level debug \
+            --train_stage $model_train_stage \
+            --ar_model_ckpt_dir $ar_model_ckpt_dir
+    fi        
 fi
 
 
diff --git a/egs/tts/VITS/README.md b/egs/tts/VITS/README.md
@@ -3,7 +3,7 @@
 [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/Text-to-Speech)
 [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/Text-to-Speech)
 
-In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
+In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes a conditional variational autoencoder with adversarial learning.
 
 There are four stages in total:
 
@@ -20,7 +20,7 @@ There are four stages in total:
 ## 1. Data Preparation
 
 ### Dataset Download
-You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train single-speaker TTS model for the first time. While for training multi-speaker TTS model for the first time, we would recommend using Hi-Fi TTS. The process of downloading dataset has been detailed [here](../../datasets/README.md).
+You can use the commonly used TTS dataset to train the TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train the single-speaker TTS model for the first time. While training the multi-speaker TTS model for the first time, we recommend using Hi-Fi TTS. The process of downloading the dataset has been detailed [here](../../datasets/README.md).
 
 ### Configuration
 
@@ -75,7 +75,7 @@ sh egs/tts/VITS/run.sh --stage 1
 
 ### Configuration
 
-We provide the default hyparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
+We provide the default hyperparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
 For training the multi-speaker TTS model, specify the `n_speakers` value to be greater (used for new speaker fine-tuning) than or equal to the number of speakers in your dataset(s) and set `multi_speaker_training` to `true`.
 
 ```json
@@ -98,9 +98,9 @@ sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName]
 
 ### Train From Existing Source
 
-We support training from existing source for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
+We support training from existing sources for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
 
-Setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`, run:
+By setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`, run:
 
 ```bash
 sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
@@ -111,16 +111,16 @@ You can also choose a **specific checkpoint** for retraining by `--resume_from_c
 
 ```bash
 sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
-    --resume true
-    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]" \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]"
 ```
 
 If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`, run:
 
 
 ```bash
 sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
-    --resume true
+    --resume true \
     --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]" \
     --resume_type "finetune"
 ```
@@ -206,6 +206,10 @@ sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_testing_set "test" 
 ```
 
+
+We released a pre-trained Amphion VITS model trained on LJSpeech. So, you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instructions. Meanwhile, the pre-trained multi-speaker VITS model trained on Hi-Fi TTS will be released soon. Stay tuned.
+
+
 ```bibtex
 @inproceedings{kim2021conditional,
   title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
diff --git a/models/tts/base/tts_trainer.py b/models/tts/base/tts_trainer.py
@@ -163,14 +163,27 @@ def _check_resume(self):
         if self.args.resume or (
             self.cfg.model_type == "VALLE" and self.args.train_stage == 2
         ):
+            checkpoint_dir = self.checkpoint_dir
             if self.cfg.model_type == "VALLE" and self.args.train_stage == 2:
-                self.args.resume_type = "finetune"
+                ls = [str(i) for i in Path(checkpoint_dir).glob("*")]
+                if (
+                    self.args.checkpoint_path is None or len(ls) == 0
+                ):  # Train stage 2 from scratch using the checkpoint of stage 1
+                    assert (
+                        self.args.ar_model_ckpt_dir is not None
+                    ), "Error: ar_model_ckpt_dir should be set to train nar model."
+                    self.args.resume_type = "finetune"
+                    checkpoint_dir = self.args.ar_model_ckpt_dir
+                    self.logger.info(
+                        f"Training NAR model at stage 2 using the checkpoint of AR model at stage 1."
+                    )
 
-            self.logger.info("Resuming from checkpoint...")
+            self.logger.info(f"Resuming from checkpoint: {checkpoint_dir}")
             start = time.monotonic_ns()
             self.ckpt_path = self._load_model(
-                self.checkpoint_dir, self.args.checkpoint_path, self.args.resume_type
+                checkpoint_dir, self.args.checkpoint_path, self.args.resume_type
             )
+            self.logger.info(f"Checkpoint path: {self.ckpt_path}")
             end = time.monotonic_ns()
             self.logger.info(
                 f"Resuming from checkpoint done in {(end - start) / 1e6:.2f}ms"
@@ -700,6 +713,7 @@ def _save_phone_symbols_file_to_exp_path(self):
             self.exp_dir, self.cfg.preprocess.symbols_dict
         )
         shutil.copy(phone_symbols_file, phone_symbols_file_to_exp_path)
+        os.chmod(phone_symbols_file_to_exp_path, 0o666)
         print(
             "phone symbols been dumped to {}".format(
                 os.path.join(self.exp_dir, self.cfg.preprocess.symbols_dict)
diff --git a/models/tts/valle/valle_trainer.py b/models/tts/valle/valle_trainer.py
@@ -262,14 +262,6 @@ def _valid_step(self, batch):
 
         return total_loss, valid_losses, valid_stats
 
-    def add_arguments(parser: argparse.ArgumentParser):
-        parser.add_argument(
-            "--train_stage",
-            type=int,
-            default="1",
-            help="0: train all modules, 1: AR Decoder, 2: NAR Decoder",
-        )
-
     def _build_dataloader(self):
         if not self.cfg.train.use_dynamic_batchsize:
             return super()._build_dataloader()
@@ -359,3 +351,17 @@ def _accelerator_prepare(self):
                 self.scheduler[key] = self.accelerator.prepare(self.scheduler[key])
         else:
             self.scheduler = self.accelerator.prepare(self.scheduler)
+
+    def add_arguments(parser: argparse.ArgumentParser):
+        parser.add_argument(
+            "--train_stage",
+            type=int,
+            default="1",
+            help="0: train all modules, 1: AR Decoder, 2: NAR Decoder",
+        )
+        parser.add_argument(
+            "--ar_model_ckpt_dir",
+            type=str,
+            default=None,
+            help="Checkpoint for ar model ckeckpoint in the first training stage.",
+        )