You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: egs/tts/VALLE/README.md
+78-11Lines changed: 78 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ There are four stages in total:
17
17
## 1. Data Preparation
18
18
19
19
### Dataset Download
20
-
You can use the commonly used TTS dataset to train VALL-E model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train VALL-E model for the first time. How to download dataset is detailed [here](../../datasets/README.md).
20
+
You can use the commonly used TTS dataset to train the VALL-E model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train the VALL-E model for the first time. How to download the dataset is detailed [here](../../datasets/README.md).
21
21
22
22
### Configuration
23
23
@@ -51,7 +51,7 @@ Specify the `processed_dir` and the `log_dir` and for saving the processed data
51
51
52
52
### Run
53
53
54
-
Run the `run.sh` as the preproces stage (set `--stage 1`):
54
+
Run the `run.sh` as the preprocess stage (set `--stage 1`):
55
55
56
56
```bash
57
57
sh egs/tts/VALLE/run.sh --stage 1
@@ -64,22 +64,22 @@ sh egs/tts/VALLE/run.sh --stage 1
64
64
65
65
### Configuration
66
66
67
-
We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
67
+
We provide the default hyperparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
68
68
69
-
```
69
+
```json
70
70
"train": {
71
71
"batch_size": 4,
72
72
}
73
73
```
74
74
75
-
### Run
75
+
### Train From Scratch
76
76
77
-
Run the `run.sh` as the training stage (set `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
77
+
Run the `run.sh` as the training stage (set `--stage 2`). Specify an experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
78
78
79
-
Specifically, VALL-E need to train a autoregressive (AR) model and then a non-autoregressive (NAR) model. So, you can set `--model_train_stage 1` to train AR model, and set `--model_train_stage 2` to train NAR model, where `--ar_model_ckpt_dir` should be set as the ckeckpoint path to the trained AR model.
79
+
Specifically, VALL-E needs to train an autoregressive (AR) model and then a non-autoregressive (NAR) model. So, you can set `--model_train_stage 1` to train AR model, and set `--model_train_stage 2` to train NAR model, where `--ar_model_ckpt_dir` should be set as the checkpoint path to the trained AR model.
80
80
81
81
82
-
Train a AR moel, just run:
82
+
Train an AR model, just run:
83
83
84
84
```bash
85
85
sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName]
@@ -89,7 +89,74 @@ Train a NAR model, just run:
89
89
```bash
90
90
sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName]
91
91
```
92
-
<!-- > **NOTE:** To train a NAR model, `--checkpoint_path` should be set as the ckeckpoint path to the trained AR model. -->
92
+
<!-- > **NOTE:** To train a NAR model, `--checkpoint_path` should be set as the checkpoint path to the trained AR model. -->
93
+
94
+
95
+
### Train From Existing Source
96
+
97
+
We support training from existing sources for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
98
+
99
+
By setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`,
100
+
101
+
Train an AR model, just run:
102
+
103
+
```bash
104
+
sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
You can also choose a **specific checkpoint** for retraining by `--resume_from_ckpt_path` argument. For example, if you want to resume training from the checkpoint `Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]`,
117
+
118
+
Train an AR model, just run:
119
+
120
+
```bash
121
+
sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`,
135
+
136
+
137
+
Train an AR model, just run:
138
+
139
+
```bash
140
+
sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
> **NOTE:** The `--resume_type` is set as `"resume"` in default. It's not necessary to specify it when resuming training.
155
+
>
156
+
> The difference between `"resume"` and `"finetune"` is that the `"finetune"` will **only** load the pretrained model weights from the checkpoint, while the `"resume"` will load all the training states (including optimizer, scheduler, etc.) from the checkpoint.
157
+
158
+
159
+
93
160
94
161
> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
We have released pre-trained VALL-E models, so you can download the pre-trained model and then generate speech following the above inference instruction. Specifically,
130
-
1. The pre-trained VALL-E trained on [LibriTTS](https://github.com/open-mmlab/Amphion/tree/main/egs/datasets#libritts) can be download[here](https://huggingface.co/amphion/valle-libritts).
131
-
2. The pre-trained VALL-E trained on a part of [Libri-light](https://ai.meta.com/tools/libri-light/) (about 6k hours) can be download[here](https://huggingface.co/amphion/valle_librilight_6k).
197
+
1. The pre-trained VALL-E trained on [LibriTTS](https://github.com/open-mmlab/Amphion/tree/main/egs/datasets#libritts) can be downloaded[here](https://huggingface.co/amphion/valle-libritts).
198
+
2. The pre-trained VALL-E trained on the part of [Libri-light](https://ai.meta.com/tools/libri-light/) (about 6k hours) can be downloaded[here](https://huggingface.co/amphion/valle_librilight_6k).
# [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
60
+
--resume_type) shift; resume_type=$1;shift ;;
61
+
55
62
--) shift;break ;;
56
63
*) echo"Invalid option: $1"exit 1 ;;
57
64
esac
@@ -98,13 +105,38 @@ if [ $running_stage -eq 2 ]; then
In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
6
+
In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes a conditional variational autoencoder with adversarial learning.
7
7
8
8
There are four stages in total:
9
9
@@ -20,7 +20,7 @@ There are four stages in total:
20
20
## 1. Data Preparation
21
21
22
22
### Dataset Download
23
-
You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train single-speaker TTS model for the first time. While fortraining multi-speaker TTS model for the first time, we would recommend using Hi-Fi TTS. The process of downloading dataset has been detailed [here](../../datasets/README.md).
23
+
You can use the commonly used TTS dataset to train the TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train the single-speaker TTS model for the first time. While training the multi-speaker TTS model for the first time, we recommend using Hi-Fi TTS. The process of downloading the dataset has been detailed [here](../../datasets/README.md).
24
24
25
25
### Configuration
26
26
@@ -75,7 +75,7 @@ sh egs/tts/VITS/run.sh --stage 1
75
75
76
76
### Configuration
77
77
78
-
We provide the default hyparametersin the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
78
+
We provide the default hyperparametersin the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
79
79
For training the multi-speaker TTS model, specify the `n_speakers` value to be greater (used fornew speaker fine-tuning) than or equal to the number of speakersin your dataset(s) and set`multi_speaker_training` to `true`.
80
80
81
81
```json
@@ -98,9 +98,9 @@ sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName]
98
98
99
99
### Train From Existing Source
100
100
101
-
We support training from existing sourcefor various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
101
+
We support training from existing sourcesfor various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
102
102
103
-
Setting`--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in`Amphion/ckpts/tts/[YourExptName]/checkpoint`, run:
103
+
By setting`--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in`Amphion/ckpts/tts/[YourExptName]/checkpoint`, run:
104
104
105
105
```bash
106
106
sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
@@ -111,16 +111,16 @@ You can also choose a **specific checkpoint** for retraining by `--resume_from_c
111
111
112
112
```bash
113
113
sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`, run:
119
119
120
120
121
121
```bash
122
122
sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
We released a pre-trained Amphion VITS model trained on LJSpeech. So, you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instructions. Meanwhile, the pre-trained multi-speaker VITS model trained on Hi-Fi TTS will be released soon. Stay tuned.
211
+
212
+
209
213
```bibtex
210
214
@inproceedings{kim2021conditional,
211
215
title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
0 commit comments