Skip to content

Commit 743212e

Browse files
committed
update README
1 parent 92f82a3 commit 743212e

File tree

3 files changed

+63
-69
lines changed

3 files changed

+63
-69
lines changed

helpers/training_configs/librispeech_tts_r_300M_dummy.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,12 @@
1313
"output_dir": "./output_dir_training",
1414

1515
"train_dataset_name": "blabble-io/libritts_r",
16-
"train_metadata_dataset_name": "stable-speech/libritts_r_tags_tagged_10k_generated",
16+
"train_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated",
1717
"train_dataset_config_name": "clean",
1818
"train_split_name": "test.clean",
1919

2020
"eval_dataset_name": "blabble-io/libritts_r",
21-
"eval_metadata_dataset_name": "stable-speech/libritts_r_tags_tagged_10k_generated",
21+
"eval_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated",
2222
"eval_dataset_config_name": "clean",
2323
"eval_split_name": "test.clean",
2424

helpers/training_configs/starting_point_0.01.json

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@
1212
"overwrite_output_dir": true,
1313
"output_dir": "./output_dir_training",
1414

15-
"train_dataset_name": "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+stable-speech/mls_eng_10k",
16-
"train_metadata_dataset_name": "stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/mls-eng-10k-tags_tagged_10k_generated",
15+
"train_dataset_name": "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+parler-tts/mls_eng_10k",
16+
"train_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated",
1717
"train_dataset_config_name": "clean+clean+other+default",
1818
"train_split_name": "train.clean.360+train.clean.100+train.other.500+train",
1919

20-
"eval_dataset_name": "blabble-io/libritts_r+stable-speech/mls_eng_10k",
21-
"eval_metadata_dataset_name": "stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/mls-eng-10k-tags_tagged_10k_generated",
20+
"eval_dataset_name": "blabble-io/libritts_r+parler-tts/mls_eng_10k",
21+
"eval_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated",
2222
"eval_dataset_config_name": "other+default",
2323
"eval_split_name": "test.other+test",
2424

@@ -41,7 +41,7 @@
4141

4242
"do_train": true,
4343
"num_train_epochs": 40,
44-
"gradient_accumulation_steps": 1,
44+
"gradient_accumulation_steps": 8,
4545
"gradient_checkpointing": false,
4646
"per_device_train_batch_size": 3,
4747
"learning_rate": 0.00095,

training/README.md

Lines changed: 56 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Training Parler-TTS
22

3+
> [!IMPORTANT]
4+
> **TL;DR:** After having followed the [installation steps](#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
5+
6+
```sh
7+
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
8+
```
9+
310
This sub-folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
411
- [1. An introduction to the Parler-TTS architecture](#a-architecture)
512
- [2. First steps to get started](#b-getting-started)
@@ -72,7 +79,7 @@ You can also train you own model from scratch. You can find [here](/helpers/mode
7279
python helpers/model_init_scripts/init_dummy_model.py ./parler-tts-untrained-dummy --text_model "google-t5/t5-small" --audio_model "parler-tts/dac_44khZ_8kbps"
7380
```
7481

75-
In the rest of this guide, we'll use a 300-M parameters that we'll initialize with:
82+
In the rest of this guide, and to reproduce the Parler-TTS v0.1 training recipe, we'll use a 300-M parameters that we'll initialize with:
7683

7784
```sh
7885
python helpers/model_init_scripts/init_model_300M.py ./parler-tts-untrained-300M --text_model "google/flan-t5-base" --audio_model "parler-tts/dac_44khZ_8kbps"
@@ -88,7 +95,11 @@ To train your own Parler-TTS, you need datasets with 3 main features:
8895

8996
Note that we made the choice to use description of the main speech characteristics (speaker pitch, speaking rate, level of noise, etc.) but that you are free to use any handmade or generated text description that makes sense.
9097

91-
In the rest of this guide, and to make it simple, we'll use the [4.8K-samples clean test split](https://huggingface.co/datasets/blabble-io/libritts_r/viewer/clean/test.clean) of [LibriTTS-R](https://huggingface.co/datasets/blabble-io/libritts_r/). We've annotated LibriTTS-R using [Data-Speech](https://github.com/huggingface/dataspeech) and shared the resulting dataset here: [parler-tts/libritts_r_tags_tagged_10k_generated](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated).
98+
To train Parler-TTS v0.1, we used:
99+
* The full [LibriTTS-R dataset](https://huggingface.co/datasets/blabble-io/libritts_r), a 1K hours high-quality speech dataset.
100+
* A [10K hours subset](https://huggingface.co/datasets/parler-tts/mls_eng_10k) of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
101+
102+
Both datasets have been annotated using the [Data-Speech](https://github.com/huggingface/dataspeech) recipe, respectively [here](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) and [here](https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated).
92103

93104

94105
## 3. Training
@@ -98,7 +109,7 @@ The script [`run_parler_tts_training.py`](/training/run_parler_tts_training.py)
98109
2. pre-compute audio tokens
99110
3. train Parler-TTS
100111

101-
In this example, we will train and evaluate on a subsample of the test split. This is purely to demonstrate how to use the training script, rather than recommended advice for defining train/validation splits. We advise that you train on the train splits of your dataset, evaluate and tune hyper-parameters on the validation split, and only test the final checkpoint on the test split.
112+
To train Parler-TTS v0.1, we roughly used:
102113

103114
```sh
104115
accelerate launch ./training/run_parler_tts_training.py \
@@ -108,104 +119,87 @@ accelerate launch ./training/run_parler_tts_training.py \
108119
--prompt_tokenizer_name "google/flan-t5-base" \
109120
--report_to "wandb" \
110121
--overwrite_output_dir true \
111-
--train_dataset_name "blabble-io/libritts_r" \
112-
--train_metadata_dataset_name "parler-tts/libritts_r_tags_tagged_10k_generated" \
113-
--train_dataset_config_name "clean" \
114-
--train_split_name "test.clean" \
115-
--eval_dataset_name "blabble-io/libritts_r" \
116-
--eval_metadata_dataset_name "parler-tts/libritts_r_tags_tagged_10k_generated" \
117-
--eval_dataset_config_name "clean" \
118-
--eval_split_name "test.clean" \
122+
--train_dataset_name "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+parler-tts/mls_eng_10k" \
123+
--train_metadata_dataset_name "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated" \
124+
--train_dataset_config_name "clean+clean+other+default" \
125+
--train_split_name "train.clean.360+train.clean.100+train.other.500+train" \
126+
--eval_dataset_name "blabble-io/libritts_r+parler-tts/mls_eng_10k" \
127+
--eval_metadata_dataset_name "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated" \
128+
--eval_dataset_config_name "other+default" \
129+
--eval_split_name "test.other+test" \
119130
--target_audio_column_name "audio" \
120131
--description_column_name "text_description" \
121132
--prompt_column_name "text" \
122-
--max_duration_in_seconds 20 \
133+
--max_duration_in_seconds 30 \
123134
--min_duration_in_seconds 2.0 \
135+
--max_text_length 400 \
124136
--add_audio_samples_to_wandb true \
125137
--id_column_name "id" \
126138
--preprocessing_num_workers 8 \
127139
--do_train true \
128-
--num_train_epochs 50 \
129-
--gradient_accumulation_steps 1 \
140+
--num_train_epochs 40 \
141+
--gradient_accumulation_steps 8 \
130142
--gradient_checkpointing false \
131-
--per_device_train_batch_size 4 \
132-
--learning_rate 1e-3 \
143+
--per_device_train_batch_size 3 \
144+
--learning_rate 0.00095 \
133145
--adam_beta1 0.9 \
134146
--adam_beta2 0.99 \
135147
--weight_decay 0.01 \
136-
--lr_scheduler_type "cosine" \
137-
--warmup_steps 40 \
138-
--logging_steps 2 \
148+
--lr_scheduler_type "constant_with_warmup" \
149+
--warmup_steps 20000 \
150+
--logging_steps 1000 \
139151
--freeze_text_encoder true \
140152
--do_eval true \
141153
--predict_with_generate true \
142154
--include_inputs_for_metrics true \
143155
--evaluation_strategy steps \
144-
--eval_steps 500 \
145-
--save_steps 5000 \
156+
--eval_steps 10000 \
157+
--save_steps 10000 \
146158
--per_device_eval_batch_size 12 \
147-
--audio_encoder_per_device_batch_size 24 \
159+
--audio_encoder_per_device_batch_size 20 \
148160
--dtype "bfloat16" \
149-
--dataloader_num_workers "16" \
150161
--seed 456 \
151162
--output_dir "./output_dir_training/" \
152163
--temporary_save_to_disk "./audio_code_tmp/" \
153164
--save_to_disk "./tmp_dataset_audio/" \
154-
--max_eval_samples 48 \
155-
--max_train_samples 96 \
156-
--dataloader_num_workers 8
165+
--max_eval_samples 96 \
166+
--dataloader_num_workers 8 \
167+
--group_by_length true
157168
```
158169

159-
160-
> [!TIP]
161-
> Fine-tuning is as easy as modifying `model_name_or_path` to a pre-trained model.
162-
> For example: `--model_name_or_path parler-tts/parler_tts_300M_v0.1`.
170+
In particular, note how multiple training datasets, metadataset, configurations and splits can be loaded by separating the dataset arguments by + symbols:
171+
```sh
172+
"train_dataset_name": "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+parler-tts/mls_eng_10k",
173+
"train_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated",
174+
"train_dataset_config_name": "clean+clean+other+default",
175+
"train_split_name": "train.clean.360+train.clean.100+train.other.500+train",
176+
```
163177

164178

165-
Additionally, you can also write a JSON config file. Here, [librispeech_tts_r_300M_dummy.json](/helpers/training_configs/librispeech_tts_r_300M_dummy.json) contains the exact same hyper-parameters than above and can be launched like that:
179+
Additionally, you can also write a JSON config file. Here, [starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) contains the exact same hyper-parameters than above and can be launched like that:
166180
```sh
167-
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/librispeech_tts_r_300M_dummy.json
181+
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
168182
```
169183

170-
The above training script is a dummy example on only 96 training samples. It will take approximately 20 mn to complete on an 80 GB A100 GPU.
184+
Training logs will be reported to wandb, provided that you passed `--report_to "wandb"` to the arguments. An example of what a training log from the above training looks like can be found [here](https://wandb.ai/ylacombe/parler-tts-300M-punctuated/runs/q6h7hspc?nw=nwuserylacombe).
171185

172-
Scaling to multiple GPUs using [distributed data parallelism (DDP)](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) is trivial: simply run `accelerate config` and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The above script can then be run using DDP with no code changes.
186+
> [!TIP]
187+
> Starting training a new model from scratch can easily be overwhelming, so here's what training looked like for v0.1: [logs](https://api.wandb.ai/links/ylacombe/ea449l81)
173188
174-
Training logs will be reported to wandb, provided that you passed `--report_to "wandb"` to the arguments. An example of what a training log from the above training looks like can be found [here](https://wandb.ai/ylacombe/parler-speech/runs/gp55k6nj). Other examples of training log on scaled up training logs can be found in the next section.
189+
Scaling to multiple GPUs using [distributed data parallelism (DDP)](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) is trivial: simply run `accelerate config` and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The above script can then be run using DDP with no code changes. In our case, we used a node of 8 H100 80GB to train Parler-TTS v0.1 for around 4 days.
175190

176191

177-
There are a few noteworthy arguments:
178-
1. `train_metadata_dataset_name` and `eval_metadata_dataset_name` precise, if necessary, the names of the dataset(s) that contain(s) the conditionning text descriptions. For example, the [dataset resulting from the Data-Speech annotation process](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) is saved without the audio column, as it's costly to write and push audio data, so it needs to be concatenated back to the original LibriTTS-R dataset.
192+
There are a few other noteworthy arguments:
193+
1. `train_metadata_dataset_name` and `eval_metadata_dataset_name` specify, if necessary, the names of the dataset(s) that contain(s) the conditionning text descriptions. For example, this [dataset resulting from the Data-Speech annotation process](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) is saved without the audio column, as it's costly to write and push audio data, so it needs to be concatenated back to the original LibriTTS-R dataset.
179194
2. As noted above, the script pre-computes audio tokens as computing audio codes is costly and only needs to be done once, since we're freezing the audio encoder. `audio_encoder_per_device_batch_size` is used to precise the per devie batch size for this pre-processing step.
180195
3. Additionnally, when scaling up the training data and iterating on the hyper-parameters or the model architecture, we might want to avoid recomputing the audio tokens at each training run. That's why we introduced two additional parameters, `save_to_disk` and `temporary_save_to_disk` that serves as temporary buffers to save intermediary datasets. Note that processed data is made of text and audio tokens which are much more memory efficient, so the additional required space is negligible.
181196
4. `predict_with_generate` and `add_audio_samples_to_wandb` are required to store generated audios and to compute WER and CLAP similarity.
182-
5. `freeze_text_encoder`: which allows to freeze the text encoder, to save compute resources. Note that our released model freeze the text encoder.
197+
5. `freeze_text_encoder`: which allows to freeze the text encoder, to save compute resources.
183198

184199
And finally, two additional comments:
185200
1. `lr_scheduler_stype`: defines the learning rate schedule, one of `constant_with_warmup` or `cosine`. When experimenting with a training set-up or training for very few epochs, using `constant_with_warmup` is typically beneficial, since the learning rate remains high over the short training run. When performing longer training runs, using a `cosine` schedule shoud give better results.
186201
2. `dtype`: data type (dtype) in which the model computation should be performed. Note that this only controls the dtype of the computations (forward and backward pass), and not the dtype of the parameters or optimiser states.
187202

188-
189-
190-
## 4. Scaling up - Discussions and tips
191-
192-
[starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) offers a good hyper-paramters starting to scale-up the training recipe to thousand of hours of data:
193-
194-
```sh
195-
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
196-
```
197-
198-
In particular, note how multiple training datasets, metadataset, configurations and splits can be loaded by separating the dataset arguments by + symbols:
199-
```sh
200-
"train_dataset_name": "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+stable-speech/mls_eng_10k",
201-
"train_metadata_dataset_name": "stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/libritts_r_tags_tagged_10k_generated+stable-speech/mls-eng-10k-tags_tagged_10k_generated",
202-
"train_dataset_config_name": "clean+clean+other+default",
203-
"train_split_name": "train.clean.360+train.clean.100+train.other.500+train",
204-
```
205-
206-
Thus, the script generalises to any number of training datasets.
207-
208-
209-
> [!IMPORTANT]
210-
> Starting training a new model from scratch can easily be overwhelming,so here's what training looked like for v0.1: [logs](https://api.wandb.ai/links/ylacombe/ea449l81)
211-
203+
> [!TIP]
204+
> Fine-tuning is as easy as modifying `model_name_or_path` to a pre-trained model.
205+
> For example: `--model_name_or_path parler-tts/parler_tts_300M_v0.1`.

0 commit comments

Comments
 (0)