You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: training/README.md
+56-62Lines changed: 56 additions & 62 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,12 @@
1
1
# Training Parler-TTS
2
2
3
+
> [!IMPORTANT]
4
+
> **TL;DR:** After having followed the [installation steps](#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
@@ -88,7 +95,11 @@ To train your own Parler-TTS, you need datasets with 3 main features:
88
95
89
96
Note that we made the choice to use description of the main speech characteristics (speaker pitch, speaking rate, level of noise, etc.) but that you are free to use any handmade or generated text description that makes sense.
90
97
91
-
In the rest of this guide, and to make it simple, we'll use the [4.8K-samples clean test split](https://huggingface.co/datasets/blabble-io/libritts_r/viewer/clean/test.clean) of [LibriTTS-R](https://huggingface.co/datasets/blabble-io/libritts_r/). We've annotated LibriTTS-R using [Data-Speech](https://github.com/huggingface/dataspeech) and shared the resulting dataset here: [parler-tts/libritts_r_tags_tagged_10k_generated](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated).
98
+
To train Parler-TTS v0.1, we used:
99
+
* The full [LibriTTS-R dataset](https://huggingface.co/datasets/blabble-io/libritts_r), a 1K hours high-quality speech dataset.
100
+
* A [10K hours subset](https://huggingface.co/datasets/parler-tts/mls_eng_10k) of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
101
+
102
+
Both datasets have been annotated using the [Data-Speech](https://github.com/huggingface/dataspeech) recipe, respectively [here](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) and [here](https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated).
92
103
93
104
94
105
## 3. Training
@@ -98,7 +109,7 @@ The script [`run_parler_tts_training.py`](/training/run_parler_tts_training.py)
98
109
2. pre-compute audio tokens
99
110
3. train Parler-TTS
100
111
101
-
In this example, we will train and evaluate on a subsample of the test split. This is purely to demonstrate how to use the training script, rather than recommended advice for defining train/validation splits. We advise that you train on the train splits of your dataset, evaluate and tune hyper-parameters on the validation split, and only test the final checkpoint on the test split.
> Fine-tuning is as easy as modifying `model_name_or_path` to a pre-trained model.
162
-
> For example: `--model_name_or_path parler-tts/parler_tts_300M_v0.1`.
170
+
In particular, note how multiple training datasets, metadataset, configurations and splits can be loaded by separating the dataset arguments by + symbols:
Additionally, you can also write a JSON config file. Here, [librispeech_tts_r_300M_dummy.json](/helpers/training_configs/librispeech_tts_r_300M_dummy.json) contains the exact same hyper-parameters than above and can be launched like that:
179
+
Additionally, you can also write a JSON config file. Here, [starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) contains the exact same hyper-parameters than above and can be launched like that:
The above training script is a dummy example on only 96 training samples. It will take approximately 20 mn to complete on an 80 GB A100 GPU.
184
+
Training logs will be reported to wandb, provided that you passed `--report_to "wandb"` to the arguments. An example of what a training log from the above training looks like can be found [here](https://wandb.ai/ylacombe/parler-tts-300M-punctuated/runs/q6h7hspc?nw=nwuserylacombe).
171
185
172
-
Scaling to multiple GPUs using [distributed data parallelism (DDP)](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) is trivial: simply run `accelerate config` and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The above script can then be run using DDP with no code changes.
186
+
> [!TIP]
187
+
> Starting training a new model from scratch can easily be overwhelming, so here's what training looked like for v0.1: [logs](https://api.wandb.ai/links/ylacombe/ea449l81)
173
188
174
-
Training logs will be reported to wandb, provided that you passed `--report_to "wandb"` to the arguments. An example of what a training log from the above training looks like can be found [here](https://wandb.ai/ylacombe/parler-speech/runs/gp55k6nj). Other examples of training log on scaled up training logs can be found in the next section.
189
+
Scaling to multiple GPUs using [distributed data parallelism (DDP)](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) is trivial: simply run `accelerate config` and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The above script can then be run using DDP with no code changes. In our case, we used a node of 8 H100 80GB to train Parler-TTS v0.1 for around 4 days.
175
190
176
191
177
-
There are a few noteworthy arguments:
178
-
1.`train_metadata_dataset_name` and `eval_metadata_dataset_name`precise, if necessary, the names of the dataset(s) that contain(s) the conditionning text descriptions. For example, the[dataset resulting from the Data-Speech annotation process](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) is saved without the audio column, as it's costly to write and push audio data, so it needs to be concatenated back to the original LibriTTS-R dataset.
192
+
There are a few other noteworthy arguments:
193
+
1.`train_metadata_dataset_name` and `eval_metadata_dataset_name`specify, if necessary, the names of the dataset(s) that contain(s) the conditionning text descriptions. For example, this[dataset resulting from the Data-Speech annotation process](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated) is saved without the audio column, as it's costly to write and push audio data, so it needs to be concatenated back to the original LibriTTS-R dataset.
179
194
2. As noted above, the script pre-computes audio tokens as computing audio codes is costly and only needs to be done once, since we're freezing the audio encoder. `audio_encoder_per_device_batch_size` is used to precise the per devie batch size for this pre-processing step.
180
195
3. Additionnally, when scaling up the training data and iterating on the hyper-parameters or the model architecture, we might want to avoid recomputing the audio tokens at each training run. That's why we introduced two additional parameters, `save_to_disk` and `temporary_save_to_disk` that serves as temporary buffers to save intermediary datasets. Note that processed data is made of text and audio tokens which are much more memory efficient, so the additional required space is negligible.
181
196
4.`predict_with_generate` and `add_audio_samples_to_wandb` are required to store generated audios and to compute WER and CLAP similarity.
182
-
5.`freeze_text_encoder`: which allows to freeze the text encoder, to save compute resources. Note that our released model freeze the text encoder.
197
+
5.`freeze_text_encoder`: which allows to freeze the text encoder, to save compute resources.
183
198
184
199
And finally, two additional comments:
185
200
1.`lr_scheduler_stype`: defines the learning rate schedule, one of `constant_with_warmup` or `cosine`. When experimenting with a training set-up or training for very few epochs, using `constant_with_warmup` is typically beneficial, since the learning rate remains high over the short training run. When performing longer training runs, using a `cosine` schedule shoud give better results.
186
201
2.`dtype`: data type (dtype) in which the model computation should be performed. Note that this only controls the dtype of the computations (forward and backward pass), and not the dtype of the parameters or optimiser states.
187
202
188
-
189
-
190
-
## 4. Scaling up - Discussions and tips
191
-
192
-
[starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) offers a good hyper-paramters starting to scale-up the training recipe to thousand of hours of data:
In particular, note how multiple training datasets, metadataset, configurations and splits can be loaded by separating the dataset arguments by + symbols:
Thus, the script generalises to any number of training datasets.
207
-
208
-
209
-
> [!IMPORTANT]
210
-
> Starting training a new model from scratch can easily be overwhelming,so here's what training looked like for v0.1: [logs](https://api.wandb.ai/links/ylacombe/ea449l81)
211
-
203
+
> [!TIP]
204
+
> Fine-tuning is as easy as modifying `model_name_or_path` to a pre-trained model.
205
+
> For example: `--model_name_or_path parler-tts/parler_tts_300M_v0.1`.
0 commit comments