Skip to content

Commit 92f82a3

Browse files
committed
add TL;DR for training
1 parent 59d717e commit 92f82a3

File tree

2 files changed

+22
-12
lines changed

2 files changed

+22
-12
lines changed

README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,17 @@ pip install git+https://github.com/huggingface/parler-tts.git
5858

5959
## Training
6060

61-
TODO
61+
The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
62+
- [1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture)
63+
- [2. The first steps to get started](/training/README.md#2-getting-started)
64+
- [3. A training guide](/training/README.md#3-training)
65+
66+
> [!IMPORTANT]
67+
> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
68+
69+
```sh
70+
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
71+
```
6272

6373
## Acknowledgements
6474

training/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Training Parler-TTS
22

33
This sub-folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
4-
- [A. An introduction to the Parler-TTS architecture](#a-architecture)
5-
- [B. First steps to get started](#b-getting-started)
6-
- [C. Training guide](#c-training)
7-
- [E. Scaling up to 10.5K hours](#d-scaling-up---discussions-and-tips)
4+
- [1. An introduction to the Parler-TTS architecture](#a-architecture)
5+
- [2. First steps to get started](#b-getting-started)
6+
- [3. Training guide](#c-training)
7+
- [4. Scaling up to 10.5K hours](#d-scaling-up---discussions-and-tips)
88

99

10-
## A. Architecture
10+
## 1. Architecture
1111

1212
At the moment, Parler-TTS architecture is a carbon copy of the [MusicGen architecture](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/musicgen#model-structure) and can be decomposed into three distinct stages:
1313
>1. Text encoder: maps the text descriptions to a sequence of hidden-state representations. Parler-TTS uses a frozen text encoder initialised entirely from Flan-T5
@@ -20,14 +20,14 @@ Parler-TTS however introduces some small tweaks:
2020
- The audio encoder used is [**DAC**](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5) instead of [Encodec](https://github.com/facebookresearch/encodec), as it exhibits better quality.
2121

2222

23-
## B. Getting started
23+
## 2. Getting started
2424

2525
To get started, you need to follow a few steps:
2626
1. Install the requirements.
2727
2. Find or initialize the model you'll train on.
2828
3. Find and/or annotate the dataset you'll train your model on.
2929

30-
### 1. Requirements
30+
### Requirements
3131

3232
The Parler-TTS code is written in [PyTorch](https://pytorch.org) and [Accelerate](https://huggingface.co/docs/accelerate/index). It uses some additional requirements, like [wandb](https://wandb.ai/), especially for logging and evaluation.
3333

@@ -60,7 +60,7 @@ huggingface-cli login
6060
```
6161
And then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.
6262

63-
### 2. Initialize a model from scratch or use a pre-trained one.
63+
### Initialize a model from scratch or use a pre-trained one.
6464

6565
Depending on your compute resources and your dataset, you need to choose between fine-tuning a pre-trained model and training a new model from scratch.
6666

@@ -79,7 +79,7 @@ python helpers/model_init_scripts/init_model_300M.py ./parler-tts-untrained-300M
7979
```
8080

8181

82-
### 3. Create or find datasets
82+
### Create or find datasets
8383

8484
To train your own Parler-TTS, you need datasets with 3 main features:
8585
- speech data
@@ -91,7 +91,7 @@ Note that we made the choice to use description of the main speech characteristi
9191
In the rest of this guide, and to make it simple, we'll use the [4.8K-samples clean test split](https://huggingface.co/datasets/blabble-io/libritts_r/viewer/clean/test.clean) of [LibriTTS-R](https://huggingface.co/datasets/blabble-io/libritts_r/). We've annotated LibriTTS-R using [Data-Speech](https://github.com/huggingface/dataspeech) and shared the resulting dataset here: [parler-tts/libritts_r_tags_tagged_10k_generated](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated).
9292

9393

94-
## C. Training
94+
## 3. Training
9595

9696
The script [`run_parler_tts_training.py`](/training/run_parler_tts_training.py) is an end-to-end script that:
9797
1. load dataset(s) and merge them to the annotation dataset(s) if necessary
@@ -187,7 +187,7 @@ And finally, two additional comments:
187187

188188

189189

190-
## D. Scaling up - Discussions and tips
190+
## 4. Scaling up - Discussions and tips
191191

192192
[starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) offers a good hyper-paramters starting to scale-up the training recipe to thousand of hours of data:
193193

0 commit comments

Comments
 (0)