You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
62
+
-[1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture)
63
+
-[2. The first steps to get started](/training/README.md#2-getting-started)
64
+
-[3. A training guide](/training/README.md#3-training)
65
+
66
+
> [!IMPORTANT]
67
+
> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
Copy file name to clipboardExpand all lines: training/README.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
# Training Parler-TTS
2
2
3
3
This sub-folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
4
-
-[A. An introduction to the Parler-TTS architecture](#a-architecture)
5
-
-[B. First steps to get started](#b-getting-started)
6
-
-[C. Training guide](#c-training)
7
-
-[E. Scaling up to 10.5K hours](#d-scaling-up---discussions-and-tips)
4
+
-[1. An introduction to the Parler-TTS architecture](#a-architecture)
5
+
-[2. First steps to get started](#b-getting-started)
6
+
-[3. Training guide](#c-training)
7
+
-[4. Scaling up to 10.5K hours](#d-scaling-up---discussions-and-tips)
8
8
9
9
10
-
## A. Architecture
10
+
## 1. Architecture
11
11
12
12
At the moment, Parler-TTS architecture is a carbon copy of the [MusicGen architecture](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/musicgen#model-structure) and can be decomposed into three distinct stages:
13
13
>1. Text encoder: maps the text descriptions to a sequence of hidden-state representations. Parler-TTS uses a frozen text encoder initialised entirely from Flan-T5
@@ -20,14 +20,14 @@ Parler-TTS however introduces some small tweaks:
20
20
- The audio encoder used is [**DAC**](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5) instead of [Encodec](https://github.com/facebookresearch/encodec), as it exhibits better quality.
21
21
22
22
23
-
## B. Getting started
23
+
## 2. Getting started
24
24
25
25
To get started, you need to follow a few steps:
26
26
1. Install the requirements.
27
27
2. Find or initialize the model you'll train on.
28
28
3. Find and/or annotate the dataset you'll train your model on.
29
29
30
-
### 1. Requirements
30
+
### Requirements
31
31
32
32
The Parler-TTS code is written in [PyTorch](https://pytorch.org) and [Accelerate](https://huggingface.co/docs/accelerate/index). It uses some additional requirements, like [wandb](https://wandb.ai/), especially for logging and evaluation.
33
33
@@ -60,7 +60,7 @@ huggingface-cli login
60
60
```
61
61
And then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.
62
62
63
-
### 2. Initialize a model from scratch or use a pre-trained one.
63
+
### Initialize a model from scratch or use a pre-trained one.
64
64
65
65
Depending on your compute resources and your dataset, you need to choose between fine-tuning a pre-trained model and training a new model from scratch.
To train your own Parler-TTS, you need datasets with 3 main features:
85
85
- speech data
@@ -91,7 +91,7 @@ Note that we made the choice to use description of the main speech characteristi
91
91
In the rest of this guide, and to make it simple, we'll use the [4.8K-samples clean test split](https://huggingface.co/datasets/blabble-io/libritts_r/viewer/clean/test.clean) of [LibriTTS-R](https://huggingface.co/datasets/blabble-io/libritts_r/). We've annotated LibriTTS-R using [Data-Speech](https://github.com/huggingface/dataspeech) and shared the resulting dataset here: [parler-tts/libritts_r_tags_tagged_10k_generated](https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated).
92
92
93
93
94
-
## C. Training
94
+
## 3. Training
95
95
96
96
The script [`run_parler_tts_training.py`](/training/run_parler_tts_training.py) is an end-to-end script that:
97
97
1. load dataset(s) and merge them to the annotation dataset(s) if necessary
@@ -187,7 +187,7 @@ And finally, two additional comments:
187
187
188
188
189
189
190
-
## D. Scaling up - Discussions and tips
190
+
## 4. Scaling up - Discussions and tips
191
191
192
192
[starting_point_0.01.json](helpers/training_configs/starting_point_0.01.json) offers a good hyper-paramters starting to scale-up the training recipe to thousand of hours of data:
0 commit comments