Skip to content

Commit a53577f

Browse files
authored
Merge pull request #2 from ylacombe/main
Release
2 parents 85b8cac + 5eae102 commit a53577f

25 files changed

+3186
-2126
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@
186186
same "printed page" as the copyright notice for easier
187187
identification within third-party archives.
188188

189-
Copyright [yyyy] [name of copyright owner]
189+
Copyright [2024] [The HuggingFace Inc. team]
190190

191191
Licensed under the Apache License, Version 2.0 (the "License");
192192
you may not use this file except in compliance with the License.

README.md

Lines changed: 118 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,129 @@
1-
# Stable Speech
2-
3-
Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
4-
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
5-
6-
Reproducing the TTS model requires the following 5 steps to be completed in order:
7-
1. Train the Accent Classifier
8-
2. Annotate the Training Set
9-
3. Aggregate Statistics
10-
4. Create Descriptions
11-
5. Train the Model
12-
13-
## Step 1: Train the Accent Classifier
14-
15-
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from
16-
the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent
17-
classification task.
18-
19-
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the
20-
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained.
21-
The linear classifier is randomly initialised, and is thus always trained.
22-
23-
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
24-
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
25-
26-
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126)
27-
model, and training the linear classifier on a combination of three open-source datasets:
28-
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
29-
2. The train split of [VCTK](https://huggingface.co/datasets/vctk)
30-
3. The dev split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
31-
32-
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
33-
to give the final classification accuracy.
34-
35-
```bash
36-
#!/usr/bin/env bash
37-
38-
python run_audio_classification.py \
39-
--model_name_or_path "facebook/mms-lid-126" \
40-
--train_dataset_name "vctk+facebook/voxpopuli+edinburghcstr/edacc" \
41-
--train_dataset_config_name "main+en_accented+default" \
42-
--train_split_name "train+test+validation" \
43-
--train_label_column_name "accent+accent+accent" \
44-
--eval_dataset_name "edinburghcstr/edacc" \
45-
--eval_dataset_config_name "default" \
46-
--eval_split_name "test" \
47-
--eval_label_column_name "accent" \
48-
--output_dir "./" \
49-
--do_train \
50-
--do_eval \
51-
--overwrite_output_dir \
52-
--remove_unused_columns False \
53-
--fp16 \
54-
--learning_rate 1e-4 \
55-
--max_length_seconds 20 \
56-
--attention_mask False \
57-
--warmup_ratio 0.1 \
58-
--num_train_epochs 5 \
59-
--per_device_train_batch_size 32 \
60-
--per_device_eval_batch_size 32 \
61-
--preprocessing_num_workers 16 \
62-
--dataloader_num_workers 4 \
63-
--logging_strategy "steps" \
64-
--logging_steps 10 \
65-
--evaluation_strategy "epoch" \
66-
--save_strategy "epoch" \
67-
--load_best_model_at_end True \
68-
--metric_for_best_model "accuracy" \
69-
--save_total_limit 3 \
70-
--freeze_base_model \
71-
--push_to_hub \
72-
--trust_remote_code
1+
# Parler-TTS
2+
3+
Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
4+
5+
Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
6+
7+
This repository contains the inference and training code for Parler-TTS. It is designed to accompany the [Data-Speech](https://github.com/huggingface/dataspeech) repository for dataset annotation.
8+
9+
> [!IMPORTANT]
10+
> We're proud to release Parler-TTS v0.1, our first 300M parameter model, trained on 10.5K hours of audio data.
11+
> In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.
12+
13+
## 📖 Quick Index
14+
* [Installation](#installation)
15+
* [Usage](#usage)
16+
* [Training](#training)
17+
* [Demo](https://huggingface.co/spaces/parler-tts/parler_tts_mini)
18+
* [Model weights and datasets](https://huggingface.co/parler-tts)
19+
20+
21+
## Usage
22+
23+
> [!TIP]
24+
> You can directly try it out in an interactive demo [here](https://huggingface.co/spaces/parler-tts/parler_tts_mini)!
25+
26+
Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
27+
28+
```py
29+
from parler_tts import ParlerTTSForConditionalGeneration
30+
from transformers import AutoTokenizer
31+
import soundfile as sf
32+
import torch
33+
34+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
35+
36+
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_300M_v0.1").to(device)
37+
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_300M_v0.1")
38+
39+
prompt = "Hey, how are you doing today?"
40+
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
41+
42+
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
43+
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
44+
45+
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
46+
audio_arr = generation.cpu().numpy().squeeze()
47+
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
48+
```
49+
50+
## Installation
51+
52+
Parler-TTS has light-weight dependencies and can be installed in one line:
53+
54+
```sh
55+
pip install git+https://github.com/huggingface/parler-tts.git
56+
```
57+
58+
## Training
59+
60+
The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
61+
- [1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture)
62+
- [2. The first steps to get started](/training/README.md#2-getting-started)
63+
- [3. A training guide](/training/README.md#3-training)
64+
65+
> [!IMPORTANT]
66+
> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
67+
68+
```sh
69+
accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
7370
```
7471

75-
Tips:
76-
1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
72+
## Acknowledgements
7773

78-
## Step 2: Annotate the Training Set
74+
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
7975

80-
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate.
76+
Special thanks to:
77+
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
78+
- the many libraries used, namely [🤗 datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [🤗 accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [🤗 transformers](https://huggingface.co/docs/transformers/index).
79+
- Descript for the [DAC codec model](https://github.com/descriptinc/descript-audio-codec)
80+
- Hugging Face 🤗 for providing compute resources and time to explore!
8181

82-
## Step 3: Aggregate Statistics
8382

84-
Aggregate statistics from Step 2. Convert continuous values to discrete labels.
83+
## Citation
8584

86-
## Step 4: Create Descriptions
85+
If you found this repository useful, please consider citing this work and also the original Stability AI paper:
8786

88-
Convert sequence of discrete labels to text description (using an LLM).
87+
```
88+
@misc{lacombe-etal-2024-parler-tts,
89+
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
90+
title = {Parler-TTS},
91+
year = {2024},
92+
publisher = {GitHub},
93+
journal = {GitHub repository},
94+
howpublished = {\url{https://github.com/huggingface/parler-tts}}
95+
}
96+
```
8997

90-
## Step 5: Train the Model
98+
```
99+
@misc{lyth2024natural,
100+
title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
101+
author={Dan Lyth and Simon King},
102+
year={2024},
103+
eprint={2402.01912},
104+
archivePrefix={arXiv},
105+
primaryClass={cs.SD}
106+
}
107+
```
91108

92-
Train MusicGen-style model on the TTS task.
109+
## Contribution
110+
111+
Contributions are welcome, as the project offers many possibilities for improvement and exploration.
112+
113+
Namely, we're looking at ways to improve both quality and speed:
114+
- Datasets:
115+
- Train on more data
116+
- Add more features such as accents
117+
- Training:
118+
- Add PEFT compatibility to do Lora fine-tuning.
119+
- Add possibility to train without description column.
120+
- Add notebook training.
121+
- Explore multilingual training.
122+
- Explore mono-speaker finetuning.
123+
- Explore more architectures.
124+
- Optimization:
125+
- Compilation and static cache
126+
- Support to FA2 and SDPA
127+
- Evaluation:
128+
- Add more evaluation metrics
93129

audio_classification_scripts/run_wav2vec2_dummy.sh

Lines changed: 0 additions & 38 deletions
This file was deleted.

helpers/gradio_demo/app.py

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
import gradio as gr
2+
import torch
3+
4+
from parler_tts import ParlerTTSForConditionalGeneration
5+
from transformers import AutoTokenizer, AutoFeatureExtractor, set_seed
6+
7+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
8+
9+
repo_id = "parler-tts/parler_tts_300M_v0.1"
10+
11+
model = ParlerTTSForConditionalGeneration.from_pretrained(repo_id).to(device)
12+
tokenizer = AutoTokenizer.from_pretrained(repo_id)
13+
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id)
14+
15+
16+
SAMPLE_RATE = feature_extractor.sampling_rate
17+
SEED = 41
18+
19+
default_text = "Please surprise me and speak in whatever voice you enjoy."
20+
21+
title = "# Parler-TTS </div>"
22+
23+
examples = [
24+
[
25+
"'This is the best time of my life, Bartley,' she said happily.",
26+
"A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
27+
],
28+
[
29+
"Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom. ",
30+
"A male speaker with a slightly high-pitched voice delivering his words at a slightly slow pace in a small, confined space with a touch of background noise and a quite monotone tone.",
31+
],
32+
[
33+
"montrose also after having experienced still more variety of good and bad fortune threw down his arms and retired out of the kingdom",
34+
"A male speaker with a low-pitched voice delivering his words at a fast pace in a small, confined space with a lot of background noise and an animated tone.",
35+
],
36+
]
37+
38+
39+
def gen_tts(text, description):
40+
inputs = tokenizer(description, return_tensors="pt").to(device)
41+
prompt = tokenizer(text, return_tensors="pt").to(device)
42+
43+
set_seed(SEED)
44+
generation = model.generate(
45+
input_ids=inputs.input_ids, prompt_input_ids=prompt.input_ids, do_sample=True, temperature=1.0
46+
)
47+
audio_arr = generation.cpu().numpy().squeeze()
48+
49+
return (SAMPLE_RATE, audio_arr)
50+
51+
52+
css = """
53+
#share-btn-container {
54+
display: flex;
55+
padding-left: 0.5rem !important;
56+
padding-right: 0.5rem !important;
57+
background-color: #000000;
58+
justify-content: center;
59+
align-items: center;
60+
border-radius: 9999px !important;
61+
width: 13rem;
62+
margin-top: 10px;
63+
margin-left: auto;
64+
flex: unset !important;
65+
}
66+
#share-btn {
67+
all: initial;
68+
color: #ffffff;
69+
font-weight: 600;
70+
cursor: pointer;
71+
font-family: 'IBM Plex Sans', sans-serif;
72+
margin-left: 0.5rem !important;
73+
padding-top: 0.25rem !important;
74+
padding-bottom: 0.25rem !important;
75+
right:0;
76+
}
77+
#share-btn * {
78+
all: unset !important;
79+
}
80+
#share-btn-container div:nth-child(-n+2){
81+
width: auto !important;
82+
min-height: 0px !important;
83+
}
84+
#share-btn-container .wrap {
85+
display: none !important;
86+
}
87+
"""
88+
with gr.Blocks(css=css) as block:
89+
gr.Markdown(title)
90+
with gr.Row():
91+
with gr.Column():
92+
input_text = gr.Textbox(label="Input Text", lines=2, value=default_text, elem_id="input_text")
93+
description = gr.Textbox(label="Description", lines=2, value="", elem_id="input_description")
94+
run_button = gr.Button("Generate Audio", variant="primary")
95+
with gr.Column():
96+
audio_out = gr.Audio(label="Parler-TTS generation", type="numpy", elem_id="audio_out")
97+
98+
inputs = [input_text, description]
99+
outputs = [audio_out]
100+
gr.Examples(examples=examples, fn=gen_tts, inputs=inputs, outputs=outputs, cache_examples=True)
101+
run_button.click(fn=gen_tts, inputs=inputs, outputs=outputs, queue=True)
102+
103+
block.queue()
104+
block.launch(share=True)

0 commit comments

Comments
 (0)