|
1 |
| -# Stable Speech |
2 |
| - |
3 |
| -Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) |
4 |
| -by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. |
5 |
| - |
6 |
| -Reproducing the TTS model requires the following 5 steps to be completed in order: |
7 |
| -1. Train the Accent Classifier |
8 |
| -2. Annotate the Training Set |
9 |
| -3. Aggregate Statistics |
10 |
| -4. Create Descriptions |
11 |
| -5. Train the Model |
12 |
| - |
13 |
| -## Step 1: Train the Accent Classifier |
14 |
| - |
15 |
| -The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from |
16 |
| -the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent |
17 |
| -classification task. |
18 |
| - |
19 |
| -Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the |
20 |
| -audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. |
21 |
| -The linear classifier is randomly initialised, and is thus always trained. |
22 |
| - |
23 |
| -The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by |
24 |
| -separating dataset names, configs and splits by the `+` character in the launch command (see below for an example). |
25 |
| - |
26 |
| -In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) |
27 |
| -model, and training the linear classifier on a combination of three open-source datasets: |
28 |
| -1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) |
29 |
| -2. The train split of [VCTK](https://huggingface.co/datasets/vctk) |
30 |
| -3. The dev split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc) |
31 |
| - |
32 |
| -The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc) |
33 |
| -to give the final classification accuracy. |
34 |
| - |
35 |
| -```bash |
36 |
| -#!/usr/bin/env bash |
37 |
| - |
38 |
| -python run_audio_classification.py \ |
39 |
| - --model_name_or_path "facebook/mms-lid-126" \ |
40 |
| - --train_dataset_name "vctk+facebook/voxpopuli+edinburghcstr/edacc" \ |
41 |
| - --train_dataset_config_name "main+en_accented+default" \ |
42 |
| - --train_split_name "train+test+validation" \ |
43 |
| - --train_label_column_name "accent+accent+accent" \ |
44 |
| - --eval_dataset_name "edinburghcstr/edacc" \ |
45 |
| - --eval_dataset_config_name "default" \ |
46 |
| - --eval_split_name "test" \ |
47 |
| - --eval_label_column_name "accent" \ |
48 |
| - --output_dir "./" \ |
49 |
| - --do_train \ |
50 |
| - --do_eval \ |
51 |
| - --overwrite_output_dir \ |
52 |
| - --remove_unused_columns False \ |
53 |
| - --fp16 \ |
54 |
| - --learning_rate 1e-4 \ |
55 |
| - --max_length_seconds 20 \ |
56 |
| - --attention_mask False \ |
57 |
| - --warmup_ratio 0.1 \ |
58 |
| - --num_train_epochs 5 \ |
59 |
| - --per_device_train_batch_size 32 \ |
60 |
| - --per_device_eval_batch_size 32 \ |
61 |
| - --preprocessing_num_workers 16 \ |
62 |
| - --dataloader_num_workers 4 \ |
63 |
| - --logging_strategy "steps" \ |
64 |
| - --logging_steps 10 \ |
65 |
| - --evaluation_strategy "epoch" \ |
66 |
| - --save_strategy "epoch" \ |
67 |
| - --load_best_model_at_end True \ |
68 |
| - --metric_for_best_model "accuracy" \ |
69 |
| - --save_total_limit 3 \ |
70 |
| - --freeze_base_model \ |
71 |
| - --push_to_hub \ |
72 |
| - --trust_remote_code |
| 1 | +# Parler-TTS |
| 2 | + |
| 3 | +Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. |
| 4 | + |
| 5 | +Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models. |
| 6 | + |
| 7 | +This repository contains the inference and training code for Parler-TTS. It is designed to accompany the [Data-Speech](https://github.com/huggingface/dataspeech) repository for dataset annotation. |
| 8 | + |
| 9 | +> [!IMPORTANT] |
| 10 | +> We're proud to release Parler-TTS v0.1, our first 300M parameter model, trained on 10.5K hours of audio data. |
| 11 | +> In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model. |
| 12 | +
|
| 13 | +## 📖 Quick Index |
| 14 | +* [Installation](#installation) |
| 15 | +* [Usage](#usage) |
| 16 | +* [Training](#training) |
| 17 | +* [Demo](https://huggingface.co/spaces/parler-tts/parler_tts_mini) |
| 18 | +* [Model weights and datasets](https://huggingface.co/parler-tts) |
| 19 | + |
| 20 | + |
| 21 | +## Usage |
| 22 | + |
| 23 | +> [!TIP] |
| 24 | +> You can directly try it out in an interactive demo [here](https://huggingface.co/spaces/parler-tts/parler_tts_mini)! |
| 25 | +
|
| 26 | +Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet. |
| 27 | + |
| 28 | +```py |
| 29 | +from parler_tts import ParlerTTSForConditionalGeneration |
| 30 | +from transformers import AutoTokenizer |
| 31 | +import soundfile as sf |
| 32 | +import torch |
| 33 | + |
| 34 | +device = "cuda:0" if torch.cuda.is_available() else "cpu" |
| 35 | + |
| 36 | +model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_300M_v0.1").to(device) |
| 37 | +tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_300M_v0.1") |
| 38 | + |
| 39 | +prompt = "Hey, how are you doing today?" |
| 40 | +description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast." |
| 41 | + |
| 42 | +input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) |
| 43 | +prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) |
| 44 | + |
| 45 | +generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) |
| 46 | +audio_arr = generation.cpu().numpy().squeeze() |
| 47 | +sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate) |
| 48 | +``` |
| 49 | + |
| 50 | +## Installation |
| 51 | + |
| 52 | +Parler-TTS has light-weight dependencies and can be installed in one line: |
| 53 | + |
| 54 | +```sh |
| 55 | +pip install git+https://github.com/huggingface/parler-tts.git |
| 56 | +``` |
| 57 | + |
| 58 | +## Training |
| 59 | + |
| 60 | +The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of: |
| 61 | +- [1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture) |
| 62 | +- [2. The first steps to get started](/training/README.md#2-getting-started) |
| 63 | +- [3. A training guide](/training/README.md#3-training) |
| 64 | + |
| 65 | +> [!IMPORTANT] |
| 66 | +> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line: |
| 67 | +
|
| 68 | +```sh |
| 69 | +accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json |
73 | 70 | ```
|
74 | 71 |
|
75 |
| -Tips: |
76 |
| -1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy. |
| 72 | +## Acknowledgements |
77 | 73 |
|
78 |
| -## Step 2: Annotate the Training Set |
| 74 | +This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools! |
79 | 75 |
|
80 |
| -Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. |
| 76 | +Special thanks to: |
| 77 | +- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912). |
| 78 | +- the many libraries used, namely [🤗 datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [🤗 accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [🤗 transformers](https://huggingface.co/docs/transformers/index). |
| 79 | +- Descript for the [DAC codec model](https://github.com/descriptinc/descript-audio-codec) |
| 80 | +- Hugging Face 🤗 for providing compute resources and time to explore! |
81 | 81 |
|
82 |
| -## Step 3: Aggregate Statistics |
83 | 82 |
|
84 |
| -Aggregate statistics from Step 2. Convert continuous values to discrete labels. |
| 83 | +## Citation |
85 | 84 |
|
86 |
| -## Step 4: Create Descriptions |
| 85 | +If you found this repository useful, please consider citing this work and also the original Stability AI paper: |
87 | 86 |
|
88 |
| -Convert sequence of discrete labels to text description (using an LLM). |
| 87 | +``` |
| 88 | +@misc{lacombe-etal-2024-parler-tts, |
| 89 | + author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi}, |
| 90 | + title = {Parler-TTS}, |
| 91 | + year = {2024}, |
| 92 | + publisher = {GitHub}, |
| 93 | + journal = {GitHub repository}, |
| 94 | + howpublished = {\url{https://github.com/huggingface/parler-tts}} |
| 95 | +} |
| 96 | +``` |
89 | 97 |
|
90 |
| -## Step 5: Train the Model |
| 98 | +``` |
| 99 | +@misc{lyth2024natural, |
| 100 | + title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations}, |
| 101 | + author={Dan Lyth and Simon King}, |
| 102 | + year={2024}, |
| 103 | + eprint={2402.01912}, |
| 104 | + archivePrefix={arXiv}, |
| 105 | + primaryClass={cs.SD} |
| 106 | +} |
| 107 | +``` |
91 | 108 |
|
92 |
| -Train MusicGen-style model on the TTS task. |
| 109 | +## Contribution |
| 110 | + |
| 111 | +Contributions are welcome, as the project offers many possibilities for improvement and exploration. |
| 112 | + |
| 113 | +Namely, we're looking at ways to improve both quality and speed: |
| 114 | +- Datasets: |
| 115 | + - Train on more data |
| 116 | + - Add more features such as accents |
| 117 | +- Training: |
| 118 | + - Add PEFT compatibility to do Lora fine-tuning. |
| 119 | + - Add possibility to train without description column. |
| 120 | + - Add notebook training. |
| 121 | + - Explore multilingual training. |
| 122 | + - Explore mono-speaker finetuning. |
| 123 | + - Explore more architectures. |
| 124 | +- Optimization: |
| 125 | + - Compilation and static cache |
| 126 | + - Support to FA2 and SDPA |
| 127 | +- Evaluation: |
| 128 | + - Add more evaluation metrics |
93 | 129 |
|
0 commit comments