Skip to content

Commit 7f97328

Browse files
Fix #130 Voice consistency in documentation (#143)
* documentation of voice consistencies * fix typos
1 parent 31816bd commit 7f97328

File tree

2 files changed

+67
-2
lines changed

2 files changed

+67
-2
lines changed

INFERENCE.md

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Parler-TTS benefits from a number of optimizations that can make the model up to
77
* [Compilation](#compilation)
88
* [Streaming](#streaming)
99
* [Batch generation](#batch-generation)
10+
* [Speaker Consistency](#speaker-consistency)
1011

1112
## Efficient Attention implementations
1213

@@ -199,4 +200,65 @@ audio_2 = generation.sequences[1, :generation.audios_length[1]]
199200
print(audio_1.shape, audio_2.shape)
200201
scipy.io.wavfile.write("sample_out.wav", rate=feature_extractor.sampling_rate, data=audio_1.cpu().numpy().squeeze())
201202
scipy.io.wavfile.write("sample_out_2.wav", rate=feature_extractor.sampling_rate, data=audio_2.cpu().numpy().squeeze())
202-
```
203+
```
204+
205+
## Speaker Consistency
206+
207+
The checkpoint was trained on 34 speakers. The full list of available speakers includes:
208+
Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, and Emily.
209+
210+
However, the models performed better with certain speakers. Below are the top 20 speakers for each model variant, ranked by their average speaker similarity scores:
211+
212+
### Large Model - Top 20 Speakers
213+
214+
| Speaker | Similarity Score |
215+
|---------|------------------|
216+
| Will | 0.906055 |
217+
| Eric | 0.887598 |
218+
| Laura | 0.877930 |
219+
| Alisa | 0.877393 |
220+
| Patrick | 0.873682 |
221+
| Rose | 0.873047 |
222+
| Jerry | 0.871582 |
223+
| Jordan | 0.870703 |
224+
| Lauren | 0.867432 |
225+
| Jenna | 0.866455 |
226+
| Karen | 0.866309 |
227+
| Rick | 0.863135 |
228+
| Bill | 0.862207 |
229+
| James | 0.856934 |
230+
| Yann | 0.856787 |
231+
| Emily | 0.856543 |
232+
| Anna | 0.848877 |
233+
| Jon | 0.848828 |
234+
| Brenda | 0.848291 |
235+
| Barbara | 0.847998 |
236+
237+
### Mini Model - Top 20 Speakers
238+
239+
| Speaker | Similarity Score |
240+
|---------|------------------|
241+
| Jon | 0.908301 |
242+
| Lea | 0.904785 |
243+
| Gary | 0.903516 |
244+
| Jenna | 0.901807 |
245+
| Mike | 0.885742 |
246+
| Laura | 0.882666 |
247+
| Lauren | 0.878320 |
248+
| Eileen | 0.875635 |
249+
| Alisa | 0.874219 |
250+
| Karen | 0.872363 |
251+
| Barbara | 0.871509 |
252+
| Carol | 0.863623 |
253+
| Emily | 0.854932 |
254+
| Rose | 0.852246 |
255+
| Will | 0.851074 |
256+
| Patrick | 0.850977 |
257+
| Eric | 0.845459 |
258+
| Rick | 0.845020 |
259+
| Anna | 0.844922 |
260+
| Tina | 0.839160 |
261+
262+
The numbers represent the average speaker similarity between a random snippet of the person speaking and a randomly Parler-generated snippet. Higher scores indicate better model performance in maintaining voice consistency.
263+
264+
These scores are derived from [dataset for Mini](https://huggingface.co/datasets/ylacombe/parler-tts-mini-v1_speaker_similarity) and [dataset for Large](https://huggingface.co/datasets/ylacombe/parler-large-v1-og_speaker_similarity).

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,13 @@ sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
7979

8080
### 🎯 Using a specific speaker
8181

82-
To ensure speaker consistency across generations, this checkpoint was also trained on 34 speakers, characterized by name (e.g. Jon, Lea, Gary, Jenna, Mike, Laura).
82+
To ensure speaker consistency across generations, this checkpoint was also trained on 34 speakers, characterized by name. The full list of available speakers includes:
83+
Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, Emily.
8384

8485
To take advantage of this, simply adapt your text description to specify which speaker to use: `Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
8586

87+
You can replace "Jon" with any of the names from the list above to utilize different speaker characteristics. Each speaker has unique vocal qualities that can be leveraged to suit your specific needs. For more detailed information on speaker performance with voice consistency, please refer [inference guide](INFERENCE.md#speaker-consistency).
88+
8689
```py
8790
import torch
8891
from parler_tts import ParlerTTSForConditionalGeneration

0 commit comments

Comments
 (0)