Skip to content

Commit 7bf5733

Browse files
committed
tts faq scrub
1 parent fbe5bda commit 7bf5733

File tree

1 file changed

+32
-67
lines changed

1 file changed

+32
-67
lines changed

articles/cognitive-services/Speech-Service/faq-tts.yml

Lines changed: 32 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ metadata:
99
ms.service: cognitive-services
1010
ms.subservice: speech-service
1111
ms.topic: faq
12-
ms.date: 03/13/2023
12+
ms.date: 03/27/2023
1313
ms.author: eur
1414
title: Text-to-speech FAQ
1515
summary: |
@@ -24,17 +24,17 @@ sections:
2424
answer: |
2525
There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to [Disclosure design guidelines](/legal/cognitive-services/speech-service/custom-neural-voice/concepts-disclosure-guidelines?context=/azure/cognitive-services/speech-service/context/context).
2626
- question: |
27-
Can Text-to-Speech support PCM 48khz 24bit mono .WAV or PCM 48khz 16bit mono .WAV?
27+
What audio formats does Text-to-Speech support?
2828
answer: |
29-
We can support 48khz 16bit mono output. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. Other sample rates can be obtained through up-sampling or down-sampling when synthesizing.
29+
The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. See [Audio outputs](rest-text-to-speech?tabs=streaming#audio-outputs).
3030
- question: |
31-
How are teams balancing dynamic with static to save cost?
31+
How can we balance dynamic and static content to limit cost?
3232
answer: |
33-
You can cache the real-time generated content, still fall back to cloud with best quality, as we are improving the model quality over time. If you do not have strong demand on latency, you can use our real-time API to generate all content, cache it and serve it on an as-needed basis.
33+
You can cache the real-time generated content, and fall back to cloud with best quality, as we are improving the model quality over time. If you do not have strong demand on latency, you can use our real-time API to generate all content, cache it and serve it on an as-needed basis.
3434
- question: |
35-
This amazing TTS technology has reached the point where it's almost indistinguishable between real voice actors, except it's all very "clean" sounding. But is there any work being considered around making the dialogue sound more natural by inserting er, um, stutter, pause, or repeated words and so on?
35+
Can we make the dialogue sound even more natural by inserting er, um, stutter, pause, or repeated words and so on?
3636
answer: |
37-
If you have such requirements from studios or other customers, we can plan to develop it. A latest update is that we will start a prototype of spontaneous speech synthesis where we'll automatically insert the filled pause (such as um, uh) and synthesize the speech accordingly. This could be a start towards a fully spontaneous speech as you mentioned. We'll make an update when we get some results.
37+
We are evaluating spontaneous speech synthesis to automatically insert the filled pause (such as um, uh) and synthesize the speech accordingly. But this isn't on the current roadmap.
3838
- question: |
3939
Is there a mapping between Viseme IDs and mouth shape?
4040
answer: |
@@ -43,90 +43,68 @@ sections:
4343
Can the Visemes be mapped to UE5 MetaHuman blend shapes weights?
4444
answer: |
4545
UE5 MetaHuman is now using a newly defined driven parameter named Expression, however, we could still use blend shapes to drive MetaHuman. One of our customers has done this and successfully driven the prefix UE5 Avatar.
46-
- question: |
47-
What?
48-
answer: |
49-
Because.
50-
- question: |
51-
What?
52-
answer: |
53-
Because.
54-
5546
5647
- name: Speech synthesis markup language (SSML)
5748
questions:
5849
- question: |
5950
Can the voice be customized to stress specific words?
6051
answer: |
61-
Some of the prebuilt `en-US` voices support the [emphasis tag](speech-synthesis-markup-voice.md#adjust-emphasis) which can be used to emphasize one or group of words. Support for more voices are coming.
62-
- question: |
63-
Are the neural voice emotions discrete states or do they have associated sliders to define, for example, how panicky or friendly you want the voice to be?
64-
answer: |
65-
So far, they're discrete states. Meanwhile, we have a tuning knob to control the style degree. We have style degree control on each emotion, and it can be used for SSML in real-time synthesizing.
52+
Adjusting the emphasis is supported for some voices depending on the locale. See the [emphasis tag](speech-synthesis-markup-voice.md#adjust-emphasis).
6653
- question: |
6754
Can we have multiple strength for each emotions, like very sad, slightly sad and so on in?
6855
answer: |
69-
Microsoft currently only supports 'style degree' in Chinese voices. And you can use SSML to adjust the style degree.
56+
Adjusting the style degree is supported for some voices depending on the locale. See the [mstts:express-as tag](speech-synthesis-markup-voice#speaking-styles-and-roles).
57+
58+
- name: Custom Neural Voice
59+
questions:
7060
- question: |
71-
What?
61+
How much data is required to create a custom neural voice?
7262
answer: |
73-
Because.
63+
At least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for Custom Neural Voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
7464
- question: |
75-
What?
65+
Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
7666
answer: |
77-
Because.
78-
79-
80-
- name: Custom Neural Voice
81-
questions:
67+
The [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature supports multi-style for the same voice in one model.
8268
- question: |
83-
How many resources are required to create a custom neural voice?
69+
What about languages that have different pronunciation structure and assembly? For example, English and Japanese don't form a sentence in the same way. If a voice talent has a fast to slow cadence on a phrase in English, would that map correctly across the same phrase in Japanese?
8470
answer: |
85-
We require at least 300 lines of recordings (or, around 30 minutes of speech) to be prepared as training data for Custom Neural Voice, and we recommend 2,000 lines of recordings (2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the cloned voice. But if hundreds of lines are provided, we can build a voice with pretty good quality (Mean Opinion Score> 4.0).
71+
Each neural voice is trained with audio data recorded by native speaking voice talent. For cross-lingual voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. It will use the native way to speak Japanese and still sound similar (but not exactly) like the original English speaker.
8672
- question: |
87-
What about languages that have different pronunciation structure and assembly? For example, English and Japanese don’t form a sentence in the same way. If a voice talent has a fast to slow cadence on a phrase in English, that wouldn't map correctly across the same phrase in Japanese.
73+
Can we include duplicate text sentences in the same set of training data?
8874
answer: |
89-
Each neural voice is trained with audio data recorded by native speaking voice talent. For cross lingual voice, we transfer the major features like timbre to sound like the original speaker, to preserve the right pronunciation. It may not sound exactly like the original speaker and will use the native way to speak Japanese but sound similar like the original English speaker.
75+
No. The service will flag the duplicate sentences and just keep the first imported one. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
9076
- question: |
91-
Can we include duplicate text sentences with multi styles in our training data?
77+
Can we include multiple styles in the same set of training data?
9278
answer: |
93-
No, you can't. The service will flag the duplicate sentences and just keep the first imported one. We recommend that you don’t include duplicate text sentences in one training data, and you keep the style consistent in one training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
79+
We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
9480
- question: |
95-
To train a better custom neural voice model, should we prepare a large number of sentences containing the game specific terminology as the training data?
81+
Do you tips for domain specific training such as gaming?
9682
answer: |
97-
We usually recommend that if you have domain specific terminology for a specific scenario you put as much of that domain content as possible in the training data. We recommend the recording scripts include both general sentences and domain-specific sentences. For example, if you plan to record 2,000 sentences, 1,000 of them could be general sentences, another 1,000 of them could be sentences from your target domain or the use case of your application.
83+
The training data should include as much domain specific terminology that you have for a specific scenario. We recommend the recording scripts include both general sentences and domain specific sentences. For example, if you plan to record 2,000 sentences, then 1,000 of them could be general sentences, and another 1,000 of them could be sentences from your target domain or the use case of your application.
9884
- question: |
9985
Is the model version the same as the engine version?
10086
answer: |
101-
No, the model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure Text-to-Speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model.
87+
No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure Cognitive Services text-to-speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See [Update engine version for your voice model](how-to-custom-voice-create-voice?tabs=neural#update-engine-version-for-your-voice-model).
10288
- question: |
10389
What kind of script should be prepared for a domain specific scenario such as gaming?
10490
answer: |
10591
For general script, you can use the sample scripts per locale on [GitHub](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/script). For a domain specific script, you can do selection from the sentences that custom neural voice will be used to read. And you can refer to the script selection criteria [Record custom voice samples](record-custom-voice-samples.md#script-selection-criteria) to create a good corpus.
10692
- question: |
107-
Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
108-
answer: |
109-
Microsoft has a new preview feature multi-style voice training feature that supports multi-style for the same voice in one model.
110-
- question: |
111-
Are there any successful cases of using custom neural voice in gaming for our reference?
93+
Are there any successful cases of using custom neural voice in gaming?
11294
answer: |
113-
Microsoft flight simulator used our custom neural voice service to create the voices for their air traffic controllers in several different languages using our [cross lingual transfer feature](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model).
95+
Microsoft Flight Simulator used custom neural voice service to create the voices for the air traffic controllers in several different languages using the [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature.
11496
- question: |
11597
Switching styles via SSML only works for prebuilt neural voices, right?
11698
answer: |
117-
Switching styles via SSML is only for prebuilt 'multi-style' voices available in the platform. And Custom Neural Voice now supports a new preview feature ”multi-style training for one model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
99+
Switching styles via SSML is only for prebuilt multi-style voices. Custom Neural Voice does support [multi-style training](how-to-custom-voice-create-voice?tabs=multistyle#train-your-custom-neural-voice-model) for the same model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
118100
- question: |
119101
Is it correct that after one training we can't train again unless we upload a corpus file?
120102
answer: |
121-
You can train again. There's no limit to this. However, each training will count as a new training cost wise.
103+
You can train again. There's no limit to this. However, each training will be charged the same as a new training.
122104
- question: |
123-
Can it support more languages than we may record on initial release?
105+
Can we have a more dynamic events system and characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
124106
answer: |
125-
Yes. We worked with the gaming localization team on it. You could use TTS to light up all languages which are not recorded. You could leverage the cross lingual transfer to build a custom voice that sounds like the recording speaker but speaks natively in different languages. For accessibility use, you can consider using our prebuilt neural voices in each language. We have minimal two voices (1 female, 1 male) in all languages.
126-
- question: |
127-
Can we have a more dynamic events system and AI characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
128-
answer: |
129-
Yes, the benefit of the TTS is that you can have dynamic content built in with static pattern. Flight Simulator is using TTS to generate dynamic content, as their flight location information is massive, and it is not feasible to pre-record all of them offline.
107+
Yes. You can have dynamic content built in with static pattern. In some cases with large datasets it isn't feasible to pre-record all variations.
130108
- question: |
131109
Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?
132110
answer: |
@@ -135,10 +113,6 @@ sections:
135113
Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it is created?
136114
answer: |
137115
The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/cognitive-services/speech-service/context/context).
138-
- question: |
139-
If we are providing a real voice as the base data to be trained, how does it work on the legal side of things? Do we need the actor's permission to generate new dialogues, etc?
140-
answer: |
141-
Yes, it's needed. We can refer you to multiple practices. TTS team has a standard talent template as an example which we can share offline. When you try out self-serve training from the portal, you should read and agree the terms. We also worked with Flight Simulator and Localization teams on their talent agreement to fulfill the TTS use need prior to their release. Meanwhile Undead Lab is leading the effort to set up the agreement reviewed with Xbox legal, and they have a version you can leverage.
142116
- question: |
143117
Is it necessary to place the Disclosure in the startup flow (the first screen that the user always passes)?
144118
answer: |
@@ -154,16 +128,7 @@ sections:
154128
- question: |
155129
Do we need to return the written permission from the voice talent back to Microsoft?
156130
answer: |
157-
Microsoft doesn't need the written permission, but we require you obtain it from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin.
158-
- question: |
159-
What?
160-
answer: |
161-
Because.
162-
- question: |
163-
What?
164-
answer: |
165-
Because.
166-
131+
Microsoft doesn't need the written permission, but you must obtain consent from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin. See [Set up voice talent for Custom Neural Voice](how-to-custom-voice-talent).
167132
168133
169134
additionalContent: |

0 commit comments

Comments
 (0)