Skip to content

Commit 6641d4b

Browse files
committed
per Qinying
1 parent 3e32aa9 commit 6641d4b

File tree

1 file changed

+30
-43
lines changed

1 file changed

+30
-43
lines changed

articles/cognitive-services/Speech-Service/faq-tts.yml

Lines changed: 30 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
2+
3+
14
### YamlMime:FAQ
25
metadata:
36
title: Text-to-speech FAQ
@@ -9,7 +12,7 @@ metadata:
912
ms.service: cognitive-services
1013
ms.subservice: speech-service
1114
ms.topic: faq
12-
ms.date: 03/27/2023
15+
ms.date: 03/29/2023
1316
ms.author: eur
1417
title: Text-to-speech FAQ
1518
summary: |
@@ -20,32 +23,25 @@ sections:
2023
- name: General
2124
questions:
2225
- question: |
23-
How would we disclose to the end user that the voice is a synthetic voice?
26+
How does the billing work for text to speech?
2427
answer: |
25-
There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to [Disclosure design guidelines](/legal/cognitive-services/speech-service/custom-neural-voice/concepts-disclosure-guidelines?context=/azure/cognitive-services/speech-service/context/context).
28+
Text to Speech usage is billed per character. Check the definition of billable characters in the [pricing note](text-to-speech.md#pricing-note).
2629
- question: |
27-
What audio formats does Text-to-Speech support?
30+
What is the rate limit for the text to speech synthesis requests?
2831
answer: |
29-
The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. See [Audio outputs](rest-text-to-speech.md?tabs=streaming#audio-outputs).
32+
The text to speech synthesis rate scales automatically as it receives more requests. A default rate limit is set per speech resource. The rate is adjustable with business justifications and no additional charge will occur for rate limit increase. Check more details in [Speech service quotas and limits](speech-services-quotas-and-limits.md#text-to-speech-quotas-and-limits-per-resource).
3033
- question: |
31-
How can we balance dynamic and static content to limit cost?
34+
How would we disclose to the end user that the voice is a synthetic voice?
3235
answer: |
33-
You can cache the real-time generated content, and fall back to cloud with best quality, as we are improving the model quality over time. If you do not have strong demand on latency, you can use our real-time API to generate all content, cache it and serve it on an as-needed basis.
36+
We recommend that every user should follow our [code of conduct](/legal/cognitive-services/speech-service/tts-code-of-conduct?context=/azure/cognitive-services/speech-service/context/context) when using the TTS service. There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to [Disclosure design guidelines](/legal/cognitive-services/speech-service/custom-neural-voice/concepts-disclosure-guidelines?context=/azure/cognitive-services/speech-service/context/context).
3437
- question: |
35-
Can we make the dialogue sound even more natural by inserting er, um, stutter, pause, or repeated words and so on?
38+
How can I reduce the latency for my voice app?
3639
answer: |
37-
We are evaluating spontaneous speech synthesis to automatically insert the filled pause (such as um, uh) and synthesize the speech accordingly. But this isn't on the current roadmap.
40+
We provide several tips for you to lower the latency and bring the best performance to your users. See [Lower speech synthesis latency using Speech SDK](how-to-lower-speech-synthesis-latency).
3841
- question: |
39-
Is there a mapping between Viseme IDs and mouth shape?
42+
What output audio formats does TTS support?
4043
answer: |
41-
Yes. See [Get facial position with viseme](how-to-speech-synthesis-viseme.md?tabs=visemeid#map-phonemes-to-visemes).
42-
- question: |
43-
Can the Visemes be mapped to UE5 MetaHuman blend shapes weights?
44-
answer: |
45-
UE5 MetaHuman is now using a newly defined driven parameter named Expression, however, we could still use blend shapes to drive MetaHuman. One of our customers has done this and successfully driven the prefix UE5 Avatar.
46-
47-
- name: Speech synthesis markup language (SSML)
48-
questions:
44+
The TTS service supports various streaming and non-streaming audio formats, with the commonly used sampling rates. All TTS prebuilt neural voices are created to support high-fidelity audio outputs with 48kHz and 24kHz as the default sampling rate, and can be resampled to support other rates. See [Audio outputs](rest-text-to-speech.md#audio-outputs).
4945
- question: |
5046
Can the voice be customized to stress specific words?
5147
answer: |
@@ -54,13 +50,18 @@ sections:
5450
Can we have multiple strength for each emotions, like very sad, slightly sad and so on in?
5551
answer: |
5652
Adjusting the style degree is supported for some voices depending on the locale. See the [mstts:express-as tag](speech-synthesis-markup-voice.md#speaking-styles-and-roles).
53+
- question: |
54+
Is there a mapping between Viseme IDs and mouth shape?
55+
answer: |
56+
Yes. See [Get facial position with viseme](how-to-speech-synthesis-viseme.md?tabs=visemeid#map-phonemes-to-visemes).
57+
5758
5859
- name: Custom Neural Voice
5960
questions:
6061
- question: |
6162
How much data is required to create a custom neural voice?
6263
answer: |
63-
At least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for Custom Neural Voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
64+
Custom Neural Voice (CNV) supports two project types: CNV Pro and CNV Lite. With CNV Pro, at least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for Custom Neural Voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. With CNV Lite, you can create a voice with just 20 recorded samples. CNV Lite is best for quick trials, or when you don’t have access to professional voice actors. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
6465
- question: |
6566
Can we include duplicate text sentences in the same set of training data?
6667
answer: |
@@ -70,29 +71,21 @@ sections:
7071
answer: |
7172
We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
7273
- question: |
73-
Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
74+
Does switching styles via SSML work for Custom Neural Voices?
7475
answer: |
75-
The [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature supports multi-style for the same voice in one model.
76+
Switching styles via SSML works for both prebuilt multi-style voices and CNV multi-style voices. With multi-style training, you can create a voice that speaks in different styles, and you can also adjust these styles via SSML.
7677
- question: |
7778
How does cross-lingual voice work with languages that have different pronunciation structure and assembly?
7879
answer: |
7980
Sentence structure and pronunciation naturally vary across languages such as English and Japanese. Each neural voice is trained with audio data recorded by native speaking voice talent. For [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. For example, a cross-lingual voice will use the native way to speak Japanese and still sound similar (but not exactly) like the original English speaker.
8081
- question: |
81-
Does switching styles via SSML only works for Custom Neural Voices?
82-
answer: |
83-
Switching styles via SSML is only for prebuilt multi-style voices. Custom Neural Voice does support [multi-style training](how-to-custom-voice-create-voice.md?tabs=multistyle#train-your-custom-neural-voice-model) for the same model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
84-
- question: |
85-
Do you tips for domain specific training?
86-
answer: |
87-
The training data should include as much domain specific terminology that you have for a specific scenario such as gaming. We recommend the recording scripts include both general sentences and domain specific sentences. For example, if you plan to record 2,000 sentences, then 1,000 of them could be general sentences, and another 1,000 of them could be sentences from your target domain or the use case of your application.
88-
- question: |
89-
Is it correct that after one training we can't train again unless we upload a corpus file?
82+
Can I use Custom Neural Voice to customize pronunciation for my domain?
9083
answer: |
91-
You can train again. There's no limit to this. However, each training will be charged the same as a new training.
84+
Custom Neural Voice enables you to create a brand voice for your business. You can optimize it for your domain as well. We recommend you include domain-specific samples in your training data for higher naturalness. However, the pronunciation is defined by the TTS engine by default, and we do not support pronunciation customization during the CNV training. If you want to customize pronunciation for your voice, use SSML. See [Pronunciation with Speech Synthesis Markup Language (SSML)](speech-synthesis-markup-pronunciation.md).
9285
- question: |
93-
Can we have a more dynamic events system and characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
86+
After one training can I train my voice again?
9487
answer: |
95-
Yes. You can have dynamic content built in with static pattern. In some cases with large datasets it isn't feasible to pre-record all variations.
88+
You can train again. Each training will create a new voice model. You will be charged for each training.
9689
- question: |
9790
Is the model version the same as the engine version?
9891
answer: |
@@ -104,24 +97,18 @@ sections:
10497
- question: |
10598
Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it is created?
10699
answer: |
107-
The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/cognitive-services/speech-service/context/context).
108-
- question: |
109-
Is it necessary to place the Disclosure in the startup flow (the first screen that the user always passes)?
110-
answer: |
111-
It's not necessary to place it in the first screen. It's also fine to put it in the end role As long as they indicate it somewhere, that's fine.
112-
- question: |
113-
Are Microsoft rights and credit unnecessary?
114-
answer: |
115-
It's not necessary to give disclosure or credit for our neural text to speech voices, however it would be great for them to do so. They must give disclosure for custom neural voices. It's pretty apparent that the character they are using the voice for is not real. But when they do use custom neural voice for characters these are the design patterns.
100+
The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/cognitive-services/speech-service/context/context). You can also request to add watermarks to your voice to protect your model. See [Microsoft Azure Neural TTS introduces the watermark algorithm for synthetic voice identification](https://techcommunity.microsoft.com/t5/ai-cognitive-services-blog/introducing-the-watermark-algorithm-for-synthetic-voice/ba-p/3298548).
116101
- question: |
117102
Do you have any cases about contracts or negotiation with voice actors?
118103
answer: |
119-
We have no recommendation on contract and it's up to the customer and the voice talent to negotiate the terms.
104+
We have no recommendations on contracts and it's up to the customer and the voice talent to negotiate the terms. However, you should make sure the voice talent understands the TTS technology, what it can do, its potential risks, and provide explicit consent to create a synthetic version of his/her voice in both the contract and a verbal statement. See [Disclosure for voice talent](/legal/cognitive-services/speech-service/custom-neural-voice/disclosure-voice-talent?context=/azure/cognitive-services/speech-service/context/context).
120105
- question: |
121106
Do we need to return the written permission from the voice talent back to Microsoft?
122107
answer: |
123108
Microsoft doesn't need the written permission, but you must obtain consent from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin. See [Set up voice talent for Custom Neural Voice](how-to-custom-voice-talent.md).
124109
110+
111+
125112
126113
additionalContent: |
127114

0 commit comments

Comments
 (0)