tts faq scrub

eric-urban · eric-urban · commit 7bf57337708f · 2023-03-27T14:08:19.000-07:00
diff --git a/articles/cognitive-services/Speech-Service/faq-tts.yml b/articles/cognitive-services/Speech-Service/faq-tts.yml
@@ -9,7 +9,7 @@ metadata:
   ms.service: cognitive-services
   ms.subservice: speech-service
   ms.topic: faq
-  ms.date: 03/13/2023
+  ms.date: 03/27/2023
   ms.author: eur
 title: Text-to-speech FAQ
 summary: |
@@ -24,17 +24,17 @@ sections:
         answer: |
           There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to [Disclosure design guidelines](/legal/cognitive-services/speech-service/custom-neural-voice/concepts-disclosure-guidelines?context=/azure/cognitive-services/speech-service/context/context).
       - question: |
-          Can Text-to-Speech support PCM 48khz 24bit mono .WAV or PCM 48khz 16bit mono .WAV?
+          What audio formats does Text-to-Speech support?
         answer: |
-          We can support 48khz 16bit mono output. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. Other sample rates can be obtained through up-sampling or down-sampling when synthesizing.
+          The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. See [Audio outputs](rest-text-to-speech?tabs=streaming#audio-outputs).
       - question: |
-          How are teams balancing dynamic with static to save cost?
+          How can we balance dynamic and static content to limit cost?
         answer: |
-          You can cache the real-time generated content, still fall back to cloud with best quality, as we are improving the model quality over time. If you do not have strong demand on latency, you can use our real-time API to generate all content, cache it and serve it on an as-needed basis.
+          You can cache the real-time generated content, and fall back to cloud with best quality, as we are improving the model quality over time. If you do not have strong demand on latency, you can use our real-time API to generate all content, cache it and serve it on an as-needed basis.
       - question: |
-          This amazing TTS technology has reached the point where it's almost indistinguishable between real voice actors, except it's all very "clean" sounding. But is there any work being considered around making the dialogue sound more natural by inserting er, um, stutter, pause, or repeated words and so on?
+          Can we make the dialogue sound even more natural by inserting er, um, stutter, pause, or repeated words and so on?
         answer: |
-          If you have such requirements from studios or other customers, we can plan to develop it. A latest update is that we will start a prototype of spontaneous speech synthesis where we'll automatically insert the filled pause (such as um, uh) and synthesize the speech accordingly. This could be a start towards a fully spontaneous speech as you mentioned. We'll make an update when we get some results.
+          We are evaluating spontaneous speech synthesis to automatically insert the filled pause (such as um, uh) and synthesize the speech accordingly. But this isn't on the current roadmap.
       - question: |
           Is there a mapping between Viseme IDs and mouth shape?
         answer: |
@@ -43,90 +43,68 @@ sections:
           Can the Visemes be mapped to UE5 MetaHuman blend shapes weights?
         answer: |
           UE5 MetaHuman is now using a newly defined driven parameter named Expression, however, we could still use blend shapes to drive MetaHuman. One of our customers has done this and successfully driven the prefix UE5 Avatar.
-      - question: |
-          What?
-        answer: |
-          Because.
-      - question: |
-          What?
-        answer: |
-          Because.
-
 
   - name: Speech synthesis markup language (SSML)
     questions:
       - question: |
           Can the voice be customized to stress specific words?
         answer: |
-          Some of the prebuilt `en-US` voices support the [emphasis tag](speech-synthesis-markup-voice.md#adjust-emphasis) which can be used to emphasize one or group of words. Support for more voices are coming.
-      - question: |
-          Are the neural voice emotions discrete states or do they have associated sliders to define, for example, how panicky or friendly you want the voice to be?
-        answer: |
-          So far, they're discrete states. Meanwhile, we have a tuning knob to control the style degree. We have style degree control on each emotion, and it can be used for SSML in real-time synthesizing.
+          Adjusting the emphasis is supported for some voices depending on the locale. See the [emphasis tag](speech-synthesis-markup-voice.md#adjust-emphasis).
       - question: |
           Can we have multiple strength for each emotions, like very sad, slightly sad and so on in?
         answer: |
-          Microsoft currently only supports 'style degree' in Chinese voices. And you can use SSML to adjust the style degree.
+          Adjusting the style degree is supported for some voices depending on the locale. See the [mstts:express-as tag](speech-synthesis-markup-voice#speaking-styles-and-roles).
+
+  - name: Custom Neural Voice
+    questions:
       - question: |
-          What?
+          How much data is required to create a custom neural voice?
         answer: |
-          Because.
+          At least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for Custom Neural Voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
       - question: |
-          What?
+          Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
         answer: |
-          Because.
-
-
-  - name: Custom Neural Voice
-    questions:
+          The [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature supports multi-style for the same voice in one model.
       - question: |
-          How many resources are required to create a custom neural voice?
+          What about languages that have different pronunciation structure and assembly? For example, English and Japanese don't form a sentence in the same way. If a voice talent has a fast to slow cadence on a phrase in English, would that map correctly across the same phrase in Japanese?
         answer: |
-          We require at least 300 lines of recordings (or, around 30 minutes of speech) to be prepared as training data for Custom Neural Voice, and we recommend 2,000 lines of recordings (2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the cloned voice. But if hundreds of lines are provided, we can build a voice with pretty good quality (Mean Opinion Score> 4.0).
+          Each neural voice is trained with audio data recorded by native speaking voice talent. For cross-lingual voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. It will use the native way to speak Japanese and still sound similar (but not exactly) like the original English speaker. 
       - question: |
-          What about languages that have different pronunciation structure and assembly? For example, English and Japanese don’t form a sentence in the same way. If a voice talent has a fast to slow cadence on a phrase in English, that wouldn't map correctly across the same phrase in Japanese.
+          Can we include duplicate text sentences in the same set of training data?
         answer: |
-          Each neural voice is trained with audio data recorded by native speaking voice talent. For cross lingual voice, we transfer the major features like timbre to sound like the original speaker, to preserve the right pronunciation. It may not sound exactly like the original speaker and will use the native way to speak Japanese but sound similar like the original English speaker. 
+          No. The service will flag the duplicate sentences and just keep the first imported one. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
       - question: |
-          Can we include duplicate text sentences with multi styles in our training data?
+          Can we include multiple styles in the same set of training data?
         answer: |
-          No, you can't. The service will flag the duplicate sentences and just keep the first imported one. We recommend that you don’t include duplicate text sentences in one training data, and you keep the style consistent in one training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice.  For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
+          We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
       - question: |
-          To train a better custom neural voice model, should we prepare a large number of sentences containing the game specific terminology as the training data?
+          Do you tips for domain specific training such as gaming? 
         answer: |
-          We usually recommend that if you have domain specific terminology for a specific scenario you put as much of that domain content as possible in the training data. We recommend the recording scripts include both general sentences and domain-specific sentences. For example, if you plan to record 2,000 sentences, 1,000 of them could be general sentences, another 1,000 of them could be sentences from your target domain or the use case of your application.
+          The training data should include as much domain specific terminology that you have for a specific scenario. We recommend the recording scripts include both general sentences and domain specific sentences. For example, if you plan to record 2,000 sentences, then 1,000 of them could be general sentences, and another 1,000 of them could be sentences from your target domain or the use case of your application.
       - question: |
           Is the model version the same as the engine version?
         answer: |
-          No, the model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time.  Azure Text-to-Speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model.
+          No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure Cognitive Services text-to-speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See [Update engine version for your voice model](how-to-custom-voice-create-voice?tabs=neural#update-engine-version-for-your-voice-model).
       - question: |
           What kind of script should be prepared for a domain specific scenario such as gaming?
         answer: |
           For general script, you can use the sample scripts per locale on [GitHub](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/script). For a domain specific script, you can do selection from the sentences that custom neural voice will be used to read. And you can refer to the script selection criteria [Record custom voice samples](record-custom-voice-samples.md#script-selection-criteria) to create a good corpus.
       - question: |
-          Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
-        answer: |
-          Microsoft has a new preview feature multi-style voice training feature that supports multi-style for the same voice in one model.
-      - question: |
-          Are there any successful cases of using custom neural voice in gaming for our reference?
+          Are there any successful cases of using custom neural voice in gaming?
         answer: |
-          Microsoft flight simulator used our custom neural voice service to create the voices for their air traffic controllers in several different languages using our [cross lingual transfer feature](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model).
+          Microsoft Flight Simulator used custom neural voice service to create the voices for the air traffic controllers in several different languages using the [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature.
       - question: |
           Switching styles via SSML only works for prebuilt neural voices, right?
         answer: |
-          Switching styles via SSML is only for prebuilt 'multi-style' voices available in the platform. And Custom Neural Voice now supports a new preview feature ”multi-style training for one model”, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
+          Switching styles via SSML is only for prebuilt multi-style voices. Custom Neural Voice does support [multi-style training](how-to-custom-voice-create-voice?tabs=multistyle#train-your-custom-neural-voice-model) for the same model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
       - question: |
           Is it correct that after one training we can't train again unless we upload a corpus file?
         answer: |
-          You can train again. There's no limit to this. However, each training will count as a new training cost wise.
+          You can train again. There's no limit to this. However, each training will be charged the same as a new training.
       - question: |
-          Can it support more languages than we may record on initial release?
+          Can we have a more dynamic events system and characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
         answer: |
-          Yes. We worked with the gaming localization team on it. You could use TTS to light up all languages which are not recorded. You could leverage the cross lingual transfer to build a custom voice that sounds like the recording speaker but speaks natively in different languages. For accessibility use, you can consider using our prebuilt neural voices in each language. We have minimal two voices (1 female, 1 male) in all languages.
-      - question: |
-          Can we have a more dynamic events system and AI characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
-        answer: |
-          Yes, the benefit of the TTS is that you can have dynamic content built in with static pattern. Flight Simulator is using TTS to generate dynamic content, as their flight location information is massive, and it is not feasible to pre-record all of them offline.
+          Yes. You can have dynamic content built in with static pattern. In some cases with large datasets it isn't feasible to pre-record all variations. 
       - question: |
           Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?
         answer: |
@@ -135,10 +113,6 @@ sections:
           Can Microsoft add a mechanism to prevent unauthorized use or misuse of our voice when it is created?
         answer: |
           The voice model can only be used by yourselves using your own token. Microsoft also doesn't use your data. See [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/cognitive-services/speech-service/context/context).
-      - question: |
-          If we are providing a real voice as the base data to be trained, how does it work on the legal side of things? Do we need the actor's permission to generate new dialogues, etc?
-        answer: |
-          Yes, it's needed. We can refer you to multiple practices. TTS team has a standard talent template as an example which we can share offline. When you try out self-serve training from the portal, you should read and agree the terms. We also worked with Flight Simulator and Localization teams on their talent agreement to fulfill the TTS use need prior to their release. Meanwhile Undead Lab is leading the effort to set up the agreement reviewed with Xbox legal, and they have a version you can leverage.
       - question: |
           Is it necessary to place the Disclosure in the startup flow (the first screen that the user always passes)?
         answer: |
@@ -154,16 +128,7 @@ sections:
       - question: |
           Do we need to return the written permission from the voice talent back to Microsoft?
         answer: |
-          Microsoft doesn't need the written permission, but we require you obtain it from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin.
-      - question: |
-          What?
-        answer: |
-          Because.
-      - question: |
-          What?
-        answer: |
-          Because.
-
+          Microsoft doesn't need the written permission, but you must obtain consent from your voice talent. The voice talent will also be required to record the consent statement and it must be uploaded into Speech Studio before training can begin. See [Set up voice talent for Custom Neural Voice](how-to-custom-voice-talent).
 
       
 additionalContent: |