Skip to content

Commit 3e32aa9

Browse files
committed
tts faq structure
1 parent 7a58f77 commit 3e32aa9

File tree

1 file changed

+13
-21
lines changed

1 file changed

+13
-21
lines changed

articles/cognitive-services/Speech-Service/faq-tts.yml

Lines changed: 13 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ sections:
2020
- name: General
2121
questions:
2222
- question: |
23-
How would we disclose to the end user that the voice used in the game is a synthetic voice?
23+
How would we disclose to the end user that the voice is a synthetic voice?
2424
answer: |
2525
There are several ways to disclose the synthetic nature of the voice including implicit and explicit byline. Refer to [Disclosure design guidelines](/legal/cognitive-services/speech-service/custom-neural-voice/concepts-disclosure-guidelines?context=/azure/cognitive-services/speech-service/context/context).
2626
- question: |
@@ -61,14 +61,6 @@ sections:
6161
How much data is required to create a custom neural voice?
6262
answer: |
6363
At least 300 lines of recordings (or approximately 30 minutes of speech) are required as training data for Custom Neural Voice. We recommend 2,000 lines of recordings (or approximately 2-3 hours of speech) to create a voice for production use. The more data you record the better the quality of the voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
64-
- question: |
65-
Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
66-
answer: |
67-
The [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature supports multi-style for the same voice in one model.
68-
- question: |
69-
What about languages that have different pronunciation structure and assembly? For example, English and Japanese don't form a sentence in the same way. If a voice talent has a fast to slow cadence on a phrase in English, would that map correctly across the same phrase in Japanese?
70-
answer: |
71-
Each neural voice is trained with audio data recorded by native speaking voice talent. For cross-lingual voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. It will use the native way to speak Japanese and still sound similar (but not exactly) like the original English speaker.
7264
- question: |
7365
Can we include duplicate text sentences in the same set of training data?
7466
answer: |
@@ -78,25 +70,21 @@ sections:
7870
answer: |
7971
We recommend that you keep the style consistent in one set of training data. If the styles are different, put them into different training sets. In this case, you may consider using the multi-style voice training feature of Custom Neural Voice. For the script selection criteria, see [Record custom voice samples](record-custom-voice-samples.md).
8072
- question: |
81-
Do you tips for domain specific training such as gaming?
82-
answer: |
83-
The training data should include as much domain specific terminology that you have for a specific scenario. We recommend the recording scripts include both general sentences and domain specific sentences. For example, if you plan to record 2,000 sentences, then 1,000 of them could be general sentences, and another 1,000 of them could be sentences from your target domain or the use case of your application.
84-
- question: |
85-
Is the model version the same as the engine version?
73+
Can we create multi-styles for the same custom neural voice and host multi-styles in one model to save our cost?
8674
answer: |
87-
No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure Cognitive Services text-to-speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See [Update engine version for your voice model](how-to-custom-voice-create-voice.md?tabs=neural#update-engine-version-for-your-voice-model).
75+
The [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature supports multi-style for the same voice in one model.
8876
- question: |
89-
What kind of script should be prepared for a domain specific scenario such as gaming?
77+
How does cross-lingual voice work with languages that have different pronunciation structure and assembly?
9078
answer: |
91-
For general script, you can use the sample scripts per locale on [GitHub](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/script). For a domain specific script, you can do selection from the sentences that custom neural voice will be used to read. And you can refer to the script selection criteria [Record custom voice samples](record-custom-voice-samples.md#script-selection-criteria) to create a good corpus.
79+
Sentence structure and pronunciation naturally vary across languages such as English and Japanese. Each neural voice is trained with audio data recorded by native speaking voice talent. For [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) voice, we transfer the major features like timbre to sound like the original speaker and preserve the right pronunciation. For example, a cross-lingual voice will use the native way to speak Japanese and still sound similar (but not exactly) like the original English speaker.
9280
- question: |
93-
Are there any successful cases of using custom neural voice in gaming?
81+
Does switching styles via SSML only works for Custom Neural Voices?
9482
answer: |
95-
Microsoft Flight Simulator used custom neural voice service to create the voices for the air traffic controllers in several different languages using the [cross lingual](how-to-custom-voice-create-voice.md?tabs=crosslingual#train-your-custom-neural-voice-model) feature.
83+
Switching styles via SSML is only for prebuilt multi-style voices. Custom Neural Voice does support [multi-style training](how-to-custom-voice-create-voice.md?tabs=multistyle#train-your-custom-neural-voice-model) for the same model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
9684
- question: |
97-
Switching styles via SSML only works for prebuilt neural voices, right?
85+
Do you tips for domain specific training?
9886
answer: |
99-
Switching styles via SSML is only for prebuilt multi-style voices. Custom Neural Voice does support [multi-style training](how-to-custom-voice-create-voice.md?tabs=multistyle#train-your-custom-neural-voice-model) for the same model, so you can also adjust the styles via SSML. Only for the speaking styles you have created for CNV.
87+
The training data should include as much domain specific terminology that you have for a specific scenario such as gaming. We recommend the recording scripts include both general sentences and domain specific sentences. For example, if you plan to record 2,000 sentences, then 1,000 of them could be general sentences, and another 1,000 of them could be sentences from your target domain or the use case of your application.
10088
- question: |
10189
Is it correct that after one training we can't train again unless we upload a corpus file?
10290
answer: |
@@ -105,6 +93,10 @@ sections:
10593
Can we have a more dynamic events system and characters that can respond to each other using equally dynamic voice lines, like "Whoa, that [plane] just crashed into that [tank]."?
10694
answer: |
10795
Yes. You can have dynamic content built in with static pattern. In some cases with large datasets it isn't feasible to pre-record all variations.
96+
- question: |
97+
Is the model version the same as the engine version?
98+
answer: |
99+
No. The model version is different from the engine version. The model version means the version of the training recipe for your model and varies by the features supported and model training time. Azure Cognitive Services text-to-speech engines are updated from time to time to capture the latest language model that defines the pronunciation of the language. After you've trained your voice, you can apply your voice to the new language model by updating to the latest engine version. When a new engine is available, you're prompted to update your neural voice model. See [Update engine version for your voice model](how-to-custom-voice-create-voice.md?tabs=neural#update-engine-version-for-your-voice-model).
108100
- question: |
109101
Can we limit the number of trainings using Azure Policy or other features? Or is there any way to avoid false training?
110102
answer: |

0 commit comments

Comments
 (0)