Skip to content

Commit b861070

Browse files
authored
Merge pull request #2520 from MicrosoftDocs/main
Publish to live, Monday 4 AM PST, 1/27
2 parents 9cc44af + d3ddef2 commit b861070

File tree

2 files changed

+16
-17
lines changed

2 files changed

+16
-17
lines changed

articles/ai-services/speech-service/how-to-custom-voice-training-data.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -87,15 +87,12 @@ It's important that the transcripts are 100% accurate transcriptions of the corr
8787
## Long audio + transcript (Preview)
8888

8989
> [!NOTE]
90-
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
90+
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
9191
9292
In some cases, you might not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service uses the [Batch Transcription API](batch-transcription.md) feature of speech to text.
9393

9494
During the processing of the segmentation, your audio files and the transcripts are also sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data is retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
9595

96-
> [!NOTE]
97-
> This service will be charged toward your speech to text subscription usage. The long-audio segmentation service is only supported with standard (S0) Speech resources.
98-
9996
### Audio data for Long audio + transcript
10097

10198
Follow these guidelines when preparing audio for segmentation.
@@ -112,6 +109,8 @@ Follow these guidelines when preparing audio for segmentation.
112109

113110
> [!NOTE]
114111
> The default sampling rate for a custom neural voice is 24 KHz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
112+
>
113+
> Segmented utterances should ideally be between 5 and 15 seconds long. For optimal segmentation results, it is recommended to include natural pauses of 0.5 to 1 second every 5 to 15 seconds of speech, preferably at the end of phrases or sentences.
115114
116115
All audio files should be grouped into a zip file. It's OK to put .wav files and .mp3 files into the same zip file. For example, you can upload a 45-second audio file named 'kingstory.wav' and a 200-second long audio file named 'queenstory.mp3' in the same zip file. All .mp3 files will be transformed into the .wav format after processing.
117116

@@ -134,15 +133,12 @@ After your dataset is successfully uploaded, we'll help you segment the audio fi
134133
## Audio only (Preview)
135134

136135
> [!NOTE]
137-
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
136+
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
138137
139138
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service is charged toward your speech to text subscription usage.
140139

141140
Follow these guidelines when preparing audio.
142141

143-
> [!NOTE]
144-
> The long-audio segmentation service will leverage the batch transcription feature of speech to text, which only supports standard subscription (S0) users.
145-
146142
| Property | Value |
147143
| -------- | ----- |
148144
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
@@ -155,6 +151,8 @@ Follow these guidelines when preparing audio.
155151

156152
> [!NOTE]
157153
> The default sampling rate for a custom neural voice is 24 KHz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
154+
>
155+
> Segmented utterances should ideally be between 5 and 15 seconds long. For optimal segmentation results, it is recommended to include natural pauses of 0.5 to 1 second every 5 to 15 seconds of speech, preferably at the end of phrases or sentences.
158156
159157
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
160158

articles/ai-services/speech-service/record-custom-voice-samples.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.author: eur
1313

1414
# Recording voice samples for custom neural voice
1515

16-
This article provides you with instructions on preparing high-quality voice samples for creating a professional voice model using the custom neural voice Pro project.
16+
This article provides you with best practices on preparing high-quality voice samples for creating a professional voice model using the custom neural voice Pro project. To understand how the data is processed and the minimum requirements for data acceptance, please refer to [upload your data](professional-voice-create-training-set.md#upload-your-data).
1717

1818
Creating a high-quality production custom neural voice from scratch isn't a casual undertaking. The central component of a custom neural voice is a large collection of audio samples of human speech. It's vital that these audio recordings be of high quality. Choose a voice talent who has experience making these kinds of recordings, and have them recorded by a recording engineer using professional equipment.
1919

@@ -74,20 +74,21 @@ We provide [sample scripts in the 'General', 'Chat' and 'Customer Service' domai
7474

7575
Below are some general guidelines that you can follow to create a good corpus (recorded audio samples) for custom neural voice training.
7676

77-
- Balance your script to cover different sentence types in your domain including statements, questions, exclamations, long sentences, and short sentences.
78-
79-
Each sentence should contain four words to 30 words, and no duplicate sentences should be included in your script.<br>
77+
- For most use cases, sentences are recommended to be between 2 and 15 seconds long, containing 5 to 30 words for Latin-based languages or 4 to 80 words for non-Latin languages. Aim to balance your script to include a variety of sentence types and lengths. Ensure your script does not include any duplicate sentences.<br>
78+
79+
If your use case requires a high emphasis on questions, exclamations, or a mix of particularly long and short sentences, it is recommended to include a good portion of sentences as questions or exclamations, along with very short phrases and longer phrases up to 20 seconds in length.
80+
8081
For how to balance the different sentence types, refer to the following table:
8182

8283
| Sentence types | Coverage |
8384
| :--------- | :--------------------------- |
8485
| Statement sentences | Statement sentences should be 70-80% of the script.|
85-
| Question sentences | Question sentences should be about 10%-20% of your domain script, including 5%-10% of rising and 5%-10% of falling tones. |
86-
| Exclamation sentences| Exclamation sentences should be about 10%-20% of your script.|
87-
| Short word/phrase| Short word/phrase scripts should be about 10% of total utterances, with 5 to 7 words per case. |
86+
| Short word/phrase| Short word/phrase scripts should be about 10% of total utterances, with 5 to 7 words per case.<br> Short words or phrases should be separated by commas to help remind voice talent to pause briefly while reading.|
87+
| Question sentences (Optional) | Question sentences should be about 10%-20% of your domain script, including 5%-10% of rising and 5%-10% of falling tones.<br> These sentences are required if you want the generated voice to accurately convey questions.|
88+
| Exclamation sentences (Optional) | Exclamation sentences should be about 10%-20% of your script.<br> These sentences are required if you want the generated voice to accurately convey exclamations.|
8889

89-
> [!NOTE]
90-
> Short words/phrases should be separated with a commas. They help remind your voice talent to pause briefly when reading them.
90+
> [!NOTE]
91+
> You can estimate the number of words in a sentence by assuming a speech rate in words per second based on your language.
9192
9293
Best practices include:
9394
- Balanced coverage for Parts of Speech, like verbs, nouns, adjectives, and so on.

0 commit comments

Comments
 (0)