You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-voice-training-data.md
+25-11Lines changed: 25 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,20 +22,22 @@ When you're ready to create a custom Text to speech voice for your application,
22
22
23
23
A voice training dataset includes audio recordings, and a text file with the associated transcriptions. Each audio file should contain a single utterance (a single sentence or a single turn for a dialog system), and be less than 15 seconds long.
24
24
25
-
In some cases, you might not have the right dataset ready. You can test the custom neural voice training with available audio files, short or long, with or without transcripts.
25
+
In some cases, you might not have the right dataset ready. You can test the custom neural voice training with available audio files, short or long, with or without transcripts.
26
26
27
27
This table lists data types and how each is used to create a custom Text to speech voice model.
28
28
29
-
| Data type | Description | When to use | Extra processing required |
|[Individual utterances + matching transcript](#individual-utterances--matching-transcript)| A collection (.zip) of audio files (.wav) as individual utterances. Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt). | Professional recordings with matching transcripts | Ready for training. |
32
-
|[Long audio + transcript](#long-audio--transcript-preview)| A collection (.zip) of long, unsegmented audio files (.wav or .mp3, longer than 20 seconds, at most 1,000 audio files), paired with a collection (.zip) of transcripts that contains all spoken words. | You have audio files and matching transcripts, but they aren't segmented into utterances. | Segmentation (using batch transcription).<br>Audio format transformation wherever required. |
33
-
|[Audio only (Preview)](#audio-only-preview)| A collection (.zip) of audio files (.wav or .mp3, at most 1,000 audio files) without a transcript. | You only have audio files available, without transcripts. | Segmentation + transcript generation (using batch transcription).<br>Audio format transformation wherever required.|
29
+
| Data type | Description | When to use | Extra processing required | Processed as |
|[Individual utterances + matching transcript](#individual-utterances--matching-transcript)| A collection (.zip) of audio files (.wav) as individual utterances. Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt). | Professional recordings with matching transcripts | Ready for training. | Segmented |
32
+
|[Long audio + transcript](#long-audio--transcript-preview)| A collection (.zip) of long, unsegmented audio files (.wav or .mp3, longer than 20 seconds, at most 1,000 audio files), paired with a collection (.zip) of transcripts that contains all spoken words. | You have audio files and matching transcripts, but they aren't segmented into utterances. | Segmentation (using batch transcription).<br>Audio format transformation wherever required. | Segmented, Contextual |
33
+
|[Audio only (Preview)](#audio-only-preview)| A collection (.zip) of audio files (.wav or .mp3, at most 1,000 audio files) without a transcript. | You only have audio files available, without transcripts. | Segmentation + transcript generation (using batch transcription).<br>Audio format transformation wherever required.| Segmented, Contextual |
34
34
35
35
Files should be grouped by type into a dataset and uploaded as a zip file. Each dataset can only contain a single data type.
36
36
37
37
> [!NOTE]
38
38
> The maximum number of datasets allowed to be imported per subscription is 500 zip files for standard subscription (S0) users.
39
+
>
40
+
> Processed as Contextual would retain the audio as a whole to keep the contextual information for more natural intonations.
39
41
40
42
## Individual utterances + matching transcript
41
43
@@ -87,10 +89,16 @@ It's important that the transcripts are 100% accurate transcriptions of the corr
87
89
## Long audio + transcript (Preview)
88
90
89
91
> [!NOTE]
90
-
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
92
+
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Hindi (India), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
93
+
>
94
+
> Processed as Contextual is currently only available for Chinese (Mandarin, Simplified) and English (United States).
91
95
92
96
In some cases, you might not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service uses the [Batch Transcription API](batch-transcription.md) feature of speech to text.
93
97
98
+
The service offers two processing modes:
99
+
-**Segmented**: The default processing mode that works with all supported languages
100
+
-**Contextual**: An enhanced mode that retains the audio as a whole to keep the contextual information for more natural intonations.
101
+
94
102
During the processing of the segmentation, your audio files and the transcripts are also sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data is retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
95
103
96
104
### Audio data for Long audio + transcript
@@ -103,7 +111,7 @@ Follow these guidelines when preparing audio for segmentation.
103
111
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
104
112
| Sampling rate | 24 KHz and higher required when creating a custom neural voice. |
105
113
| Sample format |RIFF(.wav): PCM, at least 16-bit.<br/><br/>mp3: At least 256 KBps bit rate.|
106
-
| Audio length | Longer than 20 seconds |
114
+
| Audio length | Longer than 30 seconds |
107
115
| Archive format | .zip |
108
116
| Maximum archive size | 2048 MB, at most 1,000 audio files included |
109
117
@@ -133,9 +141,15 @@ After your dataset is successfully uploaded, we'll help you segment the audio fi
133
141
## Audio only (Preview)
134
142
135
143
> [!NOTE]
136
-
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
144
+
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), Chinese (Cantonese, Traditional), Chinese (Taiwanese Mandarin), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Hindi (India), Italian (Italy), Japanese (Japan), Portuguese (Brazil), Spanish (Spain) and Spanish (Mexico).
145
+
>
146
+
> Processed as Contextual is currently only available for Chinese (Mandarin, Simplified) and English (United States).
147
+
148
+
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files.
137
149
138
-
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service is charged toward your speech to text subscription usage.
150
+
The service offers two processing modes:
151
+
-**Segmented**: The default processing mode that works with all supported languages
152
+
-**Contextual**: An enhanced mode that retains the audio as a whole to keep the contextual information for more natural intonations.
139
153
140
154
Follow these guidelines when preparing audio.
141
155
@@ -154,7 +168,7 @@ Follow these guidelines when preparing audio.
154
168
>
155
169
> Segmented utterances should ideally be between 5 and 15 seconds long. For optimal segmentation results, it is recommended to include natural pauses of 0.5 to 1 second every 5 to 15 seconds of speech, preferably at the end of phrases or sentences.
156
170
157
-
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
171
+
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. You can select either the Standard or Contextual processing mode, depending on your language and requirements. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/includes/how-to/professional-voice/create-training-set/speech-studio.md
+41-2Lines changed: 41 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,9 +24,12 @@ When you're ready to upload your data, go to the **Prepare training data** tab t
24
24
To upload training data, follow these steps:
25
25
26
26
1. Sign in to the [Speech Studio](https://aka.ms/speechstudio/customvoice).
27
-
1. Select **Custom voice** > Your project name > **Prepare training data** > **Upload data**.
27
+
1. Select **Custom voice** > Your project name > **Prepare training data** > **Upload data**.
28
28
1. In the **Upload data** wizard, choose a [data type](../../../../how-to-custom-voice-training-data.md) and then select **Next**.
29
29
1. Select local files from your computer or enter the Azure Blob storage URL to upload data.
30
+
1. If you selected **Long audio + transcript** or **Audio only** data types in contextual support projects, you'll see an option to choose the processing mode:
31
+
-**Processed as Contextual**: Processes audio while preserving contextual information for enhanced conversational abilities and more natural speech patterns.
32
+
-**Segmented**: Processes audio and transcript into individual utterances using standard segmentation.
30
33
1. Under **Specify the target training set**, select an existing training set or create a new one. If you created a new training set, make sure it's selected in the drop-down list before you continue.
31
34
1. Select **Next**.
32
35
1. Enter a name and description for your data and then select **Next**.
@@ -36,6 +39,8 @@ To upload training data, follow these steps:
36
39
> Duplicate IDs are not accepted. Utterances with the same ID will be removed.
37
40
>
38
41
> Duplicate audio names are removed from the training. Make sure the data you select don't contain the same audio names within the .zip file or across multiple .zip files. If utterance IDs (either in audio or script files) are duplicates, they're rejected.
42
+
>
43
+
> A training set can only contain data processed in the same mode. For example, if you upload data with **Processed as Contextual** mode to a training set, all subsequent uploads to that same training set must also use the **Processed as Contextual** mode.
39
44
40
45
Data files are automatically validated when you select **Submit**. Data validation includes series of checks on the audio files to verify their file format, size, and sampling rate. If there are any errors, fix them and submit again.
41
46
@@ -45,7 +50,9 @@ After you upload the data, you can check the details in the training set detail
45
50
46
51
After upload, you can check the data details of the training set. Before continuing to [train your voice model](../../../../professional-voice-train-voice.md), you should try to resolve any data issues.
47
52
48
-
You can identify and resolve data issues per utterance in [Speech Studio](https://aka.ms/custom-voice-portal).
53
+
You can identify and resolve data issues per utterance in [Speech Studio](https://aka.ms/custom-voice-portal).
54
+
55
+
#### Processed as Segmented
49
56
50
57
1. On the detail page, go to the **Accepted data** or **Rejected data** page. Select individual utterances you want to change, then select **Edit**.
51
58
@@ -77,6 +84,38 @@ You can identify and resolve data issues per utterance in [Speech Studio](https:
77
84
78
85
You can also delete utterances with issues by selecting them and clicking **Delete**.
79
86
87
+
88
+
#### Processed as Contextual
89
+
90
+
Unlike processed as segmented, **Processed as Contextual** preserves the original audio files and generates corresponding contextual information.
91
+
92
+
1. On the detail page, click on individual utterances with "Number Of issues".
93
+
94
+
:::image type="content" source="../../../../media/custom-voice/contextual-training/contextual-details.png" alt-text="Screenshot of contextual utterances details page.":::
95
+
96
+
Contextual information is presented as segments. Select the segment you want to modify, then click the **Edit** button.
97
+
98
+
:::image type="content" source="../../../../media/custom-voice/contextual-training/contextual-segments.png" alt-text="Screenshot of contextual segments to be displayed.":::
99
+
100
+
You can choose which data issues to be displayed based on your criteria.
101
+
102
+
:::image type="content" source="../../../../media/custom-voice/cnv-issues-display-criteria.png" alt-text="Screenshot of choosing which data issues to be displayed.":::
103
+
104
+
1. Edit the transcript in the text box according to the issue description, then select Done.
105
+
106
+
:::image type="content" source="../../../../media/custom-voice/contextual-training/contextual-edit-segment.png" alt-text="Screenshot of selecting Done button after editing transcript.":::
107
+
108
+
1. After you've made changes to your data, you need to check the data quality by clicking Analyze data before using this dataset for training.
109
+
110
+
You can't select this training set for training model before the analysis is complete.
111
+
112
+
:::image type="content" source="../../../../media/custom-voice/cnv-edit-trainingset-analyze.png" alt-text="Screenshot of selecting Analyze data on Data details page.":::
113
+
114
+
> [!NOTE]
115
+
> Deleting a segment of contextual information will not exclude that content from training. Only delete segments when their information is already included or will be included in adjacent segments.
116
+
>
117
+
> Rejected segments will not be used for training.
118
+
80
119
### Typical data issues
81
120
82
121
The issues are divided into three types. Refer to the following tables to check the respective types of errors.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/includes/how-to/professional-voice/train-voice/bilingual-training.md
+7-6Lines changed: 7 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,14 +12,15 @@ ms.custom: include
12
12
If you select the **Neural** training type, you can train a voice to speak in multiple languages. The `zh-CN`, `zh-HK`, and `zh-TW` locales support bilingual training for the voice to speak both Chinese and English. Depending in part on your training data, the synthesized voice can speak English with an English native accent or English with the same accent as the training data.
13
13
14
14
> [!NOTE]
15
-
> To enable a voice in the `zh-CN` locale to speak English with the same accent as the sample data, you should choose `Chinese (Mandarin, Simplified), English bilingual` when creating a project or specify the `zh-CN (English bilingual)` locale for the training set data via REST API.
15
+
> To enable a voice in the `zh-CN` locale to speak English with the same accent as the sample data, you should upload English data to a **Contextual** training set, or choose `Chinese (Mandarin, Simplified), English bilingual` when creating a project or specify the `zh-CN (English bilingual)` locale for the training set data via REST API.
16
+
>
17
+
> In your contextual training set, include at least 100 sentences or 10 minutes of English content and do not exceed the amount of Chinese content.
16
18
17
19
The following table shows the differences among the locales:
18
20
19
-
| Speech Studio locale | REST API locale | Bilingual support |
21
+
| Speech Studio locale | REST API locale | Bilingual support |
|`Chinese (Mandarin, Simplified)`|`zh-CN`|If your sample data includes English, the synthesized voice speaks English with an English native accent, instead of the same accent as the sample data, regardless of the amount of English data. |
23
+
|`Chinese (Mandarin, Simplified)`|`zh-CN`|If your sample data includes English, the synthesized voice speaks English with an English native accent, instead of the same accent as the sample data, regardless of the amount of English data. |
22
24
|`Chinese (Mandarin, Simplified), English bilingual`|`zh-CN (English bilingual)`|If you want the synthesized voice to speak English with the same accent as the sample data, we recommend including over 10% English data in your training set. Otherwise, the English speaking accent might not be ideal. |
23
-
|`Chinese (Cantonese, Simplified)`|`zh-HK`| If you want to train a synthesized voice capable of speaking English with the same accent as your sample data, make sure to provide over 10% English data in your training set. Otherwise, it defaults to an English native accent. The 10% threshold is calculated based on the data accepted after successful uploading, not the data before uploading. If some uploaded English data is rejected due to defects and doesn't meet the 10% threshold, the synthesized voice defaults to an English native accent. |
24
-
|`Chinese (Taiwanese Mandarin, Traditional)`|`zh-TW`| If you want to train a synthesized voice capable of speaking English with the same accent as your sample data, make sure to provide over 10% English data in your training set. Otherwise, it defaults to an English native accent. The 10% threshold is calculated based on the data accepted after successful uploading, not the data before uploading. If some uploaded English data is rejected due to defects and doesn't meet the 10% threshold, the synthesized voice defaults to an English native accent. |
25
-
25
+
|`Chinese (Cantonese, Simplified)`|`zh-HK`| If you want to train a synthesized voice capable of speaking English with the same accent as your sample data, make sure to provide over 10% English data in your training set. Otherwise, it defaults to an English native accent. The 10% threshold is calculated based on the data accepted after successful uploading, not the data before uploading. If some uploaded English data is rejected due to defects and doesn't meet the 10% threshold, the synthesized voice defaults to an English native accent. |
26
+
|`Chinese (Taiwanese Mandarin, Traditional)`|`zh-TW`| If you want to train a synthesized voice capable of speaking English with the same accent as your sample data, make sure to provide over 10% English data in your training set. Otherwise, it defaults to an English native accent. The 10% threshold is calculated based on the data accepted after successful uploading, not the data before uploading. If some uploaded English data is rejected due to defects and doesn't meet the 10% threshold, the synthesized voice defaults to an English native accent. |
0 commit comments