You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-voice-training-data.md
+24-24Lines changed: 24 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,13 @@ author: eric-urban
6
6
manager: nitinme
7
7
ms.service: azure-ai-speech
8
8
ms.topic: how-to
9
-
ms.date: 10/27/2022
9
+
ms.date: 1/21/2024
10
10
ms.author: eur
11
11
---
12
12
13
13
# Training data for custom neural voice
14
14
15
-
When you're ready to create a custom Text to speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. After you've trained the voice, you can start synthesizing speech in your applications.
15
+
When you're ready to create a custom Text to speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. After you train the voice, you can start synthesizing speech in your applications.
16
16
17
17
> [!TIP]
18
18
> To create a voice for production use, we recommend you use a professional recording studio and voice talent. For more information, see [record voice samples to create a custom neural voice](record-custom-voice-samples.md).
@@ -21,7 +21,7 @@ When you're ready to create a custom Text to speech voice for your application,
21
21
22
22
A voice training dataset includes audio recordings, and a text file with the associated transcriptions. Each audio file should contain a single utterance (a single sentence or a single turn for a dialog system), and be less than 15 seconds long.
23
23
24
-
In some cases, you may not have the right dataset ready and will want to test the custom neural voice training with available audio files, short or long, with or without transcripts.
24
+
In some cases, you might not have the right dataset ready. You can test the custom neural voice training with available audio files, short or long, with or without transcripts.
25
25
26
26
This table lists data types and how each is used to create a custom Text to speech voice model.
27
27
@@ -53,15 +53,15 @@ Follow these guidelines when preparing audio.
53
53
| Property | Value |
54
54
| -------- | ----- |
55
55
| File format | RIFF (.wav), grouped into a .zip file |
56
-
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
57
-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
56
+
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
57
+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
58
58
| Sample format | PCM, at least 16-bit |
59
59
| Audio length | Shorter than 15 seconds |
60
60
| Archive format | .zip |
61
61
| Maximum archive size | 2048 MB |
62
62
63
63
> [!NOTE]
64
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It’s recommended that you should use a sample rate of 24,000 Hz for your training data.
64
+
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
65
65
66
66
### Transcription data for Individual utterances + matching transcript
67
67
@@ -71,26 +71,26 @@ The transcription file is a plain text file. Use these guidelines to prepare you
71
71
| -------- | ----- |
72
72
| File format | Plain text (.txt) |
73
73
| Encoding format | ANSI, ASCII, UTF-8, UTF-8-BOM, UTF-16-LE, or UTF-16-BE. For zh-CN, ANSI and ASCII encoding aren't supported. |
74
-
| # of utterances per line |**One** - Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. The file name and transcription should be separated by a tab (\t). |
74
+
| # of utterances per line |**One** - Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. You must use a tab (\t) to separate the file name and transcription. |
75
75
| Maximum file size | 2048 MB |
76
76
77
-
Below is an example of how the transcripts are organized utterance by utterance in one .txt file:
77
+
Here's an example of how the transcripts are organized utterance by utterance in one .txt file:
78
78
79
79
```
80
80
0000000001[tab] This is the waistline, and it's falling.
81
81
0000000002[tab] We have trouble scoring.
82
82
0000000003[tab] It was Janet Maslin.
83
83
```
84
-
It’s important that the transcripts are 100% accurate transcriptions of the corresponding audio. Errors in the transcripts will introduce quality loss during the training.
84
+
It's important that the transcripts are 100% accurate transcriptions of the corresponding audio. Errors in the transcripts introduce quality loss during the training.
85
85
86
86
## Long audio + transcript (Preview)
87
87
88
88
> [!NOTE]
89
89
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
90
90
91
-
In some cases, you may not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service will use the [Batch Transcription API](batch-transcription.md) feature of speech to text.
91
+
In some cases, you might not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service uses the [Batch Transcription API](batch-transcription.md) feature of speech to text.
92
92
93
-
During the processing of the segmentation, your audio files and the transcripts will also be sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data will be retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
93
+
During the processing of the segmentation, your audio files and the transcripts are also sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data is retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
94
94
95
95
> [!NOTE]
96
96
> This service will be charged toward your speech to text subscription usage. The long-audio segmentation service is only supported with standard (S0) Speech resources.
@@ -102,17 +102,17 @@ Follow these guidelines when preparing audio for segmentation.
102
102
| Property | Value |
103
103
| -------- | ----- |
104
104
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
105
-
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
106
-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
107
-
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
105
+
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
106
+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
107
+
| Sample format |RIFF(.wav): PCM, at least 16-bit.<br/><br/>mp3: At least 256 KBps bit rate.|
108
108
| Audio length | Longer than 20 seconds |
109
109
| Archive format | .zip |
110
110
| Maximum archive size | 2048 MB, at most 1000 audio files included |
111
111
112
112
> [!NOTE]
113
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It’s recommended that you should use a sample rate of 24,000 Hz for your training data.
113
+
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
114
114
115
-
All audio files should be grouped into a zip file. It’s OK to put .wav files and .mp3 files into one audio zip. For example, you can upload a zip file containing an audio file named ‘kingstory.wav’, 45 second long, and another audio named ‘queenstory.mp3’, 200 second long. All .mp3 files will be transformed into the .wav format after processing.
115
+
All audio files should be grouped into a zip file. It's OK to put .wav files and .mp3 files into the same zip file. For example, you can upload a 45 second audio file named 'kingstory.wav' and a 200 second longaudio file named 'queenstory.mp3' in the same zip file. All .mp3 files will be transformed into the .wav format after processing.
116
116
117
117
### Transcription data for Long audio + transcript
118
118
@@ -126,16 +126,16 @@ Transcripts must be prepared to the specifications listed in this table. Each au
126
126
| # of utterances per line | No limit |
127
127
| Maximum file size | 2048 MB |
128
128
129
-
All transcripts files in this data type should be grouped into a zip file. For example, you've uploaded a zip file containing an audio file named ‘kingstory.wav’, 45 seconds long, and another one named ‘queenstory.mp3’, 200 seconds long. You'll need to upload another zip file containing two transcripts, one named ‘kingstory.txt’, the other one ‘queenstory.txt’. Within each plain text file, you'll provide the full correct transcription for the matching audio.
129
+
All transcripts files in this data type should be grouped into a zip file. For example, you might upload a 45 second audio file named 'kingstory.wav' and a 200 second long audio file named 'queenstory.mp3' in the same zip file. You need to upload another zip file containing the corresponding two transcripts--one named 'kingstory.txt' and the other one named 'queenstory.txt'. Within each plain text file, you provide the full correct transcription for the matching audio.
130
130
131
-
After your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on the transcript provided. You can check the segmented utterances and the matching transcripts by downloading the dataset. Unique IDs will be assigned to the segmented utterances automatically. It’s important that you make sure the transcripts you provide are 100% accurate. Errors in the transcripts can reduce the accuracy during the audio segmentation and further introduce quality loss in the training phase that comes later.
131
+
After your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on the transcript provided. You can check the segmented utterances and the matching transcripts by downloading the dataset. Unique IDs are assigned to the segmented utterances automatically. It's important that you make sure the transcripts you provide are 100% accurate. Errors in the transcripts can reduce the accuracy during the audio segmentation and further introduce quality loss in the training phase that comes later.
132
132
133
133
## Audio only (Preview)
134
134
135
135
> [!NOTE]
136
136
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
137
137
138
-
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service will be charged toward your speech to text subscription usage.
138
+
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service is charged toward your speech to text subscription usage.
139
139
140
140
Follow these guidelines when preparing audio.
141
141
@@ -145,17 +145,17 @@ Follow these guidelines when preparing audio.
145
145
| Property | Value |
146
146
| -------- | ----- |
147
147
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
148
-
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
149
-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
150
-
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
148
+
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
149
+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
150
+
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: At least 256 KBps bit rate.|
151
151
| Audio length | No limit |
152
152
| Archive format | .zip |
153
153
| Maximum archive size | 2048 MB, at most 1000 audio files included |
154
154
155
155
> [!NOTE]
156
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It’s recommended that you should use a sample rate of 24,000 Hz for your training data.
156
+
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
157
157
158
-
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on our speech batch transcription service. Unique IDs will be assigned to the segmented utterances automatically. Matching transcripts will be generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
158
+
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
0 commit comments