Skip to content

Commit 986f74c

Browse files
authored
Merge pull request #263762 from MicrosoftDocs/main
Publish to live, Sunday 4:00 PM PST, 01/21
2 parents 971c954 + 3a86c78 commit 986f74c

File tree

68 files changed

+266
-301
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+266
-301
lines changed

articles/ai-services/speech-service/how-to-custom-voice-training-data.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ author: eric-urban
66
manager: nitinme
77
ms.service: azure-ai-speech
88
ms.topic: how-to
9-
ms.date: 10/27/2022
9+
ms.date: 1/21/2024
1010
ms.author: eur
1111
---
1212

1313
# Training data for custom neural voice
1414

15-
When you're ready to create a custom Text to speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. After you've trained the voice, you can start synthesizing speech in your applications.
15+
When you're ready to create a custom Text to speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. After you train the voice, you can start synthesizing speech in your applications.
1616

1717
> [!TIP]
1818
> To create a voice for production use, we recommend you use a professional recording studio and voice talent. For more information, see [record voice samples to create a custom neural voice](record-custom-voice-samples.md).
@@ -21,7 +21,7 @@ When you're ready to create a custom Text to speech voice for your application,
2121

2222
A voice training dataset includes audio recordings, and a text file with the associated transcriptions. Each audio file should contain a single utterance (a single sentence or a single turn for a dialog system), and be less than 15 seconds long.
2323

24-
In some cases, you may not have the right dataset ready and will want to test the custom neural voice training with available audio files, short or long, with or without transcripts.
24+
In some cases, you might not have the right dataset ready. You can test the custom neural voice training with available audio files, short or long, with or without transcripts.
2525

2626
This table lists data types and how each is used to create a custom Text to speech voice model.
2727

@@ -53,15 +53,15 @@ Follow these guidelines when preparing audio.
5353
| Property | Value |
5454
| -------- | ----- |
5555
| File format | RIFF (.wav), grouped into a .zip file |
56-
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
57-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
56+
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
57+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
5858
| Sample format | PCM, at least 16-bit |
5959
| Audio length | Shorter than 15 seconds |
6060
| Archive format | .zip |
6161
| Maximum archive size | 2048 MB |
6262

6363
> [!NOTE]
64-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. Its recommended that you should use a sample rate of 24,000 Hz for your training data.
64+
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
6565
6666
### Transcription data for Individual utterances + matching transcript
6767

@@ -71,26 +71,26 @@ The transcription file is a plain text file. Use these guidelines to prepare you
7171
| -------- | ----- |
7272
| File format | Plain text (.txt) |
7373
| Encoding format | ANSI, ASCII, UTF-8, UTF-8-BOM, UTF-16-LE, or UTF-16-BE. For zh-CN, ANSI and ASCII encoding aren't supported. |
74-
| # of utterances per line | **One** - Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. The file name and transcription should be separated by a tab (\t). |
74+
| # of utterances per line | **One** - Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. You must use a tab (\t) to separate the file name and transcription. |
7575
| Maximum file size | 2048 MB |
7676

77-
Below is an example of how the transcripts are organized utterance by utterance in one .txt file:
77+
Here's an example of how the transcripts are organized utterance by utterance in one .txt file:
7878

7979
```
8080
0000000001[tab] This is the waistline, and it's falling.
8181
0000000002[tab] We have trouble scoring.
8282
0000000003[tab] It was Janet Maslin.
8383
```
84-
Its important that the transcripts are 100% accurate transcriptions of the corresponding audio. Errors in the transcripts will introduce quality loss during the training.
84+
It's important that the transcripts are 100% accurate transcriptions of the corresponding audio. Errors in the transcripts introduce quality loss during the training.
8585

8686
## Long audio + transcript (Preview)
8787

8888
> [!NOTE]
8989
> For **Long audio + transcript (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
9090
91-
In some cases, you may not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service will use the [Batch Transcription API](batch-transcription.md) feature of speech to text.
91+
In some cases, you might not have segmented audio available. The Speech Studio can help you segment long audio files and create transcriptions. The long-audio segmentation service uses the [Batch Transcription API](batch-transcription.md) feature of speech to text.
9292

93-
During the processing of the segmentation, your audio files and the transcripts will also be sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data will be retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
93+
During the processing of the segmentation, your audio files and the transcripts are also sent to the custom speech service to refine the recognition model so the accuracy can be improved for your data. No data is retained during this process. After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.
9494

9595
> [!NOTE]
9696
> This service will be charged toward your speech to text subscription usage. The long-audio segmentation service is only supported with standard (S0) Speech resources.
@@ -102,17 +102,17 @@ Follow these guidelines when preparing audio for segmentation.
102102
| Property | Value |
103103
| -------- | ----- |
104104
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
105-
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
106-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
107-
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
105+
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
106+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
107+
| Sample format |RIFF(.wav): PCM, at least 16-bit.<br/><br/>mp3: At least 256 KBps bit rate.|
108108
| Audio length | Longer than 20 seconds |
109109
| Archive format | .zip |
110110
| Maximum archive size | 2048 MB, at most 1000 audio files included |
111111

112112
> [!NOTE]
113-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. Its recommended that you should use a sample rate of 24,000 Hz for your training data.
113+
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
114114
115-
All audio files should be grouped into a zip file. Its OK to put .wav files and .mp3 files into one audio zip. For example, you can upload a zip file containing an audio file named kingstory.wav’, 45 second long, and another audio named queenstory.mp3’, 200 second long. All .mp3 files will be transformed into the .wav format after processing.
115+
All audio files should be grouped into a zip file. It's OK to put .wav files and .mp3 files into the same zip file. For example, you can upload a 45 second audio file named 'kingstory.wav' and a 200 second long audio file named 'queenstory.mp3' in the same zip file. All .mp3 files will be transformed into the .wav format after processing.
116116

117117
### Transcription data for Long audio + transcript
118118

@@ -126,16 +126,16 @@ Transcripts must be prepared to the specifications listed in this table. Each au
126126
| # of utterances per line | No limit |
127127
| Maximum file size | 2048 MB |
128128

129-
All transcripts files in this data type should be grouped into a zip file. For example, you've uploaded a zip file containing an audio file named kingstory.wav’, 45 seconds long, and another one named queenstory.mp3’, 200 seconds long. You'll need to upload another zip file containing two transcripts, one named kingstory.txt’, the other one queenstory.txt. Within each plain text file, you'll provide the full correct transcription for the matching audio.
129+
All transcripts files in this data type should be grouped into a zip file. For example, you might upload a 45 second audio file named 'kingstory.wav' and a 200 second long audio file named 'queenstory.mp3' in the same zip file. You need to upload another zip file containing the corresponding two transcripts--one named 'kingstory.txt' and the other one named 'queenstory.txt'. Within each plain text file, you provide the full correct transcription for the matching audio.
130130

131-
After your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on the transcript provided. You can check the segmented utterances and the matching transcripts by downloading the dataset. Unique IDs will be assigned to the segmented utterances automatically. Its important that you make sure the transcripts you provide are 100% accurate. Errors in the transcripts can reduce the accuracy during the audio segmentation and further introduce quality loss in the training phase that comes later.
131+
After your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on the transcript provided. You can check the segmented utterances and the matching transcripts by downloading the dataset. Unique IDs are assigned to the segmented utterances automatically. It's important that you make sure the transcripts you provide are 100% accurate. Errors in the transcripts can reduce the accuracy during the audio segmentation and further introduce quality loss in the training phase that comes later.
132132

133133
## Audio only (Preview)
134134

135135
> [!NOTE]
136136
> For **Audio only (Preview)**, only these languages are supported: Chinese (Mandarin, Simplified), English (India), English (United Kingdom), English (United States), French (France), German (Germany), Italian (Italy), Japanese (Japan), Portuguese (Brazil), and Spanish (Mexico).
137137
138-
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service will be charged toward your speech to text subscription usage.
138+
If you don't have transcriptions for your audio recordings, use the **Audio only** option to upload your data. Our system can help you segment and transcribe your audio files. Keep in mind, this service is charged toward your speech to text subscription usage.
139139

140140
Follow these guidelines when preparing audio.
141141

@@ -145,17 +145,17 @@ Follow these guidelines when preparing audio.
145145
| Property | Value |
146146
| -------- | ----- |
147147
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
148-
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters \ / : * ? " < > \| aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
149-
| Sampling rate | When creating a custom neural voice, 24,000 Hz is required. |
150-
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
148+
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
149+
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
150+
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: At least 256 KBps bit rate.|
151151
| Audio length | No limit |
152152
| Archive format | .zip |
153153
| Maximum archive size | 2048 MB, at most 1000 audio files included |
154154

155155
> [!NOTE]
156-
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. Its recommended that you should use a sample rate of 24,000 Hz for your training data.
156+
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
157157
158-
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, we'll help you segment the audio file into utterances based on our speech batch transcription service. Unique IDs will be assigned to the segmented utterances automatically. Matching transcripts will be generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
158+
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
159159

160160
## Next steps
161161

0 commit comments

Comments
 (0)