You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-voice-training-data.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,14 +55,14 @@ Follow these guidelines when preparing audio.
55
55
| -------- | ----- |
56
56
| File format | RIFF (.wav), grouped into a .zip file |
57
57
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
58
-
| Sampling rate |When you create a custom neural voice, 24,000 Hz is required. |
58
+
| Sampling rate |24 KHz and higher required when creating a custom neural voice. |
59
59
| Sample format | PCM, at least 16-bit |
60
60
| Audio length | Shorter than 15 seconds |
61
61
| Archive format | .zip |
62
62
| Maximum archive size | 2048 MB |
63
63
64
64
> [!NOTE]
65
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
65
+
> The default sampling rate for a custom neural voice is 24 KHz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
66
66
67
67
### Transcription data for Individual utterances + matching transcript
68
68
@@ -104,14 +104,14 @@ Follow these guidelines when preparing audio for segmentation.
104
104
| -------- | ----- |
105
105
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
106
106
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
107
-
| Sampling rate |When you create a custom neural voice, 24,000 Hz is required. |
107
+
| Sampling rate |24 KHz and higher required when creating a custom neural voice. |
108
108
| Sample format |RIFF(.wav): PCM, at least 16-bit.<br/><br/>mp3: At least 256 KBps bit rate.|
109
109
| Audio length | Longer than 20 seconds |
110
110
| Archive format | .zip |
111
111
| Maximum archive size | 2048 MB, at most 1,000 audio files included |
112
112
113
113
> [!NOTE]
114
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
114
+
> The default sampling rate for a custom neural voice is 24 KHz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
115
115
116
116
All audio files should be grouped into a zip file. It's OK to put .wav files and .mp3 files into the same zip file. For example, you can upload a 45-second audio file named 'kingstory.wav' and a 200-second long audio file named 'queenstory.mp3' in the same zip file. All .mp3 files will be transformed into the .wav format after processing.
117
117
@@ -147,14 +147,14 @@ Follow these guidelines when preparing audio.
147
147
| -------- | ----- |
148
148
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
149
149
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
150
-
| Sampling rate |When you create a custom neural voice, 24,000 Hz is required. |
150
+
| Sampling rate |24 KHz and higher required when creating a custom neural voice. |
151
151
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: At least 256 KBps bit rate.|
152
152
| Audio length | No limit |
153
153
| Archive format | .zip |
154
154
| Maximum archive size | 2048 MB, at most 1,000 audio files included |
155
155
156
156
> [!NOTE]
157
-
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
157
+
> The default sampling rate for a custom neural voice is 24 KHz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
158
158
159
159
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/record-custom-voice-samples.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -336,15 +336,15 @@ Take regular breaks and provide a beverage to help your voice talent keep their
336
336
337
337
### After the session
338
338
339
-
Modern recording studios run on computers. At the end of the session, you receive one or more audio files, not a tape. These files are probably WAV or AIFF format in CD quality (44.1 KHz 16-bit) or better. 24 KHz 16-bit is common and desirable. The default sampling rate for a custom neural voice is 24 KHz. It's recommended that you should use a sample rate of 24 KHz for your training data. Higher sampling rates, such as 96 KHz, aren't usually needed.
339
+
Modern recording studios run on computers. At the end of the session, you receive one or more audio files, not a tape. These files are probably WAV or AIFF format in CD quality (44.1 KHz 16-bit) or better. 24 KHz 16-bit is common and desirable. The default sampling rate for a custom neural voice is 24 KHz. It's recommended that you should use a sample rate of 24 KHz and higher for your training data. Higher sampling rates, such as 96 KHz, aren't usually needed.
340
340
341
341
Speech Studio requires each provided utterance to be in its own file. Each audio file delivered by the studio contains multiple utterances. So the primary post-production task is to split up the recordings and prepare them for submission. The recording engineer might have placed markers in the file (or provided a separate cue sheet) to indicate where each utterance starts.
342
342
343
343
Use your notes to find the exact takes you want, and then use a sound editing utility, such as [Avid Pro Tools](https://www.avid.com/en/pro-tools), [Adobe Audition](https://www.adobe.com/products/audition.html), or the free [Audacity](https://www.audacityteam.org/), to copy each utterance into a new file.
344
344
345
345
Listen to each file carefully. At this stage, you can edit out small unwanted sounds that you missed during recording, like a slight lip smack before a line, but be careful not to remove any actual speech. If you can't fix a file, remove it from your dataset and note that you've done so.
346
346
347
-
Convert each file to 16 bits and a sample rate of 24 KHz before saving and if you recorded the studio chatter, remove the second channel. Save each file in WAV format, naming the files with the utterance number from your script.
347
+
Convert each file to 16 bits and a sample rate of 24 KHz and higher before saving and if you recorded the studio chatter, remove the second channel. Save each file in WAV format, naming the files with the utterance number from your script.
348
348
349
349
Finally, create the transcript that associates each WAV file with a text version of the corresponding utterance. [Train your voice model](./professional-voice-train-voice.md) includes details of the required format. You can copy the text directly from your script. Then create a Zip file of the WAV files and the text transcript.
0 commit comments