Skip to content

Commit d466580

Browse files
Merge pull request #209764 from sally-baolian/patch-58
Update how-to-custom-voice-prepare-data.md
2 parents 6fa0f8f + f3424d3 commit d466580

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

articles/cognitive-services/Speech-Service/how-to-custom-voice-prepare-data.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ This table lists data types and how each is used to create a custom Text-to-Spee
4141
| Data type | Description | When to use | Additional processing required |
4242
| --------- | ----------- | ----------- | ------------------------------ |
4343
| **Individual utterances + matching transcript** | A collection (.zip) of audio files (.wav) as individual utterances. Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt). | Professional recordings with matching transcripts | Ready for training. |
44-
| **Long audio + transcript (beta)** | A collection (.zip) of long, unsegmented audio files (.wav or .mp3, longer than 20 seconds), paired with a collection (.zip) of transcripts that contains all spoken words. | You have audio files and matching transcripts, but they aren't segmented into utterances. | Segmentation (using batch transcription).<br>Audio format transformation where required. |
45-
| **Audio only (beta)** | A collection (.zip) of audio files (.wav or .mp3) without a transcript. | You only have audio files available, without transcripts. | Segmentation + transcript generation (using batch transcription).<br>Audio format transformation where required.|
44+
| **Long audio + transcript (beta)** | A collection (.zip) of long, unsegmented audio files (.wav or .mp3, longer than 20 seconds, at most 1000 audio files), paired with a collection (.zip) of transcripts that contains all spoken words. | You have audio files and matching transcripts, but they aren't segmented into utterances. | Segmentation (using batch transcription).<br>Audio format transformation where required. |
45+
| **Audio only (beta)** | A collection (.zip) of audio files (.wav or .mp3, at most 1000 audio files) without a transcript. | You only have audio files available, without transcripts. | Segmentation + transcript generation (using batch transcription).<br>Audio format transformation where required.|
4646

4747
Files should be grouped by type into a dataset and uploaded as a zip file. Each dataset can only contain a single data type.
4848

@@ -118,7 +118,7 @@ Follow these guidelines when preparing audio for segmentation.
118118
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
119119
| Audio length | Longer than 20 seconds |
120120
| Archive format | .zip |
121-
| Maximum archive size | 2048 MB |
121+
| Maximum archive size | 2048 MB, at most 1000 audio files included |
122122

123123
> [!NOTE]
124124
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It’s recommended that you should use a sample rate of 24,000 Hz for your training data.
@@ -158,7 +158,7 @@ Follow these guidelines when preparing audio.
158158
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: at least 256 KBps bit rate|
159159
| Audio length | No limit |
160160
| Archive format | .zip |
161-
| Maximum archive size | 2048 MB |
161+
| Maximum archive size | 2048 MB, at most 1000 audio files included |
162162

163163
> [!NOTE]
164164
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It’s recommended that you should use a sample rate of 24,000 Hz for your training data.

0 commit comments

Comments
 (0)