Skip to content

Commit acb4928

Browse files
authored
Merge pull request #98354 from IEvangelist/speechClarifications
[CogSvcs] added clarifications to file sizes.
2 parents fa7d5e5 + 9ba558e commit acb4928

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

articles/cognitive-services/Speech-Service/how-to-custom-speech-test-data.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@ Use this table to ensure that your audio files are formatted correctly for use w
5757
| Archive format | .zip |
5858
| Maximum archive size | 2 GB |
5959

60+
> [!TIP]
61+
> When uploading training and testing data, the .zip file size cannot exceed 2 GB. If you require more data for training and testing, divide it into several .zip files and upload them separately. Later, you can choose to train and test from *multiple* datasets.
62+
6063
If your audio doesn’t satisfy these properties or you want to check if it does, we suggest downloading [sox](http://sox.sourceforge.net) to check or convert the audio. Below are some examples of how each of these activities can be done through the command line:
6164

6265
| Activity | Description | Sox command |
@@ -66,7 +69,7 @@ If your audio doesn’t satisfy these properties or you want to check if it does
6669

6770
## Audio + human-labeled transcript data for testing/training
6871

69-
To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. Keep in mind, the improvements in recognition will only be as good as the data provided. For that reason, it's important that only high-quality transcripts are uploaded.
72+
To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. Keep in mind, the improvements in recognition will only be as good as the data provided. For that reason, it's important that only high-quality transcripts are uploaded.
7073

7174
| Property | Value |
7275
|----------|-------|
@@ -78,6 +81,9 @@ To measure the accuracy of Microsoft's speech-to-text accuracy when processing y
7881
| Archive format | .zip |
7982
| Maximum zip size | 2 GB |
8083

84+
> [!TIP]
85+
> When uploading training and testing data, the .zip file size cannot exceed 2 GB. If you require more data for training and testing, divide it into several .zip files and upload them separately. Later, you can choose to train and test from *multiple* datasets.
86+
8187
To address issues like word deletion or substitution, a significant amount of data is required to improve recognition. Generally, it's recommended to provide word-by-word transcriptions for roughly 10 to 1,000 hours of audio. The transcriptions for all WAV files should be contained in a single plain-text file. Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. The file name and transcription should be separated by a tab (\t).
8288

8389
For example:

0 commit comments

Comments
 (0)