Skip to content

Commit ed1a24c

Browse files
committed
Update documentation to reflect the latest limits for audio data used as training data for speech customization.
1 parent e47aa5f commit ed1a24c

File tree

4 files changed

+15
-11
lines changed

4 files changed

+15
-11
lines changed

articles/ai-services/speech-service/faq-stt.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ sections:
157157
158158
In general, Speech service processes approximately 10 hours of audio data per day in regions that have dedicated hardware. Training with text only is faster and ordinarily finishes within minutes.
159159
160-
Use one of the regions where dedicated hardware is available for training. The Speech service uses up to 20 hours of audio for training in these regions.
160+
Use one of the regions where dedicated hardware is available for training. The Speech service uses up to 100 hours of audio for training in these regions.
161161
162162
- name: Accuracy testing
163163
questions:

articles/ai-services/speech-service/how-to-custom-speech-human-labeled-transcriptions.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,13 @@ ms.author: eur
1212

1313
# How to create human-labeled transcriptions
1414

15-
Human-labeled transcriptions are word-by-word transcriptions of an audio file. You use human-labeled transcriptions to improve recognition accuracy, especially when words are deleted or incorrectly replaced. This guide can help you create high-quality transcriptions.
15+
Human-labeled transcriptions are word-by-word transcriptions of an audio file. You use human-labeled transcriptions to evaluate model accuracy and to improve recognition accuracy, especially when words are deleted or incorrectly replaced. This guide can help you create high-quality transcriptions.
1616

17-
A large sample of transcription data is required to improve recognition. We suggest providing between 1 and 20 hours of audio data. The Speech service uses up to 20 hours of audio for training. This guide has sections for US English, Mandarin Chinese, and German locales.
17+
A representative sample of transcription data is recommended to evaluate model accuracy. The data should cover various speakers and utterances that are representative of what users say to the application. For test data, the maximum duration of each individual audio file is 2 hours.
18+
19+
A large sample of transcription data is required to improve recognition. We suggest providing between 1 and 100 hours of audio data. The Speech service uses up to 100 hours of audio for training (up to 20 hours for older models that don't charge for training). Each individual audio file shouldn't be longer than 40 seconds (up to 30 seconds for Whisper customization).
20+
21+
This guide has sections for US English, Mandarin Chinese, and German locales.
1822

1923
The transcriptions for all WAV files are contained in a single plain-text file (.txt or .tsv). Each line of the transcription file contains the name of one of the audio files, followed by the corresponding transcription. The file name and transcription are separated by a tab (`\t`).
2024

@@ -38,7 +42,7 @@ Here are a few examples:
3842

3943
| Characters to avoid | Substitution | Notes |
4044
| ------------------- | ------------ | ----- |
41-
| Hello world | "Hello world" | The opening and closing quotations marks are substituted with appropriate ASCII characters. |
45+
| "Hello world" | "Hello world" | The opening and closing quotations marks are substituted with appropriate ASCII characters. |
4246
| John’s day | John's day | The apostrophe is substituted with the appropriate ASCII character. |
4347
| It was good—no, it was great! | it was good--no, it was great! | The em dash is substituted with two hyphens. |
4448

@@ -48,7 +52,7 @@ Text normalization is the transformation of words into a consistent format used
4852

4953
- Write out abbreviations in words.
5054
- Write out nonstandard numeric strings in words (such as accounting terms).
51-
- Non-alphabetic characters or mixed alphanumeric characters should be transcribed as pronounced.
55+
- Nonalphabetic characters or mixed alphanumeric characters should be transcribed as pronounced.
5256
- Abbreviations that are pronounced as words shouldn't be edited (such as "radar", "laser", "RAM", or "NATO").
5357
- Write out abbreviations that are pronounced as separate letters with each letter separated by a space.
5458
- If you use audio, transcribe numbers as words that match the audio (for example, "101" could be pronounced as "one oh one" or "one hundred and one").

articles/ai-services/speech-service/how-to-custom-speech-test-and-train.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,8 @@ The following table lists accepted data types, when each data type should be use
2929

3030
| Data type | Used for testing | Recommended for testing | Used for training | Recommended for training |
3131
|-----------|-----------------|----------|-------------------|----------|
32-
| [Audio only](#audio-data-for-training-or-testing) | Yes (visual inspection) | 5+ audio files | Yes (Preview for `en-US`) | 1-20 hours of audio |
33-
| [Audio + human-labeled transcripts](#audio--human-labeled-transcript-data-for-training-or-testing) | Yes (evaluation of accuracy) | 0.5-5 hours of audio | Yes | 1-20 hours of audio |
32+
| [Audio only](#audio-data-for-training-or-testing) | Yes (visual inspection) | 5+ audio files | Yes (Preview for `en-US`) | 1-100 hours of audio |
33+
| [Audio + human-labeled transcripts](#audio--human-labeled-transcript-data-for-training-or-testing) | Yes (evaluation of accuracy) | 0.5-5 hours of audio | Yes | 1-100 hours of audio |
3434
| [Plain text](#plain-text-data-for-training) | No | Not applicable | Yes | 1-200 MB of related text |
3535
| [Structured text](#structured-text-data-for-training) | No | Not applicable | Yes | Up to 10 classes with up to 4,000 items and up to 50,000 training sentences |
3636
| [Pronunciation](#pronunciation-data-for-training) | No | Not applicable | Yes | 1 KB to 1 MB of pronunciation text |
@@ -43,7 +43,7 @@ Training with plain text or structured text usually finishes within a few minute
4343
>
4444
> Start with small sets of sample data that match the language, acoustics, and hardware where your model will be used. Small datasets of representative data can expose problems before you invest in gathering larger datasets for training. For sample custom speech data, see <a href="https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/sampledata/customspeech" target="_target">this GitHub repository</a>.
4545
46-
If you train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. For more information, see footnotes in the [regions](regions.md#speech-service) table. In regions with dedicated hardware for custom speech training, the Speech service uses up to 20 hours of your audio training data, and can process about 10 hours of data per day. In other regions, the Speech service uses up to 8 hours of your audio data, and can process about 1 hour of data per day. After the model is trained, you can copy the model to another region as needed with the [Models_CopyTo](/rest/api/speechtotext/models/copy-to) REST API.
46+
If you train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. For more information, see footnotes in the [regions](regions.md#speech-service) table. In regions with dedicated hardware for custom speech training, the Speech service uses up to 100 hours of your audio training data, and can process about 10 hours of data per day. After the model is trained, you can copy the model to another region as needed with the [Models_CopyTo](/rest/api/speechtotext/models/copy-to) REST API.
4747

4848
## Consider datasets by scenario
4949

@@ -89,7 +89,7 @@ Consider these details:
8989
* The Speech service automatically uses the transcripts to improve the recognition of domain-specific words and phrases, as though they were added as related text.
9090
* It can take several days for a training operation to finish. To improve the speed of training, be sure to create your Speech service subscription in a region with dedicated hardware for training.
9191

92-
A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 20 hours of audio. However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.
92+
A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 100 hours of audio (up to 20 hours for older models that do not charge for training). However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.
9393

9494
Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. Although audio with low recording volume or disruptive background noise isn't helpful, it shouldn't limit or degrade your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.
9595

@@ -106,7 +106,7 @@ Custom speech projects require audio files with these properties:
106106
| File format | RIFF (WAV) |
107107
| Sample rate | 8,000 Hz or 16,000 Hz |
108108
| Channels | 1 (mono) |
109-
| Maximum length per audio | Two hours (testing) / 60 s (training)<br/><br/>Training with audio has a maximum audio length of 60 seconds per file. For audio files longer than 60 seconds, only the corresponding transcription files are used for training. If all audio files are longer than 60 seconds, the training fails.|
109+
| Maximum length per audio | Two hours (testing) / 40 s (training)<br/><br/>Training with audio has a maximum audio length of 40 seconds per file (up to 30 seconds for Whisper customization). For audio files longer than 40 seconds, only the corresponding text from the transcription files is used for training. If all audio files are longer than 40 seconds, the training fails.|
110110
| Sample format | PCM, 16-bit |
111111
| Archive format | .zip |
112112
| Maximum zip size | 2 GB or 10,000 files |

articles/ai-services/speech-service/how-to-custom-speech-train-model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ You can use a custom model for a limited time after it was trained. You must per
2525
> [!IMPORTANT]
2626
> If you will train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. After a model is trained, you can [copy it to a Speech resource](#copy-a-model) in another region as needed.
2727
>
28-
> In regions with dedicated hardware for custom speech training, the Speech service will use up to 20 hours of your audio training data, and can process about 10 hours of data per day. In other regions, the Speech service uses up to 8 hours of your audio data, and can process about 1 hour of data per day. See footnotes in the [regions](regions.md#speech-service) table for more information.
28+
> In regions with dedicated hardware for custom speech training, the Speech service will use up to 100 hours of your audio training data, and can process about 10 hours of data per day. See footnotes in the [regions](regions.md#speech-service) table for more information.
2929
3030
## Create a model
3131

0 commit comments

Comments
 (0)