You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/faq-stt.yml
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -157,7 +157,7 @@ sections:
157
157
158
158
In general, Speech service processes approximately 10 hours of audio data per day in regions that have dedicated hardware. Training with text only is faster and ordinarily finishes within minutes.
159
159
160
-
Use one of the regions where dedicated hardware is available for training. The Speech service uses up to 20 hours of audio for training in these regions.
160
+
Use one of the regions where dedicated hardware is available for training. The Speech service uses up to 100 hours of audio for training in these regions.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-speech-human-labeled-transcriptions.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,13 @@ ms.author: eur
12
12
13
13
# How to create human-labeled transcriptions
14
14
15
-
Human-labeled transcriptions are word-by-word transcriptions of an audio file. You use human-labeled transcriptions to improve recognition accuracy, especially when words are deleted or incorrectly replaced. This guide can help you create high-quality transcriptions.
15
+
Human-labeled transcriptions are word-by-word transcriptions of an audio file. You use human-labeled transcriptions to evaluate model accuracy and to improve recognition accuracy, especially when words are deleted or incorrectly replaced. This guide can help you create high-quality transcriptions.
16
16
17
-
A large sample of transcription data is required to improve recognition. We suggest providing between 1 and 20 hours of audio data. The Speech service uses up to 20 hours of audio for training. This guide has sections for US English, Mandarin Chinese, and German locales.
17
+
A representative sample of transcription data is recommended to evaluate model accuracy. The data should cover various speakers and utterances that are representative of what users say to the application. For test data, the maximum duration of each individual audio file is 2 hours.
18
+
19
+
A large sample of transcription data is required to improve recognition. We suggest providing between 1 and 100 hours of audio data. The Speech service uses up to 100 hours of audio for training (up to 20 hours for older models that don't charge for training). Each individual audio file shouldn't be longer than 40 seconds (up to 30 seconds for Whisper customization).
20
+
21
+
This guide has sections for US English, Mandarin Chinese, and German locales.
18
22
19
23
The transcriptions for all WAV files are contained in a single plain-text file (.txt or .tsv). Each line of the transcription file contains the name of one of the audio files, followed by the corresponding transcription. The file name and transcription are separated by a tab (`\t`).
20
24
@@ -38,7 +42,7 @@ Here are a few examples:
38
42
39
43
| Characters to avoid | Substitution | Notes |
40
44
| ------------------- | ------------ | ----- |
41
-
|“Hello world”| "Hello world" | The opening and closing quotations marks are substituted with appropriate ASCII characters. |
45
+
|"Hello world"| "Hello world" | The opening and closing quotations marks are substituted with appropriate ASCII characters. |
42
46
| John’s day | John's day | The apostrophe is substituted with the appropriate ASCII character. |
43
47
| It was good—no, it was great! | it was good--no, it was great! | The em dash is substituted with two hyphens. |
44
48
@@ -48,7 +52,7 @@ Text normalization is the transformation of words into a consistent format used
48
52
49
53
- Write out abbreviations in words.
50
54
- Write out nonstandard numeric strings in words (such as accounting terms).
51
-
-Non-alphabetic characters or mixed alphanumeric characters should be transcribed as pronounced.
55
+
-Nonalphabetic characters or mixed alphanumeric characters should be transcribed as pronounced.
52
56
- Abbreviations that are pronounced as words shouldn't be edited (such as "radar", "laser", "RAM", or "NATO").
53
57
- Write out abbreviations that are pronounced as separate letters with each letter separated by a space.
54
58
- If you use audio, transcribe numbers as words that match the audio (for example, "101" could be pronounced as "one oh one" or "one hundred and one").
|[Audio only](#audio-data-for-training-or-testing)| Yes (visual inspection) | 5+ audio files | Yes (Preview for `en-US`) | 1-20 hours of audio |
33
-
|[Audio + human-labeled transcripts](#audio--human-labeled-transcript-data-for-training-or-testing)| Yes (evaluation of accuracy) | 0.5-5 hours of audio | Yes | 1-20 hours of audio |
32
+
|[Audio only](#audio-data-for-training-or-testing)| Yes (visual inspection) | 5+ audio files | Yes (Preview for `en-US`) | 1-100 hours of audio |
33
+
|[Audio + human-labeled transcripts](#audio--human-labeled-transcript-data-for-training-or-testing)| Yes (evaluation of accuracy) | 0.5-5 hours of audio | Yes | 1-100 hours of audio |
34
34
|[Plain text](#plain-text-data-for-training)| No | Not applicable | Yes | 1-200 MB of related text |
35
35
|[Structured text](#structured-text-data-for-training)| No | Not applicable | Yes | Up to 10 classes with up to 4,000 items and up to 50,000 training sentences |
36
36
|[Pronunciation](#pronunciation-data-for-training)| No | Not applicable | Yes | 1 KB to 1 MB of pronunciation text |
@@ -43,7 +43,7 @@ Training with plain text or structured text usually finishes within a few minute
43
43
>
44
44
> Start with small sets of sample data that match the language, acoustics, and hardware where your model will be used. Small datasets of representative data can expose problems before you invest in gathering larger datasets for training. For sample custom speech data, see <ahref="https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/sampledata/customspeech"target="_target">this GitHub repository</a>.
45
45
46
-
If you train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. For more information, see footnotes in the [regions](regions.md#speech-service) table. In regions with dedicated hardware for custom speech training, the Speech service uses up to 20 hours of your audio training data, and can process about 10 hours of data per day. In other regions, the Speech service uses up to 8 hours of your audio data, and can process about 1 hour of data per day. After the model is trained, you can copy the model to another region as needed with the [Models_CopyTo](/rest/api/speechtotext/models/copy-to) REST API.
46
+
If you train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. For more information, see footnotes in the [regions](regions.md#speech-service) table. In regions with dedicated hardware for custom speech training, the Speech service uses up to 100 hours of your audio training data, and can process about 10 hours of data per day. After the model is trained, you can copy the model to another region as needed with the [Models_CopyTo](/rest/api/speechtotext/models/copy-to) REST API.
47
47
48
48
## Consider datasets by scenario
49
49
@@ -89,7 +89,7 @@ Consider these details:
89
89
* The Speech service automatically uses the transcripts to improve the recognition of domain-specific words and phrases, as though they were added as related text.
90
90
* It can take several days for a training operation to finish. To improve the speed of training, be sure to create your Speech service subscription in a region with dedicated hardware for training.
91
91
92
-
A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 20 hours of audio. However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.
92
+
A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 100 hours of audio (up to 20 hours for older models that do not charge for training). However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.
93
93
94
94
Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. Although audio with low recording volume or disruptive background noise isn't helpful, it shouldn't limit or degrade your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.
95
95
@@ -106,7 +106,7 @@ Custom speech projects require audio files with these properties:
106
106
| File format | RIFF (WAV) |
107
107
| Sample rate | 8,000 Hz or 16,000 Hz |
108
108
| Channels | 1 (mono) |
109
-
| Maximum length per audio | Two hours (testing) / 60 s (training)<br/><br/>Training with audio has a maximum audio length of 60 seconds per file. For audio files longer than 60 seconds, only the corresponding transcription files are used for training. If all audio files are longer than 60 seconds, the training fails.|
109
+
| Maximum length per audio | Two hours (testing) / 40 s (training)<br/><br/>Training with audio has a maximum audio length of 40 seconds per file (up to 30 seconds for Whisper customization). For audio files longer than 40 seconds, only the corresponding text from the transcription files is used for training. If all audio files are longer than 40 seconds, the training fails.|
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-speech-train-model.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ You can use a custom model for a limited time after it was trained. You must per
25
25
> [!IMPORTANT]
26
26
> If you will train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. After a model is trained, you can [copy it to a Speech resource](#copy-a-model) in another region as needed.
27
27
>
28
-
> In regions with dedicated hardware for custom speech training, the Speech service will use up to 20 hours of your audio training data, and can process about 10 hours of data per day. In other regions, the Speech service uses up to 8 hours of your audio data, and can process about 1 hour of data per day. See footnotes in the [regions](regions.md#speech-service) table for more information.
28
+
> In regions with dedicated hardware for custom speech training, the Speech service will use up to 100 hours of your audio training data, and can process about 10 hours of data per day. See footnotes in the [regions](regions.md#speech-service) table for more information.
0 commit comments