Skip to content

Commit ade7980

Browse files
committed
To address merge conflicts in PR113767, I've moved the change to this PR and will merge from here.
1 parent b330b7c commit ade7980

File tree

6 files changed

+109
-6
lines changed

6 files changed

+109
-6
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: Improve a model for Custom Speech - Speech service
3+
titleSuffix: Azure Cognitive Services
4+
description: Particular kinds of human-labeled transcriptions and related text can improve recognition accuracy for a speech-to-text model based on the speaking scenario.
5+
services: cognitive-services
6+
author: v-demjoh
7+
manager: nitinme
8+
9+
ms.service: cognitive-services
10+
ms.subservice: speech-service
11+
ms.topic: conceptual
12+
ms.date: 05/20/2020
13+
ms.author: v-demjoh
14+
---
15+
16+
# Improve Custom Speech accuracy
17+
18+
In this article, you'll learn how to improve the quality of your custom model by adding audio, human-labeled transcripts, and related text.
19+
20+
## Accuracy in different scenarios
21+
22+
Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios:
23+
24+
| Scenario | Audio Quality | Vocabulary | Speaking Style |
25+
|----------|---------------|------------|----------------|
26+
| Call center | Low, 8 kHz, could be 2 humans on 1 audio channel, could be compressed | Narrow, unique to domain and products | Conversational, loosely structured |
27+
| Voice assistant (such as Cortana, or a drive-through window) | High, 16 kHz | Entity heavy (song titles, products, locations) | Clearly stated words and phrases |
28+
| Dictation (instant message, notes, search) | High, 16 kHz | Varied | Note-taking |
29+
| Video closed captioning | Varied, including varied microphone use, added music | Varied, from meetings, recited speech, musical lyrics | Read, prepared, or loosely structured |
30+
31+
Different scenarios produce different quality outcomes. The following table examines how content from these four scenarios rates in the [word error rate (WER)](how-to-custom-speech-evaluate-data.md). The table shows which error types are most common in each scenario.
32+
33+
| Scenario | Speech Recognition Quality | Insertion Errors | Deletion Errors | Substitution Errors |
34+
|----------|----------------------------|------------------|-----------------|---------------------|
35+
| Call center | Medium (< 30% WER) | Low, except when other people talk in the background | Can be high. Call centers can be noisy, and overlapping speakers can confuse the model | Medium. Products and people's names can cause these errors |
36+
| Voice assistant | High (can be < 10% WER) | Low | Low | Medium, due to song titles, product names, or locations |
37+
| Dictation | High (can be < 10% WER) | Low | Low | High |
38+
| Video closed captioning | Depends on video type (can be < 50% WER) | Low | Can be high due to music, noises, microphone quality | Jargon may cause these errors |
39+
40+
Determining the components of the WER (number of insertion, deletion, and substitution errors) helps determine what kind of data to add to improve the model. Use the [Custom Speech portal](https://speech.microsoft.com/customspeech) to view the quality of a baseline model. The portal reports insertion, substitution, and deletion error rates that are combined in the WER quality rate.
41+
42+
## Improve model recognition
43+
44+
You can reduce recognition errors by adding training data in the [Custom Speech portal](https://speech.microsoft.com/customspeech).
45+
46+
Plan to maintain your custom model by adding source materials periodically. Your custom model needs additional training to stay aware of changes to your entities. For example, you may need updates to product names, song names, or new service locations.
47+
48+
The following sections describe how each kind of additional training data can reduce errors.
49+
50+
### Add related text sentences
51+
52+
Additional related text sentences can primarily reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized.
53+
54+
> [!NOTE]
55+
> Avoid related text sentences that include noise such as unrecognizable characters or words.
56+
57+
### Add audio with human-labeled transcripts
58+
59+
Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. Samples must cover the full scope of speech. For example, a call center for a retail store would get most calls about swimwear and sunglasses during summer months. Assure that your sample includes the full scope of speech you want to detect.
60+
61+
Consider these details:
62+
63+
* Custom Speech can only capture word context to reduce substitution errors, not insertion or deletion errors.
64+
* Avoid samples that include transcription errors, but do include a diversity of audio quality.
65+
* Avoid sentences that are not related to your problem domain. Unrelated sentences can harm your model.
66+
* When the quality of transcripts vary, you can duplicate exceptionally good sentences (like excellent transcriptions that include key phrases) to increase their weight.
67+
68+
### Add new words with pronunciation
69+
70+
Words that are made-up or highly specialized may have unique pronunciations. These words can be recognized if the word can be broken down into smaller words to pronounce it. For example, to recognize **Xbox**, pronounce as **X box**. This approach will not increase overall accuracy, but can increase recognition of these keywords.
71+
72+
> [!NOTE]
73+
> This technique is only available for some languages at this time. See customization for pronunciation in [the Speech-to-text table](language-support.md) for details.
74+
75+
## Sources by scenario
76+
77+
The following table shows voice recognition scenarios and lists source materials to consider within the three training content categories listed above.
78+
79+
| Scenario | Related text sentences | Audio + human-labeled transcripts | New words with pronunciation |
80+
|----------|------------------------|------------------------------|------------------------------|
81+
| Call center | marketing documents, website, product reviews related to call center activity | call center calls transcribed by humans | terms that have ambiguous pronunciations (see Xbox above) |
82+
| Voice assistant | list sentences using all combinations of commands and entities | record voices speaking commands into device, and transcribe into text | names (movies, songs, products) that have unique pronunciations |
83+
| Dictation | written input, like instant messages or emails | similar to above | similar to above |
84+
| Video closed captioning | TV show scripts, movies, marketing content, video summaries | exact transcripts of videos | similar to above |
85+
86+
## Next steps
87+
88+
- [Train your model](how-to-custom-speech-train-model.md)
89+
90+
## Additional resources
91+
92+
- [Prepare and test your data](how-to-custom-speech-test-data.md)
93+
- [Inspect your data](how-to-custom-speech-inspect-data.md)

articles/cognitive-services/Speech-Service/how-to-custom-speech-test-and-train.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ This table lists accepted data types, when each data type should be used, and th
2222

2323
| Data type | Used for testing | Recommended quantity | Used for training | Recommended quantity |
2424
|-----------|-----------------|----------|-------------------|----------|
25-
| [Audio](#audio-data-for-testing) | Yes<br>Used for visual inspection | 5+ audio files | No | N/a |
25+
| [Audio](#audio-data-for-testing) | Yes<br>Used for visual inspection | 5+ audio files | No | N/A |
2626
| [Audio + Human-labeled transcripts](#audio--human-labeled-transcript-data-for-testingtraining) | Yes<br>Used to evaluate accuracy | 0.5-5 hours of audio | Yes | 1-1,000 hours of audio |
2727
| [Related text](#related-text-data-for-training) | No | N/a | Yes | 1-200 MB of related text |
2828

@@ -76,6 +76,8 @@ Use <a href="http://sox.sourceforge.net" target="_blank" rel="noopener">SoX <spa
7676

7777
To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. Keep in mind, the improvements in recognition will only be as good as the data provided. For that reason, it's important that only high-quality transcripts are uploaded.
7878

79+
Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. While audio with low recording volume or disruptive background noise is not helpful, it should not hurt your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.
80+
7981
| Property | Value |
8082
|--------------------------|-------------------------------------|
8183
| File format | RIFF (WAV) |

articles/cognitive-services/Speech-Service/how-to-custom-speech-train-model.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ The first step to train a model is to upload training data. Use [Prepare and tes
3838
2. Navigate to **Speech-to-text > Custom Speech > Training**.
3939
3. Click **Train model**.
4040
4. Next, give your training a **Name** and **Description**.
41-
5. From the **Scenario and Baseline model** drop-down menu, select the scenario that best fits your domain. If you're unsure of which scenario to choose, select **General**. The baseline model is the starting point for training. If you don't have a preference, you can use the latest.
41+
5. From the **Scenario and Baseline model** drop-down menu, select the scenario that best fits your domain. If you're unsure of which scenario to choose, select **General**. The baseline model is the starting point for training. The latest model is usually the best choice.
4242
6. From the **Select training data** page, choose one or multiple audio + human-labeled transcription datasets that you'd like to use for training.
4343
7. Once the training is complete, you can choose to perform accuracy testing on the newly trained model. This step is optional.
4444
8. Select **Create** to build your custom model.
@@ -52,7 +52,7 @@ You can inspect the data and evaluate model accuracy using these documents:
5252
- [Inspect your data](how-to-custom-speech-inspect-data.md)
5353
- [Evaluate your data](how-to-custom-speech-evaluate-data.md)
5454

55-
If you chose to test accuracy, it's important to select an acoustic dataset that's different from the one you used with your model to get a realistic sense of the models performance.
55+
If you chose to test accuracy, it's important to select an acoustic dataset that's different from the one you used with your model to get a realistic sense of the model's performance.
5656

5757
## Next steps
5858

articles/cognitive-services/Speech-Service/how-to-custom-speech.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,11 @@ This diagram highlights the pieces that make up the [Custom Speech portal](https
3232

3333
4. [Evaluate accuracy](how-to-custom-speech-evaluate-data.md) - Evaluate the accuracy of the speech-to-text model. The [Custom Speech portal](https://speech.microsoft.com/customspeech) will provide a *Word Error Rate*, which can be used to determine if additional training is required. If you're satisfied with the accuracy, you can use the Speech service APIs directly. If you'd like to improve accuracy by a relative average of 5% - 20%, use the **Training** tab in the portal to upload additional training data, such as human-labeled transcripts and related text.
3434

35-
5. [Train the model](how-to-custom-speech-train-model.md) - Improve the accuracy of your speech-to-text model by providing written transcripts (10-1,000 hours) and related text (<200 MB) along with your audio test data. This data helps to train the speech-to-text model. After training, retest, and if you're satisfied with the result, you can deploy your model.
35+
5. [Improve accuracy](how-to-custom-speech-improve-accuracy.md) - Choose additional training data strategically to improve the quality of the speech-to-text model based on your scenario.
3636

37-
6. [Deploy the model](how-to-custom-speech-deploy-model.md) - Create a custom endpoint for your speech-to-text model and use it in your applications, tools, or products.
37+
6. [Train the model](how-to-custom-speech-train-model.md) - Improve the accuracy of your speech-to-text model by providing written transcripts (10-1,000 hours) and related text (<200 MB) along with your audio test data. This data helps to train the speech-to-text model. After training, retest, and if you're satisfied with the result, you can deploy your model.
38+
39+
7. [Deploy the model](how-to-custom-speech-deploy-model.md) - Create a custom endpoint for your speech-to-text model and use it in your applications, tools, or products.
3840

3941
## Set up your Azure account
4042

articles/cognitive-services/Speech-Service/language-support.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,11 @@ Language support varies by Speech service functionality. The following tables su
1919

2020
## Speech-to-text
2121

22-
Both the Microsoft Speech SDK and the REST API support the following languages (locales). To improve accuracy, customization is offered for a subset of the languages through uploading Audio + Human-labeled Transcripts or Related Text: Sentences. Pronunciation customization is offered through uploading Related Text: Pronunciation. Learn more about customization [here](how-to-custom-speech.md).
22+
Both the Microsoft Speech SDK and the REST API support the following languages (locales).
23+
24+
To improve accuracy, customization is offered for a subset of the languages through uploading **Audio + Human-labeled Transcripts** or **Related Text: Sentences**. To learn more about customization, see [Get started with Custom Speech](how-to-custom-speech.md).
25+
26+
For more information about how you can improve pronunciation, see [Improve a model for Custom Speech](how-to-custom-speech-improve-accuracy.md#add-new-words-with-pronunciation).
2327

2428
<!--
2529
To get the AM and ML bits:

articles/cognitive-services/Speech-Service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@
6363
href: how-to-custom-speech-inspect-data.md
6464
- name: Evaluate Custom Speech accuracy
6565
href: how-to-custom-speech-evaluate-data.md
66+
- name: Improve Custom Speech accuracy
67+
href: how-to-custom-speech-improve-accuracy.md
6668
- name: Train a model for Custom Speech
6769
href: how-to-custom-speech-train-model.md
6870
- name: Deploy a custom model

0 commit comments

Comments
 (0)