|
| 1 | +--- |
| 2 | +title: Improve a model for Custom Speech - Speech service |
| 3 | +titleSuffix: Azure Cognitive Services |
| 4 | +description: Particular kinds of human-labeled transcriptions and related text can improve recognition accuracy for a speech-to-text model based on the speaking scenario. |
| 5 | +services: cognitive-services |
| 6 | +author: v-demjoh |
| 7 | +manager: nitinme |
| 8 | + |
| 9 | +ms.service: cognitive-services |
| 10 | +ms.subservice: speech-service |
| 11 | +ms.topic: conceptual |
| 12 | +ms.date: 05/20/2020 |
| 13 | +ms.author: v-demjoh |
| 14 | +--- |
| 15 | + |
| 16 | +# Improve Custom Speech accuracy |
| 17 | + |
| 18 | +In this article, you'll learn how to improve the quality of your custom model by adding audio, human-labeled transcripts, and related text. |
| 19 | + |
| 20 | +## Accuracy in different scenarios |
| 21 | + |
| 22 | +Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios: |
| 23 | + |
| 24 | +| Scenario | Audio Quality | Vocabulary | Speaking Style | |
| 25 | +|----------|---------------|------------|----------------| |
| 26 | +| Call center | Low, 8 kHz, could be 2 humans on 1 audio channel, could be compressed | Narrow, unique to domain and products | Conversational, loosely structured | |
| 27 | +| Voice assistant (such as Cortana, or a drive-through window) | High, 16 kHz | Entity heavy (song titles, products, locations) | Clearly stated words and phrases | |
| 28 | +| Dictation (instant message, notes, search) | High, 16 kHz | Varied | Note-taking | |
| 29 | +| Video closed captioning | Varied, including varied microphone use, added music | Varied, from meetings, recited speech, musical lyrics | Read, prepared, or loosely structured | |
| 30 | + |
| 31 | +Different scenarios produce different quality outcomes. The following table examines how content from these four scenarios rates in the [word error rate (WER)](how-to-custom-speech-evaluate-data.md). The table shows which error types are most common in each scenario. |
| 32 | + |
| 33 | +| Scenario | Speech Recognition Quality | Insertion Errors | Deletion Errors | Substitution Errors | |
| 34 | +|----------|----------------------------|------------------|-----------------|---------------------| |
| 35 | +| Call center | Medium (< 30% WER) | Low, except when other people talk in the background | Can be high. Call centers can be noisy, and overlapping speakers can confuse the model | Medium. Products and people's names can cause these errors | |
| 36 | +| Voice assistant | High (can be < 10% WER) | Low | Low | Medium, due to song titles, product names, or locations | |
| 37 | +| Dictation | High (can be < 10% WER) | Low | Low | High | |
| 38 | +| Video closed captioning | Depends on video type (can be < 50% WER) | Low | Can be high due to music, noises, microphone quality | Jargon may cause these errors | |
| 39 | + |
| 40 | +Determining the components of the WER (number of insertion, deletion, and substitution errors) helps determine what kind of data to add to improve the model. Use the [Custom Speech portal](https://speech.microsoft.com/customspeech) to view the quality of a baseline model. The portal reports insertion, substitution, and deletion error rates that are combined in the WER quality rate. |
| 41 | + |
| 42 | +## Improve model recognition |
| 43 | + |
| 44 | +You can reduce recognition errors by adding training data in the [Custom Speech portal](https://speech.microsoft.com/customspeech). |
| 45 | + |
| 46 | +Plan to maintain your custom model by adding source materials periodically. Your custom model needs additional training to stay aware of changes to your entities. For example, you may need updates to product names, song names, or new service locations. |
| 47 | + |
| 48 | +The following sections describe how each kind of additional training data can reduce errors. |
| 49 | + |
| 50 | +### Add related text sentences |
| 51 | + |
| 52 | +Additional related text sentences can primarily reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. |
| 53 | + |
| 54 | +> [!NOTE] |
| 55 | +> Avoid related text sentences that include noise such as unrecognizable characters or words. |
| 56 | +
|
| 57 | +### Add audio with human-labeled transcripts |
| 58 | + |
| 59 | +Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. Samples must cover the full scope of speech. For example, a call center for a retail store would get most calls about swimwear and sunglasses during summer months. Assure that your sample includes the full scope of speech you want to detect. |
| 60 | + |
| 61 | +Consider these details: |
| 62 | + |
| 63 | +* Custom Speech can only capture word context to reduce substitution errors, not insertion or deletion errors. |
| 64 | +* Avoid samples that include transcription errors, but do include a diversity of audio quality. |
| 65 | +* Avoid sentences that are not related to your problem domain. Unrelated sentences can harm your model. |
| 66 | +* When the quality of transcripts vary, you can duplicate exceptionally good sentences (like excellent transcriptions that include key phrases) to increase their weight. |
| 67 | + |
| 68 | +### Add new words with pronunciation |
| 69 | + |
| 70 | +Words that are made-up or highly specialized may have unique pronunciations. These words can be recognized if the word can be broken down into smaller words to pronounce it. For example, to recognize **Xbox**, pronounce as **X box**. This approach will not increase overall accuracy, but can increase recognition of these keywords. |
| 71 | + |
| 72 | +> [!NOTE] |
| 73 | +> This technique is only available for some languages at this time. See customization for pronunciation in [the Speech-to-text table](language-support.md) for details. |
| 74 | +
|
| 75 | +## Sources by scenario |
| 76 | + |
| 77 | +The following table shows voice recognition scenarios and lists source materials to consider within the three training content categories listed above. |
| 78 | + |
| 79 | +| Scenario | Related text sentences | Audio + human-labeled transcripts | New words with pronunciation | |
| 80 | +|----------|------------------------|------------------------------|------------------------------| |
| 81 | +| Call center | marketing documents, website, product reviews related to call center activity | call center calls transcribed by humans | terms that have ambiguous pronunciations (see Xbox above) | |
| 82 | +| Voice assistant | list sentences using all combinations of commands and entities | record voices speaking commands into device, and transcribe into text | names (movies, songs, products) that have unique pronunciations | |
| 83 | +| Dictation | written input, like instant messages or emails | similar to above | similar to above | |
| 84 | +| Video closed captioning | TV show scripts, movies, marketing content, video summaries | exact transcripts of videos | similar to above | |
| 85 | + |
| 86 | +## Next steps |
| 87 | + |
| 88 | +- [Train your model](how-to-custom-speech-train-model.md) |
| 89 | + |
| 90 | +## Additional resources |
| 91 | + |
| 92 | +- [Prepare and test your data](how-to-custom-speech-test-data.md) |
| 93 | +- [Inspect your data](how-to-custom-speech-inspect-data.md) |
0 commit comments