To address merge conflicts in PR113767, I've moved the change to this PR and will merge from here.

erhopf · erhopf · commit ade79807b177 · 2020-05-20T11:55:34.000-07:00
diff --git a/articles/cognitive-services/Speech-Service/how-to-custom-speech-improve-accuracy.md b/articles/cognitive-services/Speech-Service/how-to-custom-speech-improve-accuracy.md
@@ -0,0 +1,93 @@
+---
+title: Improve a model for Custom Speech - Speech service
+titleSuffix: Azure Cognitive Services
+description: Particular kinds of human-labeled transcriptions and related text can improve recognition accuracy for a speech-to-text model based on the speaking scenario.
+services: cognitive-services
+author: v-demjoh
+manager: nitinme
+
+ms.service: cognitive-services
+ms.subservice: speech-service
+ms.topic: conceptual
+ms.date: 05/20/2020
+ms.author: v-demjoh
+---
+
+# Improve Custom Speech accuracy
+
+In this article, you'll learn how to improve the quality of your custom model by adding audio, human-labeled transcripts, and related text.
+
+## Accuracy in different scenarios
+
+Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios:
+
+| Scenario | Audio Quality | Vocabulary | Speaking Style |
+|----------|---------------|------------|----------------|
+| Call center | Low, 8 kHz, could be 2 humans on 1 audio channel, could be compressed | Narrow, unique to domain and products | Conversational, loosely structured |
+| Voice assistant (such as Cortana, or a drive-through window) | High, 16 kHz | Entity heavy (song titles, products, locations) | Clearly stated words and phrases |
+| Dictation (instant message, notes, search) | High, 16 kHz | Varied | Note-taking |
+| Video closed captioning | Varied, including varied microphone use, added music | Varied, from meetings, recited speech, musical lyrics | Read, prepared, or loosely structured |
+
+Different scenarios produce different quality outcomes. The following table examines how content from these four scenarios rates in the [word error rate (WER)](how-to-custom-speech-evaluate-data.md). The table shows which error types are most common in each scenario.
+
+| Scenario | Speech Recognition Quality | Insertion Errors | Deletion Errors | Substitution Errors |
+|----------|----------------------------|------------------|-----------------|---------------------|
+| Call center | Medium (< 30% WER) | Low, except when other people talk in the background | Can be high. Call centers can be noisy, and overlapping speakers can confuse the model | Medium. Products and people's names can cause these errors |
+| Voice assistant | High (can be < 10% WER) | Low | Low | Medium, due to song titles, product names, or locations |
+| Dictation | High (can be < 10% WER) | Low | Low | High |
+| Video closed captioning | Depends on video type (can be < 50% WER) | Low | Can be high due to music, noises, microphone quality | Jargon may cause these errors |
+
+Determining the components of the WER (number of insertion, deletion, and substitution errors) helps determine what kind of data to add to improve the model. Use the [Custom Speech portal](https://speech.microsoft.com/customspeech) to view the quality of a baseline model. The portal reports insertion, substitution, and deletion error rates that are combined in the WER quality rate.
+
+## Improve model recognition
+
+You can reduce recognition errors by adding training data in the [Custom Speech portal](https://speech.microsoft.com/customspeech). 
+
+Plan to maintain your custom model by adding source materials periodically. Your custom model needs additional training to stay aware of changes to your entities. For example, you may need updates to product names, song names, or new service locations.
+
+The following sections describe how each kind of additional training data can reduce errors.
+
+### Add related text sentences
+
+Additional related text sentences can primarily reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized.
+
+> [!NOTE]
+> Avoid related text sentences that include noise such as unrecognizable characters or words.
+
+### Add audio with human-labeled transcripts
+
+Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. Samples must cover the full scope of speech. For example, a call center for a retail store would get most calls about swimwear and sunglasses during summer months. Assure that your sample includes the full scope of speech you want to detect.
+
+Consider these details:
+
+* Custom Speech can only capture word context to reduce substitution errors, not insertion or deletion errors.
+* Avoid samples that include transcription errors, but do include a diversity of audio quality.
+* Avoid sentences that are not related to your problem domain. Unrelated sentences can harm your model.
+* When the quality of transcripts vary, you can duplicate exceptionally good sentences (like excellent transcriptions that include key phrases) to increase their weight.
+
+### Add new words with pronunciation
+
+Words that are made-up or highly specialized may have unique pronunciations. These words can be recognized if the word can be broken down into smaller words to pronounce it. For example, to recognize **Xbox**, pronounce as **X box**. This approach will not increase overall accuracy, but can increase recognition of these keywords.
+
+> [!NOTE]
+> This technique is only available for some languages at this time. See customization for pronunciation in [the Speech-to-text table](language-support.md) for details.
+
+## Sources by scenario
+
+The following table shows voice recognition scenarios and lists source materials to consider within the three training content categories listed above.
+
+| Scenario | Related text sentences | Audio + human-labeled transcripts | New words with pronunciation |
+|----------|------------------------|------------------------------|------------------------------|
+| Call center             | marketing documents, website, product reviews related to call center activity | call center calls transcribed by humans | terms that have ambiguous pronunciations (see Xbox above) |
+| Voice assistant         | list sentences using all combinations of commands and entities | record voices speaking commands into device, and transcribe into text | names (movies, songs, products) that have unique pronunciations |
+| Dictation               | written input, like instant messages or emails | similar to above | similar to above |
+| Video closed captioning | TV show scripts, movies, marketing content, video summaries | exact transcripts of videos | similar to above |
+
+## Next steps
+
+- [Train your model](how-to-custom-speech-train-model.md)
+
+## Additional resources
+
+- [Prepare and test your data](how-to-custom-speech-test-data.md)
+- [Inspect your data](how-to-custom-speech-inspect-data.md)
diff --git a/articles/cognitive-services/Speech-Service/how-to-custom-speech-test-and-train.md b/articles/cognitive-services/Speech-Service/how-to-custom-speech-test-and-train.md
@@ -22,7 +22,7 @@ This table lists accepted data types, when each data type should be used, and th
 
 | Data type | Used for testing | Recommended quantity | Used for training | Recommended quantity |
 |-----------|-----------------|----------|-------------------|----------|
-| [Audio](#audio-data-for-testing) | Yes<br>Used for visual inspection | 5+ audio files | No | N/a |
+| [Audio](#audio-data-for-testing) | Yes<br>Used for visual inspection | 5+ audio files | No | N/A |
 | [Audio + Human-labeled transcripts](#audio--human-labeled-transcript-data-for-testingtraining) | Yes<br>Used to evaluate accuracy | 0.5-5 hours of audio | Yes | 1-1,000 hours of audio |
 | [Related text](#related-text-data-for-training) | No | N/a | Yes | 1-200 MB of related text |
 
@@ -76,6 +76,8 @@ Use <a href="http://sox.sourceforge.net" target="_blank" rel="noopener">SoX <spa
 
 To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. Keep in mind, the improvements in recognition will only be as good as the data provided. For that reason, it's important that only high-quality transcripts are uploaded.
 
+Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. While audio with low recording volume or disruptive background noise is not helpful, it should not hurt your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.
+
 | Property                 | Value                               |
 |--------------------------|-------------------------------------|
 | File format              | RIFF (WAV)                          |
diff --git a/articles/cognitive-services/Speech-Service/how-to-custom-speech-train-model.md b/articles/cognitive-services/Speech-Service/how-to-custom-speech-train-model.md
@@ -38,7 +38,7 @@ The first step to train a model is to upload training data. Use [Prepare and tes
 2. Navigate to **Speech-to-text > Custom Speech > Training**.
 3. Click **Train model**.
 4. Next, give your training a **Name** and **Description**.
-5. From the **Scenario and Baseline model** drop-down menu, select the scenario that best fits your domain. If you're unsure of which scenario to choose, select **General**. The baseline model is the starting point for training. If you don't have a preference, you can use the latest.
+5. From the **Scenario and Baseline model** drop-down menu, select the scenario that best fits your domain. If you're unsure of which scenario to choose, select **General**. The baseline model is the starting point for training. The latest model is usually the best choice.
 6. From the **Select training data** page, choose one or multiple audio + human-labeled transcription datasets that you'd like to use for training.
 7. Once the training is complete, you can choose to perform accuracy testing on the newly trained model. This step is optional.
 8. Select **Create** to build your custom model.
@@ -52,7 +52,7 @@ You can inspect the data and evaluate model accuracy using these documents:
 - [Inspect your data](how-to-custom-speech-inspect-data.md)
 - [Evaluate your data](how-to-custom-speech-evaluate-data.md)
 
-If you chose to test accuracy, it's important to select an acoustic dataset that's different from the one you used with your model to get a realistic sense of the model’s performance.
+If you chose to test accuracy, it's important to select an acoustic dataset that's different from the one you used with your model to get a realistic sense of the model's performance.
 
 ## Next steps
 
diff --git a/articles/cognitive-services/Speech-Service/how-to-custom-speech.md b/articles/cognitive-services/Speech-Service/how-to-custom-speech.md
@@ -32,9 +32,11 @@ This diagram highlights the pieces that make up the [Custom Speech portal](https
 
 4. [Evaluate accuracy](how-to-custom-speech-evaluate-data.md) - Evaluate the accuracy of the speech-to-text model. The [Custom Speech portal](https://speech.microsoft.com/customspeech) will provide a *Word Error Rate*, which can be used to determine if additional training is required. If you're satisfied with the accuracy, you can use the Speech service APIs directly. If you'd like to improve accuracy by a relative average of 5% - 20%, use the **Training** tab in the portal to upload additional training data, such as human-labeled transcripts and related text.
 
-5. [Train the model](how-to-custom-speech-train-model.md) - Improve the accuracy of your speech-to-text model by providing written transcripts (10-1,000 hours) and related text (<200 MB) along with your audio test data. This data helps to train the speech-to-text model. After training, retest, and if you're satisfied with the result, you can deploy your model.
+5. [Improve accuracy](how-to-custom-speech-improve-accuracy.md) - Choose additional training data strategically to improve the quality of the speech-to-text model based on your scenario.
 
-6. [Deploy the model](how-to-custom-speech-deploy-model.md) - Create a custom endpoint for your speech-to-text model and use it in your applications, tools, or products.
+6. [Train the model](how-to-custom-speech-train-model.md) - Improve the accuracy of your speech-to-text model by providing written transcripts (10-1,000 hours) and related text (<200 MB) along with your audio test data. This data helps to train the speech-to-text model. After training, retest, and if you're satisfied with the result, you can deploy your model.
+
+7. [Deploy the model](how-to-custom-speech-deploy-model.md) - Create a custom endpoint for your speech-to-text model and use it in your applications, tools, or products.
 
 ## Set up your Azure account
 
diff --git a/articles/cognitive-services/Speech-Service/language-support.md b/articles/cognitive-services/Speech-Service/language-support.md
@@ -19,7 +19,11 @@ Language support varies by Speech service functionality. The following tables su
 
 ## Speech-to-text
 
-Both the Microsoft Speech SDK and the REST API support the following languages (locales). To improve accuracy, customization is offered for a subset of the languages through uploading Audio + Human-labeled Transcripts or Related Text: Sentences. Pronunciation customization is offered through uploading Related Text: Pronunciation. Learn more about customization [here](how-to-custom-speech.md).
+Both the Microsoft Speech SDK and the REST API support the following languages (locales). 
+
+To improve accuracy, customization is offered for a subset of the languages through uploading **Audio + Human-labeled Transcripts** or **Related Text: Sentences**. To learn more about customization, see [Get started with Custom Speech](how-to-custom-speech.md).
+
+For more information about how you can improve pronunciation, see [Improve a model for Custom Speech](how-to-custom-speech-improve-accuracy.md#add-new-words-with-pronunciation).
 
 <!--
 To get the AM and ML bits:
diff --git a/articles/cognitive-services/Speech-Service/toc.yml b/articles/cognitive-services/Speech-Service/toc.yml
@@ -63,6 +63,8 @@
           href: how-to-custom-speech-inspect-data.md
         - name: Evaluate Custom Speech accuracy
           href: how-to-custom-speech-evaluate-data.md
+        - name: Improve Custom Speech accuracy
+          href: how-to-custom-speech-improve-accuracy.md
         - name: Train a model for Custom Speech
           href: how-to-custom-speech-train-model.md
         - name: Deploy a custom model