Merge pull request #261698 from eric-urban/eur/ter

prmerger-automator[bot] · web-flow · commit a21a24ff66ab · 2023-12-21T05:45:31.000Z
add TER docs
diff --git a/articles/ai-services/speech-service/faq-stt.yml b/articles/ai-services/speech-service/faq-stt.yml
@@ -164,7 +164,7 @@ sections:
       - question: |
           What is word error rate (WER), and how is it computed?
         answer: |
-          WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate).
+          WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer).
           
       - question: |
           How do I determine whether the results of an accuracy test are good?
diff --git a/articles/ai-services/speech-service/how-to-custom-speech-continuous-integration-continuous-deployment.md b/articles/ai-services/speech-service/how-to-custom-speech-continuous-integration-continuous-deployment.md
@@ -30,7 +30,7 @@ Along the way, the workflows should name and store data, tests, test files, mode
 
 ### CI workflow for testing data updates
 
-The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
+The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
 
 The CI workflow for testing data updates should retest the current benchmark model with the updated test data to calculate the revised WER. This ensures that when the WER of a new model is compared to the WER of the benchmark, both models have been tested against the same test data and you're comparing like with like.
 
@@ -78,7 +78,7 @@ The [Speech DevOps template repo](https://github.com/Azure-Samples/Speech-Servic
 - Copy the template repository to your GitHub account, then create Azure resources and a [service principal](../../active-directory/develop/app-objects-and-service-principals.md#service-principal-object) for the GitHub Actions CI/CD workflows.
 - Walk through the "[dev inner loop](/dotnet/architecture/containerized-lifecycle/design-develop-containerized-apps/docker-apps-inner-loop-workflow)." Update training and testing data from a feature branch, test the changes with a temporary development model, and raise a pull request to propose and review the changes.
 - When training data is updated in a pull request to *main*, train models with the GitHub Actions CI workflow.
-- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER). Store the test results in Azure Blob.
+- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER). Store the test results in Azure Blob.
 - Execute the CD workflow to create an endpoint when the WER improves.
 
 ## Next steps
diff --git a/articles/ai-services/speech-service/how-to-custom-speech-create-project.md b/articles/ai-services/speech-service/how-to-custom-speech-create-project.md
@@ -137,7 +137,7 @@ There are a few approaches to using Custom Speech models:
 - A custom model augments the base model to include domain-specific vocabulary shared across all areas of the custom domain.
 - Multiple custom models can be used when the custom domain has multiple areas, each with a specific vocabulary.
 
-One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
+One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
 
 Multiple models are recommended if the vocabulary varies across the domain areas. For instance, Olympic commentators report on various events, each associated with its own vernacular. Because each Olympic event vocabulary differs significantly from others, building a custom model specific to an event increases accuracy by limiting the utterance data relative to that particular event. As a result, the model doesn't need to sift through unrelated data to make a match. Regardless, training still requires a decent variety of training data. Include audio from various commentators who have different accents, gender, age, etcetera. 
 
diff --git a/articles/ai-services/speech-service/how-to-custom-speech-evaluate-data.md b/articles/ai-services/speech-service/how-to-custom-speech-evaluate-data.md
@@ -6,9 +6,8 @@ author: eric-urban
 manager: nitinme
 ms.service: azure-ai-speech
 ms.topic: how-to
-ms.date: 11/29/2022
+ms.date: 12/20/2023
 ms.author: eur
-ms.custom: ignite-fall-2021
 zone_pivot_groups: speech-studio-cli-rest
 show_latex: true
 no-loc: [$$, '\times', '\over']
@@ -22,7 +21,7 @@ In this article, you learn how to quantitatively measure and improve the accurac
 
 ## Create a test
 
-You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
+You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate the word error rate (WER)](#evaluate-word-error-rate-wer) compared to speech recognition results.
 
 ::: zone pivot="speech-studio"
 
@@ -222,7 +221,7 @@ The top-level `self` property in the response body is the evaluation's URI. Use
 
 ## Get test results
 
-You should get the test results and [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
+You should get the test results and [evaluate](#evaluate-word-error-rate-wer) the word error rate (WER) compared to speech recognition results.
 
 ::: zone pivot="speech-studio"
 
@@ -386,7 +385,7 @@ You should receive a response body in the following format:
 ::: zone-end
 
 
-## Evaluate word error rate
+## Evaluate word error rate (WER)
 
 The industry standard for measuring model accuracy is [word error rate (WER)](https://en.wikipedia.org/wiki/Word_error_rate). WER counts the number of incorrect words identified during recognition, and divides the sum by the total number of words provided in the human-labeled transcript (N). 
 
@@ -423,6 +422,26 @@ How the errors are distributed is important. When many deletion errors are encou
 
 By analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. Understanding issues at the file level will help you target improvements.
 
+## Evaluate token error rate (TER)
+
+Besides [word error rate](#evaluate-word-error-rate-wer), you can also use the extended measurement of **Token Error Rate (TER)** to evaluate quality on the final end-to-end display format. In addition to the lexical format (`That will cost $900.` instead of `that will cost nine hundred dollars`), TER takes into account the display format aspects such as punctuation, capitalization, and ITN. Learn more about [Display output formatting with speech to text](display-text-format.md). 
+
+TER counts the number of incorrect tokens identified during recognition, and divides the sum by the total number of tokens provided in the human-labeled transcript (N).
+
+$$
+TER = {{I+D+S}\over N} \times 100
+$$
+
+The formula of TER calculation is also very similar to WER. The only difference is that TER is calculated based on the token level instead of word level.
+* Insertion (I): Tokens that are incorrectly added in the hypothesis transcript
+* Deletion (D): Tokens that are undetected in the hypothesis transcript
+* Substitution (S): Tokens that were substituted between reference and hypothesis
+
+In a real-world case, you may analyze both WER and TER results to get the desired improvements. 
+
+> [!NOTE]
+> To measure TER, you need to make sure the [audio + transcript testing data](./how-to-custom-speech-test-and-train.md#audio--human-labeled-transcript-data-for-training-or-testing) includes transcripts with display formatting such as punctuation, capitalization, and ITN.
+
 ## Example scenario outcomes
 
 Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios: