Skip to content

Commit a21a24f

Browse files
Merge pull request #261698 from eric-urban/eur/ter
add TER docs
2 parents 30a0dd0 + 946cbdc commit a21a24f

File tree

4 files changed

+28
-9
lines changed

4 files changed

+28
-9
lines changed

articles/ai-services/speech-service/faq-stt.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ sections:
164164
- question: |
165165
What is word error rate (WER), and how is it computed?
166166
answer: |
167-
WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate).
167+
WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer).
168168
169169
- question: |
170170
How do I determine whether the results of an accuracy test are good?

articles/ai-services/speech-service/how-to-custom-speech-continuous-integration-continuous-deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Along the way, the workflows should name and store data, tests, test files, mode
3030

3131
### CI workflow for testing data updates
3232

33-
The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
33+
The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
3434

3535
The CI workflow for testing data updates should retest the current benchmark model with the updated test data to calculate the revised WER. This ensures that when the WER of a new model is compared to the WER of the benchmark, both models have been tested against the same test data and you're comparing like with like.
3636

@@ -78,7 +78,7 @@ The [Speech DevOps template repo](https://github.com/Azure-Samples/Speech-Servic
7878
- Copy the template repository to your GitHub account, then create Azure resources and a [service principal](../../active-directory/develop/app-objects-and-service-principals.md#service-principal-object) for the GitHub Actions CI/CD workflows.
7979
- Walk through the "[dev inner loop](/dotnet/architecture/containerized-lifecycle/design-develop-containerized-apps/docker-apps-inner-loop-workflow)." Update training and testing data from a feature branch, test the changes with a temporary development model, and raise a pull request to propose and review the changes.
8080
- When training data is updated in a pull request to *main*, train models with the GitHub Actions CI workflow.
81-
- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER). Store the test results in Azure Blob.
81+
- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER). Store the test results in Azure Blob.
8282
- Execute the CD workflow to create an endpoint when the WER improves.
8383

8484
## Next steps

articles/ai-services/speech-service/how-to-custom-speech-create-project.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ There are a few approaches to using Custom Speech models:
137137
- A custom model augments the base model to include domain-specific vocabulary shared across all areas of the custom domain.
138138
- Multiple custom models can be used when the custom domain has multiple areas, each with a specific vocabulary.
139139

140-
One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
140+
One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
141141

142142
Multiple models are recommended if the vocabulary varies across the domain areas. For instance, Olympic commentators report on various events, each associated with its own vernacular. Because each Olympic event vocabulary differs significantly from others, building a custom model specific to an event increases accuracy by limiting the utterance data relative to that particular event. As a result, the model doesn't need to sift through unrelated data to make a match. Regardless, training still requires a decent variety of training data. Include audio from various commentators who have different accents, gender, age, etcetera.
143143

articles/ai-services/speech-service/how-to-custom-speech-evaluate-data.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,8 @@ author: eric-urban
66
manager: nitinme
77
ms.service: azure-ai-speech
88
ms.topic: how-to
9-
ms.date: 11/29/2022
9+
ms.date: 12/20/2023
1010
ms.author: eur
11-
ms.custom: ignite-fall-2021
1211
zone_pivot_groups: speech-studio-cli-rest
1312
show_latex: true
1413
no-loc: [$$, '\times', '\over']
@@ -22,7 +21,7 @@ In this article, you learn how to quantitatively measure and improve the accurac
2221

2322
## Create a test
2423

25-
You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
24+
You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate the word error rate (WER)](#evaluate-word-error-rate-wer) compared to speech recognition results.
2625

2726
::: zone pivot="speech-studio"
2827

@@ -222,7 +221,7 @@ The top-level `self` property in the response body is the evaluation's URI. Use
222221

223222
## Get test results
224223

225-
You should get the test results and [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
224+
You should get the test results and [evaluate](#evaluate-word-error-rate-wer) the word error rate (WER) compared to speech recognition results.
226225

227226
::: zone pivot="speech-studio"
228227

@@ -386,7 +385,7 @@ You should receive a response body in the following format:
386385
::: zone-end
387386

388387

389-
## Evaluate word error rate
388+
## Evaluate word error rate (WER)
390389

391390
The industry standard for measuring model accuracy is [word error rate (WER)](https://en.wikipedia.org/wiki/Word_error_rate). WER counts the number of incorrect words identified during recognition, and divides the sum by the total number of words provided in the human-labeled transcript (N).
392391

@@ -423,6 +422,26 @@ How the errors are distributed is important. When many deletion errors are encou
423422

424423
By analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. Understanding issues at the file level will help you target improvements.
425424

425+
## Evaluate token error rate (TER)
426+
427+
Besides [word error rate](#evaluate-word-error-rate-wer), you can also use the extended measurement of **Token Error Rate (TER)** to evaluate quality on the final end-to-end display format. In addition to the lexical format (`That will cost $900.` instead of `that will cost nine hundred dollars`), TER takes into account the display format aspects such as punctuation, capitalization, and ITN. Learn more about [Display output formatting with speech to text](display-text-format.md).
428+
429+
TER counts the number of incorrect tokens identified during recognition, and divides the sum by the total number of tokens provided in the human-labeled transcript (N).
430+
431+
$$
432+
TER = {{I+D+S}\over N} \times 100
433+
$$
434+
435+
The formula of TER calculation is also very similar to WER. The only difference is that TER is calculated based on the token level instead of word level.
436+
* Insertion (I): Tokens that are incorrectly added in the hypothesis transcript
437+
* Deletion (D): Tokens that are undetected in the hypothesis transcript
438+
* Substitution (S): Tokens that were substituted between reference and hypothesis
439+
440+
In a real-world case, you may analyze both WER and TER results to get the desired improvements.
441+
442+
> [!NOTE]
443+
> To measure TER, you need to make sure the [audio + transcript testing data](./how-to-custom-speech-test-and-train.md#audio--human-labeled-transcript-data-for-training-or-testing) includes transcripts with display formatting such as punctuation, capitalization, and ITN.
444+
426445
## Example scenario outcomes
427446

428447
Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios:

0 commit comments

Comments
 (0)