You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/faq-stt.yml
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -164,7 +164,7 @@ sections:
164
164
- question: |
165
165
What is word error rate (WER), and how is it computed?
166
166
answer: |
167
-
WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate).
167
+
WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see [Test model quantitatively](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer).
168
168
169
169
- question: |
170
170
How do I determine whether the results of an accuracy test are good?
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-speech-continuous-integration-continuous-deployment.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ Along the way, the workflows should name and store data, tests, test files, mode
30
30
31
31
### CI workflow for testing data updates
32
32
33
-
The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
33
+
The principal purpose of the CI/CD workflows is to build a new model using the training data, and to test that model using the testing data to establish whether the [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER) has improved compared to the previous best-performing model (the "benchmark model"). If the new model performs better, it becomes the new benchmark model against which future models are compared.
34
34
35
35
The CI workflow for testing data updates should retest the current benchmark model with the updated test data to calculate the revised WER. This ensures that when the WER of a new model is compared to the WER of the benchmark, both models have been tested against the same test data and you're comparing like with like.
36
36
@@ -78,7 +78,7 @@ The [Speech DevOps template repo](https://github.com/Azure-Samples/Speech-Servic
78
78
- Copy the template repository to your GitHub account, then create Azure resources and a [service principal](../../active-directory/develop/app-objects-and-service-principals.md#service-principal-object) for the GitHub Actions CI/CD workflows.
79
79
- Walk through the "[dev inner loop](/dotnet/architecture/containerized-lifecycle/design-develop-containerized-apps/docker-apps-inner-loop-workflow)." Update training and testing data from a feature branch, test the changes with a temporary development model, and raise a pull request to propose and review the changes.
80
80
- When training data is updated in a pull request to *main*, train models with the GitHub Actions CI workflow.
81
-
- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) (WER). Store the test results in Azure Blob.
81
+
- Perform automated accuracy testing to establish a model's [Word Error Rate](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) (WER). Store the test results in Azure Blob.
82
82
- Execute the CD workflow to create an endpoint when the WER improves.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-speech-create-project.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,7 +137,7 @@ There are a few approaches to using Custom Speech models:
137
137
- A custom model augments the base model to include domain-specific vocabulary shared across all areas of the custom domain.
138
138
- Multiple custom models can be used when the custom domain has multiple areas, each with a specific vocabulary.
139
139
140
-
One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
140
+
One recommended way to see if the base model will suffice is to analyze the transcription produced from the base model and compare it with a human-generated transcript for the same audio. You can compare the transcripts and obtain a [word error rate (WER)](how-to-custom-speech-evaluate-data.md#evaluate-word-error-rate-wer) score. If the WER score is high, training a custom model to recognize the incorrectly identified words is recommended.
141
141
142
142
Multiple models are recommended if the vocabulary varies across the domain areas. For instance, Olympic commentators report on various events, each associated with its own vernacular. Because each Olympic event vocabulary differs significantly from others, building a custom model specific to an event increases accuracy by limiting the utterance data relative to that particular event. As a result, the model doesn't need to sift through unrelated data to make a match. Regardless, training still requires a decent variety of training data. Include audio from various commentators who have different accents, gender, age, etcetera.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/how-to-custom-speech-evaluate-data.md
+24-5Lines changed: 24 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,8 @@ author: eric-urban
6
6
manager: nitinme
7
7
ms.service: azure-ai-speech
8
8
ms.topic: how-to
9
-
ms.date: 11/29/2022
9
+
ms.date: 12/20/2023
10
10
ms.author: eur
11
-
ms.custom: ignite-fall-2021
12
11
zone_pivot_groups: speech-studio-cli-rest
13
12
show_latex: true
14
13
no-loc: [$$, '\times', '\over']
@@ -22,7 +21,7 @@ In this article, you learn how to quantitatively measure and improve the accurac
22
21
23
22
## Create a test
24
23
25
-
You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
24
+
You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you [get](#get-test-results) the test results, [evaluate the word error rate (WER)](#evaluate-word-error-rate-wer) compared to speech recognition results.
26
25
27
26
::: zone pivot="speech-studio"
28
27
@@ -222,7 +221,7 @@ The top-level `self` property in the response body is the evaluation's URI. Use
222
221
223
222
## Get test results
224
223
225
-
You should get the test results and [evaluate](#evaluate-word-error-rate) the word error rate (WER) compared to speech recognition results.
224
+
You should get the test results and [evaluate](#evaluate-word-error-rate-wer) the word error rate (WER) compared to speech recognition results.
226
225
227
226
::: zone pivot="speech-studio"
228
227
@@ -386,7 +385,7 @@ You should receive a response body in the following format:
386
385
::: zone-end
387
386
388
387
389
-
## Evaluate word error rate
388
+
## Evaluate word error rate (WER)
390
389
391
390
The industry standard for measuring model accuracy is [word error rate (WER)](https://en.wikipedia.org/wiki/Word_error_rate). WER counts the number of incorrect words identified during recognition, and divides the sum by the total number of words provided in the human-labeled transcript (N).
392
391
@@ -423,6 +422,26 @@ How the errors are distributed is important. When many deletion errors are encou
423
422
424
423
By analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. Understanding issues at the file level will help you target improvements.
425
424
425
+
## Evaluate token error rate (TER)
426
+
427
+
Besides [word error rate](#evaluate-word-error-rate-wer), you can also use the extended measurement of **Token Error Rate (TER)** to evaluate quality on the final end-to-end display format. In addition to the lexical format (`That will cost $900.` instead of `that will cost nine hundred dollars`), TER takes into account the display format aspects such as punctuation, capitalization, and ITN. Learn more about [Display output formatting with speech to text](display-text-format.md).
428
+
429
+
TER counts the number of incorrect tokens identified during recognition, and divides the sum by the total number of tokens provided in the human-labeled transcript (N).
430
+
431
+
$$
432
+
TER = {{I+D+S}\over N} \times 100
433
+
$$
434
+
435
+
The formula of TER calculation is also very similar to WER. The only difference is that TER is calculated based on the token level instead of word level.
436
+
* Insertion (I): Tokens that are incorrectly added in the hypothesis transcript
437
+
* Deletion (D): Tokens that are undetected in the hypothesis transcript
438
+
* Substitution (S): Tokens that were substituted between reference and hypothesis
439
+
440
+
In a real-world case, you may analyze both WER and TER results to get the desired improvements.
441
+
442
+
> [!NOTE]
443
+
> To measure TER, you need to make sure the [audio + transcript testing data](./how-to-custom-speech-test-and-train.md#audio--human-labeled-transcript-data-for-training-or-testing) includes transcripts with display formatting such as punctuation, capitalization, and ITN.
444
+
426
445
## Example scenario outcomes
427
446
428
447
Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). The following table examines four common scenarios:
0 commit comments