You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/custom-evaluators.md
+14-10Lines changed: 14 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
-
title: Custom evaluators
2
+
title: Custom Evaluators
3
3
titleSuffix: Azure AI Foundry
4
4
description: Learn how to create custom evaluators for your AI applications using code-based or prompt-based approaches.
5
5
author: lgayhardt
6
6
ms.author: lagayhar
7
7
ms.reviewer: mithigpe
8
-
ms.date: 07/31/2025
8
+
ms.date: 10/16/2025
9
9
ms.service: azure-ai-foundry
10
10
ms.topic: reference
11
11
ms.custom:
@@ -15,11 +15,11 @@ ms.custom:
15
15
16
16
# Custom evaluators
17
17
18
-
Built-in evaluators are great out of the box to start evaluating your application's generations. However you might want to build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
18
+
To start evaluating your application's generations, built-in evaluators are great out of the box. To cater to your evaluation needs, you can build your own code-based or prompt-based evaluator.
19
19
20
20
## Code-based evaluators
21
21
22
-
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. You can build your own code-based evaluator, for example, by creating a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`:
22
+
You don't need a large language model for certain evaluation metrics. Code-based evaluators can give you the flexibility to define metrics based on functions or callable classes. You can build your own code-based evaluator, for example, by creating a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`, as in the following example.
23
23
24
24
### Code-based evaluator example: Answer length
25
25
@@ -32,7 +32,7 @@ class AnswerLengthEvaluator:
32
32
return {"answer_length": len(answer)}
33
33
```
34
34
35
-
Then run the evaluator on a row of data by importing a callable class:
35
+
Run the evaluator on a row of data by importing a callable class:
36
36
37
37
```python
38
38
from answer_len.answer_length import AnswerLengthEvaluator
@@ -49,13 +49,17 @@ answer_length = answer_length_evaluator(answer="What is the speed of light?")
49
49
50
50
## Prompt-based evaluators
51
51
52
-
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a **Prompty** file. Prompty is a file with `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format that contains many metadata fields that define model configuration and expected inputs of the Prompty. Let's create a custom evaluator `FriendlinessEvaluator` to measure friendliness of a response.
52
+
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a *Prompty* file.
53
+
54
+
Prompty is a file with the `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format. It contains metadata fields that define model configuration and expected inputs of the Prompty.
55
+
56
+
To measure friendliness of a response, you can create a custom evaluator `FriendlinessEvaluator`:
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
+21-16Lines changed: 21 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
-
title: Textual similarity evaluators for generative AI
2
+
title: Textual Similarity Evaluators for Generative AI
3
3
titleSuffix: Azure AI Foundry
4
4
description: Learn about textual similarity evaluators for generative AI, including semantic similarity, F1 score, BLEU, GLEU, ROUGE, and METEOR metrics.
5
5
author: lgayhardt
6
6
ms.author: lagayhar
7
7
ms.reviewer: changliu2
8
-
ms.date: 07/31/2025
8
+
ms.date: 10/16/2025
9
9
ms.service: azure-ai-foundry
10
10
ms.topic: reference
11
11
ms.custom:
@@ -15,7 +15,9 @@ ms.custom:
15
15
16
16
# Textual similarity evaluators
17
17
18
-
It's important to compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth". Use LLM-judge metric like [`SimilarityEvaluator`](#similarity) with a focus on the semantic similarity between the generated response and the ground truth, or use metrics from the field of natural language processing (NLP) including [F1 Score](#f1-score), [BLEU](#bleu-score), [GLEU](#gleu-score), [ROUGE](#rouge-score), and [METEOR](#meteor-score) with a focus on the overlaps of tokens or n-grams between the two.
18
+
It's important to compare how closely the textual response generated by your AI system matches the response you would expect. The expected response is called the *ground truth*.
19
+
20
+
Use a LLM-judge metric like [`SimilarityEvaluator`](#similarity) with a focus on the semantic similarity between the generated response and the ground truth. Or, use metrics from the field of natural language processing (NLP), including [F1 score](#f1-score), [BLEU](#bleu-score), [GLEU](#gleu-score), [ROUGE](#rouge-score), and [METEOR](#meteor-score) with a focus on the overlaps of tokens or n-grams between the two.
> We recommend using `o3-mini`for a balance of reasoning capability and cost efficiency.
41
+
> We recommend that you use `o3-mini`to balance reasoning capability and cost efficiency.
40
42
41
43
## Similarity
42
44
43
-
`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.
45
+
`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.
44
46
45
47
### Similarity example
46
48
@@ -57,7 +59,7 @@ similarity(
57
59
58
60
### Similarity output
59
61
60
-
The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
62
+
The output is a numerical score on a likert scale, integer 1 to 5. A higher score means a higher degree of similarity. Given a numerical threshold (default to 3), this example also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
61
63
62
64
```python
63
65
{
@@ -70,7 +72,10 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score means
70
72
71
73
## F1 score
72
74
73
-
`F1ScoreEvaluator` measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score. Precision is the ratio of the number of shared words to the total number of words in the generation. Recall is the ratio of the number of shared words to the total number of words in the ground truth.
75
+
`F1ScoreEvaluator` measures the similarity by shared tokens between the generated text and the ground truth. It focuses on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. The ratio is computed over the individual words in the generated response against those words in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score.
76
+
77
+
-*Precision* is the ratio of the number of shared words to the total number of words in the generation.
78
+
-*Recall* is the ratio of the number of shared words to the total number of words in the ground truth.
74
79
75
80
### F1 score example
76
81
@@ -86,7 +91,7 @@ f1_score(
86
91
87
92
### F1 score output
88
93
89
-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
94
+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
90
95
91
96
```python
92
97
{
@@ -98,7 +103,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
98
103
99
104
## BLEU score
100
105
101
-
`BleuScoreEvaluator` computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.
106
+
`BleuScoreEvaluator` computes the Bilingual Evaluation Understudy (BLEU) score commonly used in natural language processing and machine translation. It measures how closely the generated text matches the reference text.
102
107
103
108
### BLEU example
104
109
@@ -114,7 +119,7 @@ bleu_score(
114
119
115
120
### BLEU output
116
121
117
-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
122
+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
118
123
119
124
```python
120
125
{
@@ -126,7 +131,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
126
131
127
132
## GLEU score
128
133
129
-
`GleuScoreEvaluator` computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective.
134
+
`GleuScoreEvaluator` computes the Google-BLEU (GLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth. Similar to the BLEU score, it focuses on both precision and recall. It addresses the drawbacks of the BLEU score using a per-sentence reward objective.
130
135
131
136
### GLEU score example
132
137
@@ -142,7 +147,7 @@ gleu_score(
142
147
143
148
### GLEU score output
144
149
145
-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
150
+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
146
151
147
152
```python
148
153
{
@@ -154,7 +159,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
154
159
155
160
## ROUGE score
156
161
157
-
`RougeScoreEvaluator` computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.
162
+
`RougeScoreEvaluator` computes the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.
158
163
159
164
### ROUGE score example
160
165
@@ -170,7 +175,7 @@ rouge(
170
175
171
176
### ROUGE score output
172
177
173
-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
178
+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
174
179
175
180
```python
176
181
{
@@ -188,7 +193,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
188
193
189
194
## METEOR score
190
195
191
-
`MeteorScoreEvaluator` measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.
196
+
`MeteorScoreEvaluator` measures the similarity by shared n-grams between the generated text and the ground truth. Similar to the BLEU score, it focuses on precision and recall. It addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.
192
197
193
198
### METEOR score example
194
199
@@ -204,7 +209,7 @@ meteor_score(
204
209
205
210
### METEOR score output
206
211
207
-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
212
+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
0 commit comments