Skip to content

Commit 15f299c

Browse files
Freshness, in progress.
1 parent da18e4b commit 15f299c

File tree

3 files changed

+52
-49
lines changed

3 files changed

+52
-49
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/custom-evaluators.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to create custom evaluators for your AI applications usin
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: mithigpe
8-
ms.date: 07/31/2025
8+
ms.date: 10/16/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: reference
1111
ms.custom:
@@ -15,11 +15,11 @@ ms.custom:
1515

1616
# Custom evaluators
1717

18-
Built-in evaluators are great out of the box to start evaluating your application's generations. However you might want to build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
18+
To start evaluating your application's generations, built-in evaluators are great out of the box. To cater to your evaluation needs, you can build your own code-based or prompt-based evaluator.
1919

2020
## Code-based evaluators
2121

22-
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. You can build your own code-based evaluator, for example, by creating a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`:
22+
You don't need a large language model needed for certain evaluation metrics. Code-based evaluators can give you the flexibility to define metrics based on functions or callable class. You can build your own code-based evaluator, for example, by creating a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`, as in the following example.
2323

2424
### Code-based evaluator example: Answer length
2525

@@ -32,7 +32,7 @@ class AnswerLengthEvaluator:
3232
return {"answer_length": len(answer)}
3333
```
3434

35-
Then run the evaluator on a row of data by importing a callable class:
35+
Run the evaluator on a row of data by importing a callable class:
3636

3737
```python
3838
from answer_len.answer_length import AnswerLengthEvaluator
@@ -49,13 +49,13 @@ answer_length = answer_length_evaluator(answer="What is the speed of light?")
4949

5050
## Prompt-based evaluators
5151

52-
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a **Prompty** file. Prompty is a file with `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format that contains many metadata fields that define model configuration and expected inputs of the Prompty. Let's create a custom evaluator `FriendlinessEvaluator` to measure friendliness of a response.
52+
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a *Prompty* file. Prompty is a file with the `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format. It contains metadata fields that define model configuration and expected inputs of the Prompty. To measure friendliness of a response, you can create a custom evaluator `FriendlinessEvaluator`:
5353

5454
### Prompt-based evaluator example: Friendliness evaluator
5555

5656
First, create a `friendliness.prompty` file that describes the definition of the friendliness metric and its grading rubric:
5757

58-
```markdown
58+
```md
5959
---
6060
name: Friendliness Evaluator
6161
description: Friendliness Evaluator to measure warmth and approachability of answers.
@@ -108,7 +108,7 @@ generated_query: {{response}}
108108
output:
109109
```
110110

111-
Then create a class `FriendlinessEvaluator` to load the Prompty file and process the outputs with json format:
111+
Then create a class `FriendlinessEvaluator` to load the Prompty file and process the outputs with JSON format:
112112

113113
```python
114114
import os
@@ -132,7 +132,7 @@ class FriendlinessEvaluator:
132132
return response
133133
```
134134

135-
Now, you can create your own Prompty-based evaluator and run it on a row of data:
135+
Now, create your own Prompty-based evaluator and run it on a row of data:
136136

137137
```python
138138
from friendliness.friend import FriendlinessEvaluator

articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn about textual similarity evaluators for generative AI, includ
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: changliu2
8-
ms.date: 07/31/2025
8+
ms.date: 10/16/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: reference
1111
ms.custom:
@@ -15,7 +15,7 @@ ms.custom:
1515

1616
# Textual similarity evaluators
1717

18-
It's important to compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth". Use LLM-judge metric like [`SimilarityEvaluator`](#similarity) with a focus on the semantic similarity between the generated response and the ground truth, or use metrics from the field of natural language processing (NLP) including [F1 Score](#f1-score), [BLEU](#bleu-score), [GLEU](#gleu-score), [ROUGE](#rouge-score), and [METEOR](#meteor-score) with a focus on the overlaps of tokens or n-grams between the two.
18+
It's important to compare how closely the textual response generated by your AI system matches the response you would expect. The expected response is the *ground truth*. Use a LLM-judge metric like [`SimilarityEvaluator`](#similarity) with a focus on the semantic similarity between the generated response and the ground truth. Or, use metrics from the field of natural language processing (NLP) including [F1 score](#f1-score), [BLEU](#bleu-score), [GLEU](#gleu-score), [ROUGE](#rouge-score), and [METEOR](#meteor-score) with a focus on the overlaps of tokens or n-grams between the two.
1919

2020
## Model configuration for AI-assisted evaluators
2121

@@ -40,7 +40,7 @@ model_config = AzureOpenAIModelConfiguration(
4040
4141
## Similarity
4242

43-
`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.
43+
`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.
4444

4545
### Similarity example
4646

@@ -57,7 +57,7 @@ similarity(
5757

5858
### Similarity output
5959

60-
The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
60+
The output is a numerical score on a likert scale, integer 1 to 5. A higher score means a higher degree of similarity. Given a numerical threshold (default to 3), this example also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
6161

6262
```python
6363
{
@@ -70,7 +70,10 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score means
7070

7171
## F1 score
7272

73-
`F1ScoreEvaluator` measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score. Precision is the ratio of the number of shared words to the total number of words in the generation. Recall is the ratio of the number of shared words to the total number of words in the ground truth.
73+
`F1ScoreEvaluator` measures the similarity by shared tokens between the generated text and the ground truth. It focuses on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. The ratio is computed over the individual words in the generated response against those words in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score.
74+
75+
- *Precision* is the ratio of the number of shared words to the total number of words in the generation.
76+
- *Recall* is the ratio of the number of shared words to the total number of words in the ground truth.
7477

7578
### F1 score example
7679

@@ -86,7 +89,7 @@ f1_score(
8689

8790
### F1 score output
8891

89-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
92+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
9093

9194
```python
9295
{
@@ -98,7 +101,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
98101

99102
## BLEU score
100103

101-
`BleuScoreEvaluator` computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.
104+
`BleuScoreEvaluator` computes the Bilingual Evaluation Understudy (BLEU) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.
102105

103106
### BLEU example
104107

@@ -114,7 +117,7 @@ bleu_score(
114117

115118
### BLEU output
116119

117-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
120+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
118121

119122
```python
120123
{
@@ -126,7 +129,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
126129

127130
## GLEU score
128131

129-
`GleuScoreEvaluator` computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective.
132+
`GleuScoreEvaluator` computes the Google-BLEU (GLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth. Similar to the BLEU score, it focuses on both precision and recall. It addresses the drawbacks of the BLEU score using a per-sentence reward objective.
130133

131134
### GLEU score example
132135

@@ -142,7 +145,7 @@ gleu_score(
142145

143146
### GLEU score output
144147

145-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
148+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
146149

147150
```python
148151
{
@@ -154,7 +157,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
154157

155158
## ROUGE score
156159

157-
`RougeScoreEvaluator` computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.
160+
`RougeScoreEvaluator` computes the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.
158161

159162
### ROUGE score example
160163

@@ -170,7 +173,7 @@ rouge(
170173

171174
### ROUGE score output
172175

173-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
176+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
174177

175178
```python
176179
{
@@ -188,7 +191,7 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
188191

189192
## METEOR score
190193

191-
`MeteorScoreEvaluator` measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.
194+
`MeteorScoreEvaluator` measures the similarity by shared n-grams between the generated text and the ground truth. Similar to the BLEU score, it focuses on precision and recall. It addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.
192195

193196
### METEOR score example
194197

@@ -204,7 +207,7 @@ meteor_score(
204207

205208
### METEOR score output
206209

207-
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
210+
The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs *pass* if the score >= threshold, or *fail* otherwise.
208211

209212
```python
210213
{

0 commit comments

Comments
 (0)