You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
-
title: General purpose evaluators for generative AI
2
+
title: General Purpose Evaluators for Generative AI
3
3
titleSuffix: Azure AI Foundry
4
4
description: Learn about general-purpose evaluators for generative AI, including coherence, fluency, and question-answering composite evaluation.
5
5
author: lgayhardt
6
6
ms.author: lagayhar
7
7
ms.reviewer: changliu2
8
-
ms.date: 07/16/2025
8
+
ms.date: 10/17/2025
9
9
ms.service: azure-ai-foundry
10
10
ms.topic: reference
11
11
ms.custom:
@@ -15,12 +15,12 @@ ms.custom:
15
15
16
16
# General purpose evaluators
17
17
18
-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we support evaluating:
18
+
AI systems might generate textual responses that are incoherent, or lack the general writing quality beyond minimum grammatical correctness. To address these issues, Azure AI Foundry supports evaluating:
19
19
20
20
-[Coherence](#coherence)
21
21
-[Fluency](#fluency)
22
22
23
-
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator) a composite evaluator that uses relevant evaluators for judgment.
23
+
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator), which is a composite evaluator that uses relevant evaluators for judgment.
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
45
+
Azure AI Foundry supports AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the large language model judge (LLM-judge) depending on the evaluators:
46
46
47
-
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
47
+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
51
51
52
-
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
52
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model with a balance of reasoning performance and cost efficiency, like `o3-mini` and o-series mini models released afterwards.
53
53
54
54
## Coherence
55
55
56
-
`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
56
+
`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, which allows the reader to easily follow and understand the writer's train of thought. A *coherent* response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
57
57
58
58
### Coherence example
59
59
@@ -69,7 +69,7 @@ coherence(
69
69
70
70
### Coherence output
71
71
72
-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
72
+
The numerical score on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
73
73
74
74
```python
75
75
{
@@ -83,7 +83,7 @@ The numerical score on a Likert scale (integer 1 to 5) and a higher score is bet
83
83
84
84
## Fluency
85
85
86
-
`FluencyEvaluator`measures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.
86
+
`FluencyEvaluator`measures the effectiveness and clarity of written communication. This measure focuses on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.
87
87
88
88
### Fluency example
89
89
@@ -98,7 +98,7 @@ fluency(
98
98
99
99
### Fluency output
100
100
101
-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
101
+
The numerical score on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
102
102
103
103
```python
104
104
{
@@ -137,7 +137,7 @@ qa_eval(
137
137
138
138
### QA output
139
139
140
-
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
140
+
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
0 commit comments