Skip to content

Commit 7392a49

Browse files
Merge pull request #7710 from TimShererWithAquent/us496641-09
Freshness Edit: AI Foundry: General purpose evaluators
2 parents 9176476 + e9e4f62 commit 7392a49

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: General purpose evaluators for generative AI
2+
title: General Purpose Evaluators for Generative AI
33
titleSuffix: Azure AI Foundry
44
description: Learn about general-purpose evaluators for generative AI, including coherence, fluency, and question-answering composite evaluation.
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: changliu2
8-
ms.date: 07/16/2025
8+
ms.date: 10/17/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: reference
1111
ms.custom:
@@ -15,12 +15,12 @@ ms.custom:
1515

1616
# General purpose evaluators
1717

18-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we support evaluating:
18+
AI systems might generate textual responses that are incoherent, or lack the general writing quality beyond minimum grammatical correctness. To address these issues, Azure AI Foundry supports evaluating:
1919

2020
- [Coherence](#coherence)
2121
- [Fluency](#fluency)
2222

23-
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator) a composite evaluator that uses relevant evaluators for judgment.
23+
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator), which is a composite evaluator that uses relevant evaluators for judgment.
2424

2525
## Model configuration for AI-assisted evaluators
2626

@@ -42,18 +42,18 @@ model_config = AzureOpenAIModelConfiguration(
4242

4343
### Evaluator model support
4444

45-
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
45+
Azure AI Foundry supports AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the large language model judge (LLM-judge) depending on the evaluators:
4646

47-
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
47+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o) | To enable |
4848
|------------|-----------------------------------------------------------------------------|-------------------------------------------------------------|-------|
4949
| `Intent Resolution`, `Task Adherence`, `Tool Call Accuracy`, `Response Completeness` | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
5050
| Other quality evaluators| Not Supported | Supported | -- |
5151

52-
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
52+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model with a balance of reasoning performance and cost efficiency, like `o3-mini` and o-series mini models released afterwards.
5353

5454
## Coherence
5555

56-
`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
56+
`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, which allows the reader to easily follow and understand the writer's train of thought. A *coherent* response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
5757

5858
### Coherence example
5959

@@ -69,7 +69,7 @@ coherence(
6969

7070
### Coherence output
7171

72-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
72+
The numerical score on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
7373

7474
```python
7575
{
@@ -83,7 +83,7 @@ The numerical score on a Likert scale (integer 1 to 5) and a higher score is bet
8383

8484
## Fluency
8585

86-
`FluencyEvaluator`measures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.
86+
`FluencyEvaluator` measures the effectiveness and clarity of written communication. This measure focuses on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.
8787

8888
### Fluency example
8989

@@ -98,7 +98,7 @@ fluency(
9898

9999
### Fluency output
100100

101-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
101+
The numerical score on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
102102

103103
```python
104104
{
@@ -137,7 +137,7 @@ qa_eval(
137137

138138
### QA output
139139

140-
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
140+
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), it also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
141141

142142
```python
143143
{

0 commit comments

Comments
 (0)