You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions.
21
+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
22
+
23
+
-[Intent resolution](#intent-resolution)
24
+
-[Tool call accuracy](#tool-call-accuracy)
25
+
-[Task adherence](#task-adherence)
22
26
23
27
## Evaluating Azure AI agents
24
28
25
29
Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
26
30
27
-
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality as well as safety aspects of your agentic workflows, leveraging out comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
31
+
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
We will show examples of `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` here. See more examples in [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents) for other evaluators with Azure AI agent message support.
36
+
In this article we show examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
56
+
### Evaluator model support
57
+
58
+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
59
+
60
+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
64
+
65
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
53
66
54
67
## Intent resolution
55
68
@@ -70,7 +83,7 @@ intent_resolution(
70
83
71
84
### Intent resolution output
72
85
73
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
86
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
74
87
75
88
```python
76
89
{
@@ -98,7 +111,7 @@ If you're building agents outside of Azure AI Agent Serice, this evaluator accep
98
111
`ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
99
112
100
113
> [!NOTE]
101
-
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but does not support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.
114
+
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.
102
115
103
116
### Tool call accuracy example
104
117
@@ -174,7 +187,7 @@ task_adherence(
174
187
175
188
### Task adherence output
176
189
177
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
190
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
+19-8Lines changed: 19 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: changliu2
9
-
ms.date: 05/19/2025
9
+
ms.date: 07/16/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: reference
12
12
ms.custom:
@@ -16,7 +16,10 @@ ms.custom:
16
16
17
17
# General purpose evaluators
18
18
19
-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, use [Coherence](#coherence) and [Fluency](#fluency).
19
+
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we support evaluating:
20
+
21
+
-[Coherence](#coherence)
22
+
-[Fluency](#fluency)
20
23
21
24
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator) a composite evaluator that uses relevant evaluators for judgment.
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
44
+
### Evaluator model support
45
+
46
+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
47
+
48
+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
52
+
53
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
43
54
44
55
## Coherence
45
56
@@ -59,7 +70,7 @@ coherence(
59
70
60
71
### Coherence output
61
72
62
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
73
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
63
74
64
75
```python
65
76
{
@@ -88,7 +99,7 @@ fluency(
88
99
89
100
### Fluency output
90
101
91
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
102
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
92
103
93
104
```python
94
105
{
@@ -127,7 +138,7 @@ qa_eval(
127
138
128
139
### QA output
129
140
130
-
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
141
+
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
0 commit comments