Skip to content

Commit 08c1a9f

Browse files
Merge pull request #7714 from TimShererWithAquent/us496641-12
Freshness Edit: AI Foundry: Agent evaluators
2 parents c1c5136 + 9da7f01 commit 08c1a9f

File tree

1 file changed

+26
-25
lines changed

1 file changed

+26
-25
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: Agent evaluators for generative AI
2+
title: Agent Evaluators for Generative AI
33
titleSuffix: Azure AI Foundry
44
description: Learn how to evaluate Azure AI agents using intent resolution, tool call accuracy, and task adherence evaluators.
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: changliu2
8-
ms.date: 07/15/2025
8+
ms.date: 10/17/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: reference
1111
ms.custom:
@@ -17,26 +17,26 @@ ms.custom:
1717

1818
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
1919

20-
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
20+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. Azure AI Foundry currently supports these agent-specific evaluators for agentic workflows:
2121

2222
- [Intent resolution](#intent-resolution)
2323
- [Tool call accuracy](#tool-call-accuracy)
2424
- [Task adherence](#task-adherence)
2525

2626
## Evaluating Azure AI agents
2727

28-
Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
28+
Agents emit messages. Providing inputs typically requires parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, the service provides native integration for evaluation that directly takes their agent messages. For an example, see [Evaluate AI agents](https://aka.ms/e2e-agent-eval-sample).
2929

30-
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
30+
Besides `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using a comprehensive suite of built-in evaluators. Azure AI Foundry supports this list of evaluators for Azure AI agent messages from our converter:
3131

3232
- **Quality**: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
33-
- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
33+
- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`
3434

35-
In this article we show examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
35+
This article shows examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
3636

3737
## Model configuration for AI-assisted evaluators
3838

39-
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
39+
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the large language model-judge (LLM-judge):
4040

4141
```python
4242
import os
@@ -54,18 +54,18 @@ model_config = AzureOpenAIModelConfiguration(
5454

5555
### Evaluator models support
5656

57-
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
57+
Azure AI Agent Service supports AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
5858

59-
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
59+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1 or gpt-4o) | To enable |
6060
|--|--|--|--|
6161
| `Intent Resolution`, `Task Adherence`, `Tool Call Accuracy`, `Response Completeness` | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
62-
| Other quality evaluators| Not Supported | Supported | -- |
62+
| Other quality evaluators| Not Supported | Supported |--|
6363

64-
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
64+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model with a balance of reasoning performance and cost efficiency, like `o3-mini` and o-series mini models released afterwards.
6565

6666
## Intent resolution
6767

68-
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user's intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
68+
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request. This understanding includes how well it scopes the user's intent, asks questions to clarify, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
6969

7070
### Intent resolution example
7171

@@ -82,7 +82,7 @@ intent_resolution(
8282

8383
### Intent resolution output
8484

85-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
85+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Using the reason and other fields can help you understand why the score is high or low.
8686

8787
```python
8888
{
@@ -101,14 +101,15 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
101101

102102
```
103103

104-
If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
104+
If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Intent Resolution](https://aka.ms/intentresolution-sample).
105105

106106
## Tool call accuracy
107107

108-
`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
109-
- the relevance and helpfulness of the tool invoked;
110-
- the correctness of parameters used in tool calls;
111-
- the counts of missing or excessive calls.
108+
`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
109+
110+
- The relevance and helpfulness of the tool invoked
111+
- The correctness of parameters used in tool calls
112+
- The counts of missing or excessive calls
112113

113114
#### Tool call evaluation support
114115

@@ -124,7 +125,7 @@ If you're building agents outside of Azure AI Foundry Agent Service, this evalua
124125
- OpenAPI
125126
- Function Tool (user-defined tools)
126127

127-
However, if a non-supported tool is used in the agent run, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported, for ease of filtering out these cases. It's recommended that you wrap non-supported tools as user-defined tools to enable evaluation.
128+
If a non-supported tool is used in the agent run, the evaluator outputs a *pass* and a reason that evaluating the invoked tools isn't supported. This approach makes it easy to filter out these cases. We recommend that you wrap non-supported tools as user-defined tools to enable evaluation.
128129

129130
### Tool call accuracy example
130131

@@ -235,7 +236,7 @@ tool_call_accuracy(
235236

236237
### Tool call accuracy output
237238

238-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
239+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason and tool call detail fields to understand why the score is high or low.
239240

240241
```python
241242
{
@@ -267,11 +268,11 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
267268
}
268269
```
269270

270-
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see, our sample notebook for [Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
271+
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
271272

272273
## Task adherence
273274

274-
In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
275+
In various task-oriented AI systems, such as agentic systems, it's important to assess whether the agent stays on track to complete a task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction and available tools. The task instruction is extracted from system message and user query. Higher score means better adherence of the system instruction to resolve the task.
275276

276277
### Task adherence example
277278

@@ -287,7 +288,7 @@ task_adherence(
287288

288289
### Task adherence output
289290

290-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
291+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
291292

292293
```python
293294
{
@@ -298,7 +299,7 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
298299
}
299300
```
300301

301-
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Task Adherence](https://aka.ms/taskadherence-sample).
302+
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Task Adherence](https://aka.ms/taskadherence-sample).
302303

303304
## Related content
304305

0 commit comments

Comments
 (0)