You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
20
+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. Azure AI Foundry currently supports these agent-specific evaluators for agentic workflows:
21
21
22
22
-[Intent resolution](#intent-resolution)
23
23
-[Tool call accuracy](#tool-call-accuracy)
24
24
-[Task adherence](#task-adherence)
25
25
26
26
## Evaluating Azure AI agents
27
27
28
-
Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
28
+
Agents emit messages. Providing inputs typically requires parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, the service provides native integration for evaluation that directly takes their agent messages. For an example, see [Evaluate AI agents](https://aka.ms/e2e-agent-eval-sample).
29
29
30
-
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
30
+
Besides `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using a comprehensive suite of built-in evaluators. Azure AI Foundry supports this list of evaluators for Azure AI agent messages from our converter:
In this article we show examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
35
+
This article shows examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
36
36
37
37
## Model configuration for AI-assisted evaluators
38
38
39
-
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
39
+
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the large language model-judge (LLM-judge):
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
57
+
Azure AI Agent Service supports AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
58
58
59
-
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
59
+
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1 or gpt-4o) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
62
+
| Other quality evaluators| Not Supported | Supported |--|
63
63
64
-
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
64
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model with a balance of reasoning performance and cost efficiency, like `o3-mini` and o-series mini models released afterwards.
65
65
66
66
## Intent resolution
67
67
68
-
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user's intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
68
+
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request. This understanding includes how well it scopes the user's intent, asks questions to clarify, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
69
69
70
70
### Intent resolution example
71
71
@@ -82,7 +82,7 @@ intent_resolution(
82
82
83
83
### Intent resolution output
84
84
85
-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
85
+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Using the reason and other fields can help you understand why the score is high or low.
86
86
87
87
```python
88
88
{
@@ -101,14 +101,15 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
101
101
102
102
```
103
103
104
-
If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for[Intent Resolution](https://aka.ms/intentresolution-sample).
104
+
If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see[Intent Resolution](https://aka.ms/intentresolution-sample).
105
105
106
106
## Tool call accuracy
107
107
108
-
`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
109
-
- the relevance and helpfulness of the tool invoked;
110
-
- the correctness of parameters used in tool calls;
111
-
- the counts of missing or excessive calls.
108
+
`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
109
+
110
+
- The relevance and helpfulness of the tool invoked
111
+
- The correctness of parameters used in tool calls
112
+
- The counts of missing or excessive calls
112
113
113
114
#### Tool call evaluation support
114
115
@@ -124,7 +125,7 @@ If you're building agents outside of Azure AI Foundry Agent Service, this evalua
124
125
- OpenAPI
125
126
- Function Tool (user-defined tools)
126
127
127
-
However, if a non-supported tool is used in the agent run, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported, for ease of filtering out these cases. It's recommended that you wrap non-supported tools as user-defined tools to enable evaluation.
128
+
If a non-supported tool is used in the agent run, the evaluator outputs a *pass* and a reason that evaluating the invoked tools isn't supported. This approach makes it easy to filter out these cases. We recommend that you wrap non-supported tools as user-defined tools to enable evaluation.
128
129
129
130
### Tool call accuracy example
130
131
@@ -235,7 +236,7 @@ tool_call_accuracy(
235
236
236
237
### Tool call accuracy output
237
238
238
-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
239
+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason and tool call detail fields to understand why the score is high or low.
239
240
240
241
```python
241
242
{
@@ -267,11 +268,11 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
267
268
}
268
269
```
269
270
270
-
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see, our sample notebook for[Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
271
+
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see[Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
271
272
272
273
## Task adherence
273
274
274
-
In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
275
+
In various task-oriented AI systems, such as agentic systems, it's important to assess whether the agent stays on track to complete a task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction and available tools. The task instruction is extracted from system message and user query. Higher score means better adherence of the system instruction to resolve the task.
275
276
276
277
### Task adherence example
277
278
@@ -287,7 +288,7 @@ task_adherence(
287
288
288
289
### Task adherence output
289
290
290
-
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
291
+
The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
291
292
292
293
```python
293
294
{
@@ -298,7 +299,7 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
298
299
}
299
300
```
300
301
301
-
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for[Task Adherence](https://aka.ms/taskadherence-sample).
302
+
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see[Task Adherence](https://aka.ms/taskadherence-sample).
0 commit comments