Skip to content

Commit 995d67d

Browse files
committed
updates
1 parent b6e90ad commit 995d67d

File tree

2 files changed

+18
-16
lines changed

2 files changed

+18
-16
lines changed

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,47 +5,48 @@ description: This article provides instructions on how to evaluate an AI agent w
55
manager: scottpolly
66
ms.service: azure-ai-foundry
77
ms.custom:
8-
- build-2024
8+
- build-2025
99
- references_regions
10-
- ignite-2024
1110
ms.topic: how-to
12-
ms.date: 03/26/2025
11+
ms.date: 04/04/2025
1312
ms.reviewer: changliu2
1413
ms.author: lagayhar
1514
author: lgayhardt
1615
---
17-
# Evaluate your AI agents locally with the Azure AI Evaluation SDK
16+
# Evaluate your AI agents locally with the Azure AI Evaluation SDK (preview)
1817

1918
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2019

2120

22-
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with challenges for observability due to their complex interaction patterns. To build production-ready agentic applications and enable observability and transparency, developers need tools to assess not just the final output from an agent’s workflows, but the quality and efficiency of the workflows themselves. For example, consider a typical agentic workflow:
21+
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with challenges for observability due to their complex interaction patterns. In this article, you learn how to run built-in evaluators locally on simple agent data or agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
2322

24-
![alt text](agent-eval-10-sec-gif.gif)
23+
To build production-ready agentic applications and enable observability and transparency, developers need tools to assess not just the final output from an agent's workflows, but the quality and efficiency of the workflows themselves. For example, consider a typical agentic workflow:
2524

25+
:::image type="content" source="../../media/evaluations/agent-workflow-eval.gif" alt-text="Animation of the agent's workflow from user query to intent resolution to tool calls to final response." " lightbox="../../media/evaluations/agent-workflow-eval.gif":::
2626

27-
The agentic workflow is triggered by a user query "weather tomorrow". It starts to execute multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflow—along with the quality and safety of the final output—is crucial. Specifically, we formulate these steps into the following evaluators for agents:
27+
The agentic workflow is triggered by a user query "weather tomorrow". It starts to execute multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflow—along with the quality and safety of the final output—is crucial. Specifically, we formulate these evaluation aspects into the following evaluators for agents:
2828
- [Intent resolution](https://aka.ms/intentresolution-sample): Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
2929
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): Evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps.
30-
- [Task adherence](https://aka.ms/taskadherence-sample): Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.
30+
- [Task adherence](https://aka.ms/taskadherence-sample): Measures how well the agent’s final response adheres to its assigned tasks, according to its system message and prior steps.
31+
32+
For other quality and risk and safety evaluation aspects, you can use other [built-in evaluators](./evaluate-sdk.md#data-requirements-for-built-in-evaluators) to assess the content in the process where appropriate.
33+
3134

32-
In this article, you learn how to run built-in evaluators locally on simple agent data or agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
3335

3436
## Getting started
3537

3638
First install the evaluators package from Azure AI evaluation SDK:
3739

3840
```python
3941
pip install azure-ai-evaluation
40-
4142
```
4243

4344

4445
### Evaluators with agent message support
4546

46-
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract simple data from agent messages can be a challenge, due to its complex interaction patterns. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
47+
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract these simple data from agent messages can be a challenge, due to the complex interaction patterns of agents and framework differences. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
4748

48-
As illustrated in the example, we enabled agent message support specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters as they're unique to agents.
49+
As illustrated in the example, we enabled agent message support specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters (unique to agents.
4950

5051
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
5152
|----------------|---------------|---------------|---------------|---------------|
@@ -57,15 +58,16 @@ As illustrated in the example, we enabled agent message support specifically for
5758
- `ToolCall`: `dict` specifying tool calls invoked during agent interactions with a user.
5859
- `ToolDefinition`: `dict` describing the tools available to an agent.
5960

60-
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided. We will demonstrate some examples of the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each evaluator.
61+
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided.
62+
63+
We will demonstrate some examples of the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each evaluator.
6164

6265
As with other [built-in AI-assisted quality evaluators](#performance-and-quality-evaluators), `IntentResolutionEvaluator` and `TaskAdherenceEvaluator` output a likert score (integer 1-5) where the higher score is better the result. `ToolCallAccuracyEvaluator` output the passing rate of all tool calls made (a float between 0-1) based on user query. To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
6366

6467
- `{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
6568
- `{metric_name}_threshold` a numerical binarization threshold set by default or by the user
6669

6770

68-
6971
#### Simple agent data
7072

7173
In simple agent data format, `query` and `response` are simple python strings. For example:
@@ -242,7 +244,7 @@ print(result)
242244

243245
#### Converter support
244246

245-
Transforming agent messages into the right evaluation data to use our evaluators can be a nontrivial task. If you use [Azure AI Agent Service](https://learn.microsoft.com/azure/ai-services/agents/overview), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here's an example to create an Azure AI agent and some data for evaluation:
247+
Transforming agent messages into the right evaluation data to use our evaluators can be a nontrivial task. If you use [Azure AI Agent Service](../../ai-services/agents/overview.md), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here's an example to create an Azure AI agent and some data for evaluation:
246248

247249
```bash
248250
pip install azure-ai-projects azure-identity
@@ -343,7 +345,7 @@ from azure.ai.evaluation import AIAgentConverter
343345
# Initialize the converter for Azure AI agents
344346
converter = AIAgentConverter(project_client)
345347

346-
# Sepcify the thread and run id
348+
# Specify the thread and run id
347349
thread_id = thread.id
348350
run_id = run.id
349351

0 commit comments

Comments
 (0)