Skip to content

Commit b66b634

Browse files
committed
headings, edits, acrolinx
1 parent 668eaf2 commit b66b634

File tree

1 file changed

+22
-20
lines changed

1 file changed

+22
-20
lines changed

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -13,23 +13,23 @@ ms.reviewer: changliu2
1313
ms.author: lagayhar
1414
author: lgayhardt
1515
---
16-
# Evaluate your AI agents locally with the Azure AI Evaluation SDK (preview)
16+
# Evaluate your AI agents locally with Azure AI Evaluation SDK (preview)
1717

1818
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
1919

20-
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with challenges for observability due to their complex interaction patterns. In this article, you learn how to run built-in evaluators locally on simple agent data or agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
20+
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with challenges for observability due to their complex interaction patterns. In this article, you learn how to run built-in evaluators locally on simple agent data or agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
2121

2222
To build production-ready agentic applications and enable observability and transparency, developers need tools to assess not just the final output from an agent's workflows, but the quality and efficiency of the workflows themselves. For example, consider a typical agentic workflow:
2323

2424
:::image type="content" source="../../media/evaluations/agent-workflow-eval.gif" alt-text="Animation of the agent's workflow from user query to intent resolution to tool calls to final response." lightbox="../../media/evaluations/agent-workflow-eval.gif":::
2525

26-
The agentic workflow is triggered by a user query "weather tomorrow". It starts to execute multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflow—along with the quality and safety of the final output—is crucial. Specifically, we formulate these evaluation aspects into the following evaluators for agents:
26+
The agentic workflow is triggered by a user query "weather tomorrow". It starts to execute multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each step of the workflow—along with the quality and safety of the final output—is crucial. Specifically, we formulate these evaluation aspects into the following evaluators for agents:
2727

2828
- [Intent resolution](https://aka.ms/intentresolution-sample): Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
2929
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): Evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps.
3030
- [Task adherence](https://aka.ms/taskadherence-sample): Measures how well the agent’s final response adheres to its assigned tasks, according to its system message and prior steps.
3131

32-
To see more quality and risk and safety evaluators, refer to [built-in evaluators](./evaluate-sdk.md#data-requirements-for-built-in-evaluators) to assess the content in the process where appropriate.
32+
To see more quality and risk and safety evaluators, refer to [built-in evaluators](./evaluate-sdk.md#data-requirements-for-built-in-evaluators) to assess the content in the process where appropriate.
3333

3434
## Getting started
3535

@@ -39,32 +39,32 @@ First install the evaluators package from Azure AI evaluation SDK:
3939
pip install azure-ai-evaluation
4040
```
4141

42-
### Evaluators with agent message support
42+
## Evaluators with agent message support
4343

44-
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract these simple data from agent messages can be a challenge, due to the complex interaction patterns of agents and framework differences. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
44+
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract these simple data types from agent messages can be a challenge, due to the complex interaction patterns of agents and framework differences. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
4545

4646
As illustrated in the example, we enabled agent message support specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
4747

4848
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
4949
|----------------|---------------|---------------|---------------|---------------|
50-
| `IntentResolutionEvaluator` | Required: Union[String, list[Message]] | Required: Union[String, list[Message]] | N/A | Optional: list[dict] |
51-
| `ToolCallAccuracyEvaluator` | Required: Union[String, list[Message]] | Optional: Union[String, list[Message]]| Optional: Union[dict, list[ToolCall]] | Required: list[ToolDefinition] |
52-
| `TaskAdherenceEvaluator` | Required: Union[String, list[Message]] | Required: Union[String, list[Message]] | N/A | Optional: list[dict] |
50+
| `IntentResolutionEvaluator` | Required: `Union[str, list[Message]]` | Required: `Union[str, list[Message]]` | N/A | Optional: `list[ToolCall]` |
51+
| `ToolCallAccuracyEvaluator` | Required: `Union[str, list[Message]]` | Optional: `Union[str, list[Message]]` | Optional: `Union[dict, list[ToolCall]]` | Required: `list[ToolDefinition]` |
52+
| `TaskAdherenceEvaluator` | Required: `Union[str, list[Message]]` | Required: `Union[str, list[Message]]` | N/A | Optional: `list[ToolCall]` |
5353

5454
- `Message`: `dict` openai-style message describing agent interactions with a user, where `query` must include a system message as the first message.
5555
- `ToolCall`: `dict` specifying tool calls invoked during agent interactions with a user.
5656
- `ToolDefinition`: `dict` describing the tools available to an agent.
5757

5858
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided.
5959

60-
We will demonstrate some examples of the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each evaluator.
60+
We'll demonstrate some examples of the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each evaluator.
6161

6262
As with other [built-in AI-assisted quality evaluators](./evaluate-sdk.md#performance-and-quality-evaluators), `IntentResolutionEvaluator` and `TaskAdherenceEvaluator` output a likert score (integer 1-5; higher score is better). `ToolCallAccuracyEvaluator` outputs the passing rate of all tool calls made (a float between 0-1) based on user query. To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
6363

6464
- `{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
6565
- `{metric_name}_threshold` a numerical binarization threshold set by default or by the user
6666

67-
#### Simple agent data
67+
### Simple agent data
6868

6969
In simple agent data format, `query` and `response` are simple python strings. For example:
7070

@@ -140,7 +140,7 @@ response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitio
140140
print(response)
141141
```
142142

143-
#### Agent messages
143+
### Agent messages
144144

145145
In agent message format, `query` and `response` are list of openai-style messages. Specifically, `query` carry the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` will carry the last message of the agent in response to the last user query. Example:
146146

@@ -237,11 +237,11 @@ print(result)
237237

238238
```
239239

240-
#### Converter support
240+
## Converter support
241241

242242
Transforming agent messages into the right evaluation data to use our evaluators can be a nontrivial task. If you use [Azure AI Agent Service](../../../ai-services/agents/overview.md), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here's an example to create an Azure AI agent and some data for evaluation. Separately from evaluation, Azure AI Agent Service requires `pip install azure-ai-projects azure-identity` and an Azure AI project connection string and the supported models.
243243

244-
#### Create agent threads and runs
244+
### Create agent threads and runs
245245

246246
```python
247247
import os, json
@@ -327,7 +327,7 @@ for message in project_client.agents.list_messages(thread.id, order="asc").data:
327327
print("-" * 40)
328328
```
329329

330-
##### Convert agent runs (single-run)
330+
#### Convert agent runs (single-run)
331331

332332
Now you use our converter to transform the Azure AI agent thread or run data into required evaluation data that the evaluators can understand.
333333
```python
@@ -345,7 +345,8 @@ converted_data = converter.convert(thread_id, run_id)
345345
print(json.dumps(converted_data, indent=4))
346346
```
347347

348-
##### Convert agent threads (multi-run)
348+
#### Convert agent threads (multi-run)
349+
349350
```python
350351
import json
351352
from azure.ai.evaluation import AIAgentConverter
@@ -363,7 +364,7 @@ print(f"Evaluation data saved to {filename}")
363364

364365
#### Batch evaluation on agent thread data
365366

366-
With the evaluation data prepared in one line of code, you can simply select the evaluators to assess the agent quality (for example, intent resolution, tool call accuracy, and task adherence), and submit a batch evaluation run:
367+
With the evaluation data prepared in one line of code, you can select the evaluators to assess the agent quality (for example, intent resolution, tool call accuracy, and task adherence), and submit a batch evaluation run:
367368
```python
368369
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator
369370
from azure.ai.projects.models import ConnectionType
@@ -414,17 +415,18 @@ print(response["metrics"])
414415
# use the URL to inspect the results on the UI
415416
print(f'AI Foundary URL: {response.get("studio_url")}')
416417
```
417-
With Azure AI Evaluation SDK, you can seamlessly evaluate your Azure AI agents via our converter support, which enables observability and transparency into agentic workflows. You can evaluate other agents by preparing the right data for the evaluators of your choice.
418418

419-
### Sample notebooks
419+
With Azure AI Evaluation SDK, you can seamlessly evaluate your Azure AI agents via our converter support, which enables observability and transparency into agentic workflows. You can evaluate other agents by preparing the right data for the evaluators of your choice.
420+
421+
## Sample notebooks
422+
420423
Now you're ready to try a sample for each of these evaluators:
421424
- [Intent resolution](https://aka.ms/intentresolution-sample)
422425
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample)
423426
- [Task adherence](https://aka.ms/taskadherence-sample)
424427
- [Response Completeness](https://aka.ms/rescompleteness-sample)
425428
- [End-to-end Azure AI agent evaluation](https://aka.ms/e2e-agent-eval-sample)
426429

427-
428430
## Related content
429431

430432
- [Azure AI Evaluation Python SDK client reference documentation](https://aka.ms/azureaieval-python-ref)

0 commit comments

Comments
 (0)