Skip to content

Commit b6e90ad

Browse files
committed
added agent evaluation
1 parent 068f009 commit b6e90ad

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,17 @@ author: lgayhardt
1919
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2020

2121

22-
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with additional challenges for observability due to their complex interaction patterns. To build production-ready agentic applications and enable observability and transparency, developers need tools to assess not just the final output from an agent’s workflows, but the quality and efficiency of the workflows themselves. For example, consider a typical agentic workflow:
22+
AI Agents are powerful productivity assistants to create workflows for business needs. However, they come with challenges for observability due to their complex interaction patterns. To build production-ready agentic applications and enable observability and transparency, developers need tools to assess not just the final output from an agent’s workflows, but the quality and efficiency of the workflows themselves. For example, consider a typical agentic workflow:
2323

2424
![alt text](agent-eval-10-sec-gif.gif)
2525

2626

27-
Triggered by a user query about “weather tomorrow”, the agentic workflow may include multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflowalong with the quality and safety of the final outputis crucial. Specifically, we formulate these steps into the following evaluators for agents:
27+
The agentic workflow is triggered by a user query "weather tomorrow". It starts to execute multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflowalong with the quality and safety of the final outputis crucial. Specifically, we formulate these steps into the following evaluators for agents:
2828
- [Intent resolution](https://aka.ms/intentresolution-sample): Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
2929
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): Evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps.
3030
- [Task adherence](https://aka.ms/taskadherence-sample): Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.
3131

32-
In this article, you learn how to run built-in evaluators locally on simple agent data as well as agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
32+
In this article, you learn how to run built-in evaluators locally on simple agent data or agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
3333

3434
## Getting started
3535

@@ -45,7 +45,7 @@ pip install azure-ai-evaluation
4545

4646
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract simple data from agent messages can be a challenge, due to its complex interaction patterns. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
4747

48-
As illustrated in the example, we have enabled agent message suport specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters as they are unique to agents.
48+
As illustrated in the example, we enabled agent message support specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters as they're unique to agents.
4949

5050
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
5151
|----------------|---------------|---------------|---------------|---------------|
@@ -57,7 +57,7 @@ As illustrated in the example, we have enabled agent message suport specifically
5757
- `ToolCall`: `dict` specifying tool calls invoked during agent interactions with a user.
5858
- `ToolDefinition`: `dict` describing the tools available to an agent.
5959

60-
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided. See examples below to showcase the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each one.
60+
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided. We will demonstrate some examples of the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each evaluator.
6161

6262
As with other [built-in AI-assisted quality evaluators](#performance-and-quality-evaluators), `IntentResolutionEvaluator` and `TaskAdherenceEvaluator` output a likert score (integer 1-5) where the higher score is better the result. `ToolCallAccuracyEvaluator` output the passing rate of all tool calls made (a float between 0-1) based on user query. To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
6363

@@ -242,7 +242,7 @@ print(result)
242242

243243
#### Converter support
244244

245-
Transforming agent messages into the right evaluation data to use our evaluators can be a non-trivial task. If you use [Azure AI Agent Service](https://learn.microsoft.com/azure/ai-services/agents/overview), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here is an example to create an Azure AI agent and some data for evaluation:
245+
Transforming agent messages into the right evaluation data to use our evaluators can be a nontrivial task. If you use [Azure AI Agent Service](https://learn.microsoft.com/azure/ai-services/agents/overview), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here's an example to create an Azure AI agent and some data for evaluation:
246246

247247
```bash
248248
pip install azure-ai-projects azure-identity
@@ -417,13 +417,13 @@ response = evaluate(
417417
)
418418
# look at the average scores
419419
print(response["metrics"])
420-
420+
# use the URL to inspect the results on the UI
421421
print(f'AI Foundary URL: {response.get("studio_url")}')
422422
```
423-
423+
With Azure AI Evaluation SDK, you can seamlessly evaluate your Azure AI agents via our converter support, which enables observability and transparency into agentic workflows. You can evaluate other agents by preparing the right data for the evaluators of your choice.
424424

425425
### Sample notebooks
426-
Now, you are ready to try a sample for each of these evaluators:
426+
Now you're ready to try a sample for each of these evaluators:
427427
- [Intent resolution](https://aka.ms/intentresolution-sample)
428428
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample)
429429
- [Task adherence](https://aka.ms/taskadherence-sample)

0 commit comments

Comments
 (0)