Skip to content

Commit 068f009

Browse files
committed
added agent evaluation
1 parent c58c732 commit 068f009

File tree

2 files changed

+56
-27
lines changed

2 files changed

+56
-27
lines changed

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 47 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@ AI Agents are powerful productivity assistants to create workflows for business
2424
![alt text](agent-eval-10-sec-gif.gif)
2525

2626

27-
Triggered by a user query about “weather tomorrow”, the agentic workflow may include multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, assessing the quality and safety of each step — along with the final output of an agentic workflow — is crucial.
27+
Triggered by a user query about “weather tomorrow”, the agentic workflow may include multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, evaluating each steps of the workflow — along with the quality and safety of the final output — is crucial. Specifically, we formulate these steps into the following evaluators for agents:
28+
- [Intent resolution](https://aka.ms/intentresolution-sample): Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
29+
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): Evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps.
30+
- [Task adherence](https://aka.ms/taskadherence-sample): Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.
2831

2932
In this article, you learn how to run built-in evaluators locally on simple agent data as well as agent messages with built-in evaluators to thoroughly assess the performance of your AI agents.
3033

@@ -40,22 +43,27 @@ pip install azure-ai-evaluation
4043

4144
### Evaluators with agent message support
4245

43-
Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` as with the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract simple data from agent messages can be a formidable challenge, due to its complex interaction patterns. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with tool calls invoked.
46+
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, to extract simple data from agent messages can be a challenge, due to its complex interaction patterns. For example, as mentioned, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
4447

45-
Therefore, we have specifically enabled agent message suport for these built-in evaluators:
48+
As illustrated in the example, we have enabled agent message suport specifically for these built-in evaluators to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters as they are unique to agents.
4649

4750
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
48-
|----------------|---------------|---------------|---------------|---------------|-----------|
51+
|----------------|---------------|---------------|---------------|---------------|
4952
| `IntentResolutionEvaluator` | Required: Union[String, list[Message]] | Required: Union[String, list[Message]] | N/A | Optional: list[dict] |
5053
| `ToolCallAccuracyEvaluator` | Required: Union[String, list[Message]] | Optional: Union[String, list[Message]]| Optional: Union[dict, list[ToolCall]] | Required: list[ToolDefinition] |
5154
| `TaskAdherenceEvaluator` | Required: Union[String, list[Message]] | Required: Union[String, list[Message]] | N/A | Optional: list[dict] |
5255

53-
- `Messages`: `dict` of openai-style messages describing agent interactions with a user
54-
- `ToolCall`: `dict` specifying tool calls invoked during agent interactions with a user
55-
- `ToolDefinition`: `dict` describing the tools available to an agent
56+
- `Message`: `dict` openai-style message describing agent interactions with a user, where `query` must include a system message as the first message.
57+
- `ToolCall`: `dict` specifying tool calls invoked during agent interactions with a user.
58+
- `ToolDefinition`: `dict` describing the tools available to an agent.
59+
60+
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided. See examples below to showcase the two data formats: simple agent data, and agent messages. However, due to the unique requirements of these evaluators, we recommend referring to the [sample notebooks](#sample-notebooks) which illustrate the possible input paths for each one.
61+
62+
As with other [built-in AI-assisted quality evaluators](#performance-and-quality-evaluators), `IntentResolutionEvaluator` and `TaskAdherenceEvaluator` output a likert score (integer 1-5) where the higher score is better the result. `ToolCallAccuracyEvaluator` output the passing rate of all tool calls made (a float between 0-1) based on user query. To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
63+
64+
- `{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
65+
- `{metric_name}_threshold` a numerical binarization threshold set by default or by the user
5666

57-
For `ToolCallAccuracyEvaluator`, either `response` or `tool_calls` must be provided. See examples below to showcase the two data formats: simple agent data, and agent messages.
58-
5967

6068

6169
#### Simple agent data
@@ -66,7 +74,7 @@ In simple agent data format, `query` and `response` are simple python strings. F
6674
import os
6775
from azure.ai.evaluation import AzureOpenAIModelConfiguration
6876
from azure.identity import DefaultAzureCredential
69-
from azure.ai.evaluation import IntentResolutionEvaluator
77+
from azure.ai.evaluation import IntentResolutionEvaluator, ResponseCompletenessEvaluator
7078

7179

7280
model_config = AzureOpenAIModelConfiguration(
@@ -77,20 +85,29 @@ model_config = AzureOpenAIModelConfiguration(
7785
)
7886

7987
intent_resolution_evaluator = IntentResolutionEvaluator(model_config)
88+
completeness_evaluator = CompletenessEvaluator(model_config=model_config)
8089

8190
# Evaluating query and response as strings
82-
# A successful example. Intent is identified and understood and the response correctly resolves user intent
91+
# A positive example. Intent is identified and understood and the response correctly resolves user intent
8392
result = intent_resolution_evaluator(
8493
query="What are the opening hours of the Eiffel Tower?",
8594
response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.",
8695
)
8796
print(result)
97+
98+
# A negative example. Only half of the statements in the response were complete according to the ground truth
99+
result = completeness_evaluator(
100+
response="Itinery: Day 1 take a train to visit Disneyland outside of the city; Day 2 rests in hotel.",
101+
ground_truth="Itinery: Day 1 take a train to visit the downtown area for city sightseeing; Day 2 rests in hotel."
102+
)
103+
print(result)
104+
88105
```
89106

90-
Tool calls are typically made within the agent response messages, you can also extract them in this format:
107+
Examples of `tool_calls` and `tool_definitions` for `ToolCallAccuracyEvaluator`:
91108

92109
```python
93-
query = "How is the weather in Seattle ?"
110+
query = "How is the weather in Seattle?"
94111
tool_calls = [{
95112
"type": "tool_call",
96113
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
@@ -127,14 +144,17 @@ print(response)
127144

128145
#### Agent messages
129146

130-
In agent message format, `query` and `response` are list of openai-style messages. Specifically, `query` requires the system message (of the agent) on top of the list:
147+
In agent message format, `query` and `response` are list of openai-style messages. Specifically, `query` carry the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` will carry the last message of the agent in response to the last user query. Example:
148+
131149
```python
132150
# user asked a question
133151
query = [
134152
{
135153
"role": "system",
136154
"content": "You are a friendly and helpful customer service agent."
137155
},
156+
# past interactions omitted
157+
# ...
138158
{
139159
"createdAt": "2025-03-14T06:14:20Z",
140160
"role": "user",
@@ -174,7 +194,8 @@ response = [
174194
}
175195
]
176196
},
177-
...
197+
# many more messages omitted
198+
# ...
178199
# here is the agent's final response
179200
{
180201
"createdAt": "2025-03-14T06:15:05Z",
@@ -204,7 +225,8 @@ tool_definitions = [
204225
}
205226
}
206227
},
207-
...
228+
# other tool definitions omitted
229+
# ...
208230
]
209231

210232
result = intent_resolution_evaluator(
@@ -215,14 +237,12 @@ result = intent_resolution_evaluator(
215237
)
216238
print(result)
217239

218-
219240
```
220241

221242

222243
#### Converter support
223244

224-
225-
If you use [Azure AI Agent Service](https://learn.microsoft.com/azure/ai-services/agents/overview), you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. To create an Azure AI agent and some data, you need:
245+
Transforming agent messages into the right evaluation data to use our evaluators can be a non-trivial task. If you use [Azure AI Agent Service](https://learn.microsoft.com/azure/ai-services/agents/overview), however, you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. Here is an example to create an Azure AI agent and some data for evaluation:
226246

227247
```bash
228248
pip install azure-ai-projects azure-identity
@@ -238,7 +258,6 @@ from azure.ai.projects.models import FunctionTool, ToolSet
238258

239259
from dotenv import load_dotenv
240260

241-
242261
load_dotenv()
243262

244263
# Define some custom python function
@@ -316,7 +335,7 @@ for message in project_client.agents.list_messages(thread.id, order="asc").data:
316335

317336
##### Convert agent runs (single-run)
318337

319-
Now you use our converter to transform the agent thread or run data, into required evaluation data that the evaluators can understand.
338+
Now you use our converter to transform the Azure AI agent thread or run data into required evaluation data that the evaluators can understand.
320339
```python
321340
import json
322341
from azure.ai.evaluation import AIAgentConverter
@@ -350,12 +369,15 @@ print(f"Evaluation data saved to {filename}")
350369

351370
#### Batch evaluation on agent thread data
352371

353-
Select the evaluators to assess the agent quality (for example, intent resolution, tool call accuracy, and task adherence), and submit a batch
372+
With the evaluation data prepared in one line of code, you can simply select the evaluators to assess the agent quality (for example, intent resolution, tool call accuracy, and task adherence), and submit a batch evaluation run:
354373
```python
355374
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator
356375
from azure.ai.projects.models import ConnectionType
357376
import os
358377

378+
from dotenv import load_dotenv
379+
load_dotenv()
380+
359381

360382
project_client = AIProjectClient.from_connection_string(
361383
credential=DefaultAzureCredential(),
@@ -370,7 +392,7 @@ model_config = project_client.connections.get_default(
370392
include_credentials=True
371393
)
372394

373-
395+
# select evaluators
374396
intent_resolution = IntentResolutionEvaluator(model_config=model_config)
375397
task_adherence = TaskAdherenceEvaluator(model_config=model_config)
376398
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)
@@ -400,11 +422,11 @@ print(f'AI Foundary URL: {response.get("studio_url")}')
400422
```
401423

402424

403-
### Samples
425+
### Sample notebooks
404426
Now, you are ready to try a sample for each of these evaluators:
405427
- [Intent resolution](https://aka.ms/intentresolution-sample)
406-
- [Task adherence](https://aka.ms/taskadherence-sample)
407428
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample)
429+
- [Task adherence](https://aka.ms/taskadherence-sample)
408430
- [End-to-end Azure AI agent evaluation](https://aka.ms/e2e-agent-eval-sample)
409431

410432

articles/ai-foundry/how-to/develop/evaluate-sdk.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -292,10 +292,17 @@ For
292292

293293
The result of the AI-assisted quality evaluators for a query and response pair is a dictionary containing:
294294

295-
- `{metric_name}` provides a numerical score.
296-
- `{metric_name}_label` provides a binary label.
295+
- `{metric_name}` provides a numerical score, on a likert scale (integer 1 to 5) or a float between 0-1.
296+
- `{metric_name}_label` provides a binary label (if the metric outputs a binary score naturally).
297297
- `{metric_name}_reason` explains why a certain score or label was given for each data point.
298298

299+
To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
300+
301+
- `{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
302+
- `{metric_name}_threshold` a numerical binarization threshold set by default or by the user
303+
304+
305+
299306
#### Comparing quality and custom evaluators
300307

301308
For NLP evaluators, only a score is given in the `{metric_name}` key.

0 commit comments

Comments
 (0)