Skip to content

Commit 180dbd4

Browse files
committed
Merge branch 'agent_eval_update' of https://github.com/changliu2/azure-ai-docs-pr into agenteval0725
2 parents a3da3fc + d0640d2 commit 180dbd4

File tree

3 files changed

+207
-56
lines changed

3 files changed

+207
-56
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ intent_resolution(
8383

8484
### Intent resolution output
8585

86-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
86+
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
8787

8888
```python
8989
{
@@ -108,10 +108,13 @@ If you're building agents outside of Azure AI Agent Serice, this evaluator accep
108108

109109
## Tool call accuracy
110110

111-
`ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
111+
`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
112+
- the relevance and helpfulness of the tool invoked;
113+
- the correctness of parameters used in tool calls;
114+
- the counts of missing or excessive calls.
112115

113116
> [!NOTE]
114-
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.
117+
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent run must have at least one Function Tool call and no Built-in Tool calls made to be evaluated.
115118
116119
### Tool call accuracy example
117120

@@ -150,20 +153,35 @@ tool_call_accuracy(
150153

151154
### Tool call accuracy output
152155

153-
The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
156+
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
154157

155158
```python
156159
{
157-
"tool_call_accuracy": 1.0,
160+
"tool_call_accuracy": 5,
158161
"tool_call_accuracy_result": "pass",
159-
"tool_call_accuracy_threshold": 0.8,
160-
"per_tool_call_details": [
161-
{
162-
"tool_call_accurate": True,
163-
"tool_call_accurate_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's question about the weather in Seattle, includes appropriate parameters that match the TOOL DEFINITION, and the parameter values are correct and relevant to the user's query.",
164-
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
162+
"tool_call_accuracy_threshold": 3,
163+
"details": {
164+
"tool_calls_made_by_agent": 1,
165+
"correct_tool_calls_made_by_agent": 1,
166+
"per_tool_call_details": [
167+
{
168+
"tool_name": "fetch_weather",
169+
"total_calls_required": 1,
170+
"correct_calls_made_by_agent": 1,
171+
"correct_tool_percentage": 1.0,
172+
"tool_call_errors": 0,
173+
"tool_success_result": "pass"
174+
}
175+
],
176+
"excess_tool_calls": {
177+
"total": 0,
178+
"details": []
179+
},
180+
"missing_tool_calls": {
181+
"total": 0,
182+
"details": []
165183
}
166-
]
184+
}
167185
}
168186
```
169187

@@ -187,7 +205,7 @@ task_adherence(
187205

188206
### Task adherence output
189207

190-
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
208+
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
191209

192210
```python
193211
{

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 164 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -245,20 +245,20 @@ See the following example output for some evaluators:
245245
```
246246
{
247247
"intent_resolution": 5.0, # likert scale: 1-5 integer
248-
"intent_resolution_result": "pass", # pass because 5 > 3 the threshold
249248
"intent_resolution_threshold": 3,
249+
"intent_resolution_result": "pass", # pass because 5 > 3 the threshold
250250
"intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information."
251251
}
252252
{
253253
"task_adherence": 5.0, # likert scale: 1-5 integer
254-
"task_adherence_result": "pass", # pass because 5 > 3 the threshold
255254
"task_adherence_threshold": 3,
255+
"task_adherence_result": "pass", # pass because 5 > 3 the threshold
256256
"task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
257257
}
258258
{
259259
"tool_call_accuracy": 5, # a score between 1-5, higher is better
260-
"tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
261260
"tool_call_accuracy_threshold": 3,
261+
"tool_call_accuracy_result": "pass", # pass because 5 > 3 the threshold
262262
"details": { ... } # helpful details for debugging the tool calls made by the agent
263263
}
264264
```
@@ -316,7 +316,7 @@ If you're using agents outside Azure AI Foundry Agent Service, you can still eva
316316

317317
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, and `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, it can be a challenge to extract these simple data types from agent messages, due to the complex interaction patterns of agents and framework differences. For example, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
318318

319-
As illustrated in the following example, we enable agent message support specifically for the built-in evaluators `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence` to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
319+
As illustrated in the following example, we enable agent message support specifically for the built-in evaluators `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, and `TaskAdherenceEvaluator` to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
320320

321321
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
322322
|----------------|---------------|---------------|---------------|---------------|
@@ -375,16 +375,9 @@ See the following output (reference [Output format](#output-format) for details)
375375
"intent_resolution_result": "pass",
376376
"intent_resolution_threshold": 3,
377377
"intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
378-
"additional_details": {
379-
"conversation_has_intent": true,
380-
"agent_perceived_intent": "inquire about the opening hours of the Eiffel Tower",
381-
"actual_user_intent": "inquire about the opening hours of the Eiffel Tower",
382-
"correct_intent_detected": true,
383-
"intent_resolved": true
384-
}
385378
}
386379
```
387-
380+
### Agent tool calls and definitions
388381
See the following examples of `tool_calls` and `tool_definitions` for `ToolCallAccuracyEvaluator`:
389382

390383
```python
@@ -421,6 +414,10 @@ tool_definitions = [{
421414
}
422415
}
423416
}]
417+
418+
from azure.ai.evaluation import ToolCallAccuracyEvaluator
419+
420+
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config) # reuse the config defined above
424421
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
425422
print(json.dumps(response, indent=4))
426423
```
@@ -436,9 +433,161 @@ See the following output (reference [Output format](#output-format) for details)
436433
}
437434
```
438435

439-
### Agent messages
436+
### Agent message schema
437+
438+
In agent message format, `query` and `response` are a list of OpenAI-style messages. Specifically, `query` carries the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` carries the last message of the agent in response to the last user query.
439+
440+
The expected input format for the evaluators is a Python list of messages as follows:
441+
```
442+
[
443+
{
444+
"role": "system" | "user" | "assistant" | "tool",
445+
"createdAt": "ISO 8601 timestamp", // Optional for 'system'
446+
"run_id": "string", // Optional, only for assistant/tool in tool call context
447+
"tool_call_id": "string", // Optional, only for tool/tool_result
448+
"name": "string", // Present if it's a tool call
449+
"arguments": { ... }, // Parameters passed to the tool (if tool call)
450+
"content": [
451+
{
452+
"type": "text" | "tool_call" | "tool_result",
453+
"text": "string", // if type == text
454+
"tool_call_id": "string", // if type == tool_call
455+
"name": "string", // tool name if type == tool_call
456+
"arguments": { ... }, // tool args if type == tool_call
457+
"tool_result": { ... } // result if type == tool_result
458+
}
459+
]
460+
}
461+
]
462+
```
440463

441-
In agent message format, `query` and `response` are a list of OpenAI-style messages. Specifically, `query` carries the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` carries the last message of the agent in response to the last user query. See the following example:
464+
Sample query and response objects:
465+
466+
```python
467+
query = [
468+
{
469+
"role": "system",
470+
"content": "You are an AI assistant interacting with Azure Maps services to serve user requests."
471+
},
472+
{
473+
"createdAt": "2025-04-25T23:55:43Z",
474+
"role": "user",
475+
"content": [
476+
{
477+
"type": "text",
478+
"text": "Find the address for coordinates 41.8781,-87.6298."
479+
}
480+
]
481+
},
482+
{
483+
"createdAt": "2025-04-25T23:55:45Z",
484+
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
485+
"role": "assistant",
486+
"content": [
487+
{
488+
"type": "tool_call",
489+
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
490+
"name": "azure_maps_reverse_address_search",
491+
"arguments": {
492+
"lat": "41.8781",
493+
"lon": "-87.6298"
494+
}
495+
}
496+
]
497+
},
498+
{
499+
"createdAt": "2025-04-25T23:55:47Z",
500+
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
501+
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
502+
"role": "tool",
503+
"content": [
504+
{
505+
"type": "tool_result",
506+
"tool_result": {
507+
"address": "300 South Federal Street, Chicago, IL 60604",
508+
"position": {
509+
"lat": "41.8781",
510+
"lon": "-87.6298"
511+
}
512+
}
513+
}
514+
]
515+
},
516+
{
517+
"createdAt": "2025-04-25T23:55:48Z",
518+
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
519+
"role": "assistant",
520+
"content": [
521+
{
522+
"type": "text",
523+
"text": "The address for the coordinates 41.8781, -87.6298 is 300 South Federal Street, Chicago, IL 60604."
524+
}
525+
]
526+
},
527+
{
528+
"createdAt": "2025-04-25T23:55:50Z",
529+
"role": "user",
530+
"content": [
531+
{
532+
"type": "text",
533+
"text": "What timezone corresponds to 41.8781,-87.6298?"
534+
}
535+
]
536+
},
537+
]
538+
539+
response = [
540+
{
541+
"createdAt": "2025-04-25T23:55:52Z",
542+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
543+
"role": "assistant",
544+
"content": [
545+
{
546+
"type": "tool_call",
547+
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
548+
"name": "azure_maps_timezone",
549+
"arguments": {
550+
"lat": 41.878100000000003,
551+
"lon": -87.629800000000003
552+
}
553+
}
554+
]
555+
},
556+
{
557+
"createdAt": "2025-04-25T23:55:54Z",
558+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
559+
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
560+
"role": "tool",
561+
"content": [
562+
{
563+
"type": "tool_result",
564+
"tool_result": {
565+
"ianaId": "America/Chicago",
566+
"utcOffset": None,
567+
"abbreviation": None,
568+
"isDaylightSavingTime": None
569+
}
570+
}
571+
]
572+
},
573+
{
574+
"createdAt": "2025-04-25T23:55:55Z",
575+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
576+
"role": "assistant",
577+
"content": [
578+
{
579+
"type": "text",
580+
"text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
581+
}
582+
]
583+
}
584+
]
585+
```
586+
587+
> [!NOTE]
588+
> The evaluator throws a warning that query (i.e. the conversation history till the current run) or agent response (the response to the query) cannot be parsed when their format is not the expected one.
589+
590+
See an example of evaluating the agent messages with `ToolCallAccuracyEvaluator`:
442591

443592
```python
444593
import json
@@ -525,10 +674,9 @@ tool_definitions = [
525674
# ...
526675
]
527676

528-
result = intent_resolution_evaluator(
677+
result = tool_call_accuracy(
529678
query=query,
530679
response=response,
531-
# Optionally, provide the tool definitions.
532680
tool_definitions=tool_definitions
533681
)
534682
print(json.dumps(result, indent=4))

0 commit comments

Comments
 (0)