Merge pull request #6160 from lgayhardt/agenteval0725

prmerger-automator[bot] · web-flow · commit acd94aa9f555 · 2025-07-22T20:31:20.000Z
Agent and eval ask updates
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -47,7 +47,7 @@ load_dotenv()
 
 model_config = AzureOpenAIModelConfiguration(
     azure_endpoint=os.environ["AZURE_ENDPOINT"],
-    api_key=os.environ.get["AZURE_API_KEY"],
+    api_key=os.environ.get("AZURE_API_KEY"),
     azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
     api_version=os.environ.get("AZURE_API_VERSION"),
 )
@@ -83,7 +83,7 @@ intent_resolution(
 
 ### Intent resolution output
 
-The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -104,14 +104,17 @@ The numerical score on a Likert scale (integer 1 to 5) and a higher score is bet
 
 ```
 
-If you're building agents outside of Azure AI Agent Serice, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
+If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
 
 ## Tool call accuracy
 
-`ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
+`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on: 
+- the relevance and helpfulness of the tool invoked;
+- the correctness of parameters used in tool calls;
+- the counts of missing or excessive calls.
 
 > [!NOTE]
-> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.    
+> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent run must have at least one Function Tool call and no Built-in Tool calls made to be evaluated.
 
 ### Tool call accuracy example
 
@@ -150,20 +153,35 @@ tool_call_accuracy(
 
 ### Tool call accuracy output
 
-The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
 
 ```python
 {
-    "tool_call_accuracy": 1.0,
+    "tool_call_accuracy": 5,
     "tool_call_accuracy_result": "pass",
-    "tool_call_accuracy_threshold": 0.8,
-    "per_tool_call_details": [
-        {
-            "tool_call_accurate": True,
-            "tool_call_accurate_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's question about the weather in Seattle, includes appropriate parameters that match the TOOL DEFINITION, and the parameter values are correct and relevant to the user's query.",
-            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
+    "tool_call_accuracy_threshold": 3,
+    "details": {
+        "tool_calls_made_by_agent": 1,
+        "correct_tool_calls_made_by_agent": 1,
+        "per_tool_call_details": [
+            {
+                "tool_name": "fetch_weather",
+                "total_calls_required": 1,
+                "correct_calls_made_by_agent": 1,
+                "correct_tool_percentage": 1.0,
+                "tool_call_errors": 0,
+                "tool_success_result": "pass"
+            }
+        ],
+        "excess_tool_calls": {
+            "total": 0,
+            "details": []
+        },
+        "missing_tool_calls": {
+            "total": 0,
+            "details": []
         }
-    ]
+    }
 }
 ```
 
@@ -187,7 +205,7 @@ task_adherence(
 
 ### Task adherence output
 
-The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md b/articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md
@@ -59,6 +59,7 @@ Agents can use tool. Here's an example of creating custom tools you intend the a
 ```python
 from azure.ai.projects.models import FunctionTool, ToolSet
 from typing import Set, Callable, Any
+import json
 
 # Define a custom Python function.
 def fetch_weather(location: str) -> str:
@@ -177,7 +178,7 @@ And that's it! `converted_data` contains all inputs required for [these evaluato
 
 For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
 
-We set up a list of quality and safety evaluator in `quality_evaluators` and `safety_evaluators` and reference them in [evaluating multiples agent runs or a thread](#evaluate-multiple-agent-runs-or-threads).
+We set up a list of quality and safety evaluators in `quality_evaluators` and `safety_evaluators` and reference them in [evaluating multiples agent runs or a thread](#evaluate-multiple-agent-runs-or-threads).
 
 ```python
 # This is specific to agentic workflows.
@@ -213,7 +214,7 @@ quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_con
 ## Using Azure AI Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
 azure_ai_project = os.environ.get("AZURE_AI_PROJECT")
 
-safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in[ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}
+safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in [ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}
 
 # Reference the quality and safety evaluator list above.
 quality_and_safety_evaluators = {**quality_evaluators, **safety_evaluators}
@@ -245,20 +246,20 @@ See the following example output for some evaluators:
 ```
 {
     "intent_resolution": 5.0, # likert scale: 1-5 integer 
-    "intent_resolution_result": "pass", # pass because 5 > 3 the threshold
     "intent_resolution_threshold": 3,
+    "intent_resolution_result": "pass", # pass because 5 > 3 the threshold
     "intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information."
 }
 {
     "task_adherence": 5.0, # likert scale: 1-5 integer 
-    "task_adherence_result": "pass", # pass because 5 > 3 the threshold
     "task_adherence_threshold": 3,
+    "task_adherence_result": "pass", # pass because 5 > 3 the threshold
     "task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
 }
 {
     "tool_call_accuracy": 5,  # a score between 1-5, higher is better
-    "tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
     "tool_call_accuracy_threshold": 3,
+    "tool_call_accuracy_result": "pass", # pass because 5 > 3 the threshold
     "details": { ... } # helpful details for debugging the tool calls made by the agent
 }
 ```
@@ -316,7 +317,7 @@ If you're using agents outside Azure AI Foundry Agent Service, you can still eva
 
 Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, and `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, it can be a challenge to extract these simple data types from agent messages, due to the complex interaction patterns of agents and framework differences. For example, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
 
-As illustrated in the following example, we enable agent message support specifically for the built-in evaluators `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence` to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
+As illustrated in the following example, we enable agent message support specifically for the built-in evaluators `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, and `TaskAdherenceEvaluator` to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
 
 | Evaluator       | `query`      | `response`      | `tool_calls`       | `tool_definitions`  |
 |----------------|---------------|---------------|---------------|---------------|
@@ -375,16 +376,11 @@ See the following output (reference [Output format](#output-format) for details)
     "intent_resolution_result": "pass",
     "intent_resolution_threshold": 3,
     "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
-    "additional_details": {
-        "conversation_has_intent": true,
-        "agent_perceived_intent": "inquire about the opening hours of the Eiffel Tower",
-        "actual_user_intent": "inquire about the opening hours of the Eiffel Tower",
-        "correct_intent_detected": true,
-        "intent_resolved": true
-    }
 }
 ```
 
+### Agent tool calls and definitions
+
 See the following examples of `tool_calls` and `tool_definitions` for `ToolCallAccuracyEvaluator`:
 
 ```python
@@ -421,6 +417,10 @@ tool_definitions = [{
                         }
                     }
                 }]
+
+from azure.ai.evaluation import ToolCallAccuracyEvaluator
+
+tool_call_accuracy = ToolCallAccuracyEvaluator(model_config) # reuse the config defined above
 response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
 print(json.dumps(response, indent=4))
 ```
@@ -436,9 +436,162 @@ See the following output (reference [Output format](#output-format) for details)
 }
 ```
 
-### Agent messages
+### Agent message schema
+
+In agent message format, `query` and `response` are a list of OpenAI-style messages. Specifically, `query` carries the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` carries the last message of the agent in response to the last user query. 
 
-In agent message format, `query` and `response` are a list of OpenAI-style messages. Specifically, `query` carries the past agent-user interactions leading up to the last user query and requires the system message (of the agent) on top of the list; and `response` carries the last message of the agent in response to the last user query. See the following example:
+The expected input format for the evaluators is a Python list of messages as follows:
+
+```
+[
+  {
+    "role": "system" | "user" | "assistant" | "tool",
+    "createdAt": "ISO 8601 timestamp",     // Optional for 'system'
+    "run_id": "string",                    // Optional, only for assistant/tool in tool call context
+    "tool_call_id": "string",              // Optional, only for tool/tool_result
+    "name": "string",                      // Present if it's a tool call
+    "arguments": { ... },                  // Parameters passed to the tool (if tool call)
+    "content": [
+      {
+        "type": "text" | "tool_call" | "tool_result",
+        "text": "string",                  // if type == text
+        "tool_call_id": "string",         // if type == tool_call
+        "name": "string",                 // tool name if type == tool_call
+        "arguments": { ... },             // tool args if type == tool_call
+        "tool_result": { ... }            // result if type == tool_result
+      }
+    ]
+  }
+]
+```
+
+Sample query and response objects:
+
+```python
+query = [
+    {
+        "role": "system",
+        "content": "You are an AI assistant interacting with Azure Maps services to serve user requests."
+    },
+    {
+        "createdAt": "2025-04-25T23:55:43Z",
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "Find the address for coordinates 41.8781,-87.6298."
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:45Z",
+        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
+        "role": "assistant",
+        "content": [
+            {
+                "type": "tool_call",
+                "tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
+                "name": "azure_maps_reverse_address_search",
+                "arguments": {
+                    "lat": "41.8781",
+                    "lon": "-87.6298"
+                }
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:47Z",
+        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
+        "tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
+        "role": "tool",
+        "content": [
+            {
+                "type": "tool_result",
+                "tool_result": {
+                    "address": "300 South Federal Street, Chicago, IL 60604",
+                    "position": {
+                        "lat": "41.8781",
+                        "lon": "-87.6298"
+                    }
+                }
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:48Z",
+        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
+        "role": "assistant",
+        "content": [
+            {
+                "type": "text",
+                "text": "The address for the coordinates 41.8781, -87.6298 is 300 South Federal Street, Chicago, IL 60604."
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:50Z",
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "What timezone corresponds to 41.8781,-87.6298?"
+            }
+        ]
+    },
+]
+
+response = [
+    {
+        "createdAt": "2025-04-25T23:55:52Z",
+        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
+        "role": "assistant",
+        "content": [
+            {
+                "type": "tool_call",
+                "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
+                "name": "azure_maps_timezone",
+                "arguments": {
+                    "lat": 41.878100000000003,
+                    "lon": -87.629800000000003
+                }
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:54Z",
+        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
+        "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
+        "role": "tool",
+        "content": [
+            {
+                "type": "tool_result",
+                "tool_result": {
+                    "ianaId": "America/Chicago",
+                    "utcOffset": None,
+                    "abbreviation": None,
+                    "isDaylightSavingTime": None
+                }
+            }
+        ]
+    },
+    {
+        "createdAt": "2025-04-25T23:55:55Z",
+        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
+        "role": "assistant",
+        "content": [
+            {
+                "type": "text",
+                "text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
+            }
+        ]
+    }
+]
+```
+
+> [!NOTE]
+> The evaluator throws a warning that query (the conversation history till the current run) or agent response (the response to the query) can't be parsed when their format isn't the expected one.
+
+See an example of evaluating the agent messages with `ToolCallAccuracyEvaluator`:
 
 ```python
 import json
@@ -525,10 +678,9 @@ tool_definitions = [
     # ...
 ]
 
-result = intent_resolution_evaluator(
+result = tool_call_accuracy(
     query=query,
     response=response,
-    # Optionally, provide the tool definitions.
     tool_definitions=tool_definitions 
 )
 print(json.dumps(result, indent=4))
diff --git a/articles/ai-foundry/how-to/develop/evaluate-sdk.md b/articles/ai-foundry/how-to/develop/evaluate-sdk.md