Merge pull request #7714 from TimShererWithAquent/us496641-12

prmerger-automator[bot] · web-flow · commit 08c1a9f1be68 · 2025-10-20T19:39:55.000Z
Freshness Edit: AI Foundry: Agent evaluators
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -1,11 +1,11 @@
 ---
-title: Agent evaluators for generative AI
+title: Agent Evaluators for Generative AI
 titleSuffix: Azure AI Foundry
 description: Learn how to evaluate Azure AI agents using intent resolution, tool call accuracy, and task adherence evaluators.
 author: lgayhardt
 ms.author: lagayhar
 ms.reviewer: changliu2
-ms.date: 07/15/2025
+ms.date: 10/17/2025
 ms.service: azure-ai-foundry
 ms.topic: reference
 ms.custom:
@@ -17,26 +17,26 @@ ms.custom:
 
 [!INCLUDE [feature-preview](../../includes/feature-preview.md)]
 
-Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
+Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. Azure AI Foundry currently supports these agent-specific evaluators for agentic workflows:
 
 - [Intent resolution](#intent-resolution)
 - [Tool call accuracy](#tool-call-accuracy)
 - [Task adherence](#task-adherence)
 
 ## Evaluating Azure AI agents
 
-Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
+Agents emit messages. Providing inputs typically requires parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, the service provides native integration for evaluation that directly takes their agent messages. For an example, see [Evaluate AI agents](https://aka.ms/e2e-agent-eval-sample).
 
-Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
+Besides `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using a comprehensive suite of built-in evaluators. Azure AI Foundry supports this list of evaluators for Azure AI agent messages from our converter:
 
 - **Quality**: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
-- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
+- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`
 
-In this article we show examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
+This article shows examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
 
 ## Model configuration for AI-assisted evaluators
 
-For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
+For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the large language model-judge (LLM-judge):
 
 ```python
 import os
@@ -54,18 +54,18 @@ model_config = AzureOpenAIModelConfiguration(
 
 ### Evaluator models support
 
-We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
+Azure AI Agent Service supports AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
 
-| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
+| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1 or gpt-4o) | To enable |
 |--|--|--|--|
 | `Intent Resolution`, `Task Adherence`, `Tool Call Accuracy`, `Response Completeness` | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
-| Other quality evaluators| Not Supported | Supported | -- |
+| Other quality evaluators| Not Supported | Supported |--|
 
-For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
+For complex evaluation that requires refined reasoning, we recommend a strong reasoning model with a balance of reasoning performance and cost efficiency, like `o3-mini` and o-series mini models released afterwards.
 
 ## Intent resolution
 
-`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user's intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
+`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request. This understanding includes how well it scopes the user's intent, asks questions to clarify, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
 
 ### Intent resolution example
 
@@ -82,7 +82,7 @@ intent_resolution(
 
 ### Intent resolution output
 
-The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Using the reason and other fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -101,14 +101,15 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
 
 ```
 
-If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
+If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Intent Resolution](https://aka.ms/intentresolution-sample).
 
 ## Tool call accuracy
 
-`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on: 
-- the relevance and helpfulness of the tool invoked;
-- the correctness of parameters used in tool calls;
-- the counts of missing or excessive calls.
+`ToolCallAccuracyEvaluator` measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
+ 
+- The relevance and helpfulness of the tool invoked
+- The correctness of parameters used in tool calls
+- The counts of missing or excessive calls
 
 #### Tool call evaluation support
 
@@ -124,7 +125,7 @@ If you're building agents outside of Azure AI Foundry Agent Service, this evalua
 - OpenAPI
 - Function Tool (user-defined tools)
 
-However, if a non-supported tool is used in the agent run, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported, for ease of filtering out these cases. It's recommended that you wrap non-supported tools as user-defined tools to enable evaluation.
+If a non-supported tool is used in the agent run, the evaluator outputs a *pass* and a reason that evaluating the invoked tools isn't supported. This approach makes it easy to filter out these cases. We recommend that you wrap non-supported tools as user-defined tools to enable evaluation.
 
 ### Tool call accuracy example
 
@@ -235,7 +236,7 @@ tool_call_accuracy(
 
 ### Tool call accuracy output
 
-The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason and tool call detail fields to understand why the score is high or low.
 
 ```python
 {
@@ -267,11 +268,11 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
 }
 ```
 
-If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see, our sample notebook for [Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
+If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Tool Call Accuracy](https://aka.ms/toolcallaccuracy-sample).
 
 ## Task adherence
 
-In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
+In various task-oriented AI systems, such as agentic systems, it's important to assess whether the agent stays on track to complete a task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction and available tools. The task instruction is extracted from system message and user query. Higher score means better adherence of the system instruction to resolve the task.
 
 ### Task adherence example
 
@@ -287,7 +288,7 @@ task_adherence(
 
 ### Task adherence output
 
-The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score is on a Likert scale (integer 1 to 5). A higher score is better. Given a numerical threshold (default to 3), the evaluator also outputs *pass* if the score >= threshold, or *fail* otherwise. Use the reason field to understand why the score is high or low.
 
 ```python
 {
@@ -298,7 +299,7 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
 }
 ```
 
-If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Task Adherence](https://aka.ms/taskadherence-sample).
+If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. For a sample notebook, see [Task Adherence](https://aka.ms/taskadherence-sample).
 
 ## Related content