minor updates to address reviewer comments

changliu2 · changliu2 · commit f1f1aae3fa8f · 2025-07-15T11:30:48.000-04:00
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -6,7 +6,7 @@ author: lgayhardt
 ms.author: lagayhar
 manager: scottpolly
 ms.reviewer: changliu2
-ms.date: 06/26/2025
+ms.date: 07/15/2025
 ms.service: azure-ai-foundry
 ms.topic: reference
 ms.custom:
@@ -18,7 +18,7 @@ ms.custom:
 
 [!INCLUDE [feature-preview](../../includes/feature-preview.md)]
 
-Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We current support evaluating:
+Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
 - [Intent resolution](#intent-resolution)
 - [Tool call accuracy](#tool-call-accuracy)
 - [Task adherence](#task-adherence)
@@ -184,7 +184,7 @@ task_adherence(
 
 ### Task adherence output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
@@ -6,7 +6,7 @@ author: lgayhardt
 ms.author: lagayhar
 manager: scottpolly
 ms.reviewer: changliu2
-ms.date: 06/26/2025
+ms.date: 07/15/2025
 ms.service: azure-ai-foundry
 ms.topic: reference
 ms.custom:
@@ -25,9 +25,9 @@ A retrieval-augmented generation (RAG) system tries to generate the most relevan
 - [Response Completeness](#response-completeness)
 
 These evaluators address three aspects:
-1. The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
-2. The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
-3. The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
+- The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
+- The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
+- The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
 
 A good way to think about **Groundedness** and **Response Completeness** is: groundedness is about the **precision** aspect of the response that it shouldn't contain content outside of the grounding context, whereas response completeness is about the **recall** aspect of the response that it shouldn't miss critical information compared to the expected response (ground truth).
 
@@ -162,8 +162,10 @@ retrieved_documents = [
 ]
 
 document_retrieval_evaluator = DocumentRetrievalEvaluator(
+    # Specify the ground truth label range
     ground_truth_label_min=ground_truth_label_min, 
     ground_truth_label_max=ground_truth_label_max,
+    # Optionally override the binarization threshold for pass/fail output
     ndcg_threshold = 0.5,
     xdcg_threshold = 50.0,
     fidelity_threshold = 0.5,
diff --git a/articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md b/articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md
@@ -50,7 +50,7 @@ If you use [Foundry Agent Service](../../../ai-services/agents/overview.md), you
 
 
 > [!NOTE]
-> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported. 
+> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined Python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported. 
 
 
 Here's an example that shows you how to seamlessly build and evaluate an Azure AI agent. Separately from evaluation, Azure AI Foundry Agent Service requires `pip install azure-ai-projects azure-identity`, an Azure AI project connection string, and the supported models.
@@ -97,7 +97,7 @@ AGENT_NAME = "Seattle Tourist Assistant"
 If you are using [Azure AI Foundry (non-Hub) project](../create-projects.md?tabs=ai-foundry&pivots=fdp-project), create an agent with the toolset as follows:
 
 > [!NOTE]
-> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for loggging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
+> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK Python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for logging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
 
 ```python
 import os
@@ -184,7 +184,7 @@ And that's it! `converted_data` will contain all inputs required for [these eval
 | `Intent Resolution` / `Task Adherence` / `Tool Call Accuracy` / `Response Completeness`) | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
 | Other quality evaluators| Not Supported | Supported | -- |
 
-For complex tasks that requires refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
+For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
 
 We set up a list of quality and safety evaluator in `quality_evaluators` and `safety_evaluators` and reference them in [evaluating multiples agent runs or a thread](#evaluate-multiple-agent-runs-or-threads).
 
@@ -266,7 +266,7 @@ See the following example output for some evaluators:
     "task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
 }
 {
-    "tool_call_accuracy": 5,  # a score bewteen 1-5, higher is better
+    "tool_call_accuracy": 5,  # a score between 1-5, higher is better
     "tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
     "tool_call_accuracy_threshold": 3,
     "details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -439,7 +439,7 @@ See the following output (reference [Output format](#output-format) for details)
 
 ```
 {
-    "tool_call_accuracy": 3,  # a score bewteen 1-5, higher is better
+    "tool_call_accuracy": 3,  # a score between 1-5, higher is better
     "tool_call_accuracy_result": "fail",
     "tool_call_accuracy_threshold": 4,
     "details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -549,7 +549,7 @@ See the following output (reference [Output format](#output-format) for details)
 
 ```
 {
-    "tool_call_accuracy": 2,  # a score bewteen 1-5, higher is better
+    "tool_call_accuracy": 2,  # a score between 1-5, higher is better
     "tool_call_accuracy_result": "fail",
     "tool_call_accuracy_threshold": 3,
     "details": { ... } # helpful details for debugging the tool calls made by the agent
diff --git a/articles/ai-foundry/how-to/develop/evaluate-sdk.md b/articles/ai-foundry/how-to/develop/evaluate-sdk.md
@@ -6,7 +6,7 @@ author: lgayhardt
 ms.author: lagayhar
 manager: scottpolly
 ms.reviewer: minthigpen
-ms.date: 05/19/2025
+ms.date: 07/15/2025
 ms.service: azure-ai-foundry
 ms.topic: how-to
 ms.custom:
@@ -51,14 +51,44 @@ Built-in quality and safety metrics take in query and response pairs, along with
 ### Data requirements for built-in evaluators
 
 Built-in evaluators can accept query and response pairs, a list of conversations in JSON Lines (JSONL) format, or both.
+| Evaluator | Conversation & single-turn support for text | Conversation & single-turn support for text and image | Single-turn support for text only | Requires `ground_truth` | Supports [agent inputs](./agent-evaluate-sdk.md#agent-messages) |
+|-----------|---------------------------------------------|-------------------------------------------------------|-----------------------------------|---------------------|----------------------|
+| **Quality Evaluators** |
+| `IntentResolutionEvaluator` | | | | | ✓ |
+| `ToolCallAccuracyEvaluator` | | | | | ✓ |
+| `TaskAdherenceEvaluator` | | | | | ✓ |
+| `GroundednessEvaluator` | ✓ | | | | |
+| `GroundednessProEvaluator` | ✓ | | | | |
+| `RetrievalEvaluator` | ✓ | | | | |
+| `DocumentRetrievalEvaluator` | ✓ | | | ✓ | |
+| `RelevanceEvaluator` | ✓ | | | | ✓ |
+| `CoherenceEvaluator` | ✓ | | | | ✓ |
+| `FluencyEvaluator` | ✓ | | | | ✓ |
+| `ResponseCompletenessEvaluator` | ✓ | | ✓ | ✓ | |
+| `QAEvaluator` | | | ✓ | ✓ | |
+| **NLP Evaluators** |
+| `SimilarityEvaluator` | | | ✓ | ✓ | |
+| `F1ScoreEvaluator` | | | ✓ | ✓ | |
+| `RougeScoreEvaluator` | | | ✓ | ✓ | |
+| `GleuScoreEvaluator` | | | ✓ | ✓ | |
+| `BleuScoreEvaluator` | | | ✓ | ✓ | |
+| `MeteorScoreEvaluator` | | | ✓ | ✓ | |
+| **Safety Evaluators** |
+| `ViolenceEvaluator` | | ✓ | | | ✓ |
+| `SexualEvaluator` | | ✓ | | | ✓ |
+| `SelfHarmEvaluator` | | ✓ | | | ✓ |
+| `HateUnfairnessEvaluator` | | ✓ | | | ✓ |
+| `ProtectedMaterialEvaluator` | | ✓ | | | ✓ |
+| `ContentSafetyEvaluator` | | ✓ | | | ✓ |
+| `UngroundedAttributesEvaluator` | | | ✓ | | |
+| `CodeVulnerabilityEvaluator` | | | ✓ | | ✓ |
+| `IndirectAttackEvaluator` | ✓ | | | | ✓ |
+| **Azure OpenAI Graders** |
+| `AzureOpenAILabelGrader` | ✓ | | | | |
+| `AzureOpenAIStringCheckGrader` | ✓ | | | | |
+| `AzureOpenAITextSimilarityGrader` | ✓ | | | ✓ | |
+| `AzureOpenAIGrader` | ✓ | | | | |
 
-| Conversation *and* single-turn support for text | Conversation *and* single-turn support for text and image | Single-turn support for text only |
-|--------------------|------------------------------|---------------|
-| `GroundednessEvaluator`, `GroundednessProEvaluator`, `RetrievalEvaluator`, `DocumentRetrievalEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `ResponseCompletenessEvaluator`, `IndirectAttackEvaluator`, `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader` | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `ContentSafetyEvaluator` | `UngroundedAttributesEvaluator`, `CodeVulnerabilityEvaluator`, `ResponseCompletenessEvaluator`, `SimilarityEvaluator`, `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`, `QAEvaluator` |
-
-| Evaluators requiring `ground_truth` | Evaluators supporting agent inputs  |
-|--------------------|------------------------------|
-| `DocumentRetrievalEvaluator`, `ResponseCompletenessEvaluator`,`AzureOpenAITextSimilarityGrader`, `SimilarityEvaluator`, `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`, `QAEvaluator` | `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `CodeVulnerabilitiesEvaluator`, `ViolenceEvaluator`, `Self-harEvaluator`, `SexualEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialsEvaluator`, `ContentSafetyEvaluator` |
 
 > [!NOTE]
 > AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)