Skip to content

Commit f1f1aae

Browse files
committed
minor updates to address reviewer comments
1 parent efb1e26 commit f1f1aae

File tree

4 files changed

+53
-21
lines changed

4 files changed

+53
-21
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 06/26/2025
9+
ms.date: 07/15/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -18,7 +18,7 @@ ms.custom:
1818

1919
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2020

21-
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We current support evaluating:
21+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
2222
- [Intent resolution](#intent-resolution)
2323
- [Tool call accuracy](#tool-call-accuracy)
2424
- [Task adherence](#task-adherence)
@@ -184,7 +184,7 @@ task_adherence(
184184

185185
### Task adherence output
186186

187-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
187+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
188188

189189
```python
190190
{

articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 06/26/2025
9+
ms.date: 07/15/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -25,9 +25,9 @@ A retrieval-augmented generation (RAG) system tries to generate the most relevan
2525
- [Response Completeness](#response-completeness)
2626

2727
These evaluators address three aspects:
28-
1. The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
29-
2. The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
30-
3. The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
28+
- The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
29+
- The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
30+
- The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
3131

3232
A good way to think about **Groundedness** and **Response Completeness** is: groundedness is about the **precision** aspect of the response that it shouldn't contain content outside of the grounding context, whereas response completeness is about the **recall** aspect of the response that it shouldn't miss critical information compared to the expected response (ground truth).
3333

@@ -162,8 +162,10 @@ retrieved_documents = [
162162
]
163163

164164
document_retrieval_evaluator = DocumentRetrievalEvaluator(
165+
# Specify the ground truth label range
165166
ground_truth_label_min=ground_truth_label_min,
166167
ground_truth_label_max=ground_truth_label_max,
168+
# Optionally override the binarization threshold for pass/fail output
167169
ndcg_threshold = 0.5,
168170
xdcg_threshold = 50.0,
169171
fidelity_threshold = 0.5,

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ If you use [Foundry Agent Service](../../../ai-services/agents/overview.md), you
5050

5151

5252
> [!NOTE]
53-
> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported.
53+
> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined Python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported.
5454
5555

5656
Here's an example that shows you how to seamlessly build and evaluate an Azure AI agent. Separately from evaluation, Azure AI Foundry Agent Service requires `pip install azure-ai-projects azure-identity`, an Azure AI project connection string, and the supported models.
@@ -97,7 +97,7 @@ AGENT_NAME = "Seattle Tourist Assistant"
9797
If you are using [Azure AI Foundry (non-Hub) project](../create-projects.md?tabs=ai-foundry&pivots=fdp-project), create an agent with the toolset as follows:
9898

9999
> [!NOTE]
100-
> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for loggging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
100+
> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK Python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for logging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
101101
102102
```python
103103
import os
@@ -184,7 +184,7 @@ And that's it! `converted_data` will contain all inputs required for [these eval
184184
| `Intent Resolution` / `Task Adherence` / `Tool Call Accuracy` / `Response Completeness`) | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
185185
| Other quality evaluators| Not Supported | Supported | -- |
186186

187-
For complex tasks that requires refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
187+
For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
188188

189189
We set up a list of quality and safety evaluator in `quality_evaluators` and `safety_evaluators` and reference them in [evaluating multiples agent runs or a thread](#evaluate-multiple-agent-runs-or-threads).
190190

@@ -266,7 +266,7 @@ See the following example output for some evaluators:
266266
"task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
267267
}
268268
{
269-
"tool_call_accuracy": 5, # a score bewteen 1-5, higher is better
269+
"tool_call_accuracy": 5, # a score between 1-5, higher is better
270270
"tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
271271
"tool_call_accuracy_threshold": 3,
272272
"details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -439,7 +439,7 @@ See the following output (reference [Output format](#output-format) for details)
439439

440440
```
441441
{
442-
"tool_call_accuracy": 3, # a score bewteen 1-5, higher is better
442+
"tool_call_accuracy": 3, # a score between 1-5, higher is better
443443
"tool_call_accuracy_result": "fail",
444444
"tool_call_accuracy_threshold": 4,
445445
"details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -549,7 +549,7 @@ See the following output (reference [Output format](#output-format) for details)
549549

550550
```
551551
{
552-
"tool_call_accuracy": 2, # a score bewteen 1-5, higher is better
552+
"tool_call_accuracy": 2, # a score between 1-5, higher is better
553553
"tool_call_accuracy_result": "fail",
554554
"tool_call_accuracy_threshold": 3,
555555
"details": { ... } # helpful details for debugging the tool calls made by the agent

articles/ai-foundry/how-to/develop/evaluate-sdk.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: minthigpen
9-
ms.date: 05/19/2025
9+
ms.date: 07/15/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: how-to
1212
ms.custom:
@@ -51,14 +51,44 @@ Built-in quality and safety metrics take in query and response pairs, along with
5151
### Data requirements for built-in evaluators
5252

5353
Built-in evaluators can accept query and response pairs, a list of conversations in JSON Lines (JSONL) format, or both.
54+
| Evaluator | Conversation & single-turn support for text | Conversation & single-turn support for text and image | Single-turn support for text only | Requires `ground_truth` | Supports [agent inputs](./agent-evaluate-sdk.md#agent-messages) |
55+
|-----------|---------------------------------------------|-------------------------------------------------------|-----------------------------------|---------------------|----------------------|
56+
| **Quality Evaluators** |
57+
| `IntentResolutionEvaluator` | | | | ||
58+
| `ToolCallAccuracyEvaluator` | | | | ||
59+
| `TaskAdherenceEvaluator` | | | | ||
60+
| `GroundednessEvaluator` || | | | |
61+
| `GroundednessProEvaluator` || | | | |
62+
| `RetrievalEvaluator` || | | | |
63+
| `DocumentRetrievalEvaluator` || | || |
64+
| `RelevanceEvaluator` || | | ||
65+
| `CoherenceEvaluator` || | | ||
66+
| `FluencyEvaluator` || | | ||
67+
| `ResponseCompletenessEvaluator` || ||| |
68+
| `QAEvaluator` | | ||| |
69+
| **NLP Evaluators** |
70+
| `SimilarityEvaluator` | | ||| |
71+
| `F1ScoreEvaluator` | | ||| |
72+
| `RougeScoreEvaluator` | | ||| |
73+
| `GleuScoreEvaluator` | | ||| |
74+
| `BleuScoreEvaluator` | | ||| |
75+
| `MeteorScoreEvaluator` | | ||| |
76+
| **Safety Evaluators** |
77+
| `ViolenceEvaluator` | || | ||
78+
| `SexualEvaluator` | || | ||
79+
| `SelfHarmEvaluator` | || | ||
80+
| `HateUnfairnessEvaluator` | || | ||
81+
| `ProtectedMaterialEvaluator` | || | ||
82+
| `ContentSafetyEvaluator` | || | ||
83+
| `UngroundedAttributesEvaluator` | | || | |
84+
| `CodeVulnerabilityEvaluator` | | || ||
85+
| `IndirectAttackEvaluator` || | | ||
86+
| **Azure OpenAI Graders** |
87+
| `AzureOpenAILabelGrader` || | | | |
88+
| `AzureOpenAIStringCheckGrader` || | | | |
89+
| `AzureOpenAITextSimilarityGrader` || | || |
90+
| `AzureOpenAIGrader` || | | | |
5491

55-
| Conversation *and* single-turn support for text | Conversation *and* single-turn support for text and image | Single-turn support for text only |
56-
|--------------------|------------------------------|---------------|
57-
| `GroundednessEvaluator`, `GroundednessProEvaluator`, `RetrievalEvaluator`, `DocumentRetrievalEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `ResponseCompletenessEvaluator`, `IndirectAttackEvaluator`, `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader` | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `ContentSafetyEvaluator` | `UngroundedAttributesEvaluator`, `CodeVulnerabilityEvaluator`, `ResponseCompletenessEvaluator`, `SimilarityEvaluator`, `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`, `QAEvaluator` |
58-
59-
| Evaluators requiring `ground_truth` | Evaluators supporting agent inputs |
60-
|--------------------|------------------------------|
61-
| `DocumentRetrievalEvaluator`, `ResponseCompletenessEvaluator`,`AzureOpenAITextSimilarityGrader`, `SimilarityEvaluator`, `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`, `QAEvaluator` | `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `CodeVulnerabilitiesEvaluator`, `ViolenceEvaluator`, `Self-harEvaluator`, `SexualEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialsEvaluator`, `ContentSafetyEvaluator` |
6292

6393
> [!NOTE]
6494
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)

0 commit comments

Comments
 (0)