You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We current support evaluating:
21
+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
22
22
-[Intent resolution](#intent-resolution)
23
23
-[Tool call accuracy](#tool-call-accuracy)
24
24
-[Task adherence](#task-adherence)
@@ -184,7 +184,7 @@ task_adherence(
184
184
185
185
### Task adherence output
186
186
187
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
187
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: changliu2
9
-
ms.date: 06/26/2025
9
+
ms.date: 07/15/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: reference
12
12
ms.custom:
@@ -25,9 +25,9 @@ A retrieval-augmented generation (RAG) system tries to generate the most relevan
25
25
-[Response Completeness](#response-completeness)
26
26
27
27
These evaluators address three aspects:
28
-
1. The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
29
-
2. The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
30
-
3. The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
28
+
- The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
29
+
- The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
30
+
- The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
31
31
32
32
A good way to think about **Groundedness** and **Response Completeness** is: groundedness is about the **precision** aspect of the response that it shouldn't contain content outside of the grounding context, whereas response completeness is about the **recall** aspect of the response that it shouldn't miss critical information compared to the expected response (ground truth).
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ If you use [Foundry Agent Service](../../../ai-services/agents/overview.md), you
50
50
51
51
52
52
> [!NOTE]
53
-
> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported.
53
+
> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined Python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it will output a "pass" and a reason that evaluating the invoked tool(s) is not supported.
54
54
55
55
56
56
Here's an example that shows you how to seamlessly build and evaluate an Azure AI agent. Separately from evaluation, Azure AI Foundry Agent Service requires `pip install azure-ai-projects azure-identity`, an Azure AI project connection string, and the supported models.
If you are using [Azure AI Foundry (non-Hub) project](../create-projects.md?tabs=ai-foundry&pivots=fdp-project), create an agent with the toolset as follows:
98
98
99
99
> [!NOTE]
100
-
> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for loggging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
100
+
> If you are using a [Foundry Hub-based project](../create-projects.md?tabs=ai-foundry&pivots=hub-project) (which only supports lower versions of `azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10`), we strongly recommend migrating to [the latest Foundry Agent Service SDK Python client library](../../agents/quickstart.md?pivots=programming-language-python-azure) with a [Foundry project set up for logging batch evaluation results](../../how-to/develop/evaluate-sdk.md#prerequisite-set-up-steps-for-azure-ai-foundry-projects).
101
101
102
102
```python
103
103
import os
@@ -184,7 +184,7 @@ And that's it! `converted_data` will contain all inputs required for [these eval
| Other quality evaluators| Not Supported | Supported | -- |
186
186
187
-
For complex tasks that requires refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
187
+
For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
188
188
189
189
We set up a list of quality and safety evaluator in `quality_evaluators` and `safety_evaluators` and reference them in [evaluating multiples agent runs or a thread](#evaluate-multiple-agent-runs-or-threads).
190
190
@@ -266,7 +266,7 @@ See the following example output for some evaluators:
266
266
"task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
267
267
}
268
268
{
269
-
"tool_call_accuracy": 5, # a score bewteen 1-5, higher is better
269
+
"tool_call_accuracy": 5, # a score between 1-5, higher is better
270
270
"tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
271
271
"tool_call_accuracy_threshold": 3,
272
272
"details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -439,7 +439,7 @@ See the following output (reference [Output format](#output-format) for details)
439
439
440
440
```
441
441
{
442
-
"tool_call_accuracy": 3, # a score bewteen 1-5, higher is better
442
+
"tool_call_accuracy": 3, # a score between 1-5, higher is better
443
443
"tool_call_accuracy_result": "fail",
444
444
"tool_call_accuracy_threshold": 4,
445
445
"details": { ... } # helpful details for debugging the tool calls made by the agent
@@ -549,7 +549,7 @@ See the following output (reference [Output format](#output-format) for details)
549
549
550
550
```
551
551
{
552
-
"tool_call_accuracy": 2, # a score bewteen 1-5, higher is better
552
+
"tool_call_accuracy": 2, # a score between 1-5, higher is better
553
553
"tool_call_accuracy_result": "fail",
554
554
"tool_call_accuracy_threshold": 3,
555
555
"details": { ... } # helpful details for debugging the tool calls made by the agent
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/develop/evaluate-sdk.md
+38-8Lines changed: 38 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: minthigpen
9
-
ms.date: 05/19/2025
9
+
ms.date: 07/15/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: how-to
12
12
ms.custom:
@@ -51,14 +51,44 @@ Built-in quality and safety metrics take in query and response pairs, along with
51
51
### Data requirements for built-in evaluators
52
52
53
53
Built-in evaluators can accept query and response pairs, a list of conversations in JSON Lines (JSONL) format, or both.
54
+
| Evaluator | Conversation & single-turn support for text | Conversation & single-turn support for text and image | Single-turn support for text only | Requires `ground_truth`| Supports [agent inputs](./agent-evaluate-sdk.md#agent-messages)|
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
0 commit comments