You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions.
21
+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
54
+
### Evaluator model support
55
+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
56
+
57
+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
61
+
62
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
53
63
54
64
## Intent resolution
55
65
@@ -174,7 +184,7 @@ task_adherence(
174
184
175
185
### Task adherence output
176
186
177
-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
187
+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
+13-4Lines changed: 13 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: changliu2
9
-
ms.date: 05/19/2025
9
+
ms.date: 06/26/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: reference
12
12
ms.custom:
@@ -16,7 +16,9 @@ ms.custom:
16
16
17
17
# General purpose evaluators
18
18
19
-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, use [Coherence](#coherence) and [Fluency](#fluency).
19
+
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we current support evaluating:
20
+
-[Coherence](#coherence)
21
+
-[Fluency](#fluency)
20
22
21
23
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator) a composite evaluator that uses relevant evaluators for judgment.
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
43
+
### Evaluator model support
44
+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
45
+
46
+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
50
+
51
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
+20-4Lines changed: 20 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: lgayhardt
6
6
ms.author: lagayhar
7
7
manager: scottpolly
8
8
ms.reviewer: changliu2
9
-
ms.date: 05/19/2025
9
+
ms.date: 07/15/2025
10
10
ms.service: azure-ai-foundry
11
11
ms.topic: reference
12
12
ms.custom:
@@ -17,7 +17,14 @@ ms.custom:
17
17
# Retrieval-augmented Generation (RAG) evaluators
18
18
19
19
A retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user's query. At a high level, a user's query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It's important to evaluate:
20
-
20
+
-[Document Retrieval](#document-retrieval)
21
+
-[Retrieval](#retrieval)
22
+
-[Groundedness](#groundedness)
23
+
-[Groundedness Pro](#groundedness-pro)
24
+
-[Relevance](#relevance)
25
+
-[Response Completeness](#response-completeness)
26
+
27
+
These evaluators address three aspects:
21
28
- The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
22
29
- The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
23
30
- The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
52
+
### Evaluator model support
53
+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
54
+
55
+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
| Other quality evaluators| Not Supported | Supported | -- |
59
+
60
+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
0 commit comments