Skip to content

Commit ece2abd

Browse files
committed
Merge branch 'july_update' of https://github.com/changliu2/azure-ai-docs-pr into evalsdkagentmetrics0725
2 parents c7023d8 + f1f1aae commit ece2abd

File tree

5 files changed

+201
-143
lines changed

5 files changed

+201
-143
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 05/19/2025
9+
ms.date: 07/15/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -18,7 +18,10 @@ ms.custom:
1818

1919
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2020

21-
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions.
21+
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
22+
- [Intent resolution](#intent-resolution)
23+
- [Tool call accuracy](#tool-call-accuracy)
24+
- [Task adherence](#task-adherence)
2225

2326
## Evaluating Azure AI agents
2427

@@ -48,8 +51,15 @@ model_config = AzureOpenAIModelConfiguration(
4851
)
4952
```
5053

51-
> [!TIP]
52-
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
54+
### Evaluator model support
55+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
56+
57+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
58+
|------------|-----------------------------------------------------------------------------|-------------------------------------------------------------|-------|
59+
| `Intent Resolution` / `Task Adherence` / `Tool Call Accuracy` / `Response Completeness`) | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
60+
| Other quality evaluators| Not Supported | Supported | -- |
61+
62+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
5363

5464
## Intent resolution
5565

@@ -174,7 +184,7 @@ task_adherence(
174184

175185
### Task adherence output
176186

177-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
187+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
178188

179189
```python
180190
{

articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 05/19/2025
9+
ms.date: 06/26/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -16,7 +16,9 @@ ms.custom:
1616

1717
# General purpose evaluators
1818

19-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, use [Coherence](#coherence) and [Fluency](#fluency).
19+
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we current support evaluating:
20+
- [Coherence](#coherence)
21+
- [Fluency](#fluency)
2022

2123
If you have a question-answering (QA) scenario with both `context` and `ground truth` data in addition to `query` and `response`, you can also use our [QAEvaluator](#question-answering-composite-evaluator) a composite evaluator that uses relevant evaluators for judgment.
2224

@@ -38,8 +40,15 @@ model_config = AzureOpenAIModelConfiguration(
3840
)
3941
```
4042

41-
> [!TIP]
42-
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
43+
### Evaluator model support
44+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
45+
46+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
47+
|------------|-----------------------------------------------------------------------------|-------------------------------------------------------------|-------|
48+
| `Intent Resolution` / `Task Adherence` / `Tool Call Accuracy` / `Response Completeness`) | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
49+
| Other quality evaluators| Not Supported | Supported | -- |
50+
51+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
4352

4453
## Coherence
4554

articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 05/19/2025
9+
ms.date: 07/15/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -17,7 +17,14 @@ ms.custom:
1717
# Retrieval-augmented Generation (RAG) evaluators
1818

1919
A retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user's query. At a high level, a user's query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It's important to evaluate:
20-
20+
- [Document Retrieval](#document-retrieval)
21+
- [Retrieval](#retrieval)
22+
- [Groundedness](#groundedness)
23+
- [Groundedness Pro](#groundedness-pro)
24+
- [Relevance](#relevance)
25+
- [Response Completeness](#response-completeness)
26+
27+
These evaluators address three aspects:
2128
- The relevance of the retrieval results to the user's query: use [Document Retrieval](#document-retrieval) if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use [Retrieval](#retrieval) if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
2229
- The consistency of the generated response with respect to the grounding documents: use [Groundedness](#groundedness) if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, [Groundedness Pro](#groundedness-pro) if you want a straightforward definition.
2330
- The relevance of the final response to the query: [Relevance](#relevance) if you don't have ground truth, and [Response Completeness](#response-completeness) if you have ground truth and don't want your response to miss critical information.
@@ -42,8 +49,15 @@ model_config = AzureOpenAIModelConfiguration(
4249
)
4350
```
4451

45-
> [!TIP]
46-
> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
52+
### Evaluator model support
53+
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
54+
55+
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
56+
|------------|-----------------------------------------------------------------------------|-------------------------------------------------------------|-------|
57+
| `Intent Resolution` / `Task Adherence` / `Tool Call Accuracy` / `Response Completeness`) | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
58+
| Other quality evaluators| Not Supported | Supported | -- |
59+
60+
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like `o3-mini` and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
4761

4862
## Retrieval
4963

@@ -148,8 +162,10 @@ retrieved_documents = [
148162
]
149163

150164
document_retrieval_evaluator = DocumentRetrievalEvaluator(
165+
# Specify the ground truth label range
151166
ground_truth_label_min=ground_truth_label_min,
152167
ground_truth_label_max=ground_truth_label_max,
168+
# Optionally override the binarization threshold for pass/fail output
153169
ndcg_threshold = 0.5,
154170
xdcg_threshold = 50.0,
155171
fidelity_threshold = 0.5,

0 commit comments

Comments
 (0)