Skip to content

Commit e812913

Browse files
committed
July eval sdk, agents and other evaluator updates
1 parent ece2abd commit e812913

File tree

6 files changed

+77
-67
lines changed

6 files changed

+77
-67
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ ms.custom:
1919
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2020

2121
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first [reason through user intents in conversations](#intent-resolution), [select the correct tools](#tool-call-accuracy) to call and satisfy the user requests, and [complete various tasks](#task-adherence) according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
22+
2223
- [Intent resolution](#intent-resolution)
2324
- [Tool call accuracy](#tool-call-accuracy)
2425
- [Task adherence](#task-adherence)
@@ -27,11 +28,12 @@ Agents are powerful productivity assistants. They can plan, make decisions, and
2728

2829
Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
2930

30-
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality as well as safety aspects of your agentic workflows, leveraging out comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
31+
Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
32+
3133
- **Quality**: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
3234
- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
3335

34-
We will show examples of `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` here. See more examples in [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents) for other evaluators with Azure AI agent message support.
36+
In this article we show examples of `IntentResolution`, `ToolCallAccuracy`, and `TaskAdherence`. For examples of using other evaluators with Azure AI agent messages, see [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents).
3537

3638
## Model configuration for AI-assisted evaluators
3739

@@ -52,6 +54,7 @@ model_config = AzureOpenAIModelConfiguration(
5254
```
5355

5456
### Evaluator model support
57+
5558
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
5659

5760
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
@@ -80,7 +83,7 @@ intent_resolution(
8083

8184
### Intent resolution output
8285

83-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
86+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
8487

8588
```python
8689
{
@@ -108,7 +111,7 @@ If you're building agents outside of Azure AI Agent Serice, this evaluator accep
108111
`ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
109112

110113
> [!NOTE]
111-
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but does not support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.
114+
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.
112115
113116
### Tool call accuracy example
114117

articles/ai-foundry/concepts/evaluation-evaluators/azure-openai-graders.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: mithigpe
9-
ms.date: 05/19/2025
9+
ms.date: 07/16/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -47,7 +47,7 @@ model_config = AzureOpenAIModelConfiguration(
4747
4848
Here's an example `data.jsonl` that is used in the following code snippets:
4949

50-
```json
50+
```jsonl
5151
[
5252
{
5353
"query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",

articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: lgayhardt
66
ms.author: lagayhar
77
manager: scottpolly
88
ms.reviewer: changliu2
9-
ms.date: 06/26/2025
9+
ms.date: 07/16/2025
1010
ms.service: azure-ai-foundry
1111
ms.topic: reference
1212
ms.custom:
@@ -16,7 +16,8 @@ ms.custom:
1616

1717
# General purpose evaluators
1818

19-
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we current support evaluating:
19+
AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, we support evaluating:
20+
2021
- [Coherence](#coherence)
2122
- [Fluency](#fluency)
2223

@@ -34,13 +35,14 @@ load_dotenv()
3435

3536
model_config = AzureOpenAIModelConfiguration(
3637
azure_endpoint=os.environ["AZURE_ENDPOINT"],
37-
api_key=os.environ.get["AZURE_API_KEY"],
38+
api_key=os.environ.get("AZURE_API_KEY"),
3839
azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
3940
api_version=os.environ.get("AZURE_API_VERSION"),
4041
)
4142
```
4243

4344
### Evaluator model support
45+
4446
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
4547

4648
| Evaluators | Reasoning Models as Judge (ex: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (ex: gpt-4.1, gpt-4o, etc.) | To enable |
@@ -68,7 +70,7 @@ coherence(
6870

6971
### Coherence output
7072

71-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
73+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
7274

7375
```python
7476
{
@@ -97,7 +99,7 @@ fluency(
9799

98100
### Fluency output
99101

100-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
102+
The numerical score on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
101103

102104
```python
103105
{
@@ -136,7 +138,7 @@ qa_eval(
136138

137139
### QA output
138140

139-
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
141+
While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
140142

141143
```python
142144
{

0 commit comments

Comments
 (0)