Skip to content

Commit 59725f4

Browse files
Freshness, in progress.
1 parent ff4df10 commit 59725f4

File tree

1 file changed

+25
-23
lines changed

1 file changed

+25
-23
lines changed

articles/ai-foundry/how-to/develop/evaluate-sdk.md

Lines changed: 25 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: Local Evaluation with the Azure AI Evaluation SDK
33
titleSuffix: Azure AI Foundry
4-
description: This article provides instructions on how to evaluate a generative AI application with the Azure AI Evaluation SDK.
4+
description: Learn how to run evaluators on a single row of data and a larger test dataset to evaluate a generative AI application with the Azure AI Evaluation SDK.
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: minthigpen
8-
ms.date: 07/15/2025
8+
ms.date: 10/10/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: how-to
1111
ms.custom:
@@ -19,7 +19,7 @@ ms.custom:
1919

2020
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2121

22-
If you want to thoroughly assess the performance of your generative AI application when you apply it to a substantial dataset, you can evaluate it in your development environment with the Azure AI Evaluation SDK. When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
22+
You can thoroughly assess the performance of your generative AI application by applying it to a substantial dataset. Evaluate the application in your development environment with the Azure AI Evaluation SDK. When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
2323

2424
In this article, you learn how to run evaluators on a single row of data and a larger test dataset on an application target. You use built-in evaluators that use the Azure AI Evaluation SDK locally. Then, you learn to track the results and evaluation logs in an Azure AI project.
2525

@@ -45,7 +45,7 @@ pip install azure-ai-evaluation
4545
| [Agentic](../../concepts/evaluation-evaluators/agent-evaluators.md) | `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator` |
4646
| [Azure OpenAI](../../concepts/evaluation-evaluators/azure-openai-graders.md) | `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader` |
4747

48-
Built-in quality and safety metrics take in query and response pairs, along with additional information for specific evaluators.
48+
Built-in quality and safety metrics accept query and response pairs, along with additional information for specific evaluators.
4949

5050
### Data requirements for built-in evaluators
5151

@@ -91,9 +91,11 @@ Built-in evaluators can accept query and response pairs, a list of conversations
9191

9292

9393
> [!NOTE]
94-
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation is set to 800 for all AI-assisted evaluators, except that it has the value 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate for longer inputs.
94+
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score.
95+
>
96+
> They consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation is set to 800 for most AI-assisted evaluators. It has the value 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate longer inputs.
9597
96-
Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. Example: If you have two inputs called *query* and *response*, and a template that was formatted as `{{item.query}}`, then only the query would be used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
98+
Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. For example, if you have two inputs called *query* and *response*, and a template formatted as `{{item.query}}`, only the query is used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
9799

98100
For more information on data requirements for agentic evaluators, see [Evaluate your AI agents](agent-evaluate-sdk.md).
99101

@@ -112,26 +114,26 @@ relevance_eval = RelevanceEvaluator(model_config)
112114
relevance_eval(query=query, response=response)
113115
```
114116

115-
To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), you need to represent the dataset in JSONL format. The previous single-turn data (a query-and-response pair) is equivalent to a line of a dataset like the following (we show three lines as an example):
117+
To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), represent the dataset in JSONL format. The previous single-turn data, which is a query-and-response pair, is equivalent to a line of a dataset like the following example, which shows three lines:
116118

117119
```json
118120
{"query":"What is the capital of France?","response":"Paris."}
119121
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
120122
{"query":"What color is my shirt?","response":"Blue."}
121123
```
122124

123-
The evaluation test dataset can contain the following, depending on the requirements of each built-in evaluator:
125+
The evaluation test dataset can contain the following elements, depending on the requirements of each built-in evaluator:
124126

125127
- **Query**: The query sent in to the generative AI application.
126128
- **Response**: The response to the query generated by the generative AI application.
127-
- **Context**: The source the generated response is based on (that is, the grounding documents).
129+
- **Context**: The source the generated response is based on. That is, the grounding documents.
128130
- **Ground truth**: The response generated by a user or human as the true answer.
129131

130132
To see what each evaluator requires, see [Evaluators](/azure/ai-foundry/concepts/observability#what-are-evaluators).
131133

132134
#### Conversation support for text
133135

134-
For evaluators that support conversations for text, you can provide `conversation` as input, which includes a Python dictionary with a list of `messages` (which include `content`, `role`, and optionally `context`).
136+
For evaluators that support conversations for text, you can provide `conversation` as input. This input includes a Python dictionary with a list of `messages`, which includes `content`, `role`, and optionally `context`.
135137

136138
See the following two-turn conversation in Python:
137139

@@ -192,7 +194,7 @@ To run batch evaluations by using [local evaluation](#local-evaluation-on-test-d
192194
Our evaluators understand that the first turn of the conversation provides valid `query` from `user`, `context` from `assistant`, and `response` from `assistant` in the query-response format. Conversations are then evaluated per turn and results are aggregated over all turns for a conversation score.
193195

194196
> [!NOTE]
195-
> In the second turn, even if `context` is `null` or a missing key, the evaluator interprets the turn as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
197+
> In the second turn, even if `context` is `null` or a missing key, the evaluator interprets the turn as an empty string instead of failing with an error, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
196198
197199
For conversation mode, here's an example for `GroundednessEvaluator`:
198200

@@ -259,7 +261,7 @@ For conversation outputs, per-turn results are stored in a list and the overall
259261
```
260262

261263
> [!NOTE]
262-
> We recommend that users migrate their code to use the key without prefixes (for example, `groundedness.groundedness`) to allow your code to support more evaluator models.
264+
> We recommend that you migrate your code to use the key without prefixes to allow your code to support more evaluator models. For example, `groundedness.groundedness`.
263265
264266
#### Conversation support for images and multi-modal text and image
265267

@@ -332,29 +334,29 @@ conversation_base64 = {
332334
safety_score = safety_evaluator(conversation=conversation_image_url)
333335
```
334336

335-
Currently the image and multi-modal evaluators support:
337+
Currently, the image and multi-modal evaluators support:
336338

337-
- Single turn only (a conversation can have only one user message and one assistant message).
339+
- Single turn only: a conversation can have only one user message and one assistant message.
338340
- Conversations that have only one system message.
339-
- Conversation payloads that are smaller than 10 MB (including images).
341+
- Conversation payloads that are smaller than 10 MB, including images.
340342
- Absolute URLs and Base64-encoded images.
341343
- Multiple images in a single turn.
342344
- JPG/JPEG, PNG, and GIF file formats.
343345

344346
#### Set up
345347

346-
For AI-assisted quality evaluators (except for `GroundednessProEvaluator` preview), you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
348+
For AI-assisted quality evaluators, except for `GroundednessProEvaluator` preview, you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
347349

348350
> [!NOTE]
349-
> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model. The latter is cheaper, more capable, and as fast, according to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo).
351+
> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model. According to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo), `gpt-4o-mini` is cheaper, more capable, and as fast.
350352
>
351353
> Make sure that you have at least the `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with the API key. To learn more about permissions, see [Permissions for an Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
352354
353-
For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service via your Azure AI project.
355+
For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service by using your Azure AI project.
354356

355357
#### Prompts for AI-assisted built-in evaluators
356358

357-
We open source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator` (powered by Azure AI Content Safety). These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
359+
We open-source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator`, powered by Azure AI Content Safety. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that you customize the definitions and grading rubrics to your scenario specifics. For more information, see [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
358360

359361
### Composite evaluators
360362

@@ -374,12 +376,12 @@ After you spot-check your built-in or custom evaluators on a single row of data,
374376
If this session is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few other setup steps:
375377

376378
1. [Create and connect your storage account](https://github.com/azure-ai-foundry/foundry-samples/blob/main/samples/microsoft/infrastructure-setup/01-connections/connection-storage-account.bicep) to your Azure AI Foundry project at the resource level. This bicep template provisions and connects a storage account to your Foundry project with key authentication.
377-
2. Make sure the connected storage account has access to all projects.
378-
3. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
379+
1. Make sure the connected storage account has access to all projects.
380+
1. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
379381

380382
### Evaluate on a dataset and log results to Azure AI Foundry
381383

382-
To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. In this case, we specify the data mapping for `query`, `response`, and `context`.
384+
To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. This example specifies the data mapping for `query`, `response`, and `context`.
383385

384386
```python
385387
from azure.ai.evaluation import evaluate
@@ -513,7 +515,7 @@ result = evaluate(
513515

514516
If you have a list of queries that you want to run and then evaluate, the `evaluate()` API also supports a `target` parameter. This parameter can send queries to an application to collect answers, and then run your evaluators on the resulting query and response.
515517

516-
A target can be any callable class in your directory. In this case, we have a Python script `askwiki.py` with a callable class `askwiki()` that we can set as our target. If we have a dataset of queries we can send into our simple `askwiki` app, we can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
518+
A target can be any callable class in your directory. In this example, there's a Python script `askwiki.py` with a callable class `askwiki()` that is set as our target. If you have a dataset of queries that you can send into the simple `askwiki` app, you can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
517519

518520
Here's the content in `"data.jsonl"`:
519521

0 commit comments

Comments
 (0)