Skip to content

Commit abddd3a

Browse files
Merge pull request #7605 from TimShererWithAquent/us496641-13
Freshness Edit: Evaluate your generative AI application locally with the Azure AI Evaluation SDK
2 parents 64646a3 + 2363aa1 commit abddd3a

File tree

1 file changed

+43
-37
lines changed

1 file changed

+43
-37
lines changed

articles/ai-foundry/how-to/develop/evaluate-sdk.md

Lines changed: 43 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: Local Evaluation with the Azure AI Evaluation SDK
33
titleSuffix: Azure AI Foundry
4-
description: This article provides instructions on how to evaluate a generative AI application with the Azure AI Evaluation SDK.
4+
description: Learn how to run evaluators on a single row of data and a larger test dataset to evaluate a generative AI application with the Azure AI Evaluation SDK.
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: minthigpen
8-
ms.date: 07/15/2025
8+
ms.date: 10/10/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: how-to
1111
ms.custom:
@@ -19,7 +19,9 @@ ms.custom:
1919

2020
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2121

22-
If you want to thoroughly assess the performance of your generative AI application when you apply it to a substantial dataset, you can evaluate it in your development environment with the Azure AI Evaluation SDK. When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
22+
You can thoroughly assess the performance of your generative AI application by applying it to a substantial dataset. Evaluate the application in your development environment with the Azure AI Evaluation SDK.
23+
24+
When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
2325

2426
In this article, you learn how to run evaluators on a single row of data and a larger test dataset on an application target. You use built-in evaluators that use the Azure AI Evaluation SDK locally. Then, you learn to track the results and evaluation logs in an Azure AI project.
2527

@@ -32,10 +34,12 @@ pip install azure-ai-evaluation
3234
```
3335

3436
> [!NOTE]
35-
> For more detailed information, see the [API reference documentation for the Azure AI Evaluation SDK](https://aka.ms/azureaieval-python-ref).
37+
> For more information, see [Azure AI Evaluation client library for Python](https://aka.ms/azureaieval-python-ref).
3638
3739
## Built-in evaluators
3840

41+
Built-in quality and safety metrics accept query and response pairs, along with additional information for specific evaluators.
42+
3943
| Category | Evaluators |
4044
|--------------------------|-----------------------------|
4145
| [General purpose](../../concepts/evaluation-evaluators/general-purpose-evaluators.md) | `CoherenceEvaluator`, `FluencyEvaluator`, `QAEvaluator` |
@@ -45,8 +49,6 @@ pip install azure-ai-evaluation
4549
| [Agentic](../../concepts/evaluation-evaluators/agent-evaluators.md) | `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator` |
4650
| [Azure OpenAI](../../concepts/evaluation-evaluators/azure-openai-graders.md) | `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader` |
4751

48-
Built-in quality and safety metrics take in query and response pairs, along with additional information for specific evaluators.
49-
5052
### Data requirements for built-in evaluators
5153

5254
Built-in evaluators can accept query and response pairs, a list of conversations in JSON Lines (JSONL) format, or both.
@@ -91,11 +93,13 @@ Built-in evaluators can accept query and response pairs, a list of conversations
9193

9294

9395
> [!NOTE]
94-
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators, except that it will be 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate for longer inputs.
96+
> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score.
97+
>
98+
> They consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation is set to 800 for most AI-assisted evaluators. It has the value 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate longer inputs.
9599
96-
Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. Example: If you have two inputs called *query* and *response*, and a template that was formatted as `{{item.query}}`, then only the query would be used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
100+
Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. For example, if you have two inputs called *query* and *response*, and a template formatted as `{{item.query}}`, only the query is used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
97101

98-
For more information on data requirements for agentic evaluators, go to [Run agent evaluations locally with the Azure AI Evaluation SDK](agent-evaluate-sdk.md).
102+
For more information on data requirements for agentic evaluators, see [Evaluate your AI agents](agent-evaluate-sdk.md).
99103

100104
#### Single-turn support for text
101105

@@ -112,26 +116,26 @@ relevance_eval = RelevanceEvaluator(model_config)
112116
relevance_eval(query=query, response=response)
113117
```
114118

115-
To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), you need to represent the dataset in JSONL format. The previous single-turn data (a query-and-response pair) is equivalent to a line of a dataset like the following (we show three lines as an example):
119+
To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), represent the dataset in JSONL format. The previous single-turn data, which is a query-and-response pair, is equivalent to a line of a dataset like the following example, which shows three lines:
116120

117121
```json
118122
{"query":"What is the capital of France?","response":"Paris."}
119123
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
120124
{"query":"What color is my shirt?","response":"Blue."}
121125
```
122126

123-
The evaluation test dataset can contain the following, depending on the requirements of each built-in evaluator:
127+
The evaluation test dataset can contain the following elements, depending on the requirements of each built-in evaluator:
124128

125129
- **Query**: The query sent in to the generative AI application.
126130
- **Response**: The response to the query generated by the generative AI application.
127-
- **Context**: The source the generated response is based on (that is, the grounding documents).
131+
- **Context**: The source the generated response is based on. That is, the grounding documents.
128132
- **Ground truth**: The response generated by a user or human as the true answer.
129133

130-
To see what each evaluator requires, you can learn more in the [built-in evaluators documents](/azure/ai-foundry/concepts/observability#what-are-evaluators).
134+
To see what each evaluator requires, see [Evaluators](/azure/ai-foundry/concepts/observability#what-are-evaluators).
131135

132136
#### Conversation support for text
133137

134-
For evaluators that support conversations for text, you can provide `conversation` as input, which includes a Python dictionary with a list of `messages` (which include `content`, `role`, and optionally `context`).
138+
For evaluators that support conversations for text, you can provide `conversation` as input. This input includes a Python dictionary with a list of `messages`, which includes `content`, `role`, and optionally `context`.
135139

136140
See the following two-turn conversation in Python:
137141

@@ -192,7 +196,9 @@ To run batch evaluations by using [local evaluation](#local-evaluation-on-test-d
192196
Our evaluators understand that the first turn of the conversation provides valid `query` from `user`, `context` from `assistant`, and `response` from `assistant` in the query-response format. Conversations are then evaluated per turn and results are aggregated over all turns for a conversation score.
193197

194198
> [!NOTE]
195-
> In the second turn, even if `context` is `null` or a missing key, it's interpreted as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
199+
> In the second turn, even if `context` is `null` or a missing key, the evaluator interprets the turn as an empty string instead of failing with an error, which might lead to misleading results.
200+
>
201+
> We strongly recommend that you validate your evaluation data to comply with the data requirements.
196202
197203
For conversation mode, here's an example for `GroundednessEvaluator`:
198204

@@ -259,7 +265,7 @@ For conversation outputs, per-turn results are stored in a list and the overall
259265
```
260266

261267
> [!NOTE]
262-
> We recommend that users migrate their code to use the key without prefixes (for example, `groundedness.groundedness`) to allow your code to support more evaluator models.
268+
> We recommend that you migrate your code to use the key without prefixes to allow your code to support more evaluator models. For example, `groundedness.groundedness`.
263269
264270
#### Conversation support for images and multi-modal text and image
265271

@@ -332,33 +338,33 @@ conversation_base64 = {
332338
safety_score = safety_evaluator(conversation=conversation_image_url)
333339
```
334340

335-
Currently the image and multi-modal evaluators support:
341+
Currently, the image and multi-modal evaluators support:
336342

337-
- Single turn only (a conversation can have only one user message and one assistant message).
343+
- Single turn only: a conversation can have only one user message and one assistant message.
338344
- Conversations that have only one system message.
339-
- Conversation payloads that are smaller than 10 MB (including images).
345+
- Conversation payloads that are smaller than 10 MB, including images.
340346
- Absolute URLs and Base64-encoded images.
341347
- Multiple images in a single turn.
342348
- JPG/JPEG, PNG, and GIF file formats.
343349

344350
#### Set up
345351

346-
For AI-assisted quality evaluators (except for `GroundednessProEvaluator` preview), you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
352+
For AI-assisted quality evaluators, except for `GroundednessProEvaluator` preview, you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
347353

348354
> [!NOTE]
349-
> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model, because the latter is cheaper, more capable, and just as fast, according to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo).
355+
> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model. According to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo), `gpt-4o-mini` is cheaper, more capable, and as fast.
350356
>
351357
> Make sure that you have at least the `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with the API key. To learn more about permissions, see [Permissions for an Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
352358
353-
For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service via your Azure AI project.
359+
For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service by using your Azure AI project.
354360

355361
#### Prompts for AI-assisted built-in evaluators
356362

357-
We open source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator` (powered by Azure AI Content Safety). These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
363+
We open-source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator`, powered by Azure AI Content Safety. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that you customize the definitions and grading rubrics to your scenario specifics. For more information, see [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
358364

359365
### Composite evaluators
360366

361-
Composite evaluators are built-in evaluators that combine individual quality or safety metrics. They easily provide a wide range of metrics right out of the box for both query response pairs or chat messages.
367+
Composite evaluators are built-in evaluators that combine individual quality or safety metrics. They provide a wide range of metrics right out of the box for both query response pairs or chat messages.
362368

363369
| Composite evaluator | Contains | Description |
364370
|--|--|--|
@@ -371,15 +377,15 @@ After you spot-check your built-in or custom evaluators on a single row of data,
371377

372378
### Prerequisite set up steps for Azure AI Foundry projects
373379

374-
If this is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few additional setup steps:
380+
If this session is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do the following setup steps:
375381

376382
1. [Create and connect your storage account](https://github.com/azure-ai-foundry/foundry-samples/blob/main/samples/microsoft/infrastructure-setup/01-connections/connection-storage-account.bicep) to your Azure AI Foundry project at the resource level. This bicep template provisions and connects a storage account to your Foundry project with key authentication.
377-
2. Make sure the connected storage account has access to all projects.
378-
3. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
383+
1. Make sure the connected storage account has access to all projects.
384+
1. If you connected your storage account with Microsoft Entra ID, make sure to give Microsoft Identity permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
379385

380386
### Evaluate on a dataset and log results to Azure AI Foundry
381387

382-
To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. In this case, we specify the data mapping for `query`, `response`, and `context`.
388+
To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. This example specifies the data mapping for `query`, `response`, and `context`.
383389

384390
```python
385391
from azure.ai.evaluation import evaluate
@@ -410,7 +416,7 @@ result = evaluate(
410416
> [!TIP]
411417
> Get the contents of the `result.studio_url` property for a link to view your logged evaluation results in your Azure AI project.
412418
413-
The evaluator outputs results in a dictionary, which contains aggregate `metrics` and row-level data and metrics. See the following example of an output:
419+
The evaluator outputs results in a dictionary, which contains aggregate `metrics` and row-level data and metrics. See the following example output:
414420

415421
```python
416422
{'metrics': {'answer_length.value': 49.333333333333336,
@@ -513,7 +519,7 @@ result = evaluate(
513519

514520
If you have a list of queries that you want to run and then evaluate, the `evaluate()` API also supports a `target` parameter. This parameter can send queries to an application to collect answers, and then run your evaluators on the resulting query and response.
515521

516-
A target can be any callable class in your directory. In this case, we have a Python script `askwiki.py` with a callable class `askwiki()` that we can set as our target. If we have a dataset of queries we can send into our simple `askwiki` app, we can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
522+
A target can be any callable class in your directory. In this example, there's a Python script `askwiki.py` with a callable class `askwiki()` that is set as the target. If you have a dataset of queries that you can send into the simple `askwiki` app, you can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
517523

518524
Here's the content in `"data.jsonl"`:
519525

@@ -547,11 +553,11 @@ result = evaluate(
547553

548554
## Related content
549555

550-
- [Azure AI Evaluation Python SDK client reference documentation](https://aka.ms/azureaieval-python-ref)
551-
- [Azure AI Evaluation SDK client troubleshooting guide](https://aka.ms/azureaieval-tsg)
552-
- [Learn more about the evaluation metrics](../../concepts/evaluation-metrics-built-in.md)
553-
- [Evaluate your generative AI applications remotely on the cloud](./cloud-evaluation.md)
554-
- [Learn more about simulating test datasets for evaluation](./simulator-interaction-data.md)
555-
- [View your evaluation results in an Azure AI project](../../how-to/evaluate-results.md)
556-
- [Get started building a chat app by using the Azure AI Foundry SDK](../../quickstarts/get-started-code.md)
556+
- [Azure AI Evaluation client library for Python](https://aka.ms/azureaieval-python-ref)
557+
- [Troubleshoot AI Evaluation SDK Issues](https://aka.ms/azureaieval-tsg)
558+
- [Observability in generative AI](../../concepts/evaluation-metrics-built-in.md)
559+
- [Run evaluations in the cloud by using the Azure AI Foundry SDK](./cloud-evaluation.md)
560+
- [Generate synthetic and simulated data for evaluation](./simulator-interaction-data.md)
561+
- [See evaluation results in the Azure AI Foundry portal](../../how-to/evaluate-results.md)
562+
- [Get started with Azure AI Foundry](../../quickstarts/get-started-code.md)
557563
- [Get started with evaluation samples](https://aka.ms/aistudio/eval-samples)

0 commit comments

Comments
 (0)