Freshness, in progress.

TimShererWithAquent · TimShererWithAquent · commit 59725f4f60ff · 2025-10-09T09:09:36.000-07:00
diff --git a/articles/ai-foundry/how-to/develop/evaluate-sdk.md b/articles/ai-foundry/how-to/develop/evaluate-sdk.md
@@ -1,11 +1,11 @@
 ---
 title: Local Evaluation with the Azure AI Evaluation SDK
 titleSuffix: Azure AI Foundry
-description: This article provides instructions on how to evaluate a generative AI application with the Azure AI Evaluation SDK.
+description: Learn how to run evaluators on a single row of data and a larger test dataset to evaluate a generative AI application with the Azure AI Evaluation SDK.
 author: lgayhardt
 ms.author: lagayhar
 ms.reviewer: minthigpen
-ms.date: 07/15/2025
+ms.date: 10/10/2025
 ms.service: azure-ai-foundry
 ms.topic: how-to
 ms.custom:
@@ -19,7 +19,7 @@ ms.custom:
 
 [!INCLUDE [feature-preview](../../includes/feature-preview.md)]
 
-If you want to thoroughly assess the performance of your generative AI application when you apply it to a substantial dataset, you can evaluate it in your development environment with the Azure AI Evaluation SDK. When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
+You can thoroughly assess the performance of your generative AI application by applying it to a substantial dataset. Evaluate the application in your development environment with the Azure AI Evaluation SDK. When you provide either a test dataset or a target, your generative AI application outputs are quantitatively measured with both mathematical-based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
 
 In this article, you learn how to run evaluators on a single row of data and a larger test dataset on an application target. You use built-in evaluators that use the Azure AI Evaluation SDK locally. Then, you learn to track the results and evaluation logs in an Azure AI project.
 
@@ -45,7 +45,7 @@ pip install azure-ai-evaluation
 | [Agentic](../../concepts/evaluation-evaluators/agent-evaluators.md) | `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator` |
 | [Azure OpenAI](../../concepts/evaluation-evaluators/azure-openai-graders.md) | `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader` |
 
-Built-in quality and safety metrics take in query and response pairs, along with additional information for specific evaluators.
+Built-in quality and safety metrics accept query and response pairs, along with additional information for specific evaluators.
 
 ### Data requirements for built-in evaluators
 
@@ -91,9 +91,11 @@ Built-in evaluators can accept query and response pairs, a list of conversations
 
 
 > [!NOTE]
-> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation is set to 800 for all AI-assisted evaluators, except that it has the value 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate for longer inputs.
+> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score.
+>
+> They consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation is set to 800 for most AI-assisted evaluators. It has the value 1600 for `RetrievalEvaluator` and 3000 for `ToolCallAccuracyEvaluator` to accommodate longer inputs.
 
-Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. Example: If you have two inputs called *query* and *response*, and a template that was formatted as `{{item.query}}`, then only the query would be used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
+Azure OpenAI graders require a template that describes how their input columns are turned into the *real* input that the grader uses. For example, if you have two inputs called *query* and *response*, and a template formatted as `{{item.query}}`, only the query is used. Similarly, you could have something like `{{item.conversation}}` to accept a conversation input, but the ability of the system to handle that depends on how you configure the rest of the grader to expect that input.
 
 For more information on data requirements for agentic evaluators, see [Evaluate your AI agents](agent-evaluate-sdk.md).
 
@@ -112,26 +114,26 @@ relevance_eval = RelevanceEvaluator(model_config)
 relevance_eval(query=query, response=response)
 ```
 
-To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), you need to represent the dataset in JSONL format. The previous single-turn data (a query-and-response pair) is equivalent to a line of a dataset like the following (we show three lines as an example):
+To run batch evaluations by using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run a cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), represent the dataset in JSONL format. The previous single-turn data, which is a query-and-response pair, is equivalent to a line of a dataset like the following example, which shows three lines:
 
 ```json
 {"query":"What is the capital of France?","response":"Paris."}
 {"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
 {"query":"What color is my shirt?","response":"Blue."}
 ```
 
-The evaluation test dataset can contain the following, depending on the requirements of each built-in evaluator:
+The evaluation test dataset can contain the following elements, depending on the requirements of each built-in evaluator:
 
 - **Query**: The query sent in to the generative AI application.
 - **Response**: The response to the query generated by the generative AI application.
-- **Context**: The source the generated response is based on (that is, the grounding documents).
+- **Context**: The source the generated response is based on. That is, the grounding documents.
 - **Ground truth**: The response generated by a user or human as the true answer.
 
 To see what each evaluator requires, see [Evaluators](/azure/ai-foundry/concepts/observability#what-are-evaluators).
 
 #### Conversation support for text
 
-For evaluators that support conversations for text, you can provide `conversation` as input, which includes a Python dictionary with a list of `messages` (which include `content`, `role`, and optionally `context`).
+For evaluators that support conversations for text, you can provide `conversation` as input. This input includes a Python dictionary with a list of `messages`, which includes `content`, `role`, and optionally `context`.
 
 See the following two-turn conversation in Python:
 
@@ -192,7 +194,7 @@ To run batch evaluations by using [local evaluation](#local-evaluation-on-test-d
 Our evaluators understand that the first turn of the conversation provides valid `query` from `user`, `context` from `assistant`,  and `response` from `assistant` in the query-response format. Conversations are then evaluated per turn and results are aggregated over all turns for a conversation score.
 
 > [!NOTE]
-> In the second turn, even if `context` is `null` or a missing key, the evaluator interprets the turn as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
+> In the second turn, even if `context` is `null` or a missing key, the evaluator interprets the turn as an empty string instead of failing with an error, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
 
 For conversation mode, here's an example for `GroundednessEvaluator`:
 
@@ -259,7 +261,7 @@ For conversation outputs, per-turn results are stored in a list and the overall
 ```
 
 > [!NOTE]
-> We recommend that users migrate their code to use the key without prefixes (for example, `groundedness.groundedness`) to allow your code to support more evaluator models.
+> We recommend that you migrate your code to use the key without prefixes to allow your code to support more evaluator models. For example, `groundedness.groundedness`.
 
 #### Conversation support for images and multi-modal text and image
 
@@ -332,29 +334,29 @@ conversation_base64 = {
 safety_score = safety_evaluator(conversation=conversation_image_url)
 ```
 
-Currently the image and multi-modal evaluators support:
+Currently, the image and multi-modal evaluators support:
 
-- Single turn only (a conversation can have only one user message and one assistant message).
+- Single turn only: a conversation can have only one user message and one assistant message.
 - Conversations that have only one system message.
-- Conversation payloads that are smaller than 10 MB (including images).
+- Conversation payloads that are smaller than 10 MB, including images.
 - Absolute URLs and Base64-encoded images.
 - Multiple images in a single turn.
 - JPG/JPEG, PNG, and GIF file formats.
 
 #### Set up
 
-For AI-assisted quality evaluators (except for `GroundednessProEvaluator` preview), you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
+For AI-assisted quality evaluators, except for `GroundednessProEvaluator` preview, you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config`. The GPT model acts as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schemas. For the best performance and parseable responses with our evaluators, we recommend using GPT models that aren't in preview.
 
 > [!NOTE]
-> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model. The latter is cheaper, more capable, and as fast, according to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo).
+> We strongly recommend that you replace `gpt-3.5-turbo` with `gpt-4o-mini` for your evaluator model. According to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo), `gpt-4o-mini` is cheaper, more capable, and as fast.
 >
 > Make sure that you have at least the `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with the API key. To learn more about permissions, see [Permissions for an Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).  
 
-For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service via your Azure AI project.
+For all risk and safety evaluators and `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the back end evaluation service by using your Azure AI project.
 
 #### Prompts for AI-assisted built-in evaluators
 
-We open source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator` (powered by Azure AI Content Safety). These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
+We open-source the prompts of our quality evaluators in our Evaluator Library and the Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator`, powered by Azure AI Content Safety. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that you customize the definitions and grading rubrics to your scenario specifics. For more information, see [Custom evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
 
 ### Composite evaluators
 
@@ -374,12 +376,12 @@ After you spot-check your built-in or custom evaluators on a single row of data,
 If this session is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few other setup steps:
 
 1. [Create and connect your storage account](https://github.com/azure-ai-foundry/foundry-samples/blob/main/samples/microsoft/infrastructure-setup/01-connections/connection-storage-account.bicep) to your Azure AI Foundry project at the resource level. This bicep template provisions and connects a storage account to your Foundry project with key authentication.
-2. Make sure the connected storage account has access to all projects.
-3. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
+1. Make sure the connected storage account has access to all projects.
+1. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for **Storage Blob Data Owner** to both your account and Foundry project resource in Azure portal.
 
 ### Evaluate on a dataset and log results to Azure AI Foundry
 
-To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. In this case, we specify the data mapping for `query`, `response`, and `context`.
+To ensure the `evaluate()` API can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that the evaluators accept. This example specifies the data mapping for `query`, `response`, and `context`.
 
 ```python
 from azure.ai.evaluation import evaluate
@@ -513,7 +515,7 @@ result = evaluate(
 
 If you have a list of queries that you want to run and then evaluate, the `evaluate()` API also supports a `target` parameter. This parameter can send queries to an application to collect answers, and then run your evaluators on the resulting query and response.
 
-A target can be any callable class in your directory. In this case, we have a Python script `askwiki.py` with a callable class `askwiki()` that we can set as our target. If we have a dataset of queries we can send into our simple `askwiki` app, we can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
+A target can be any callable class in your directory. In this example, there's a Python script `askwiki.py` with a callable class `askwiki()` that is set as our target. If you have a dataset of queries that you can send into the simple `askwiki` app, you can evaluate the groundedness of the outputs. Make sure that you specify the proper column mapping for your data in `"column_mapping"`. You can use `"default"` to specify column mapping for all evaluators.
 
 Here's the content in `"data.jsonl"`: