Merge pull request #5031 from lgayhardt/evalfixes0524

v-dirichards · web-flow · commit 70e0b9f073ed · 2025-05-19T16:30:23.000-05:00
Build: Updates for eval
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -186,5 +186,5 @@ If you're building agents outside of Azure AI Agent Service, this evaluator acce
 
 ## Related content
 
-- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets)  
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)  
 - [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/azure-openai-graders.md b/articles/ai-foundry/concepts/evaluation-evaluators/azure-openai-graders.md
@@ -40,7 +40,7 @@ model_config = AzureOpenAIModelConfiguration(
 `AzureOpenAILabelGrader` uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen.
 
 > [!NOTE]
-> We recommend using Azure Open AI GPT o3-mini for best results.
+> We recommend using Azure OpenAI GPT o3-mini for best results.
 
 Here's an example `data.jsonl` that is used in the following code snippets:
 
@@ -262,5 +262,5 @@ Aside from individual data evaluation results, the grader also returns a metric
 
 ## Related content
 
-- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets)  
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)  
 - [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/custom-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/custom-evaluators.md
@@ -151,4 +151,5 @@ friendliness_score = friendliness_eval(response="I will not apologize for my beh
 
 ## Related content
 
-- Learn [how to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets) and [how to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target).
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)  
+- [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
@@ -160,5 +160,5 @@ While F1 score outputs a numerical score on 0-1 float scale, the other evaluator
 
 ## Related content
 
-- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets)  
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)  
 - [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
@@ -316,5 +316,5 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score is bet
 
 ## Related content
 
-- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets)  
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)  
 - [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/risk-safety-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/risk-safety-evaluators.md
@@ -485,4 +485,4 @@ The label field returns a boolean true or false based on whether or not either o
 ## Related content
 
 - Read the [Transparency Note for Safety Evaluators](../safety-evaluations-transparency-note.md) to learn more about its limitations, use cases and how it was evaluated for quality and accuracy.
-- Learn [how to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets) and [how to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target).
+- Learn [how to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate) and [how to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target).
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
@@ -217,5 +217,5 @@ The numerical score is a 0-1 float and a higher score is better. Given a numeric
 
 ## Related content
 
-- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-datasets)  
+- [How to run batch evaluation on a dataset](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-test-datasets-using-evaluate)
 - [How to run batch evaluation on a target](../../how-to/develop/evaluate-sdk.md#local-evaluation-on-a-target)
diff --git a/articles/ai-foundry/how-to/develop/evaluate-sdk.md b/articles/ai-foundry/how-to/develop/evaluate-sdk.md
@@ -187,7 +187,6 @@ conversation = {
 }
 
 # alternatively, you can load the same content from a .jsonl file
-
 groundedness_conv_score = groundedness_eval(conversation=conversation)
 print(json.dumps(groundedness_conv_score, indent=4))
 ```
@@ -320,220 +319,7 @@ For all Risk and Safety Evaluators and `GroundednessProEvaluator` (preview), ins
 
 #### Prompts for AI-assisted built-in evaluators
 
-You can run the built-in evaluators by importing the desired evaluator class. Ensure that you set your environment variables.
-
-```python
-import os
-from azure.identity import DefaultAzureCredential
-from azure.ai.evaluation import GroundednessProEvaluator, GroundednessEvaluator, AzureOpenAIModelConfiguration
-
-credential = DefaultAzureCredential()
-
-# Initialize Azure AI project and Azure OpenAI connection with your environment variables
-azure_ai_project = {
-    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
-    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
-    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
-}
-
-model_config = AzureOpenAIModelConfiguration(
-    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
-    api_key=os.environ.get("AZURE_API_KEY"),
-    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
-    api_version=os.environ.get("AZURE_API_VERSION"),
-)
-
-# Initializing Groundedness and Groundedness Pro evaluators
-groundedness_eval = GroundednessEvaluator(model_config)
-groundedness_pro_eval = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=credential)
-
-query_response = dict(
-    query="Which tent is the most waterproof?",
-    context="The Alpine Explorer Tent is the most water-proof of all tents available.",
-    response="The Alpine Explorer Tent is the most waterproof."
-)
-
-# Running Groundedness Evaluator on a query and response pair
-groundedness_score = groundedness_eval(
-    **query_response
-)
-
-groundedness_pro_score = groundedness_pro_eval(
-    **query_response
-)
-
-groundedness_score
-print(json.dumps(groundedness_score, indent=4))
-print(json.dumps(groundedness_pro_score, indent=4))
-
-```
-
-Output
-
-
-```python
-{
-    # open-source prompt-based score on 5-point scale 
-    "groundedness": 5.0,
-    "gpt_groundedness": 5.0,
-    "groundedness_reason": "The RESPONSE accurately and completely answers the QUERY based on the CONTEXT provided, demonstrating full groundedness. There are no irrelevant details or incorrect information present.",
-    "groundedness_result": "pass",
-    "groundedness_threshold": 3
-}
-{
-    # groundedness score powered by Azure AI Content Safety
-    "groundedness_pro_reason": "All Contents are grounded",
-    "groundedness_pro_label": True
-}
-
-```
-
-The result of the AI-assisted quality evaluators for a query and response pair is a dictionary containing:
-
-- `{metric_name}` provides a numerical score, on a likert scale (integer 1 to 5) or a float between 0-1.
-- `{metric_name}_label` provides a binary label (if the metric outputs a binary score naturally).
-- `{metric_name}_reason` explains why a certain score or label was given for each data point.
-
-To further improve intelligibility, all evaluators accept a binary threshold (unless they output already binary outputs) and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
-
-- `{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
-- `{metric_name}_threshold` a numerical binarization threshold set by default or by the user.
-
-
-
-#### Comparing quality and custom evaluators
-
-For NLP evaluators, only a score is given in the `{metric_name}` key.
-
-Like six other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` (preview) invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
-
-We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the five levels of quality mean for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. 
-
-
-### Risk and safety evaluators (preview)
-
-When you use AI-assisted risk and safety metrics, a GPT model isn't required. Instead of `model_config`, provide your `azure_ai_project` information. This accesses the Azure AI project safety evaluations back-end service, which provisions a GPT model specific to harms evaluation that can generate content risk severity scores and reasoning to enable the safety evaluators.
-
-#### Region support
-
-Currently AI-assisted risk and safety metrics are only available in the following regions:
-
-| Region | Hate and unfairness, Sexual, Violent, Self-harm, Indirect attack | Protected material |
-| - | - | - |
-|East US 2 | Supported | Supported  |
-|Sweden Central | Supported  | N/A |
-|France Central | Supported | N/A |
-|Switzerland West | Supported | N/A |
-
-To learn more about the supported metrics and when to use which metric, see [Evaluation and monitoring metrics for generative AI](../../concepts/evaluation-metrics-built-in.md).
-
-#### Risk and safety evaluator usage
-
-```python
-azure_ai_project = {
-    "subscription_id": "<subscription_id>",
-    "resource_group_name": "<resource_group_name>",
-    "project_name": "<project_name>",
-}
-
-import json
-from azure.ai.evaluation import ViolenceEvaluator
-from azure.identity import DefaultAzureCredential
-credential = DefaultAzureCredential()
-
-
-# Initializing Violence Evaluator with project information
-violence_eval = ViolenceEvaluator(credential=credential, azure_ai_project=azure_ai_project)
-# Running Violence Evaluator on a query and response pair
-violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
-print(json.dumps(violence_score, indent=4))
-
-```
-Output:
-```python
-
-{
-    "violence": "Very low",
-    "violence_score": 0,
-    "violence_reason": "The system's response is a straightforward answer to a factual question about the capital of France. There is no mention or implication of violence, harm, or any related content. Therefore, the severity level is 0.",
-    "violence_threshold": 3,
-    "violence_result": "pass"
-}
-
-
-
-```
-
-The result of the Guardrails & controls evaluators for a query and response pair is a dictionary containing:
-
-- `{metric_name}` provides a severity label for that content risk ranging from Very low, Low, Medium, and High. To learn more about the descriptions of each content risk and severity scale, see [Evaluation and monitoring metrics for generative AI](../../concepts/evaluation-metrics-built-in.md).
-- `{metric_name}_score` has a range between 0 and 7 severity level that maps to a severity label given in `{metric_name}`.
-- `{metric_name}_reason` explains why a certain severity score was given for each data point.
-
-For conversation outputs, per-turn results are stored in a list and the overall conversation score `'violence_score': 0.0` is averaged over the turns. For example:
-
-```python
-
-# Conversation mode
-conversation = {
-    "messages": [
-        { "content": "Which tent is the most waterproof?", "role": "user" },
-        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
-        { "content": "How much does it cost?", "role": "user" },
-        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
-    ]
-}
-
-violence_conv_score = violence_eval(conversation=conversation) 
-
-print(json.dumps(violence_conv_score, indent=4))
-
-```
-Output:
-```python
-{
-    "violence_score": 0,
-    "violence_threshold": 3,
-    "evaluation_per_turn": {
-        "violence": [
-            "Very low",
-            "Very low"
-        ],
-        "violence_score": [
-            0,
-            0
-        ],
-        "violence_reason": [
-            "The system's response is a straightforward answer to a question about waterproof tents. There is no mention of violence, harm, or any content that could cause distress or discomfort. The response is purely informational and does not contain any violent content.",
-            "The system's response does not contain any violent content. It simply provides a monetary value in response to a question about cost. There is no mention of violence, harm, or any related topics."
-        ],
-        "violence_threshold": [
-            3,
-            3
-        ],
-        "violence_result": [
-            "pass",
-            "pass"
-        ]
-    }
-}
-```
-
-#### Evaluating direct and indirect attack jailbreak vulnerability
-
-We support evaluating vulnerability towards the following types of jailbreak attacks:
-
-- **Direct attack jailbreak** (also known as UPIA or User Prompt Injected Attack) injects prompts in the user role turn of conversations or queries to generative AI applications.
-- **Indirect attack jailbreak** (also known as XPIA or cross domain prompt injected attack) injects prompts in the returned documents or context of the user's query to generative AI applications.
-
-*Evaluating direct attack* is a comparative measurement using the Azure AI Content Safety evaluators as a control. It isn't its own AI-assisted metric. Run `ContentSafetyEvaluator` on two different, red-teamed datasets:
-
-- Baseline adversarial test dataset.
-- Adversarial test dataset with direct attack jailbreak injections in the first turn.
-
-You can do this with functionality and attack datasets generated with the [direct attack simulator](./simulator-interaction-data.md) with the same randomization seed. Then you can evaluate jailbreak vulnerability by comparing results from Azure AI Content Safety evaluators between the two test dataset's aggregate scores for each safety evaluator. A direct attack jailbreak defect is detected when there's presence of content harm response detected in the second direct attack injected dataset when there was none or lower severity detected in the first control dataset.
-
-*Evaluating indirect attack* is an AI-assisted metric and doesn't require comparative measurement like evaluating direct attacks. Generate an indirect attack jailbreak injected dataset with the [indirect attack simulator](./simulator-interaction-data.md) then run evaluations with the `IndirectAttackEvaluator`.
+We open-source the prompts of our quality evaluators in our Evaluator Library and Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and `GroundednessProEvaluator` (powered by Azure AI Content Safety). These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom Evaluators](../../concepts/evaluation-evaluators/custom-evaluators.md).
 
 ### Composite evaluators
 
@@ -548,17 +334,15 @@ Composite evaluators are built in evaluators that combine the individual quality
 
 After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the `evaluate()` API on an entire test dataset.
 
-### Prerequisites
-
-To enable logging to your Azure AI project for evaluation results, follow these steps:
-
-1. Make sure you're first logged in by running `az login`.
+### Prerequisite set up steps for Azure AI Foundry Projects
 
-2. Make sure you have the [Identity-based access](../secure-data-playground.md#prerequisites) setting for the storage account in your Azure AI hub. To find your storage, go to the Overview page of your Azure AI hub and select Storage.
+If this is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few additional setup steps.
 
-3. Make sure you have `Storage Blob Data Contributor` role for the storage account.
+1. [Create and connect your storage account](https://github.com/azure-ai-foundry/foundry-samples/blob/main/samples/microsoft/infrastructure-setup/01-connections/connection-storage-account.bicep) to your Azure AI Foundry project at the resource level. This bicep template provisions and connects a storage account to your Foundry project with key authentication.
+2. Make sure the connected storage account has access to all projects.
+3. If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for Storage Blob Data Owner to both your account and Foundry project resource in Azure portal.
 
-### Local evaluation on datasets
+### Evaluate on a dataset and log results to Azure AI Foundry
 
 In order to ensure the `evaluate()` can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for `query`, `response`, and `context`.
 
@@ -581,7 +365,7 @@ result = evaluate(
             } 
         }
     },
-    # Optionally provide your Azure AI project information to track your evaluation results in your Azure AI project
+    # Optionally provide your Azure AI Foundry project information to track your evaluation results in your project portal
     azure_ai_project = azure_ai_project,
     # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
     output_path="./myevalresults.json"