Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 29 additions & 12 deletions scenarios/evaluate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,24 @@ description: Evaluate.

### Overview

This tutorial provides a step-by-step guide on how to evaluate Generative AI models with Azure. Each of these samples uses the `azure-ai-evaluation` SDK.
This tutorial provides a step-by-step guide on how to evaluate Generative AI base models or AI Applications with Azure. Each of these samples uses the [`azure-ai-evaluation`](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk) SDK.

When selecting a base model for building an application—or after building an AI application (such as a Retrieval-Augmented Generation (RAG) system or a multi-agent framework)—evaluation plays a pivotal role. Effective evaluation ensures that the chosen or developed AI model or application meets the intended safety, quality, and performance benchmarks.

In both cases, running evaluations requires specific tools, methods, and datasets. Here’s a breakdown of the key components involved:

* Testing with Evaluation Datasets

- Bring Your Own Data: Use datasets tailored to your application or domain.
- Redteaming Queries: Design adversarial prompts to test robustness.
- [Azure AI Simulators](Simulators/): Leverage Azure AI's context-specific or adversarial dataset generators to create relevant test cases.

* Selecting the Appropriate Evaluators or Building Custom Ones

- Pre-Built Evaluators: Azure AI provides a range of [generation safety](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/) and [quality/NLP evaluators](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/) ready for immediate use.
- [Custom Evaluators](Supported_Evaluation_Metrics/Custom_Evaluators/): Using the Azure AI Evaluation SDK, you can design and implement evaluators that align with the unique requirements of your application.

* Generating and Visualizing Evaluation Results: Azure AI Evaluation SDK enables you to evaluate the target functions (such as [endpoints of your AI application](Supported_Evaluation_Targets/Evaluate_App_Endpoint/) or your [model endpoints](Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/) on your dataset with either built-in or custom evaluators. You can run evaluations [remotely](Supported_Evaluation_Targets/Evaluate_On_Cloud/) in the cloud or locally on your own machine.

### Objective

Expand All @@ -26,17 +43,17 @@ The main objective of this tutorial is to help users understand the process of e

| Sample name | adversarial | simulator | conversation starter | index | raw text | against model endpoint | against app | qualitative metrics | custom metrics | quantitative NLP metrics |
|----------------------------------------|-------------|-----------|---------------------|-------|----------|-----------------------|-------------|---------------------|----------------|----------------------|
| simulate_adversarial.ipynb | X | X | | | | X | | | | |
| simulate_conversation_starter.ipynb | | X | X | | | X | | | | |
| simulate_input_index.ipynb | | X | | X | | X | | | | |
| simulate_input_text.ipynb | | X | | | X | X | | | | |
| evaluate_endpoints.ipynb | | | | | | X | | X | | |
| evaluate_app.ipynb | | | | | | | X | X | | |
| evaluate_qualitative.ipynb | | | | | | X | | X | | |
| evaluate_custom.ipynb | | | | | | X | | | X | |
| evaluate_quantitative.ipynb | | | | | | X | | | | X |
| evaluate_safety_risk.ipynb | X | | | | | X | | | | |
| simulate_and_evaluate_endpoint.py | | X | | | X | X | | X | | |
| [Simulate_Adversarial.ipynb](Simulators/Simulate_Adversarial_Data/Simulate_Adversarial.ipynb) | X | X | | | | X | | | | |
| [Simulate_From_Conversation_Starter.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb) | | X | X | | | X | | | | |
| [Simulate_From_Azure_Search_Index.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb) | | X | | X | | X | | | | |
| [Simulate_From_Input_Text.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb) | | X | | | X | X | | | | |
| [Evaluate_Base_Model_Endpoint.ipynb](Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb) | | | | | | X | | X | | |
| [Evaluate_App_Endpoint.ipynb](Supported_Evaluation_Targets/Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb) | | | | | | | X | X | | |
| [AI_Judge_Evaluators_Quality.ipynb](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/AI_Judge_Evaluators_Quality.ipynb) | | | | | | X | | X | | |
| [Custum_Evaluators.ipynb](Supported_Evaluation_Metrics/Custom_Evaluators/Custom_Evaluators.ipynb) | | | | | | X | | | X | |
| [NLP_Evaluators.ipynb](Supported_Evaluation_Metrics/NLP_Evaluators/NLP_Evaluators.ipynb) | | | | | | X | | | | X |
| [AI_Judge_Evaluators_Safety_Risks.ipynb](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/AI_Judge_Evaluators_Safety_Risks.ipynb) | X | | | | | X | | | | |
| [Simulate_Evaluate_Groundedness.py](Simulators/Simulate_Evaluate_Groundedness/Simulate_Evaluate_Groundedness.ipynb) | | X | | | X | X | | X | | |



Expand Down
24 changes: 24 additions & 0 deletions scenarios/evaluate/Simulators/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
page_type: sample
languages:
- python
products:
- ai-services
- azure-openai
description: Evaluate.
---

## Simulate Evaluation (Test) Data



Relevant, robust evaluation data is essential for effective evaluations. This data can be generated manually, can include production data, or can be assembled with the help of AI. There are two main types of evaluation data:

- Bring Your Own Data: You can create and update a “golden dataset” with realistic customer questions or inputs paired with expert answers, ensuring quality for generative AI experiences. This dataset can also include samples from production data, offering a realistic evaluation dataset derived from actual queries your AI application has encountered.
* Simulators: If evaluation data is not available, simulators can play a crucial role in generating evaluation data by creating both topic-related and adversarial queries.
- Context-related simulators test the AI system’s ability to handle relevant interactions within a specific context, ensuring it performs well under typical use scenarios.

- Adversarial simulators, on the other hand, generate queries designed to challenge the AI system, mimicking potential security threats or attempting to provoke undesirable behaviors. This approach helps identify the model's limitations and prepares it to perform well in unexpected or hostile conditions.

Azure AI Studio provides tools for both topic-related and adversarial simulations, enabling comprehensive evaluation and enhancing confidence in deployment. For topic-related simulations, Azure AI enables you to simulate relevant conversations using [your data](Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb), [your Azure Search Index](Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb), or [your pre-defined conversation starters](Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb).

Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@ languages:
products:
- ai-services
- azure-openai
description: Simulator which simulates adversarial questions to ask wiki a custom application
description: Simulator which simulates adversarial questions
---

## Adversarial Simulator for Custom Application (askwiki)
## Adversarial Simulator

### Overview

This tutorial provides a step-by-step guide on how to use the adversarial simulator to simulate against a custom application
This tutorial provides a step-by-step guide on how to use the adversarial simulator

### Objective

The main objective of this tutorial is to help users understand the process of creating and using an adversarial simulator and use it with a custom application
The main objective of this tutorial is to help users understand the process of creating and using an adversarial simulator
By the end of this tutorial, you should be able to:
- Use the simulator
- Run the simulator to have an adversarial question answering scenario
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ languages:
- python
products:
- azure-openai
description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs."
description: Simulate from Azure Search Index
---

## Generate Query and Response from your Azure Search Index
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simulate Queries and Responses from input text"
"# Simulate Queries and Responses from Azure Search Index"
]
},
{
Expand All @@ -13,7 +13,7 @@
"source": [
"## Objective\n",
"\n",
"Use the Simulator to generate high-quality queries and responses from your data using LLMs.\n",
"Use the Simulator to generate high-quality queries and responses from your data in Azure Search using LLMs.\n",
"\n",
"This tutorial uses the following Azure AI services:\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ languages:
- python
products:
- azure-openai
description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs."
description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs
---

## Generate Query and Response from your data
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate model endpoints using Prompt Flow Eval APIs\n",
"# Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK\n",
"\n",
"## Objective\n",
"\n",
"This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. \n",
"\n",
"This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. \n",
"This guide uses Python Class as an application target which is passed to Evaluate API provided by Azure AI Evaluation SDK to evaluate results generated by LLM models against provided prompts. \n",
"\n",
"This tutorial uses the following Azure AI services:\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@ languages:
products:
- ai-services
- azure-openai
description: Evaluating qualitative metrics
description: Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK
---

## Evaluating qualitative metrics
## Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK

### Overview

This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoint using qualitative metrics.
This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoint using AI as Judge evaluators.

### Objective

The main objective of this tutorial is to help users understand the process of evaluating model endpoints using qualitative metrics. By the end of this tutorial, you should be able to:
The main objective of this tutorial is to help users understand the process of evaluating model endpoints using AI as Judge quality evaluators. By the end of this tutorial, you should be able to:

- Learn about evaluations
- Evaluate prompt against model endpoint of your choice.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions scenarios/evaluate/Supported_Evaluation_Metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
page_type: sample
languages:
- python
products:
- ai-services
- azure-openai
description: Evaluate.
---

## Supported Evaluators


Azure AI Evaluation SDK supported a group of evaluators that measure different aspects of a response’s alignment with expectations. They can be customized, versioned, and shared across an organization, ensuring consistent evaluation metrics and parameters across various projects. The choice of evaluators will depend on the specific goals of the evaluation, such as assessing quality, safety, or custom requirements tailored to a particular use case. Below are three main categories of evaluators we support via Azure AI SDK and Studio UI:

Currently, Azure AI Evaluation SDK supports three types of evaluators:

![Types of Evaluators](./AutomatedEvaluationAzureAIFoundry.jpg)

* [Risk and safety evaluators](AI_Judge_Evaluators_Safety_Risks/): Evaluating potential risks associated with AI-generated content is essential for safeguarding against content risks with varying degrees of severity. This includes evaluating an AI system's predisposition towards generating harmful or inappropriate content.

* Generation quality evaluators: This involves assessing metrics such as the groundedness, coherence and relevance of generated content using robust [AI-assisted](AI_Judge_Evaluators_Quality/) and [NLP](NLP_Evaluators/) metrics.


* [Custom evaluators](Custom_Evaluators/): Tailored evaluation metrics can be designed to meet specific needs and goals, providing flexibility and precision in assessing unique aspects of AI-generated content. These custom evaluators allow for more detailed and specific analyses, addressing particular concerns or requirements that standard metrics may not cover.



You can run evaluators locally or [remotely](../Supported_Evaluation_Targets/Evaluate_On_Cloud/Evaluate_On_Cloud.ipynb), log results in the cloud using the evaluation SDK, or integrate them into automated evaluations within the Azure AI Studio UI.
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,7 @@
"id": "2e932e4c-5d55-461e-a313-3a087d8983b5",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"# Evaluate app using Azure AI Evaluation APIs\n"
"# Evaluate application endpoint using Azure AI Evaluation APIs\n"
]
},
{
Expand All @@ -25,7 +14,7 @@
"metadata": {},
"source": [
"## Objective\n",
"In this notebook we will demonstrate how to use the target functions with the standard evaluators to evaluate an app.\n",
"In this notebook we will demonstrate how to use the target functions with the standard evaluators to evaluate an application endpoint.\n",
"\n",
"This tutorial provides a step-by-step guide on how to evaluate a function\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ languages:
products:
- ai-services
- azure-openai
description: Evaluating an endpoint
description: Evaluating an application endpoint
---

## Evaluating an endpoint
## Evaluating an application endpoint

### Overview

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate model endpoints using Azure AI Evaluation APIs\n",
"# Evaluate Base Model Endpoints using Azure AI Evaluation APIs\n",
"\n",
"## Objective\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ languages:
products:
- ai-services
- azure-openai
description: Evaluating model endpoints
description: Evaluating base model endpoints
---

## Evaluating model endpoints
## Evaluating base model endpoints

### Overview

Expand Down
Loading