diff --git a/scenarios/evaluate/README.md b/scenarios/evaluate/README.md index 15d1d4fc..68072df8 100644 --- a/scenarios/evaluate/README.md +++ b/scenarios/evaluate/README.md @@ -12,7 +12,24 @@ description: Evaluate. ### Overview -This tutorial provides a step-by-step guide on how to evaluate Generative AI models with Azure. Each of these samples uses the `azure-ai-evaluation` SDK. +This tutorial provides a step-by-step guide on how to evaluate Generative AI base models or AI Applications with Azure. Each of these samples uses the [`azure-ai-evaluation`](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk) SDK. + +When selecting a base model for building an application—or after building an AI application (such as a Retrieval-Augmented Generation (RAG) system or a multi-agent framework)—evaluation plays a pivotal role. Effective evaluation ensures that the chosen or developed AI model or application meets the intended safety, quality, and performance benchmarks. + +In both cases, running evaluations requires specific tools, methods, and datasets. Here’s a breakdown of the key components involved: + +* Testing with Evaluation Datasets + + - Bring Your Own Data: Use datasets tailored to your application or domain. + - Redteaming Queries: Design adversarial prompts to test robustness. + - [Azure AI Simulators](Simulators/): Leverage Azure AI's context-specific or adversarial dataset generators to create relevant test cases. + +* Selecting the Appropriate Evaluators or Building Custom Ones + + - Pre-Built Evaluators: Azure AI provides a range of [generation safety](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/) and [quality/NLP evaluators](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/) ready for immediate use. + - [Custom Evaluators](Supported_Evaluation_Metrics/Custom_Evaluators/): Using the Azure AI Evaluation SDK, you can design and implement evaluators that align with the unique requirements of your application. + +* Generating and Visualizing Evaluation Results: Azure AI Evaluation SDK enables you to evaluate the target functions (such as [endpoints of your AI application](Supported_Evaluation_Targets/Evaluate_App_Endpoint/) or your [model endpoints](Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/) on your dataset with either built-in or custom evaluators. You can run evaluations [remotely](Supported_Evaluation_Targets/Evaluate_On_Cloud/) in the cloud or locally on your own machine. ### Objective @@ -26,17 +43,17 @@ The main objective of this tutorial is to help users understand the process of e | Sample name | adversarial | simulator | conversation starter | index | raw text | against model endpoint | against app | qualitative metrics | custom metrics | quantitative NLP metrics | |----------------------------------------|-------------|-----------|---------------------|-------|----------|-----------------------|-------------|---------------------|----------------|----------------------| -| simulate_adversarial.ipynb | X | X | | | | X | | | | | -| simulate_conversation_starter.ipynb | | X | X | | | X | | | | | -| simulate_input_index.ipynb | | X | | X | | X | | | | | -| simulate_input_text.ipynb | | X | | | X | X | | | | | -| evaluate_endpoints.ipynb | | | | | | X | | X | | | -| evaluate_app.ipynb | | | | | | | X | X | | | -| evaluate_qualitative.ipynb | | | | | | X | | X | | | -| evaluate_custom.ipynb | | | | | | X | | | X | | -| evaluate_quantitative.ipynb | | | | | | X | | | | X | -| evaluate_safety_risk.ipynb | X | | | | | X | | | | | -| simulate_and_evaluate_endpoint.py | | X | | | X | X | | X | | | +| [Simulate_Adversarial.ipynb](Simulators/Simulate_Adversarial_Data/Simulate_Adversarial.ipynb) | X | X | | | | X | | | | | +| [Simulate_From_Conversation_Starter.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb) | | X | X | | | X | | | | | +| [Simulate_From_Azure_Search_Index.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb) | | X | | X | | X | | | | | +| [Simulate_From_Input_Text.ipynb](Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb) | | X | | | X | X | | | | | +| [Evaluate_Base_Model_Endpoint.ipynb](Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb) | | | | | | X | | X | | | +| [Evaluate_App_Endpoint.ipynb](Supported_Evaluation_Targets/Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb) | | | | | | | X | X | | | +| [AI_Judge_Evaluators_Quality.ipynb](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/AI_Judge_Evaluators_Quality.ipynb) | | | | | | X | | X | | | +| [Custum_Evaluators.ipynb](Supported_Evaluation_Metrics/Custom_Evaluators/Custom_Evaluators.ipynb) | | | | | | X | | | X | | +| [NLP_Evaluators.ipynb](Supported_Evaluation_Metrics/NLP_Evaluators/NLP_Evaluators.ipynb) | | | | | | X | | | | X | +| [AI_Judge_Evaluators_Safety_Risks.ipynb](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/AI_Judge_Evaluators_Safety_Risks.ipynb) | X | | | | | X | | | | | +| [Simulate_Evaluate_Groundedness.py](Simulators/Simulate_Evaluate_Groundedness/Simulate_Evaluate_Groundedness.ipynb) | | X | | | X | X | | X | | | diff --git a/scenarios/evaluate/Simulators/README.md b/scenarios/evaluate/Simulators/README.md new file mode 100644 index 00000000..0d99482f --- /dev/null +++ b/scenarios/evaluate/Simulators/README.md @@ -0,0 +1,24 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Evaluate. +--- + +## Simulate Evaluation (Test) Data + + + +Relevant, robust evaluation data is essential for effective evaluations. This data can be generated manually, can include production data, or can be assembled with the help of AI. There are two main types of evaluation data: + +- Bring Your Own Data: You can create and update a “golden dataset” with realistic customer questions or inputs paired with expert answers, ensuring quality for generative AI experiences. This dataset can also include samples from production data, offering a realistic evaluation dataset derived from actual queries your AI application has encountered. +* Simulators: If evaluation data is not available, simulators can play a crucial role in generating evaluation data by creating both topic-related and adversarial queries. + - Context-related simulators test the AI system’s ability to handle relevant interactions within a specific context, ensuring it performs well under typical use scenarios. + + - Adversarial simulators, on the other hand, generate queries designed to challenge the AI system, mimicking potential security threats or attempting to provoke undesirable behaviors. This approach helps identify the model's limitations and prepares it to perform well in unexpected or hostile conditions. + +Azure AI Studio provides tools for both topic-related and adversarial simulations, enabling comprehensive evaluation and enhancing confidence in deployment. For topic-related simulations, Azure AI enables you to simulate relevant conversations using [your data](Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb), [your Azure Search Index](Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb), or [your pre-defined conversation starters](Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb). + diff --git a/scenarios/evaluate/simulate_adversarial/README.md b/scenarios/evaluate/Simulators/Simulate_Adversarial_Data/README.md similarity index 86% rename from scenarios/evaluate/simulate_adversarial/README.md rename to scenarios/evaluate/Simulators/Simulate_Adversarial_Data/README.md index ff039bbb..55e6cd6f 100644 --- a/scenarios/evaluate/simulate_adversarial/README.md +++ b/scenarios/evaluate/Simulators/Simulate_Adversarial_Data/README.md @@ -5,18 +5,18 @@ languages: products: - ai-services - azure-openai -description: Simulator which simulates adversarial questions to ask wiki a custom application +description: Simulator which simulates adversarial questions --- -## Adversarial Simulator for Custom Application (askwiki) +## Adversarial Simulator ### Overview -This tutorial provides a step-by-step guide on how to use the adversarial simulator to simulate against a custom application +This tutorial provides a step-by-step guide on how to use the adversarial simulator ### Objective -The main objective of this tutorial is to help users understand the process of creating and using an adversarial simulator and use it with a custom application +The main objective of this tutorial is to help users understand the process of creating and using an adversarial simulator By the end of this tutorial, you should be able to: - Use the simulator - Run the simulator to have an adversarial question answering scenario diff --git a/scenarios/evaluate/simulate_adversarial/simulate_adversarial.ipynb b/scenarios/evaluate/Simulators/Simulate_Adversarial_Data/Simulate_Adversarial.ipynb similarity index 100% rename from scenarios/evaluate/simulate_adversarial/simulate_adversarial.ipynb rename to scenarios/evaluate/Simulators/Simulate_Adversarial_Data/Simulate_Adversarial.ipynb diff --git a/scenarios/evaluate/simulate_input_index/README.md b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/README.md similarity index 89% rename from scenarios/evaluate/simulate_input_index/README.md rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/README.md index a459aefc..6a06c407 100644 --- a/scenarios/evaluate/simulate_input_index/README.md +++ b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/README.md @@ -4,7 +4,7 @@ languages: - python products: - azure-openai -description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs." +description: Simulate from Azure Search Index --- ## Generate Query and Response from your Azure Search Index diff --git a/scenarios/evaluate/simulate_input_index/simulate_input_index.ipynb b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb similarity index 99% rename from scenarios/evaluate/simulate_input_index/simulate_input_index.ipynb rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb index bfdb2d12..21cca016 100644 --- a/scenarios/evaluate/simulate_input_index/simulate_input_index.ipynb +++ b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Azure_Search_Index/Simulate_From_Azure_Search_Index.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Simulate Queries and Responses from input text" + "# Simulate Queries and Responses from Azure Search Index" ] }, { @@ -13,7 +13,7 @@ "source": [ "## Objective\n", "\n", - "Use the Simulator to generate high-quality queries and responses from your data using LLMs.\n", + "Use the Simulator to generate high-quality queries and responses from your data in Azure Search using LLMs.\n", "\n", "This tutorial uses the following Azure AI services:\n", "\n", diff --git a/scenarios/evaluate/simulate_conversation_starter/README.md b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/README.md similarity index 100% rename from scenarios/evaluate/simulate_conversation_starter/README.md rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/README.md diff --git a/scenarios/evaluate/simulate_conversation_starter/simulate_conversation_starter.ipynb b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb similarity index 100% rename from scenarios/evaluate/simulate_conversation_starter/simulate_conversation_starter.ipynb rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter/Simulate_From_Conversation_Starter.ipynb diff --git a/scenarios/evaluate/simulate_input_text/README.md b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/README.md similarity index 98% rename from scenarios/evaluate/simulate_input_text/README.md rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/README.md index 1a13f717..d2e7f8bf 100644 --- a/scenarios/evaluate/simulate_input_text/README.md +++ b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/README.md @@ -4,7 +4,7 @@ languages: - python products: - azure-openai -description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs." +description: Use the Simulator to generate high-quality query and response interactions with your AI applications from your data using LLMs --- ## Generate Query and Response from your data diff --git a/scenarios/evaluate/simulate_input_text/simulate_input_text.ipynb b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb similarity index 100% rename from scenarios/evaluate/simulate_input_text/simulate_input_text.ipynb rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/Simulate_From_Input_Text.ipynb diff --git a/scenarios/evaluate/simulate_input_text/user_override.prompty b/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/user_override.prompty similarity index 100% rename from scenarios/evaluate/simulate_input_text/user_override.prompty rename to scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text/user_override.prompty diff --git a/scenarios/evaluate/simulate_evaluate_groundedness/README.md b/scenarios/evaluate/Simulators/Simulate_Evaluate_Groundedness/README.md similarity index 100% rename from scenarios/evaluate/simulate_evaluate_groundedness/README.md rename to scenarios/evaluate/Simulators/Simulate_Evaluate_Groundedness/README.md diff --git a/scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb b/scenarios/evaluate/Simulators/Simulate_Evaluate_Groundedness/Simulate_Evaluate_Groundedness.ipynb similarity index 100% rename from scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb rename to scenarios/evaluate/Simulators/Simulate_Evaluate_Groundedness/Simulate_Evaluate_Groundedness.ipynb diff --git a/scenarios/evaluate/evaluate_qualitative_metrics/evaluate_qualitative_metrics.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/AI_Judge_Evaluators_Quality.ipynb similarity index 97% rename from scenarios/evaluate/evaluate_qualitative_metrics/evaluate_qualitative_metrics.ipynb rename to scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/AI_Judge_Evaluators_Quality.ipynb index ce75eb58..f950d1fb 100644 --- a/scenarios/evaluate/evaluate_qualitative_metrics/evaluate_qualitative_metrics.ipynb +++ b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/AI_Judge_Evaluators_Quality.ipynb @@ -4,13 +4,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Evaluate model endpoints using Prompt Flow Eval APIs\n", + "# Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK\n", "\n", "## Objective\n", "\n", "This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. \n", "\n", - "This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. \n", + "This guide uses Python Class as an application target which is passed to Evaluate API provided by Azure AI Evaluation SDK to evaluate results generated by LLM models against provided prompts. \n", "\n", "This tutorial uses the following Azure AI services:\n", "\n", diff --git a/scenarios/evaluate/evaluate_qualitative_metrics/README.md b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/README.md similarity index 51% rename from scenarios/evaluate/evaluate_qualitative_metrics/README.md rename to scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/README.md index def15bfc..9ea9fbb3 100644 --- a/scenarios/evaluate/evaluate_qualitative_metrics/README.md +++ b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/README.md @@ -5,18 +5,18 @@ languages: products: - ai-services - azure-openai -description: Evaluating qualitative metrics +description: Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK --- -## Evaluating qualitative metrics +## Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK ### Overview -This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoint using qualitative metrics. +This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoint using AI as Judge evaluators. ### Objective -The main objective of this tutorial is to help users understand the process of evaluating model endpoints using qualitative metrics. By the end of this tutorial, you should be able to: +The main objective of this tutorial is to help users understand the process of evaluating model endpoints using AI as Judge quality evaluators. By the end of this tutorial, you should be able to: - Learn about evaluations - Evaluate prompt against model endpoint of your choice. diff --git a/scenarios/evaluate/evaluate_endpoints/data.jsonl b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/data.jsonl similarity index 100% rename from scenarios/evaluate/evaluate_endpoints/data.jsonl rename to scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Quality/data.jsonl diff --git a/scenarios/evaluate/evaluate_safety_risk/evaluate_safety_risk.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/AI_Judge_Evaluators_Safety_Risks.ipynb similarity index 100% rename from scenarios/evaluate/evaluate_safety_risk/evaluate_safety_risk.ipynb rename to scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/AI_Judge_Evaluators_Safety_Risks.ipynb diff --git a/scenarios/evaluate/evaluate_safety_risk/README.md b/scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/README.md similarity index 100% rename from scenarios/evaluate/evaluate_safety_risk/README.md rename to scenarios/evaluate/Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/README.md diff --git a/scenarios/evaluate/Supported_Evaluation_Metrics/AutomatedEvaluationAzureAIFoundry.jpg b/scenarios/evaluate/Supported_Evaluation_Metrics/AutomatedEvaluationAzureAIFoundry.jpg new file mode 100644 index 00000000..2aa7e1f6 Binary files /dev/null and b/scenarios/evaluate/Supported_Evaluation_Metrics/AutomatedEvaluationAzureAIFoundry.jpg differ diff --git a/scenarios/evaluate/evaluate_custom/evaluate_custom.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/Custom_Evaluators.ipynb similarity index 100% rename from scenarios/evaluate/evaluate_custom/evaluate_custom.ipynb rename to scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/Custom_Evaluators.ipynb diff --git a/scenarios/evaluate/evaluate_custom/blocklist.py b/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/blocklist.py similarity index 100% rename from scenarios/evaluate/evaluate_custom/blocklist.py rename to scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/blocklist.py diff --git a/scenarios/evaluate/evaluate_app/data.jsonl b/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/data.jsonl similarity index 100% rename from scenarios/evaluate/evaluate_app/data.jsonl rename to scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators/data.jsonl diff --git a/scenarios/evaluate/evaluate_nlp_metrics/evaluate_nlp.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/NLP_Evaluators.ipynb similarity index 100% rename from scenarios/evaluate/evaluate_nlp_metrics/evaluate_nlp.ipynb rename to scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/NLP_Evaluators.ipynb diff --git a/scenarios/evaluate/evaluate_nlp_metrics/README.md b/scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/README.md similarity index 100% rename from scenarios/evaluate/evaluate_nlp_metrics/README.md rename to scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/README.md diff --git a/scenarios/evaluate/evaluate_nlp_metrics/data.jsonl b/scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/data.jsonl similarity index 100% rename from scenarios/evaluate/evaluate_nlp_metrics/data.jsonl rename to scenarios/evaluate/Supported_Evaluation_Metrics/NLP_Evaluators/data.jsonl diff --git a/scenarios/evaluate/Supported_Evaluation_Metrics/README.md b/scenarios/evaluate/Supported_Evaluation_Metrics/README.md new file mode 100644 index 00000000..8b9e3dd4 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Metrics/README.md @@ -0,0 +1,29 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Evaluate. +--- + +## Supported Evaluators + + +Azure AI Evaluation SDK supported a group of evaluators that measure different aspects of a response’s alignment with expectations. They can be customized, versioned, and shared across an organization, ensuring consistent evaluation metrics and parameters across various projects. The choice of evaluators will depend on the specific goals of the evaluation, such as assessing quality, safety, or custom requirements tailored to a particular use case. Below are three main categories of evaluators we support via Azure AI SDK and Studio UI: + +Currently, Azure AI Evaluation SDK supports three types of evaluators: + +![Types of Evaluators](./AutomatedEvaluationAzureAIFoundry.jpg) + +* [Risk and safety evaluators](AI_Judge_Evaluators_Safety_Risks/): Evaluating potential risks associated with AI-generated content is essential for safeguarding against content risks with varying degrees of severity. This includes evaluating an AI system's predisposition towards generating harmful or inappropriate content. + +* Generation quality evaluators: This involves assessing metrics such as the groundedness, coherence and relevance of generated content using robust [AI-assisted](AI_Judge_Evaluators_Quality/) and [NLP](NLP_Evaluators/) metrics. + + +* [Custom evaluators](Custom_Evaluators/): Tailored evaluation metrics can be designed to meet specific needs and goals, providing flexibility and precision in assessing unique aspects of AI-generated content. These custom evaluators allow for more detailed and specific analyses, addressing particular concerns or requirements that standard metrics may not cover. + + + +You can run evaluators locally or [remotely](../Supported_Evaluation_Targets/Evaluate_On_Cloud/Evaluate_On_Cloud.ipynb), log results in the cloud using the evaluation SDK, or integrate them into automated evaluations within the Azure AI Studio UI. diff --git a/scenarios/evaluate/evaluate_app/evaluate_app.ipynb b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb similarity index 96% rename from scenarios/evaluate/evaluate_app/evaluate_app.ipynb rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb index d0b442e1..494ab159 100644 --- a/scenarios/evaluate/evaluate_app/evaluate_app.ipynb +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb @@ -5,18 +5,7 @@ "id": "2e932e4c-5d55-461e-a313-3a087d8983b5", "metadata": {}, "source": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "# Evaluate app using Azure AI Evaluation APIs\n" + "# Evaluate application endpoint using Azure AI Evaluation APIs\n" ] }, { @@ -25,7 +14,7 @@ "metadata": {}, "source": [ "## Objective\n", - "In this notebook we will demonstrate how to use the target functions with the standard evaluators to evaluate an app.\n", + "In this notebook we will demonstrate how to use the target functions with the standard evaluators to evaluate an application endpoint.\n", "\n", "This tutorial provides a step-by-step guide on how to evaluate a function\n", "\n", diff --git a/scenarios/evaluate/evaluate_app/README.md b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/README.md similarity index 85% rename from scenarios/evaluate/evaluate_app/README.md rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/README.md index 60a3c578..d4ccd7e9 100644 --- a/scenarios/evaluate/evaluate_app/README.md +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/README.md @@ -5,10 +5,10 @@ languages: products: - ai-services - azure-openai -description: Evaluating an endpoint +description: Evaluating an application endpoint --- -## Evaluating an endpoint +## Evaluating an application endpoint ### Overview diff --git a/scenarios/evaluate/evaluate_app/askwiki.py b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/askwiki.py similarity index 100% rename from scenarios/evaluate/evaluate_app/askwiki.py rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/askwiki.py diff --git a/scenarios/evaluate/evaluate_custom/data.jsonl b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/data.jsonl similarity index 100% rename from scenarios/evaluate/evaluate_custom/data.jsonl rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/data.jsonl diff --git a/scenarios/evaluate/evaluate_app/system-message.jinja2 b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/system-message.jinja2 similarity index 100% rename from scenarios/evaluate/evaluate_app/system-message.jinja2 rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint/system-message.jinja2 diff --git a/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb similarity index 99% rename from scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb index 550c2c2b..10b78aa2 100644 --- a/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Evaluate model endpoints using Azure AI Evaluation APIs\n", + "# Evaluate Base Model Endpoints using Azure AI Evaluation APIs\n", "\n", "## Objective\n", "\n", diff --git a/scenarios/evaluate/evaluate_endpoints/README.md b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/README.md similarity index 87% rename from scenarios/evaluate/evaluate_endpoints/README.md rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/README.md index 438b54c9..1c135a50 100644 --- a/scenarios/evaluate/evaluate_endpoints/README.md +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/README.md @@ -5,10 +5,10 @@ languages: products: - ai-services - azure-openai -description: Evaluating model endpoints +description: Evaluating base model endpoints --- -## Evaluating model endpoints +## Evaluating base model endpoints ### Overview diff --git a/scenarios/evaluate/evaluate_qualitative_metrics/data.jsonl b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/data.jsonl similarity index 100% rename from scenarios/evaluate/evaluate_qualitative_metrics/data.jsonl rename to scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint/data.jsonl diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/Evaluate_On_Cloud.ipynb b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/Evaluate_On_Cloud.ipynb new file mode 100644 index 00000000..b6b5f351 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/Evaluate_On_Cloud.ipynb @@ -0,0 +1,223 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cloud evaluation: Evaluating AI app data remotely in the cloud \n", + "\n", + "## Objective\n", + "\n", + "This tutorial provides a step-by-step guide on how to evaluate data generated by AI applications or LLMs remotely in the cloud. \n", + "\n", + "This tutorial uses the following Azure AI services:\n", + "\n", + "- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)\n", + "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", + "\n", + "## Time\n", + "\n", + "You should expect to spend 20 minutes running this sample. \n", + "\n", + "## About this example\n", + "\n", + "This example demonstrates the cloud evaluation of query and response pairs that were generated by an AI app or a LLM. It is important to have access to AzureOpenAI credentials and an AzureAI project. **To create data to use in your own evaluation, learn more [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data)** . This example demonstrates: \n", + "\n", + "- Single-instance, triggered cloud evaluation on a test dataset (to be used for pre-deployment evaluation of an AI application).\n", + "\n", + "## Before you begin\n", + "### Prerequesite\n", + "- Have an Azure OpenAI Deployment with GPT model supporting `chat completion`, for example `gpt-4`.\n", + "- Make sure you're first logged into your Azure subscription by running `az login`.\n", + "- You have some test data you want to evaluate, which includes the user queries and responses (and perhaps context, or ground truth) from your AI applications. See [data requirements for our built-in evaluators](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk#data-requirements-for-built-in-evaluators). Alternatively, if you want to simulate data against your application endpoints using Azure AI Evaluation SDK, see our samples on simulation. \n", + "\n", + "### Installation\n", + "\n", + "Install the following packages required to execute this notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -U azure-identity\n", + "%pip install -U azure-ai-project\n", + "%pip install -U azure-ai-evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.project import AIProjectClient\n", + "from azure.identity import DefaultAzureCredential\n", + "from azure.ai.project.models import (\n", + " Evaluation,\n", + " Dataset,\n", + " EvaluatorConfiguration,\n", + " ConnectionType,\n", + ")\n", + "from azure.ai.evaluation import F1ScoreEvaluator, ViolenceEvaluator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to your Azure Open AI deployment\n", + "To evaluate your LLM-generated data remotely in the cloud, we must connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the proper value for `conn_str`, navigate to the connection string at the \"Project Overview\" page for your Azure AI project. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project_client = AIProjectClient.from_connection_string(\n", + " credential=DefaultAzureCredential(),\n", + " conn_str=\"\", # At the moment, it should be in the format \".api.azureml.ms;;;\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to your AOAI resource, you must use an AOAI GPT model\n", + "deployment_name = \"gpt-4\"\n", + "api_version = \"2024-06-01\"\n", + "default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)\n", + "model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data\n", + "The following code demonstrates how to upload the data for evaluation to your Azure AI project. Below we use `evaluate_test_data.jsonl` which exemplifies LLM-generated data in the query-response format expected by the Azure AI Evaluation SDK. For your use case, you should upload data in the same format, which can be generated using the [`Simulator`](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data) from Azure AI Evaluation SDK. \n", + "\n", + "Alternatively, if you already have an existing dataset for evaluation, you can use that by finding the link to your dataset in your [registry](https://ml.azure.com/registries) or find the dataset ID." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# # Upload data for evaluation\n", + "data_id, _ = project_client.upload_file(\"./evaluate_test_data.jsonl\")\n", + "# data_id = \"azureml://registries//data//versions/\"\n", + "# To use an existing dataset, replace the above line with the following line\n", + "# data_id = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure Evaluators to Run\n", + "The code below demonstrates how to configure the evaluators you want to run. In this example, we use the `F1ScoreEvaluator`, `RelevanceEvaluator` and the `ViolenceEvaluator`, but all evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by cloud evaluation and can be configured here. You can either import the classes from the SDK and reference them with the `.id` property, or you can find the fully formed `id` of the evaluator in the AI Studio registry of evaluators, and use it here. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n", + "# init_params is the configuration for the model to use to perform the evaluation\n", + "# data_mapping is used to map the output columns of your query to the names required by the evaluator\n", + "evaluators = {\n", + " \"f1_score\": EvaluatorConfiguration(\n", + " id=F1ScoreEvaluator.id,\n", + " ),\n", + " \"relevance\": EvaluatorConfiguration(\n", + " id=\"azureml://registries/azureml-staging/models/Relevance-Evaluator/versions/4\",\n", + " init_params={\"model_config\": model_config},\n", + " data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n", + " ),\n", + " \"violence\": EvaluatorConfiguration(\n", + " id=ViolenceEvaluator.id,\n", + " init_params={\"azure_ai_project\": project_client.scope},\n", + " data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n", + " ),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create cloud evaluation\n", + "Below we demonstrate how to trigger a single-instance Cloud Evaluation remotely on a test dataset. This can be used for pre-deployment testing of your AI application. \n", + " \n", + "Here we pass in the `data_id` we would like to use for the evaluation. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evaluation = Evaluation(\n", + " display_name=\"Cloud Evaluation\",\n", + " description=\"Cloud Evaluation of dataset\",\n", + " data=Dataset(id=data_id),\n", + " evaluators=evaluators,\n", + ")\n", + "\n", + "# Create evaluation\n", + "evaluation_response = project_client.evaluations.create(\n", + " evaluation=evaluation,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get evaluation\n", + "get_evaluation_response = project_client.evaluations.get(evaluation_response.id)\n", + "\n", + "print(\"----------------------------------------------------------------\")\n", + "print(\"Created evaluation, evaluation ID: \", get_evaluation_response.id)\n", + "print(\"Evaluation status: \", get_evaluation_response.status)\n", + "print(\"AI Foundry Portal URI: \", get_evaluation_response.properties[\"AiStudioEvaluationUri\"])\n", + "print(\"----------------------------------------------------------------\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "azureai-samples313", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/README.md b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/README.md new file mode 100644 index 00000000..fe017505 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/README.md @@ -0,0 +1,40 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Cloud evaluation +--- + +## Evaluating in the cloud + +### Overview + +This tutorial provides a step-by-step guide on how to evaluate generative AI applications or LLMs remotely using a triggered cloud evaluation. + +### Objective + +The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for pre-deployment testing. By the end of this tutorial, you should be able to: + + - Learn about evaluations + - Evaluate LLM using various evaluators from Azure AI Evaluations SDK remotely in the cloud. + +### Note +Cloud evaluations do not support `ContentSafetyEvaluator`, and `QAEvaluator`. + +#### Region Support for Cloud evaluations + +| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material | +| - | - | - | - | +| UK South | Will be deprecated 12/1/24 | no | no | +| East US 2 | yes | yes | yes | +| Sweden Central | yes | yes | no | +| France Central | yes | no | no | +| Switzerland West | yes | no | no | + +### Programming Languages + - Python + +### Estimated Runtime: 20 mins \ No newline at end of file diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/evaluate_test_data.jsonl b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/evaluate_test_data.jsonl new file mode 100644 index 00000000..ddca8949 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_On_Cloud/evaluate_test_data.jsonl @@ -0,0 +1,3 @@ +{"query":"How do you create a run?","context":"AML API only","response":"To create a run using the Azure Machine Learning API, you first need to create an Experiment. Once you have an experiment, you can create a Run object that is associated with that experiment. Here is some Python code that demonstrates this process:\n\n```\nfrom azureml.core import Experiment, Run\nfrom azureml.core.workspace import Workspace\n\n# Define workspace and experiment\nws = Workspace.from_config()\nexp = Experiment(workspace=ws, name='my_experiment')\n\n# Create a new run\nrun = exp.start_logging()\n```\n\nIn this code, the `from_config()` method reads the configuration file that you created when you set up your Azure Machine Learning workspace. The `Experiment` constructor creates an Experiment object that is associated with your workspace, and the `start_logging()` method creates a new Run object that is associated with the Experiment. Now you can use the `run` object to log metrics, upload files, and track other information related to your machine learning experiment.", "ground_truth":"Paris is the capital of France."} +{"query":"How do you log a model?","context":"Logging can be done using any OSS Sdk","response":"There are a few ways to log models in Azure Machine Learning. \n\nOne way is to use the `register_model()` method of the `Run` object. The `register_model()` method logs a model file in the Azure Machine Learning service workspace and makes it available for deployment. Here's an example:\n\n```python\nfrom azureml.core import Model\n\nmodel_path = '.\/outputs\/my_model.pkl'\nmodel = Model.register(workspace=ws, model_path=model_path, model_name='my_model')\n```\n\nThis code registers the model file located at `model_path` to the Azure Machine Learning service workspace with the name `my_model`. \n\nAnother way to log a model is to save it as an output of a `Run`. If your model generation code is part of a script or Jupyter notebook that runs as an Azure Machine Learning experiment, you can save the model file as an output of the `Run` object. Here's an example:\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom azureml.core.run import Run\n\n# Initialize a run object\nrun = Run.get_context()\n\n# Train your model\nX_train, y_train = ...\nlog_reg = LogisticRegression().fit(X_train, y_train)\n\n# Save the model to the Run object's outputs directory\nmodel_path = 'outputs\/model.pkl'\njoblib.dump(value=log_reg, filename=model_path)\n\n# Log the model as a run artifact\nrun.upload_file(name=model_path, path_or_stream=model_path)\n```\n\nIn this code, `Run.get_context()` retrieves the current run context object, which you can use to track metadata and metrics for the run. After training your model, you can use `joblib.dump()` to save the model to a file, and then log the file as an artifact of the run using `run.upload_file()`.","ground_truth":"Paris is the capital of France."} +{"query":"What is the capital of France?","context":"France is in Europe","response":"Paris is the capital of France.", "ground_truth":"Paris is the capital of France."} diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/Evaluate_Online.ipynb b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/Evaluate_Online.ipynb new file mode 100644 index 00000000..995f64f1 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/Evaluate_Online.ipynb @@ -0,0 +1,248 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Online Evaluations: Evaluating in the Cloud on a Schedule\n", + "\n", + "## Objective\n", + "\n", + "This tutorial provides a step-by-step guide on how to evaluate data generated by LLMs online on a schedule. \n", + "\n", + "This tutorial uses the following Azure AI services:\n", + "\n", + "- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)\n", + "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", + "\n", + "## Time\n", + "\n", + "You should expect to spend 30 minutes running this sample. \n", + "\n", + "## About this example\n", + "\n", + "This example demonstrates the online evaluation of a LLM. It is important to have access to AzureOpenAI credentials and an AzureAI project. This example demonstrates: \n", + "\n", + "- Recurring, Online Evaluation (to be used to monitor LLMs once they are deployed)\n", + "\n", + "## Before you begin\n", + "### Prerequesite\n", + "- Configure resources to support Online Evaluation as per [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -U azure-identity\n", + "%pip install -U azure-ai-project\n", + "%pip install -U azure-ai-evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.project import AIProjectClient\n", + "from azure.identity import DefaultAzureCredential\n", + "from azure.ai.project.models import (\n", + " ApplicationInsightsConfiguration,\n", + " EvaluatorConfiguration,\n", + " ConnectionType,\n", + " EvaluationSchedule,\n", + " RecurrenceTrigger,\n", + ")\n", + "from azure.ai.evaluation import F1ScoreEvaluator, ViolenceEvaluator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to your Azure Open AI deployment\n", + "To evaluate your LLM-generated data remotely in the cloud, we must connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the proper value for `conn_str`, navigate to the connection string at the \"Project Overview\" page for your Azure AI project. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project_client = AIProjectClient.from_connection_string(\n", + " credential=DefaultAzureCredential(),\n", + " conn_str=\"\", # At the moment, it should be in the format \".api.azureml.ms;;;\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc) for configuration of Application Insights. `service_name` is a unique name you provide to define your Generative AI application and identify it within your Application Insights resource. This property will be logged in the `traces` table in Application Insights and can be found in the `customDimensions[\"service.name\"]` field. `evaluation_name` is a unique name you provide for your Online Evaluation schedule. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your Application Insights resource ID\n", + "# At the moment, it should be something in the format \"/subscriptions//resourceGroups//providers/Microsoft.Insights/components/\"\"\n", + "app_insights_resource_id = \"\"\n", + "\n", + "# Name of your generative AI application (will be available in trace data in Application Insights)\n", + "service_name = \"\"\n", + "\n", + "# Name of your online evaluation schedule\n", + "evaluation_name = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below is the Kusto Query Language (KQL) query to query data from Application Insights resource. This query is compatible with data logged by the Azure AI Inferencing Tracing SDK (linked in [documentation](https://aka.ms/GenAIMonitoringDoc)). You can modify it depending on your data schema. The KQL query must output several columns: `operation_ID`, `operation_ParentID`, and `gen_ai_response_id`. You can choose which other columns to output as required by the evaluators you are using." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "kusto_query = 'let gen_ai_spans=(dependencies | where isnotnull(customDimensions[\"gen_ai.system\"]) | extend response_id = tostring(customDimensions[\"gen_ai.response.id\"]) | project id, operation_Id, operation_ParentId, timestamp, response_id); let gen_ai_events=(traces | where message in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") or tostring(customDimensions[\"event.name\"]) in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") | project id= operation_ParentId, operation_Id, operation_ParentId, user_input = iff(message == \"gen_ai.user.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.user.message\", parse_json(iff(message == \"gen_ai.user.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), system = iff(message == \"gen_ai.system.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.system.message\", parse_json(iff(message == \"gen_ai.system.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), llm_response = iff(message == \"gen_ai.choice\", parse_json(tostring(parse_json(tostring(customDimensions[\"gen_ai.event.content\"])).message)).content, iff(tostring(customDimensions[\"event.name\"]) == \"gen_ai.choice\", parse_json(parse_json(message).message).content, \"\")) | summarize operation_ParentId = any(operation_ParentId), Input = maxif(user_input, user_input != \"\"), System = maxif(system, system != \"\"), Output = maxif(llm_response, llm_response != \"\") by operation_Id, id); gen_ai_spans | join kind=inner (gen_ai_events) on id, operation_Id | project Input, System, Output, operation_Id, operation_ParentId, gen_ai_response_id = response_id'\n", + "\n", + "# AzureMSIClientId is the clientID of the User-assigned managed identity created during set-up - see documentation for how to find it\n", + "properties = {\"AzureMSIClientId\": \"your_client_id\"}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to your Application Insights resource\n", + "app_insights_config = ApplicationInsightsConfiguration(\n", + " resource_id=app_insights_resource_id, query=kusto_query, service_name=service_name\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to your AOAI resource, you must use an AOAI GPT model\n", + "deployment_name = \"gpt-4\"\n", + "api_version = \"2024-06-01\"\n", + "default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)\n", + "model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure Evaluators to Run\n", + "The code below demonstrates how to configure the evaluators you want to run. In this example, we use the `F1ScoreEvaluator`, `RelevanceEvaluator` and the `ViolenceEvaluator`, but all evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation and can be configured here. You can either import the classes from the SDK and reference them with the `.id` property, or you can find the fully formed `id` of the evaluator in the AI Studio registry of evaluators, and use it here. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n", + "# init_params is the configuration for the model to use to perform the evaluation\n", + "# data_mapping is used to map the output columns of your query to the names required by the evaluator\n", + "evaluators = {\n", + " \"f1_score\": EvaluatorConfiguration(\n", + " id=F1ScoreEvaluator.id,\n", + " ),\n", + " \"relevance\": EvaluatorConfiguration(\n", + " id=\"azureml://registries/azureml-staging/models/Relevance-Evaluator/versions/4\",\n", + " init_params={\"model_config\": model_config},\n", + " data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n", + " ),\n", + " \"violence\": EvaluatorConfiguration(\n", + " id=ViolenceEvaluator.id,\n", + " init_params={\"azure_ai_project\": project_client.scope},\n", + " data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n", + " ),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate in the Cloud on a Schedule with Online Evaluation\n", + "\n", + "You can configure the `RecurrenceTrigger` based on the class definition [here](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.recurrencetrigger?view=azure-python)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Frequency to run the schedule\n", + "recurrence_trigger = RecurrenceTrigger(frequency=\"day\", interval=1)\n", + "\n", + "# Configure the online evaluation schedule\n", + "evaluation_schedule = EvaluationSchedule(\n", + " data=app_insights_config,\n", + " evaluators=evaluators,\n", + " trigger=recurrence_trigger,\n", + " description=f\"{service_name} evaluation schedule\",\n", + " properties=properties,\n", + ")\n", + "\n", + "# Create the online evaluation schedule\n", + "created_evaluation_schedule = project_client.evaluations.create_or_replace_schedule(service_name, evaluation_schedule)\n", + "print(\n", + " f\"Successfully submitted the online evaluation schedule creation request - {created_evaluation_schedule.name}, currently in {created_evaluation_schedule.provisioning_state} state.\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Next steps \n", + "\n", + "Navigate to the \"Tracing\" tab in [Azure AI Studio](https://ai.azure.com/) to view your logged trace data alongside the evaluations produced by the Online Evaluation schedule. You can use the reference link provided in the \"Tracing\" tab to navigate to a comprehensive workbook in Application Insights for more details on how your application is performing. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "azureai-samples313", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/README.md b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/README.md new file mode 100644 index 00000000..f7af45a1 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Online/README.md @@ -0,0 +1,41 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Evaluating online +--- + +## Evaluating in the cloud on a schedule + +### Overview + +This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs on a scheduling using online evaluation. + +### Objective + +The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for monitoring LLMs and Generative AI that has been deployed. By the end of this tutorial, you should be able to: + + - Learn about evaluations + - Evaluate LLM using various evaluators from Azure AI Evaluations SDK online in the cloud. + +### Note +All evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation. For updated documentation, please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc). + +#### Region Support for Evaluations + +| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material | +| - | - | - | - | +| UK South | Will be deprecated 12/1/24 | no | no | +| East US 2 | yes | yes | yes | +| Sweden Central | yes | yes | no | +| US North Central | yes | no | no | +| France Central | yes | no | no | +| Switzerland West | yes | no | no | + +### Programming Languages + - Python + +### Estimated Runtime: 30 mins \ No newline at end of file diff --git a/scenarios/evaluate/Supported_Evaluation_Targets/README.md b/scenarios/evaluate/Supported_Evaluation_Targets/README.md new file mode 100644 index 00000000..ff05e8e8 --- /dev/null +++ b/scenarios/evaluate/Supported_Evaluation_Targets/README.md @@ -0,0 +1,39 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Evaluate. +--- + +## Evaluation Target + +Evaluation plays a critical role at three pivotal stages: + +* Base Model Evaluation for Model Selection: Initial assessments to identify the most promising base models. + +* Pre-Production AI Application Evaluation: Comprehensive testing to ensure models perform reliably before deployment. + +* Post-Production AI Application Evaluation (Monitoring): Continuous monitoring to maintain performance and adapt to new challenges. + + +### Base Model Evaluation for Model Selection + +The first stage of the AI lifecycle involves selecting an appropriate base model. With over 1,600 base models available in the Azure AI Studio's model catalog, it's crucial to choose one that best aligns with your specific use case. This involves comparing models based on quality and safety metrics like groundedness, relevance, and safety scores, as well as considering cost, computational efficiency, and latency. + +Follow [this tutorial](Evaluate_Base_Model_Endpoint/Evaluate_Base_Model_Endpoint.ipynb) to evaluate and compare base models on your own data. + +### Pre-Production AI Application Evaluation + +Once a base model is selected, the next stage is building an AI application around it—such as an AI-powered copilot, a retrieval-augmented generation (RAG) system, an agentic AI system, or any other generative AI-based tool. Before deploying the AI application into a production environment, rigorous testing is essential to ensure the model is truly ready for real-world use. + +Follow [this tutorial](Evaluate_App_Endpoint/Evaluate_App_Endpoint.ipynb) to evaluate any AI application endpoint on your own data. + +### Post-Production AI Application Evaluation (Monitoring) + +Once the AI application is approved for production and deployed, it’s crucial to ensure it continues to perform safely and generate high-quality responses while maintaining satisfactory performance. + +Learn how you can [use Azure AI to monitor](aka.ms/azureaimonitoring) your AI Applications post production. +