-
Notifications
You must be signed in to change notification settings - Fork 313
Adding sample to evaluate groundedness #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
40ca32c
update promptflow-eval dependencies to azure-ai-evaluation
slister1001 3c98269
clear local variables
slister1001 2ccdfb2
fix errors and remove 'question' col from data
slister1001 fc46d6c
small fix in evaluator config
slister1001 c6d52a4
Merge branch 'Azure-Samples:main' into main
slister1001 4d6fc68
Merge branch 'Azure-Samples:main' into main
slister1001 d5cd237
Merge branch 'Azure-Samples:main' into main
slister1001 724c315
Merge branch 'Azure-Samples:main' into main
slister1001 800c15b
add groundedness sample
slister1001 6fff308
adding and fixing readme
slister1001 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
326 changes: 326 additions & 0 deletions
326
scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,326 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Evaluating Model Groundedness with Azure AI Evaluation SDK\n", | ||
| "\n", | ||
| "This notebook aims to simulate and evaluate the groundedness of a model endpoint using the Azure AI Evaluation SDK. Groundedness refers to the extent to which the responses generated by a model are based on reliable and verifiable information. Ensuring that a model's outputs are grounded is crucial for maintaining the accuracy and trustworthiness of AI systems.\n", | ||
| "\n", | ||
| "In this notebook, we will:\n", | ||
| "\n", | ||
| "1. Set up the Azure AI Evaluation SDK.\n", | ||
| "2. Define the dataset for evaluating groundedness, which will vary based on the specific use case of your model.\n", | ||
| "3. Simulate the model endpoint and generate responses.\n", | ||
| "4. Evaluate the groundedness of the model's responses using the Azure AI Evaluation SDK.\n", | ||
| "\n", | ||
| "The dataset used for evaluating groundedness will be tailored to the particular application of your model. For instance, if your model is designed for customer support, the dataset might consist of common customer queries and the corresponding accurate responses. If your model is used for medical diagnosis, the dataset would include medical cases and verified diagnostic information.\n", | ||
| "\n", | ||
| "By the end of this notebook, you will have a clear understanding of how to assess the groundedness of your model's outputs and ensure that they are based on solid and reliable information.\n", | ||
| "\n", | ||
| "This tutorial uses the following Azure AI services:\n", | ||
| "\n", | ||
| "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", | ||
| "\n", | ||
| "## Time\n", | ||
| "\n", | ||
| "You should expect to spend 30 minutes running this sample. \n", | ||
| "\n", | ||
| "## About this example\n", | ||
| "\n", | ||
| "This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation\n", | ||
| "\n", | ||
| "## Before you begin\n", | ||
| "\n", | ||
| "### Installation\n", | ||
| "\n", | ||
| "Install the following packages required to execute this notebook. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "%pip install azure-ai-evaluation --upgrade\n", | ||
| "%pip install promptflow-azure\n", | ||
| "%pip install azure-identity" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Parameters and imports" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import os\n", | ||
| "from typing import Any, Dict, List, Optional\n", | ||
| "import json\n", | ||
| "from pathlib import Path\n", | ||
| "\n", | ||
| "from azure.ai.evaluation import evaluate\n", | ||
| "from azure.ai.evaluation import GroundednessEvaluator\n", | ||
| "from azure.ai.evaluation.simulator import Simulator\n", | ||
| "from openai import AzureOpenAI\n", | ||
| "import importlib.resources as pkg_resources\n", | ||
| "from azure.identity import DefaultAzureCredential, get_bearer_token_provider" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "os.environ[\"AZURE_SUBSCRIPTION_ID\"] = \"<your-subscription-id>\"\n", | ||
| "os.environ[\"RESOURCE_GROUP\"] = \"<your-resource-group>\"\n", | ||
| "os.environ[\"PROJECT_NAME\"] = \"<your-project-name>\"\n", | ||
| "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"<your-endpoint>\"\n", | ||
| "os.environ[\"AZURE_DEPLOYMENT_NAME\"] = \"<your-deployment-name>\"\n", | ||
| "os.environ[\"AZURE_API_VERSION\"] = \"<api-version>\"" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "project_scope = {\n", | ||
| " \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n", | ||
| " \"resource_group_name\": os.environ.get(\"RESOURCE_GROUP\"),\n", | ||
| " \"project_name\": os.environ.get(\"PROJECT_NAME\"),\n", | ||
| "}\n", | ||
| "\n", | ||
| "model_config = {\n", | ||
| " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", | ||
| " \"azure_deployment\": os.environ.get(\"AZURE_DEPLOYMENT_NAME\"),\n", | ||
| "}" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Data\n", | ||
| "Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "resource_name = \"grounding.json\"\n", | ||
| "package = \"azure.ai.evaluation.simulator._data_sources\"\n", | ||
| "conversation_turns = []\n", | ||
| "\n", | ||
| "with pkg_resources.path(package, resource_name) as grounding_file, Path.open(grounding_file, \"r\") as file:\n", | ||
| " data = json.load(file)\n", | ||
| "\n", | ||
| "for item in data:\n", | ||
| " conversation_turns.append([item])\n", | ||
| " if len(conversation_turns) == 2:\n", | ||
| " break" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Target Endpoint\n", | ||
| "\n", | ||
| "We will use Evaluate API provided by Azure AI Evaluations SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "def example_application_response(query: str, context: str) -> str:\n", | ||
| " deployment = os.environ.get(\"AZURE_DEPLOYMENT_NAME\")\n", | ||
| " endpoint = os.environ.get(\"AZURE_OPENAI_ENDPOINT\")\n", | ||
| " token_provider = get_bearer_token_provider(DefaultAzureCredential(), \"https://cognitiveservices.azure.com/.default\")\n", | ||
| "\n", | ||
| " # Get a client handle for the AOAI model\n", | ||
| " client = AzureOpenAI(\n", | ||
| " azure_endpoint=endpoint,\n", | ||
| " api_version=os.environ.get(\"AZURE_API_VERSION\"),\n", | ||
| " azure_ad_token_provider=token_provider,\n", | ||
| " )\n", | ||
| "\n", | ||
| " # Prepare the messages\n", | ||
| " messages = [\n", | ||
| " {\n", | ||
| " \"role\": \"system\",\n", | ||
| " \"content\": f\"You are a user assistant who helps answer questions based on some context.\\n\\nContext: '{context}'\",\n", | ||
| " },\n", | ||
| " {\"role\": \"user\", \"content\": query},\n", | ||
| " ]\n", | ||
| " # Call the model\n", | ||
| " completion = client.chat.completions.create(\n", | ||
| " model=deployment,\n", | ||
| " messages=messages,\n", | ||
| " max_tokens=800,\n", | ||
| " temperature=0.7,\n", | ||
| " top_p=0.95,\n", | ||
| " frequency_penalty=0,\n", | ||
| " presence_penalty=0,\n", | ||
| " stop=None,\n", | ||
| " stream=False,\n", | ||
| " )\n", | ||
| "\n", | ||
| " message = completion.to_dict()[\"choices\"][0][\"message\"]\n", | ||
| " if isinstance(message, dict):\n", | ||
| " message = message[\"content\"]\n", | ||
| " return message" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Run the simulator\n", | ||
| "\n", | ||
| "The interactions between your endpoint (in this case, `example_application_response`) and the simulator is managed by a callback method, `custom_simulator_callback` and this method is used to format the request to your endpoint and the response from the endpoint." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "async def custom_simulator_callback(\n", | ||
| " messages: List[Dict],\n", | ||
| " stream: bool = False,\n", | ||
| " session_state: Optional[str] = None,\n", | ||
| " context: Optional[Dict[str, Any]] = None,\n", | ||
| ") -> dict:\n", | ||
| " messages_list = messages[\"messages\"]\n", | ||
| " # get last message\n", | ||
| " latest_message = messages_list[-1]\n", | ||
| " application_input = latest_message[\"content\"]\n", | ||
| " context = latest_message.get(\"context\", None)\n", | ||
| " # call your endpoint or ai application here\n", | ||
| " response = example_application_response(query=application_input, context=context)\n", | ||
| " # we are formatting the response to follow the openAI chat protocol format\n", | ||
| " message = {\n", | ||
| " \"content\": response,\n", | ||
| " \"role\": \"assistant\",\n", | ||
| " \"context\": context,\n", | ||
| " }\n", | ||
| " messages[\"messages\"].append(message)\n", | ||
| " return {\"messages\": messages[\"messages\"], \"stream\": stream, \"session_state\": session_state, \"context\": context}" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "custom_simulator = Simulator(model_config=model_config)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "outputs = await custom_simulator(\n", | ||
| " target=custom_simulator_callback,\n", | ||
| " conversation_turns=conversation_turns,\n", | ||
| " max_conversation_turns=1,\n", | ||
| " concurrent_async_tasks=10,\n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Convert the outputs to a format that can be evaluated" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "output_file = \"ground_sim_output.jsonl\"\n", | ||
| "with Path.open(output_file, \"w\") as file:\n", | ||
| " for output in outputs:\n", | ||
| " file.write(output.to_eval_qr_json_lines())" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "\n", | ||
| "## Run the evaluation\n", | ||
| "\n", | ||
| "In this section, we will run the evaluation using the `GroundednessEvaluator` and the `evaluate` function from the Azure AI Evaluation SDK. The evaluation will assess the groundedness of the model's responses based on the dataset produced by the `Simulator` above." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "groundedness_evaluator = GroundednessEvaluator(model_config=model_config)\n", | ||
| "eval_output = evaluate(\n", | ||
| " data=output_file,\n", | ||
| " evaluators={\n", | ||
| " \"groundedness\": groundedness_evaluator,\n", | ||
| " },\n", | ||
| " azure_ai_project=project_scope,\n", | ||
| ")\n", | ||
| "print(eval_output)" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.