-
Notifications
You must be signed in to change notification settings - Fork 311
New sample for Remote and Online Evaluation #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
slister1001
merged 14 commits into
Azure-Samples:main
from
slister1001:remote-online-evals
Nov 18, 2024
Merged
Changes from 12 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
40ca32c
update promptflow-eval dependencies to azure-ai-evaluation
slister1001 3c98269
clear local variables
slister1001 2ccdfb2
fix errors and remove 'question' col from data
slister1001 fc46d6c
small fix in evaluator config
slister1001 c6d52a4
Merge branch 'Azure-Samples:main' into main
slister1001 4d6fc68
Merge branch 'Azure-Samples:main' into main
slister1001 d5cd237
Merge branch 'Azure-Samples:main' into main
slister1001 724c315
Merge branch 'Azure-Samples:main' into main
slister1001 57b9d02
Merge branch 'Azure-Samples:main' into main
slister1001 554f358
Merge branch 'Azure-Samples:main' into main
slister1001 419b5b1
Add sample for Remote Evaluation
slister1001 b32b674
code review updates
slister1001 b212a5c
adding online evals, code review updates
slister1001 5a40105
Separate remote and online samples
slister1001 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| --- | ||
| page_type: sample | ||
| languages: | ||
| - python | ||
| products: | ||
| - ai-services | ||
| - azure-openai | ||
| description: Evaluating remotely | ||
| --- | ||
|
|
||
| ## Evaluating in the cloud | ||
|
|
||
| ### Overview | ||
|
|
||
| This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs remotely using a triggered evaluation. | ||
|
|
||
| ### Objective | ||
|
|
||
| The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for pre-deployment testing. By the end of this tutorial, you should be able to: | ||
|
|
||
| - Learn about evaluations | ||
| - Evaluate LLM using various evaluators from Azure AI Evaluations SDK remotely in the cloud. | ||
|
|
||
| ### Note | ||
| Remote evaluations do not support `Groundedness-Pro-Evaluator`, `Retrieval-Evaluator`, `Protected-Material-Evaluator`, `Indirect-Attack-Evaluator`, `ContentSafetyEvaluator`, and `QAEvaluator`. | ||
slister1001 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| #### Region Support for Evaluations | ||
|
|
||
| | Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness | Protected Material | | ||
slister1001 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | - | - | - | - | | ||
| | UK South | Will be deprecated 12/1/24 | no | no | | ||
| | East US 2 | yes | yes | yes | | ||
| | Sweden Central | yes | yes | no | | ||
| | US North Central | yes | no | no | | ||
| | France Central | yes | no | no | | ||
| | Switzerland West | yes | no | no | | ||
|
|
||
| ### Programming Languages | ||
| - Python | ||
|
|
||
| ### Estimated Runtime: 20 mins | ||
195 changes: 195 additions & 0 deletions
195
scenarios/evaluate/evaluate_remotely/evaluate_remotely.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Remote Evaluations: Evaluating in the Cloud \n", | ||
| "\n", | ||
| "## Objective\n", | ||
| "\n", | ||
| "This tutorial provides a step-by-step guide on how to evaluate data generated by LLMs remotely in the cloud. \n", | ||
| "\n", | ||
| "This tutorial uses the following Azure AI services:\n", | ||
| "\n", | ||
| "- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)\n", | ||
| "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", | ||
| "\n", | ||
| "## Time\n", | ||
| "\n", | ||
| "You should expect to spend 20 minutes running this sample. \n", | ||
| "\n", | ||
| "## About this example\n", | ||
| "\n", | ||
| "This example demonstrates the remote evaluation of query and response pairs that were generated by an LLM model. It is important to have access to AzureOpenAI credentials and an AzureAI project. **To create data to use in your own evaluation, learn more [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data)** . This example demonstrates: \n", | ||
| "\n", | ||
| "- Single-instance, triggered Remote Evaluation (to be used for pre-deployment evaluation of LLMs)\n", | ||
| "\n", | ||
| "## Before you begin\n", | ||
| "### Prerequesite\n", | ||
| "- [Have an online deployment on Azure Open AI studio supporting `chat completion` such as `gpt-4`](https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-online?view=azureml-api-2)\n", | ||
| "- You also might want to evaluate data generated by your LLM, to see how to generate data to be evaluated using the Azure AI Evaluation SDK, see our samples on simulation \n", | ||
| "\n", | ||
| "### Installation\n", | ||
| "\n", | ||
| "Install the following packages required to execute this notebook. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# %pip uninstall azure-ai-project azure-ai-ml azure-ai-evaluation\n", | ||
| "# %pip install azure-identity\n", | ||
| "# %pip install azure-ai-project\n", | ||
| "# %pip install azure-ai-evaluation" | ||
slister1001 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from azure.ai.project import AIProjectClient\n", | ||
| "from azure.identity import DefaultAzureCredential\n", | ||
| "from azure.ai.project.models import Evaluation, Dataset, EvaluatorConfiguration, ConnectionType\n", | ||
| "from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Connect to your Azure Open AI deployment\n", | ||
| "To evaluate your LLM-generated data remotely in the cloud, we must first connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the connection string, navigate to the \"Project Overview\" page for your Azure AI project. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "project_client = AIProjectClient.from_connection_string(\n", | ||
| " credential=DefaultAzureCredential(),\n", | ||
| " conn_str=\"<connection_string>\", # At the moment, it should be in the format \"<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "deployment_name = \"<deployment_name>\"\n", | ||
| "api_version = \"<api_version>\"\n", | ||
| "default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)\n", | ||
| "model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Data\n", | ||
| "The following code demonstrates how to upload the data for evaluation to your Azure AI project. Below we use `evaluate_test_data.jsonl` which exemplifies LLM-generated data in the query-response format expected by the Azure AI Evaluation SDK. For your use case, you should upload data in the same format, which can be generated using the [`Simulator`](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data) from Azure AI Evaluation SDK. \n", | ||
| "\n", | ||
| "Alternatively, if you already have an existing dataset for evaluation, you can use that by finding the link to your dataset in your [registry](https://ml.azure.com/registries) or find the dataset ID." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Upload data for evaluation\n", | ||
| "data_id = project_client.upload_file(\"./evaluate_test_data.jsonl\")\n", | ||
slister1001 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "# data_id = \"azureml://registries/<registry_name>/data/<dataset_name>/versions/1\"\n", | ||
| "# To use an existing dataset, replace the above line with the following line\n", | ||
| "# data_id = \"<dataset_id>\"" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Evaluate in the Cloud with Remote Evaluation\n", | ||
| "Below we demonstrate how to trigger a single-instance Remote Evaluation in the Cloud. This can be used for pre-deployment testing of an LLM. \n", | ||
| " \n", | ||
| "Here we pass in the `data_id` we would like to use for the evaluation and the `EvaluatorConfiguration` for each of the evaluators we would like to include. Below we demonstrate how to use the `F1ScoreEvaluator`, `RelevanceEvaluator`, and the `ViolenceEvaluator`" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "evaluation = Evaluation(\n", | ||
| " display_name=\"Remote Evaluation\",\n", | ||
| " description=\"Evaluation of dataset\",\n", | ||
| " data=Dataset(id=data_id),\n", | ||
| " evaluators={\n", | ||
| " \"f1_score\": EvaluatorConfiguration(\n", | ||
| " id=F1ScoreEvaluator.id,\n", | ||
| " ),\n", | ||
| " \"relevance\": EvaluatorConfiguration(\n", | ||
| " id=RelevanceEvaluator.id,\n", | ||
| " init_params={\"model_config\": model_config},\n", | ||
| " ),\n", | ||
| " \"violence\": EvaluatorConfiguration(\n", | ||
| " id=ViolenceEvaluator.id,\n", | ||
| " init_params={\"azure_ai_project\": project_client.scope},\n", | ||
| " ),\n", | ||
| " },\n", | ||
| ")\n", | ||
| "\n", | ||
| "# Create evaluation\n", | ||
| "evaluation_response = project_client.evaluations.create(\n", | ||
| " evaluation=evaluation,\n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Get evaluation\n", | ||
| "get_evaluation_response = project_client.evaluations.get(evaluation_response.id)\n", | ||
| "\n", | ||
| "print(\"----------------------------------------------------------------\")\n", | ||
| "print(\"Created evaluation, evaluation ID: \", get_evaluation_response.id)\n", | ||
| "print(\"Evaluation status: \", get_evaluation_response.status)\n", | ||
| "print(\"AI Foundry Portal URI: \", get_evaluation_response.properties[\"AiFoundryPortalUri\"])\n", | ||
| "print(\"----------------------------------------------------------------\")" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "azureai-samples313", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } | ||
3 changes: 3 additions & 0 deletions
3
scenarios/evaluate/evaluate_remotely/evaluate_test_data.jsonl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| {"query":"How do you create a run?","context":"AML API only","response":"To create a run using the Azure Machine Learning API, you first need to create an Experiment. Once you have an experiment, you can create a Run object that is associated with that experiment. Here is some Python code that demonstrates this process:\n\n```\nfrom azureml.core import Experiment, Run\nfrom azureml.core.workspace import Workspace\n\n# Define workspace and experiment\nws = Workspace.from_config()\nexp = Experiment(workspace=ws, name='my_experiment')\n\n# Create a new run\nrun = exp.start_logging()\n```\n\nIn this code, the `from_config()` method reads the configuration file that you created when you set up your Azure Machine Learning workspace. The `Experiment` constructor creates an Experiment object that is associated with your workspace, and the `start_logging()` method creates a new Run object that is associated with the Experiment. Now you can use the `run` object to log metrics, upload files, and track other information related to your machine learning experiment.", "ground_truth":"Paris is the capital of France."} | ||
| {"query":"How do you log a model?","context":"Logging can be done using any OSS Sdk","response":"There are a few ways to log models in Azure Machine Learning. \n\nOne way is to use the `register_model()` method of the `Run` object. The `register_model()` method logs a model file in the Azure Machine Learning service workspace and makes it available for deployment. Here's an example:\n\n```python\nfrom azureml.core import Model\n\nmodel_path = '.\/outputs\/my_model.pkl'\nmodel = Model.register(workspace=ws, model_path=model_path, model_name='my_model')\n```\n\nThis code registers the model file located at `model_path` to the Azure Machine Learning service workspace with the name `my_model`. \n\nAnother way to log a model is to save it as an output of a `Run`. If your model generation code is part of a script or Jupyter notebook that runs as an Azure Machine Learning experiment, you can save the model file as an output of the `Run` object. Here's an example:\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom azureml.core.run import Run\n\n# Initialize a run object\nrun = Run.get_context()\n\n# Train your model\nX_train, y_train = ...\nlog_reg = LogisticRegression().fit(X_train, y_train)\n\n# Save the model to the Run object's outputs directory\nmodel_path = 'outputs\/model.pkl'\njoblib.dump(value=log_reg, filename=model_path)\n\n# Log the model as a run artifact\nrun.upload_file(name=model_path, path_or_stream=model_path)\n```\n\nIn this code, `Run.get_context()` retrieves the current run context object, which you can use to track metadata and metrics for the run. After training your model, you can use `joblib.dump()` to save the model to a file, and then log the file as an artifact of the run using `run.upload_file()`.","ground_truth":"Paris is the capital of France."} | ||
| {"query":"What is the capital of France?","context":"France is in Europe","response":"Paris is the capital of France.", "ground_truth":"Paris is the capital of France."} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.