Skip to content

Commit 88a00f1

Browse files
authored
New sample for Remote and Online Evaluation (#149)
* update promptflow-eval dependencies to azure-ai-evaluation * clear local variables * fix errors and remove 'question' col from data * small fix in evaluator config * Add sample for Remote Evaluation * code review updates * adding online evals, code review updates * Separate remote and online samples
1 parent 52ab914 commit 88a00f1

File tree

5 files changed

+555
-0
lines changed

5 files changed

+555
-0
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
page_type: sample
3+
languages:
4+
- python
5+
products:
6+
- ai-services
7+
- azure-openai
8+
description: Evaluating online
9+
---
10+
11+
## Evaluating in the cloud on a schedule
12+
13+
### Overview
14+
15+
This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs on a scheduling using online evaluation.
16+
17+
### Objective
18+
19+
The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for monitoring LLMs and Generative AI that has been deployed. By the end of this tutorial, you should be able to:
20+
21+
- Learn about evaluations
22+
- Evaluate LLM using various evaluators from Azure AI Evaluations SDK online in the cloud.
23+
24+
### Note
25+
All evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation. For updated documentation, please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc).
26+
27+
#### Region Support for Evaluations
28+
29+
| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material |
30+
| - | - | - | - |
31+
| UK South | Will be deprecated 12/1/24 | no | no |
32+
| East US 2 | yes | yes | yes |
33+
| Sweden Central | yes | yes | no |
34+
| US North Central | yes | no | no |
35+
| France Central | yes | no | no |
36+
| Switzerland West | yes | no | no |
37+
38+
### Programming Languages
39+
- Python
40+
41+
### Estimated Runtime: 30 mins
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Online Evaluations: Evaluating in the Cloud on a Schedule\n",
8+
"\n",
9+
"## Objective\n",
10+
"\n",
11+
"This tutorial provides a step-by-step guide on how to evaluate data generated by LLMs online on a schedule. \n",
12+
"\n",
13+
"This tutorial uses the following Azure AI services:\n",
14+
"\n",
15+
"- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)\n",
16+
"- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n",
17+
"\n",
18+
"## Time\n",
19+
"\n",
20+
"You should expect to spend 30 minutes running this sample. \n",
21+
"\n",
22+
"## About this example\n",
23+
"\n",
24+
"This example demonstrates the online evaluation of a LLM. It is important to have access to AzureOpenAI credentials and an AzureAI project. This example demonstrates: \n",
25+
"\n",
26+
"- Recurring, Online Evaluation (to be used to monitor LLMs once they are deployed)\n",
27+
"\n",
28+
"## Before you begin\n",
29+
"### Prerequesite\n",
30+
"- Configure resources to support Online Evaluation as per [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc)"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": null,
36+
"metadata": {},
37+
"outputs": [],
38+
"source": [
39+
"%pip install -U azure-identity\n",
40+
"%pip install -U azure-ai-project\n",
41+
"%pip install -U azure-ai-evaluation"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"from azure.ai.project import AIProjectClient\n",
51+
"from azure.identity import DefaultAzureCredential\n",
52+
"from azure.ai.project.models import (\n",
53+
" ApplicationInsightsConfiguration,\n",
54+
" EvaluatorConfiguration,\n",
55+
" ConnectionType,\n",
56+
" EvaluationSchedule,\n",
57+
" RecurrenceTrigger,\n",
58+
")\n",
59+
"from azure.ai.evaluation import F1ScoreEvaluator, ViolenceEvaluator"
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"metadata": {},
65+
"source": [
66+
"### Connect to your Azure Open AI deployment\n",
67+
"To evaluate your LLM-generated data remotely in the cloud, we must connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the proper value for `conn_str`, navigate to the connection string at the \"Project Overview\" page for your Azure AI project. "
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"metadata": {},
74+
"outputs": [],
75+
"source": [
76+
"project_client = AIProjectClient.from_connection_string(\n",
77+
" credential=DefaultAzureCredential(),\n",
78+
" conn_str=\"<connection_string>\", # At the moment, it should be in the format \"<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n",
79+
")"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"Please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc) for configuration of Application Insights. `service_name` is a unique name you provide to define your Generative AI application and identify it within your Application Insights resource. This property will be logged in the `traces` table in Application Insights and can be found in the `customDimensions[\"service.name\"]` field. `evaluation_name` is a unique name you provide for your Online Evaluation schedule. "
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": null,
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"# Your Application Insights resource ID\n",
96+
"# At the moment, it should be something in the format \"/subscriptions/<AzureSubscriptionId>/resourceGroups/<ResourceGroup>/providers/Microsoft.Insights/components/<ApplicationInsights>\"\"\n",
97+
"app_insights_resource_id = \"<app_insights_resource_id>\"\n",
98+
"\n",
99+
"# Name of your generative AI application (will be available in trace data in Application Insights)\n",
100+
"service_name = \"<service_name>\"\n",
101+
"\n",
102+
"# Name of your online evaluation schedule\n",
103+
"evaluation_name = \"<evaluation_name>\""
104+
]
105+
},
106+
{
107+
"cell_type": "markdown",
108+
"metadata": {},
109+
"source": [
110+
"Below is the Kusto Query Language (KQL) query to query data from Application Insights resource. This query is compatible with data logged by the Azure AI Inferencing Tracing SDK (linked in [documentation](https://aka.ms/GenAIMonitoringDoc)). You can modify it depending on your data schema. The KQL query must output several columns: `operation_ID`, `operation_ParentID`, and `gen_ai_response_id`. You can choose which other columns to output as required by the evaluators you are using."
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": null,
116+
"metadata": {},
117+
"outputs": [],
118+
"source": [
119+
"kusto_query = 'let gen_ai_spans=(dependencies | where isnotnull(customDimensions[\"gen_ai.system\"]) | extend response_id = tostring(customDimensions[\"gen_ai.response.id\"]) | project id, operation_Id, operation_ParentId, timestamp, response_id); let gen_ai_events=(traces | where message in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") or tostring(customDimensions[\"event.name\"]) in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") | project id= operation_ParentId, operation_Id, operation_ParentId, user_input = iff(message == \"gen_ai.user.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.user.message\", parse_json(iff(message == \"gen_ai.user.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), system = iff(message == \"gen_ai.system.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.system.message\", parse_json(iff(message == \"gen_ai.system.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), llm_response = iff(message == \"gen_ai.choice\", parse_json(tostring(parse_json(tostring(customDimensions[\"gen_ai.event.content\"])).message)).content, iff(tostring(customDimensions[\"event.name\"]) == \"gen_ai.choice\", parse_json(parse_json(message).message).content, \"\")) | summarize operation_ParentId = any(operation_ParentId), Input = maxif(user_input, user_input != \"\"), System = maxif(system, system != \"\"), Output = maxif(llm_response, llm_response != \"\") by operation_Id, id); gen_ai_spans | join kind=inner (gen_ai_events) on id, operation_Id | project Input, System, Output, operation_Id, operation_ParentId, gen_ai_response_id = response_id'\n",
120+
"\n",
121+
"# AzureMSIClientId is the clientID of the User-assigned managed identity created during set-up - see documentation for how to find it\n",
122+
"properties = {\"AzureMSIClientId\": \"your_client_id\"}"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"metadata": {},
129+
"outputs": [],
130+
"source": [
131+
"# Connect to your Application Insights resource\n",
132+
"app_insights_config = ApplicationInsightsConfiguration(\n",
133+
" resource_id=app_insights_resource_id, query=kusto_query, service_name=service_name\n",
134+
")"
135+
]
136+
},
137+
{
138+
"cell_type": "code",
139+
"execution_count": null,
140+
"metadata": {},
141+
"outputs": [],
142+
"source": [
143+
"# Connect to your AOAI resource, you must use an AOAI GPT model\n",
144+
"deployment_name = \"gpt-4\"\n",
145+
"api_version = \"2024-06-01\"\n",
146+
"default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)\n",
147+
"model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)"
148+
]
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"metadata": {},
153+
"source": [
154+
"### Configure Evaluators to Run\n",
155+
"The code below demonstrates how to configure the evaluators you want to run. In this example, we use the `F1ScoreEvaluator`, `RelevanceEvaluator` and the `ViolenceEvaluator`, but all evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation and can be configured here. You can either import the classes from the SDK and reference them with the `.id` property, or you can find the fully formed `id` of the evaluator in the AI Studio registry of evaluators, and use it here. "
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": null,
161+
"metadata": {},
162+
"outputs": [],
163+
"source": [
164+
"# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n",
165+
"# init_params is the configuration for the model to use to perform the evaluation\n",
166+
"# data_mapping is used to map the output columns of your query to the names required by the evaluator\n",
167+
"evaluators = {\n",
168+
" \"f1_score\": EvaluatorConfiguration(\n",
169+
" id=F1ScoreEvaluator.id,\n",
170+
" ),\n",
171+
" \"relevance\": EvaluatorConfiguration(\n",
172+
" id=\"azureml://registries/azureml-staging/models/Relevance-Evaluator/versions/4\",\n",
173+
" init_params={\"model_config\": model_config},\n",
174+
" data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n",
175+
" ),\n",
176+
" \"violence\": EvaluatorConfiguration(\n",
177+
" id=ViolenceEvaluator.id,\n",
178+
" init_params={\"azure_ai_project\": project_client.scope},\n",
179+
" data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n",
180+
" ),\n",
181+
"}"
182+
]
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"metadata": {},
187+
"source": [
188+
"### Evaluate in the Cloud on a Schedule with Online Evaluation\n",
189+
"\n",
190+
"You can configure the `RecurrenceTrigger` based on the class definition [here](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.recurrencetrigger?view=azure-python)."
191+
]
192+
},
193+
{
194+
"cell_type": "code",
195+
"execution_count": null,
196+
"metadata": {},
197+
"outputs": [],
198+
"source": [
199+
"# Frequency to run the schedule\n",
200+
"recurrence_trigger = RecurrenceTrigger(frequency=\"day\", interval=1)\n",
201+
"\n",
202+
"# Configure the online evaluation schedule\n",
203+
"evaluation_schedule = EvaluationSchedule(\n",
204+
" data=app_insights_config,\n",
205+
" evaluators=evaluators,\n",
206+
" trigger=recurrence_trigger,\n",
207+
" description=f\"{service_name} evaluation schedule\",\n",
208+
" properties=properties,\n",
209+
")\n",
210+
"\n",
211+
"# Create the online evaluation schedule\n",
212+
"created_evaluation_schedule = project_client.evaluations.create_or_replace_schedule(service_name, evaluation_schedule)\n",
213+
"print(\n",
214+
" f\"Successfully submitted the online evaluation schedule creation request - {created_evaluation_schedule.name}, currently in {created_evaluation_schedule.provisioning_state} state.\"\n",
215+
")"
216+
]
217+
},
218+
{
219+
"cell_type": "markdown",
220+
"metadata": {},
221+
"source": [
222+
"### Next steps \n",
223+
"\n",
224+
"Navigate to the \"Tracing\" tab in [Azure AI Studio](https://ai.azure.com/) to view your logged trace data alongside the evaluations produced by the Online Evaluation schedule. You can use the reference link provided in the \"Tracing\" tab to navigate to a comprehensive workbook in Application Insights for more details on how your application is performing. "
225+
]
226+
}
227+
],
228+
"metadata": {
229+
"kernelspec": {
230+
"display_name": "azureai-samples313",
231+
"language": "python",
232+
"name": "python3"
233+
},
234+
"language_info": {
235+
"codemirror_mode": {
236+
"name": "ipython",
237+
"version": 3
238+
},
239+
"file_extension": ".py",
240+
"mimetype": "text/x-python",
241+
"name": "python",
242+
"nbconvert_exporter": "python",
243+
"pygments_lexer": "ipython3"
244+
}
245+
},
246+
"nbformat": 4,
247+
"nbformat_minor": 2
248+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
page_type: sample
3+
languages:
4+
- python
5+
products:
6+
- ai-services
7+
- azure-openai
8+
description: Evaluating remotely
9+
---
10+
11+
## Evaluating in the cloud
12+
13+
### Overview
14+
15+
This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs remotely using a triggered evaluation.
16+
17+
### Objective
18+
19+
The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for pre-deployment testing. By the end of this tutorial, you should be able to:
20+
21+
- Learn about evaluations
22+
- Evaluate LLM using various evaluators from Azure AI Evaluations SDK remotely in the cloud.
23+
24+
### Note
25+
Remote evaluations do not support `Retrieval-Evaluator`, `ContentSafetyEvaluator`, and `QAEvaluator`.
26+
27+
#### Region Support for Evaluations
28+
29+
| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material |
30+
| - | - | - | - |
31+
| UK South | Will be deprecated 12/1/24 | no | no |
32+
| East US 2 | yes | yes | yes |
33+
| Sweden Central | yes | yes | no |
34+
| US North Central | yes | no | no |
35+
| France Central | yes | no | no |
36+
| Switzerland West | yes | no | no |
37+
38+
### Programming Languages
39+
- Python
40+
41+
### Estimated Runtime: 20 mins

0 commit comments

Comments
 (0)