Skip to content

Commit 015cdbf

Browse files
authored
Add new sample for evaluator input types (#187)
* input types sample * run black * change singleton references to query/response * CI * run black again * that's ruff buddy * pre commit run
1 parent cc3ffd0 commit 015cdbf

File tree

5 files changed

+328
-0
lines changed

5 files changed

+328
-0
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
page_type: sample
3+
languages:
4+
- language1
5+
- language2
6+
products:
7+
- ai-services
8+
- azure-openai
9+
description: Evaluate with various inputs: query/response versus conversation, csv versus jsonl.
10+
---
11+
12+
## Evaluate With Queries and Responses, Conversations, Jsonl, and Csv
13+
14+
### Overview
15+
16+
This notebook walks through a local toy evaluation using a variety of input types and formats. Many built-in evaluators in the Evaluation SDK can evaluate either a set on inputs like a query and response, or they can derive a list of query/response pairs from a conversation and evaluate those as well.
17+
18+
Additionally, the Evaluation SDK supports reading in datasets from both jsonl files and csv files. This notebook showcases evaluation using both file types.
19+
20+
### Objective
21+
22+
The main objective of this tutorial is to help users understand how to use the azure-ai-evaluation SDK to to evaluate chat data that is stored in any valid combination of variable type (query/response versus conversation) or file type (jsonl or csv). By the end of this tutorial, you should be able to:
23+
24+
- Understand how to perform an evaluation using either query/response values or conversations as inputs.
25+
- Use both jsonl and csv files as inputs for `evaluate`.
26+
27+
### Basic requirements
28+
29+
This notebook is meant to be very lightweight, as such it only requires the Evaluation SDK and the files included in this directory to function.
30+
31+
### Programming Languages
32+
- Python
33+
34+
### Estimated Runtime: 10 mins
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
2+
{"conversation" : {"messages": [{"content": "What is the meaning of life?", "role" :"user"}, {"content": "42.", "role" :"assistant"}]}}
3+
{"conversation" : {"messages": [{"content": "What atoms compose water?", "role" :"user"}, {"content": "Hydrogen and oxygen", "role" :"assistant"}, {"content": "What color is my shirt?", "role" :"user"}, {"content": "How would I know? I don't have eyes.", "role" :"assistant"}]}}
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Evaluate with various inputs\n",
8+
"\n",
9+
"## Objective\n",
10+
"\n",
11+
"This notebook walks through how to use jsonl and csv files as inputs for evaluation, as well as both query/response and conversation-based inputs within those files. \n",
12+
"\n",
13+
"Note: When this notebook refers to 'conversations', we are referring to the definition of conversations defined [here](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.conversation?view=azure-python#attributes). This is a simplified variant on the broader Chat Protocol standard that is defined [here](https://github.com/microsoft/ai-chat-protocol)\n",
14+
"\n",
15+
"## Time\n",
16+
"\n",
17+
"You should expect to spend about 10 minutes running this notebook.\n",
18+
"\n",
19+
"## Setup\n"
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": null,
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"# Install the Evaluation SDK package\n",
29+
"%pip install azure-ai-evaluation"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"### Imports\n",
37+
"Run this cell to import everything that is needed for this sample"
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": null,
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"from azure.ai.evaluation import evaluate\n",
47+
"from typing import List, Tuple, Dict, Optional, TypedDict\n",
48+
"from pathlib import Path"
49+
]
50+
},
51+
{
52+
"cell_type": "markdown",
53+
"metadata": {},
54+
"source": [
55+
"## Evaluator definition\n",
56+
"\n",
57+
"We define a toy math evaluator below to showcase multi-input handling. A variety of built-in evaluators have a similar input structure to the evaluator below, like the `ContentSafetyEvaluator` and the `ProtectedMaterialEvaluator`. However they all require API connections to function. To avoid that setup and keep this sample offline-capable, this toy evaluator requires no external support."
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": null,
63+
"metadata": {},
64+
"outputs": [],
65+
"source": [
66+
"# Underlying evaluation: The return ratio of the query to response lengths\n",
67+
"def query_response_ratio(query: str, response: str) -> float:\n",
68+
" return len(query) / len(response)\n",
69+
"\n",
70+
"\n",
71+
"# Helper function that converts a conversation into a list of query-response pairs\n",
72+
"def unwrap_conversation(conversation: Dict) -> List[Tuple[str, str]]:\n",
73+
" queries = []\n",
74+
" responses = []\n",
75+
" for turn in conversation[\"messages\"]:\n",
76+
" if turn[\"role\"] == \"user\":\n",
77+
" queries.append(turn[\"content\"])\n",
78+
" else:\n",
79+
" responses.append(turn[\"content\"])\n",
80+
" return zip(queries, responses)\n",
81+
"\n",
82+
"\n",
83+
"# Define the output of the evaluation to make the sample repo's robust type requirements happy.\n",
84+
"class EvalOutput(TypedDict, total=False):\n",
85+
" result: float\n",
86+
"\n",
87+
"\n",
88+
"# Actual evaluation function, which handles either a single query-response pair or a conversation\n",
89+
"def simple_evaluator_function(\n",
90+
" query: Optional[str] = None, response: Optional[str] = None, conversation: Optional[str] = None\n",
91+
") -> EvalOutput:\n",
92+
" if conversation is not None and query is None and response is None:\n",
93+
" per_turn_results = [query_response_ratio(q, r) for q, r in unwrap_conversation(conversation)]\n",
94+
" return {\"result\": sum(per_turn_results) / len(per_turn_results), \"per_turn_results\": per_turn_results}\n",
95+
" if conversation is None and query is not None and response is not None:\n",
96+
" return {\"result\": query_response_ratio(query, response)}\n",
97+
" raise ValueError(\"Either a conversation or a query-response pair must be provided.\")\n",
98+
"\n",
99+
"\n",
100+
"# Feel free to replace this assignment with more complex evaluation functions for further testing.\n",
101+
"my_evaluator = simple_evaluator_function"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"With the evaluator defined above, we can input either a query and response together, or a conversation to receive a result:"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {},
115+
"outputs": [],
116+
"source": [
117+
"# Query+response evaluation\n",
118+
"qr_result = my_evaluator(query=\"Hello\", response=\"world\")\n",
119+
"print(f\"query/response output: {qr_result}\")\n",
120+
"\n",
121+
"conversation_input = {\n",
122+
" \"messages\": [\n",
123+
" {\"role\": \"user\", \"content\": \"Hello\"},\n",
124+
" {\"role\": \"assistant\", \"content\": \"world\"},\n",
125+
" {\"role\": \"user\", \"content\": \"Hello\"},\n",
126+
" {\"role\": \"assistant\", \"content\": \"world and more words to change ratio\"},\n",
127+
" ]\n",
128+
"}\n",
129+
"\n",
130+
"# Conversation evaluation\n",
131+
"conversation_result = my_evaluator(conversation=conversation_input)\n",
132+
"print(f\"conversation output: {conversation_result}\")"
133+
]
134+
},
135+
{
136+
"cell_type": "markdown",
137+
"metadata": {},
138+
"source": [
139+
"## Datasets\n",
140+
"\n",
141+
"Direct inputs into evaluators as shown above are useful for sanity checks. But for larger datasets we typically input the evaluator and a dataset file into the `evaluate` method. For that, we will need some data files.\n",
142+
"\n",
143+
"Included in this sample directory are 3 files:\n",
144+
"- qr_data.jsonl contains query/response inputs in jsonl format.\n",
145+
"- qr_data.csv contains query/response inputs in csv format.\n",
146+
"- conversation_data.jsonl contains conversation inputs in jsonl format.\n",
147+
"\n",
148+
"Conversations and other complex inputs are not supported via csv inputs, so there is no corresponding \"conversation_data.csv\" file. Each file contains the same three query/response pairs, but in the conversation dataset, the second and third pairs are wrapped into a single, 4-turn conversation.\n",
149+
"\n",
150+
"Double check the contents of these files by changing the print statement below. You might need to alter the `path_to_data` value depending on where your notebook is running:"
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": null,
156+
"metadata": {},
157+
"outputs": [],
158+
"source": [
159+
"# Change this depending on where your notebook is running.\n",
160+
"# Default value assumes that the notebook is running in the root of the repository.\n",
161+
"path_to_data = \"./scenarios/evaluate/evaluate_with_various_inputs\"\n",
162+
"# Define data path variables.\n",
163+
"qr_js_data = path_to_data + \"/qr_data.jsonl\"\n",
164+
"qr_csv_data = path_to_data + \"/qr_data.csv\"\n",
165+
"conversation_js_data = path_to_data + \"/conversation_data.jsonl\"\n",
166+
"\n",
167+
"# Change variable referenced here to check different files\n",
168+
"with Path(qr_js_data).open() as f:\n",
169+
" print(f.read())"
170+
]
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"metadata": {},
175+
"source": [
176+
"## Evaluation\n",
177+
"\n",
178+
"Now that we have some datasets and an evaluator, and can pass both of them into evaluate. Starting with query/response jsonl inputs:"
179+
]
180+
},
181+
{
182+
"cell_type": "code",
183+
"execution_count": null,
184+
"metadata": {},
185+
"outputs": [],
186+
"source": [
187+
"js_qr_output = evaluate(\n",
188+
" data=qr_js_data,\n",
189+
" evaluators={\"test\": my_evaluator},\n",
190+
" _use_pf_client=False, # Avoid using PF dependencies to further simplify the example\n",
191+
")\n",
192+
"\n",
193+
"eval_row_results = [row[\"outputs.test.result\"] for row in js_qr_output[\"rows\"]]\n",
194+
"metrics = js_qr_output[\"metrics\"]\n",
195+
"\n",
196+
"print(f\"query/response jsonl results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"metadata": {},
202+
"source": [
203+
"Now let's run the evaluation using the conversation-based jsonl data. Notice that the evaluator works for both conversations that only convert into a single query response pair, and for conversations that convert into multiple query response pairs. It also produces an extra output called `per_turn_results`, which allows you to check the results of each query-response evaluation that comprised a conversation, since the top-level result is an average of these values. This `per_turn_results` value is also produced by built-in evaluators when evaluating conversations."
204+
]
205+
},
206+
{
207+
"cell_type": "code",
208+
"execution_count": null,
209+
"metadata": {},
210+
"outputs": [],
211+
"source": [
212+
"js_convo_output = evaluate(\n",
213+
" data=conversation_js_data,\n",
214+
" evaluators={\"test\": my_evaluator},\n",
215+
" _use_pf_client=False,\n",
216+
")\n",
217+
"\n",
218+
"eval_row_results = [row[\"outputs.test.result\"] for row in js_convo_output[\"rows\"]]\n",
219+
"per_turn_results = [row[\"outputs.test.per_turn_results\"] for row in js_convo_output[\"rows\"]]\n",
220+
"metrics = js_convo_output[\"metrics\"]\n",
221+
"\n",
222+
"print(\n",
223+
" f\"\"\"conversation jsonl results: {eval_row_results} \n",
224+
"with per turn results: {per_turn_results} \n",
225+
"and overall metrics: {metrics}\"\"\"\n",
226+
")"
227+
]
228+
},
229+
{
230+
"cell_type": "markdown",
231+
"metadata": {},
232+
"source": [
233+
"Next we run the evaluation using the csv file as input. As expected, the results are the same as the equivalent jsonl file:"
234+
]
235+
},
236+
{
237+
"cell_type": "code",
238+
"execution_count": null,
239+
"metadata": {},
240+
"outputs": [],
241+
"source": [
242+
"csv_qr_output = evaluate(\n",
243+
" data=qr_csv_data,\n",
244+
" evaluators={\"test\": my_evaluator},\n",
245+
" _use_pf_client=False,\n",
246+
")\n",
247+
"\n",
248+
"eval_row_results = [row[\"outputs.test.result\"] for row in csv_qr_output[\"rows\"]]\n",
249+
"metrics = csv_qr_output[\"metrics\"]\n",
250+
"\n",
251+
"print(f\"Query/response csv results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
252+
]
253+
},
254+
{
255+
"cell_type": "markdown",
256+
"metadata": {},
257+
"source": [
258+
"## Conclusion\n",
259+
"\n",
260+
"This sample has shown various ways to input data using `evaluate`, and the difference between query/response and conversation-based inputs. As the SDK is improved, more of the built-in evaluators will continue to support a larger variety of input schemes. We encourage users to leverage which ever options suit their needs."
261+
]
262+
}
263+
],
264+
"metadata": {
265+
"kernelspec": {
266+
"display_name": "Python 3 (ipykernel)",
267+
"language": "python",
268+
"name": "python3"
269+
},
270+
"language_info": {
271+
"codemirror_mode": {
272+
"name": "ipython",
273+
"version": 3
274+
},
275+
"file_extension": ".py",
276+
"mimetype": "text/x-python",
277+
"name": "python",
278+
"nbconvert_exporter": "python",
279+
"pygments_lexer": "ipython3"
280+
}
281+
},
282+
"nbformat": 4,
283+
"nbformat_minor": 2
284+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
query,response
2+
What is the meaning of life?,42.
3+
What atoms compose water?,Hydrogen and oxygen.
4+
What color is my shirt?,How would I know? I don't have eyes.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query":"What is the meaning of life?","response":"42."}
2+
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
3+
{"query":"What color is my shirt?","response":"How would I know? I don't have eyes."}

0 commit comments

Comments
 (0)