Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions scenarios/evaluate/evaluate_with_various_inputs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
page_type: sample
languages:
- language1
- language2
products:
- ai-services
- azure-openai
description: Evaluate with various inputs: query/response versus conversation, csv versus jsonl.
---

## Evaluate With Queries and Responses, Conversations, Jsonl, and Csv

### Overview

This notebook walks through a local toy evaluation using a variety of input types and formats. Many built-in evaluators in the Evaluation SDK can evaluate either a set on inputs like a query and response, or they can derive a list of query/response pairs from a conversation and evaluate those as well.

Additionally, the Evaluation SDK supports reading in datasets from both jsonl files and csv files. This notebook showcases evaluation using both file types.

### Objective

The main objective of this tutorial is to help users understand how to use the azure-ai-evaluation SDK to to evaluate chat data that is stored in any valid combination of variable type (query/response versus conversation) or file type (jsonl or csv). By the end of this tutorial, you should be able to:

- Understand how to perform an evaluation using either query/response values or conversations as inputs.
- Use both jsonl and csv files as inputs for `evaluate`.

### Basic requirements

This notebook is meant to be very lightweight, as such it only requires the Evaluation SDK and the files included in this directory to function.

### Programming Languages
- Python

### Estimated Runtime: 10 mins
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

{"conversation" : {"messages": [{"content": "What is the meaning of life?", "role" :"user"}, {"content": "42.", "role" :"assistant"}]}}
{"conversation" : {"messages": [{"content": "What atoms compose water?", "role" :"user"}, {"content": "Hydrogen and oxygen", "role" :"assistant"}, {"content": "What color is my shirt?", "role" :"user"}, {"content": "How would I know? I don't have eyes.", "role" :"assistant"}]}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate with various inputs\n",
"\n",
"## Objective\n",
"\n",
"This notebook walks through how to use jsonl and csv files as inputs for evaluation, as well as both query/response and conversation-based inputs within those files. \n",
"\n",
"Note: When this notebook refers to 'conversations', we are referring to the definition of conversations defined [here](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.conversation?view=azure-python#attributes). This is a simplified variant on the broader Chat Protocol standard that is defined [here](https://github.com/microsoft/ai-chat-protocol)\n",
"\n",
"## Time\n",
"\n",
"You should expect to spend about 10 minutes running this notebook.\n",
"\n",
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install the Evaluation SDK package\n",
"%pip install azure-ai-evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imports\n",
"Run this cell to import everything that is needed for this sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.evaluation import evaluate\n",
"from typing import List, Tuple, Dict, Optional, TypedDict\n",
"from pathlib import Path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluator definition\n",
"\n",
"We define a toy math evaluator below to showcase multi-input handling. A variety of built-in evaluators have a similar input structure to the evaluator below, like the `ContentSafetyEvaluator` and the `ProtectedMaterialEvaluator`. However they all require API connections to function. To avoid that setup and keep this sample offline-capable, this toy evaluator requires no external support."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Underlying evaluation: The return ratio of the query to response lengths\n",
"def query_response_ratio(query: str, response: str) -> float:\n",
" return len(query) / len(response)\n",
"\n",
"\n",
"# Helper function that converts a conversation into a list of query-response pairs\n",
"def unwrap_conversation(conversation: Dict) -> List[Tuple[str, str]]:\n",
" queries = []\n",
" responses = []\n",
" for turn in conversation[\"messages\"]:\n",
" if turn[\"role\"] == \"user\":\n",
" queries.append(turn[\"content\"])\n",
" else:\n",
" responses.append(turn[\"content\"])\n",
" return zip(queries, responses)\n",
"\n",
"\n",
"# Define the output of the evaluation to make the sample repo's robust type requirements happy.\n",
"class EvalOutput(TypedDict, total=False):\n",
" result: float\n",
"\n",
"\n",
"# Actual evaluation function, which handles either a single query-response pair or a conversation\n",
"def simple_evaluator_function(\n",
" query: Optional[str] = None, response: Optional[str] = None, conversation: Optional[str] = None\n",
") -> EvalOutput:\n",
" if conversation is not None and query is None and response is None:\n",
" per_turn_results = [query_response_ratio(q, r) for q, r in unwrap_conversation(conversation)]\n",
" return {\"result\": sum(per_turn_results) / len(per_turn_results), \"per_turn_results\": per_turn_results}\n",
" if conversation is None and query is not None and response is not None:\n",
" return {\"result\": query_response_ratio(query, response)}\n",
" raise ValueError(\"Either a conversation or a query-response pair must be provided.\")\n",
"\n",
"\n",
"# Feel free to replace this assignment with more complex evaluation functions for further testing.\n",
"my_evaluator = simple_evaluator_function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the evaluator defined above, we can input either a query and response together, or a conversation to receive a result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query+response evaluation\n",
"qr_result = my_evaluator(query=\"Hello\", response=\"world\")\n",
"print(f\"query/response output: {qr_result}\")\n",
"\n",
"conversation_input = {\n",
" \"messages\": [\n",
" {\"role\": \"user\", \"content\": \"Hello\"},\n",
" {\"role\": \"assistant\", \"content\": \"world\"},\n",
" {\"role\": \"user\", \"content\": \"Hello\"},\n",
" {\"role\": \"assistant\", \"content\": \"world and more words to change ratio\"},\n",
" ]\n",
"}\n",
"\n",
"# Conversation evaluation\n",
"conversation_result = my_evaluator(conversation=conversation_input)\n",
"print(f\"conversation output: {conversation_result}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Datasets\n",
"\n",
"Direct inputs into evaluators as shown above are useful for sanity checks. But for larger datasets we typically input the evaluator and a dataset file into the `evaluate` method. For that, we will need some data files.\n",
"\n",
"Included in this sample directory are 3 files:\n",
"- qr_data.jsonl contains query/response inputs in jsonl format.\n",
"- qr_data.csv contains query/response inputs in csv format.\n",
"- conversation_data.jsonl contains conversation inputs in jsonl format.\n",
"\n",
"Conversations and other complex inputs are not supported via csv inputs, so there is no corresponding \"conversation_data.csv\" file. Each file contains the same three query/response pairs, but in the conversation dataset, the second and third pairs are wrapped into a single, 4-turn conversation.\n",
"\n",
"Double check the contents of these files by changing the print statement below. You might need to alter the `path_to_data` value depending on where your notebook is running:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Change this depending on where your notebook is running.\n",
"# Default value assumes that the notebook is running in the root of the repository.\n",
"path_to_data = \"./scenarios/evaluate/evaluate_with_various_inputs\"\n",
"# Define data path variables.\n",
"qr_js_data = path_to_data + \"/qr_data.jsonl\"\n",
"qr_csv_data = path_to_data + \"/qr_data.csv\"\n",
"conversation_js_data = path_to_data + \"/conversation_data.jsonl\"\n",
"\n",
"# Change variable referenced here to check different files\n",
"with Path(qr_js_data).open() as f:\n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation\n",
"\n",
"Now that we have some datasets and an evaluator, and can pass both of them into evaluate. Starting with query/response jsonl inputs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"js_qr_output = evaluate(\n",
" data=qr_js_data,\n",
" evaluators={\"test\": my_evaluator},\n",
" _use_pf_client=False, # Avoid using PF dependencies to further simplify the example\n",
")\n",
"\n",
"eval_row_results = [row[\"outputs.test.result\"] for row in js_qr_output[\"rows\"]]\n",
"metrics = js_qr_output[\"metrics\"]\n",
"\n",
"print(f\"query/response jsonl results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's run the evaluation using the conversation-based jsonl data. Notice that the evaluator works for both conversations that only convert into a single query response pair, and for conversations that convert into multiple query response pairs. It also produces an extra output called `per_turn_results`, which allows you to check the results of each query-response evaluation that comprised a conversation, since the top-level result is an average of these values. This `per_turn_results` value is also produced by built-in evaluators when evaluating conversations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"js_convo_output = evaluate(\n",
" data=conversation_js_data,\n",
" evaluators={\"test\": my_evaluator},\n",
" _use_pf_client=False,\n",
")\n",
"\n",
"eval_row_results = [row[\"outputs.test.result\"] for row in js_convo_output[\"rows\"]]\n",
"per_turn_results = [row[\"outputs.test.per_turn_results\"] for row in js_convo_output[\"rows\"]]\n",
"metrics = js_convo_output[\"metrics\"]\n",
"\n",
"print(\n",
" f\"\"\"conversation jsonl results: {eval_row_results} \n",
"with per turn results: {per_turn_results} \n",
"and overall metrics: {metrics}\"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we run the evaluation using the csv file as input. As expected, the results are the same as the equivalent jsonl file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_qr_output = evaluate(\n",
" data=qr_csv_data,\n",
" evaluators={\"test\": my_evaluator},\n",
" _use_pf_client=False,\n",
")\n",
"\n",
"eval_row_results = [row[\"outputs.test.result\"] for row in csv_qr_output[\"rows\"]]\n",
"metrics = csv_qr_output[\"metrics\"]\n",
"\n",
"print(f\"Query/response csv results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"This sample has shown various ways to input data using `evaluate`, and the difference between query/response and conversation-based inputs. As the SDK is improved, more of the built-in evaluators will continue to support a larger variety of input schemes. We encourage users to leverage which ever options suit their needs."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
4 changes: 4 additions & 0 deletions scenarios/evaluate/evaluate_with_various_inputs/qr_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
query,response
What is the meaning of life?,42.
What atoms compose water?,Hydrogen and oxygen.
What color is my shirt?,How would I know? I don't have eyes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"query":"What is the meaning of life?","response":"42."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"How would I know? I don't have eyes."}
Loading