Add new sample for evaluator input types (#187)

MilesHolland · web-flow · commit 015cdbf25ea2 · 2025-02-19T15:38:29.000-05:00
* input types sample

* run black

* change singleton references to query/response

* CI

* run black again

* that's ruff buddy

* pre commit run
diff --git a/scenarios/evaluate/evaluate_with_various_inputs/README.md b/scenarios/evaluate/evaluate_with_various_inputs/README.md
@@ -0,0 +1,34 @@
+---
+page_type: sample
+languages:
+- language1
+- language2
+products:
+- ai-services
+- azure-openai
+description: Evaluate with various inputs: query/response versus conversation, csv versus jsonl. 
+---
+
+## Evaluate With Queries and Responses, Conversations, Jsonl, and Csv
+
+### Overview
+
+This notebook walks through a local toy evaluation using a variety of input types and formats. Many built-in evaluators in the Evaluation SDK can evaluate either a set on inputs like a query and response, or they can derive a list of query/response pairs from a conversation and evaluate those as well.
+
+Additionally, the Evaluation SDK supports reading in datasets from both jsonl files and csv files. This notebook showcases evaluation using both file types.
+
+### Objective
+
+The main objective of this tutorial is to help users understand how to use the azure-ai-evaluation SDK to to evaluate chat data that is stored in any valid combination of variable type (query/response versus conversation) or file type (jsonl or csv). By the end of this tutorial, you should be able to:
+
+ - Understand how to perform an evaluation using either query/response values or conversations as inputs.
+ - Use both jsonl and csv files as inputs for `evaluate`.
+
+### Basic requirements
+
+This notebook is meant to be very lightweight, as such it only requires the Evaluation SDK and the files included in this directory to function.
+
+### Programming Languages
+ - Python
+
+### Estimated Runtime: 10 mins
diff --git a/scenarios/evaluate/evaluate_with_various_inputs/conversation_data.jsonl b/scenarios/evaluate/evaluate_with_various_inputs/conversation_data.jsonl
@@ -0,0 +1,3 @@
+
+{"conversation" : {"messages": [{"content": "What is the meaning of life?", "role" :"user"}, {"content": "42.", "role" :"assistant"}]}}
+{"conversation" : {"messages": [{"content": "What atoms compose water?", "role" :"user"}, {"content": "Hydrogen and oxygen", "role" :"assistant"}, {"content": "What color is my shirt?", "role" :"user"}, {"content": "How would I know? I don't have eyes.", "role" :"assistant"}]}}
diff --git a/scenarios/evaluate/evaluate_with_various_inputs/evaluate_with_various_inputs.ipynb b/scenarios/evaluate/evaluate_with_various_inputs/evaluate_with_various_inputs.ipynb
@@ -0,0 +1,284 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluate with various inputs\n",
+    "\n",
+    "## Objective\n",
+    "\n",
+    "This notebook walks through how to use jsonl and csv files as inputs for evaluation, as well as both query/response and conversation-based inputs within those files. \n",
+    "\n",
+    "Note: When this notebook refers to 'conversations', we are referring to the definition of conversations defined [here](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.conversation?view=azure-python#attributes). This is a simplified variant on the broader Chat Protocol standard that is defined [here](https://github.com/microsoft/ai-chat-protocol)\n",
+    "\n",
+    "## Time\n",
+    "\n",
+    "You should expect to spend about 10 minutes running this notebook.\n",
+    "\n",
+    "## Setup\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install the Evaluation SDK package\n",
+    "%pip install azure-ai-evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports\n",
+    "Run this cell to import everything that is needed for this sample"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.ai.evaluation import evaluate\n",
+    "from typing import List, Tuple, Dict, Optional, TypedDict\n",
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluator definition\n",
+    "\n",
+    "We define a toy math evaluator below to showcase multi-input handling. A variety of built-in evaluators have a similar input structure to the evaluator below, like the `ContentSafetyEvaluator` and the `ProtectedMaterialEvaluator`. However they all require API connections to function. To avoid that setup and keep this sample offline-capable, this toy evaluator requires no external support."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Underlying evaluation: The return ratio of the query to response lengths\n",
+    "def query_response_ratio(query: str, response: str) -> float:\n",
+    "    return len(query) / len(response)\n",
+    "\n",
+    "\n",
+    "# Helper function that converts a conversation into a list of query-response pairs\n",
+    "def unwrap_conversation(conversation: Dict) -> List[Tuple[str, str]]:\n",
+    "    queries = []\n",
+    "    responses = []\n",
+    "    for turn in conversation[\"messages\"]:\n",
+    "        if turn[\"role\"] == \"user\":\n",
+    "            queries.append(turn[\"content\"])\n",
+    "        else:\n",
+    "            responses.append(turn[\"content\"])\n",
+    "    return zip(queries, responses)\n",
+    "\n",
+    "\n",
+    "# Define the output of the evaluation to make the sample repo's robust type requirements happy.\n",
+    "class EvalOutput(TypedDict, total=False):\n",
+    "    result: float\n",
+    "\n",
+    "\n",
+    "# Actual evaluation function, which handles either a single query-response pair or a conversation\n",
+    "def simple_evaluator_function(\n",
+    "    query: Optional[str] = None, response: Optional[str] = None, conversation: Optional[str] = None\n",
+    ") -> EvalOutput:\n",
+    "    if conversation is not None and query is None and response is None:\n",
+    "        per_turn_results = [query_response_ratio(q, r) for q, r in unwrap_conversation(conversation)]\n",
+    "        return {\"result\": sum(per_turn_results) / len(per_turn_results), \"per_turn_results\": per_turn_results}\n",
+    "    if conversation is None and query is not None and response is not None:\n",
+    "        return {\"result\": query_response_ratio(query, response)}\n",
+    "    raise ValueError(\"Either a conversation or a query-response pair must be provided.\")\n",
+    "\n",
+    "\n",
+    "# Feel free to replace this assignment with more complex evaluation functions for further testing.\n",
+    "my_evaluator = simple_evaluator_function"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With the evaluator defined above, we can input either a query and response together, or a conversation to receive a result:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Query+response evaluation\n",
+    "qr_result = my_evaluator(query=\"Hello\", response=\"world\")\n",
+    "print(f\"query/response output: {qr_result}\")\n",
+    "\n",
+    "conversation_input = {\n",
+    "    \"messages\": [\n",
+    "        {\"role\": \"user\", \"content\": \"Hello\"},\n",
+    "        {\"role\": \"assistant\", \"content\": \"world\"},\n",
+    "        {\"role\": \"user\", \"content\": \"Hello\"},\n",
+    "        {\"role\": \"assistant\", \"content\": \"world and more words to change ratio\"},\n",
+    "    ]\n",
+    "}\n",
+    "\n",
+    "# Conversation evaluation\n",
+    "conversation_result = my_evaluator(conversation=conversation_input)\n",
+    "print(f\"conversation output: {conversation_result}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Datasets\n",
+    "\n",
+    "Direct inputs into evaluators as shown above are useful for sanity checks. But for larger datasets we typically input the evaluator and a dataset file into the `evaluate` method. For that, we will need some data files.\n",
+    "\n",
+    "Included in this sample directory are 3 files:\n",
+    "- qr_data.jsonl contains query/response inputs in jsonl format.\n",
+    "- qr_data.csv contains query/response inputs in csv format.\n",
+    "- conversation_data.jsonl contains conversation inputs in jsonl format.\n",
+    "\n",
+    "Conversations and other complex inputs are not supported via csv inputs, so there is no corresponding \"conversation_data.csv\" file. Each file contains the same three query/response pairs, but in the conversation dataset, the second and third pairs are wrapped into a single, 4-turn conversation.\n",
+    "\n",
+    "Double check the contents of these files by changing the print statement below. You might need to alter the `path_to_data` value depending on where your notebook is running:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Change this depending on where your notebook is running.\n",
+    "# Default value assumes that the notebook is running in the root of the repository.\n",
+    "path_to_data = \"./scenarios/evaluate/evaluate_with_various_inputs\"\n",
+    "# Define data path variables.\n",
+    "qr_js_data = path_to_data + \"/qr_data.jsonl\"\n",
+    "qr_csv_data = path_to_data + \"/qr_data.csv\"\n",
+    "conversation_js_data = path_to_data + \"/conversation_data.jsonl\"\n",
+    "\n",
+    "# Change variable referenced here to check different files\n",
+    "with Path(qr_js_data).open() as f:\n",
+    "    print(f.read())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluation\n",
+    "\n",
+    "Now that we have some datasets and an evaluator, and can pass both of them into evaluate. Starting with query/response jsonl inputs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "js_qr_output = evaluate(\n",
+    "    data=qr_js_data,\n",
+    "    evaluators={\"test\": my_evaluator},\n",
+    "    _use_pf_client=False,  # Avoid using PF dependencies to further simplify the example\n",
+    ")\n",
+    "\n",
+    "eval_row_results = [row[\"outputs.test.result\"] for row in js_qr_output[\"rows\"]]\n",
+    "metrics = js_qr_output[\"metrics\"]\n",
+    "\n",
+    "print(f\"query/response jsonl results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's run the evaluation using the conversation-based jsonl data. Notice that the evaluator works for both conversations that only convert into a single query response pair, and for conversations that convert into multiple query response pairs. It also produces an extra output called `per_turn_results`, which allows you to check the results of each query-response evaluation that comprised a conversation, since the top-level result is an average of these values. This `per_turn_results` value is also produced by built-in evaluators when evaluating conversations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "js_convo_output = evaluate(\n",
+    "    data=conversation_js_data,\n",
+    "    evaluators={\"test\": my_evaluator},\n",
+    "    _use_pf_client=False,\n",
+    ")\n",
+    "\n",
+    "eval_row_results = [row[\"outputs.test.result\"] for row in js_convo_output[\"rows\"]]\n",
+    "per_turn_results = [row[\"outputs.test.per_turn_results\"] for row in js_convo_output[\"rows\"]]\n",
+    "metrics = js_convo_output[\"metrics\"]\n",
+    "\n",
+    "print(\n",
+    "    f\"\"\"conversation jsonl results: {eval_row_results} \n",
+    "with per turn results: {per_turn_results} \n",
+    "and overall metrics: {metrics}\"\"\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we run the evaluation using the csv file as input. As expected, the results are the same as the equivalent jsonl file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_qr_output = evaluate(\n",
+    "    data=qr_csv_data,\n",
+    "    evaluators={\"test\": my_evaluator},\n",
+    "    _use_pf_client=False,\n",
+    ")\n",
+    "\n",
+    "eval_row_results = [row[\"outputs.test.result\"] for row in csv_qr_output[\"rows\"]]\n",
+    "metrics = csv_qr_output[\"metrics\"]\n",
+    "\n",
+    "print(f\"Query/response csv results: {eval_row_results} \\nwith overall metrics: {metrics}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "This sample has shown various ways to input data using `evaluate`, and the difference between query/response and conversation-based inputs. As the SDK is improved, more of the built-in evaluators will continue to support a larger variety of input schemes. We encourage users to leverage which ever options suit their needs."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/scenarios/evaluate/evaluate_with_various_inputs/qr_data.csv b/scenarios/evaluate/evaluate_with_various_inputs/qr_data.csv
@@ -0,0 +1,4 @@
+query,response
+What is the meaning of life?,42.
+What atoms compose water?,Hydrogen and oxygen.
+What color is my shirt?,How would I know? I don't have eyes.
diff --git a/scenarios/evaluate/evaluate_with_various_inputs/qr_data.jsonl b/scenarios/evaluate/evaluate_with_various_inputs/qr_data.jsonl
@@ -0,0 +1,3 @@
+{"query":"What is the meaning of life?","response":"42."}
+{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
+{"query":"What color is my shirt?","response":"How would I know? I don't have eyes."}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+`
	`2`	`+{"conversation" : {"messages": [{"content": "What is the meaning of life?", "role" :"user"}, {"content": "42.", "role" :"assistant"}]}}`
	`3`	`+{"conversation" : {"messages": [{"content": "What atoms compose water?", "role" :"user"}, {"content": "Hydrogen and oxygen", "role" :"assistant"}, {"content": "What color is my shirt?", "role" :"user"}, {"content": "How would I know? I don't have eyes.", "role" :"assistant"}]}}`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+{"query":"What is the meaning of life?","response":"42."}`
	`2`	`+{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}`
	`3`	`+{"query":"What color is my shirt?","response":"How would I know? I don't have eyes."}`