|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Evaluate with various inputs\n", |
| 8 | + "\n", |
| 9 | + "## Objective\n", |
| 10 | + "\n", |
| 11 | + "This notebook walks through how to use jsonl and csv files as inputs for evaluation, as well as both query/response and conversation-based inputs within those files. \n", |
| 12 | + "\n", |
| 13 | + "Note: When this notebook refers to 'conversations', we are referring to the definition of conversations defined [here](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.conversation?view=azure-python#attributes). This is a simplified variant on the broader Chat Protocol standard that is defined [here](https://github.com/microsoft/ai-chat-protocol)\n", |
| 14 | + "\n", |
| 15 | + "## Time\n", |
| 16 | + "\n", |
| 17 | + "You should expect to spend about 10 minutes running this notebook.\n", |
| 18 | + "\n", |
| 19 | + "## Setup\n" |
| 20 | + ] |
| 21 | + }, |
| 22 | + { |
| 23 | + "cell_type": "code", |
| 24 | + "execution_count": null, |
| 25 | + "metadata": {}, |
| 26 | + "outputs": [], |
| 27 | + "source": [ |
| 28 | + "# Install the Evaluation SDK package\n", |
| 29 | + "%pip install azure-ai-evaluation" |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "metadata": {}, |
| 35 | + "source": [ |
| 36 | + "### Imports\n", |
| 37 | + "Run this cell to import everything that is needed for this sample" |
| 38 | + ] |
| 39 | + }, |
| 40 | + { |
| 41 | + "cell_type": "code", |
| 42 | + "execution_count": null, |
| 43 | + "metadata": {}, |
| 44 | + "outputs": [], |
| 45 | + "source": [ |
| 46 | + "from azure.ai.evaluation import evaluate\n", |
| 47 | + "from typing import List, Tuple, Dict, Optional, TypedDict\n", |
| 48 | + "from pathlib import Path" |
| 49 | + ] |
| 50 | + }, |
| 51 | + { |
| 52 | + "cell_type": "markdown", |
| 53 | + "metadata": {}, |
| 54 | + "source": [ |
| 55 | + "## Evaluator definition\n", |
| 56 | + "\n", |
| 57 | + "We define a toy math evaluator below to showcase multi-input handling. A variety of built-in evaluators have a similar input structure to the evaluator below, like the `ContentSafetyEvaluator` and the `ProtectedMaterialEvaluator`. However they all require API connections to function. To avoid that setup and keep this sample offline-capable, this toy evaluator requires no external support." |
| 58 | + ] |
| 59 | + }, |
| 60 | + { |
| 61 | + "cell_type": "code", |
| 62 | + "execution_count": null, |
| 63 | + "metadata": {}, |
| 64 | + "outputs": [], |
| 65 | + "source": [ |
| 66 | + "# Underlying evaluation: The return ratio of the query to response lengths\n", |
| 67 | + "def query_response_ratio(query: str, response: str) -> float:\n", |
| 68 | + " return len(query) / len(response)\n", |
| 69 | + "\n", |
| 70 | + "\n", |
| 71 | + "# Helper function that converts a conversation into a list of query-response pairs\n", |
| 72 | + "def unwrap_conversation(conversation: Dict) -> List[Tuple[str, str]]:\n", |
| 73 | + " queries = []\n", |
| 74 | + " responses = []\n", |
| 75 | + " for turn in conversation[\"messages\"]:\n", |
| 76 | + " if turn[\"role\"] == \"user\":\n", |
| 77 | + " queries.append(turn[\"content\"])\n", |
| 78 | + " else:\n", |
| 79 | + " responses.append(turn[\"content\"])\n", |
| 80 | + " return zip(queries, responses)\n", |
| 81 | + "\n", |
| 82 | + "\n", |
| 83 | + "# Define the output of the evaluation to make the sample repo's robust type requirements happy.\n", |
| 84 | + "class EvalOutput(TypedDict, total=False):\n", |
| 85 | + " result: float\n", |
| 86 | + "\n", |
| 87 | + "\n", |
| 88 | + "# Actual evaluation function, which handles either a single query-response pair or a conversation\n", |
| 89 | + "def simple_evaluator_function(\n", |
| 90 | + " query: Optional[str] = None, response: Optional[str] = None, conversation: Optional[str] = None\n", |
| 91 | + ") -> EvalOutput:\n", |
| 92 | + " if conversation is not None and query is None and response is None:\n", |
| 93 | + " per_turn_results = [query_response_ratio(q, r) for q, r in unwrap_conversation(conversation)]\n", |
| 94 | + " return {\"result\": sum(per_turn_results) / len(per_turn_results), \"per_turn_results\": per_turn_results}\n", |
| 95 | + " if conversation is None and query is not None and response is not None:\n", |
| 96 | + " return {\"result\": query_response_ratio(query, response)}\n", |
| 97 | + " raise ValueError(\"Either a conversation or a query-response pair must be provided.\")\n", |
| 98 | + "\n", |
| 99 | + "\n", |
| 100 | + "# Feel free to replace this assignment with more complex evaluation functions for further testing.\n", |
| 101 | + "my_evaluator = simple_evaluator_function" |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "markdown", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "With the evaluator defined above, we can input either a query and response together, or a conversation to receive a result:" |
| 109 | + ] |
| 110 | + }, |
| 111 | + { |
| 112 | + "cell_type": "code", |
| 113 | + "execution_count": null, |
| 114 | + "metadata": {}, |
| 115 | + "outputs": [], |
| 116 | + "source": [ |
| 117 | + "# Query+response evaluation\n", |
| 118 | + "qr_result = my_evaluator(query=\"Hello\", response=\"world\")\n", |
| 119 | + "print(f\"query/response output: {qr_result}\")\n", |
| 120 | + "\n", |
| 121 | + "conversation_input = {\n", |
| 122 | + " \"messages\": [\n", |
| 123 | + " {\"role\": \"user\", \"content\": \"Hello\"},\n", |
| 124 | + " {\"role\": \"assistant\", \"content\": \"world\"},\n", |
| 125 | + " {\"role\": \"user\", \"content\": \"Hello\"},\n", |
| 126 | + " {\"role\": \"assistant\", \"content\": \"world and more words to change ratio\"},\n", |
| 127 | + " ]\n", |
| 128 | + "}\n", |
| 129 | + "\n", |
| 130 | + "# Conversation evaluation\n", |
| 131 | + "conversation_result = my_evaluator(conversation=conversation_input)\n", |
| 132 | + "print(f\"conversation output: {conversation_result}\")" |
| 133 | + ] |
| 134 | + }, |
| 135 | + { |
| 136 | + "cell_type": "markdown", |
| 137 | + "metadata": {}, |
| 138 | + "source": [ |
| 139 | + "## Datasets\n", |
| 140 | + "\n", |
| 141 | + "Direct inputs into evaluators as shown above are useful for sanity checks. But for larger datasets we typically input the evaluator and a dataset file into the `evaluate` method. For that, we will need some data files.\n", |
| 142 | + "\n", |
| 143 | + "Included in this sample directory are 3 files:\n", |
| 144 | + "- qr_data.jsonl contains query/response inputs in jsonl format.\n", |
| 145 | + "- qr_data.csv contains query/response inputs in csv format.\n", |
| 146 | + "- conversation_data.jsonl contains conversation inputs in jsonl format.\n", |
| 147 | + "\n", |
| 148 | + "Conversations and other complex inputs are not supported via csv inputs, so there is no corresponding \"conversation_data.csv\" file. Each file contains the same three query/response pairs, but in the conversation dataset, the second and third pairs are wrapped into a single, 4-turn conversation.\n", |
| 149 | + "\n", |
| 150 | + "Double check the contents of these files by changing the print statement below. You might need to alter the `path_to_data` value depending on where your notebook is running:" |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "code", |
| 155 | + "execution_count": null, |
| 156 | + "metadata": {}, |
| 157 | + "outputs": [], |
| 158 | + "source": [ |
| 159 | + "# Change this depending on where your notebook is running.\n", |
| 160 | + "# Default value assumes that the notebook is running in the root of the repository.\n", |
| 161 | + "path_to_data = \"./scenarios/evaluate/evaluate_with_various_inputs\"\n", |
| 162 | + "# Define data path variables.\n", |
| 163 | + "qr_js_data = path_to_data + \"/qr_data.jsonl\"\n", |
| 164 | + "qr_csv_data = path_to_data + \"/qr_data.csv\"\n", |
| 165 | + "conversation_js_data = path_to_data + \"/conversation_data.jsonl\"\n", |
| 166 | + "\n", |
| 167 | + "# Change variable referenced here to check different files\n", |
| 168 | + "with Path(qr_js_data).open() as f:\n", |
| 169 | + " print(f.read())" |
| 170 | + ] |
| 171 | + }, |
| 172 | + { |
| 173 | + "cell_type": "markdown", |
| 174 | + "metadata": {}, |
| 175 | + "source": [ |
| 176 | + "## Evaluation\n", |
| 177 | + "\n", |
| 178 | + "Now that we have some datasets and an evaluator, and can pass both of them into evaluate. Starting with query/response jsonl inputs:" |
| 179 | + ] |
| 180 | + }, |
| 181 | + { |
| 182 | + "cell_type": "code", |
| 183 | + "execution_count": null, |
| 184 | + "metadata": {}, |
| 185 | + "outputs": [], |
| 186 | + "source": [ |
| 187 | + "js_qr_output = evaluate(\n", |
| 188 | + " data=qr_js_data,\n", |
| 189 | + " evaluators={\"test\": my_evaluator},\n", |
| 190 | + " _use_pf_client=False, # Avoid using PF dependencies to further simplify the example\n", |
| 191 | + ")\n", |
| 192 | + "\n", |
| 193 | + "eval_row_results = [row[\"outputs.test.result\"] for row in js_qr_output[\"rows\"]]\n", |
| 194 | + "metrics = js_qr_output[\"metrics\"]\n", |
| 195 | + "\n", |
| 196 | + "print(f\"query/response jsonl results: {eval_row_results} \\nwith overall metrics: {metrics}\")" |
| 197 | + ] |
| 198 | + }, |
| 199 | + { |
| 200 | + "cell_type": "markdown", |
| 201 | + "metadata": {}, |
| 202 | + "source": [ |
| 203 | + "Now let's run the evaluation using the conversation-based jsonl data. Notice that the evaluator works for both conversations that only convert into a single query response pair, and for conversations that convert into multiple query response pairs. It also produces an extra output called `per_turn_results`, which allows you to check the results of each query-response evaluation that comprised a conversation, since the top-level result is an average of these values. This `per_turn_results` value is also produced by built-in evaluators when evaluating conversations." |
| 204 | + ] |
| 205 | + }, |
| 206 | + { |
| 207 | + "cell_type": "code", |
| 208 | + "execution_count": null, |
| 209 | + "metadata": {}, |
| 210 | + "outputs": [], |
| 211 | + "source": [ |
| 212 | + "js_convo_output = evaluate(\n", |
| 213 | + " data=conversation_js_data,\n", |
| 214 | + " evaluators={\"test\": my_evaluator},\n", |
| 215 | + " _use_pf_client=False,\n", |
| 216 | + ")\n", |
| 217 | + "\n", |
| 218 | + "eval_row_results = [row[\"outputs.test.result\"] for row in js_convo_output[\"rows\"]]\n", |
| 219 | + "per_turn_results = [row[\"outputs.test.per_turn_results\"] for row in js_convo_output[\"rows\"]]\n", |
| 220 | + "metrics = js_convo_output[\"metrics\"]\n", |
| 221 | + "\n", |
| 222 | + "print(\n", |
| 223 | + " f\"\"\"conversation jsonl results: {eval_row_results} \n", |
| 224 | + "with per turn results: {per_turn_results} \n", |
| 225 | + "and overall metrics: {metrics}\"\"\"\n", |
| 226 | + ")" |
| 227 | + ] |
| 228 | + }, |
| 229 | + { |
| 230 | + "cell_type": "markdown", |
| 231 | + "metadata": {}, |
| 232 | + "source": [ |
| 233 | + "Next we run the evaluation using the csv file as input. As expected, the results are the same as the equivalent jsonl file:" |
| 234 | + ] |
| 235 | + }, |
| 236 | + { |
| 237 | + "cell_type": "code", |
| 238 | + "execution_count": null, |
| 239 | + "metadata": {}, |
| 240 | + "outputs": [], |
| 241 | + "source": [ |
| 242 | + "csv_qr_output = evaluate(\n", |
| 243 | + " data=qr_csv_data,\n", |
| 244 | + " evaluators={\"test\": my_evaluator},\n", |
| 245 | + " _use_pf_client=False,\n", |
| 246 | + ")\n", |
| 247 | + "\n", |
| 248 | + "eval_row_results = [row[\"outputs.test.result\"] for row in csv_qr_output[\"rows\"]]\n", |
| 249 | + "metrics = csv_qr_output[\"metrics\"]\n", |
| 250 | + "\n", |
| 251 | + "print(f\"Query/response csv results: {eval_row_results} \\nwith overall metrics: {metrics}\")" |
| 252 | + ] |
| 253 | + }, |
| 254 | + { |
| 255 | + "cell_type": "markdown", |
| 256 | + "metadata": {}, |
| 257 | + "source": [ |
| 258 | + "## Conclusion\n", |
| 259 | + "\n", |
| 260 | + "This sample has shown various ways to input data using `evaluate`, and the difference between query/response and conversation-based inputs. As the SDK is improved, more of the built-in evaluators will continue to support a larger variety of input schemes. We encourage users to leverage which ever options suit their needs." |
| 261 | + ] |
| 262 | + } |
| 263 | + ], |
| 264 | + "metadata": { |
| 265 | + "kernelspec": { |
| 266 | + "display_name": "Python 3 (ipykernel)", |
| 267 | + "language": "python", |
| 268 | + "name": "python3" |
| 269 | + }, |
| 270 | + "language_info": { |
| 271 | + "codemirror_mode": { |
| 272 | + "name": "ipython", |
| 273 | + "version": 3 |
| 274 | + }, |
| 275 | + "file_extension": ".py", |
| 276 | + "mimetype": "text/x-python", |
| 277 | + "name": "python", |
| 278 | + "nbconvert_exporter": "python", |
| 279 | + "pygments_lexer": "ipython3" |
| 280 | + } |
| 281 | + }, |
| 282 | + "nbformat": 4, |
| 283 | + "nbformat_minor": 2 |
| 284 | +} |
0 commit comments