diff --git a/authors.yaml b/authors.yaml index ef087642a6..d84aa18f87 100644 --- a/authors.yaml +++ b/authors.yaml @@ -391,3 +391,8 @@ corwin: name: "Corwin Cheung" website: "https://www.linkedin.com/in/corwincubes/" avatar: "https://avatars.githubusercontent.com/u/85517581?v=4" + +daisyshe-oai: + name: "Daisy Sheng" + website: "https://www.linkedin.com/in/daisysheng/" + avatar: "https://avatars.githubusercontent.com/u/212609991?v=4" diff --git a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb new file mode 100644 index 0000000000..12a65e5efc --- /dev/null +++ b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evals API: Image Inputs\n", + "\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.\n", + "\n", + "In this example, we will evaluate how well our model can:\n", + "1. **Generate appropriate responses** to user prompts about images\n", + "3. **Align with reference answers** that represent high-quality responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installing Dependencies + Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "!pip install openai datasets pandas --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import libraries\n", + "from datasets import load_dataset\n", + "from openai import OpenAI\n", + "import os\n", + "import json\n", + "import time\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Preparation\n", + "\n", + "We use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of user prompt, accompanying image, and reference answer data. First, we load the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = load_dataset(\"RekaAI/VibeEval\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evals_data_source = []\n", + "\n", + "# select the first 3 examples in the dataset to use for this cookbook\n", + "for example in dataset[\"test\"].select(range(3)):\n", + " evals_data_source.append({\n", + " \"item\": {\n", + " \"media_url\": example[\"media_url\"], # image web URL\n", + " \"reference\": example[\"reference\"], # reference answer\n", + " \"prompt\": example[\"prompt\"] # prompt\n", + " }\n", + " })" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you print the data source list, each item should be of a similar form to:\n", + "\n", + "```python\n", + "{\n", + " \"item\": {\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg\"\n", + " \"reference\": \"This appears to be a classic Margherita pizza, which has the following ingredients...\"\n", + " \"prompt\": \"What ingredients do I need to make this?\"\n", + " }\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Configuration\n", + "\n", + "Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Source Config\n", + "\n", + "Based on the data that we have compiled, our data source config is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"media_url\": { \"type\": \"string\" },\n", + " \"reference\": { \"type\": \"string\" },\n", + " \"prompt\": { \"type\": \"string\" }\n", + " },\n", + " \"required\": [\"media_url\", \"reference\", \"prompt\"]\n", + " },\n", + " \"include_sample_schema\": True, # enables sampling\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing Criteria" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API Grader docs](hhttps://platform.openai.com/docs/api-reference/graders). \n", + "\n", + "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string. " + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "grader_config = {\n", + "\t \"type\": \"score_model\",\n", + " \"name\": \"Score Model Grader\",\n", + " \"input\":[\n", + " {\n", + " \"role\": \"system\",\n", + "\t\t \"content\": \"You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0.\"\n", + "\t },\n", + "\t {\n", + "\t\t \"role\": \"user\",\n", + "\t\t \"content\": [{ \"type\": \"input_text\", \"text\": \"Prompt: {{ item.prompt }}.\"},\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_image\", \"image_url\": \"{{ item.media_url }}\", \"detail\": \"auto\" },\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_text\", \"text\": \"Reference answer: {{ item.reference }}. Model response: {{ sample.output_text }}.\"}\n", + "\t\t\t\t]\n", + "\t }\n", + "\t\t],\n", + "\t\t\"pass_threshold\": 0.9,\n", + "\t \"range\": [0, 1],\n", + "\t \"model\": \"o4-mini\" # model for grading; check that the model you use supports image inputs\n", + "\t}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create the eval object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_object = client.evals.create(\n", + " name=\"Image Grading\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[grader_config],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", + "\n", + "Here's the sampling message input we'll use for this example." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "sampling_messages = [{\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"{{ item.prompt }}\"\n", + " }\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_image\",\n", + " \"image_url\": \"{{ item.media_url }}\",\n", + " \"detail\": \"auto\"\n", + " }\n", + " }]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now kickoff an eval run." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "eval_run = client.evals.runs.create(\n", + " name=\"Image Input Eval Run\",\n", + " eval_id=eval_object.id,\n", + " data_source={\n", + " \"type\": \"responses\", # sample using responses API\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": evals_data_source\n", + " },\n", + " \"model\": \"gpt-4o-mini\", # model used to generate the response; check that the model you use supports image inputs\n", + " \"input_messages\": {\n", + " \"type\": \"template\", \n", + " \"template\": sampling_messages}\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Poll and Display the Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
promptreferencemodel_responsegrading_results
0Please provide latex code to replicate this tableBelow is the latex code for your table:\\n```te...Certainly! Below is the LaTeX code to replicat...{\"steps\":[{\"description\":\"Assess if the provid...
1What ingredients do I need to make this?This appears to be a classic Margherita pizza,...To make a classic Margherita pizza like the on...{\"steps\":[{\"description\":\"Check if model ident...
2Is this safe for a vegan to eat?Based on the image, this dish appears to be a ...To determine if the dish is safe for a vegan t...{\"steps\":[{\"description\":\"Compare model respon...
\n", + "
" + ], + "text/plain": [ + " prompt \\\n", + "0 Please provide latex code to replicate this table \n", + "1 What ingredients do I need to make this? \n", + "2 Is this safe for a vegan to eat? \n", + "\n", + " reference \\\n", + "0 Below is the latex code for your table:\\n```te... \n", + "1 This appears to be a classic Margherita pizza,... \n", + "2 Based on the image, this dish appears to be a ... \n", + "\n", + " model_response \\\n", + "0 Certainly! Below is the LaTeX code to replicat... \n", + "1 To make a classic Margherita pizza like the on... \n", + "2 To determine if the dish is safe for a vegan t... \n", + "\n", + " grading_results \n", + "0 {\"steps\":[{\"description\":\"Assess if the provid... \n", + "1 {\"steps\":[{\"description\":\"Check if model ident... \n", + "2 {\"steps\":[{\"description\":\"Compare model respon... " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "while True:\n", + " run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)\n", + " if run.status == \"completed\" or run.status == \"failed\": # check if the run is finished\n", + " output_items = list(client.evals.runs.output_items.list(\n", + " run_id=run.id, eval_id=eval_object.id\n", + " ))\n", + " df = pd.DataFrame({\n", + " \"prompt\": [item.datasource_item[\"prompt\"]for item in output_items],\n", + " \"reference\": [item.datasource_item[\"reference\"] for item in output_items],\n", + " \"model_response\": [item.sample.output[0].content for item in output_items],\n", + " \"grading_results\": [item.results[0][\"sample\"][\"output\"][0][\"content\"]\n", + " for item in output_items]\n", + " })\n", + " display(df)\n", + " break\n", + " time.sleep(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Viewing Individual Output Items" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see a full output item, we can do the following. The structure of an output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"id\": \"outputitem_687833f102ec8191a6e53a5461b970c2\",\n", + " \"created_at\": 1752708081,\n", + " \"datasource_item\": {\n", + " \"prompt\": \"Please provide latex code to replicate this table\",\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png\",\n", + " \"reference\": \"Below is the latex code for your table:\\n```tex\\n\\\\begin{table}\\n\\\\begin{tabular}{c c c c} \\\\hline & \\\\(S2\\\\) & Expert & Layman & PoelM \\\\\\\\ \\\\cline{2-4} \\\\(S1\\\\) & Expert & \\u2013 & 54.0 & 62.7 \\\\\\\\ & Layman & 46.0 & \\u2013 & 60.7 \\\\\\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\\u2013&39.3 \\\\\\\\\\n[-1ex] \\\\end{tabular}\\n\\\\end{table}\\n```.\"\n", + " },\n", + " \"datasource_item_id\": 1,\n", + " \"eval_id\": \"eval_687833d68e888191bc4bd8b965368f22\",\n", + " \"object\": \"eval.run.output_item\",\n", + " \"results\": [\n", + " {\n", + " \"name\": \"Score Model Grader-73fe48a0-8090-46eb-aa8e-d426ad074eb3\",\n", + " \"sample\": {\n", + " \"input\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Prompt: Please provide latex code to replicate this table. https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png Reference answer: Below is the latex code for your table:\\n```tex\\n\\\\begin{table}\\n\\\\begin{tabular}{c c c c} \\\\hline & \\\\(S2\\\\) & Expert & Layman & PoelM \\\\\\\\ \\\\cline{2-4} \\\\(S1\\\\) & Expert & \\u2013 & 54.0 & 62.7 \\\\\\\\ & Layman & 46.0 & \\u2013 & 60.7 \\\\\\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\\u2013&39.3 \\\\\\\\\\n[-1ex] \\\\end{tabular}\\n\\\\end{table}\\n```.. Model response: Certainly! Below is the LaTeX code to replicate the table you provided:\\n\\n```latex\\n\\\\documentclass{article}\\n\\\\usepackage{array}\\n\\\\usepackage{multirow}\\n\\\\usepackage{booktabs}\\n\\n\\\\begin{document}\\n\\n\\\\begin{table}[ht]\\n \\\\centering\\n \\\\begin{tabular}{c|c|c|c}\\n \\\\multirow{2}{*}{S1} & \\\\multirow{2}{*}{S2} & \\\\multicolumn{3}{c}{Methods} \\\\\\\\ \\n \\\\cline{3-5}\\n & & Expert & Layman & PoeLM \\\\\\\\\\n \\\\hline\\n Expert & & - & 54.0 & 62.7 \\\\\\\\\\n Layman & & 46.0 & - & 60.7 \\\\\\\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\\\\\\n \\\\end{tabular}\\n \\\\caption{Comparison of different methods}\\n \\\\label{tab:methods_comparison}\\n\\\\end{table}\\n\\n\\\\end{document}\\n```\\n\\n### Explanation:\\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\\n- The `booktabs` package is used for improved table formatting (with `\\\\hline` for horizontal lines).\\n- Adjust the table's caption and label as needed..\"\n", + " }\n", + " ],\n", + " \"output\": [\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"{\\\"steps\\\":[{\\\"description\\\":\\\"Assess if the provided LaTeX code correctly matches the structure of the target table, including the diagonal header, column counts, and alignment.\\\",\\\"conclusion\\\":\\\"The code fails to create the diagonal split between S1 and S2 and mismatches column counts (defines 4 columns but uses 5).\\\"},{\\\"description\\\":\\\"Check the header layout: the target table has a single diagonal cell spanning two axes and three following columns labeled Expert, Layman, PoeLM. The model uses \\\\\\\\multirow and a \\\\\\\\multicolumn block named 'Methods', which does not replicate the diagonal or correct labeling.\\\",\\\"conclusion\\\":\\\"Header structure is incorrect and does not match the prompt's table.\\\"},{\\\"description\\\":\\\"Verify the data rows: the model code includes two empty cells after S1 and before the data, misaligning all data entries relative to the intended columns.\\\",\\\"conclusion\\\":\\\"Data rows are misaligned due to incorrect column definitions.\\\"},{\\\"description\\\":\\\"Overall compatibility: the code is syntactically flawed for the target table and conceptually does not replicate the diagonal header or correct column count.\\\",\\\"conclusion\\\":\\\"The response does not satisfy the prompt.\\\"}],\\\"result\\\":0.0}\"\n", + " }\n", + " ],\n", + " \"finish_reason\": \"stop\",\n", + " \"model\": \"o4-mini-2025-04-16\",\n", + " \"usage\": {\n", + " \"total_tokens\": 2185,\n", + " \"completion_tokens\": 712,\n", + " \"prompt_tokens\": 1473,\n", + " \"cached_tokens\": 0\n", + " },\n", + " \"error\": null,\n", + " \"seed\": null,\n", + " \"temperature\": 1.0,\n", + " \"top_p\": 1.0,\n", + " \"reasoning_effort\": null,\n", + " \"max_completions_tokens\": 4096\n", + " },\n", + " \"passed\": false,\n", + " \"score\": 0.0\n", + " }\n", + " ],\n", + " \"run_id\": \"evalrun_687833dbadd081919a0f9fbfb817baf4\",\n", + " \"sample\": \"Sample(error=None, finish_reason='stop', input=[SampleInput(content='Please provide latex code to replicate this table', role='user'), SampleInput(content='https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png', role='user')], max_completion_tokens=None, model='gpt-4o-mini-2024-07-18', output=[SampleOutput(content=\\\"Certainly! Below is the LaTeX code to replicate the table you provided:\\\\n\\\\n```latex\\\\n\\\\\\\\documentclass{article}\\\\n\\\\\\\\usepackage{array}\\\\n\\\\\\\\usepackage{multirow}\\\\n\\\\\\\\usepackage{booktabs}\\\\n\\\\n\\\\\\\\begin{document}\\\\n\\\\n\\\\\\\\begin{table}[ht]\\\\n \\\\\\\\centering\\\\n \\\\\\\\begin{tabular}{c|c|c|c}\\\\n \\\\\\\\multirow{2}{*}{S1} & \\\\\\\\multirow{2}{*}{S2} & \\\\\\\\multicolumn{3}{c}{Methods} \\\\\\\\\\\\\\\\ \\\\n \\\\\\\\cline{3-5}\\\\n & & Expert & Layman & PoeLM \\\\\\\\\\\\\\\\\\\\n \\\\\\\\hline\\\\n Expert & & - & 54.0 & 62.7 \\\\\\\\\\\\\\\\\\\\n Layman & & 46.0 & - & 60.7 \\\\\\\\\\\\\\\\\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\\\\\\\\\\\\\\\\n \\\\\\\\end{tabular}\\\\n \\\\\\\\caption{Comparison of different methods}\\\\n \\\\\\\\label{tab:methods_comparison}\\\\n\\\\\\\\end{table}\\\\n\\\\n\\\\\\\\end{document}\\\\n```\\\\n\\\\n### Explanation:\\\\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\\\\n- The `booktabs` package is used for improved table formatting (with `\\\\\\\\hline` for horizontal lines).\\\\n- Adjust the table's caption and label as needed.\\\", role='assistant')], seed=None, temperature=1.0, top_p=1.0, usage=SampleUsage(cached_tokens=0, completion_tokens=295, prompt_tokens=14187, total_tokens=14482), max_completions_tokens=4096)\",\n", + " \"status\": \"fail\",\n", + " \"_datasource_item_content_hash\": \"bb2090df47ea2ca0aa67337709ce2ff7382d639118d3358068b0cc7031c12f82\"\n", + "}\n" + ] + } + ], + "source": [ + "first_item = output_items[0]\n", + "\n", + "print(json.dumps(dict(first_item), indent=2, default=str))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this cookbook, we covered a workflow for evaluating an image-based task using the OpenAI Evals API's. By using the image input functionality for both sampling and model grading, we were able to streamline our evals process for the task.\n", + "\n", + "We're excited to see you extend this to your own image-based use cases, whether it's OCR accuracy, image generation grading, and more!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.17" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index 649b52d42e..6d1efb45da 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,15 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Using Evals API on Image Inputs + path: examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb + date: 2025-07-15 + authors: + - daisyshe-oai + tags: + - evals-api + - images + - title: Optimize Prompts path: examples/Optimize_Prompts.ipynb date: 2025-07-14