From 9ca3d00264cde8a66ceda6a43b7490312c776239 Mon Sep 17 00:00:00 2001 From: Daisy Sheng Date: Tue, 15 Jul 2025 13:07:35 -0700 Subject: [PATCH 1/3] cookbook and info --- authors.yaml | 5 + .../use-cases/EvalsAPI_Image_Inputs.ipynb | 593 ++++++++++++++++++ registry.yaml | 9 + 3 files changed, 607 insertions(+) create mode 100644 examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb diff --git a/authors.yaml b/authors.yaml index ef087642a6..d84aa18f87 100644 --- a/authors.yaml +++ b/authors.yaml @@ -391,3 +391,8 @@ corwin: name: "Corwin Cheung" website: "https://www.linkedin.com/in/corwincubes/" avatar: "https://avatars.githubusercontent.com/u/85517581?v=4" + +daisyshe-oai: + name: "Daisy Sheng" + website: "https://www.linkedin.com/in/daisysheng/" + avatar: "https://avatars.githubusercontent.com/u/212609991?v=4" diff --git a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb new file mode 100644 index 0000000000..4d27e2fec7 --- /dev/null +++ b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb @@ -0,0 +1,593 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evals API: Image Inputs\n", + "\n", + "OpenAI’s Evals API now supports image inputs, in its step toward multimodal functionality! API users can use OpenAI's Evals API to evaluate their image use cases to see how their LLM integration is performing and improve it.\n", + "\n", + "In this cookbook, we'll walk through an image example with the Evals API. More specifically, we will use Evals API to evaluate model-generated responses to an image and its corresponding prompt, using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score those model responses against the image and reference answer.\n", + "\n", + "Based on your use case, you might only need the sampling functionality or the model grader, and you can revise what you pass in during the eval and run creation to fit your needs. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset\n", + "\n", + "For this example, we will use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of image, prompt, and reference answer data. First, we load the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: datasets in /opt/homebrew/lib/python3.10/site-packages (4.0.0)\n", + "Requirement already satisfied: filelock in /opt/homebrew/lib/python3.10/site-packages (from datasets) (3.18.0)\n", + "Requirement already satisfied: numpy>=1.17 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.1.3)\n", + "Requirement already satisfied: pyarrow>=15.0.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (20.0.0)\n", + "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.3.8)\n", + "Requirement already satisfied: pandas in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.3.1)\n", + "Requirement already satisfied: requests>=2.32.2 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.32.4)\n", + "Requirement already satisfied: tqdm>=4.66.3 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (4.67.1)\n", + "Requirement already satisfied: xxhash in /opt/homebrew/lib/python3.10/site-packages (from datasets) (3.5.0)\n", + "Requirement already satisfied: multiprocess<0.70.17 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.70.16)\n", + "Requirement already satisfied: fsspec<=2025.3.0,>=2023.1.0 in /opt/homebrew/lib/python3.10/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (2025.3.0)\n", + "Requirement already satisfied: huggingface-hub>=0.24.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.33.4)\n", + "Requirement already satisfied: packaging in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from datasets) (25.0)\n", + "Requirement already satisfied: pyyaml>=5.1 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (6.0.2)\n", + "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /opt/homebrew/lib/python3.10/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (3.12.14)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (2.6.1)\n", + "Requirement already satisfied: aiosignal>=1.4.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.4.0)\n", + "Requirement already satisfied: async-timeout<6.0,>=4.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (5.0.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (25.3.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.7.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (6.6.3)\n", + "Requirement already satisfied: propcache>=0.2.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (0.3.2)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.20.1)\n", + "Requirement already satisfied: typing-extensions>=4.1.0 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from multidict<7.0,>=4.5->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (4.14.1)\n", + "Requirement already satisfied: idna>=2.0 in /opt/homebrew/lib/python3.10/site-packages (from yarl<2.0,>=1.17.0->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (3.10)\n", + "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /opt/homebrew/lib/python3.10/site-packages (from huggingface-hub>=0.24.0->datasets) (1.1.5)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.4.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2025.7.14)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from pandas->datasets) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas->datasets) (2025.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /opt/homebrew/lib/python3.10/site-packages (from pandas->datasets) (2025.2)\n", + "Requirement already satisfied: six>=1.5 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.17.0)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "pip install datasets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/homebrew/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"RekaAI/VibeEval\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "evals_data_source = []\n", + "\n", + "# select the first 5 examples in the dataset to use for this cookbook\n", + "for example in dataset[\"test\"].select(range(5)):\n", + " evals_data_source.append({\n", + " \"item\": {\n", + " \"media_url\": example[\"media_url\"], # image web URL\n", + " \"reference\": example[\"reference\"], # reference answer\n", + " \"prompt\": example[\"prompt\"] # prompt\n", + " }\n", + " })" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you print the data source list, each item should be of a similar form to:\n", + "\n", + "```python\n", + "{\n", + " \"item\": {\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg\"\n", + " \"reference\": \"This appears to be a classic Margherita pizza, which has the following ingredients...\"\n", + " \"prompt\": \"What ingredients do I need to make this?\"\n", + " }\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evals Structure\n", + "\n", + "Now that we have our data source and task, we will create our evals. For the evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: openai in /opt/homebrew/lib/python3.10/site-packages (1.95.1)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (4.9.0)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (1.9.0)\n", + "Requirement already satisfied: httpx<1,>=0.23.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (0.28.1)\n", + "Requirement already satisfied: jiter<1,>=0.4.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (0.10.0)\n", + "Requirement already satisfied: pydantic<3,>=1.9.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (2.11.7)\n", + "Requirement already satisfied: sniffio in /opt/homebrew/lib/python3.10/site-packages (from openai) (1.3.1)\n", + "Requirement already satisfied: tqdm>4 in /opt/homebrew/lib/python3.10/site-packages (from openai) (4.67.1)\n", + "Requirement already satisfied: typing-extensions<5,>=4.11 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from openai) (4.14.1)\n", + "Requirement already satisfied: exceptiongroup>=1.0.2 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from anyio<5,>=3.5.0->openai) (1.3.0)\n", + "Requirement already satisfied: idna>=2.8 in /opt/homebrew/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (3.10)\n", + "Requirement already satisfied: certifi in /opt/homebrew/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (2025.7.14)\n", + "Requirement already satisfied: httpcore==1.* in /opt/homebrew/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)\n", + "Requirement already satisfied: h11>=0.16 in /opt/homebrew/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (2.33.2)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.4.1)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "pip install openai" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "import os\n", + "\n", + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\"),\n", + " base_url=\"https://api.openai.com/v1\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader). Based on the data that we have compiled, our data source config is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"media_url\": { \"type\": \"string\" },\n", + " \"reference\": { \"type\": \"string\" },\n", + " \"prompt\": { \"type\": \"string\" }\n", + " },\n", + " \"required\": [\"media_url\", \"reference\", \"prompt\"]\n", + " },\n", + " \"include_sample_schema\": True, # enables sampling\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API docs](hhttps://platform.openai.com/docs/api-reference/graders). \n", + "\n", + "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string. " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "grader_config = {\n", + "\t \"type\": \"score_model\",\n", + " \"name\": \"Score Model Grader\",\n", + " \"input\":[\n", + " {\n", + " \"role\": \"system\",\n", + "\t\t \"content\": \"You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0.\"\n", + "\t },\n", + "\t {\n", + "\t\t \"role\": \"user\",\n", + "\t\t \"content\": [{ \"type\": \"input_text\", \"text\": \"Prompt: {{ item.prompt }}.\"},\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_image\", \"image_url\": \"{{ item.media_url }}\", \"detail\": \"auto\" },\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_text\", \"text\": \"Reference answer: {{ item.reference }}. Model response: {{ sample.output_text }}.\"}\n", + "\t\t\t\t]\n", + "\t }\n", + "\t\t],\n", + "\t\t\"pass_threshold\": 0.9,\n", + "\t \"range\": [0, 1],\n", + "\t \"model\": \"o4-mini\" # model for grading; check that the model you use supports image inputs\n", + "\t}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create the eval object." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "eval_object = client.evals.create(\n", + " name=\"Image Grading\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[grader_config],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id and the data source (i.e., the data we compiled earlier) in addition to the chat message trajectory we'd like for sampling to get the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", + "\n", + "Here's the sampling message trajectory we'll use for this example." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "sampling_messages = [{\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"{{ item.prompt }}\"\n", + " }\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_image\",\n", + " \"image_url\": \"{{ item.media_url }}\",\n", + " \"detail\": \"auto\"\n", + " }\n", + " }]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_run = client.evals.runs.create(\n", + " name=\"Image Input Eval Run\",\n", + " eval_id=eval_object.id,\n", + " data_source={\n", + " \"type\": \"responses\", # sample using responses API\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": evals_data_source\n", + " },\n", + " \"model\": \"gpt-4o-mini\", # model used to generate the response; check that the model you use supports image inputs\n", + " \"input_messages\": {\n", + " \"type\": \"template\", \n", + " \"template\": sampling_messages}\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
promptreferencemodel_responsegrading_results
0Please provide latex code to replicate this tableBelow is the latex code for your table:\\n```te...Here is the LaTeX code to replicate the table ...{\"steps\":[{\"description\":\"Check if the model’s...
1What ingredients do I need to make this?This appears to be a classic Margherita pizza,...To make a classic Margherita pizza like the on...{\"steps\":[{\"description\":\"Compare the model re...
2Is this safe for a vegan to eat?Based on the image, this dish appears to be a ...To determine if this dish is safe for a vegan ...{\"steps\":[{\"description\":\"The reference answer...
3Where was this taken?This image is of the seafront in San Sebastián...I can't determine the exact location of the im...{\"steps\":[{\"description\":\"Compare model respon...
4What is the man in the picture doing?The man on the postcard is playing bagpipes, w...The man in the picture is playing the bagpipes...{\"steps\":[{\"description\":\"Compare the model re...
\n", + "
" + ], + "text/plain": [ + " prompt \\\n", + "0 Please provide latex code to replicate this table \n", + "1 What ingredients do I need to make this? \n", + "2 Is this safe for a vegan to eat? \n", + "3 Where was this taken? \n", + "4 What is the man in the picture doing? \n", + "\n", + " reference \\\n", + "0 Below is the latex code for your table:\\n```te... \n", + "1 This appears to be a classic Margherita pizza,... \n", + "2 Based on the image, this dish appears to be a ... \n", + "3 This image is of the seafront in San Sebastián... \n", + "4 The man on the postcard is playing bagpipes, w... \n", + "\n", + " model_response \\\n", + "0 Here is the LaTeX code to replicate the table ... \n", + "1 To make a classic Margherita pizza like the on... \n", + "2 To determine if this dish is safe for a vegan ... \n", + "3 I can't determine the exact location of the im... \n", + "4 The man in the picture is playing the bagpipes... \n", + "\n", + " grading_results \n", + "0 {\"steps\":[{\"description\":\"Check if the model’s... \n", + "1 {\"steps\":[{\"description\":\"Compare the model re... \n", + "2 {\"steps\":[{\"description\":\"The reference answer... \n", + "3 {\"steps\":[{\"description\":\"Compare model respon... \n", + "4 {\"steps\":[{\"description\":\"Compare the model re... " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "while True:\n", + " run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)\n", + " if run.status == \"completed\" or run.status == \"failed\": # check if the run is finished\n", + " output_items = client.evals.runs.output_items.list(\n", + " run_id=run.id, eval_id=eval_object.id\n", + " )\n", + " df = pd.DataFrame({\n", + " \"prompt\": [item.datasource_item[\"prompt\"]for item in output_items],\n", + " \"reference\": [item.datasource_item[\"reference\"] for item in output_items],\n", + " \"model_response\": [item.sample.output[0].content for item in output_items],\n", + " \"grading_results\": [item.results[0][\"sample\"][\"output\"][0][\"content\"]\n", + " for item in output_items]\n", + " })\n", + " display(df)\n", + " break\n", + " time.sleep(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see the full output item, such as for the pizza ingredients image, we can do the following. The structure of the output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"id\": \"outputitem_68768c0f7658819187d4f128c2e0ff8c\",\n", + " \"created_at\": 1752599567,\n", + " \"datasource_item\": {\n", + " \"prompt\": \"What ingredients do I need to make this?\",\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg\",\n", + " \"reference\": \"This appears to be a classic Margherita pizza, which has the following ingredients:\\n\\n- Pizza Dough: You'll need yeast, flour, salt, and water to make the dough. A simple recipe is 500g of flour, 1 tsp of salt, 1 tbsp of sugar, and about 300ml warm water.\\n\\n- Tomatoes: Fresh or canned San Marzano tomatoes are traditionally used for their sweet flavor. If using fresh tomatoes, you can blend them into a sauce.\\n\\n- Mozzarella Cheese: Traditionally mozzarella di bufala campana D.O.P., but Fior di Latte or other fresh mozzarella work well too.\\n\\n- Basil Leaves: Fresh basil leaves add a burst of flavor.\\n\\n- Olive Oil: Extra virgin olive oil is drizzled over the pizza before baking for added flavor.\\n\\n- Salt & Pepper\\n\\nYou would also need a pizza stone or baking sheet preheated in an oven set to around 475\\u00b0F (246\\u00b0C). Once your dough is prepared and shaped into a circle (use parchment paper if it's homemade), spread your tomato sauce on top leaving some space at the edge. Add dollops of cheese on top then gently press them down with your fingers. Drizzle with olive oil and season with salt & pepper. Finally add your basil leaves before placing it in the oven to bake until the crust is golden brown and bubbly - about 10 minutes depending on thickness.\"\n", + " },\n", + " \"datasource_item_id\": 2,\n", + " \"eval_id\": \"eval_687689442f7c8191aa614761671be57c\",\n", + " \"object\": \"eval.run.output_item\",\n", + " \"results\": [\n", + " {\n", + " \"name\": \"Score Model Grader-3510e5e4-b0f8-4cfb-a051-f5440152ae1e\",\n", + " \"sample\": {\n", + " \"input\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Prompt: What ingredients do I need to make this?. https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg Reference answer: This appears to be a classic Margherita pizza, which has the following ingredients:\\n\\n- Pizza Dough: You'll need yeast, flour, salt, and water to make the dough. A simple recipe is 500g of flour, 1 tsp of salt, 1 tbsp of sugar, and about 300ml warm water.\\n\\n- Tomatoes: Fresh or canned San Marzano tomatoes are traditionally used for their sweet flavor. If using fresh tomatoes, you can blend them into a sauce.\\n\\n- Mozzarella Cheese: Traditionally mozzarella di bufala campana D.O.P., but Fior di Latte or other fresh mozzarella work well too.\\n\\n- Basil Leaves: Fresh basil leaves add a burst of flavor.\\n\\n- Olive Oil: Extra virgin olive oil is drizzled over the pizza before baking for added flavor.\\n\\n- Salt & Pepper\\n\\nYou would also need a pizza stone or baking sheet preheated in an oven set to around 475\\u00b0F (246\\u00b0C). Once your dough is prepared and shaped into a circle (use parchment paper if it's homemade), spread your tomato sauce on top leaving some space at the edge. Add dollops of cheese on top then gently press them down with your fingers. Drizzle with olive oil and season with salt & pepper. Finally add your basil leaves before placing it in the oven to bake until the crust is golden brown and bubbly - about 10 minutes depending on thickness.. Model response: To make a classic Margherita pizza like the one in the image, you'll need the following ingredients:\\n\\n### For the Dough:\\n- **Flour** (preferably Type \\\"00\\\" pizza flour)\\n- **Water**\\n- **Yeast** (active dry or fresh)\\n- **Salt**\\n- **Olive oil** (optional)\\n\\n### For the Topping:\\n- **Tomato Sauce** (preferably San Marzano tomatoes, crushed)\\n- **Fresh Mozzarella Cheese** (preferably buffalo mozzarella)\\n- **Fresh Basil Leaves**\\n- **Olive Oil** (for drizzling)\\n- **Salt** (to taste)\\n\\n### Optional:\\n- **Parmesan Cheese** (for extra flavor)\\n- **Crushed Red Pepper Flakes** (for heat)\\n\\n### Instructions Summary:\\n1. Prepare the dough by mixing flour, water, yeast, and salt, then let it rise.\\n2. Shape the dough into a pizza base.\\n3. Spread tomato sauce over the base.\\n4. Add sliced mozzarella and basil leaves.\\n5. Bake in a hot oven or pizza stone until the crust is golden and the cheese is bubbly.\\n6. Drizzle with olive oil before serving. \\n\\nEnjoy your pizza-making!.\"\n", + " }\n", + " ],\n", + " \"output\": [\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": \"{\\\"steps\\\":[{\\\"description\\\":\\\"Compare the model response ingredients list with the reference ingredients list. Both include flour, water, yeast, salt, olive oil, tomato sauce (San Marzano), fresh mozzarella, basil, and salt. The model also adds optional parmesan and red pepper flakes, which do not conflict with the core Margherita recipe.\\\",\\\"conclusion\\\":\\\"Ingredient lists match well, with minor optional additions that don\\u2019t detract.\\\"},{\\\"description\\\":\\\"Assess whether the response answers the prompt: The user asks \\u201cWhat ingredients do I need to make this?\\u201d and the model response provides a clear, structured list of ingredients along with an optional instructions summary.\\\",\\\"conclusion\\\":\\\"The response directly addresses the prompt.\\\"},{\\\"description\\\":\\\"Compare the overall completeness and accuracy against the reference: The model includes all key ingredients (dough components, sauce, cheese, basil, olive oil, salt) and optional extras. Instructions are concise and helpful.\\\",\\\"conclusion\\\":\\\"The model response is comprehensive and accurate.\\\"}],\\\"result\\\":1.0}\"\n", + " }\n", + " ],\n", + " \"finish_reason\": \"stop\",\n", + " \"model\": \"o4-mini-2025-04-16\",\n", + " \"usage\": {\n", + " \"total_tokens\": 2395,\n", + " \"completion_tokens\": 420,\n", + " \"prompt_tokens\": 1975,\n", + " \"cached_tokens\": 0\n", + " },\n", + " \"error\": null,\n", + " \"seed\": null,\n", + " \"temperature\": 1.0,\n", + " \"top_p\": 1.0,\n", + " \"reasoning_effort\": null,\n", + " \"max_completions_tokens\": 4096\n", + " },\n", + " \"passed\": true,\n", + " \"score\": 1.0\n", + " }\n", + " ],\n", + " \"run_id\": \"evalrun_68768bfd44fc8191a359121443dab061\",\n", + " \"sample\": \"Sample(error=None, finish_reason='stop', input=[SampleInput(content='What ingredients do I need to make this?', role='user'), SampleInput(content='https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg', role='user')], max_completion_tokens=None, model='gpt-4o-mini-2024-07-18', output=[SampleOutput(content='To make a classic Margherita pizza like the one in the image, you\\\\'ll need the following ingredients:\\\\n\\\\n### For the Dough:\\\\n- **Flour** (preferably Type \\\"00\\\" pizza flour)\\\\n- **Water**\\\\n- **Yeast** (active dry or fresh)\\\\n- **Salt**\\\\n- **Olive oil** (optional)\\\\n\\\\n### For the Topping:\\\\n- **Tomato Sauce** (preferably San Marzano tomatoes, crushed)\\\\n- **Fresh Mozzarella Cheese** (preferably buffalo mozzarella)\\\\n- **Fresh Basil Leaves**\\\\n- **Olive Oil** (for drizzling)\\\\n- **Salt** (to taste)\\\\n\\\\n### Optional:\\\\n- **Parmesan Cheese** (for extra flavor)\\\\n- **Crushed Red Pepper Flakes** (for heat)\\\\n\\\\n### Instructions Summary:\\\\n1. Prepare the dough by mixing flour, water, yeast, and salt, then let it rise.\\\\n2. Shape the dough into a pizza base.\\\\n3. Spread tomato sauce over the base.\\\\n4. Add sliced mozzarella and basil leaves.\\\\n5. Bake in a hot oven or pizza stone until the crust is golden and the cheese is bubbly.\\\\n6. Drizzle with olive oil before serving. \\\\n\\\\nEnjoy your pizza-making!', role='assistant')], seed=None, temperature=1.0, top_p=1.0, usage=SampleUsage(cached_tokens=0, completion_tokens=249, prompt_tokens=36856, total_tokens=37105), max_completions_tokens=4096)\",\n", + " \"status\": \"pass\",\n", + " \"_datasource_item_content_hash\": \"4baa1b4e9daaee8cce285a14b3d7e8155eef3a7770ebbbb50ee16f78f5024768\"\n", + "}\n" + ] + } + ], + "source": [ + "import json\n", + "\n", + "pizza_item = next(\n", + " item for item in output_items \n", + " if \"What ingredients do I need to make this?\" in item.datasource_item[\"prompt\"]\n", + ")\n", + "\n", + "print(json.dumps(dict(pizza_item), indent=2, default=str))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, feel free to extend this to your own use cases! Some examples include grading image generation results with our EvalAPI model graders, evaluating your OCR use cases using model sampling, and more. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.17" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index 649b52d42e..6d1efb45da 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,15 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Using Evals API on Image Inputs + path: examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb + date: 2025-07-15 + authors: + - daisyshe-oai + tags: + - evals-api + - images + - title: Optimize Prompts path: examples/Optimize_Prompts.ipynb date: 2025-07-14 From cbfe4749b1c44ac88ee1aee89a50a97cd84008eb Mon Sep 17 00:00:00 2001 From: Daisy Sheng Date: Tue, 15 Jul 2025 14:19:57 -0700 Subject: [PATCH 2/3] revised from feedback --- .../use-cases/EvalsAPI_Image_Inputs.ipynb | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb index 4d27e2fec7..b03b642aa8 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb @@ -6,11 +6,7 @@ "source": [ "# Evals API: Image Inputs\n", "\n", - "OpenAI’s Evals API now supports image inputs, in its step toward multimodal functionality! API users can use OpenAI's Evals API to evaluate their image use cases to see how their LLM integration is performing and improve it.\n", - "\n", - "In this cookbook, we'll walk through an image example with the Evals API. More specifically, we will use Evals API to evaluate model-generated responses to an image and its corresponding prompt, using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score those model responses against the image and reference answer.\n", - "\n", - "Based on your use case, you might only need the sampling functionality or the model grader, and you can revise what you pass in during the eval and run creation to fit your needs. " + "In this cookbook, we will use Evals API to grade model-generated responses to an image and prompt, using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the image, prompt, and reference answer." ] }, { @@ -180,7 +176,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -188,8 +184,7 @@ "import os\n", "\n", "client = OpenAI(\n", - " api_key=os.getenv(\"OPENAI_API_KEY\"),\n", - " base_url=\"https://api.openai.com/v1\",\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", ")" ] }, @@ -289,9 +284,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To create the run, we pass in the eval object id and the data source (i.e., the data we compiled earlier) in addition to the chat message trajectory we'd like for sampling to get the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", + "To create the run, we pass in the eval object id and the data source (i.e., the data we compiled earlier) in addition to the chat message input we'd like for sampling to get the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", "\n", - "Here's the sampling message trajectory we'll use for this example." + "Here's the sampling message input we'll use for this example." ] }, { From fee6383fc35c465fdd94e9c3a8df4613c3b2a7dc Mon Sep 17 00:00:00 2001 From: Daisy Sheng Date: Wed, 16 Jul 2025 16:31:46 -0700 Subject: [PATCH 3/3] revisions from Shikhar's feedback --- .../use-cases/EvalsAPI_Image_Inputs.ipynb | 315 ++++++++---------- 1 file changed, 136 insertions(+), 179 deletions(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb index b03b642aa8..12a65e5efc 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb @@ -6,85 +6,60 @@ "source": [ "# Evals API: Image Inputs\n", "\n", - "In this cookbook, we will use Evals API to grade model-generated responses to an image and prompt, using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the image, prompt, and reference answer." + "This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.\n", + "\n", + "In this example, we will evaluate how well our model can:\n", + "1. **Generate appropriate responses** to user prompts about images\n", + "3. **Align with reference answers** that represent high-quality responses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Dataset\n", - "\n", - "For this example, we will use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of image, prompt, and reference answer data. First, we load the dataset." + "## Installing Dependencies + Setup" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 1, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: datasets in /opt/homebrew/lib/python3.10/site-packages (4.0.0)\n", - "Requirement already satisfied: filelock in /opt/homebrew/lib/python3.10/site-packages (from datasets) (3.18.0)\n", - "Requirement already satisfied: numpy>=1.17 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.1.3)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (20.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.3.8)\n", - "Requirement already satisfied: pandas in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.3.1)\n", - "Requirement already satisfied: requests>=2.32.2 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (2.32.4)\n", - "Requirement already satisfied: tqdm>=4.66.3 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (4.67.1)\n", - "Requirement already satisfied: xxhash in /opt/homebrew/lib/python3.10/site-packages (from datasets) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2025.3.0,>=2023.1.0 in /opt/homebrew/lib/python3.10/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (2025.3.0)\n", - "Requirement already satisfied: huggingface-hub>=0.24.0 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (0.33.4)\n", - "Requirement already satisfied: packaging in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from datasets) (25.0)\n", - "Requirement already satisfied: pyyaml>=5.1 in /opt/homebrew/lib/python3.10/site-packages (from datasets) (6.0.2)\n", - "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /opt/homebrew/lib/python3.10/site-packages (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (3.12.14)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.4.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.4.0)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (5.0.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (25.3.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.7.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (6.6.3)\n", - "Requirement already satisfied: propcache>=0.2.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (0.3.2)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /opt/homebrew/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (1.20.1)\n", - "Requirement already satisfied: typing-extensions>=4.1.0 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from multidict<7.0,>=4.5->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (4.14.1)\n", - "Requirement already satisfied: idna>=2.0 in /opt/homebrew/lib/python3.10/site-packages (from yarl<2.0,>=1.17.0->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) (3.10)\n", - "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /opt/homebrew/lib/python3.10/site-packages (from huggingface-hub>=0.24.0->datasets) (1.1.5)\n", - "Requirement already satisfied: charset_normalizer<4,>=2 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.4.2)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2.5.0)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2025.7.14)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from pandas->datasets) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas->datasets) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /opt/homebrew/lib/python3.10/site-packages (from pandas->datasets) (2025.2)\n", - "Requirement already satisfied: six>=1.5 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.17.0)\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], + "outputs": [], "source": [ - "pip install datasets" + "# Install required packages\n", + "!pip install openai datasets pandas --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/opt/homebrew/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], + "outputs": [], "source": [ + "# Import libraries\n", "from datasets import load_dataset\n", + "from openai import OpenAI\n", + "import os\n", + "import json\n", + "import time\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Preparation\n", "\n", + "We use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of user prompt, accompanying image, and reference answer data. First, we load the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ "dataset = load_dataset(\"RekaAI/VibeEval\")" ] }, @@ -97,14 +72,14 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evals_data_source = []\n", "\n", - "# select the first 5 examples in the dataset to use for this cookbook\n", - "for example in dataset[\"test\"].select(range(5)):\n", + "# select the first 3 examples in the dataset to use for this cookbook\n", + "for example in dataset[\"test\"].select(range(3)):\n", " evals_data_source.append({\n", " \"item\": {\n", " \"media_url\": example[\"media_url\"], # image web URL\n", @@ -135,43 +110,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Evals Structure\n", + "## Eval Configuration\n", "\n", - "Now that we have our data source and task, we will create our evals. For the evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: openai in /opt/homebrew/lib/python3.10/site-packages (1.95.1)\n", - "Requirement already satisfied: anyio<5,>=3.5.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (4.9.0)\n", - "Requirement already satisfied: distro<2,>=1.7.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (1.9.0)\n", - "Requirement already satisfied: httpx<1,>=0.23.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (0.28.1)\n", - "Requirement already satisfied: jiter<1,>=0.4.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (0.10.0)\n", - "Requirement already satisfied: pydantic<3,>=1.9.0 in /opt/homebrew/lib/python3.10/site-packages (from openai) (2.11.7)\n", - "Requirement already satisfied: sniffio in /opt/homebrew/lib/python3.10/site-packages (from openai) (1.3.1)\n", - "Requirement already satisfied: tqdm>4 in /opt/homebrew/lib/python3.10/site-packages (from openai) (4.67.1)\n", - "Requirement already satisfied: typing-extensions<5,>=4.11 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from openai) (4.14.1)\n", - "Requirement already satisfied: exceptiongroup>=1.0.2 in /Users/daisyshe/Library/Python/3.10/lib/python/site-packages (from anyio<5,>=3.5.0->openai) (1.3.0)\n", - "Requirement already satisfied: idna>=2.8 in /opt/homebrew/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (3.10)\n", - "Requirement already satisfied: certifi in /opt/homebrew/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (2025.7.14)\n", - "Requirement already satisfied: httpcore==1.* in /opt/homebrew/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)\n", - "Requirement already satisfied: h11>=0.16 in /opt/homebrew/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)\n", - "Requirement already satisfied: annotated-types>=0.6.0 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n", - "Requirement already satisfied: pydantic-core==2.33.2 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (2.33.2)\n", - "Requirement already satisfied: typing-inspection>=0.4.0 in /opt/homebrew/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.4.1)\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "pip install openai" + "Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" ] }, { @@ -180,9 +121,6 @@ "metadata": {}, "outputs": [], "source": [ - "from openai import OpenAI\n", - "import os\n", - "\n", "client = OpenAI(\n", " api_key=os.getenv(\"OPENAI_API_KEY\")\n", ")" @@ -192,12 +130,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader). Based on the data that we have compiled, our data source config is as follows:" + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Source Config\n", + "\n", + "Based on the data that we have compiled, our data source config is as follows:" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ @@ -220,7 +167,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API docs](hhttps://platform.openai.com/docs/api-reference/graders). \n", + "### Testing Criteria" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API Grader docs](hhttps://platform.openai.com/docs/api-reference/graders). \n", "\n", "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " ] @@ -234,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ @@ -269,7 +223,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -284,14 +238,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To create the run, we pass in the eval object id and the data source (i.e., the data we compiled earlier) in addition to the chat message input we'd like for sampling to get the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", + "## Eval Run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", "\n", "Here's the sampling message input we'll use for this example." ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -314,9 +275,16 @@ " }]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now kickoff an eval run." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -337,6 +305,13 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Poll and Display the Results" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -346,7 +321,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 46, "metadata": {}, "outputs": [ { @@ -381,37 +356,23 @@ " 0\n", " Please provide latex code to replicate this table\n", " Below is the latex code for your table:\\n```te...\n", - " Here is the LaTeX code to replicate the table ...\n", - " {\"steps\":[{\"description\":\"Check if the model’s...\n", + " Certainly! Below is the LaTeX code to replicat...\n", + " {\"steps\":[{\"description\":\"Assess if the provid...\n", " \n", " \n", " 1\n", " What ingredients do I need to make this?\n", " This appears to be a classic Margherita pizza,...\n", " To make a classic Margherita pizza like the on...\n", - " {\"steps\":[{\"description\":\"Compare the model re...\n", + " {\"steps\":[{\"description\":\"Check if model ident...\n", " \n", " \n", " 2\n", " Is this safe for a vegan to eat?\n", " Based on the image, this dish appears to be a ...\n", - " To determine if this dish is safe for a vegan ...\n", - " {\"steps\":[{\"description\":\"The reference answer...\n", - " \n", - " \n", - " 3\n", - " Where was this taken?\n", - " This image is of the seafront in San Sebastián...\n", - " I can't determine the exact location of the im...\n", + " To determine if the dish is safe for a vegan t...\n", " {\"steps\":[{\"description\":\"Compare model respon...\n", " \n", - " \n", - " 4\n", - " What is the man in the picture doing?\n", - " The man on the postcard is playing bagpipes, w...\n", - " The man in the picture is playing the bagpipes...\n", - " {\"steps\":[{\"description\":\"Compare the model re...\n", - " \n", " \n", "\n", "" @@ -421,29 +382,21 @@ "0 Please provide latex code to replicate this table \n", "1 What ingredients do I need to make this? \n", "2 Is this safe for a vegan to eat? \n", - "3 Where was this taken? \n", - "4 What is the man in the picture doing? \n", "\n", " reference \\\n", "0 Below is the latex code for your table:\\n```te... \n", "1 This appears to be a classic Margherita pizza,... \n", "2 Based on the image, this dish appears to be a ... \n", - "3 This image is of the seafront in San Sebastián... \n", - "4 The man on the postcard is playing bagpipes, w... \n", "\n", " model_response \\\n", - "0 Here is the LaTeX code to replicate the table ... \n", + "0 Certainly! Below is the LaTeX code to replicat... \n", "1 To make a classic Margherita pizza like the on... \n", - "2 To determine if this dish is safe for a vegan ... \n", - "3 I can't determine the exact location of the im... \n", - "4 The man in the picture is playing the bagpipes... \n", + "2 To determine if the dish is safe for a vegan t... \n", "\n", " grading_results \n", - "0 {\"steps\":[{\"description\":\"Check if the model’s... \n", - "1 {\"steps\":[{\"description\":\"Compare the model re... \n", - "2 {\"steps\":[{\"description\":\"The reference answer... \n", - "3 {\"steps\":[{\"description\":\"Compare model respon... \n", - "4 {\"steps\":[{\"description\":\"Compare the model re... " + "0 {\"steps\":[{\"description\":\"Assess if the provid... \n", + "1 {\"steps\":[{\"description\":\"Check if model ident... \n", + "2 {\"steps\":[{\"description\":\"Compare model respon... " ] }, "metadata": {}, @@ -451,21 +404,19 @@ } ], "source": [ - "import pandas as pd\n", - "\n", "while True:\n", " run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)\n", " if run.status == \"completed\" or run.status == \"failed\": # check if the run is finished\n", - " output_items = client.evals.runs.output_items.list(\n", + " output_items = list(client.evals.runs.output_items.list(\n", " run_id=run.id, eval_id=eval_object.id\n", - " )\n", + " ))\n", " df = pd.DataFrame({\n", - " \"prompt\": [item.datasource_item[\"prompt\"]for item in output_items],\n", - " \"reference\": [item.datasource_item[\"reference\"] for item in output_items],\n", - " \"model_response\": [item.sample.output[0].content for item in output_items],\n", - " \"grading_results\": [item.results[0][\"sample\"][\"output\"][0][\"content\"]\n", - " for item in output_items]\n", - " })\n", + " \"prompt\": [item.datasource_item[\"prompt\"]for item in output_items],\n", + " \"reference\": [item.datasource_item[\"reference\"] for item in output_items],\n", + " \"model_response\": [item.sample.output[0].content for item in output_items],\n", + " \"grading_results\": [item.results[0][\"sample\"][\"output\"][0][\"content\"]\n", + " for item in output_items]\n", + " })\n", " display(df)\n", " break\n", " time.sleep(5)" @@ -475,12 +426,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To see the full output item, such as for the pizza ingredients image, we can do the following. The structure of the output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." + "### Viewing Individual Output Items" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see a full output item, we can do the following. The structure of an output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." ] }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 47, "metadata": {}, "outputs": [ { @@ -488,19 +446,19 @@ "output_type": "stream", "text": [ "{\n", - " \"id\": \"outputitem_68768c0f7658819187d4f128c2e0ff8c\",\n", - " \"created_at\": 1752599567,\n", + " \"id\": \"outputitem_687833f102ec8191a6e53a5461b970c2\",\n", + " \"created_at\": 1752708081,\n", " \"datasource_item\": {\n", - " \"prompt\": \"What ingredients do I need to make this?\",\n", - " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg\",\n", - " \"reference\": \"This appears to be a classic Margherita pizza, which has the following ingredients:\\n\\n- Pizza Dough: You'll need yeast, flour, salt, and water to make the dough. A simple recipe is 500g of flour, 1 tsp of salt, 1 tbsp of sugar, and about 300ml warm water.\\n\\n- Tomatoes: Fresh or canned San Marzano tomatoes are traditionally used for their sweet flavor. If using fresh tomatoes, you can blend them into a sauce.\\n\\n- Mozzarella Cheese: Traditionally mozzarella di bufala campana D.O.P., but Fior di Latte or other fresh mozzarella work well too.\\n\\n- Basil Leaves: Fresh basil leaves add a burst of flavor.\\n\\n- Olive Oil: Extra virgin olive oil is drizzled over the pizza before baking for added flavor.\\n\\n- Salt & Pepper\\n\\nYou would also need a pizza stone or baking sheet preheated in an oven set to around 475\\u00b0F (246\\u00b0C). Once your dough is prepared and shaped into a circle (use parchment paper if it's homemade), spread your tomato sauce on top leaving some space at the edge. Add dollops of cheese on top then gently press them down with your fingers. Drizzle with olive oil and season with salt & pepper. Finally add your basil leaves before placing it in the oven to bake until the crust is golden brown and bubbly - about 10 minutes depending on thickness.\"\n", + " \"prompt\": \"Please provide latex code to replicate this table\",\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png\",\n", + " \"reference\": \"Below is the latex code for your table:\\n```tex\\n\\\\begin{table}\\n\\\\begin{tabular}{c c c c} \\\\hline & \\\\(S2\\\\) & Expert & Layman & PoelM \\\\\\\\ \\\\cline{2-4} \\\\(S1\\\\) & Expert & \\u2013 & 54.0 & 62.7 \\\\\\\\ & Layman & 46.0 & \\u2013 & 60.7 \\\\\\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\\u2013&39.3 \\\\\\\\\\n[-1ex] \\\\end{tabular}\\n\\\\end{table}\\n```.\"\n", " },\n", - " \"datasource_item_id\": 2,\n", - " \"eval_id\": \"eval_687689442f7c8191aa614761671be57c\",\n", + " \"datasource_item_id\": 1,\n", + " \"eval_id\": \"eval_687833d68e888191bc4bd8b965368f22\",\n", " \"object\": \"eval.run.output_item\",\n", " \"results\": [\n", " {\n", - " \"name\": \"Score Model Grader-3510e5e4-b0f8-4cfb-a051-f5440152ae1e\",\n", + " \"name\": \"Score Model Grader-73fe48a0-8090-46eb-aa8e-d426ad074eb3\",\n", " \"sample\": {\n", " \"input\": [\n", " {\n", @@ -509,21 +467,21 @@ " },\n", " {\n", " \"role\": \"user\",\n", - " \"content\": \"Prompt: What ingredients do I need to make this?. https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg Reference answer: This appears to be a classic Margherita pizza, which has the following ingredients:\\n\\n- Pizza Dough: You'll need yeast, flour, salt, and water to make the dough. A simple recipe is 500g of flour, 1 tsp of salt, 1 tbsp of sugar, and about 300ml warm water.\\n\\n- Tomatoes: Fresh or canned San Marzano tomatoes are traditionally used for their sweet flavor. If using fresh tomatoes, you can blend them into a sauce.\\n\\n- Mozzarella Cheese: Traditionally mozzarella di bufala campana D.O.P., but Fior di Latte or other fresh mozzarella work well too.\\n\\n- Basil Leaves: Fresh basil leaves add a burst of flavor.\\n\\n- Olive Oil: Extra virgin olive oil is drizzled over the pizza before baking for added flavor.\\n\\n- Salt & Pepper\\n\\nYou would also need a pizza stone or baking sheet preheated in an oven set to around 475\\u00b0F (246\\u00b0C). Once your dough is prepared and shaped into a circle (use parchment paper if it's homemade), spread your tomato sauce on top leaving some space at the edge. Add dollops of cheese on top then gently press them down with your fingers. Drizzle with olive oil and season with salt & pepper. Finally add your basil leaves before placing it in the oven to bake until the crust is golden brown and bubbly - about 10 minutes depending on thickness.. Model response: To make a classic Margherita pizza like the one in the image, you'll need the following ingredients:\\n\\n### For the Dough:\\n- **Flour** (preferably Type \\\"00\\\" pizza flour)\\n- **Water**\\n- **Yeast** (active dry or fresh)\\n- **Salt**\\n- **Olive oil** (optional)\\n\\n### For the Topping:\\n- **Tomato Sauce** (preferably San Marzano tomatoes, crushed)\\n- **Fresh Mozzarella Cheese** (preferably buffalo mozzarella)\\n- **Fresh Basil Leaves**\\n- **Olive Oil** (for drizzling)\\n- **Salt** (to taste)\\n\\n### Optional:\\n- **Parmesan Cheese** (for extra flavor)\\n- **Crushed Red Pepper Flakes** (for heat)\\n\\n### Instructions Summary:\\n1. Prepare the dough by mixing flour, water, yeast, and salt, then let it rise.\\n2. Shape the dough into a pizza base.\\n3. Spread tomato sauce over the base.\\n4. Add sliced mozzarella and basil leaves.\\n5. Bake in a hot oven or pizza stone until the crust is golden and the cheese is bubbly.\\n6. Drizzle with olive oil before serving. \\n\\nEnjoy your pizza-making!.\"\n", + " \"content\": \"Prompt: Please provide latex code to replicate this table. https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png Reference answer: Below is the latex code for your table:\\n```tex\\n\\\\begin{table}\\n\\\\begin{tabular}{c c c c} \\\\hline & \\\\(S2\\\\) & Expert & Layman & PoelM \\\\\\\\ \\\\cline{2-4} \\\\(S1\\\\) & Expert & \\u2013 & 54.0 & 62.7 \\\\\\\\ & Layman & 46.0 & \\u2013 & 60.7 \\\\\\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\\u2013&39.3 \\\\\\\\\\n[-1ex] \\\\end{tabular}\\n\\\\end{table}\\n```.. Model response: Certainly! Below is the LaTeX code to replicate the table you provided:\\n\\n```latex\\n\\\\documentclass{article}\\n\\\\usepackage{array}\\n\\\\usepackage{multirow}\\n\\\\usepackage{booktabs}\\n\\n\\\\begin{document}\\n\\n\\\\begin{table}[ht]\\n \\\\centering\\n \\\\begin{tabular}{c|c|c|c}\\n \\\\multirow{2}{*}{S1} & \\\\multirow{2}{*}{S2} & \\\\multicolumn{3}{c}{Methods} \\\\\\\\ \\n \\\\cline{3-5}\\n & & Expert & Layman & PoeLM \\\\\\\\\\n \\\\hline\\n Expert & & - & 54.0 & 62.7 \\\\\\\\\\n Layman & & 46.0 & - & 60.7 \\\\\\\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\\\\\\n \\\\end{tabular}\\n \\\\caption{Comparison of different methods}\\n \\\\label{tab:methods_comparison}\\n\\\\end{table}\\n\\n\\\\end{document}\\n```\\n\\n### Explanation:\\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\\n- The `booktabs` package is used for improved table formatting (with `\\\\hline` for horizontal lines).\\n- Adjust the table's caption and label as needed..\"\n", " }\n", " ],\n", " \"output\": [\n", " {\n", " \"role\": \"assistant\",\n", - " \"content\": \"{\\\"steps\\\":[{\\\"description\\\":\\\"Compare the model response ingredients list with the reference ingredients list. Both include flour, water, yeast, salt, olive oil, tomato sauce (San Marzano), fresh mozzarella, basil, and salt. The model also adds optional parmesan and red pepper flakes, which do not conflict with the core Margherita recipe.\\\",\\\"conclusion\\\":\\\"Ingredient lists match well, with minor optional additions that don\\u2019t detract.\\\"},{\\\"description\\\":\\\"Assess whether the response answers the prompt: The user asks \\u201cWhat ingredients do I need to make this?\\u201d and the model response provides a clear, structured list of ingredients along with an optional instructions summary.\\\",\\\"conclusion\\\":\\\"The response directly addresses the prompt.\\\"},{\\\"description\\\":\\\"Compare the overall completeness and accuracy against the reference: The model includes all key ingredients (dough components, sauce, cheese, basil, olive oil, salt) and optional extras. Instructions are concise and helpful.\\\",\\\"conclusion\\\":\\\"The model response is comprehensive and accurate.\\\"}],\\\"result\\\":1.0}\"\n", + " \"content\": \"{\\\"steps\\\":[{\\\"description\\\":\\\"Assess if the provided LaTeX code correctly matches the structure of the target table, including the diagonal header, column counts, and alignment.\\\",\\\"conclusion\\\":\\\"The code fails to create the diagonal split between S1 and S2 and mismatches column counts (defines 4 columns but uses 5).\\\"},{\\\"description\\\":\\\"Check the header layout: the target table has a single diagonal cell spanning two axes and three following columns labeled Expert, Layman, PoeLM. The model uses \\\\\\\\multirow and a \\\\\\\\multicolumn block named 'Methods', which does not replicate the diagonal or correct labeling.\\\",\\\"conclusion\\\":\\\"Header structure is incorrect and does not match the prompt's table.\\\"},{\\\"description\\\":\\\"Verify the data rows: the model code includes two empty cells after S1 and before the data, misaligning all data entries relative to the intended columns.\\\",\\\"conclusion\\\":\\\"Data rows are misaligned due to incorrect column definitions.\\\"},{\\\"description\\\":\\\"Overall compatibility: the code is syntactically flawed for the target table and conceptually does not replicate the diagonal header or correct column count.\\\",\\\"conclusion\\\":\\\"The response does not satisfy the prompt.\\\"}],\\\"result\\\":0.0}\"\n", " }\n", " ],\n", " \"finish_reason\": \"stop\",\n", " \"model\": \"o4-mini-2025-04-16\",\n", " \"usage\": {\n", - " \"total_tokens\": 2395,\n", - " \"completion_tokens\": 420,\n", - " \"prompt_tokens\": 1975,\n", + " \"total_tokens\": 2185,\n", + " \"completion_tokens\": 712,\n", + " \"prompt_tokens\": 1473,\n", " \"cached_tokens\": 0\n", " },\n", " \"error\": null,\n", @@ -533,34 +491,33 @@ " \"reasoning_effort\": null,\n", " \"max_completions_tokens\": 4096\n", " },\n", - " \"passed\": true,\n", - " \"score\": 1.0\n", + " \"passed\": false,\n", + " \"score\": 0.0\n", " }\n", " ],\n", - " \"run_id\": \"evalrun_68768bfd44fc8191a359121443dab061\",\n", - " \"sample\": \"Sample(error=None, finish_reason='stop', input=[SampleInput(content='What ingredients do I need to make this?', role='user'), SampleInput(content='https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg', role='user')], max_completion_tokens=None, model='gpt-4o-mini-2024-07-18', output=[SampleOutput(content='To make a classic Margherita pizza like the one in the image, you\\\\'ll need the following ingredients:\\\\n\\\\n### For the Dough:\\\\n- **Flour** (preferably Type \\\"00\\\" pizza flour)\\\\n- **Water**\\\\n- **Yeast** (active dry or fresh)\\\\n- **Salt**\\\\n- **Olive oil** (optional)\\\\n\\\\n### For the Topping:\\\\n- **Tomato Sauce** (preferably San Marzano tomatoes, crushed)\\\\n- **Fresh Mozzarella Cheese** (preferably buffalo mozzarella)\\\\n- **Fresh Basil Leaves**\\\\n- **Olive Oil** (for drizzling)\\\\n- **Salt** (to taste)\\\\n\\\\n### Optional:\\\\n- **Parmesan Cheese** (for extra flavor)\\\\n- **Crushed Red Pepper Flakes** (for heat)\\\\n\\\\n### Instructions Summary:\\\\n1. Prepare the dough by mixing flour, water, yeast, and salt, then let it rise.\\\\n2. Shape the dough into a pizza base.\\\\n3. Spread tomato sauce over the base.\\\\n4. Add sliced mozzarella and basil leaves.\\\\n5. Bake in a hot oven or pizza stone until the crust is golden and the cheese is bubbly.\\\\n6. Drizzle with olive oil before serving. \\\\n\\\\nEnjoy your pizza-making!', role='assistant')], seed=None, temperature=1.0, top_p=1.0, usage=SampleUsage(cached_tokens=0, completion_tokens=249, prompt_tokens=36856, total_tokens=37105), max_completions_tokens=4096)\",\n", - " \"status\": \"pass\",\n", - " \"_datasource_item_content_hash\": \"4baa1b4e9daaee8cce285a14b3d7e8155eef3a7770ebbbb50ee16f78f5024768\"\n", + " \"run_id\": \"evalrun_687833dbadd081919a0f9fbfb817baf4\",\n", + " \"sample\": \"Sample(error=None, finish_reason='stop', input=[SampleInput(content='Please provide latex code to replicate this table', role='user'), SampleInput(content='https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png', role='user')], max_completion_tokens=None, model='gpt-4o-mini-2024-07-18', output=[SampleOutput(content=\\\"Certainly! Below is the LaTeX code to replicate the table you provided:\\\\n\\\\n```latex\\\\n\\\\\\\\documentclass{article}\\\\n\\\\\\\\usepackage{array}\\\\n\\\\\\\\usepackage{multirow}\\\\n\\\\\\\\usepackage{booktabs}\\\\n\\\\n\\\\\\\\begin{document}\\\\n\\\\n\\\\\\\\begin{table}[ht]\\\\n \\\\\\\\centering\\\\n \\\\\\\\begin{tabular}{c|c|c|c}\\\\n \\\\\\\\multirow{2}{*}{S1} & \\\\\\\\multirow{2}{*}{S2} & \\\\\\\\multicolumn{3}{c}{Methods} \\\\\\\\\\\\\\\\ \\\\n \\\\\\\\cline{3-5}\\\\n & & Expert & Layman & PoeLM \\\\\\\\\\\\\\\\\\\\n \\\\\\\\hline\\\\n Expert & & - & 54.0 & 62.7 \\\\\\\\\\\\\\\\\\\\n Layman & & 46.0 & - & 60.7 \\\\\\\\\\\\\\\\\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\\\\\\\\\\\\\\\\n \\\\\\\\end{tabular}\\\\n \\\\\\\\caption{Comparison of different methods}\\\\n \\\\\\\\label{tab:methods_comparison}\\\\n\\\\\\\\end{table}\\\\n\\\\n\\\\\\\\end{document}\\\\n```\\\\n\\\\n### Explanation:\\\\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\\\\n- The `booktabs` package is used for improved table formatting (with `\\\\\\\\hline` for horizontal lines).\\\\n- Adjust the table's caption and label as needed.\\\", role='assistant')], seed=None, temperature=1.0, top_p=1.0, usage=SampleUsage(cached_tokens=0, completion_tokens=295, prompt_tokens=14187, total_tokens=14482), max_completions_tokens=4096)\",\n", + " \"status\": \"fail\",\n", + " \"_datasource_item_content_hash\": \"bb2090df47ea2ca0aa67337709ce2ff7382d639118d3358068b0cc7031c12f82\"\n", "}\n" ] } ], "source": [ - "import json\n", + "first_item = output_items[0]\n", "\n", - "pizza_item = next(\n", - " item for item in output_items \n", - " if \"What ingredients do I need to make this?\" in item.datasource_item[\"prompt\"]\n", - ")\n", - "\n", - "print(json.dumps(dict(pizza_item), indent=2, default=str))" + "print(json.dumps(dict(first_item), indent=2, default=str))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now, feel free to extend this to your own use cases! Some examples include grading image generation results with our EvalAPI model graders, evaluating your OCR use cases using model sampling, and more. " + "## Conclusion\n", + "\n", + "In this cookbook, we covered a workflow for evaluating an image-based task using the OpenAI Evals API's. By using the image input functionality for both sampling and model grading, we were able to streamline our evals process for the task.\n", + "\n", + "We're excited to see you extend this to your own image-based use cases, whether it's OCR accuracy, image generation grading, and more!" ] } ],