diff --git a/authors.yaml b/authors.yaml index ef087642a6..d84aa18f87 100644 --- a/authors.yaml +++ b/authors.yaml @@ -391,3 +391,8 @@ corwin: name: "Corwin Cheung" website: "https://www.linkedin.com/in/corwincubes/" avatar: "https://avatars.githubusercontent.com/u/85517581?v=4" + +daisyshe-oai: + name: "Daisy Sheng" + website: "https://www.linkedin.com/in/daisysheng/" + avatar: "https://avatars.githubusercontent.com/u/212609991?v=4" diff --git a/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb new file mode 100644 index 0000000000..12a65e5efc --- /dev/null +++ b/examples/evaluation/use-cases/EvalsAPI_Image_Inputs.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evals API: Image Inputs\n", + "\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.\n", + "\n", + "In this example, we will evaluate how well our model can:\n", + "1. **Generate appropriate responses** to user prompts about images\n", + "3. **Align with reference answers** that represent high-quality responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installing Dependencies + Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "!pip install openai datasets pandas --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import libraries\n", + "from datasets import load_dataset\n", + "from openai import OpenAI\n", + "import os\n", + "import json\n", + "import time\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Preparation\n", + "\n", + "We use the [VibeEval](https://huggingface.co/datasets/RekaAI/VibeEval) dataset that's hosted on Hugging Face. It contains a collection of user prompt, accompanying image, and reference answer data. First, we load the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = load_dataset(\"RekaAI/VibeEval\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input image data can be in the form of a web URL or a base64 encoded string. Here, we use the provided web URLs. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evals_data_source = []\n", + "\n", + "# select the first 3 examples in the dataset to use for this cookbook\n", + "for example in dataset[\"test\"].select(range(3)):\n", + " evals_data_source.append({\n", + " \"item\": {\n", + " \"media_url\": example[\"media_url\"], # image web URL\n", + " \"reference\": example[\"reference\"], # reference answer\n", + " \"prompt\": example[\"prompt\"] # prompt\n", + " }\n", + " })" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you print the data source list, each item should be of a similar form to:\n", + "\n", + "```python\n", + "{\n", + " \"item\": {\n", + " \"media_url\": \"https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg\"\n", + " \"reference\": \"This appears to be a classic Margherita pizza, which has the following ingredients...\"\n", + " \"prompt\": \"What ingredients do I need to make this?\"\n", + " }\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Configuration\n", + "\n", + "Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Source Config\n", + "\n", + "Based on the data that we have compiled, our data source config is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"media_url\": { \"type\": \"string\" },\n", + " \"reference\": { \"type\": \"string\" },\n", + " \"prompt\": { \"type\": \"string\" }\n", + " },\n", + " \"required\": [\"media_url\", \"reference\", \"prompt\"]\n", + " },\n", + " \"include_sample_schema\": True, # enables sampling\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing Criteria" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a model grader that takes in an image, reference answer, and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on how closely the model response matches the reference answer and its general suitability for the conversation. For more info on model graders, visit [API Grader docs](hhttps://platform.openai.com/docs/api-reference/graders). \n", + "\n", + "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: The image url field / templating need to be placed in an input image object to be interpreted as an image. Otherwise, the image will be interpreted as a text string. " + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "grader_config = {\n", + "\t \"type\": \"score_model\",\n", + " \"name\": \"Score Model Grader\",\n", + " \"input\":[\n", + " {\n", + " \"role\": \"system\",\n", + "\t\t \"content\": \"You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0.\"\n", + "\t },\n", + "\t {\n", + "\t\t \"role\": \"user\",\n", + "\t\t \"content\": [{ \"type\": \"input_text\", \"text\": \"Prompt: {{ item.prompt }}.\"},\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_image\", \"image_url\": \"{{ item.media_url }}\", \"detail\": \"auto\" },\n", + "\t\t\t\t\t\t\t{ \"type\": \"input_text\", \"text\": \"Reference answer: {{ item.reference }}. Model response: {{ sample.output_text }}.\"}\n", + "\t\t\t\t]\n", + "\t }\n", + "\t\t],\n", + "\t\t\"pass_threshold\": 0.9,\n", + "\t \"range\": [0, 1],\n", + "\t \"model\": \"o4-mini\" # model for grading; check that the model you use supports image inputs\n", + "\t}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create the eval object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_object = client.evals.create(\n", + " name=\"Image Grading\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[grader_config],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response. While we won't dive into it in this cookbook, EvalsAPI also supports stored completions containing images as a data source. \n", + "\n", + "Here's the sampling message input we'll use for this example." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "sampling_messages = [{\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"{{ item.prompt }}\"\n", + " }\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_image\",\n", + " \"image_url\": \"{{ item.media_url }}\",\n", + " \"detail\": \"auto\"\n", + " }\n", + " }]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now kickoff an eval run." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "eval_run = client.evals.runs.create(\n", + " name=\"Image Input Eval Run\",\n", + " eval_id=eval_object.id,\n", + " data_source={\n", + " \"type\": \"responses\", # sample using responses API\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": evals_data_source\n", + " },\n", + " \"model\": \"gpt-4o-mini\", # model used to generate the response; check that the model you use supports image inputs\n", + " \"input_messages\": {\n", + " \"type\": \"template\", \n", + " \"template\": sampling_messages}\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Poll and Display the Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | prompt | \n", + "reference | \n", + "model_response | \n", + "grading_results | \n", + "
---|---|---|---|---|
0 | \n", + "Please provide latex code to replicate this table | \n", + "Below is the latex code for your table:\\n```te... | \n", + "Certainly! Below is the LaTeX code to replicat... | \n", + "{\"steps\":[{\"description\":\"Assess if the provid... | \n", + "
1 | \n", + "What ingredients do I need to make this? | \n", + "This appears to be a classic Margherita pizza,... | \n", + "To make a classic Margherita pizza like the on... | \n", + "{\"steps\":[{\"description\":\"Check if model ident... | \n", + "
2 | \n", + "Is this safe for a vegan to eat? | \n", + "Based on the image, this dish appears to be a ... | \n", + "To determine if the dish is safe for a vegan t... | \n", + "{\"steps\":[{\"description\":\"Compare model respon... | \n", + "