openai · hendrytl · Aug 14, 2025 · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb
@@ -6,7 +6,9 @@
    "source": [
     "# Evals API: Audio Inputs\n",
     "\n",
-    "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n",
+    "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **string match grading** to score the model responses against the output audio transcript and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n",
+    "\n",
+    "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text.  Now you can use the original audio and get samples from the model in audio as well. This will more accurately repesent workflows such as a customer suppor agent where both the user and agent are using aduio. For grading, we use the text transcript from the sampled audio so that we can leverage the existig suite of text graders. \n",
     "\n",
     "In this example, we will evaluate how well our model can:\n",
     "1. **Generate appropriate responses** to user prompts about an audio message\n",
@@ -27,7 +29,7 @@
    "outputs": [],
    "source": [
     "# Install required packages\n",
-    "!pip install openai datasets pandas soundfile torch torchcodec --quiet"
+    "%pip install openai datasets pandas soundfile torch torchcodec --quiet"
    ]
   },
   {
@@ -55,7 +57,7 @@
    "source": [
     "## Dataset Preparation\n",
     "\n",
-    "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input."
+    "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an offical answer."
    ]
   },
   {
@@ -86,11 +88,9 @@
     "def get_base64(audio_path_or_datauri: str) -> str:\n",
     "    if audio_path_or_datauri.startswith(\"data:\"):\n",
     "        # Already base64, just strip prefix\n",
-    "        print(\"Already base64, just strip prefix\")\n",
     "        return audio_path_or_datauri.split(\",\", 1)[1]\n",
     "    else:\n",
     "        # It's a real file path\n",
-    "        print(\"It's a real file path\")\n",
     "        with open(audio_path_or_datauri, \"rb\") as f:\n",
     "            return base64.b64encode(f.read()).decode(\"ascii\")\n",
     "\n",
@@ -154,6 +154,7 @@
     "evals_data_source = []\n",
     "audio_base64 = None\n",
     "\n",
+    "# Will use the first 3 examples for testing\n",
     "for example in dataset[\"train\"].select(range(3)):\n",
     "    audio_val = example[\"audio\"]\n",
     "    try:\n",
@@ -205,7 +206,7 @@
    "outputs": [],
    "source": [
     "client = OpenAI(\n",
-    "    api_key=os.getenv(\"OPENAI_API_KEY\")\n",
+    "    api_key=os.getenv(\"OPENAI_API_KEY_DISTILLATION\")\n",
     ")"
    ]
   },
@@ -300,7 +301,7 @@
     "  \"name\": \"String check grader\",\n",
     "  \"input\": \"{{sample.output_text}}\",\n",
     "  \"reference\": \"{{item.official_answer}}\",\n",
-    "  \"operation\": \"like\"\n",
+    "  \"operation\": \"ilike\"\n",
     "}"
    ]
   },
@@ -349,7 +350,15 @@
     "sampling_messages = [\n",
     "    {\n",
     "        \"role\": \"system\",\n",
-    "        \"content\": \"You are a helpful assistant that can answer questions with the audio input. You will be given an audio input and a question. You will need to answer the question based on the audio input.\"\n",
+    "        \"content\": \"You are a helpful and obedient assistant that can answer questions with audio input. You will be given an audio input containing a question and instructions on exactly how to answer. For example, if the user asks for a single word response, then you should only reply with a single word answer.\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"type\": \"message\",\n",
+    "        \"content\": {\n",
+    "            \"type\": \"input_text\",\n",
+    "            \"text\": \"Answer the following question by replying with a single word answer: 'valid' or 'invalid'.\"\n",
+    "        }\n",
     "    },\n",
     "    {\n",
     "        \"role\": \"user\",\n",
@@ -387,6 +396,9 @@
     "                \"id\": file.id\n",
     "            },\n",
     "            \"model\": \"gpt-4o-audio-preview\", # model used to generate the response; check that the model you use supports audio inputs\n",
+    "            \"sampling_params\": {\n",
+    "                \"temperature\": 0.0,\n",
+    "            },\n",
     "            \"input_messages\": {\n",
     "                \"type\": \"template\", \n",
     "                \"template\": sampling_messages}\n",
@@ -467,7 +479,11 @@
    "source": [
     "## Conclusion\n",
     "\n",
-    "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We could additionally add model based graders for additional flexibility in grading in future.\n"
+    "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n",
+    "### Next steps\n",
+    "- Convert this example to your use case.  \n",
+    "- Try using model based graders for additional flexibility in grading.\n",
+    "- If you have large audio clips, try using the [uploads API](https://platform.openai.com/docs/api-reference/uploads/create) for support up to 8 GB.\n"
    ]
   }
  ],