address review comments

hendrytl · hendrytl · commit 7cfb312172e7 · 2025-08-13T11:27:01.000-07:00
diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb
@@ -6,11 +6,11 @@
    "source": [
     "# Evals API: Audio Inputs\n",
     "\n",
-    "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer.\n",
+    "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n",
     "\n",
     "In this example, we will evaluate how well our model can:\n",
     "1. **Generate appropriate responses** to user prompts about an audio message\n",
-    "3. **Align with reference answers** that represent high-quality responses"
+    "2. **Align with reference answers** that represent high-quality responses"
    ]
   },
   {
@@ -284,9 +284,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response matches the reference answer. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n",
+    "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n",
     "\n",
-    "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. "
+    "Getting both the data and the grader right are key for an effective evaluation. While this example uses a simple string check grader, a more powerful model grader could be used instead and you will likely want to iteratively refine the prompts for your graders. "
    ]
   },
   {
@@ -467,9 +467,7 @@
    "source": [
     "## Conclusion\n",
     "\n",
-    "In this cookbook, we covered a workflow for evaluating an audio-based task using the OpenAI Evals API's. By using the audio input functionality, we were able to streamline our evals process for the task. It could also be useful to use the audio transcript as input to a model grader for additional flexibility in grading the response.\n",
-    "\n",
-    "We're excited to see you extend this to your own audio-based use cases!"
+    "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We could additionally add model based graders for additional flexibility in grading in future.\n"
    ]
   }
  ],