Add new example notebook for Out-of-Band User-Turn Transcription using OpenAI Realtime API. The notebook includes detailed setup instructions, prompts for transcription, and audio streaming functionality, enhancing user experience and accuracy in transcription tasks.

minh-hoque · minh-hoque · commit 46a7164976fd · 2025-11-20T00:16:00.000-05:00
diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb
@@ -8,6 +8,8 @@
         "\n",
         "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n",
         "\n",
+        "> Out-of-band transcription using the realtime model refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n",
+        "\n",
         "It covers how to build a server-to-server client that:\n",
         "\n",
         "- Streams microphone audio to an OpenAI Realtime voice agent.\n",
@@ -31,7 +33,7 @@
         "\n",
         "- **Context-aware transcription**: Uses full session context to improve transcription accuracy.\n",
         "- **Non-intrusive**: Transcript does not affect live conversation state.\n",
-        "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than a a transcription model at following instructions.\n"
+        "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.\n"
       ]
     },
     {
@@ -95,7 +97,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 1,
+      "execution_count": 2,
       "id": "c399f440",
       "metadata": {},
       "outputs": [],
@@ -117,7 +119,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 2,
+      "execution_count": 3,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -136,7 +138,7 @@
         "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n",
         "# Role\n",
         "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n",
-        "Follow the instsructions and output format below.\n",
+        "Follow the instructions and output format below.\n",
         "\n",
         "# Instructions\n",
         "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n",
@@ -169,22 +171,21 @@
         "\n",
         "We define:\n",
         "\n",
+        "- Imports\n",
         "- Audio and model defaults\n",
-        "- Constants for transcription event handling\n",
-        "\n",
-        "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts."
+        "- Constants for transcription event handling"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": 4,
       "metadata": {},
       "outputs": [
         {
           "name": "stderr",
           "output_type": "stream",
           "text": [
-            "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_16744/1694753399.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n",
+            "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n",
             "  from websockets.client import WebSocketClientProtocol\n"
           ]
         }
@@ -208,7 +209,17 @@
         "DEFAULT_BLOCK_MS = 100\n",
         "DEFAULT_SILENCE_DURATION_MS = 800\n",
         "DEFAULT_PREFIX_PADDING_MS = 300\n",
-        "TRANSCRIPTION_PURPOSE = \"User turn transcription\"\n",
+        "TRANSCRIPTION_PURPOSE = \"User turn transcription\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "id": "7254080a",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Event grouping constants\n",
         "TRANSCRIPTION_DELTA_TYPES = {\n",
         "    \"input_audio_buffer.transcription.delta\",\n",
         "    \"input_audio_transcription.delta\",\n",
@@ -221,6 +232,22 @@
         "    \"input_audio_transcription.done\",\n",
         "    \"conversation.item.input_audio_transcription.completed\",\n",
         "    \"conversation.item.input_audio_transcription.done\",\n",
+        "}\n",
+        "INPUT_SPEECH_END_EVENT_TYPES = {\n",
+        "    \"input_audio_buffer.speech_stopped\",\n",
+        "    \"input_audio_buffer.committed\",\n",
+        "}\n",
+        "RESPONSE_AUDIO_DELTA_TYPES = {\n",
+        "    \"response.output_audio.delta\",\n",
+        "    \"response.audio.delta\",\n",
+        "}\n",
+        "RESPONSE_TEXT_DELTA_TYPES = {\n",
+        "    \"response.output_text.delta\",\n",
+        "    \"response.text.delta\",\n",
+        "}\n",
+        "RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {\n",
+        "    \"response.output_audio_transcript.delta\",\n",
+        "    \"response.audio_transcript.delta\",\n",
         "}"
       ]
     },
@@ -240,12 +267,14 @@
         "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n",
         "\n",
         "- `conversation: \"none\"` – use session state but don’t write to the main conversation\n",
-        "- `output_modalities: [\"text\"]` – get a text transcript only\n"
+        "- `output_modalities: [\"text\"]` – get a text transcript only\n",
+        "\n",
+        "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": 6,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -257,7 +286,6 @@
         "    prefix_padding_ms: int,\n",
         "    idle_timeout_ms: int | None,\n",
         "    input_audio_transcription_model: str | None = None,\n",
-        "    transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n",
         ") -> dict[str, object]:\n",
         "    \"\"\"Configure the Realtime session: audio in/out, server VAD, etc.\"\"\"\n",
         "\n",
@@ -340,7 +368,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 5,
+      "execution_count": 7,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -440,7 +468,7 @@
       "source": [
         "## 7. Extracting and comparing transcripts\n",
         "\n",
-        "Each user turn generates **two transcripts**:\n",
+        "The function below enables us to generate **two transcripts** for each user turn:\n",
         "\n",
         "- **Realtime model transcript**: from our out-of-band `response.create` call.\n",
         "- **Built-in ASR transcript**: from the standard transcription model (`input_audio_transcription_model`).\n",
@@ -457,7 +485,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 6,
+      "execution_count": 8,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -498,7 +526,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 7,
+      "execution_count": 9,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -534,13 +562,11 @@
         "            print(\"\\n[client] Speech detected; streaming...\", flush=True)\n",
         "            awaiting_transcription_prompt = True\n",
         "\n",
-        "        elif message_type in {\n",
-        "            \"input_audio_buffer.speech_stopped\",\n",
-        "            \"input_audio_buffer.committed\",\n",
-        "        }:\n",
+        "        elif message_type in INPUT_SPEECH_END_EVENT_TYPES:\n",
         "            if message_type == \"input_audio_buffer.speech_stopped\":\n",
         "                print(\"[client] Detected silence; preparing transcript...\", flush=True)\n",
         "\n",
+        "            # This is where the out-of-band transcription request is sent. <-------\n",
         "            if awaiting_transcription_prompt:\n",
         "                request_payload = build_transcription_request(\n",
         "                    transcription_instructions\n",
@@ -549,7 +575,6 @@
         "                awaiting_transcription_prompt = False\n",
         "\n",
         "        # --- Built-in transcription model stream -------------------------------\n",
-        "\n",
         "        elif message_type in TRANSCRIPTION_DELTA_TYPES:\n",
         "            buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n",
         "            delta_text = (\n",
@@ -593,10 +618,7 @@
         "                \"done\": False,\n",
         "            }\n",
         "\n",
-        "        elif message_type in {\n",
-        "            \"response.output_audio.delta\",\n",
-        "            \"response.audio.delta\",\n",
-        "        }:\n",
+        "        elif message_type in RESPONSE_AUDIO_DELTA_TYPES:\n",
         "            response_id = message.get(\"response_id\")\n",
         "            if response_id is None:\n",
         "                continue\n",
@@ -616,17 +638,14 @@
         "\n",
         "            await playback_queue.put(audio_chunk)\n",
         "\n",
-        "        elif message_type in {\"response.output_text.delta\", \"response.text.delta\"}:\n",
+        "        elif message_type in RESPONSE_TEXT_DELTA_TYPES:\n",
         "            response_id = message.get(\"response_id\")\n",
         "            if response_id is None:\n",
         "                continue\n",
         "            buffers[response_id] += message.get(\"delta\", \"\")\n",
         "            \n",
         "\n",
-        "        elif message_type in {\n",
-        "            \"response.output_audio_transcript.delta\",\n",
-        "            \"response.audio_transcript.delta\",\n",
-        "        }:\n",
+        "        elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:\n",
         "            response_id = message.get(\"response_id\")\n",
         "            if response_id is None:\n",
         "                continue\n",
@@ -685,7 +704,9 @@
         "- Starts concurrent tasks:\n",
         "  - `listen_for_events` (handle incoming messages)\n",
         "  - `stream_microphone_audio` (send microphone audio)\n",
+        "  - Mutes mic when assistant is speaking\n",
         "  - `playback_audio` (play assistant responses)\n",
+        "  - prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.\n",
         "- Run session until you `interrupt`\n",
         "\n",
         "Output should look like:\n",
@@ -707,7 +728,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 8,
+      "execution_count": 12,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -716,8 +737,8 @@
         "    server: str = \"wss://api.openai.com/v1/realtime\",\n",
         "    model: str = DEFAULT_MODEL,\n",
         "    voice: str = DEFAULT_VOICE,\n",
-        "    instructions: str | None = None,\n",
-        "    transcription_instructions: str | None = None,\n",
+        "    instructions: str = REALTIME_MODEL_PROMPT,\n",
+        "    transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n",
         "    summary_instructions: str | None = None,\n",
         "    input_audio_transcription_model: str | None = \"gpt-4o-transcribe\",\n",
         "    silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,\n",
@@ -729,31 +750,6 @@
         ") -> None:\n",
         "    \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n",
         "    api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n",
-        "    if not api_key:\n",
-        "        raise SystemExit(\"Set OPENAI_API_KEY or pass --api-key.\")\n",
-        "\n",
-        "    server = server or \"wss://api.openai.com/v1/realtime\"\n",
-        "    model = model or DEFAULT_MODEL\n",
-        "    voice = voice or DEFAULT_VOICE\n",
-        "    silence_duration_ms = int(\n",
-        "        silence_duration_ms\n",
-        "        if silence_duration_ms is not None\n",
-        "        else DEFAULT_SILENCE_DURATION_MS\n",
-        "    )\n",
-        "    prefix_padding_ms = int(\n",
-        "        prefix_padding_ms if prefix_padding_ms is not None else DEFAULT_PREFIX_PADDING_MS\n",
-        "    )\n",
-        "    vad_threshold = float(vad_threshold if vad_threshold is not None else 0.6)\n",
-        "    idle_timeout_ms = int(idle_timeout_ms) if idle_timeout_ms is not None else None\n",
-        "    max_turns = int(max_turns) if max_turns else None\n",
-        "    timeout_seconds = int(timeout_seconds or 0)\n",
-        "    instructions = instructions or REALTIME_MODEL_PROMPT\n",
-        "    transcription_instructions = (\n",
-        "        transcription_instructions\n",
-        "        or summary_instructions\n",
-        "        or REALTIME_MODEL_TRANSCRIPTION_PROMPT\n",
-        "    )\n",
-        "\n",
         "    ws_url = f\"{server}?model={model}\"\n",
         "    headers = {\n",
         "        \"Authorization\": f\"Bearer {api_key}\",\n",
@@ -776,7 +772,6 @@
         "        \"pending_transcription_prints\": deque(),\n",
         "    }\n",
         "\n",
-        "\n",
         "    async with websockets.connect(\n",
         "        ws_url, additional_headers=headers, max_size=None\n",
         "    ) as ws:\n",
@@ -933,10 +928,10 @@
       "id": "efabdbf5",
       "metadata": {},
       "source": [
-        "Key observations from the example above:\n",
+        "From the above example, we can notice:\n",
         "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns.\n",
         "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n",
-        "- With context from the entire session—including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")."
+        "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")."
       ]
     }
   ],