|
8 | 8 | "\n", |
9 | 9 | "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n", |
10 | 10 | "\n", |
| 11 | + "> Out-of-band transcription using the realtime model refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n", |
| 12 | + "\n", |
11 | 13 | "It covers how to build a server-to-server client that:\n", |
12 | 14 | "\n", |
13 | 15 | "- Streams microphone audio to an OpenAI Realtime voice agent.\n", |
|
31 | 33 | "\n", |
32 | 34 | "- **Context-aware transcription**: Uses full session context to improve transcription accuracy.\n", |
33 | 35 | "- **Non-intrusive**: Transcript does not affect live conversation state.\n", |
34 | | - "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than a a transcription model at following instructions.\n" |
| 36 | + "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.\n" |
35 | 37 | ] |
36 | 38 | }, |
37 | 39 | { |
|
95 | 97 | }, |
96 | 98 | { |
97 | 99 | "cell_type": "code", |
98 | | - "execution_count": 1, |
| 100 | + "execution_count": 2, |
99 | 101 | "id": "c399f440", |
100 | 102 | "metadata": {}, |
101 | 103 | "outputs": [], |
|
117 | 119 | }, |
118 | 120 | { |
119 | 121 | "cell_type": "code", |
120 | | - "execution_count": 2, |
| 122 | + "execution_count": 3, |
121 | 123 | "metadata": {}, |
122 | 124 | "outputs": [], |
123 | 125 | "source": [ |
|
136 | 138 | "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", |
137 | 139 | "# Role\n", |
138 | 140 | "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n", |
139 | | - "Follow the instsructions and output format below.\n", |
| 141 | + "Follow the instructions and output format below.\n", |
140 | 142 | "\n", |
141 | 143 | "# Instructions\n", |
142 | 144 | "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n", |
|
169 | 171 | "\n", |
170 | 172 | "We define:\n", |
171 | 173 | "\n", |
| 174 | + "- Imports\n", |
172 | 175 | "- Audio and model defaults\n", |
173 | | - "- Constants for transcription event handling\n", |
174 | | - "\n", |
175 | | - "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts." |
| 176 | + "- Constants for transcription event handling" |
176 | 177 | ] |
177 | 178 | }, |
178 | 179 | { |
179 | 180 | "cell_type": "code", |
180 | | - "execution_count": 3, |
| 181 | + "execution_count": 4, |
181 | 182 | "metadata": {}, |
182 | 183 | "outputs": [ |
183 | 184 | { |
184 | 185 | "name": "stderr", |
185 | 186 | "output_type": "stream", |
186 | 187 | "text": [ |
187 | | - "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_16744/1694753399.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", |
| 188 | + "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", |
188 | 189 | " from websockets.client import WebSocketClientProtocol\n" |
189 | 190 | ] |
190 | 191 | } |
|
208 | 209 | "DEFAULT_BLOCK_MS = 100\n", |
209 | 210 | "DEFAULT_SILENCE_DURATION_MS = 800\n", |
210 | 211 | "DEFAULT_PREFIX_PADDING_MS = 300\n", |
211 | | - "TRANSCRIPTION_PURPOSE = \"User turn transcription\"\n", |
| 212 | + "TRANSCRIPTION_PURPOSE = \"User turn transcription\"" |
| 213 | + ] |
| 214 | + }, |
| 215 | + { |
| 216 | + "cell_type": "code", |
| 217 | + "execution_count": 5, |
| 218 | + "id": "7254080a", |
| 219 | + "metadata": {}, |
| 220 | + "outputs": [], |
| 221 | + "source": [ |
| 222 | + "# Event grouping constants\n", |
212 | 223 | "TRANSCRIPTION_DELTA_TYPES = {\n", |
213 | 224 | " \"input_audio_buffer.transcription.delta\",\n", |
214 | 225 | " \"input_audio_transcription.delta\",\n", |
|
221 | 232 | " \"input_audio_transcription.done\",\n", |
222 | 233 | " \"conversation.item.input_audio_transcription.completed\",\n", |
223 | 234 | " \"conversation.item.input_audio_transcription.done\",\n", |
| 235 | + "}\n", |
| 236 | + "INPUT_SPEECH_END_EVENT_TYPES = {\n", |
| 237 | + " \"input_audio_buffer.speech_stopped\",\n", |
| 238 | + " \"input_audio_buffer.committed\",\n", |
| 239 | + "}\n", |
| 240 | + "RESPONSE_AUDIO_DELTA_TYPES = {\n", |
| 241 | + " \"response.output_audio.delta\",\n", |
| 242 | + " \"response.audio.delta\",\n", |
| 243 | + "}\n", |
| 244 | + "RESPONSE_TEXT_DELTA_TYPES = {\n", |
| 245 | + " \"response.output_text.delta\",\n", |
| 246 | + " \"response.text.delta\",\n", |
| 247 | + "}\n", |
| 248 | + "RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {\n", |
| 249 | + " \"response.output_audio_transcript.delta\",\n", |
| 250 | + " \"response.audio_transcript.delta\",\n", |
224 | 251 | "}" |
225 | 252 | ] |
226 | 253 | }, |
|
240 | 267 | "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n", |
241 | 268 | "\n", |
242 | 269 | "- `conversation: \"none\"` – use session state but don’t write to the main conversation\n", |
243 | | - "- `output_modalities: [\"text\"]` – get a text transcript only\n" |
| 270 | + "- `output_modalities: [\"text\"]` – get a text transcript only\n", |
| 271 | + "\n", |
| 272 | + "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" |
244 | 273 | ] |
245 | 274 | }, |
246 | 275 | { |
247 | 276 | "cell_type": "code", |
248 | | - "execution_count": 4, |
| 277 | + "execution_count": 6, |
249 | 278 | "metadata": {}, |
250 | 279 | "outputs": [], |
251 | 280 | "source": [ |
|
257 | 286 | " prefix_padding_ms: int,\n", |
258 | 287 | " idle_timeout_ms: int | None,\n", |
259 | 288 | " input_audio_transcription_model: str | None = None,\n", |
260 | | - " transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n", |
261 | 289 | ") -> dict[str, object]:\n", |
262 | 290 | " \"\"\"Configure the Realtime session: audio in/out, server VAD, etc.\"\"\"\n", |
263 | 291 | "\n", |
|
340 | 368 | }, |
341 | 369 | { |
342 | 370 | "cell_type": "code", |
343 | | - "execution_count": 5, |
| 371 | + "execution_count": 7, |
344 | 372 | "metadata": {}, |
345 | 373 | "outputs": [], |
346 | 374 | "source": [ |
|
440 | 468 | "source": [ |
441 | 469 | "## 7. Extracting and comparing transcripts\n", |
442 | 470 | "\n", |
443 | | - "Each user turn generates **two transcripts**:\n", |
| 471 | + "The function below enables us to generate **two transcripts** for each user turn:\n", |
444 | 472 | "\n", |
445 | 473 | "- **Realtime model transcript**: from our out-of-band `response.create` call.\n", |
446 | 474 | "- **Built-in ASR transcript**: from the standard transcription model (`input_audio_transcription_model`).\n", |
|
457 | 485 | }, |
458 | 486 | { |
459 | 487 | "cell_type": "code", |
460 | | - "execution_count": 6, |
| 488 | + "execution_count": 8, |
461 | 489 | "metadata": {}, |
462 | 490 | "outputs": [], |
463 | 491 | "source": [ |
|
498 | 526 | }, |
499 | 527 | { |
500 | 528 | "cell_type": "code", |
501 | | - "execution_count": 7, |
| 529 | + "execution_count": 9, |
502 | 530 | "metadata": {}, |
503 | 531 | "outputs": [], |
504 | 532 | "source": [ |
|
534 | 562 | " print(\"\\n[client] Speech detected; streaming...\", flush=True)\n", |
535 | 563 | " awaiting_transcription_prompt = True\n", |
536 | 564 | "\n", |
537 | | - " elif message_type in {\n", |
538 | | - " \"input_audio_buffer.speech_stopped\",\n", |
539 | | - " \"input_audio_buffer.committed\",\n", |
540 | | - " }:\n", |
| 565 | + " elif message_type in INPUT_SPEECH_END_EVENT_TYPES:\n", |
541 | 566 | " if message_type == \"input_audio_buffer.speech_stopped\":\n", |
542 | 567 | " print(\"[client] Detected silence; preparing transcript...\", flush=True)\n", |
543 | 568 | "\n", |
| 569 | + " # This is where the out-of-band transcription request is sent. <-------\n", |
544 | 570 | " if awaiting_transcription_prompt:\n", |
545 | 571 | " request_payload = build_transcription_request(\n", |
546 | 572 | " transcription_instructions\n", |
|
549 | 575 | " awaiting_transcription_prompt = False\n", |
550 | 576 | "\n", |
551 | 577 | " # --- Built-in transcription model stream -------------------------------\n", |
552 | | - "\n", |
553 | 578 | " elif message_type in TRANSCRIPTION_DELTA_TYPES:\n", |
554 | 579 | " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", |
555 | 580 | " delta_text = (\n", |
|
593 | 618 | " \"done\": False,\n", |
594 | 619 | " }\n", |
595 | 620 | "\n", |
596 | | - " elif message_type in {\n", |
597 | | - " \"response.output_audio.delta\",\n", |
598 | | - " \"response.audio.delta\",\n", |
599 | | - " }:\n", |
| 621 | + " elif message_type in RESPONSE_AUDIO_DELTA_TYPES:\n", |
600 | 622 | " response_id = message.get(\"response_id\")\n", |
601 | 623 | " if response_id is None:\n", |
602 | 624 | " continue\n", |
|
616 | 638 | "\n", |
617 | 639 | " await playback_queue.put(audio_chunk)\n", |
618 | 640 | "\n", |
619 | | - " elif message_type in {\"response.output_text.delta\", \"response.text.delta\"}:\n", |
| 641 | + " elif message_type in RESPONSE_TEXT_DELTA_TYPES:\n", |
620 | 642 | " response_id = message.get(\"response_id\")\n", |
621 | 643 | " if response_id is None:\n", |
622 | 644 | " continue\n", |
623 | 645 | " buffers[response_id] += message.get(\"delta\", \"\")\n", |
624 | 646 | " \n", |
625 | 647 | "\n", |
626 | | - " elif message_type in {\n", |
627 | | - " \"response.output_audio_transcript.delta\",\n", |
628 | | - " \"response.audio_transcript.delta\",\n", |
629 | | - " }:\n", |
| 648 | + " elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:\n", |
630 | 649 | " response_id = message.get(\"response_id\")\n", |
631 | 650 | " if response_id is None:\n", |
632 | 651 | " continue\n", |
|
685 | 704 | "- Starts concurrent tasks:\n", |
686 | 705 | " - `listen_for_events` (handle incoming messages)\n", |
687 | 706 | " - `stream_microphone_audio` (send microphone audio)\n", |
| 707 | + " - Mutes mic when assistant is speaking\n", |
688 | 708 | " - `playback_audio` (play assistant responses)\n", |
| 709 | + " - prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.\n", |
689 | 710 | "- Run session until you `interrupt`\n", |
690 | 711 | "\n", |
691 | 712 | "Output should look like:\n", |
|
707 | 728 | }, |
708 | 729 | { |
709 | 730 | "cell_type": "code", |
710 | | - "execution_count": 8, |
| 731 | + "execution_count": 12, |
711 | 732 | "metadata": {}, |
712 | 733 | "outputs": [], |
713 | 734 | "source": [ |
|
716 | 737 | " server: str = \"wss://api.openai.com/v1/realtime\",\n", |
717 | 738 | " model: str = DEFAULT_MODEL,\n", |
718 | 739 | " voice: str = DEFAULT_VOICE,\n", |
719 | | - " instructions: str | None = None,\n", |
720 | | - " transcription_instructions: str | None = None,\n", |
| 740 | + " instructions: str = REALTIME_MODEL_PROMPT,\n", |
| 741 | + " transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n", |
721 | 742 | " summary_instructions: str | None = None,\n", |
722 | 743 | " input_audio_transcription_model: str | None = \"gpt-4o-transcribe\",\n", |
723 | 744 | " silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,\n", |
|
729 | 750 | ") -> None:\n", |
730 | 751 | " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", |
731 | 752 | " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", |
732 | | - " if not api_key:\n", |
733 | | - " raise SystemExit(\"Set OPENAI_API_KEY or pass --api-key.\")\n", |
734 | | - "\n", |
735 | | - " server = server or \"wss://api.openai.com/v1/realtime\"\n", |
736 | | - " model = model or DEFAULT_MODEL\n", |
737 | | - " voice = voice or DEFAULT_VOICE\n", |
738 | | - " silence_duration_ms = int(\n", |
739 | | - " silence_duration_ms\n", |
740 | | - " if silence_duration_ms is not None\n", |
741 | | - " else DEFAULT_SILENCE_DURATION_MS\n", |
742 | | - " )\n", |
743 | | - " prefix_padding_ms = int(\n", |
744 | | - " prefix_padding_ms if prefix_padding_ms is not None else DEFAULT_PREFIX_PADDING_MS\n", |
745 | | - " )\n", |
746 | | - " vad_threshold = float(vad_threshold if vad_threshold is not None else 0.6)\n", |
747 | | - " idle_timeout_ms = int(idle_timeout_ms) if idle_timeout_ms is not None else None\n", |
748 | | - " max_turns = int(max_turns) if max_turns else None\n", |
749 | | - " timeout_seconds = int(timeout_seconds or 0)\n", |
750 | | - " instructions = instructions or REALTIME_MODEL_PROMPT\n", |
751 | | - " transcription_instructions = (\n", |
752 | | - " transcription_instructions\n", |
753 | | - " or summary_instructions\n", |
754 | | - " or REALTIME_MODEL_TRANSCRIPTION_PROMPT\n", |
755 | | - " )\n", |
756 | | - "\n", |
757 | 753 | " ws_url = f\"{server}?model={model}\"\n", |
758 | 754 | " headers = {\n", |
759 | 755 | " \"Authorization\": f\"Bearer {api_key}\",\n", |
|
776 | 772 | " \"pending_transcription_prints\": deque(),\n", |
777 | 773 | " }\n", |
778 | 774 | "\n", |
779 | | - "\n", |
780 | 775 | " async with websockets.connect(\n", |
781 | 776 | " ws_url, additional_headers=headers, max_size=None\n", |
782 | 777 | " ) as ws:\n", |
|
933 | 928 | "id": "efabdbf5", |
934 | 929 | "metadata": {}, |
935 | 930 | "source": [ |
936 | | - "Key observations from the example above:\n", |
| 931 | + "From the above example, we can notice:\n", |
937 | 932 | "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns.\n", |
938 | 933 | "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", |
939 | | - "- With context from the entire session—including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." |
| 934 | + "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." |
940 | 935 | ] |
941 | 936 | } |
942 | 937 | ], |
|
0 commit comments