Skip to content

Commit 46a7164

Browse files
committed
Add new example notebook for Out-of-Band User-Turn Transcription using OpenAI Realtime API. The notebook includes detailed setup instructions, prompts for transcription, and audio streaming functionality, enhancing user experience and accuracy in transcription tasks.
1 parent 8ed7863 commit 46a7164

File tree

1 file changed

+57
-62
lines changed

1 file changed

+57
-62
lines changed

examples/realtime_oob_transcription.ipynb renamed to examples/Realtime_out_of_band_transcription.ipynb

Lines changed: 57 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
"\n",
99
"**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n",
1010
"\n",
11+
"> Out-of-band transcription using the realtime model refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n",
12+
"\n",
1113
"It covers how to build a server-to-server client that:\n",
1214
"\n",
1315
"- Streams microphone audio to an OpenAI Realtime voice agent.\n",
@@ -31,7 +33,7 @@
3133
"\n",
3234
"- **Context-aware transcription**: Uses full session context to improve transcription accuracy.\n",
3335
"- **Non-intrusive**: Transcript does not affect live conversation state.\n",
34-
"- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than a a transcription model at following instructions.\n"
36+
"- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.\n"
3537
]
3638
},
3739
{
@@ -95,7 +97,7 @@
9597
},
9698
{
9799
"cell_type": "code",
98-
"execution_count": 1,
100+
"execution_count": 2,
99101
"id": "c399f440",
100102
"metadata": {},
101103
"outputs": [],
@@ -117,7 +119,7 @@
117119
},
118120
{
119121
"cell_type": "code",
120-
"execution_count": 2,
122+
"execution_count": 3,
121123
"metadata": {},
122124
"outputs": [],
123125
"source": [
@@ -136,7 +138,7 @@
136138
"REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n",
137139
"# Role\n",
138140
"Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n",
139-
"Follow the instsructions and output format below.\n",
141+
"Follow the instructions and output format below.\n",
140142
"\n",
141143
"# Instructions\n",
142144
"- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n",
@@ -169,22 +171,21 @@
169171
"\n",
170172
"We define:\n",
171173
"\n",
174+
"- Imports\n",
172175
"- Audio and model defaults\n",
173-
"- Constants for transcription event handling\n",
174-
"\n",
175-
"> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts."
176+
"- Constants for transcription event handling"
176177
]
177178
},
178179
{
179180
"cell_type": "code",
180-
"execution_count": 3,
181+
"execution_count": 4,
181182
"metadata": {},
182183
"outputs": [
183184
{
184185
"name": "stderr",
185186
"output_type": "stream",
186187
"text": [
187-
"/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_16744/1694753399.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n",
188+
"/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n",
188189
" from websockets.client import WebSocketClientProtocol\n"
189190
]
190191
}
@@ -208,7 +209,17 @@
208209
"DEFAULT_BLOCK_MS = 100\n",
209210
"DEFAULT_SILENCE_DURATION_MS = 800\n",
210211
"DEFAULT_PREFIX_PADDING_MS = 300\n",
211-
"TRANSCRIPTION_PURPOSE = \"User turn transcription\"\n",
212+
"TRANSCRIPTION_PURPOSE = \"User turn transcription\""
213+
]
214+
},
215+
{
216+
"cell_type": "code",
217+
"execution_count": 5,
218+
"id": "7254080a",
219+
"metadata": {},
220+
"outputs": [],
221+
"source": [
222+
"# Event grouping constants\n",
212223
"TRANSCRIPTION_DELTA_TYPES = {\n",
213224
" \"input_audio_buffer.transcription.delta\",\n",
214225
" \"input_audio_transcription.delta\",\n",
@@ -221,6 +232,22 @@
221232
" \"input_audio_transcription.done\",\n",
222233
" \"conversation.item.input_audio_transcription.completed\",\n",
223234
" \"conversation.item.input_audio_transcription.done\",\n",
235+
"}\n",
236+
"INPUT_SPEECH_END_EVENT_TYPES = {\n",
237+
" \"input_audio_buffer.speech_stopped\",\n",
238+
" \"input_audio_buffer.committed\",\n",
239+
"}\n",
240+
"RESPONSE_AUDIO_DELTA_TYPES = {\n",
241+
" \"response.output_audio.delta\",\n",
242+
" \"response.audio.delta\",\n",
243+
"}\n",
244+
"RESPONSE_TEXT_DELTA_TYPES = {\n",
245+
" \"response.output_text.delta\",\n",
246+
" \"response.text.delta\",\n",
247+
"}\n",
248+
"RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {\n",
249+
" \"response.output_audio_transcript.delta\",\n",
250+
" \"response.audio_transcript.delta\",\n",
224251
"}"
225252
]
226253
},
@@ -240,12 +267,14 @@
240267
"The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n",
241268
"\n",
242269
"- `conversation: \"none\"` – use session state but don’t write to the main conversation\n",
243-
"- `output_modalities: [\"text\"]` – get a text transcript only\n"
270+
"- `output_modalities: [\"text\"]` – get a text transcript only\n",
271+
"\n",
272+
"> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n"
244273
]
245274
},
246275
{
247276
"cell_type": "code",
248-
"execution_count": 4,
277+
"execution_count": 6,
249278
"metadata": {},
250279
"outputs": [],
251280
"source": [
@@ -257,7 +286,6 @@
257286
" prefix_padding_ms: int,\n",
258287
" idle_timeout_ms: int | None,\n",
259288
" input_audio_transcription_model: str | None = None,\n",
260-
" transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n",
261289
") -> dict[str, object]:\n",
262290
" \"\"\"Configure the Realtime session: audio in/out, server VAD, etc.\"\"\"\n",
263291
"\n",
@@ -340,7 +368,7 @@
340368
},
341369
{
342370
"cell_type": "code",
343-
"execution_count": 5,
371+
"execution_count": 7,
344372
"metadata": {},
345373
"outputs": [],
346374
"source": [
@@ -440,7 +468,7 @@
440468
"source": [
441469
"## 7. Extracting and comparing transcripts\n",
442470
"\n",
443-
"Each user turn generates **two transcripts**:\n",
471+
"The function below enables us to generate **two transcripts** for each user turn:\n",
444472
"\n",
445473
"- **Realtime model transcript**: from our out-of-band `response.create` call.\n",
446474
"- **Built-in ASR transcript**: from the standard transcription model (`input_audio_transcription_model`).\n",
@@ -457,7 +485,7 @@
457485
},
458486
{
459487
"cell_type": "code",
460-
"execution_count": 6,
488+
"execution_count": 8,
461489
"metadata": {},
462490
"outputs": [],
463491
"source": [
@@ -498,7 +526,7 @@
498526
},
499527
{
500528
"cell_type": "code",
501-
"execution_count": 7,
529+
"execution_count": 9,
502530
"metadata": {},
503531
"outputs": [],
504532
"source": [
@@ -534,13 +562,11 @@
534562
" print(\"\\n[client] Speech detected; streaming...\", flush=True)\n",
535563
" awaiting_transcription_prompt = True\n",
536564
"\n",
537-
" elif message_type in {\n",
538-
" \"input_audio_buffer.speech_stopped\",\n",
539-
" \"input_audio_buffer.committed\",\n",
540-
" }:\n",
565+
" elif message_type in INPUT_SPEECH_END_EVENT_TYPES:\n",
541566
" if message_type == \"input_audio_buffer.speech_stopped\":\n",
542567
" print(\"[client] Detected silence; preparing transcript...\", flush=True)\n",
543568
"\n",
569+
" # This is where the out-of-band transcription request is sent. <-------\n",
544570
" if awaiting_transcription_prompt:\n",
545571
" request_payload = build_transcription_request(\n",
546572
" transcription_instructions\n",
@@ -549,7 +575,6 @@
549575
" awaiting_transcription_prompt = False\n",
550576
"\n",
551577
" # --- Built-in transcription model stream -------------------------------\n",
552-
"\n",
553578
" elif message_type in TRANSCRIPTION_DELTA_TYPES:\n",
554579
" buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n",
555580
" delta_text = (\n",
@@ -593,10 +618,7 @@
593618
" \"done\": False,\n",
594619
" }\n",
595620
"\n",
596-
" elif message_type in {\n",
597-
" \"response.output_audio.delta\",\n",
598-
" \"response.audio.delta\",\n",
599-
" }:\n",
621+
" elif message_type in RESPONSE_AUDIO_DELTA_TYPES:\n",
600622
" response_id = message.get(\"response_id\")\n",
601623
" if response_id is None:\n",
602624
" continue\n",
@@ -616,17 +638,14 @@
616638
"\n",
617639
" await playback_queue.put(audio_chunk)\n",
618640
"\n",
619-
" elif message_type in {\"response.output_text.delta\", \"response.text.delta\"}:\n",
641+
" elif message_type in RESPONSE_TEXT_DELTA_TYPES:\n",
620642
" response_id = message.get(\"response_id\")\n",
621643
" if response_id is None:\n",
622644
" continue\n",
623645
" buffers[response_id] += message.get(\"delta\", \"\")\n",
624646
" \n",
625647
"\n",
626-
" elif message_type in {\n",
627-
" \"response.output_audio_transcript.delta\",\n",
628-
" \"response.audio_transcript.delta\",\n",
629-
" }:\n",
648+
" elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:\n",
630649
" response_id = message.get(\"response_id\")\n",
631650
" if response_id is None:\n",
632651
" continue\n",
@@ -685,7 +704,9 @@
685704
"- Starts concurrent tasks:\n",
686705
" - `listen_for_events` (handle incoming messages)\n",
687706
" - `stream_microphone_audio` (send microphone audio)\n",
707+
" - Mutes mic when assistant is speaking\n",
688708
" - `playback_audio` (play assistant responses)\n",
709+
" - prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.\n",
689710
"- Run session until you `interrupt`\n",
690711
"\n",
691712
"Output should look like:\n",
@@ -707,7 +728,7 @@
707728
},
708729
{
709730
"cell_type": "code",
710-
"execution_count": 8,
731+
"execution_count": 12,
711732
"metadata": {},
712733
"outputs": [],
713734
"source": [
@@ -716,8 +737,8 @@
716737
" server: str = \"wss://api.openai.com/v1/realtime\",\n",
717738
" model: str = DEFAULT_MODEL,\n",
718739
" voice: str = DEFAULT_VOICE,\n",
719-
" instructions: str | None = None,\n",
720-
" transcription_instructions: str | None = None,\n",
740+
" instructions: str = REALTIME_MODEL_PROMPT,\n",
741+
" transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n",
721742
" summary_instructions: str | None = None,\n",
722743
" input_audio_transcription_model: str | None = \"gpt-4o-transcribe\",\n",
723744
" silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,\n",
@@ -729,31 +750,6 @@
729750
") -> None:\n",
730751
" \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n",
731752
" api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n",
732-
" if not api_key:\n",
733-
" raise SystemExit(\"Set OPENAI_API_KEY or pass --api-key.\")\n",
734-
"\n",
735-
" server = server or \"wss://api.openai.com/v1/realtime\"\n",
736-
" model = model or DEFAULT_MODEL\n",
737-
" voice = voice or DEFAULT_VOICE\n",
738-
" silence_duration_ms = int(\n",
739-
" silence_duration_ms\n",
740-
" if silence_duration_ms is not None\n",
741-
" else DEFAULT_SILENCE_DURATION_MS\n",
742-
" )\n",
743-
" prefix_padding_ms = int(\n",
744-
" prefix_padding_ms if prefix_padding_ms is not None else DEFAULT_PREFIX_PADDING_MS\n",
745-
" )\n",
746-
" vad_threshold = float(vad_threshold if vad_threshold is not None else 0.6)\n",
747-
" idle_timeout_ms = int(idle_timeout_ms) if idle_timeout_ms is not None else None\n",
748-
" max_turns = int(max_turns) if max_turns else None\n",
749-
" timeout_seconds = int(timeout_seconds or 0)\n",
750-
" instructions = instructions or REALTIME_MODEL_PROMPT\n",
751-
" transcription_instructions = (\n",
752-
" transcription_instructions\n",
753-
" or summary_instructions\n",
754-
" or REALTIME_MODEL_TRANSCRIPTION_PROMPT\n",
755-
" )\n",
756-
"\n",
757753
" ws_url = f\"{server}?model={model}\"\n",
758754
" headers = {\n",
759755
" \"Authorization\": f\"Bearer {api_key}\",\n",
@@ -776,7 +772,6 @@
776772
" \"pending_transcription_prints\": deque(),\n",
777773
" }\n",
778774
"\n",
779-
"\n",
780775
" async with websockets.connect(\n",
781776
" ws_url, additional_headers=headers, max_size=None\n",
782777
" ) as ws:\n",
@@ -933,10 +928,10 @@
933928
"id": "efabdbf5",
934929
"metadata": {},
935930
"source": [
936-
"Key observations from the example above:\n",
931+
"From the above example, we can notice:\n",
937932
"- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns.\n",
938933
"- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n",
939-
"- With context from the entire sessionincluding previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")."
934+
"- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")."
940935
]
941936
}
942937
],

0 commit comments

Comments
 (0)