|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# 🎙️ Context Summarization with Realtime API\n", |
8 |
| - "## 1 · Overview\n", |
| 7 | + "# Context Summarization with Realtime API\n", |
| 8 | + "## 1. Overview\n", |
9 | 9 | "Build an end‑to‑end **voice bot** that listens to your mic, speaks back in real time and **summarises long conversations** so quality never drops.\n",
|
10 |
| - "### 🏃♂️ What You’ll Build\n", |
| 10 | + "\n", |
| 11 | + "### What You’ll Learn\n", |
11 | 12 | "1. **Live microphone streaming** → OpenAI *Realtime* (voice‑to‑voice) endpoint.\n",
|
12 | 13 | "2. **Instant transcripts & speech playback** on every turn.\n",
|
13 | 14 | "3. **Conversation state container** that stores **every** user/assistant message.\n",
|
14 | 15 | "4. **Automatic “context trim”** – when the token window becomes very large (configurable), older turns are compressed into a summary.\n",
|
15 | 16 | "5. **Extensible design** you can adapt to support customer‑support bots, kiosks, or multilingual assistants.\n",
|
16 | 17 | "\n",
|
17 | 18 | "\n",
|
18 |
| - "### 🎯 Learning Objectives\n", |
19 |
| - "By the end of this notebook you can:\n", |
20 |
| - "\n", |
21 |
| - "| Skill | Why it matters |\n", |
22 |
| - "|-------|----------------|\n", |
23 |
| - "| Capture audio with `sounddevice` | Low‑latency input is critical for natural UX |\n", |
24 |
| - "| Use WebSockets with the OpenAI **Realtime** API | Streams beats polling for speed & simplicity |\n", |
25 |
| - "| Track token usage and detect when to summarize context | Prevents quality loss in long chats |\n", |
26 |
| - "| Summarise & prune history on‑the‑fly | Keeps conversations coherent without manual resets |\n", |
27 |
| - "\n", |
28 |
| - "\n", |
29 |
| - "### 🔧 Prerequisites\n", |
| 19 | + "### Prerequisites\n", |
30 | 20 | "\n",
|
31 | 21 | "| Requirement | Details |\n",
|
32 | 22 | "|-------------|---------|\n",
|
|
43 | 33 | "> 1. GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.\n",
|
44 | 34 | "> 2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n",
|
45 | 35 | "\n",
|
46 |
| - "### 🚀 One‑liner install (run in a fresh cell)" |
| 36 | + "### One‑liner install (run in a fresh cell)" |
47 | 37 | ]
|
48 | 38 | },
|
49 | 39 | {
|
|
62 | 52 | "metadata": {},
|
63 | 53 | "outputs": [],
|
64 | 54 | "source": [
|
65 |
| - "# Essential imports & constants\n", |
66 |
| - "\n", |
67 | 55 | "# Standard library imports\n",
|
68 | 56 | "import os\n",
|
69 | 57 | "import sys\n",
|
|
100 | 88 | "cell_type": "markdown",
|
101 | 89 | "metadata": {},
|
102 | 90 | "source": [
|
103 |
| - "## 2 · Key Concepts Behind the Realtime Voice API\n", |
104 |
| - "\n", |
105 |
| - "This section gives you the mental model you’ll need before diving into code. Skim it now; refer back whenever something in the notebook feels “magic”.\n", |
106 |
| - "\n", |
107 |
| - "\n", |
108 |
| - "### 2.1 Realtime vs Chat Completions — Why WebSockets?\n", |
109 |
| - "\n", |
110 |
| - "| | **Chat Completions (HTTP)** | **Realtime (WebSocket)** |\n", |
111 |
| - "|---|---|---|\n", |
112 |
| - "| Transport | Stateless request → response | Persistent, bi‑directional socket |\n", |
113 |
| - "| Best for | Plain text or batched jobs | *Live* audio + incremental text |\n", |
114 |
| - "| Latency model | 1 RTT per message | Sub‑200 ms deltas during one open session |\n", |
115 |
| - "| Event types | *None* (single JSON) | `session.*`, `input_audio_buffer.append`, `response.*`, … |\n", |
116 |
| - "\n", |
| 91 | + "## 2. Token Utilisation – Text vs Voice\n", |
117 | 92 | "\n",
|
118 |
| - "**Flow**: you talk ▸ server transcribes ▸ assistant replies ▸ you talk again. \n", |
119 |
| - "> Mirrors natural conversation while keeping event handling simple.\n", |
120 |
| - "\n", |
121 |
| - "\n", |
122 |
| - "### 2.2 Audio Encoding Fundamentals\n", |
123 |
| - "\n", |
124 |
| - "| Parameter | Value | Why it matters |\n", |
125 |
| - "|-----------|-------|----------------|\n", |
126 |
| - "| **Format** | PCM‑16 (signed 16‑bit) | Widely supported; no compression delay |\n", |
127 |
| - "| **Sample rate** | 24 kHz | Required by Realtime endpoint |\n", |
128 |
| - "| **Chunk size** | ≈ 40 ms | Lower chunk → snappier response ↔ higher packet overhead |\n", |
129 |
| - "\n", |
130 |
| - "`chunk_bytes = sample_rate * bytes_per_sample * chunk_duration_s`\n", |
| 93 | + "Large‑token windows are precious, every extra token you use costs latency + money. \n", |
| 94 | + "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n", |
131 | 95 | "\n",
|
| 96 | + "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n", |
132 | 97 | "\n",
|
133 |
| - "### 2.3 Token Context Windows\n", |
134 | 98 | "\n",
|
135 |
| - "* GPT‑4o Realtime accepts **up to 128 K tokens** in theory. \n", |
136 |
| - "* In practice, answer quality starts to drift as you increase **input token size**. \n", |
| 99 | + "* GPT-4o realtime accepts up to **128k tokens** and as the token size increases instruction adherence can drifts.\n", |
137 | 100 | "* Every user/assistant turn consumes tokens → the window **only grows**.\n",
|
138 |
| - "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n", |
139 |
| - "\n", |
140 |
| - "\n", |
141 |
| - "### 2.4 Conversation State\n", |
| 101 | + "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n" |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "markdown", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "## 3. Helper Functions\n", |
| 109 | + "The following helper functions will enable us to run the full script." |
| 110 | + ] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "markdown", |
| 114 | + "metadata": {}, |
| 115 | + "source": [ |
| 116 | + "### 3.1 Conversation State\n", |
142 | 117 | "Unlike HTTP-based Chat Completions, the Realtime API maintains an open, **stateful** session with two key components:\n",
|
143 | 118 | "\n",
|
144 | 119 | "| Component | Purpose |\n",
|
|
199 | 174 | "cell_type": "markdown",
|
200 | 175 | "metadata": {},
|
201 | 176 | "source": [
|
202 |
| - "## 3 · Token Utilisation – Text vs Voice\n", |
203 |
| - "\n", |
204 |
| - "Large‑token windows are precious: every extra token you use costs latency + money. \n", |
205 |
| - "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n", |
206 |
| - "\n", |
207 |
| - "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n", |
208 |
| - "\n", |
209 |
| - "### 3.1 Hands‑on comparison 📊\n", |
210 |
| - "\n", |
211 |
| - "The cells below:\n", |
212 |
| - "\n", |
213 |
| - "1. **Sends `TEXT` to Chat Completions** → reads `prompt_tokens`. \n", |
214 |
| - "2. **Turns the same `TEXT` into speech** with TTS. \n", |
215 |
| - "3. **Feeds the speech back into the Realtime API Transcription endpoint** → reads `audio input tokens`. \n", |
216 |
| - "4. Prints a ratio so you can see the multiplier on *your* hardware / account." |
217 |
| - ] |
218 |
| - }, |
219 |
| - { |
220 |
| - "cell_type": "code", |
221 |
| - "execution_count": 67, |
222 |
| - "metadata": {}, |
223 |
| - "outputs": [ |
224 |
| - { |
225 |
| - "name": "stdout", |
226 |
| - "output_type": "stream", |
227 |
| - "text": [ |
228 |
| - "📄 Text prompt tokens : 42\n", |
229 |
| - "🔊 Audio length (s) : 11.55\n" |
230 |
| - ] |
231 |
| - } |
232 |
| - ], |
233 |
| - "source": [ |
234 |
| - "TEXT = \"Hello there, I am measuring tokens for text versus voice because we want to better compare the number of tokens used when sending a message as text versus when converting it to speech..\"\n", |
235 |
| - "CHAT_MODEL = \"gpt-4o-mini\"\n", |
236 |
| - "STT_MODEL = \"gpt-4o-transcribe\"\n", |
237 |
| - "TTS_MODEL = \"gpt-4o-mini-tts\"\n", |
238 |
| - "RT_MODEL = \"gpt-4o-realtime-preview\" # S2S model\n", |
239 |
| - "VOICE = \"shimmer\"\n", |
240 |
| - "\n", |
241 |
| - "TARGET_SR = 24_000\n", |
242 |
| - "PCM_SCALE = 32_767\n", |
243 |
| - "CHUNK_MS = 120 # stream step\n", |
244 |
| - "\n", |
245 |
| - "\n", |
246 |
| - "HEADERS = {\n", |
247 |
| - " \"Authorization\": f\"Bearer {openai.api_key}\",\n", |
248 |
| - " \"OpenAI-Beta\": \"realtime=v1\",\n", |
249 |
| - "}\n", |
250 |
| - "\n", |
251 |
| - "show = lambda l, v: print(f\"{l:<28}: {v}\")\n", |
252 |
| - "\n", |
253 |
| - "# ─── Helpers ─────────────────────────────────────────────────────────────\n", |
254 |
| - "def float_to_pcm16(x: np.ndarray) -> bytes:\n", |
255 |
| - " return (np.clip(x, -1, 1) * PCM_SCALE).astype(\"<i2\").tobytes()\n", |
256 |
| - "\n", |
257 |
| - "def chunk_pcm(pcm: bytes, ms: int = CHUNK_MS) -> List[bytes]:\n", |
258 |
| - " step = TARGET_SR * 2 * ms // 1000\n", |
259 |
| - " return [pcm[i:i + step] for i in range(0, len(pcm), step)]\n", |
260 |
| - "\n", |
261 |
| - "# ─── 1 · Count text tokens ──────────────────────────────────────────────\n", |
262 |
| - "chat = openai.chat.completions.create(\n", |
263 |
| - " model=CHAT_MODEL,\n", |
264 |
| - " messages=[{\"role\": \"user\", \"content\": TEXT}],\n", |
265 |
| - " max_tokens=1,\n", |
266 |
| - " temperature=0,\n", |
267 |
| - ")\n", |
268 |
| - "text_tokens = chat.usage.prompt_tokens\n", |
269 |
| - "show(\"📄 Text prompt tokens\", text_tokens)\n", |
270 |
| - "\n", |
271 |
| - "# ─── 2 · Synthesis to WAV & PCM16 ───────────────────────────────────────\n", |
272 |
| - "wav_bytes = openai.audio.speech.create(\n", |
273 |
| - " model=TTS_MODEL, input=TEXT, voice=VOICE, response_format=\"wav\"\n", |
274 |
| - ").content\n", |
275 |
| - "\n", |
276 |
| - "with wave.open(io.BytesIO(wav_bytes)) as w:\n", |
277 |
| - " pcm_bytes = w.readframes(w.getnframes())\n", |
278 |
| - "duration_sec = len(pcm_bytes) / (2 * TARGET_SR)\n", |
279 |
| - "show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")" |
280 |
| - ] |
281 |
| - }, |
282 |
| - { |
283 |
| - "cell_type": "code", |
284 |
| - "execution_count": 73, |
285 |
| - "metadata": {}, |
286 |
| - "outputs": [ |
287 |
| - { |
288 |
| - "name": "stdout", |
289 |
| - "output_type": "stream", |
290 |
| - "text": [ |
291 |
| - "🎤 Audio input tokens : 112\n", |
292 |
| - "⚖️ Audio/Text ratio : 2.7×\n", |
293 |
| - "\n", |
294 |
| - "≈9 audio‑tokens / sec vs ≈1 token / word.\n" |
295 |
| - ] |
296 |
| - } |
297 |
| - ], |
298 |
| - "source": [ |
299 |
| - "# ─── 3 · Realtime streaming & token harvest ─────────────────────────────\n", |
300 |
| - "async def count_audio_tokens(pcm: bytes) -> int:\n", |
301 |
| - " url = f\"wss://api.openai.com/v1/realtime?model={RT_MODEL}\"\n", |
302 |
| - " chunks = chunk_pcm(pcm)\n", |
303 |
| - "\n", |
304 |
| - " async with websockets.connect(url, extra_headers=HEADERS,\n", |
305 |
| - " max_size=1 << 24) as ws:\n", |
306 |
| - "\n", |
307 |
| - " # Wait for session.created\n", |
308 |
| - " while json.loads(await ws.recv())[\"type\"] != \"session.created\":\n", |
309 |
| - " pass\n", |
310 |
| - "\n", |
311 |
| - " # Configure modalities + voice\n", |
312 |
| - " await ws.send(json.dumps({\n", |
313 |
| - " \"type\": \"session.update\",\n", |
314 |
| - " \"session\": {\n", |
315 |
| - " \"modalities\": [\"audio\", \"text\"],\n", |
316 |
| - " \"voice\": VOICE,\n", |
317 |
| - " \"input_audio_format\": \"pcm16\",\n", |
318 |
| - " \"output_audio_format\": \"pcm16\",\n", |
319 |
| - " \"input_audio_transcription\": {\"model\": STT_MODEL},\n", |
320 |
| - " }\n", |
321 |
| - " }))\n", |
322 |
| - "\n", |
323 |
| - " # Stream user audio chunks (no manual commit; server VAD handles it)\n", |
324 |
| - " for c in chunks:\n", |
325 |
| - " await ws.send(json.dumps({\n", |
326 |
| - " \"type\": \"input_audio_buffer.append\",\n", |
327 |
| - " \"audio\": base64.b64encode(c).decode(),\n", |
328 |
| - " }))\n", |
329 |
| - "\n", |
330 |
| - " async for raw in ws:\n", |
331 |
| - " ev = json.loads(raw)\n", |
332 |
| - " t = ev.get(\"type\")\n", |
333 |
| - "\n", |
334 |
| - " if t == \"response.done\":\n", |
335 |
| - " return ev[\"response\"][\"usage\"]\\\n", |
336 |
| - " [\"input_token_details\"][\"audio_tokens\"]\n", |
337 |
| - "\n", |
338 |
| - "audio_tokens = await count_audio_tokens(pcm_bytes)\n", |
339 |
| - "show(\"🎤 Audio input tokens\", audio_tokens)\n", |
| 177 | + "### 3.2 · Streaming Audio\n", |
| 178 | + "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n", |
340 | 179 | "\n",
|
341 |
| - "# ─── 4 · Comparison ─────────────────────────────────────────────────────\n", |
342 |
| - "ratio = audio_tokens / text_tokens if text_tokens else float(\"inf\")\n", |
343 |
| - "show(\"⚖️ Audio/Text ratio\", f\"{ratio:.1f}×\")\n", |
344 |
| - "print(f\"\\n≈{int(audio_tokens/duration_sec)} audio‑tokens / sec vs ≈1 token / word.\")" |
345 |
| - ] |
346 |
| - }, |
347 |
| - { |
348 |
| - "cell_type": "markdown", |
349 |
| - "metadata": {}, |
350 |
| - "source": [ |
351 |
| - "This toy example uses a short input, but as transcripts get longer, the difference between text token count and voice token count grows substantially." |
| 180 | + "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API" |
352 | 181 | ]
|
353 | 182 | },
|
354 | 183 | {
|
355 | 184 | "cell_type": "markdown",
|
356 | 185 | "metadata": {},
|
357 | 186 | "source": [
|
358 |
| - "## 4 · Streaming Audio\n", |
359 |
| - "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n", |
360 |
| - "\n", |
361 |
| - "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API\n", |
362 |
| - "\n", |
363 |
| - "### 4.1 Capture Microphone Input\n", |
| 187 | + "#### 3.2.1 Capture Microphone Input\n", |
364 | 188 | "We’ll start with a coroutine that:\n",
|
365 | 189 | "\n",
|
366 | 190 | "* Opens the default mic at **24 kHz, mono, PCM‑16** (one of the [format](https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format) Realtime accepts). \n",
|
367 | 191 | "* Slices the stream into **≈ 40 ms** blocks. \n",
|
368 |
| - "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI.\n" |
| 192 | + "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI." |
369 | 193 | ]
|
370 | 194 | },
|
371 | 195 | {
|
|
414 | 238 | "cell_type": "markdown",
|
415 | 239 | "metadata": {},
|
416 | 240 | "source": [
|
417 |
| - "### 4.2 Send Audio Chunks to the API\n", |
| 241 | + "#### 3.2.2 Send Audio Chunks to the API\n", |
418 | 242 | "\n",
|
419 | 243 | "Our mic task is now filling an `asyncio.Queue` with raw PCM‑16 blocks. \n",
|
420 | 244 | "Next step: pull chunks off that queue, **base‑64 encode** them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an `input_audio_buffer.append` event."
|
|
444 | 268 | "cell_type": "markdown",
|
445 | 269 | "metadata": {},
|
446 | 270 | "source": [
|
447 |
| - "### 4.3 Handle Incoming Events \n", |
| 271 | + "#### 3.2.3 Handle Incoming Events \n", |
448 | 272 | "Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the **same** WebSocket. \n",
|
449 | 273 | "Understanding these events is critical for:\n",
|
450 | 274 | "\n",
|
451 | 275 | "* Printing live transcripts \n",
|
452 | 276 | "* Playing incremental audio back to the user \n",
|
453 |
| - "* Keeping an accurate [`ConversationState`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n", |
| 277 | + "* Keeping an accurate [`Conversation State`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n", |
454 | 278 | "\n",
|
455 | 279 | "| Event type | Typical timing | What you should do with it |\n",
|
456 | 280 | "|------------|----------------|----------------------------|\n",
|
|
466 | 290 | "cell_type": "markdown",
|
467 | 291 | "metadata": {},
|
468 | 292 | "source": [
|
469 |
| - "## 5 · Dynamic Context Management & Summarisation\n", |
470 |
| - "\n", |
| 293 | + "### 3.3 Detect When to Summarise\n", |
471 | 294 | "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n",
|
| 295 | + "\n", |
472 | 296 | "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n",
|
473 | 297 | "\n",
|
474 |
| - "### 5.1 Detect When to Summarise\n", |
475 | 298 | "We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n",
|
476 | 299 | "\n",
|
477 |
| - "### 5.2 Generate & Insert a Summary\n", |
478 | 300 | "We compress everything except the last 2 turns into a single French paragraph, then:\n",
|
479 | 301 | "\n",
|
480 | 302 | "1. Insert that paragraph as a new assistant message at the top of the conversation.\n",
|
|
603 | 425 | "cell_type": "markdown",
|
604 | 426 | "metadata": {},
|
605 | 427 | "source": [
|
606 |
| - "## 6 · End‑to‑End Workflow Demonstration\n", |
| 428 | + "## 4. End‑to‑End Workflow Demonstration\n", |
607 | 429 | "\n",
|
608 | 430 | "Run the two cells below to launch an interactive session. Interrupt the cell stop recording.\n",
|
609 | 431 | "\n",
|
|
835 | 657 | "cell_type": "markdown",
|
836 | 658 | "metadata": {},
|
837 | 659 | "source": [
|
838 |
| - "## 7 · Real‑World Applications\n", |
| 660 | + "## 5 · Real‑World Applications\n", |
839 | 661 | "\n",
|
840 | 662 | "Context summarisation can be useful for **long‑running voice experiences**. \n",
|
841 | 663 | "Here are a use case ideas:\n",
|
|
852 | 674 | "cell_type": "markdown",
|
853 | 675 | "metadata": {},
|
854 | 676 | "source": [
|
855 |
| - "## 8 · Next Steps & Further Reading\n", |
| 677 | + "## 6 · Next Steps & Further Reading\n", |
856 | 678 | "Try out the notebook and try integrating context summary into your application.\n",
|
857 | 679 | "\n",
|
858 | 680 | "Few things you can try:\n",
|
|
0 commit comments