|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# 🎙️ Context Summarization with Realtime API\n", |
8 | | - "## 1 · Overview\n", |
| 7 | + "# Context Summarization with Realtime API\n", |
| 8 | + "## 1. Overview\n", |
9 | 9 | "Build an end‑to‑end **voice bot** that listens to your mic, speaks back in real time and **summarises long conversations** so quality never drops.\n", |
10 | | - "### 🏃♂️ What You’ll Build\n", |
| 10 | + "\n", |
| 11 | + "### What You’ll Learn\n", |
11 | 12 | "1. **Live microphone streaming** → OpenAI *Realtime* (voice‑to‑voice) endpoint.\n", |
12 | 13 | "2. **Instant transcripts & speech playback** on every turn.\n", |
13 | 14 | "3. **Conversation state container** that stores **every** user/assistant message.\n", |
14 | 15 | "4. **Automatic “context trim”** – when the token window becomes very large (configurable), older turns are compressed into a summary.\n", |
15 | 16 | "5. **Extensible design** you can adapt to support customer‑support bots, kiosks, or multilingual assistants.\n", |
16 | 17 | "\n", |
17 | 18 | "\n", |
18 | | - "### 🎯 Learning Objectives\n", |
19 | | - "By the end of this notebook you can:\n", |
20 | | - "\n", |
21 | | - "| Skill | Why it matters |\n", |
22 | | - "|-------|----------------|\n", |
23 | | - "| Capture audio with `sounddevice` | Low‑latency input is critical for natural UX |\n", |
24 | | - "| Use WebSockets with the OpenAI **Realtime** API | Streams beats polling for speed & simplicity |\n", |
25 | | - "| Track token usage and detect when to summarize context | Prevents quality loss in long chats |\n", |
26 | | - "| Summarise & prune history on‑the‑fly | Keeps conversations coherent without manual resets |\n", |
27 | | - "\n", |
28 | | - "\n", |
29 | | - "### 🔧 Prerequisites\n", |
| 19 | + "### Prerequisites\n", |
30 | 20 | "\n", |
31 | 21 | "| Requirement | Details |\n", |
32 | 22 | "|-------------|---------|\n", |
|
43 | 33 | "> 1. GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.\n", |
44 | 34 | "> 2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n", |
45 | 35 | "\n", |
46 | | - "### 🚀 One‑liner install (run in a fresh cell)" |
| 36 | + "### One‑liner install (run in a fresh cell)" |
47 | 37 | ] |
48 | 38 | }, |
49 | 39 | { |
|
62 | 52 | "metadata": {}, |
63 | 53 | "outputs": [], |
64 | 54 | "source": [ |
65 | | - "# Essential imports & constants\n", |
66 | | - "\n", |
67 | 55 | "# Standard library imports\n", |
68 | 56 | "import os\n", |
69 | 57 | "import sys\n", |
|
100 | 88 | "cell_type": "markdown", |
101 | 89 | "metadata": {}, |
102 | 90 | "source": [ |
103 | | - "## 2 · Key Concepts Behind the Realtime Voice API\n", |
104 | | - "\n", |
105 | | - "This section gives you the mental model you’ll need before diving into code. Skim it now; refer back whenever something in the notebook feels “magic”.\n", |
106 | | - "\n", |
107 | | - "\n", |
108 | | - "### 2.1 Realtime vs Chat Completions — Why WebSockets?\n", |
109 | | - "\n", |
110 | | - "| | **Chat Completions (HTTP)** | **Realtime (WebSocket)** |\n", |
111 | | - "|---|---|---|\n", |
112 | | - "| Transport | Stateless request → response | Persistent, bi‑directional socket |\n", |
113 | | - "| Best for | Plain text or batched jobs | *Live* audio + incremental text |\n", |
114 | | - "| Latency model | 1 RTT per message | Sub‑200 ms deltas during one open session |\n", |
115 | | - "| Event types | *None* (single JSON) | `session.*`, `input_audio_buffer.append`, `response.*`, … |\n", |
116 | | - "\n", |
| 91 | + "## 2. Token Utilisation – Text vs Voice\n", |
117 | 92 | "\n", |
118 | | - "**Flow**: you talk ▸ server transcribes ▸ assistant replies ▸ you talk again. \n", |
119 | | - "> Mirrors natural conversation while keeping event handling simple.\n", |
120 | | - "\n", |
121 | | - "\n", |
122 | | - "### 2.2 Audio Encoding Fundamentals\n", |
123 | | - "\n", |
124 | | - "| Parameter | Value | Why it matters |\n", |
125 | | - "|-----------|-------|----------------|\n", |
126 | | - "| **Format** | PCM‑16 (signed 16‑bit) | Widely supported; no compression delay |\n", |
127 | | - "| **Sample rate** | 24 kHz | Required by Realtime endpoint |\n", |
128 | | - "| **Chunk size** | ≈ 40 ms | Lower chunk → snappier response ↔ higher packet overhead |\n", |
129 | | - "\n", |
130 | | - "`chunk_bytes = sample_rate * bytes_per_sample * chunk_duration_s`\n", |
| 93 | + "Large‑token windows are precious, every extra token you use costs latency + money. \n", |
| 94 | + "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n", |
131 | 95 | "\n", |
| 96 | + "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n", |
132 | 97 | "\n", |
133 | | - "### 2.3 Token Context Windows\n", |
134 | 98 | "\n", |
135 | | - "* GPT‑4o Realtime accepts **up to 128 K tokens** in theory. \n", |
136 | | - "* In practice, answer quality starts to drift as you increase **input token size**. \n", |
| 99 | + "* GPT-4o realtime accepts up to **128k tokens** and as the token size increases instruction adherence can drifts.\n", |
137 | 100 | "* Every user/assistant turn consumes tokens → the window **only grows**.\n", |
138 | | - "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n", |
139 | | - "\n", |
140 | | - "\n", |
141 | | - "### 2.4 Conversation State\n", |
| 101 | + "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n" |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "markdown", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "## 3. Helper Functions\n", |
| 109 | + "The following helper functions will enable us to run the full script." |
| 110 | + ] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "markdown", |
| 114 | + "metadata": {}, |
| 115 | + "source": [ |
| 116 | + "### 3.1 Conversation State\n", |
142 | 117 | "Unlike HTTP-based Chat Completions, the Realtime API maintains an open, **stateful** session with two key components:\n", |
143 | 118 | "\n", |
144 | 119 | "| Component | Purpose |\n", |
|
199 | 174 | "cell_type": "markdown", |
200 | 175 | "metadata": {}, |
201 | 176 | "source": [ |
202 | | - "## 3 · Token Utilisation – Text vs Voice\n", |
203 | | - "\n", |
204 | | - "Large‑token windows are precious: every extra token you use costs latency + money. \n", |
205 | | - "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n", |
206 | | - "\n", |
207 | | - "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n", |
208 | | - "\n", |
209 | | - "### 3.1 Hands‑on comparison 📊\n", |
210 | | - "\n", |
211 | | - "The cells below:\n", |
212 | | - "\n", |
213 | | - "1. **Sends `TEXT` to Chat Completions** → reads `prompt_tokens`. \n", |
214 | | - "2. **Turns the same `TEXT` into speech** with TTS. \n", |
215 | | - "3. **Feeds the speech back into the Realtime API Transcription endpoint** → reads `audio input tokens`. \n", |
216 | | - "4. Prints a ratio so you can see the multiplier on *your* hardware / account." |
217 | | - ] |
218 | | - }, |
219 | | - { |
220 | | - "cell_type": "code", |
221 | | - "execution_count": 67, |
222 | | - "metadata": {}, |
223 | | - "outputs": [ |
224 | | - { |
225 | | - "name": "stdout", |
226 | | - "output_type": "stream", |
227 | | - "text": [ |
228 | | - "📄 Text prompt tokens : 42\n", |
229 | | - "🔊 Audio length (s) : 11.55\n" |
230 | | - ] |
231 | | - } |
232 | | - ], |
233 | | - "source": [ |
234 | | - "TEXT = \"Hello there, I am measuring tokens for text versus voice because we want to better compare the number of tokens used when sending a message as text versus when converting it to speech..\"\n", |
235 | | - "CHAT_MODEL = \"gpt-4o-mini\"\n", |
236 | | - "STT_MODEL = \"gpt-4o-transcribe\"\n", |
237 | | - "TTS_MODEL = \"gpt-4o-mini-tts\"\n", |
238 | | - "RT_MODEL = \"gpt-4o-realtime-preview\" # S2S model\n", |
239 | | - "VOICE = \"shimmer\"\n", |
240 | | - "\n", |
241 | | - "TARGET_SR = 24_000\n", |
242 | | - "PCM_SCALE = 32_767\n", |
243 | | - "CHUNK_MS = 120 # stream step\n", |
244 | | - "\n", |
245 | | - "\n", |
246 | | - "HEADERS = {\n", |
247 | | - " \"Authorization\": f\"Bearer {openai.api_key}\",\n", |
248 | | - " \"OpenAI-Beta\": \"realtime=v1\",\n", |
249 | | - "}\n", |
250 | | - "\n", |
251 | | - "show = lambda l, v: print(f\"{l:<28}: {v}\")\n", |
252 | | - "\n", |
253 | | - "# ─── Helpers ─────────────────────────────────────────────────────────────\n", |
254 | | - "def float_to_pcm16(x: np.ndarray) -> bytes:\n", |
255 | | - " return (np.clip(x, -1, 1) * PCM_SCALE).astype(\"<i2\").tobytes()\n", |
256 | | - "\n", |
257 | | - "def chunk_pcm(pcm: bytes, ms: int = CHUNK_MS) -> List[bytes]:\n", |
258 | | - " step = TARGET_SR * 2 * ms // 1000\n", |
259 | | - " return [pcm[i:i + step] for i in range(0, len(pcm), step)]\n", |
260 | | - "\n", |
261 | | - "# ─── 1 · Count text tokens ──────────────────────────────────────────────\n", |
262 | | - "chat = openai.chat.completions.create(\n", |
263 | | - " model=CHAT_MODEL,\n", |
264 | | - " messages=[{\"role\": \"user\", \"content\": TEXT}],\n", |
265 | | - " max_tokens=1,\n", |
266 | | - " temperature=0,\n", |
267 | | - ")\n", |
268 | | - "text_tokens = chat.usage.prompt_tokens\n", |
269 | | - "show(\"📄 Text prompt tokens\", text_tokens)\n", |
270 | | - "\n", |
271 | | - "# ─── 2 · Synthesis to WAV & PCM16 ───────────────────────────────────────\n", |
272 | | - "wav_bytes = openai.audio.speech.create(\n", |
273 | | - " model=TTS_MODEL, input=TEXT, voice=VOICE, response_format=\"wav\"\n", |
274 | | - ").content\n", |
275 | | - "\n", |
276 | | - "with wave.open(io.BytesIO(wav_bytes)) as w:\n", |
277 | | - " pcm_bytes = w.readframes(w.getnframes())\n", |
278 | | - "duration_sec = len(pcm_bytes) / (2 * TARGET_SR)\n", |
279 | | - "show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")" |
280 | | - ] |
281 | | - }, |
282 | | - { |
283 | | - "cell_type": "code", |
284 | | - "execution_count": 73, |
285 | | - "metadata": {}, |
286 | | - "outputs": [ |
287 | | - { |
288 | | - "name": "stdout", |
289 | | - "output_type": "stream", |
290 | | - "text": [ |
291 | | - "🎤 Audio input tokens : 112\n", |
292 | | - "⚖️ Audio/Text ratio : 2.7×\n", |
293 | | - "\n", |
294 | | - "≈9 audio‑tokens / sec vs ≈1 token / word.\n" |
295 | | - ] |
296 | | - } |
297 | | - ], |
298 | | - "source": [ |
299 | | - "# ─── 3 · Realtime streaming & token harvest ─────────────────────────────\n", |
300 | | - "async def count_audio_tokens(pcm: bytes) -> int:\n", |
301 | | - " url = f\"wss://api.openai.com/v1/realtime?model={RT_MODEL}\"\n", |
302 | | - " chunks = chunk_pcm(pcm)\n", |
303 | | - "\n", |
304 | | - " async with websockets.connect(url, extra_headers=HEADERS,\n", |
305 | | - " max_size=1 << 24) as ws:\n", |
306 | | - "\n", |
307 | | - " # Wait for session.created\n", |
308 | | - " while json.loads(await ws.recv())[\"type\"] != \"session.created\":\n", |
309 | | - " pass\n", |
310 | | - "\n", |
311 | | - " # Configure modalities + voice\n", |
312 | | - " await ws.send(json.dumps({\n", |
313 | | - " \"type\": \"session.update\",\n", |
314 | | - " \"session\": {\n", |
315 | | - " \"modalities\": [\"audio\", \"text\"],\n", |
316 | | - " \"voice\": VOICE,\n", |
317 | | - " \"input_audio_format\": \"pcm16\",\n", |
318 | | - " \"output_audio_format\": \"pcm16\",\n", |
319 | | - " \"input_audio_transcription\": {\"model\": STT_MODEL},\n", |
320 | | - " }\n", |
321 | | - " }))\n", |
322 | | - "\n", |
323 | | - " # Stream user audio chunks (no manual commit; server VAD handles it)\n", |
324 | | - " for c in chunks:\n", |
325 | | - " await ws.send(json.dumps({\n", |
326 | | - " \"type\": \"input_audio_buffer.append\",\n", |
327 | | - " \"audio\": base64.b64encode(c).decode(),\n", |
328 | | - " }))\n", |
329 | | - "\n", |
330 | | - " async for raw in ws:\n", |
331 | | - " ev = json.loads(raw)\n", |
332 | | - " t = ev.get(\"type\")\n", |
333 | | - "\n", |
334 | | - " if t == \"response.done\":\n", |
335 | | - " return ev[\"response\"][\"usage\"]\\\n", |
336 | | - " [\"input_token_details\"][\"audio_tokens\"]\n", |
337 | | - "\n", |
338 | | - "audio_tokens = await count_audio_tokens(pcm_bytes)\n", |
339 | | - "show(\"🎤 Audio input tokens\", audio_tokens)\n", |
| 177 | + "### 3.2 · Streaming Audio\n", |
| 178 | + "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n", |
340 | 179 | "\n", |
341 | | - "# ─── 4 · Comparison ─────────────────────────────────────────────────────\n", |
342 | | - "ratio = audio_tokens / text_tokens if text_tokens else float(\"inf\")\n", |
343 | | - "show(\"⚖️ Audio/Text ratio\", f\"{ratio:.1f}×\")\n", |
344 | | - "print(f\"\\n≈{int(audio_tokens/duration_sec)} audio‑tokens / sec vs ≈1 token / word.\")" |
345 | | - ] |
346 | | - }, |
347 | | - { |
348 | | - "cell_type": "markdown", |
349 | | - "metadata": {}, |
350 | | - "source": [ |
351 | | - "This toy example uses a short input, but as transcripts get longer, the difference between text token count and voice token count grows substantially." |
| 180 | + "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API" |
352 | 181 | ] |
353 | 182 | }, |
354 | 183 | { |
355 | 184 | "cell_type": "markdown", |
356 | 185 | "metadata": {}, |
357 | 186 | "source": [ |
358 | | - "## 4 · Streaming Audio\n", |
359 | | - "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n", |
360 | | - "\n", |
361 | | - "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API\n", |
362 | | - "\n", |
363 | | - "### 4.1 Capture Microphone Input\n", |
| 187 | + "#### 3.2.1 Capture Microphone Input\n", |
364 | 188 | "We’ll start with a coroutine that:\n", |
365 | 189 | "\n", |
366 | 190 | "* Opens the default mic at **24 kHz, mono, PCM‑16** (one of the [format](https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format) Realtime accepts). \n", |
367 | 191 | "* Slices the stream into **≈ 40 ms** blocks. \n", |
368 | | - "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI.\n" |
| 192 | + "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI." |
369 | 193 | ] |
370 | 194 | }, |
371 | 195 | { |
|
414 | 238 | "cell_type": "markdown", |
415 | 239 | "metadata": {}, |
416 | 240 | "source": [ |
417 | | - "### 4.2 Send Audio Chunks to the API\n", |
| 241 | + "#### 3.2.2 Send Audio Chunks to the API\n", |
418 | 242 | "\n", |
419 | 243 | "Our mic task is now filling an `asyncio.Queue` with raw PCM‑16 blocks. \n", |
420 | 244 | "Next step: pull chunks off that queue, **base‑64 encode** them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an `input_audio_buffer.append` event." |
|
444 | 268 | "cell_type": "markdown", |
445 | 269 | "metadata": {}, |
446 | 270 | "source": [ |
447 | | - "### 4.3 Handle Incoming Events \n", |
| 271 | + "#### 3.2.3 Handle Incoming Events \n", |
448 | 272 | "Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the **same** WebSocket. \n", |
449 | 273 | "Understanding these events is critical for:\n", |
450 | 274 | "\n", |
451 | 275 | "* Printing live transcripts \n", |
452 | 276 | "* Playing incremental audio back to the user \n", |
453 | | - "* Keeping an accurate [`ConversationState`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n", |
| 277 | + "* Keeping an accurate [`Conversation State`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n", |
454 | 278 | "\n", |
455 | 279 | "| Event type | Typical timing | What you should do with it |\n", |
456 | 280 | "|------------|----------------|----------------------------|\n", |
|
466 | 290 | "cell_type": "markdown", |
467 | 291 | "metadata": {}, |
468 | 292 | "source": [ |
469 | | - "## 5 · Dynamic Context Management & Summarisation\n", |
470 | | - "\n", |
| 293 | + "### 3.3 Detect When to Summarise\n", |
471 | 294 | "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n", |
| 295 | + "\n", |
472 | 296 | "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n", |
473 | 297 | "\n", |
474 | | - "### 5.1 Detect When to Summarise\n", |
475 | 298 | "We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n", |
476 | 299 | "\n", |
477 | | - "### 5.2 Generate & Insert a Summary\n", |
478 | 300 | "We compress everything except the last 2 turns into a single French paragraph, then:\n", |
479 | 301 | "\n", |
480 | 302 | "1. Insert that paragraph as a new assistant message at the top of the conversation.\n", |
|
603 | 425 | "cell_type": "markdown", |
604 | 426 | "metadata": {}, |
605 | 427 | "source": [ |
606 | | - "## 6 · End‑to‑End Workflow Demonstration\n", |
| 428 | + "## 4. End‑to‑End Workflow Demonstration\n", |
607 | 429 | "\n", |
608 | 430 | "Run the two cells below to launch an interactive session. Interrupt the cell stop recording.\n", |
609 | 431 | "\n", |
|
835 | 657 | "cell_type": "markdown", |
836 | 658 | "metadata": {}, |
837 | 659 | "source": [ |
838 | | - "## 7 · Real‑World Applications\n", |
| 660 | + "## 5 · Real‑World Applications\n", |
839 | 661 | "\n", |
840 | 662 | "Context summarisation can be useful for **long‑running voice experiences**. \n", |
841 | 663 | "Here are a use case ideas:\n", |
|
852 | 674 | "cell_type": "markdown", |
853 | 675 | "metadata": {}, |
854 | 676 | "source": [ |
855 | | - "## 8 · Next Steps & Further Reading\n", |
| 677 | + "## 6 · Next Steps & Further Reading\n", |
856 | 678 | "Try out the notebook and try integrating context summary into your application.\n", |
857 | 679 | "\n", |
858 | 680 | "Few things you can try:\n", |
|
0 commit comments