Skip to content

Commit 8b08e9e

Browse files
committed
Simplified notebook
1 parent 8e1db5a commit 8b08e9e

File tree

1 file changed

+40
-218
lines changed

1 file changed

+40
-218
lines changed

examples/Context_summarization_with_realtime_api.ipynb

Lines changed: 40 additions & 218 deletions
Original file line numberDiff line numberDiff line change
@@ -4,29 +4,19 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# 🎙️ Context Summarization with Realtime API\n",
8-
"## 1 · Overview\n",
7+
"# Context Summarization with Realtime API\n",
8+
"## 1. Overview\n",
99
"Build an end‑to‑end **voice bot** that listens to your mic, speaks back in real time and **summarises long conversations** so quality never drops.\n",
10-
"### 🏃‍♂️ What You’ll Build\n",
10+
"\n",
11+
"### What You’ll Learn\n",
1112
"1. **Live microphone streaming** → OpenAI *Realtime* (voice‑to‑voice) endpoint.\n",
1213
"2. **Instant transcripts & speech playback** on every turn.\n",
1314
"3. **Conversation state container** that stores **every** user/assistant message.\n",
1415
"4. **Automatic “context trim”** – when the token window becomes very large (configurable), older turns are compressed into a summary.\n",
1516
"5. **Extensible design** you can adapt to support customer‑support bots, kiosks, or multilingual assistants.\n",
1617
"\n",
1718
"\n",
18-
"### 🎯 Learning Objectives\n",
19-
"By the end of this notebook you can:\n",
20-
"\n",
21-
"| Skill | Why it matters |\n",
22-
"|-------|----------------|\n",
23-
"| Capture audio with `sounddevice` | Low‑latency input is critical for natural UX |\n",
24-
"| Use WebSockets with the OpenAI **Realtime** API | Streams beats polling for speed & simplicity |\n",
25-
"| Track token usage and detect when to summarize context | Prevents quality loss in long chats |\n",
26-
"| Summarise & prune history on‑the‑fly | Keeps conversations coherent without manual resets |\n",
27-
"\n",
28-
"\n",
29-
"### 🔧 Prerequisites\n",
19+
"### Prerequisites\n",
3020
"\n",
3121
"| Requirement | Details |\n",
3222
"|-------------|---------|\n",
@@ -43,7 +33,7 @@
4333
"> 1. GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.\n",
4434
"> 2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n",
4535
"\n",
46-
"### 🚀 One‑liner install (run in a fresh cell)"
36+
"### One‑liner install (run in a fresh cell)"
4737
]
4838
},
4939
{
@@ -62,8 +52,6 @@
6252
"metadata": {},
6353
"outputs": [],
6454
"source": [
65-
"# Essential imports & constants\n",
66-
"\n",
6755
"# Standard library imports\n",
6856
"import os\n",
6957
"import sys\n",
@@ -100,45 +88,32 @@
10088
"cell_type": "markdown",
10189
"metadata": {},
10290
"source": [
103-
"## 2 · Key Concepts Behind the Realtime Voice API\n",
104-
"\n",
105-
"This section gives you the mental model you’ll need before diving into code. Skim it now; refer back whenever something in the notebook feels “magic”.\n",
106-
"\n",
107-
"\n",
108-
"### 2.1 Realtime vs Chat Completions — Why WebSockets?\n",
109-
"\n",
110-
"| | **Chat Completions (HTTP)** | **Realtime (WebSocket)** |\n",
111-
"|---|---|---|\n",
112-
"| Transport | Stateless request → response | Persistent, bi‑directional socket |\n",
113-
"| Best for | Plain text or batched jobs | *Live* audio + incremental text |\n",
114-
"| Latency model | 1 RTT per message | Sub‑200 ms deltas during one open session |\n",
115-
"| Event types | *None* (single JSON) | `session.*`, `input_audio_buffer.append`, `response.*`, … |\n",
116-
"\n",
91+
"## 2. Token Utilisation – Text vs Voice\n",
11792
"\n",
118-
"**Flow**: you talk ▸ server transcribes ▸ assistant replies ▸ you talk again. \n",
119-
"> Mirrors natural conversation while keeping event handling simple.\n",
120-
"\n",
121-
"\n",
122-
"### 2.2 Audio Encoding Fundamentals\n",
123-
"\n",
124-
"| Parameter | Value | Why it matters |\n",
125-
"|-----------|-------|----------------|\n",
126-
"| **Format** | PCM‑16 (signed 16‑bit) | Widely supported; no compression delay |\n",
127-
"| **Sample rate** | 24 kHz | Required by Realtime endpoint |\n",
128-
"| **Chunk size** | ≈ 40 ms | Lower chunk → snappier response ↔ higher packet overhead |\n",
129-
"\n",
130-
"`chunk_bytes = sample_rate * bytes_per_sample * chunk_duration_s`\n",
93+
"Large‑token windows are precious, every extra token you use costs latency + money. \n",
94+
"For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
13195
"\n",
96+
"In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n",
13297
"\n",
133-
"### 2.3 Token Context Windows\n",
13498
"\n",
135-
"* GPT‑4o Realtime accepts **up to 128 K tokens** in theory. \n",
136-
"* In practice, answer quality starts to drift as you increase **input token size**. \n",
99+
"* GPT-4o realtime accepts up to **128k tokens** and as the token size increases instruction adherence can drifts.\n",
137100
"* Every user/assistant turn consumes tokens → the window **only grows**.\n",
138-
"* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n",
139-
"\n",
140-
"\n",
141-
"### 2.4 Conversation State\n",
101+
"* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"## 3. Helper Functions\n",
109+
"The following helper functions will enable us to run the full script."
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"### 3.1 Conversation State\n",
142117
"Unlike HTTP-based Chat Completions, the Realtime API maintains an open, **stateful** session with two key components:\n",
143118
"\n",
144119
"| Component | Purpose |\n",
@@ -199,173 +174,22 @@
199174
"cell_type": "markdown",
200175
"metadata": {},
201176
"source": [
202-
"## 3 · Token Utilisation – Text vs Voice\n",
203-
"\n",
204-
"Large‑token windows are precious: every extra token you use costs latency + money. \n",
205-
"For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
206-
"\n",
207-
"In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n",
208-
"\n",
209-
"### 3.1 Hands‑on comparison 📊\n",
210-
"\n",
211-
"The cells below:\n",
212-
"\n",
213-
"1. **Sends `TEXT` to Chat Completions** → reads `prompt_tokens`. \n",
214-
"2. **Turns the same `TEXT` into speech** with TTS. \n",
215-
"3. **Feeds the speech back into the Realtime API Transcription endpoint** → reads `audio input tokens`. \n",
216-
"4. Prints a ratio so you can see the multiplier on *your* hardware / account."
217-
]
218-
},
219-
{
220-
"cell_type": "code",
221-
"execution_count": 67,
222-
"metadata": {},
223-
"outputs": [
224-
{
225-
"name": "stdout",
226-
"output_type": "stream",
227-
"text": [
228-
"📄 Text prompt tokens : 42\n",
229-
"🔊 Audio length (s) : 11.55\n"
230-
]
231-
}
232-
],
233-
"source": [
234-
"TEXT = \"Hello there, I am measuring tokens for text versus voice because we want to better compare the number of tokens used when sending a message as text versus when converting it to speech..\"\n",
235-
"CHAT_MODEL = \"gpt-4o-mini\"\n",
236-
"STT_MODEL = \"gpt-4o-transcribe\"\n",
237-
"TTS_MODEL = \"gpt-4o-mini-tts\"\n",
238-
"RT_MODEL = \"gpt-4o-realtime-preview\" # S2S model\n",
239-
"VOICE = \"shimmer\"\n",
240-
"\n",
241-
"TARGET_SR = 24_000\n",
242-
"PCM_SCALE = 32_767\n",
243-
"CHUNK_MS = 120 # stream step\n",
244-
"\n",
245-
"\n",
246-
"HEADERS = {\n",
247-
" \"Authorization\": f\"Bearer {openai.api_key}\",\n",
248-
" \"OpenAI-Beta\": \"realtime=v1\",\n",
249-
"}\n",
250-
"\n",
251-
"show = lambda l, v: print(f\"{l:<28}: {v}\")\n",
252-
"\n",
253-
"# ─── Helpers ─────────────────────────────────────────────────────────────\n",
254-
"def float_to_pcm16(x: np.ndarray) -> bytes:\n",
255-
" return (np.clip(x, -1, 1) * PCM_SCALE).astype(\"<i2\").tobytes()\n",
256-
"\n",
257-
"def chunk_pcm(pcm: bytes, ms: int = CHUNK_MS) -> List[bytes]:\n",
258-
" step = TARGET_SR * 2 * ms // 1000\n",
259-
" return [pcm[i:i + step] for i in range(0, len(pcm), step)]\n",
260-
"\n",
261-
"# ─── 1 · Count text tokens ──────────────────────────────────────────────\n",
262-
"chat = openai.chat.completions.create(\n",
263-
" model=CHAT_MODEL,\n",
264-
" messages=[{\"role\": \"user\", \"content\": TEXT}],\n",
265-
" max_tokens=1,\n",
266-
" temperature=0,\n",
267-
")\n",
268-
"text_tokens = chat.usage.prompt_tokens\n",
269-
"show(\"📄 Text prompt tokens\", text_tokens)\n",
270-
"\n",
271-
"# ─── 2 · Synthesis to WAV & PCM16 ───────────────────────────────────────\n",
272-
"wav_bytes = openai.audio.speech.create(\n",
273-
" model=TTS_MODEL, input=TEXT, voice=VOICE, response_format=\"wav\"\n",
274-
").content\n",
275-
"\n",
276-
"with wave.open(io.BytesIO(wav_bytes)) as w:\n",
277-
" pcm_bytes = w.readframes(w.getnframes())\n",
278-
"duration_sec = len(pcm_bytes) / (2 * TARGET_SR)\n",
279-
"show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")"
280-
]
281-
},
282-
{
283-
"cell_type": "code",
284-
"execution_count": 73,
285-
"metadata": {},
286-
"outputs": [
287-
{
288-
"name": "stdout",
289-
"output_type": "stream",
290-
"text": [
291-
"🎤 Audio input tokens : 112\n",
292-
"⚖️ Audio/Text ratio : 2.7×\n",
293-
"\n",
294-
"≈9 audio‑tokens / sec vs ≈1 token / word.\n"
295-
]
296-
}
297-
],
298-
"source": [
299-
"# ─── 3 · Realtime streaming & token harvest ─────────────────────────────\n",
300-
"async def count_audio_tokens(pcm: bytes) -> int:\n",
301-
" url = f\"wss://api.openai.com/v1/realtime?model={RT_MODEL}\"\n",
302-
" chunks = chunk_pcm(pcm)\n",
303-
"\n",
304-
" async with websockets.connect(url, extra_headers=HEADERS,\n",
305-
" max_size=1 << 24) as ws:\n",
306-
"\n",
307-
" # Wait for session.created\n",
308-
" while json.loads(await ws.recv())[\"type\"] != \"session.created\":\n",
309-
" pass\n",
310-
"\n",
311-
" # Configure modalities + voice\n",
312-
" await ws.send(json.dumps({\n",
313-
" \"type\": \"session.update\",\n",
314-
" \"session\": {\n",
315-
" \"modalities\": [\"audio\", \"text\"],\n",
316-
" \"voice\": VOICE,\n",
317-
" \"input_audio_format\": \"pcm16\",\n",
318-
" \"output_audio_format\": \"pcm16\",\n",
319-
" \"input_audio_transcription\": {\"model\": STT_MODEL},\n",
320-
" }\n",
321-
" }))\n",
322-
"\n",
323-
" # Stream user audio chunks (no manual commit; server VAD handles it)\n",
324-
" for c in chunks:\n",
325-
" await ws.send(json.dumps({\n",
326-
" \"type\": \"input_audio_buffer.append\",\n",
327-
" \"audio\": base64.b64encode(c).decode(),\n",
328-
" }))\n",
329-
"\n",
330-
" async for raw in ws:\n",
331-
" ev = json.loads(raw)\n",
332-
" t = ev.get(\"type\")\n",
333-
"\n",
334-
" if t == \"response.done\":\n",
335-
" return ev[\"response\"][\"usage\"]\\\n",
336-
" [\"input_token_details\"][\"audio_tokens\"]\n",
337-
"\n",
338-
"audio_tokens = await count_audio_tokens(pcm_bytes)\n",
339-
"show(\"🎤 Audio input tokens\", audio_tokens)\n",
177+
"### 3.2 · Streaming Audio\n",
178+
"We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n",
340179
"\n",
341-
"# ─── 4 · Comparison ─────────────────────────────────────────────────────\n",
342-
"ratio = audio_tokens / text_tokens if text_tokens else float(\"inf\")\n",
343-
"show(\"⚖️ Audio/Text ratio\", f\"{ratio:.1f}×\")\n",
344-
"print(f\"\\n≈{int(audio_tokens/duration_sec)} audio‑tokens / sec vs ≈1 token / word.\")"
345-
]
346-
},
347-
{
348-
"cell_type": "markdown",
349-
"metadata": {},
350-
"source": [
351-
"This toy example uses a short input, but as transcripts get longer, the difference between text token count and voice token count grows substantially."
180+
"The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API"
352181
]
353182
},
354183
{
355184
"cell_type": "markdown",
356185
"metadata": {},
357186
"source": [
358-
"## 4 · Streaming Audio\n",
359-
"We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n",
360-
"\n",
361-
"The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API\n",
362-
"\n",
363-
"### 4.1 Capture Microphone Input\n",
187+
"#### 3.2.1 Capture Microphone Input\n",
364188
"We’ll start with a coroutine that:\n",
365189
"\n",
366190
"* Opens the default mic at **24 kHz, mono, PCM‑16** (one of the [format](https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format) Realtime accepts). \n",
367191
"* Slices the stream into **≈ 40 ms** blocks. \n",
368-
"* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI.\n"
192+
"* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI."
369193
]
370194
},
371195
{
@@ -414,7 +238,7 @@
414238
"cell_type": "markdown",
415239
"metadata": {},
416240
"source": [
417-
"### 4.2 Send Audio Chunks to the API\n",
241+
"#### 3.2.2 Send Audio Chunks to the API\n",
418242
"\n",
419243
"Our mic task is now filling an `asyncio.Queue` with raw PCM‑16 blocks. \n",
420244
"Next step: pull chunks off that queue, **base‑64 encode** them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an `input_audio_buffer.append` event."
@@ -444,13 +268,13 @@
444268
"cell_type": "markdown",
445269
"metadata": {},
446270
"source": [
447-
"### 4.3 Handle Incoming Events \n",
271+
"#### 3.2.3 Handle Incoming Events \n",
448272
"Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the **same** WebSocket. \n",
449273
"Understanding these events is critical for:\n",
450274
"\n",
451275
"* Printing live transcripts \n",
452276
"* Playing incremental audio back to the user \n",
453-
"* Keeping an accurate [`ConversationState`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n",
277+
"* Keeping an accurate [`Conversation State`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n",
454278
"\n",
455279
"| Event type | Typical timing | What you should do with it |\n",
456280
"|------------|----------------|----------------------------|\n",
@@ -466,15 +290,13 @@
466290
"cell_type": "markdown",
467291
"metadata": {},
468292
"source": [
469-
"## 5 · Dynamic Context Management & Summarisation\n",
470-
"\n",
293+
"### 3.3 Detect When to Summarise\n",
471294
"The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n",
295+
"\n",
472296
"Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n",
473297
"\n",
474-
"### 5.1 Detect When to Summarise\n",
475298
"We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n",
476299
"\n",
477-
"### 5.2 Generate & Insert a Summary\n",
478300
"We compress everything except the last 2 turns into a single French paragraph, then:\n",
479301
"\n",
480302
"1. Insert that paragraph as a new assistant message at the top of the conversation.\n",
@@ -603,7 +425,7 @@
603425
"cell_type": "markdown",
604426
"metadata": {},
605427
"source": [
606-
"## 6 · End‑to‑End Workflow Demonstration\n",
428+
"## 4. End‑to‑End Workflow Demonstration\n",
607429
"\n",
608430
"Run the two cells below to launch an interactive session. Interrupt the cell stop recording.\n",
609431
"\n",
@@ -835,7 +657,7 @@
835657
"cell_type": "markdown",
836658
"metadata": {},
837659
"source": [
838-
"## 7 · Real‑World Applications\n",
660+
"## 5 · Real‑World Applications\n",
839661
"\n",
840662
"Context summarisation can be useful for **long‑running voice experiences**. \n",
841663
"Here are a use case ideas:\n",
@@ -852,7 +674,7 @@
852674
"cell_type": "markdown",
853675
"metadata": {},
854676
"source": [
855-
"## 8 · Next Steps & Further Reading\n",
677+
"## 6 · Next Steps & Further Reading\n",
856678
"Try out the notebook and try integrating context summary into your application.\n",
857679
"\n",
858680
"Few things you can try:\n",

0 commit comments

Comments
 (0)