Simplified notebook

minh-hoque · minh-hoque · commit 8b08e9ec9e36 · 2025-05-08T10:17:56.000-04:00
diff --git a/examples/Context_summarization_with_realtime_api.ipynb b/examples/Context_summarization_with_realtime_api.ipynb
@@ -4,29 +4,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 🎙️ Context Summarization with Realtime API\n",
-    "## 1 · Overview\n",
+    "# Context Summarization with Realtime API\n",
+    "## 1. Overview\n",
     "Build an end‑to‑end **voice bot** that listens to your mic, speaks back in real time and **summarises long conversations** so quality never drops.\n",
-    "### 🏃‍♂️ What You’ll Build\n",
+    "\n",
+    "### What You’ll Learn\n",
     "1. **Live microphone streaming** → OpenAI *Realtime* (voice‑to‑voice) endpoint.\n",
     "2. **Instant transcripts & speech playback** on every turn.\n",
     "3. **Conversation state container** that stores **every** user/assistant message.\n",
     "4. **Automatic “context trim”** – when the token window becomes very large (configurable), older turns are compressed into a summary.\n",
     "5. **Extensible design** you can adapt to support customer‑support bots, kiosks, or multilingual assistants.\n",
     "\n",
     "\n",
-    "### 🎯 Learning Objectives\n",
-    "By the end of this notebook you can:\n",
-    "\n",
-    "| Skill | Why it matters |\n",
-    "|-------|----------------|\n",
-    "| Capture audio with `sounddevice` | Low‑latency input is critical for natural UX |\n",
-    "| Use WebSockets with the OpenAI **Realtime** API | Streams beats polling for speed & simplicity |\n",
-    "| Track token usage and detect when to summarize context | Prevents quality loss in long chats |\n",
-    "| Summarise & prune history on‑the‑fly | Keeps conversations coherent without manual resets |\n",
-    "\n",
-    "\n",
-    "### 🔧 Prerequisites\n",
+    "### Prerequisites\n",
     "\n",
     "| Requirement | Details |\n",
     "|-------------|---------|\n",
@@ -43,7 +33,7 @@
     "> 1. GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.\n",
     "> 2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n",
     "\n",
-    "### 🚀 One‑liner install (run in a fresh cell)"
+    "### One‑liner install (run in a fresh cell)"
    ]
   },
   {
@@ -62,8 +52,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Essential imports & constants\n",
-    "\n",
     "# Standard library imports\n",
     "import os\n",
     "import sys\n",
@@ -100,45 +88,32 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2 · Key Concepts Behind the Realtime Voice API\n",
-    "\n",
-    "This section gives you the mental model you’ll need before diving into code.  Skim it now; refer back whenever something in the notebook feels “magic”.\n",
-    "\n",
-    "\n",
-    "### 2.1 Realtime vs Chat Completions — Why WebSockets?\n",
-    "\n",
-    "|  | **Chat Completions (HTTP)** | **Realtime (WebSocket)** |\n",
-    "|---|---|---|\n",
-    "| Transport | Stateless request → response | Persistent, bi‑directional socket |\n",
-    "| Best for | Plain text or batched jobs | *Live* audio + incremental text |\n",
-    "| Latency model | 1 RTT per message | Sub‑200 ms deltas during one open session |\n",
-    "| Event types | *None* (single JSON) | `session.*`, `input_audio_buffer.append`, `response.*`, … |\n",
-    "\n",
+    "## 2. Token Utilisation – Text vs Voice\n",
     "\n",
-    "**Flow**: you talk ▸ server transcribes ▸ assistant replies ▸ you talk again.  \n",
-    "> Mirrors natural conversation while keeping event handling simple.\n",
-    "\n",
-    "\n",
-    "### 2.2 Audio Encoding Fundamentals\n",
-    "\n",
-    "| Parameter | Value | Why it matters |\n",
-    "|-----------|-------|----------------|\n",
-    "| **Format** | PCM‑16 (signed 16‑bit) | Widely supported; no compression delay |\n",
-    "| **Sample rate** | 24 kHz | Required by Realtime endpoint |\n",
-    "| **Chunk size** | ≈ 40 ms | Lower chunk → snappier response ↔ higher packet overhead |\n",
-    "\n",
-    "`chunk_bytes  = sample_rate * bytes_per_sample * chunk_duration_s`\n",
+    "Large‑token windows are precious, every extra token you use costs latency + money.  \n",
+    "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
     "\n",
+    "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n",
     "\n",
-    "### 2.3 Token Context Windows\n",
     "\n",
-    "* GPT‑4o Realtime accepts **up to 128 K tokens** in theory.  \n",
-    "* In practice, answer quality starts to drift as you increase **input token size**. \n",
+    "* GPT-4o realtime accepts up to **128k tokens** and as the token size increases instruction adherence can drifts.\n",
     "* Every user/assistant turn consumes tokens → the window **only grows**.\n",
-    "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n",
-    "\n",
-    "\n",
-    "### 2.4 Conversation State\n",
+    "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Helper Functions\n",
+    "The following helper functions will enable us to run the full script."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.1 Conversation State\n",
     "Unlike HTTP-based Chat Completions, the Realtime API maintains an open, **stateful** session with two key components:\n",
     "\n",
     "| Component       | Purpose |\n",
@@ -199,173 +174,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3 · Token Utilisation – Text vs Voice\n",
-    "\n",
-    "Large‑token windows are precious: every extra token you use costs latency + money.  \n",
-    "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
-    "\n",
-    "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n",
-    "\n",
-    "### 3.1 Hands‑on comparison 📊\n",
-    "\n",
-    "The cells below:\n",
-    "\n",
-    "1. **Sends `TEXT` to Chat Completions** → reads `prompt_tokens`.  \n",
-    "2. **Turns the same `TEXT` into speech** with TTS.  \n",
-    "3. **Feeds the speech back into the Realtime API Transcription endpoint** → reads `audio input tokens`.  \n",
-    "4. Prints a ratio so you can see the multiplier on *your* hardware / account."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 67,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "📄 Text prompt tokens        : 42\n",
-      "🔊 Audio length (s)          : 11.55\n"
-     ]
-    }
-   ],
-   "source": [
-    "TEXT = \"Hello there, I am measuring tokens for text versus voice because we want to better compare the number of tokens used when sending a message as text versus when converting it to speech..\"\n",
-    "CHAT_MODEL = \"gpt-4o-mini\"\n",
-    "STT_MODEL   = \"gpt-4o-transcribe\"\n",
-    "TTS_MODEL   = \"gpt-4o-mini-tts\"\n",
-    "RT_MODEL    = \"gpt-4o-realtime-preview\"          # S2S model\n",
-    "VOICE       = \"shimmer\"\n",
-    "\n",
-    "TARGET_SR   = 24_000\n",
-    "PCM_SCALE   = 32_767\n",
-    "CHUNK_MS    = 120                                # stream step\n",
-    "\n",
-    "\n",
-    "HEADERS = {\n",
-    "    \"Authorization\": f\"Bearer {openai.api_key}\",\n",
-    "    \"OpenAI-Beta\":   \"realtime=v1\",\n",
-    "}\n",
-    "\n",
-    "show = lambda l, v: print(f\"{l:<28}: {v}\")\n",
-    "\n",
-    "# ─── Helpers ─────────────────────────────────────────────────────────────\n",
-    "def float_to_pcm16(x: np.ndarray) -> bytes:\n",
-    "    return (np.clip(x, -1, 1) * PCM_SCALE).astype(\"<i2\").tobytes()\n",
-    "\n",
-    "def chunk_pcm(pcm: bytes, ms: int = CHUNK_MS) -> List[bytes]:\n",
-    "    step = TARGET_SR * 2 * ms // 1000\n",
-    "    return [pcm[i:i + step] for i in range(0, len(pcm), step)]\n",
-    "\n",
-    "# ─── 1 · Count text tokens ──────────────────────────────────────────────\n",
-    "chat = openai.chat.completions.create(\n",
-    "    model=CHAT_MODEL,\n",
-    "    messages=[{\"role\": \"user\", \"content\": TEXT}],\n",
-    "    max_tokens=1,\n",
-    "    temperature=0,\n",
-    ")\n",
-    "text_tokens = chat.usage.prompt_tokens\n",
-    "show(\"📄 Text prompt tokens\", text_tokens)\n",
-    "\n",
-    "# ─── 2 · Synthesis to WAV & PCM16 ───────────────────────────────────────\n",
-    "wav_bytes = openai.audio.speech.create(\n",
-    "    model=TTS_MODEL, input=TEXT, voice=VOICE, response_format=\"wav\"\n",
-    ").content\n",
-    "\n",
-    "with wave.open(io.BytesIO(wav_bytes)) as w:\n",
-    "    pcm_bytes = w.readframes(w.getnframes())\n",
-    "duration_sec = len(pcm_bytes) / (2 * TARGET_SR)\n",
-    "show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "🎤 Audio input tokens        : 112\n",
-      "⚖️  Audio/Text ratio        : 2.7×\n",
-      "\n",
-      "≈9 audio‑tokens / sec vs ≈1 token / word.\n"
-     ]
-    }
-   ],
-   "source": [
-    "# ─── 3 · Realtime streaming & token harvest ─────────────────────────────\n",
-    "async def count_audio_tokens(pcm: bytes) -> int:\n",
-    "    url = f\"wss://api.openai.com/v1/realtime?model={RT_MODEL}\"\n",
-    "    chunks = chunk_pcm(pcm)\n",
-    "\n",
-    "    async with websockets.connect(url, extra_headers=HEADERS,\n",
-    "                                  max_size=1 << 24) as ws:\n",
-    "\n",
-    "        # Wait for session.created\n",
-    "        while json.loads(await ws.recv())[\"type\"] != \"session.created\":\n",
-    "            pass\n",
-    "\n",
-    "        # Configure modalities + voice\n",
-    "        await ws.send(json.dumps({\n",
-    "            \"type\": \"session.update\",\n",
-    "            \"session\": {\n",
-    "                \"modalities\": [\"audio\", \"text\"],\n",
-    "                \"voice\": VOICE,\n",
-    "                \"input_audio_format\": \"pcm16\",\n",
-    "                \"output_audio_format\": \"pcm16\",\n",
-    "                \"input_audio_transcription\": {\"model\": STT_MODEL},\n",
-    "            }\n",
-    "        }))\n",
-    "\n",
-    "        # Stream user audio chunks (no manual commit; server VAD handles it)\n",
-    "        for c in chunks:\n",
-    "            await ws.send(json.dumps({\n",
-    "                \"type\": \"input_audio_buffer.append\",\n",
-    "                \"audio\": base64.b64encode(c).decode(),\n",
-    "            }))\n",
-    "\n",
-    "        async for raw in ws:\n",
-    "            ev = json.loads(raw)\n",
-    "            t = ev.get(\"type\")\n",
-    "\n",
-    "            if t == \"response.done\":\n",
-    "                return ev[\"response\"][\"usage\"]\\\n",
-    "                         [\"input_token_details\"][\"audio_tokens\"]\n",
-    "\n",
-    "audio_tokens = await count_audio_tokens(pcm_bytes)\n",
-    "show(\"🎤 Audio input tokens\", audio_tokens)\n",
+    "### 3.2 · Streaming Audio\n",
+    "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n",
     "\n",
-    "# ─── 4 · Comparison ─────────────────────────────────────────────────────\n",
-    "ratio = audio_tokens / text_tokens if text_tokens else float(\"inf\")\n",
-    "show(\"⚖️  Audio/Text ratio\", f\"{ratio:.1f}×\")\n",
-    "print(f\"\\n≈{int(audio_tokens/duration_sec)} audio‑tokens / sec vs ≈1 token / word.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This toy example uses a short input, but as transcripts get longer, the difference between text token count and voice token count grows substantially."
+    "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4 · Streaming Audio\n",
-    "We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n",
-    "\n",
-    "The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API\n",
-    "\n",
-    "### 4.1 Capture Microphone Input\n",
+    "#### 3.2.1 Capture Microphone Input\n",
     "We’ll start with a coroutine that:\n",
     "\n",
     "* Opens the default mic at **24 kHz, mono, PCM‑16** (one of the [format](https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format) Realtime accepts).  \n",
     "* Slices the stream into **≈ 40 ms** blocks.  \n",
-    "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI.\n"
+    "* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI."
    ]
   },
   {
@@ -414,7 +238,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4.2 Send Audio Chunks to the API\n",
+    "#### 3.2.2 Send Audio Chunks to the API\n",
     "\n",
     "Our mic task is now filling an `asyncio.Queue` with raw PCM‑16 blocks.  \n",
     "Next step: pull chunks off that queue, **base‑64 encode** them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an `input_audio_buffer.append` event."
@@ -444,13 +268,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4.3 Handle Incoming Events \n",
+    "#### 3.2.3 Handle Incoming Events \n",
     "Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the **same** WebSocket.  \n",
     "Understanding these events is critical for:\n",
     "\n",
     "* Printing live transcripts  \n",
     "* Playing incremental audio back to the user  \n",
-    "* Keeping an accurate [`ConversationState`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later  \n",
+    "* Keeping an accurate [`Conversation State`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later  \n",
     "\n",
     "| Event type | Typical timing | What you should do with it |\n",
     "|------------|----------------|----------------------------|\n",
@@ -466,15 +290,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5 · Dynamic Context Management & Summarisation\n",
-    "\n",
+    "### 3.3 Detect When to Summarise\n",
     "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n",
+    "\n",
     "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n",
     "\n",
-    "### 5.1 Detect When to Summarise\n",
     "We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n",
     "\n",
-    "### 5.2 Generate & Insert a Summary\n",
     "We compress everything except the last 2 turns into a single French paragraph, then:\n",
     "\n",
     "1. Insert that paragraph as a new assistant message at the top of the conversation.\n",
@@ -603,7 +425,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 6 · End‑to‑End Workflow Demonstration\n",
+    "## 4. End‑to‑End Workflow Demonstration\n",
     "\n",
     "Run the two cells below to launch an interactive session. Interrupt the cell stop recording.\n",
     "\n",
@@ -835,7 +657,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 7 · Real‑World Applications\n",
+    "## 5 · Real‑World Applications\n",
     "\n",
     "Context summarisation can be useful for **long‑running voice experiences**.  \n",
     "Here are a use case ideas:\n",
@@ -852,7 +674,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 8 · Next Steps & Further Reading\n",
+    "## 6 · Next Steps & Further Reading\n",
     "Try out the notebook and try integrating context summary into your application.\n",
     "\n",
     "Few things you can try:\n",