Updated explanations to reflect accurate limits

minh-hoque · minh-hoque · commit 48fd2427b660 · 2025-05-05T21:51:04.000-04:00
diff --git a/examples/Context_summarization_with_realtime_api.ipynb b/examples/Context_summarization_with_realtime_api.ipynb
@@ -11,7 +11,7 @@
     "1. **Live microphone streaming** → OpenAI *Realtime* (voice‑to‑voice) endpoint.\n",
     "2. **Instant transcripts & speech playback** on every turn.\n",
     "3. **Conversation state container** that stores **every** user/assistant message.\n",
-    "4. **Automatic “context trim”** – when the token window nears 32 k, older turns are compressed into a summary.\n",
+    "4. **Automatic “context trim”** – when the token window becomes very large (configurable), older turns are compressed into a summary.\n",
     "5. **Extensible design** you can adapt to support customer‑support bots, kiosks, or multilingual assistants.\n",
     "\n",
     "\n",
@@ -40,7 +40,7 @@
     "\n",
     "\n",
     "*Notes:*\n",
-    "> 1. Why 32 k? OpenAI’s public guidance notes that quality begins to decline well before the full 128 k token limit; 32 k is a conservative threshold observed in practice.\n",
+    "> 1. GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.\n",
     "> 2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n",
     "\n",
     "### 🚀 One‑liner install (run in a fresh cell)"
@@ -136,7 +136,7 @@
     "### 2.3 Token Context Windows\n",
     "\n",
     "* GPT‑4o Realtime accepts **up to 128 K tokens** in theory.  \n",
-    "* In practice, answer quality starts to drift around **≈ 32 K tokens**.  \n",
+    "* In practice, answer quality starts to drift as you increase **input token size**. \n",
     "* Every user/assistant turn consumes tokens → the window **only grows**.\n",
     "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n",
     "\n",
@@ -204,11 +204,10 @@
    "source": [
     "## 3 · Token Utilisation – Text vs Voice\n",
     "\n",
-    "Large‑token windows are precious: every extra token you burn costs latency + money.  \n",
-    "For **audio** the bill climbs much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
+    "Large‑token windows are precious: every extra token you use costs latency + money.  \n",
+    "For **audio** the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.\n",
     "\n",
-    "*Rule of thumb*: **1 word of text ≈ 1 token**, but **1 second of 24‑kHz PCM‑16 ≈ ~150 audio tokens**.  \n",
-    "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence spoken aloud than typed.\n",
+    "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n",
     "\n",
     "### 3.1 Hands‑on comparison 📊\n",
     "\n",
@@ -472,8 +471,8 @@
    "source": [
     "## 5 · Dynamic Context Management & Summarisation\n",
     "\n",
-    "The Realtime model keeps a **gargantuan 128 k‑token window**, but quality drifts long before that.  \n",
-    "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **4 000 tokens**), then prune the superseded turns both locally *and* server‑side.\n",
+    "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n",
+    "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n",
     "\n",
     "### 5.1 Detect When to Summarise\n",
     "We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n",