|
96 | 96 | "In practice you’ll often see **≈ 10 ×** more tokens for the *same* sentence in audio versus text.\n", |
97 | 97 | "\n", |
98 | 98 | "\n", |
99 | | - "* GPT-4o realtime accepts up to **128k tokens** and as the token size increases instruction adherence can drifts.\n", |
| 99 | + "* GPT-4o realtime accepts up to **128k tokens** and as the token size increases, instruction adherence can drift.\n", |
100 | 100 | "* Every user/assistant turn consumes tokens → the window **only grows**.\n", |
101 | 101 | "* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n", |
102 | 102 | "\n", |
|
252 | 252 | "metadata": {}, |
253 | 253 | "outputs": [], |
254 | 254 | "source": [ |
| 255 | + "# Helper function to encode audio chunks in base64\n", |
255 | 256 | "b64 = lambda blob: base64.b64encode(blob).decode()\n", |
256 | 257 | "\n", |
257 | 258 | "async def queue_to_websocket(pcm_queue: asyncio.Queue[bytes], ws):\n", |
|
278 | 279 | "* Playing incremental audio back to the user \n", |
279 | 280 | "* Keeping an accurate [`Conversation State`](https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/created) so context trimming works later \n", |
280 | 281 | "\n", |
281 | | - "| Event type | Typical timing | What you should do with it |\n", |
282 | | - "|------------|----------------|----------------------------|\n", |
283 | | - "| **`session.created`** | Immediately after connection | Verify the handshake; stash the `session_id` if you need it for server logs. |\n", |
284 | | - "| **`conversation.item.created`** (user) | Right after the user stops talking | Place a *placeholder* `Turn` in `state.history`. Transcript may still be `null`. |\n", |
285 | | - "| **`conversation.item.retrieved`** | A few hundred ms later | Fill in any missing user transcript once STT completes. |\n", |
286 | | - "| **`response.audio.delta`** | Streaming chunks while the assistant speaks | Append bytes to a local buffer, play them (low‑latency) as they arrive. |\n", |
287 | | - "| **`response.done`** | After final assistant token | Add assistant text + usage stats, update `state.latest_tokens`. |\n", |
288 | | - "| **`conversation.item.deleted`** | Whenever you prune old turns | Remove superseded items from `conversation.item`. |\n" |
| 282 | + "\n", |
| 283 | + "| Event type | When it arrives | Why it matters | Typical handler logic |\n", |
| 284 | + "|------------|-----------------|---------------|-----------------------|\n", |
| 285 | + "| **`session.created`** | Immediately after the WebSocket handshake | Confirms the session is open and provides the `session.id`. | Log the ID for traceability and verify the connection. |\n", |
| 286 | + "| **`session.updated`** | After you send a `session.update` call | Acknowledges that the server applied new session settings. | Inspect the echoed settings and update any local cache. |\n", |
| 287 | + "| **`conversation.item.created`** (user) | A few ms after the user stops speaking (client VAD fires) | Reserves a timeline slot; transcript may still be **`null`**. | Insert a *placeholder* user turn in `state.history` marked “pending transcript”. |\n", |
| 288 | + "| **`conversation.item.retrieved`** | ~100 – 300 ms later, once audio transcription is complete | Supplies the final user transcript (with timing). | Replace the placeholder with the transcript and print it if desired. |\n", |
| 289 | + "| **`response.audio.delta`** | Every 20 – 60 ms while the assistant is speaking | Streams PCM‑16 audio chunks (and optional incremental text). | Buffer each chunk and play it; optionally show partial text in the console. |\n", |
| 290 | + "| **`response.done`** | After the assistant’s last token | Signals both audio & text are complete; includes usage stats. | Finalize the assistant turn, update `state.latest_tokens`, and log usage. |\n", |
| 291 | + "| **`conversation.item.deleted`** | Whenever you prune with `conversation.item.delete` | Confirms a turn was removed, freeing tokens on the server. | Mirror the deletion locally so your context window matches the server’s. |\n", |
| 292 | + "\n" |
289 | 293 | ] |
290 | 294 | }, |
291 | 295 | { |
292 | 296 | "cell_type": "markdown", |
293 | 297 | "metadata": {}, |
294 | 298 | "source": [ |
295 | 299 | "### 3.3 Detect When to Summarise\n", |
296 | | - "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that as you stuff more context into the model.\n", |
| 300 | + "The Realtime model keeps a **large 128 k‑token window**, but quality can drift long before that limit as you stuff more context into the model.\n", |
297 | 301 | "\n", |
298 | 302 | "Our goal: **auto‑summarise** once the running window nears a safe threshold (default **2 000 tokens** for the notebook), then prune the superseded turns both locally *and* server‑side.\n", |
299 | 303 | "\n", |
300 | | - "We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n", |
| 304 | + "We monitor latest_tokens returned in `response.done`. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.\n", |
301 | 305 | "\n", |
302 | 306 | "We compress everything except the last 2 turns into a single French paragraph, then:\n", |
303 | 307 | "\n", |
|
0 commit comments