|
5 | 5 | "id": "26a10eea", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# 🗣️ Methods of Speech-to-Text using OpenAI API\n", |
| 8 | + "# 🗣️ Methods of Speech-to-Text using OpenAI API & Agents SDK\n", |
9 | 9 | "\n", |
10 | | - "**Updated : April 27 2025** \n", |
| 10 | + "**Updated : April 29 2025** \n", |
11 | 11 | "This notebook provides a clear, hands-on guide for beginners to quickly get started with Speech-to-Text (STT) using the OpenAI API. You'll explore multiple practical methods, their use cases, and considerations.\n", |
12 | 12 | "\n", |
13 | 13 | "Assumption: For simplicity and ease of use, this notebook uses WAV audio files. Real-time microphone streaming (e.g., from web apps or microphones) is not utilized." |
|
22 | 22 | "| Mode | Latency to **first token** | Best for (real examples) | What you still need to handle / key limits |\n", |
23 | 23 | "|--------------------------------|-------------------------|--------------------------------------------------------------|-----------------------------------------------------------|\n", |
24 | 24 | "| File upload + `stream=False` (blocking) | > 1 s | Voicemail, meeting recordings | • No partial results, users see nothing until file finishes <br>• Max 25 MB per request (you must chunk long audio) |\n", |
25 | | - "| File upload + `stream=True` | < 1 s | voice memos in mobile apps | • Still requires a completed file <br>• You implement progress bars / chunked uploads |\n", |
| 25 | + "| File upload + `stream=True` | < 1 s | Voice memos in mobile apps | • Still requires a completed file <br>• You implement progress bars / chunked uploads |\n", |
26 | 26 | "| Realtime WebSocket | < 1 s | Live captions in webinars | • Audio must be pcm16, g711_ulaw, or g711_alaw <br>• Session ≤ 30 min, reconnect & stitch <br>• You handle speaker-turn formatting to build the full transcript |\n", |
27 | 27 | "| Agents SDK VoicePipeline | < 1 s | Internal help-desk assistant | • Python-only beta <br>• API surface may change, limited customisation |" |
28 | 28 | ] |
|
68 | 68 | }, |
69 | 69 | { |
70 | 70 | "cell_type": "code", |
71 | | - "execution_count": 13, |
| 71 | + "execution_count": 6, |
72 | 72 | "id": "e4078915", |
73 | 73 | "metadata": {}, |
74 | 74 | "outputs": [ |
|
83 | 83 | "source": [ |
84 | 84 | "import os, nest_asyncio, asyncio\n", |
85 | 85 | "from openai import OpenAI\n", |
| 86 | + "import time\n", |
86 | 87 | "nest_asyncio.apply()\n", |
87 | 88 | "\n", |
88 | 89 | "# ✏️ Put your key in an env-var or just replace the call below.\n", |
|
145 | 146 | }, |
146 | 147 | { |
147 | 148 | "cell_type": "code", |
148 | | - "execution_count": 14, |
| 149 | + "execution_count": 2, |
149 | 150 | "id": "ab545e4c", |
150 | 151 | "metadata": {}, |
151 | 152 | "outputs": [ |
|
179 | 180 | }, |
180 | 181 | { |
181 | 182 | "cell_type": "code", |
182 | | - "execution_count": 15, |
| 183 | + "execution_count": 10, |
183 | 184 | "id": "7ae4af8d", |
184 | 185 | "metadata": {}, |
185 | 186 | "outputs": [ |
|
255 | 256 | }, |
256 | 257 | { |
257 | 258 | "cell_type": "code", |
258 | | - "execution_count": 16, |
| 259 | + "execution_count": null, |
259 | 260 | "id": "d027fdb9", |
260 | 261 | "metadata": {}, |
261 | 262 | "outputs": [ |
262 | 263 | { |
263 | 264 | "name": "stdout", |
264 | 265 | "output_type": "stream", |
265 | 266 | "text": [ |
266 | | - "TranscriptionTextDeltaEvent(delta='The', type='transcript.text.delta', logprobs=None)\n", |
267 | | - "TranscriptionTextDeltaEvent(delta=' stale', type='transcript.text.delta', logprobs=None)\n", |
268 | | - "TranscriptionTextDeltaEvent(delta=' smell', type='transcript.text.delta', logprobs=None)\n", |
269 | | - "TranscriptionTextDeltaEvent(delta=' of', type='transcript.text.delta', logprobs=None)\n", |
270 | | - "TranscriptionTextDeltaEvent(delta=' old', type='transcript.text.delta', logprobs=None)\n", |
271 | | - "TranscriptionTextDeltaEvent(delta=' beer', type='transcript.text.delta', logprobs=None)\n", |
272 | | - "TranscriptionTextDeltaEvent(delta=' l', type='transcript.text.delta', logprobs=None)\n", |
273 | | - "TranscriptionTextDeltaEvent(delta='ingers', type='transcript.text.delta', logprobs=None)\n", |
274 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
275 | | - "TranscriptionTextDeltaEvent(delta=' It', type='transcript.text.delta', logprobs=None)\n", |
276 | | - "TranscriptionTextDeltaEvent(delta=' takes', type='transcript.text.delta', logprobs=None)\n", |
277 | | - "TranscriptionTextDeltaEvent(delta=' heat', type='transcript.text.delta', logprobs=None)\n", |
278 | | - "TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)\n", |
279 | | - "TranscriptionTextDeltaEvent(delta=' bring', type='transcript.text.delta', logprobs=None)\n", |
280 | | - "TranscriptionTextDeltaEvent(delta=' out', type='transcript.text.delta', logprobs=None)\n", |
281 | | - "TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)\n", |
282 | | - "TranscriptionTextDeltaEvent(delta=' odor', type='transcript.text.delta', logprobs=None)\n", |
283 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
284 | | - "TranscriptionTextDeltaEvent(delta=' A', type='transcript.text.delta', logprobs=None)\n", |
285 | | - "TranscriptionTextDeltaEvent(delta=' cold', type='transcript.text.delta', logprobs=None)\n", |
286 | | - "TranscriptionTextDeltaEvent(delta=' dip', type='transcript.text.delta', logprobs=None)\n", |
287 | | - "TranscriptionTextDeltaEvent(delta=' restores', type='transcript.text.delta', logprobs=None)\n", |
288 | | - "TranscriptionTextDeltaEvent(delta=' health', type='transcript.text.delta', logprobs=None)\n", |
289 | | - "TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)\n", |
290 | | - "TranscriptionTextDeltaEvent(delta=' zest', type='transcript.text.delta', logprobs=None)\n", |
291 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
292 | | - "TranscriptionTextDeltaEvent(delta=' A', type='transcript.text.delta', logprobs=None)\n", |
293 | | - "TranscriptionTextDeltaEvent(delta=' salt', type='transcript.text.delta', logprobs=None)\n", |
294 | | - "TranscriptionTextDeltaEvent(delta=' pickle', type='transcript.text.delta', logprobs=None)\n", |
295 | | - "TranscriptionTextDeltaEvent(delta=' tastes', type='transcript.text.delta', logprobs=None)\n", |
296 | | - "TranscriptionTextDeltaEvent(delta=' fine', type='transcript.text.delta', logprobs=None)\n", |
297 | | - "TranscriptionTextDeltaEvent(delta=' with', type='transcript.text.delta', logprobs=None)\n", |
298 | | - "TranscriptionTextDeltaEvent(delta=' ham', type='transcript.text.delta', logprobs=None)\n", |
299 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
300 | | - "TranscriptionTextDeltaEvent(delta=' T', type='transcript.text.delta', logprobs=None)\n", |
301 | | - "TranscriptionTextDeltaEvent(delta='acos', type='transcript.text.delta', logprobs=None)\n", |
302 | | - "TranscriptionTextDeltaEvent(delta=' al', type='transcript.text.delta', logprobs=None)\n", |
303 | | - "TranscriptionTextDeltaEvent(delta=' pastor', type='transcript.text.delta', logprobs=None)\n", |
304 | | - "TranscriptionTextDeltaEvent(delta=' are', type='transcript.text.delta', logprobs=None)\n", |
305 | | - "TranscriptionTextDeltaEvent(delta=' my', type='transcript.text.delta', logprobs=None)\n", |
306 | | - "TranscriptionTextDeltaEvent(delta=' favorite', type='transcript.text.delta', logprobs=None)\n", |
307 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
308 | | - "TranscriptionTextDeltaEvent(delta=' A', type='transcript.text.delta', logprobs=None)\n", |
309 | | - "TranscriptionTextDeltaEvent(delta=' zest', type='transcript.text.delta', logprobs=None)\n", |
310 | | - "TranscriptionTextDeltaEvent(delta='ful', type='transcript.text.delta', logprobs=None)\n", |
311 | | - "TranscriptionTextDeltaEvent(delta=' food', type='transcript.text.delta', logprobs=None)\n", |
312 | | - "TranscriptionTextDeltaEvent(delta=' is', type='transcript.text.delta', logprobs=None)\n", |
313 | | - "TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)\n", |
314 | | - "TranscriptionTextDeltaEvent(delta=' hot', type='transcript.text.delta', logprobs=None)\n", |
315 | | - "TranscriptionTextDeltaEvent(delta='-cross', type='transcript.text.delta', logprobs=None)\n", |
316 | | - "TranscriptionTextDeltaEvent(delta=' bun', type='transcript.text.delta', logprobs=None)\n", |
317 | | - "TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)\n", |
318 | | - "TranscriptionTextDoneEvent(text='The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot-cross bun.', type='transcript.text.done', logprobs=None)\n" |
| 267 | + "The stale smell of old beer lingers.\n", |
| 268 | + "The stale smell of old beer lingers.\n" |
319 | 269 | ] |
320 | 270 | } |
321 | 271 | ], |
|
330 | 280 | ")\n", |
331 | 281 | "\n", |
332 | 282 | "for event in stream:\n", |
333 | | - " print(event)" |
| 283 | + " # If this is an incremental update …\n", |
| 284 | + " if getattr(event, \"delta\", None): \n", |
| 285 | + " print(event.delta, end=\"\", flush=True)\n", |
| 286 | + " time.sleep(0.05) # simulate real-time pacing\n", |
| 287 | + " # … otherwise it’s the final transcript chunk\n", |
| 288 | + " elif getattr(event, \"text\", None):\n", |
| 289 | + " print(\"\\n\" + event.text)" |
334 | 290 | ] |
335 | 291 | }, |
336 | 292 | { |
|
375 | 331 | "#### Limitations\n", |
376 | 332 | "- **Complex integration:** Requires managing WebSockets, Base64 encoding, and robust error handling. \n", |
377 | 333 | "- **Session constraints:** Limited to 30-minute sessions. \n", |
378 | | - "- **Restricted formats:** Accepts only raw PCM (no MP3 or Opus); bandwidth ≈ 1.5 Mbit/s for 16-kHz mono audio. " |
| 334 | + "- **Restricted formats:** Accepts only raw PCM (no MP3 or Opus); For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order." |
379 | 335 | ] |
380 | 336 | }, |
381 | 337 | { |
|
396 | 352 | "import websockets # asyncio-based WebSocket client\n", |
397 | 353 | "\n", |
398 | 354 | "\n", |
399 | | - "TARGET_SR = 16_000 # GPT-4o-Transcribe expects 16 kHz mono input\n", |
400 | | - "PCM_SCALE = 32_767 # float32 (-1…1) → int16 conversion factor\n", |
| 355 | + "TARGET_SR = 24_000 # Realtime STT expects 24-kHz mono input\n", |
| 356 | + "PCM_SCALE = 32_767 # float32 (−1…1) → int16 conversion factor\n", |
| 357 | + "DEFAULT_CHUNK = 3_072 # 3 072 / 24 000 ≈ 128 ms (server-VAD sweet-spot)\n", |
401 | 358 | "REALTIME_URL = \"wss://api.openai.com/v1/realtime?intent=transcription\"\n", |
402 | 359 | "\n", |
403 | 360 | "def load_and_resample(path: str, target_sr: int = TARGET_SR) -> bytes:\n", |
|
433 | 390 | " # WebSocket payload must be JSON → encode the binary chunk as base64\n", |
434 | 391 | " b64 = base64.b64encode(chunk).decode()\n", |
435 | 392 | " try:\n", |
436 | | - " await ws.send({\n", |
437 | | - " \"type\": \"input_audio_buffer.append\",\n", |
438 | | - " \"audio\": b64,\n", |
439 | | - " } | {} if False else json.dumps({ # keep original semantics\n", |
440 | | - " \"type\": \"input_audio_buffer.append\",\n", |
441 | | - " \"audio\": b64,\n", |
442 | | - " }))\n", |
| 393 | + " payload = {\n", |
| 394 | + " \"type\": \"input_audio_buffer.append\",\n", |
| 395 | + " \"audio\": b64,\n", |
| 396 | + " }\n", |
| 397 | + " await ws.send(json.dumps(payload))\n", |
443 | 398 | " except websockets.ConnectionClosed:\n", |
444 | 399 | " # Receiver closed early (e.g., after final VAD turn) – just stop.\n", |
445 | 400 | " break\n", |
|
490 | 445 | " api_key: str,\n", |
491 | 446 | " *,\n", |
492 | 447 | " model: str = \"gpt-4o-transcribe\",\n", |
493 | | - " chunk: int = 2_048, # ≈128 ms @16 kHz → good server-VAD latency\n", |
| 448 | + " chunk: int = DEFAULT_CHUNK, # now sized for 24 kHz\n", |
494 | 449 | " debug: bool = False,\n", |
495 | 450 | ") -> str:\n", |
496 | 451 | " \"\"\"\n", |
|
519 | 474 | " await ws.send(json.dumps({\n", |
520 | 475 | " \"type\": \"transcription_session.update\",\n", |
521 | 476 | " \"session\": {\n", |
522 | | - " \"input_audio_format\": \"pcm16\", # little-endian, mono\n", |
| 477 | + " \"input_audio_format\": \"pcm16\", # 24 kHz mono, little-endian\n", |
523 | 478 | " \"turn_detection\": {\n", |
524 | 479 | " \"type\": \"server_vad\", # server does VAD\n", |
525 | 480 | " \"threshold\": 0.5 # adjust if needed\n", |
|
545 | 500 | }, |
546 | 501 | { |
547 | 502 | "cell_type": "code", |
548 | | - "execution_count": null, |
| 503 | + "execution_count": 16, |
549 | 504 | "id": "d90de5b9", |
550 | 505 | "metadata": {}, |
551 | | - "outputs": [], |
| 506 | + "outputs": [ |
| 507 | + { |
| 508 | + "name": "stdout", |
| 509 | + "output_type": "stream", |
| 510 | + "text": [ |
| 511 | + "The stale smell of old beer lingers.It takes heat to bring out the odor.A cold dip restores health and zest.A salt pickle tastes fine with ham.Tacos al pastor are my favorite." |
| 512 | + ] |
| 513 | + }, |
| 514 | + { |
| 515 | + "data": { |
| 516 | + "text/plain": [ |
| 517 | + "'The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite.'" |
| 518 | + ] |
| 519 | + }, |
| 520 | + "execution_count": 16, |
| 521 | + "metadata": {}, |
| 522 | + "output_type": "execute_result" |
| 523 | + } |
| 524 | + ], |
552 | 525 | "source": [ |
553 | | - "WAV_PATH = \"./data/sample_audio_files/18_sec_food_story.wav\" # any 16-bit PCM WAV file (mono or stereo OK)\n", |
554 | | - "\n", |
| 526 | + "WAV_PATH = \"./data/sample_audio_files/18_sec_food_story.wav\"\n", |
555 | 527 | "# IMPORTANT: this cell must be prefixed with `await`\n", |
556 | 528 | "transcript = await transcribe_wav_async(WAV_PATH, OPENAI_API_KEY, debug=False)\n", |
557 | 529 | "transcript" |
|
600 | 572 | }, |
601 | 573 | { |
602 | 574 | "cell_type": "code", |
603 | | - "execution_count": 17, |
| 575 | + "execution_count": 13, |
604 | 576 | "id": "754a846b", |
605 | 577 | "metadata": {}, |
606 | | - "outputs": [], |
| 578 | + "outputs": [ |
| 579 | + { |
| 580 | + "name": "stdout", |
| 581 | + "output_type": "stream", |
| 582 | + "text": [ |
| 583 | + "\n", |
| 584 | + "[User]: The stale smell of old beer lingers.\n", |
| 585 | + "[Assistant]: L'odeur rance de la vieille bière persiste.\n", |
| 586 | + "[User]: A cold dip restores health and zest.\n", |
| 587 | + "[Assistant]: Un bain froid restaure la santé et l'énergie.\n", |
| 588 | + "[User]: Tacos al pastor are my favorite.\n", |
| 589 | + "[Assistant]: Les tacos al pastor sont mes préférés." |
| 590 | + ] |
| 591 | + }, |
| 592 | + { |
| 593 | + "name": "stdout", |
| 594 | + "output_type": "stream", |
| 595 | + "text": [ |
| 596 | + "\n", |
| 597 | + "[User]: A zestful food is the hot cross bun.\n", |
| 598 | + "[Assistant]: Un aliment plein de dynamisme est le petit pain de Pâques." |
| 599 | + ] |
| 600 | + } |
| 601 | + ], |
607 | 602 | "source": [ |
608 | 603 | "import asyncio, numpy as np, soundfile as sf, resampy\n", |
609 | 604 | "from agents import Agent\n", |
|
678 | 673 | "source": [ |
679 | 674 | "## Conclusion \n", |
680 | 675 | "\n", |
681 | | - "In this notebook you explored multiple ways to convert speech to text with the OpenAI API and the Agents SDK—ranging from simple file uploads to fully-interactive, real-time streaming. Each workflow shines in a different scenario, so pick the one that best matches your product’s needs.\n", |
| 676 | + "In this notebook you explored multiple ways to convert speech to text with the OpenAI API and the Agents SDK, ranging from simple file uploads to fully-interactive, real-time streaming. Each workflow shines in a different scenario, so pick the one that best matches your product’s needs.\n", |
682 | 677 | "\n", |
683 | 678 | "### Key takeaways\n", |
684 | 679 | "- **Match the method to the use-case:** \n", |
|
0 commit comments