|
18 | 18 | "Speech-to-speech systems are essential for enabling voice as a core AI interface. Today’s release enhances robustness and usability, giving enterprises the confidence to deploy mission-critical voice agents at scale.\n",
|
19 | 19 | "\n",
|
20 | 20 | "\n",
|
21 |
| - "The new gpt-4o-realtime-08-11 model delivers stronger instruction following, more reliable tool calling, noticeably better voice quality, and an overall smoother feel. These gains make it practical to move from chained approaches to true realtime experiences, cutting latency while keeping a consistent conversational tone.\n", |
| 21 | + "The new gpt-realtime model delivers stronger instruction following, more reliable tool calling, noticeably better voice quality, and an overall smoother feel. These gains make it practical to move from chained approaches to true realtime experiences, cutting latency, carrying over more nuance from spoken input, and producing responses that sound more natural and expressive.\n", |
22 | 22 | "\n",
|
23 | 23 | "Realtime model benefit from different prompting techniques that wouldn't directly apply to text based models. This prompting guide starts with a simple prompt skeleton, then walks through each part with practical tips, small patterns you can copy, and examples you can adapt to your use case.\n",
|
24 | 24 | "\n",
|
25 | 25 | "# Table of Contents\n",
|
26 | 26 | "\n",
|
27 | 27 | "- [Realtime Prompting Guide](#realtime-prompting-guide)\n",
|
| 28 | + "- [General Tips](#general-tips)\n", |
28 | 29 | "- [Prompt Structure](#prompt-structure)\n",
|
29 | 30 | "- [Role and Objective](#role-and-objective)\n",
|
30 | 31 | "- [Personality and Tone](#personality-and-tone)\n",
|
|
65 | 66 | "from IPython.display import Audio, display"
|
66 | 67 | ]
|
67 | 68 | },
|
| 69 | + { |
| 70 | + "cell_type": "markdown", |
| 71 | + "id": "cfec8f54", |
| 72 | + "metadata": {}, |
| 73 | + "source": [ |
| 74 | + "# General Tips\n", |
| 75 | + "- **Iterate relentlessly**: Small wording changes can make or break behavior.\n", |
| 76 | + " - Example: swapping “inaudible” → “intelligible” boosted noisy input handling.\n", |
| 77 | + "- **Prefer bullets over paragraphs**: Clear, short bullets outperform long paragraphs.\n", |
| 78 | + "- **Guide with examples**: The model strongly follows onto sample phrases.\n", |
| 79 | + "- **Be precise**: Ambiguity or conflicting instructions = degraded performance similar to GPT-5.\n", |
| 80 | + "- **Control language**: Pin output to a target language if you see drift.\n", |
| 81 | + "- **Fight repetition**: Add a Variety rule to reduce robotic phrasing.\n" |
| 82 | + ] |
| 83 | + }, |
68 | 84 | {
|
69 | 85 | "cell_type": "markdown",
|
70 | 86 | "id": "629a2289",
|
|
136 | 152 | },
|
137 | 153 | {
|
138 | 154 | "cell_type": "code",
|
139 |
| - "execution_count": 26, |
| 155 | + "execution_count": 14, |
140 | 156 | "id": "093a6b4f",
|
141 | 157 | "metadata": {},
|
142 | 158 | "outputs": [
|
|
154 | 170 | "<IPython.lib.display.Audio object>"
|
155 | 171 | ]
|
156 | 172 | },
|
157 |
| - "execution_count": 26, |
| 173 | + "execution_count": 14, |
158 | 174 | "metadata": {},
|
159 | 175 | "output_type": "execute_result"
|
160 | 176 | }
|
161 | 177 | ],
|
162 | 178 | "source": [
|
163 |
| - "Audio(\"./data/audio/obj_o6.wav\")" |
| 179 | + "Audio(\"./data/audio/obj_06.wav\")" |
164 | 180 | ]
|
165 | 181 | },
|
166 | 182 | {
|
|
319 | 335 | "Friendly, calm and approachable expert customer service assistant.\n",
|
320 | 336 | "\n",
|
321 | 337 | "## Tone\n",
|
322 |
| - "Tone: Warm, concise, confident, never fawning.\n", |
| 338 | + "Warm, concise, confident, never fawning.\n", |
323 | 339 | "\n",
|
324 | 340 | "## Length\n",
|
325 | 341 | "2–3 sentences per turn.\n",
|
|
342 | 358 | },
|
343 | 359 | {
|
344 | 360 | "cell_type": "code",
|
345 |
| - "execution_count": 4, |
| 361 | + "execution_count": null, |
346 | 362 | "id": "6dce17d5",
|
347 | 363 | "metadata": {},
|
348 | 364 | "outputs": [
|
|
366 | 382 | }
|
367 | 383 | ],
|
368 | 384 | "source": [
|
369 |
| - "import soundfile as sf\n", |
370 |
| - "from IPython.display import Audio, display\n", |
371 |
| - "# Load AIFC\n", |
372 |
| - "data, samplerate = sf.read(\"/Users/minh.hoque/work/github/openai-cookbook/examples/data/audio/multi-emotion.aifc\")\n", |
373 |
| - "\n", |
374 |
| - "# Save as WAV for better browser compatibility\n", |
375 |
| - "sf.write(\"/Users/minh.hoque/work/github/openai-cookbook/examples/data/audio/multi-emotion.wav\", data, samplerate)\n", |
376 |
| - "\n", |
377 | 385 | "Audio(\"/Users/minh.hoque/work/github/openai-cookbook/examples/data/audio/multi-emotion.wav\")"
|
378 | 386 | ]
|
379 | 387 | },
|
|
410 | 418 | "Friendly, calm and approachable expert customer service assistant.\n",
|
411 | 419 | "\n",
|
412 | 420 | "## Tone\n",
|
413 |
| - "Tone: Warm, concise, confident, never fawning.\n", |
| 421 | + "Warm, concise, confident, never fawning.\n", |
414 | 422 | "\n",
|
415 | 423 | "## Length\n",
|
416 | 424 | "2–3 sentences per turn.\n",
|
417 | 425 | "\n",
|
418 | 426 | "## Pacing\n",
|
419 |
| - "Deliver your audio response fast, but do not sound rushed. Do not modify the content of your response, only increase speaking speed for the same response.\n", |
| 427 | + "- Deliver your audio response fast, but do not sound rushed.\n", |
| 428 | + "- Do not modify the content of your response, only increase speaking speed for the same response.\n", |
420 | 429 | "```"
|
421 | 430 | ]
|
422 | 431 | },
|
|
430 | 439 | },
|
431 | 440 | {
|
432 | 441 | "cell_type": "code",
|
433 |
| - "execution_count": null, |
| 442 | + "execution_count": 15, |
434 | 443 | "id": "9754730f",
|
435 | 444 | "metadata": {},
|
436 | 445 | "outputs": [
|
|
454 | 463 | }
|
455 | 464 | ],
|
456 | 465 | "source": [
|
457 |
| - "import soundfile as sf\n", |
458 |
| - "\n", |
459 |
| - "# Load AIFC\n", |
460 |
| - "data, samplerate = sf.read(\"/Users/minh.hoque/work/github/openai-cookbook-internal/examples/data/audio/pace_06.wav.aif\")\n", |
461 |
| - "\n", |
462 |
| - "# Save as WAV for better browser compatibility\n", |
463 |
| - "sf.write(\"/Users/minh.hoque/work/github/openai-cookbook-internal/examples/data/audio/pace_06.wav\", data, samplerate)\n", |
464 |
| - "\n", |
465 | 466 | "Audio(\"./data/audio/pace_06.wav\")"
|
466 | 467 | ]
|
467 | 468 | },
|
|
534 | 535 | "Friendly, calm and approachable expert customer service assistant.\n",
|
535 | 536 | "\n",
|
536 | 537 | "## Tone\n",
|
537 |
| - "Tone: Warm, concise, confident, never fawning.\n", |
| 538 | + "Warm, concise, confident, never fawning.\n", |
538 | 539 | "\n",
|
539 | 540 | "## Length\n",
|
540 | 541 | "2–3 sentences per turn.\n",
|
541 | 542 | "\n",
|
542 | 543 | "## Language\n",
|
543 |
| - "The conversation will only be in English. Do not respond in any other language even if the user asks for it, only respond in english.\n", |
| 544 | + "- The conversation will be only in English.\n", |
| 545 | + "- Do not respond in any other language even if the user asks.\n", |
| 546 | + "- If the user speaks another language, politely explain that support is limited to English.\n", |
544 | 547 | "```"
|
545 | 548 | ]
|
546 | 549 | },
|
|
575 | 578 | "Friendly, calm and approachable expert voice tutor.\n",
|
576 | 579 | "\n",
|
577 | 580 | "## Tone\n",
|
578 |
| - "Tone: Warm, concise, confident, never fawning.\n", |
| 581 | + "Warm, concise, confident, never fawning.\n", |
579 | 582 | "\n",
|
580 | 583 | "## Length\n",
|
581 | 584 | "2–3 sentences per turn.\n",
|
|
629 | 632 | "Friendly, calm and approachable expert customer service assistant.\n",
|
630 | 633 | "\n",
|
631 | 634 | "## Tone\n",
|
632 |
| - "Tone: Warm, concise, confident, never fawning.\n", |
| 635 | + "Warm, concise, confident, never fawning.\n", |
633 | 636 | "\n",
|
634 | 637 | "## Lenght\n",
|
635 | 638 | "2–3 sentences per turn.\n",
|
636 | 639 | "\n",
|
637 | 640 | "## Language\n",
|
638 |
| - "The conversation will only be in English. Do not respond in any other language even if the user asks for it, only respond in english.\n", |
| 641 | + "- The conversation will be only in English.\n", |
| 642 | + "- Do not respond in any other language even if the user asks.\n", |
| 643 | + "- If the user speaks another language, politely explain that support is limited to English.\n", |
639 | 644 | "\n",
|
640 | 645 | "## Variety\n",
|
641 |
| - "Variety: Do not repeat the same sentence twice. Vary your responses so it doesn't sound robotic.\n", |
| 646 | + "- Do not repeat the same sentence twice.\n", |
| 647 | + "- Vary your responses so it doesn't sound robotic.\n", |
642 | 648 | "```"
|
643 | 649 | ]
|
644 | 650 | },
|
|
798 | 804 | "### Example\n",
|
799 | 805 | "```\n",
|
800 | 806 | "# Instructions/Rules\n",
|
801 |
| - "- When reading numbers or codes, speak each character separately, separated by hyphens (e.g., 4-1-5). Repeat EXACTLY the provided number, do not forget any.\n", |
| 807 | + "- When reading numbers or codes, speak each character separately, separated by hyphens (e.g., 4-1-5). \n", |
| 808 | + "- Repeat EXACTLY the provided number, do not forget any.\n", |
802 | 809 | "```"
|
803 | 810 | ]
|
804 | 811 | },
|
|
943 | 950 | "...\n",
|
944 | 951 | "\n",
|
945 | 952 | "\n",
|
946 |
| - "## Unclear audio\n", |
947 |
| - "- Always respond in the same language the user is speaking in, if intelligible. (optional)\n", |
948 |
| - "- Only respond to clear speech or text.\n", |
949 |
| - "- If the user's audio is not clear (e.g. background noise/inaudible/silent/unintelligible) or if you did not fully hear or understand the user, ask for clarification using English phrases such as “I didn’t catch that—mind repeating?”. Vary the phrases.'\n", |
950 |
| - "\n", |
951 |
| - "\n", |
952 |
| - "## Preferred Response Language\n", |
953 |
| - "- Always respond in the same language the user is speaking. This is the preferred language for the session.\n", |
954 |
| - "- If you cannot clearly determine the user's language (e.g., due to background noise, inaudible, silent, unintelligible, or ambiguous input), do not guess or fabricate. Instead, politely ask the user to repeat or clarify.\n", |
955 |
| - "- If you cannot determine the user's preferred language, ask for clarification in English.\n", |
| 953 | + "## Unclear audio \n", |
| 954 | + "- Always respond in the same language the user is speaking in, if intelligible.\n", |
| 955 | + "- Only respond to clear audio or text. \n", |
| 956 | + "- If the user's audio is not clear (e.g. ambiguous input/background noise/silent/unintelligible) or if you did not fully hear or understand the user, ask for clarification using {preferred_language} phrases.\n", |
956 | 957 | "```"
|
957 | 958 | ]
|
958 | 959 | },
|
|
1030 | 1031 | "...\n",
|
1031 | 1032 | "```\n",
|
1032 | 1033 | "\n",
|
1033 |
| - "We need to ensure the tool list has the same availability tools and do not contradict each other:\n", |
| 1034 | + "We need to ensure the tool list has the same availability tools and **the descriptions do not contradict each other**:\n", |
1034 | 1035 | "\n",
|
1035 | 1036 | "```json\n",
|
1036 | 1037 | "[\n",
|
|
1104 | 1105 | "id": "a752a4a6",
|
1105 | 1106 | "metadata": {},
|
1106 | 1107 | "source": [
|
1107 |
| - "### Example\n", |
| 1108 | + "#### Example\n", |
1108 | 1109 | "```python\n",
|
1109 | 1110 | "tools = [\n",
|
1110 | 1111 | " {\n",
|
|
1182 | 1183 | "id": "30ccc1d9",
|
1183 | 1184 | "metadata": {},
|
1184 | 1185 | "source": [
|
1185 |
| - "*Note: If you notice the model disregarding instructions or constraints about your tool call, there’s a chance this might be due to the fact you are asking it to be very proactive.*" |
| 1186 | + "*Tip: If a tool call can fail unpredictably, add clear failure-handling instructions so the model responds gracefully.*" |
1186 | 1187 | ]
|
1187 | 1188 | },
|
1188 | 1189 | {
|
|
1232 | 1233 | "```"
|
1233 | 1234 | ]
|
1234 | 1235 | },
|
| 1236 | + { |
| 1237 | + "cell_type": "markdown", |
| 1238 | + "id": "24579f54", |
| 1239 | + "metadata": {}, |
| 1240 | + "source": [ |
| 1241 | + "*Tip: If your tool call can fail in unpredictablr ways, try adding instructions on how to handle tool call outputs failures so taht the model can behave*" |
| 1242 | + ] |
| 1243 | + }, |
1235 | 1244 | {
|
1236 | 1245 | "cell_type": "markdown",
|
1237 | 1246 | "id": "edebafe2",
|
|
1610 | 1619 | "As use case complexity grows, you need maintainable structure without overloading the model.\n",
|
1611 | 1620 | "\n",
|
1612 | 1621 | "Two patterns for complex scenarios: \n",
|
1613 |
| - "1. State machine conversation flow\n", |
1614 |
| - "2. Dynamic flow via session updates.\n" |
| 1622 | + "1. Conversation Flow as State Machine\n", |
| 1623 | + "2. Dynamic Conversation Flow via session.updates\n" |
1615 | 1624 | ]
|
1616 | 1625 | },
|
1617 | 1626 | {
|
|
1628 | 1637 | "id": "b77ddc02",
|
1629 | 1638 | "metadata": {},
|
1630 | 1639 | "source": [
|
1631 |
| - "### Example\n", |
| 1640 | + "#### Example\n", |
1632 | 1641 | "```json\n",
|
1633 | 1642 | "# Conversation States\n",
|
1634 | 1643 | "[\n",
|
|
1824 | 1833 | "### Example\n",
|
1825 | 1834 | "```\n",
|
1826 | 1835 | "# Safety & Escalation\n",
|
1827 |
| - "**When to escalate (no extra troubleshooting):**\n", |
1828 |
| - "- Safety risk (self-harm, threats, harassment)\n", |
1829 |
| - "- User explicitly asks for a human\n", |
1830 |
| - "- Severe dissatisfaction (e.g., “extremely frustrated,” repeated complaints, profanity)\n", |
1831 |
| - "- **2** failed tool attempts on the same task **or** **3** consecutive no-match/no-input events\n", |
1832 |
| - "- Out-of-scope or restricted (e.g., real-time news, financial/legal/medical advice)\n", |
1833 |
| - "\n", |
1834 |
| - "\n", |
1835 |
| - "**Examples of what to say (Mandatory phrase before handoff):**\n", |
1836 |
| - "- “I'm sorry for the trouble — I'm transferring you to a specialist now. **.”\n", |
1837 |
| - "\n", |
| 1836 | + "- When to escalate (no extra troubleshooting):\n", |
| 1837 | + " - Safety risk (self-harm, threats, harassment)\n", |
| 1838 | + " - User explicitly asks for a human\n", |
| 1839 | + " - Severe dissatisfaction (e.g., “extremely frustrated,” repeated complaints, profanity)\n", |
| 1840 | + " - **2** failed tool attempts on the same task **or** **3** consecutive no-match/no-input events\n", |
| 1841 | + " - Out-of-scope or restricted (e.g., real-time news, financial/legal/medical advice)\n", |
1838 | 1842 | "\n",
|
1839 |
| - "**Then call the tool:** `escalate_to_human`\n", |
| 1843 | + "- Examples of what to say (Mandatory phrase before handoff):\n", |
| 1844 | + " - “I'm sorry for the trouble — I'm transferring you to a specialist now. **.”\n", |
1840 | 1845 | "\n",
|
| 1846 | + "- Then call the tool: `escalate_to_human`\n", |
1841 | 1847 | "\n",
|
1842 |
| - "Examples that would require escalation:\n", |
1843 |
| - "- “This is the third time the reset didn’t work. Just get me a person.”\n", |
1844 |
| - "- “I am extremely frustrated!”\n", |
| 1848 | + "- Examples that would require escalation:\n", |
| 1849 | + " - “This is the third time the reset didn’t work. Just get me a person.”\n", |
| 1850 | + " - “I am extremely frustrated!”\n", |
1845 | 1851 | "```"
|
1846 | 1852 | ]
|
1847 | 1853 | },
|
|
0 commit comments