|
1 | 1 | --- |
2 | 2 | title: OpenAI Realtime |
3 | | -subtitle: You can use OpenAI's newest speech-to-speech model with your Vapi assistants. |
| 3 | +subtitle: Build voice assistants with OpenAI's native speech-to-speech models for ultra-low latency conversations |
4 | 4 | slug: openai-realtime |
5 | 5 | --- |
6 | 6 |
|
| 7 | +## Overview |
| 8 | + |
| 9 | +OpenAI’s Realtime API enables developers to use a native speech-to-speech model. Unlike other Vapi configurations which orchestrate a transcriber, model and voice API to simulate speech-to-speech, OpenAI’s Realtime API natively processes audio in and audio out. |
| 10 | + |
| 11 | +**In this guide, you'll learn to:** |
| 12 | +- Choose the right realtime model for your use case |
| 13 | +- Configure voice assistants with realtime capabilities |
| 14 | +- Implement best practices for production deployments |
| 15 | +- Optimize prompts specifically for realtime models |
| 16 | + |
| 17 | +## Available models |
| 18 | + |
| 19 | +<Tip> |
| 20 | + The `gpt-realtime-2025-08-28` model is production-ready. |
| 21 | +</Tip> |
| 22 | + |
| 23 | +OpenAI offers three realtime models, each with different capabilities and cost/performance trade-offs: |
| 24 | + |
| 25 | +| Model | Status | Best For | Key Features | |
| 26 | +|-------|---------|----------|--------------| |
| 27 | +| `gpt-realtime-2025-08-28` | **Production** | Production workloads | Production Ready | |
| 28 | +| `gpt-4o-realtime-preview-2024-12-17` | Preview | Development & testing | Balanced performance/cost | |
| 29 | +| `gpt-4o-mini-realtime-preview-2024-12-17` | Preview | Cost-sensitive apps | Lower latency, reduced cost | |
| 30 | + |
| 31 | +## Voice options |
| 32 | + |
| 33 | +Realtime models support a specific set of OpenAI voices optimized for speech-to-speech: |
| 34 | + |
| 35 | +<CardGroup cols={2}> |
| 36 | + <Card title="Standard Voices" icon="microphone"> |
| 37 | + Available across all realtime models: |
| 38 | + - `alloy` - Neutral and balanced |
| 39 | + - `echo` - Warm and engaging |
| 40 | + - `shimmer` - Energetic and expressive |
| 41 | + </Card> |
| 42 | + <Card title="Realtime-Exclusive Voices" icon="sparkles"> |
| 43 | + Only available with realtime models: |
| 44 | + - `marin` - Professional and clear |
| 45 | + - `cedar` - Natural and conversational |
| 46 | + </Card> |
| 47 | +</CardGroup> |
| 48 | + |
| 49 | +<Warning> |
| 50 | + The following voices are **NOT** supported by realtime models: ash, ballad, coral, fable, onyx, and nova. |
| 51 | +</Warning> |
| 52 | + |
| 53 | +## Configuration |
| 54 | + |
| 55 | +### Basic setup |
| 56 | + |
| 57 | +Configure a realtime assistant with function calling: |
| 58 | + |
| 59 | +<CodeBlocks> |
| 60 | +```json title="Assistant Configuration" |
| 61 | +{ |
| 62 | + "model": { |
| 63 | + "provider": "openai", |
| 64 | + "model": "gpt-realtime-2025-08-28", |
| 65 | + "messages": [ |
| 66 | + { |
| 67 | + "role": "system", |
| 68 | + "content": "You are a helpful assistant. Be concise and friendly." |
| 69 | + } |
| 70 | + ], |
| 71 | + "temperature": 0.7, |
| 72 | + "maxTokens": 250, |
| 73 | + "tools": [ |
| 74 | + { |
| 75 | + "type": "function", |
| 76 | + "function": { |
| 77 | + "name": "getWeather", |
| 78 | + "description": "Get the current weather", |
| 79 | + "parameters": { |
| 80 | + "type": "object", |
| 81 | + "properties": { |
| 82 | + "location": { |
| 83 | + "type": "string", |
| 84 | + "description": "The city name" |
| 85 | + } |
| 86 | + }, |
| 87 | + "required": ["location"] |
| 88 | + } |
| 89 | + } |
| 90 | + } |
| 91 | + ] |
| 92 | + }, |
| 93 | + "voice": { |
| 94 | + "provider": "openai", |
| 95 | + "voiceId": "alloy" |
| 96 | + } |
| 97 | +} |
| 98 | +``` |
| 99 | +```typescript title="TypeScript SDK" |
| 100 | +import { Vapi } from '@vapi-ai/server-sdk'; |
| 101 | + |
| 102 | +const vapi = new Vapi({ token: process.env.VAPI_API_KEY }); |
| 103 | + |
| 104 | +const assistant = await vapi.assistants.create({ |
| 105 | + model: { |
| 106 | + provider: "openai", |
| 107 | + model: "gpt-realtime-2025-08-28", |
| 108 | + messages: [{ |
| 109 | + role: "system", |
| 110 | + content: "You are a helpful assistant. Be concise and friendly." |
| 111 | + }], |
| 112 | + temperature: 0.7, |
| 113 | + maxTokens: 250, |
| 114 | + tools: [{ |
| 115 | + type: "function", |
| 116 | + function: { |
| 117 | + name: "getWeather", |
| 118 | + description: "Get the current weather", |
| 119 | + parameters: { |
| 120 | + type: "object", |
| 121 | + properties: { |
| 122 | + location: { |
| 123 | + type: "string", |
| 124 | + description: "The city name" |
| 125 | + } |
| 126 | + }, |
| 127 | + required: ["location"] |
| 128 | + } |
| 129 | + } |
| 130 | + }] |
| 131 | + }, |
| 132 | + voice: { |
| 133 | + provider: "openai", |
| 134 | + voiceId: "alloy" |
| 135 | + } |
| 136 | +}); |
| 137 | +``` |
| 138 | +```python title="Python SDK" |
| 139 | +from vapi import Vapi |
| 140 | + |
| 141 | +vapi = Vapi(token=os.getenv("VAPI_API_KEY")) |
| 142 | + |
| 143 | +assistant = vapi.assistants.create( |
| 144 | + model={ |
| 145 | + "provider": "openai", |
| 146 | + "model": "gpt-realtime-2025-08-28", |
| 147 | + "messages": [{ |
| 148 | + "role": "system", |
| 149 | + "content": "You are a helpful assistant. Be concise and friendly." |
| 150 | + }], |
| 151 | + "temperature": 0.7, |
| 152 | + "maxTokens": 250, |
| 153 | + "tools": [{ |
| 154 | + "type": "function", |
| 155 | + "function": { |
| 156 | + "name": "getWeather", |
| 157 | + "description": "Get the current weather", |
| 158 | + "parameters": { |
| 159 | + "type": "object", |
| 160 | + "properties": { |
| 161 | + "location": { |
| 162 | + "type": "string", |
| 163 | + "description": "The city name" |
| 164 | + } |
| 165 | + }, |
| 166 | + "required": ["location"] |
| 167 | + } |
| 168 | + } |
| 169 | + }] |
| 170 | + }, |
| 171 | + voice={ |
| 172 | + "provider": "openai", |
| 173 | + "voiceId": "alloy" |
| 174 | + } |
| 175 | +) |
| 176 | +``` |
| 177 | +</CodeBlocks> |
| 178 | + |
| 179 | +### Using realtime-exclusive voices |
| 180 | + |
| 181 | +To use the enhanced voices only available with realtime models: |
| 182 | + |
| 183 | +```json |
| 184 | +{ |
| 185 | + "voice": { |
| 186 | + "provider": "openai", |
| 187 | + "voiceId": "marin" // or "cedar" |
| 188 | + } |
| 189 | +} |
| 190 | +``` |
| 191 | + |
| 192 | +### Handling instructions |
| 193 | + |
| 194 | +<Info> |
| 195 | + Unlike traditional OpenAI models, realtime models receive instructions through the session configuration. Vapi automatically converts your system messages to session instructions during WebSocket initialization. |
| 196 | +</Info> |
| 197 | + |
| 198 | +The system message in your model configuration is automatically optimized for realtime processing: |
| 199 | + |
| 200 | +1. System messages are converted to session instructions |
| 201 | +2. Instructions are sent during WebSocket session initialization |
| 202 | +3. The instructions field supports the same prompting strategies as system messages |
| 203 | + |
| 204 | +## Prompting best practices |
| 205 | + |
7 | 206 | <Note> |
8 | | - The Realtime API is currently in beta, and not recommended for production use by OpenAI. We're excited to have you try this new feature and welcome your [feedback](https://discord.com/invite/pUFNcf2WmH) as we continue to refine and improve the experience. |
| 207 | + Realtime models benefit from different prompting techniques than text-based models. These guidelines are based on [OpenAI's official prompting guide](https://cookbook.openai.com/examples/realtime_prompting_guide). |
9 | 208 | </Note> |
10 | 209 |
|
11 | | -OpenAI’s Realtime API enables developers to use a native speech-to-speech model. Unlike other Vapi configurations which orchestrate a transcriber, model and voice API to simulate speech-to-speech, OpenAI’s Realtime API natively processes audio in and audio out. |
| 210 | +### General tips |
| 211 | + |
| 212 | +- **Iterate relentlessly**: Small wording changes can significantly impact behavior |
| 213 | +- **Use bullet points over paragraphs**: Clear, short bullets outperform long text blocks |
| 214 | +- **Guide with examples**: The model closely follows sample phrases you provide |
| 215 | +- **Be precise**: Ambiguity or conflicting instructions degrade performance |
| 216 | +- **Control language**: Pin output to a target language to prevent unwanted switching |
| 217 | +- **Reduce repetition**: Add variety rules to avoid robotic phrasing |
| 218 | +- **Capitalize for emphasis**: Use CAPS for key rules to make them stand out |
| 219 | + |
| 220 | +### Prompt structure |
| 221 | + |
| 222 | +Organize your prompts with clear sections for better model comprehension: |
| 223 | + |
| 224 | +``` |
| 225 | +# Role & Objective |
| 226 | +You are a customer service agent for Acme Corp. Your goal is to resolve issues quickly. |
| 227 | +
|
| 228 | +# Personality & Tone |
| 229 | +- Friendly, professional, and empathetic |
| 230 | +- Speak naturally at a moderate pace |
| 231 | +- Keep responses to 2-3 sentences |
| 232 | +
|
| 233 | +# Instructions |
| 234 | +- Greet callers warmly |
| 235 | +- Ask clarifying questions before offering solutions |
| 236 | +- Always confirm understanding before proceeding |
| 237 | +
|
| 238 | +# Tools |
| 239 | +Use the available tools to look up account information and process requests. |
| 240 | +
|
| 241 | +# Safety |
| 242 | +If a caller becomes aggressive or requests something outside your scope, |
| 243 | +politely offer to transfer them to a specialist. |
| 244 | +``` |
| 245 | + |
| 246 | +### Realtime-specific techniques |
| 247 | + |
| 248 | +<Tabs> |
| 249 | + <Tab title="Speaking Speed"> |
| 250 | + Control the model's speaking pace with explicit instructions: |
| 251 | + |
| 252 | + ``` |
| 253 | + ## Pacing |
| 254 | + - Deliver responses at a natural, conversational speed |
| 255 | + - Do not rush through information |
| 256 | + - Pause briefly between key points |
| 257 | + ``` |
| 258 | + </Tab> |
| 259 | + <Tab title="Personality"> |
| 260 | + Realtime models excel at maintaining consistent personality: |
| 261 | + |
| 262 | + ``` |
| 263 | + ## Personality |
| 264 | + - Warm and approachable like a trusted advisor |
| 265 | + - Professional but not robotic |
| 266 | + - Show genuine interest in helping |
| 267 | + ``` |
| 268 | + </Tab> |
| 269 | + <Tab title="Conversation Flow"> |
| 270 | + Guide natural conversation progression: |
| 271 | + |
| 272 | + ``` |
| 273 | + ## Conversation Flow |
| 274 | + 1. Greeting: Welcome caller and ask how you can help |
| 275 | + 2. Discovery: Understand their specific needs |
| 276 | + 3. Solution: Offer the best available option |
| 277 | + 4. Confirmation: Ensure they're satisfied before ending |
| 278 | + ``` |
| 279 | + </Tab> |
| 280 | +</Tabs> |
| 281 | + |
| 282 | +## Migration guide |
| 283 | + |
| 284 | +Transitioning from standard STT/TTS to realtime models: |
| 285 | + |
| 286 | +<Steps> |
| 287 | + <Step title="Update your model configuration"> |
| 288 | + Change your model to one of the realtime options: |
| 289 | + ```json |
| 290 | + { |
| 291 | + "model": { |
| 292 | + "provider": "openai", |
| 293 | + "model": "gpt-realtime-2025-08-28" // Changed from gpt-4 |
| 294 | + } |
| 295 | + } |
| 296 | + ``` |
| 297 | + </Step> |
| 298 | + |
| 299 | + <Step title="Verify voice compatibility"> |
| 300 | + Ensure your selected voice is supported (alloy, echo, shimmer, marin, or cedar) |
| 301 | + </Step> |
| 302 | + |
| 303 | + <Step title="Remove transcriber configuration"> |
| 304 | + Realtime models handle speech-to-speech natively, so transcriber settings are not needed |
| 305 | + </Step> |
| 306 | + |
| 307 | + <Step title="Test function calling"> |
| 308 | + Your existing function configurations work unchanged with realtime models |
| 309 | + </Step> |
| 310 | + |
| 311 | + <Step title="Optimize your prompts"> |
| 312 | + Apply realtime-specific prompting techniques for best results |
| 313 | + </Step> |
| 314 | +</Steps> |
| 315 | + |
| 316 | +## Best practices |
| 317 | + |
| 318 | +### Model selection strategy |
| 319 | + |
| 320 | +<AccordionGroup> |
| 321 | + <Accordion title="When to use gpt-realtime-2025-08-28"> |
| 322 | + **Best for production workloads requiring:** |
| 323 | + - Structured outputs for form filling or data collection |
| 324 | + - Complex function orchestration |
| 325 | + - Highest quality voice interactions |
| 326 | + - Responses API integration |
| 327 | + </Accordion> |
| 328 | + |
| 329 | + <Accordion title="When to use gpt-4o-realtime-preview"> |
| 330 | + **Best for development and testing:** |
| 331 | + - Prototyping voice applications |
| 332 | + - Balanced cost/performance during development |
| 333 | + - Testing conversation flows before production |
| 334 | + </Accordion> |
| 335 | + |
| 336 | + <Accordion title="When to use gpt-4o-mini-realtime-preview"> |
| 337 | + **Best for cost-sensitive applications:** |
| 338 | + - High-volume voice interactions |
| 339 | + - Simple Q&A or routing scenarios |
| 340 | + - Applications where latency is critical |
| 341 | + </Accordion> |
| 342 | +</AccordionGroup> |
| 343 | + |
| 344 | +### Performance optimization |
| 345 | + |
| 346 | +- **Temperature settings**: Use 0.5-0.7 for consistent yet natural responses |
| 347 | +- **Max tokens**: Set appropriate limits (200-300) for conversational responses |
| 348 | +- **Voice selection**: Test different voices to match your brand personality |
| 349 | +- **Function design**: Keep function schemas simple for faster execution |
| 350 | + |
| 351 | +### Error handling |
| 352 | + |
| 353 | +Handle edge cases gracefully: |
| 354 | + |
| 355 | +```json |
| 356 | +{ |
| 357 | + "messages": [{ |
| 358 | + "role": "system", |
| 359 | + "content": "If you don't understand the user, politely ask them to repeat. Never make assumptions about unclear requests." |
| 360 | + }] |
| 361 | +} |
| 362 | +``` |
| 363 | + |
| 364 | +## Current limitations |
| 365 | + |
| 366 | +<Warning> |
| 367 | + Be aware of these limitations when implementing realtime models: |
| 368 | +</Warning> |
| 369 | + |
| 370 | +- **Knowledge Bases** are not currently supported with the Realtime API |
| 371 | +- **Endpointing and Interruption** models are managed by Vapi's orchestration layer |
| 372 | +- **Custom voice cloning** is not available for realtime models |
| 373 | +- **Some OpenAI voices** (ash, ballad, coral, fable, onyx, nova) are incompatible |
| 374 | +- **Transcripts** may have slight differences from traditional STT output |
| 375 | + |
| 376 | +## Additional resources |
| 377 | + |
| 378 | +- [OpenAI Realtime Documentation](https://platform.openai.com/docs/guides/realtime) |
| 379 | +- [Realtime Prompting Guide](https://platform.openai.com/docs/guides/realtime-models-prompting) |
| 380 | +- [Prompting Cookbook](https://cookbook.openai.com/examples/realtime_prompting_guide) |
| 381 | +- [Vapi Discord Community](https://discord.com/invite/pUFNcf2WmH) |
| 382 | + |
| 383 | +## Next steps |
12 | 384 |
|
13 | | -To start using it with your Vapi assistants, select `gpt-4o-realtime-preview-2024-12-17` as your model. |
14 | | -- Please note that only OpenAI voices may be selected while using this model. The voice selection will not act as a TTS (text-to-speech) model, but rather as the voice used within the speech-to-speech model. |
15 | | -- Also note that we don’t currently support Knowledge Bases with the Realtime API. |
16 | | -- Lastly, note that our Realtime integration still retains the rest of Vapi's orchestration layer such as Endpointing and Interruption models to enable a reliable conversational flow. |
| 385 | +Now that you understand OpenAI Realtime models: |
| 386 | +- **[Phone Calling Guide](/phone-calling):** Set up inbound and outbound calling |
| 387 | +- **[Assistant Hooks](/assistants/assistant-hooks):** Add custom logic to your conversations |
| 388 | +- **[Voice Providers](/providers/voice/openai):** Explore other voice options |
0 commit comments