You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/realtime-audio.md
+90-9Lines changed: 90 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Most users of the Realtime API need to deliver and receive audio from an end-use
22
22
23
23
## Supported models
24
24
25
-
The GPT 4o realtime models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
25
+
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
26
26
-`gpt-4o-realtime-preview` (2024-12-17)
27
27
-`gpt-4o-realtime-preview` (2024-10-01)
28
28
@@ -116,7 +116,7 @@ Often, the first event sent by the caller on a newly established `/realtime` ses
116
116
117
117
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
118
118
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
119
-
- Turn handling is controlled by the `turn_detection` property. This propertycan be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
119
+
- Turn handling is controlled by the `turn_detection` property. This property's type can be set to `none` or `server_vad` as described in the [voice activity detection (VAD) and the audio buffer](#voice-activity-detection-vad-and-the-audio-buffer) section.
120
120
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
121
121
122
122
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional and can be omitted if not needed.
@@ -135,7 +135,8 @@ An example `session.update` that configures several aspects of the session, incl
135
135
"type": "server_vad",
136
136
"threshold": 0.5,
137
137
"prefix_padding_ms": 300,
138
-
"silence_duration_ms": 200
138
+
"silence_duration_ms": 200,
139
+
"create_response": true
139
140
},
140
141
"tools": []
141
142
}
@@ -144,15 +145,75 @@ An example `session.update` that configures several aspects of the session, incl
144
145
145
146
The server responds with a [`session.updated`](../realtime-audio-reference.md#realtimeservereventsessionupdated) event to confirm the session configuration.
146
147
147
-
## Input audio buffer and turn handling
148
+
## Out-of-band responses
149
+
150
+
By default, responses generated during a session are added to the default conversation state. In some cases, you might want to generate responses outside the default conversation. This can be useful for generating multiple responses concurrently or for generating responses that don't affect the default conversation state. For example, you can limit the number of turns considered by the model when generating a response.
151
+
152
+
You can create out-of-band responses by setting the [`response.conversation`](../realtime-audio-reference.md#realtimeresponseoptions) field to the string `none` when creating a response with the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) client event.
148
153
149
-
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
154
+
In the same [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) client event, you can also set the [`response.metadata`](../realtime-audio-reference.md#realtimeresponseoptions) field to help you identify which response is being generated for this client-sent event.
155
+
156
+
```json
157
+
{
158
+
"type": "response.create",
159
+
"response": {
160
+
"conversation": "none",
161
+
"metadata": {
162
+
"topic": "world_capitals"
163
+
},
164
+
"modalities": ["text"],
165
+
"prompt": "What is the capital of France?"
166
+
}
167
+
}
168
+
```
169
+
170
+
When the server responds with a [`response.done`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event, the response contains the metadata you provided. You can identify the corresponding response for the client-sent event via the `response.metadata` field.
171
+
172
+
> [!IMPORTANT]
173
+
> If you create any responses outside the default conversation, be sure to always check the `response.metadata` field to help you identify the corresponding response for the client-sent event. You should even check the `response.metadata` field for responses that are part of the default conversation. That way, you can ensure that you're handling the correct response for the client-sent event.
174
+
175
+
### Custom context for out-of-band responses
176
+
177
+
You can also construct a custom context that the model uses outside of the session's default conversation. To create a response with custom context, set the `conversation` field to `none` and provide the custom context in the `input` array. The `input` array can contain new inputs or references to existing conversation items.
178
+
179
+
```json
180
+
{
181
+
"type": "response.create",
182
+
"response": {
183
+
"conversation": "none",
184
+
"modalities": ["text"],
185
+
"prompt": "What is the capital of France?",
186
+
"input": [
187
+
{
188
+
"type": "item_reference",
189
+
"id": "existing_conversation_item_id"
190
+
},
191
+
{
192
+
"type": "message",
193
+
"role": "user",
194
+
"content": [
195
+
{
196
+
"type": "input_text",
197
+
"text": "The capital of France is Paris."
198
+
},
199
+
],
200
+
},
201
+
]
202
+
}
203
+
}
204
+
```
205
+
206
+
## Voice activity detection (VAD) and the audio buffer
207
+
208
+
The server maintains an input audio buffer containing client-provided audio that hasn't yet been committed to the conversation state.
150
209
151
210
One of the key [session-wide](#session-configuration) settings is `turn_detection`, which controls how data flow is handled between the caller and model. The `turn_detection` setting can be set to `none` or `server_vad` (to use [server-side voice activity detection](#server-decision-mode)).
152
211
212
+
By default, voice activity detection (VAD) is enabled, and the server automatically generates responses when it detects the end of speech in the input audio buffer. You can change the behavior by setting the `turn_detection` property in the session configuration.
213
+
153
214
### Without server decision mode
154
215
155
-
By default, the session is configured with the `turn_detection` type effectively set to `none`.
216
+
By default, the session is configured with the `turn_detection` type effectively set to `none`. Voice activity detection (VAD) is disabled, and the server doesn't automatically generate responses when it detects the end of speech in the input audio buffer.
156
217
157
218
The session relies on caller-initiated [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) and [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
158
219
@@ -177,7 +238,9 @@ sequenceDiagram
177
238
178
239
### Server decision mode
179
240
180
-
The session can be configured with the `turn_detection` type set to `server_vad`. In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
241
+
You can configure the session to use server-side voice activity detection (VAD). Set the `turn_detection` type to `server_vad` to enable VAD.
242
+
243
+
In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can also be configured when specifying `server_vad` detection mode.
181
244
182
245
- The server sends the [`input_audio_buffer.speech_started`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstarted) event when it detects the start of speech.
183
246
- At any time, the client can optionally append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
@@ -201,9 +264,27 @@ sequenceDiagram
201
264
Server->>Client: conversation.item.created
202
265
-->
203
266
267
+
### VAD without automatic response generation
268
+
269
+
You can use server-side voice activity detection (VAD) without automatic response generation. This approach can be useful when you want to implement some degree of moderation.
270
+
271
+
Set [`turn_detection.create_response`](../realtime-audio-reference.md#realtimeturndetection) to `false` via the [session.update](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event. VAD detects the end of speech but the server doesn't generate a response until you send a [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
272
+
273
+
```json
274
+
{
275
+
"turn_detection": {
276
+
"type": "server_vad",
277
+
"threshold": 0.5,
278
+
"prefix_padding_ms": 300,
279
+
"silence_duration_ms": 200,
280
+
"create_response": false
281
+
}
282
+
}
283
+
```
284
+
204
285
## Conversation and response generation
205
286
206
-
The Realtime API is designed to handle real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
287
+
The GPT-4o real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
207
288
208
289
### Conversation sequence and items
209
290
@@ -256,7 +337,7 @@ A user might want to interrupt the assistant's response or ask the assistant to
256
337
257
338
Here's an example of the event sequence for a simple text-in, audio-out conversation:
258
339
259
-
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event.
340
+
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event. The maximum session duration is 30 minutes.
Copy file name to clipboardExpand all lines: articles/ai-services/openai/realtime-audio-reference.md
+30-4Lines changed: 30 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1080,7 +1080,7 @@ The server `session.updated` event is returned when a session is updated by the
1080
1080
| Field | Type | Description |
1081
1081
|-------|------|-------------|
1082
1082
| type |[RealtimeClientEventType](#realtimeclienteventtype)| The type of the client event. |
1083
-
| event_id | string | The unique ID of the event. The ID can be specified by the client to help identify the event. |
1083
+
| event_id | string | The unique ID of the event. The client can specify the ID to help identify the event. |
1084
1084
1085
1085
### RealtimeClientEventType
1086
1086
@@ -1100,7 +1100,11 @@ The server `session.updated` event is returned when a session is updated by the
1100
1100
1101
1101
| Field | Type | Description |
1102
1102
|-------|------|-------------|
1103
-
| type |[RealtimeContentPartType](#realtimecontentparttype)| The type of the content part. |
1103
+
| type |[RealtimeContentPartType](#realtimecontentparttype)| The content type.<br><br>A property of the `function` object.<br/><br>Allowed values: `input_text`, `input_audio`, `item_reference`, `text`. |
1104
+
| text | string | The text content. This property is applicable for the `input_text` and `text` content types. |
1105
+
| id | string | ID of a previous conversation item to reference in both client and server created items. This property is applicable for the `item_reference` content type in `response.create` events. |
1106
+
| audio | string | The base64-encoded audio bytes. This property is applicable for the `input_audio` content type. |
1107
+
| transcript | string | The transcript of the audio. This property is applicable for the `input_audio` content type. |
1104
1108
1105
1109
### RealtimeContentPartType
1106
1110
@@ -1115,14 +1119,29 @@ The server `session.updated` event is returned when a session is updated by the
1115
1119
1116
1120
The item to add to the conversation.
1117
1121
1122
+
This table describes all `RealtimeConversationItem` properties. The properties that are applicable per event depend on the [RealtimeItemType](#realtimeitemtype).
1123
+
1124
+
| Field | Type | Description |
1125
+
|-------|------|-------------|
1126
+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
1127
+
| type |[RealtimeItemType](#realtimeitemtype)| The type of the item.<br><br>Allowed values: `message`, `function_call`, `function_call_output`|
1128
+
| object | string | The identifier for the API object being returned. The value will always be `realtime.item`. |
1129
+
| status |[RealtimeItemStatus](#realtimeitemstatus)| The status of the item. This field doesn't affect the conversation, but it's accepted for consistency with the `conversation.item.created` event.<br><br>Allowed values: `completed`, `incomplete`|
1130
+
| role |[RealtimeMessageRole](#realtimemessagerole)| The role of the message sender. This property is only applicable for `message` items. <br><br>Allowed values: `system`, `user`, `assistant`|
1131
+
| content | array of [RealtimeContentPart](#realtimecontentpart)| The content of the message. This property is only applicable for `message` items.<br><br>- Message items of role `system` support only `input_text` content.<br>- Message items of role `user` support `input_text` and `input_audio` content.<br>- Message items of role `assistant` support `text` content. |
1132
+
| call_id | string | The ID of the function call (for `function_call` and `function_call_output` items). If passed on a `function_call_output` item, the server will check that a `function_call` item with the same ID exists in the conversation history. |
1133
+
| name | string | The name of the function being called (for `function_call` items). |
1134
+
| arguments | string | The arguments of the function call (for `function_call` items). |
1135
+
| output | string | The output of the function call (for `function_call_output` items). |
1136
+
1118
1137
### RealtimeConversationRequestItem
1119
1138
1120
1139
You use the `RealtimeConversationRequestItem` object to create a new item in the conversation via the [conversation.item.create](#realtimeclienteventconversationitemcreate) event.
1121
1140
1122
1141
| Field | Type | Description |
1123
1142
|-------|------|-------------|
1124
1143
| type |[RealtimeItemType](#realtimeitemtype)| The type of the item. |
1125
-
| id | string | The unique ID of the item. The ID can be specified by the client to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
1144
+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
1126
1145
1127
1146
### RealtimeConversationResponseItem
1128
1147
@@ -1138,7 +1157,7 @@ The `RealtimeConversationResponseItem` object represents an item in the conversa
1138
1157
|-------|------|-------------|
1139
1158
| object | string | The identifier for the returned API object.<br><br>Allowed values: `realtime.item`|
1140
1159
| type |[RealtimeItemType](#realtimeitemtype)| The type of the item.<br><br>Allowed values: `message`, `function_call`, `function_call_output`|
1141
-
| id | string | The unique ID of the item. The ID can be specified by the client to help manage server-side context. If the client doesn't provide an ID, the server generates one.<br><br>This property is nullable. |
1160
+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.<br><br>This property is nullable. |
1142
1161
1143
1162
### RealtimeFunctionTool
1144
1163
@@ -1333,6 +1352,9 @@ The response resource.
1333
1352
| tool_choice |[RealtimeToolChoice](#realtimetoolchoice)| The tool choice for the session. |
1334
1353
| temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
1335
1354
| max__output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls.<br><br>Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens.<br><br>For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`.<br><br>Defaults to `"inf"`. |
1355
+
| conversation | string | Controls which conversation the response is added to. The supported values are `auto` and `none`.<br><br>The `auto` value (or not setting this property) ensures that the contents of the response are added to the session's default conversation.<br><br>Set this property to `none` to create an out-of-band response where items won't be added to the default conversation. For more information, see the [how-to guide](./how-to/realtime-audio.md#out-of-band-responses).<br><br>Defaults to `"auto"`|
1356
+
| metadata | map | Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long.<br/><br/>For example: `metadata: { topic: "classification" }`|
1357
+
| input | array | Input items to include in the prompt for the model. Creates a new context for this response, without including the default conversation. Can include references to items from the default conversation.<br><br>Array items: [RealtimeConversationItemBase](#realtimeconversationitembase)|
1336
1358
1337
1359
### RealtimeResponseSession
1338
1360
@@ -1496,6 +1518,10 @@ Currently, only 'function' tools are supported.
1496
1518
| Field | Type | Description |
1497
1519
|-------|------|-------------|
1498
1520
| type |[RealtimeTurnDetectionType](#realtimeturndetectiontype)| The type of turn detection.<br><br>Allowed values: `server_vad`|
1521
+
| threshold | number | The activation threshold for the server VAD turn detection. In noisy environments, you might need to increase the threshold to avoid false positives. In quiet environments, you might need to decrease the threshold to avoid false negatives.<br><br>Defaults to `0.5`. You can set the threshold to a value between `0.0` and `1.0`. |
1522
+
| prefix_padding_ms | string | The duration of speech audio (in milliseconds) to include before the start of detected speech.<br><br>Defaults to `300` milliseconds. |
1523
+
| silence_duration_ms | string | The duration of silence (in milliseconds) to detect the end of speech. You want to detect the end of speech as soon as possible, but not too soon to avoid cutting off the last part of the speech.<br><br>The model will respond more quickly if you set this value to a lower number, but it might cut off the last part of the speech. If you set this value to a higher number, the model will wait longer to detect the end of speech, but it might take longer to respond.<br><br>Defaults to `500` milliseconds. |
1524
+
| create_response | boolean | Indicates whether the server will automatically create a response when VAD is enabled and speech stops.<br><br>Defaults to `true`. |
0 commit comments