Skip to content

Commit c2a799c

Browse files
Merge pull request #2416 from eric-urban/eur/realtime-how-to-ref-updates-364769
[AOAI] realtime how-to and ref updates
2 parents 03fbe92 + c31ead2 commit c2a799c

File tree

2 files changed

+120
-13
lines changed

2 files changed

+120
-13
lines changed

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 90 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Most users of the Realtime API need to deliver and receive audio from an end-use
2222

2323
## Supported models
2424

25-
The GPT 4o realtime models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
25+
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
2626
- `gpt-4o-realtime-preview` (2024-12-17)
2727
- `gpt-4o-realtime-preview` (2024-10-01)
2828

@@ -116,7 +116,7 @@ Often, the first event sent by the caller on a newly established `/realtime` ses
116116

117117
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
118118
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
119-
- Turn handling is controlled by the `turn_detection` property. This property can be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
119+
- Turn handling is controlled by the `turn_detection` property. This property's type can be set to `none` or `server_vad` as described in the [voice activity detection (VAD) and the audio buffer](#voice-activity-detection-vad-and-the-audio-buffer) section.
120120
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
121121

122122
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional and can be omitted if not needed.
@@ -135,7 +135,8 @@ An example `session.update` that configures several aspects of the session, incl
135135
"type": "server_vad",
136136
"threshold": 0.5,
137137
"prefix_padding_ms": 300,
138-
"silence_duration_ms": 200
138+
"silence_duration_ms": 200,
139+
"create_response": true
139140
},
140141
"tools": []
141142
}
@@ -144,15 +145,75 @@ An example `session.update` that configures several aspects of the session, incl
144145

145146
The server responds with a [`session.updated`](../realtime-audio-reference.md#realtimeservereventsessionupdated) event to confirm the session configuration.
146147

147-
## Input audio buffer and turn handling
148+
## Out-of-band responses
149+
150+
By default, responses generated during a session are added to the default conversation state. In some cases, you might want to generate responses outside the default conversation. This can be useful for generating multiple responses concurrently or for generating responses that don't affect the default conversation state. For example, you can limit the number of turns considered by the model when generating a response.
151+
152+
You can create out-of-band responses by setting the [`response.conversation`](../realtime-audio-reference.md#realtimeresponseoptions) field to the string `none` when creating a response with the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) client event.
148153

149-
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
154+
In the same [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) client event, you can also set the [`response.metadata`](../realtime-audio-reference.md#realtimeresponseoptions) field to help you identify which response is being generated for this client-sent event.
155+
156+
```json
157+
{
158+
"type": "response.create",
159+
"response": {
160+
"conversation": "none",
161+
"metadata": {
162+
"topic": "world_capitals"
163+
},
164+
"modalities": ["text"],
165+
"prompt": "What is the capital of France?"
166+
}
167+
}
168+
```
169+
170+
When the server responds with a [`response.done`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event, the response contains the metadata you provided. You can identify the corresponding response for the client-sent event via the `response.metadata` field.
171+
172+
> [!IMPORTANT]
173+
> If you create any responses outside the default conversation, be sure to always check the `response.metadata` field to help you identify the corresponding response for the client-sent event. You should even check the `response.metadata` field for responses that are part of the default conversation. That way, you can ensure that you're handling the correct response for the client-sent event.
174+
175+
### Custom context for out-of-band responses
176+
177+
You can also construct a custom context that the model uses outside of the session's default conversation. To create a response with custom context, set the `conversation` field to `none` and provide the custom context in the `input` array. The `input` array can contain new inputs or references to existing conversation items.
178+
179+
```json
180+
{
181+
"type": "response.create",
182+
"response": {
183+
"conversation": "none",
184+
"modalities": ["text"],
185+
"prompt": "What is the capital of France?",
186+
"input": [
187+
{
188+
"type": "item_reference",
189+
"id": "existing_conversation_item_id"
190+
},
191+
{
192+
"type": "message",
193+
"role": "user",
194+
"content": [
195+
{
196+
"type": "input_text",
197+
"text": "The capital of France is Paris."
198+
},
199+
],
200+
},
201+
]
202+
}
203+
}
204+
```
205+
206+
## Voice activity detection (VAD) and the audio buffer
207+
208+
The server maintains an input audio buffer containing client-provided audio that hasn't yet been committed to the conversation state.
150209

151210
One of the key [session-wide](#session-configuration) settings is `turn_detection`, which controls how data flow is handled between the caller and model. The `turn_detection` setting can be set to `none` or `server_vad` (to use [server-side voice activity detection](#server-decision-mode)).
152211

212+
By default, voice activity detection (VAD) is enabled, and the server automatically generates responses when it detects the end of speech in the input audio buffer. You can change the behavior by setting the `turn_detection` property in the session configuration.
213+
153214
### Without server decision mode
154215

155-
By default, the session is configured with the `turn_detection` type effectively set to `none`.
216+
By default, the session is configured with the `turn_detection` type effectively set to `none`. Voice activity detection (VAD) is disabled, and the server doesn't automatically generate responses when it detects the end of speech in the input audio buffer.
156217

157218
The session relies on caller-initiated [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) and [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
158219

@@ -177,7 +238,9 @@ sequenceDiagram
177238

178239
### Server decision mode
179240

180-
The session can be configured with the `turn_detection` type set to `server_vad`. In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
241+
You can configure the session to use server-side voice activity detection (VAD). Set the `turn_detection` type to `server_vad` to enable VAD.
242+
243+
In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can also be configured when specifying `server_vad` detection mode.
181244

182245
- The server sends the [`input_audio_buffer.speech_started`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstarted) event when it detects the start of speech.
183246
- At any time, the client can optionally append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
@@ -201,9 +264,27 @@ sequenceDiagram
201264
Server->>Client: conversation.item.created
202265
-->
203266

267+
### VAD without automatic response generation
268+
269+
You can use server-side voice activity detection (VAD) without automatic response generation. This approach can be useful when you want to implement some degree of moderation.
270+
271+
Set [`turn_detection.create_response`](../realtime-audio-reference.md#realtimeturndetection) to `false` via the [session.update](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event. VAD detects the end of speech but the server doesn't generate a response until you send a [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
272+
273+
```json
274+
{
275+
"turn_detection": {
276+
"type": "server_vad",
277+
"threshold": 0.5,
278+
"prefix_padding_ms": 300,
279+
"silence_duration_ms": 200,
280+
"create_response": false
281+
}
282+
}
283+
```
284+
204285
## Conversation and response generation
205286

206-
The Realtime API is designed to handle real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
287+
The GPT-4o real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
207288

208289
### Conversation sequence and items
209290

@@ -256,7 +337,7 @@ A user might want to interrupt the assistant's response or ask the assistant to
256337

257338
Here's an example of the event sequence for a simple text-in, audio-out conversation:
258339

259-
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event.
340+
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event. The maximum session duration is 30 minutes.
260341

261342
```json
262343
{

articles/ai-services/openai/realtime-audio-reference.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1080,7 +1080,7 @@ The server `session.updated` event is returned when a session is updated by the
10801080
| Field | Type | Description |
10811081
|-------|------|-------------|
10821082
| type | [RealtimeClientEventType](#realtimeclienteventtype) | The type of the client event. |
1083-
| event_id | string | The unique ID of the event. The ID can be specified by the client to help identify the event. |
1083+
| event_id | string | The unique ID of the event. The client can specify the ID to help identify the event. |
10841084

10851085
### RealtimeClientEventType
10861086

@@ -1100,7 +1100,11 @@ The server `session.updated` event is returned when a session is updated by the
11001100

11011101
| Field | Type | Description |
11021102
|-------|------|-------------|
1103-
| type | [RealtimeContentPartType](#realtimecontentparttype) | The type of the content part. |
1103+
| type | [RealtimeContentPartType](#realtimecontentparttype) | The content type.<br><br>A property of the `function` object.<br/><br>Allowed values: `input_text`, `input_audio`, `item_reference`, `text`. |
1104+
| text | string | The text content. This property is applicable for the `input_text` and `text` content types. |
1105+
| id | string | ID of a previous conversation item to reference in both client and server created items. This property is applicable for the `item_reference` content type in `response.create` events. |
1106+
| audio | string | The base64-encoded audio bytes. This property is applicable for the `input_audio` content type. |
1107+
| transcript | string | The transcript of the audio. This property is applicable for the `input_audio` content type. |
11041108

11051109
### RealtimeContentPartType
11061110

@@ -1115,14 +1119,29 @@ The server `session.updated` event is returned when a session is updated by the
11151119

11161120
The item to add to the conversation.
11171121

1122+
This table describes all `RealtimeConversationItem` properties. The properties that are applicable per event depend on the [RealtimeItemType](#realtimeitemtype).
1123+
1124+
| Field | Type | Description |
1125+
|-------|------|-------------|
1126+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
1127+
| type | [RealtimeItemType](#realtimeitemtype) | The type of the item.<br><br>Allowed values: `message`, `function_call`, `function_call_output` |
1128+
| object | string | The identifier for the API object being returned. The value will always be `realtime.item`. |
1129+
| status | [RealtimeItemStatus](#realtimeitemstatus) | The status of the item. This field doesn't affect the conversation, but it's accepted for consistency with the `conversation.item.created` event.<br><br>Allowed values: `completed`, `incomplete` |
1130+
| role | [RealtimeMessageRole](#realtimemessagerole) | The role of the message sender. This property is only applicable for `message` items. <br><br>Allowed values: `system`, `user`, `assistant` |
1131+
| content | array of [RealtimeContentPart](#realtimecontentpart) | The content of the message. This property is only applicable for `message` items.<br><br>- Message items of role `system` support only `input_text` content.<br>- Message items of role `user` support `input_text` and `input_audio` content.<br>- Message items of role `assistant` support `text` content. |
1132+
| call_id | string | The ID of the function call (for `function_call` and `function_call_output` items). If passed on a `function_call_output` item, the server will check that a `function_call` item with the same ID exists in the conversation history. |
1133+
| name | string | The name of the function being called (for `function_call` items). |
1134+
| arguments | string | The arguments of the function call (for `function_call` items). |
1135+
| output | string | The output of the function call (for `function_call_output` items). |
1136+
11181137
### RealtimeConversationRequestItem
11191138

11201139
You use the `RealtimeConversationRequestItem` object to create a new item in the conversation via the [conversation.item.create](#realtimeclienteventconversationitemcreate) event.
11211140

11221141
| Field | Type | Description |
11231142
|-------|------|-------------|
11241143
| type | [RealtimeItemType](#realtimeitemtype) | The type of the item. |
1125-
| id | string | The unique ID of the item. The ID can be specified by the client to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
1144+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
11261145

11271146
### RealtimeConversationResponseItem
11281147

@@ -1138,7 +1157,7 @@ The `RealtimeConversationResponseItem` object represents an item in the conversa
11381157
|-------|------|-------------|
11391158
| object | string | The identifier for the returned API object.<br><br>Allowed values: `realtime.item` |
11401159
| type | [RealtimeItemType](#realtimeitemtype) | The type of the item.<br><br>Allowed values: `message`, `function_call`, `function_call_output` |
1141-
| id | string | The unique ID of the item. The ID can be specified by the client to help manage server-side context. If the client doesn't provide an ID, the server generates one.<br><br>This property is nullable. |
1160+
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.<br><br>This property is nullable. |
11421161

11431162
### RealtimeFunctionTool
11441163

@@ -1333,6 +1352,9 @@ The response resource.
13331352
| tool_choice | [RealtimeToolChoice](#realtimetoolchoice) | The tool choice for the session. |
13341353
| temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
13351354
| max__output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls.<br><br>Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens.<br><br>For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`.<br><br>Defaults to `"inf"`. |
1355+
| conversation | string | Controls which conversation the response is added to. The supported values are `auto` and `none`.<br><br>The `auto` value (or not setting this property) ensures that the contents of the response are added to the session's default conversation.<br><br>Set this property to `none` to create an out-of-band response where items won't be added to the default conversation. For more information, see the [how-to guide](./how-to/realtime-audio.md#out-of-band-responses).<br><br>Defaults to `"auto"` |
1356+
| metadata | map | Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long.<br/><br/>For example: `metadata: { topic: "classification" }` |
1357+
| input | array | Input items to include in the prompt for the model. Creates a new context for this response, without including the default conversation. Can include references to items from the default conversation.<br><br>Array items: [RealtimeConversationItemBase](#realtimeconversationitembase) |
13361358

13371359
### RealtimeResponseSession
13381360

@@ -1496,6 +1518,10 @@ Currently, only 'function' tools are supported.
14961518
| Field | Type | Description |
14971519
|-------|------|-------------|
14981520
| type | [RealtimeTurnDetectionType](#realtimeturndetectiontype) | The type of turn detection.<br><br>Allowed values: `server_vad` |
1521+
| threshold | number | The activation threshold for the server VAD turn detection. In noisy environments, you might need to increase the threshold to avoid false positives. In quiet environments, you might need to decrease the threshold to avoid false negatives.<br><br>Defaults to `0.5`. You can set the threshold to a value between `0.0` and `1.0`. |
1522+
| prefix_padding_ms | string | The duration of speech audio (in milliseconds) to include before the start of detected speech.<br><br>Defaults to `300` milliseconds. |
1523+
| silence_duration_ms | string | The duration of silence (in milliseconds) to detect the end of speech. You want to detect the end of speech as soon as possible, but not too soon to avoid cutting off the last part of the speech.<br><br>The model will respond more quickly if you set this value to a lower number, but it might cut off the last part of the speech. If you set this value to a higher number, the model will wait longer to detect the end of speech, but it might take longer to respond.<br><br>Defaults to `500` milliseconds. |
1524+
| create_response | boolean | Indicates whether the server will automatically create a response when VAD is enabled and speech stops.<br><br>Defaults to `true`. |
14991525

15001526
### RealtimeTurnDetectionType
15011527

0 commit comments

Comments
 (0)