Skip to content

Commit ddfadb7

Browse files
committed
controlled response timing
1 parent e14999c commit ddfadb7

File tree

2 files changed

+33
-6
lines changed

2 files changed

+33
-6
lines changed

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Often, the first event sent by the caller on a newly established `/realtime` ses
116116

117117
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
118118
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
119-
- Turn handling is controlled by the `turn_detection` property. This property can be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
119+
- Turn handling is controlled by the `turn_detection` property. This property's type can be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
120120
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
121121

122122
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional and can be omitted if not needed.
@@ -135,7 +135,8 @@ An example `session.update` that configures several aspects of the session, incl
135135
"type": "server_vad",
136136
"threshold": 0.5,
137137
"prefix_padding_ms": 300,
138-
"silence_duration_ms": 200
138+
"silence_duration_ms": 200,
139+
"create_response": true
139140
},
140141
"tools": []
141142
}
@@ -144,15 +145,17 @@ An example `session.update` that configures several aspects of the session, incl
144145

145146
The server responds with a [`session.updated`](../realtime-audio-reference.md#realtimeservereventsessionupdated) event to confirm the session configuration.
146147

147-
## Input audio buffer and turn handling
148+
## Voice activity detection (VAD) and the audio buffer
148149

149150
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
150151

151152
One of the key [session-wide](#session-configuration) settings is `turn_detection`, which controls how data flow is handled between the caller and model. The `turn_detection` setting can be set to `none` or `server_vad` (to use [server-side voice activity detection](#server-decision-mode)).
152153

154+
By default, voice activity detection (VAD) is enabled, and the server automatically generates responses when it detects the end of speech in the input audio buffer. You can change the behavior by setting the `turn_detection` property in the session configuration.
155+
153156
### Without server decision mode
154157

155-
By default, the session is configured with the `turn_detection` type effectively set to `none`.
158+
By default, the session is configured with the `turn_detection` type effectively set to `none`. Voice activity detection (VAD) is disabled, and the server doesn't automatically generate responses when it detects the end of speech in the input audio buffer.
156159

157160
The session relies on caller-initiated [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) and [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
158161

@@ -177,7 +180,9 @@ sequenceDiagram
177180

178181
### Server decision mode
179182

180-
The session can be configured with the `turn_detection` type set to `server_vad`. In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
183+
You can configure the session to use server-side voice activity detection (VAD). Set the `turn_detection` type to `server_vad` to enable VAD.
184+
185+
In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can also be configured when specifying `server_vad` detection mode.
181186

182187
- The server sends the [`input_audio_buffer.speech_started`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstarted) event when it detects the start of speech.
183188
- At any time, the client can optionally append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
@@ -201,6 +206,24 @@ sequenceDiagram
201206
Server->>Client: conversation.item.created
202207
-->
203208

209+
### VAD without automatic response generation
210+
211+
You can use server-side voice activity detection (VAD) without automatic response generation. This can be useful when you want to implement some degree of moderation.
212+
213+
Set [`turn_detection.create_response`](../realtime-audio-reference.md#realtimeturndetection) to `false` via the [session.update](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event. VAD will detect the end of speech but the server won't generate a response until you send a [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
214+
215+
```json
216+
{
217+
"turn_detection": {
218+
"type": "server_vad",
219+
"threshold": 0.5,
220+
"prefix_padding_ms": 300,
221+
"silence_duration_ms": 200,
222+
"create_response": false
223+
}
224+
}
225+
```
226+
204227
## Conversation and response generation
205228

206229
The Realtime API is designed to handle real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
@@ -256,7 +279,7 @@ A user might want to interrupt the assistant's response or ask the assistant to
256279

257280
Here's an example of the event sequence for a simple text-in, audio-out conversation:
258281

259-
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event.
282+
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event. The maximum session duration is 30 minutes.
260283

261284
```json
262285
{

articles/ai-services/openai/realtime-audio-reference.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1496,6 +1496,10 @@ Currently, only 'function' tools are supported.
14961496
| Field | Type | Description |
14971497
|-------|------|-------------|
14981498
| type | [RealtimeTurnDetectionType](#realtimeturndetectiontype) | The type of turn detection.<br><br>Allowed values: `server_vad` |
1499+
| threshold | number | The activation threshold for the server VAD turn detection. In noisy environments, you might need to increase the threshold to avoid false positives. In quiet environments, you might need to decrease the threshold to avoid false negatives.<br><br>Defaults to `0.5`. You can set the threshold to a value between `0.0` and `1.0`. |
1500+
| prefix_padding_ms | string | The duration of speech audio (in milliseconds) to include before the start of detected speech.<br><br>Defaults to `300` milliseconds. |
1501+
| silence_duration_ms | string | The duration of silence (in milliseconds) to detect the end of speech. You want to detect the end of speech as soon as possible, but not too soon to avoid cutting off the last part of the speech.<br><br>The model will respond more quickly if you set this value to a lower number, but it might cut off the last part of the speech. If you set this value to a higher number, the model will wait longer to detect the end of speech, but it might take longer to respond.<br><br>Defaults to `500` milliseconds. |
1502+
| create_response | boolean | Indicates whether the server will automatically create a response when VAD is enabled and speech stops.<br><br>Defaults to `true`. |
14991503

15001504
### RealtimeTurnDetectionType
15011505

0 commit comments

Comments
 (0)