You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object. Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
89
-
90
-
- A caller establishes a connection to `/realtime`, which starts a new `session`.
88
+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.
89
+
90
+
:::image type="content" source="../media/how-to/real-time/realtime-api-sequence.png" alt-text="Diagram of the Realtime API authentication and connection sequence." lightbox="../media/how-to/real-time/realtime-api-sequence.png":::
91
+
92
+
<!--
93
+
sequenceDiagram
94
+
actor User as End User
95
+
participant MiddleTier as /realtime host
96
+
participant AOAI as Azure OpenAI
97
+
User->>MiddleTier: Begin interaction
98
+
MiddleTier->>MiddleTier: Authenticate/Validate User
AOAI--)MiddleTier: (within items) create/stream/finish content parts
113
+
-->
114
+
115
+
Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
116
+
117
+
- A client-side caller establishes a connection to `/realtime`, which starts a new [`session`](#session-configuration).
91
118
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
92
-
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice-activity-based (VAD) turn detection.
119
+
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voiceactivity detection (VAD).
93
120
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
94
121
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
95
122
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
96
-
- Each caller-initiated `response.create` can override some of the output `response` behavior, if desired.
123
+
- Each caller-initiated [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) can override some of the output [`response`](../realtime-audio-reference.md#realtimeresponse) behavior, if desired.
97
124
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
98
125
99
-
## Session configuration and turn handling mode
100
-
101
-
Often, the first event sent by the caller on a newly established `/realtime` session is a `session.update` payload. This event controls a wide set of input and output behavior, with output and response generation portions then later overridable via `response.create` properties.
102
-
103
-
One of the key session-wide settings is `turn_detection`, which controls how data flow is handled between the caller and model:
126
+
## Session configuration
104
127
105
-
-`server_vad` evaluates incoming user audio (as sent via `input_audio_buffer.append`) using a voice activity detector (VAD) component and automatically use that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
106
-
-`none` relies on caller-initiated `input_audio_buffer.commit` and `response.create` events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
128
+
Often, the first event sent by the caller on a newly established `/realtime` session is a [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) payload. This event controls a wide set of input and output behavior, with output and response generation properties then later overridable using the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
107
129
108
-
Transcription of user input audio is opted into via the `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of `conversation.item.audio_transcription.completed` events.
130
+
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
131
+
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
132
+
- Turn handling is controlled by the `turn_detection` property. This property can be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
133
+
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
109
134
110
-
### Session update example
111
-
112
-
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional; not everything needs to be configured!
135
+
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional and can be omitted if not needed.
113
136
114
137
```json
115
138
{
@@ -136,7 +159,7 @@ An example `session.update` that configures several aspects of the session, incl
136
159
"properties": {
137
160
"location": {
138
161
"type": "string",
139
-
"description": "The city and state e.g. San Francisco, CA"
162
+
"description": "The city and state such as San Francisco, CA"
140
163
},
141
164
"unit": {
142
165
"type": "string",
@@ -157,6 +180,95 @@ An example `session.update` that configures several aspects of the session, incl
157
180
}
158
181
```
159
182
183
+
## Input audio buffer and turn handling
184
+
185
+
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
186
+
187
+
One of the key [session-wide](#session-configuration) settings is `turn_detection`, which controls how data flow is handled between the caller and model. The `turn_detection` setting can be set to `none` or `server_vad` (to use [server-side voice activity detection](#server-decision-mode)).
188
+
189
+
### Without server decision mode
190
+
191
+
By default, the session is configured with the `turn_detection` type effectively set to `none`.
192
+
193
+
The session relies on caller-initiated [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) and [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
194
+
195
+
- The client can append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
196
+
- The client commits the input audio buffer by sending the [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) event. The commit creates a new user message item in the conversation.
197
+
- The server responds by sending the [`input_audio_buffer.committed`](../realtime-audio-reference.md#realtimeservereventinputaudiobuffercommitted) event.
198
+
- The server responds by sending the [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event.
199
+
200
+
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-client-managed.png" alt-text="Diagram of the Realtime API input audio sequence without server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-client-managed.png":::
201
+
202
+
<!--
203
+
sequenceDiagram
204
+
participant Client as Client
205
+
participant Server as Server
206
+
Client->>Server: input_audio_buffer.append
207
+
Server->>Server: Append audio to buffer
208
+
Client->>Server: input_audio_buffer.commit
209
+
Server->>Server: Commit audio buffer
210
+
Server->>Client: input_audio_buffer.committed
211
+
Server->>Client: conversation.item.created
212
+
-->
213
+
214
+
### Server decision mode
215
+
216
+
The session can be configured with the `turn_detection` type set to `server_vad`. In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
217
+
218
+
- The server sends the [`input_audio_buffer.speech_started`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstarted) event when it detects the start of speech.
219
+
- At any time, the client can optionally append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
220
+
- The server sends the [`input_audio_buffer.speech_stopped`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstopped) event when it detects the end of speech.
221
+
- The server commits the input audio buffer by sending the [`input_audio_buffer.committed`](../realtime-audio-reference.md#realtimeservereventinputaudiobuffercommitted) event.
222
+
- The server sends the [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event with the user message item created from the audio buffer.
223
+
224
+
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-server-vad.png" alt-text="Diagram of the Realtime API input audio sequence with server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-server-vad.png":::
You can have one active conversation per session. The conversation accumulates input signals until a response is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
243
+
244
+
- The server [`conversation.created`](../realtime-audio-reference.md#realtimeservereventconversationcreated) event is returned right after session creation.
245
+
- The client adds new items to the conversation with a [`conversation.item.create`](../realtime-audio-reference.md#realtimeclienteventconversationitemcreate) event.
246
+
- The server [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event is returned when the client adds a new item to the conversation.
247
+
248
+
Optionally, the client can truncate or delete items in the conversation:
249
+
- The client truncates an earlier assistant audio message item with a [`conversation.item.truncate`](../realtime-audio-reference.md#realtimeclienteventconversationitemtruncate) event.
250
+
- The server [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event is returned to sync the client and server state.
251
+
- The client deletes an item in the conversation with a [`conversation.item.delete`](../realtime-audio-reference.md#realtimeclienteventconversationitemdelete) event.
252
+
- The server [`conversation.item.deleted`](../realtime-audio-reference.md#realtimeservereventconversationitemdeleted) event is returned to sync the client and server state.
253
+
254
+
:::image type="content" source="../media/how-to/real-time/conversation-item-sequence.png" alt-text="Diagram of the Realtime API conversation item sequence." lightbox="../media/how-to/real-time/conversation-item-sequence.png":::
0 commit comments