Skip to content

Commit 8114afd

Browse files
authored
Merge pull request #2012 from eric-urban/eur/realtime-howto
realtime how-to updates
2 parents 1ea0e52 + 9bde1c0 commit 8114afd

File tree

6 files changed

+625
-499
lines changed

6 files changed

+625
-499
lines changed

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 129 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -85,31 +85,54 @@ To authenticate:
8585

8686
## Realtime API architecture
8787

88-
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object. Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
89-
90-
- A caller establishes a connection to `/realtime`, which starts a new `session`.
88+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.
89+
90+
:::image type="content" source="../media/how-to/real-time/realtime-api-sequence.png" alt-text="Diagram of the Realtime API authentication and connection sequence." lightbox="../media/how-to/real-time/realtime-api-sequence.png":::
91+
92+
<!--
93+
sequenceDiagram
94+
actor User as End User
95+
participant MiddleTier as /realtime host
96+
participant AOAI as Azure OpenAI
97+
User->>MiddleTier: Begin interaction
98+
MiddleTier->>MiddleTier: Authenticate/Validate User
99+
MiddleTier--)User: audio information
100+
User--)MiddleTier:
101+
MiddleTier--)User: text information
102+
User--)MiddleTier:
103+
MiddleTier--)User: control information
104+
User--)MiddleTier:
105+
MiddleTier->>AOAI: connect to /realtime
106+
MiddleTier->>AOAI: configure session
107+
AOAI->>MiddleTier: session start
108+
MiddleTier--)AOAI: send/receive WS commands
109+
AOAI--)MiddleTier:
110+
AOAI--)MiddleTier: create/start conversation responses
111+
AOAI--)MiddleTier: (within responses) create/start/add/finish items
112+
AOAI--)MiddleTier: (within items) create/stream/finish content parts
113+
-->
114+
115+
Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
116+
117+
- A client-side caller establishes a connection to `/realtime`, which starts a new [`session`](#session-configuration).
91118
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
92-
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice-activity-based (VAD) turn detection.
119+
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
93120
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
94121
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
95122
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
96-
- Each caller-initiated `response.create` can override some of the output `response` behavior, if desired.
123+
- Each caller-initiated [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) can override some of the output [`response`](../realtime-audio-reference.md#realtimeresponse) behavior, if desired.
97124
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
98125

99-
## Session configuration and turn handling mode
100-
101-
Often, the first event sent by the caller on a newly established `/realtime` session is a `session.update` payload. This event controls a wide set of input and output behavior, with output and response generation portions then later overridable via `response.create` properties.
102-
103-
One of the key session-wide settings is `turn_detection`, which controls how data flow is handled between the caller and model:
126+
## Session configuration
104127

105-
- `server_vad` evaluates incoming user audio (as sent via `input_audio_buffer.append`) using a voice activity detector (VAD) component and automatically use that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
106-
- `none` relies on caller-initiated `input_audio_buffer.commit` and `response.create` events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
128+
Often, the first event sent by the caller on a newly established `/realtime` session is a [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) payload. This event controls a wide set of input and output behavior, with output and response generation properties then later overridable using the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
107129

108-
Transcription of user input audio is opted into via the `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of `conversation.item.audio_transcription.completed` events.
130+
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
131+
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
132+
- Turn handling is controlled by the `turn_detection` property. This property can be set to `none` or `server_vad` as described in the [input audio buffer and turn handling](#input-audio-buffer-and-turn-handling) section.
133+
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
109134

110-
### Session update example
111-
112-
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional; not everything needs to be configured!
135+
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional and can be omitted if not needed.
113136

114137
```json
115138
{
@@ -136,7 +159,7 @@ An example `session.update` that configures several aspects of the session, incl
136159
"properties": {
137160
"location": {
138161
"type": "string",
139-
"description": "The city and state e.g. San Francisco, CA"
162+
"description": "The city and state such as San Francisco, CA"
140163
},
141164
"unit": {
142165
"type": "string",
@@ -157,6 +180,95 @@ An example `session.update` that configures several aspects of the session, incl
157180
}
158181
```
159182

183+
## Input audio buffer and turn handling
184+
185+
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
186+
187+
One of the key [session-wide](#session-configuration) settings is `turn_detection`, which controls how data flow is handled between the caller and model. The `turn_detection` setting can be set to `none` or `server_vad` (to use [server-side voice activity detection](#server-decision-mode)).
188+
189+
### Without server decision mode
190+
191+
By default, the session is configured with the `turn_detection` type effectively set to `none`.
192+
193+
The session relies on caller-initiated [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) and [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
194+
195+
- The client can append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
196+
- The client commits the input audio buffer by sending the [`input_audio_buffer.commit`](../realtime-audio-reference.md#realtimeclienteventinputaudiobuffercommit) event. The commit creates a new user message item in the conversation.
197+
- The server responds by sending the [`input_audio_buffer.committed`](../realtime-audio-reference.md#realtimeservereventinputaudiobuffercommitted) event.
198+
- The server responds by sending the [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event.
199+
200+
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-client-managed.png" alt-text="Diagram of the Realtime API input audio sequence without server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-client-managed.png":::
201+
202+
<!--
203+
sequenceDiagram
204+
participant Client as Client
205+
participant Server as Server
206+
Client->>Server: input_audio_buffer.append
207+
Server->>Server: Append audio to buffer
208+
Client->>Server: input_audio_buffer.commit
209+
Server->>Server: Commit audio buffer
210+
Server->>Client: input_audio_buffer.committed
211+
Server->>Client: conversation.item.created
212+
-->
213+
214+
### Server decision mode
215+
216+
The session can be configured with the `turn_detection` type set to `server_vad`. In this case, the server evaluates user audio from the client (as sent via [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend)) using a voice activity detection (VAD) component. The server automatically uses that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
217+
218+
- The server sends the [`input_audio_buffer.speech_started`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstarted) event when it detects the start of speech.
219+
- At any time, the client can optionally append audio to the buffer by sending the [`input_audio_buffer.append`](../realtime-audio-reference.md#realtimeclienteventinputaudiobufferappend) event.
220+
- The server sends the [`input_audio_buffer.speech_stopped`](../realtime-audio-reference.md#realtimeservereventinputaudiobufferspeechstopped) event when it detects the end of speech.
221+
- The server commits the input audio buffer by sending the [`input_audio_buffer.committed`](../realtime-audio-reference.md#realtimeservereventinputaudiobuffercommitted) event.
222+
- The server sends the [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event with the user message item created from the audio buffer.
223+
224+
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-server-vad.png" alt-text="Diagram of the Realtime API input audio sequence with server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-server-vad.png":::
225+
226+
227+
<!--
228+
sequenceDiagram
229+
participant Client as Client
230+
participant Server as Server
231+
Server->>Client: input_audio_buffer.speech_started
232+
Client->>Server: input_audio_buffer.append (optional)
233+
Server->>Server: Append audio to buffer
234+
Server->>Client: input_audio_buffer.speech_stopped
235+
Server->>Server: Commit audio buffer
236+
Server->>Client: input_audio_buffer.committed
237+
Server->>Client: conversation.item.created
238+
-->
239+
240+
## Conversation and response generation
241+
242+
You can have one active conversation per session. The conversation accumulates input signals until a response is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
243+
244+
- The server [`conversation.created`](../realtime-audio-reference.md#realtimeservereventconversationcreated) event is returned right after session creation.
245+
- The client adds new items to the conversation with a [`conversation.item.create`](../realtime-audio-reference.md#realtimeclienteventconversationitemcreate) event.
246+
- The server [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event is returned when the client adds a new item to the conversation.
247+
248+
Optionally, the client can truncate or delete items in the conversation:
249+
- The client truncates an earlier assistant audio message item with a [`conversation.item.truncate`](../realtime-audio-reference.md#realtimeclienteventconversationitemtruncate) event.
250+
- The server [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event is returned to sync the client and server state.
251+
- The client deletes an item in the conversation with a [`conversation.item.delete`](../realtime-audio-reference.md#realtimeclienteventconversationitemdelete) event.
252+
- The server [`conversation.item.deleted`](../realtime-audio-reference.md#realtimeservereventconversationitemdeleted) event is returned to sync the client and server state.
253+
254+
:::image type="content" source="../media/how-to/real-time/conversation-item-sequence.png" alt-text="Diagram of the Realtime API conversation item sequence." lightbox="../media/how-to/real-time/conversation-item-sequence.png":::
255+
256+
<!--
257+
sequenceDiagram
258+
participant Client as Client
259+
participant Server as Server
260+
Server->>Client: conversation.created
261+
Client->>Server: conversation.item.create
262+
Server->>Server: Create item
263+
Server->>Client: conversation.item.created
264+
Client->>Server: conversation.item.truncate
265+
Server->>Server: Truncate item
266+
Server->>Client: conversation.item.truncated
267+
Client->>Server: conversation.item.delete
268+
Server->>Server: Delete item
269+
Server->>Client: conversation.item.deleted
270+
-->
271+
160272

161273
## Related content
162274

92.5 KB
Loading
96.7 KB
Loading
100 KB
Loading
134 KB
Loading

0 commit comments

Comments
 (0)