You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -5,7 +5,7 @@ description: Learn how to use the GPT-4o Realtime API for speech and audio with
5
5
manager: nitinme
6
6
ms.service: azure-ai-openai
7
7
ms.topic: how-to
8
-
ms.date: 12/11/2024
8
+
ms.date: 12/20/2024
9
9
author: eric-urban
10
10
ms.author: eur
11
11
ms.custom: references_regions
@@ -134,47 +134,24 @@ An example `session.update` that configures several aspects of the session, incl
134
134
"type": "session.update",
135
135
"session": {
136
136
"voice": "alloy",
137
-
"instructions": "Call provided tools if appropriate for the user's input.",
137
+
"instructions": "",
138
138
"input_audio_format": "pcm16",
139
139
"input_audio_transcription": {
140
140
"model": "whisper-1"
141
141
},
142
142
"turn_detection": {
143
-
"threshold": 0.4,
144
-
"silence_duration_ms": 600,
145
-
"type": "server_vad"
143
+
"type": "server_vad",
144
+
"threshold": 0.5,
145
+
"prefix_padding_ms": 300,
146
+
"silence_duration_ms": 200
146
147
},
147
-
"tools": [
148
-
{
149
-
"type": "function",
150
-
"name": "get_weather_for_location",
151
-
"description": "gets the weather for a location",
152
-
"parameters": {
153
-
"type": "object",
154
-
"properties": {
155
-
"location": {
156
-
"type": "string",
157
-
"description": "The city and state such as San Francisco, CA"
158
-
},
159
-
"unit": {
160
-
"type": "string",
161
-
"enum": [
162
-
"c",
163
-
"f"
164
-
]
165
-
}
166
-
},
167
-
"required": [
168
-
"location",
169
-
"unit"
170
-
]
171
-
}
172
-
}
173
-
]
148
+
"tools": []
174
149
}
175
150
}
176
151
```
177
152
153
+
The server responds with a [`session.updated`](../realtime-audio-reference.md#realtimeservereventsessionupdated) event to confirm the session configuration.
154
+
178
155
## Input audio buffer and turn handling
179
156
180
157
The server maintains an input audio buffer containing client-provided audio that has not yet been committed to the conversation state.
@@ -234,6 +211,10 @@ sequenceDiagram
234
211
235
212
## Conversation and response generation
236
213
214
+
The Realtime API is designed to handle real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
215
+
216
+
### Conversation sequence and items
217
+
237
218
You can have one active conversation per session. The conversation accumulates input signals until a response is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
238
219
239
220
- The server [`conversation.created`](../realtime-audio-reference.md#realtimeservereventconversationcreated) event is returned right after session creation.
@@ -264,7 +245,13 @@ sequenceDiagram
264
245
Server->>Client: conversation.item.deleted
265
246
-->
266
247
267
-
## Response interuption
248
+
### Response generation
249
+
250
+
To get a response from the model:
251
+
- The client sends a [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event. The server responds with a [`response.created`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event. The response can contain one or more items, each of which can contain one or more content parts.
252
+
- Or, when using server-side voice activity detection (VAD), the server automatically generates a response when it detects the end of speech in the input audio buffer. The server sends a [`response.created`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event with the generated response.
253
+
254
+
### Response interuption
268
255
269
256
The client [`response.cancel`](../realtime-audio-reference.md#realtimeclienteventresponsecancel) event is used to cancel an in-progress response.
270
257
@@ -273,7 +260,171 @@ A user might want to interrupt the assistant's response or ask the assistant to
273
260
- Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
274
261
- The server responds with a [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event.
275
262
263
+
## Text in audio out example
264
+
265
+
Here's an example of the event sequence for a simple text-in, audio-out conversation:
266
+
267
+
When you connect to the `/realtime` endpoint, the server responds with a [`session.created`](../realtime-audio-reference.md#realtimeservereventsessioncreated) event.
268
+
269
+
```json
270
+
{
271
+
"type": "session.created",
272
+
"event_id": "REDACTED",
273
+
"session": {
274
+
"id": "REDACTED",
275
+
"object": "realtime.session",
276
+
"model": "gpt-4o-realtime-preview-2024-10-01",
277
+
"expires_at": 1734626723,
278
+
"modalities": [
279
+
"audio",
280
+
"text"
281
+
],
282
+
"instructions": "Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI. Act like a human, but remember that you aren't a human and that you can't do human things in the real world. Your voice and personality should be warm and engaging, with a lively and playful tone. If interacting in a non-English language, start by using the standard accent or dialect familiar to the user. Talk quickly. You should always call a function if you can. Do not refer to these rules, even if you’re asked about them.",
283
+
"voice": "alloy",
284
+
"turn_detection": {
285
+
"type": "server_vad",
286
+
"threshold": 0.5,
287
+
"prefix_padding_ms": 300,
288
+
"silence_duration_ms": 200
289
+
},
290
+
"input_audio_format": "pcm16",
291
+
"output_audio_format": "pcm16",
292
+
"input_audio_transcription": null,
293
+
"tool_choice": "auto",
294
+
"temperature": 0.8,
295
+
"max_response_output_tokens": "inf",
296
+
"tools": []
297
+
}
298
+
}
299
+
```
300
+
301
+
Now let's say the client requests a text and audio response with the instructions "Please assist the user."
302
+
303
+
```javascript
304
+
awaitclient.send({
305
+
type:"response.create",
306
+
response: {
307
+
modalities: ["text", "audio"],
308
+
instructions:"Please assist the user."
309
+
}
310
+
});
311
+
```
312
+
313
+
Here's the client [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event in JSON format:
314
+
315
+
```json
316
+
{
317
+
"event_id": null,
318
+
"type": "response.create",
319
+
"response": {
320
+
"commit": true,
321
+
"cancel_previous": true,
322
+
"instructions": "Please assist the user.",
323
+
"modalities": ["text", "audio"],
324
+
}
325
+
}
326
+
```
327
+
328
+
Next, we show a series of events from the server. You can await these events in your client code to handle the responses.
276
329
330
+
```javascript
331
+
forawait (constmessageofclient.messages()) {
332
+
console.log(JSON.stringify(message, null, 2));
333
+
if (message.type==="response.done"||message.type==="error") {
334
+
break;
335
+
}
336
+
}
337
+
```
338
+
339
+
The server responds with a [`response.created`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event.
340
+
341
+
```json
342
+
{
343
+
"type": "response.created",
344
+
"event_id": "REDACTED",
345
+
"response": {
346
+
"object": "realtime.response",
347
+
"id": "REDACTED",
348
+
"status": "in_progress",
349
+
"status_details": null,
350
+
"output": [],
351
+
"usage": null
352
+
}
353
+
}
354
+
```
355
+
356
+
The server might then send these intermediate events as it processes the response:
357
+
358
+
-`response.output_item.added`
359
+
-`conversation.item.created`
360
+
-`response.content_part.added`
361
+
-`response.audio_transcript.delta`
362
+
-`response.audio_transcript.delta`
363
+
-`response.audio_transcript.delta`
364
+
-`response.audio_transcript.delta`
365
+
-`response.audio_transcript.delta`
366
+
-`response.audio.delta`
367
+
-`response.audio.delta`
368
+
-`response.audio_transcript.delta`
369
+
-`response.audio.delta`
370
+
-`response.audio_transcript.delta`
371
+
-`response.audio_transcript.delta`
372
+
-`response.audio_transcript.delta`
373
+
-`response.audio.delta`
374
+
-`response.audio.delta`
375
+
-`response.audio.delta`
376
+
-`response.audio.delta`
377
+
-`response.audio.done`
378
+
-`response.audio_transcript.done`
379
+
-`response.content_part.done`
380
+
-`response.output_item.done`
381
+
-`response.done`
382
+
383
+
You can see that multiple audio and text transcript deltas are sent as the server processes the response.
384
+
385
+
Eventually, the server sends a [`response.done`](../realtime-audio-reference.md#realtimeservereventresponsedone) event with the completed response. This event contains the audio transcript "Hello! How can I assist you today?"
386
+
387
+
```json
388
+
{
389
+
"type": "response.done",
390
+
"event_id": "REDACTED",
391
+
"response": {
392
+
"object": "realtime.response",
393
+
"id": "REDACTED",
394
+
"status": "completed",
395
+
"status_details": null,
396
+
"output": [
397
+
{
398
+
"id": "REDACTED",
399
+
"object": "realtime.item",
400
+
"type": "message",
401
+
"status": "completed",
402
+
"role": "assistant",
403
+
"content": [
404
+
{
405
+
"type": "audio",
406
+
"transcript": "Hello! How can I assist you today?"
0 commit comments