Skip to content

Commit a96c931

Browse files
committed
real-time how to
1 parent e88b6f8 commit a96c931

File tree

3 files changed

+80
-83
lines changed

3 files changed

+80
-83
lines changed

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 67 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to use the GPT-4o Realtime API for speech and audio with
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 12/6/2024
8+
ms.date: 12/11/2024
99
author: eric-urban
1010
ms.author: eur
1111
ms.custom: references_regions
@@ -14,7 +14,9 @@ recommendations: false
1414

1515
# How to use the GPT-4o Realtime API for speech and audio (Preview)
1616

17-
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
17+
[!INCLUDE [Feature preview](../includes/preview-feature.md)]
18+
19+
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
1820

1921
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.
2022

@@ -52,13 +54,14 @@ Right now, the fastest way to get started development with the GPT-4o Realtime A
5254

5355
[The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
5456

55-
## Architecture
57+
## Connection and authentication
5658

57-
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model. Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
59+
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model.
5860

59-
### Connection and authentication with the Realtime API
61+
> [!IMPORTANT]
62+
> Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
6063
61-
The Realtime API requires an existing Azure OpenAI resource endpoint in a supported region. The API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
64+
The Realtime API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
6265

6366
You can construct a full request URI by concatenating:
6467

@@ -71,7 +74,7 @@ You can construct a full request URI by concatenating:
7174
The following example is a well-constructed `/realtime` request URI:
7275

7376
```http
74-
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview-1001
77+
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview-deployment-name
7578
```
7679

7780
To authenticate:
@@ -80,98 +83,80 @@ To authenticate:
8083
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
8184
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
8285

86+
## Realtime API architecture
8387

84-
### API concepts
88+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object. events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
8589

8690
- A caller establishes a connection to `/realtime`, which starts a new `session`.
8791
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
88-
- The `conversation` accumulates input signals until a `response` is started, either via a direct command by the caller or automatically by voice-activity-based (VAD) turn detection.
92+
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice-activity-based (VAD) turn detection.
8993
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
9094
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
9195
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
9296
- Each caller-initiated `response.create` can override some of the output `response` behavior, if desired.
9397
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
9498

95-
## API details
96-
97-
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via sending and receiving WebSocket messages, that we refer to as "commands" to avoid ambiguity with the content-bearing "message" concept already present for inference. These commands each take the form of a JSON object. Commands can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
99+
## Session configuration and turn handling mode
98100

99-
### Session configuration and turn handling mode
100-
101-
Often, the first command sent by the caller on a newly established `/realtime` session is a `session.update` payload. This command controls a wide set of input and output behavior, with output and response generation portions then later overridable via `update_conversation_config` or other properties in `response.create`.
101+
Often, the first event sent by the caller on a newly established `/realtime` session is a `session.update` payload. This event controls a wide set of input and output behavior, with output and response generation portions then later overridable via `response.create` properties.
102102

103103
One of the key session-wide settings is `turn_detection`, which controls how data flow is handled between the caller and model:
104104

105-
- `server_vad` evaluates incoming user audio (as sent via `add_user_audio`) using a voice activity detector (VAD) component and automatically use that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
106-
- `none` relies on caller-initiated `input_audio_buffer.commit` and `response.create` commands to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
105+
- `server_vad` evaluates incoming user audio (as sent via `input_audio_buffer.append`) using a voice activity detector (VAD) component and automatically use that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
106+
- `none` relies on caller-initiated `input_audio_buffer.commit` and `response.create` events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
107107

108108
Transcription of user input audio is opted into via the `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of `conversation.item.audio_transcription.completed` events.
109109

110-
## Summary of commands
111-
112-
Here's a summary of the commands that can be [sent](#requests) and [received](#responses) via the `/realtime` endpoint.
113-
114-
### Requests
115-
116-
The following table describes commands sent from the caller to the `/realtime` endpoint.
117-
118-
| `type` | Description |
119-
|---|---|
120-
| **Session Configuration** | |
121-
| `session.update` | Configures the connection-wide behavior of the conversation session such as shared audio input handling and common response generation characteristics. This is typically sent immediately after connecting, but can also be sent at any point during a session to reconfigure behavior after the current response (if in progress) is complete. |
122-
| **Input Audio** | |
123-
| `input_audio_buffer_append` | Appends audio data to the shared user input buffer. This audio isn't processed until an end of speech is detected in the `server_vad` `turn_detection` mode or until a manual `response.create` is sent (in either `turn_detection` configuration). |
124-
| `input_audio_buffer_clear` | Clears the current audio input buffer. This doesn't affect responses already in progress. |
125-
| `input_audio_buffer_commit` | Commits the current state of the user input buffer to subscribed conversations, including it as information for the next response. |
126-
| **Item Management** | For establishing history or including nonaudio item information. |
127-
| `item_create` | Inserts a new item into the conversation, optionally positioned according to `previous_item_id`. This property can provide new, nonaudio input from the user (such as a text message), tool responses, or historical information from another interaction to form a conversation history before generation. |
128-
| `item_delete` | Removes an item from an existing conversation. |
129-
| `item_truncate` | Manually shortens text and audio content in a message. This property can be useful in situations where faster-than-realtime model generation produced more data that's later skipped by an interruption. |
130-
| **Response Management** |
131-
| `response.create` | Initiates model processing of unprocessed conversation input, signifying the end of the caller's logical turn. `server_vad` `turn_detection` mode automatically triggers generation at end of speech, but `response.create` must be called in other circumstances (such as text input, tool responses, and `none` mode) to signal that the conversation should continue. The `response.create` should be invoked after the `response.done` command from the model that confirms all tool calls and other messages are provided. |
132-
| `response.cancel` | Cancels an in-progress response. |
133-
134-
135-
### Responses
136-
137-
The following table describes commands sent by the `/realtime` endpoint to the caller.
138-
139-
| `type` | Description |
140-
|---|---|
141-
| **Session** | |
142-
| `session_created` | Sent as soon as the connection is successfully established. Provides a connection-specific ID that might be useful for debugging or logging. |
143-
| **Caller Item Acknowledgement** | |
144-
| `item_created` | Provides acknowledgment that a new conversation item is inserted into a conversation. |
145-
| `item_deleted` | Provides acknowledgment that an existing conversation item is removed from a conversation. |
146-
| `item_truncated` | Provides acknowledgment that an existing item in a conversation is truncated. |
147-
| **Response Flow** | |
148-
| `response_created` | Notifies that a new response is started for a conversation. This snapshots input state and begins generation of new items. Until `response_done` signifies the end of the response, a response can create items via `response_output_item_added` that are then populated via `delta` commands. |
149-
| `response_done` | Notifies that a response generation is complete for a conversation. |
150-
| `response_cancelled` | Confirms that a response was canceled in response to a caller-initiated or internal signal. |
151-
| `rate_limits_updated` | This response is sent immediately after `response.done`, this property provides the current rate limit information reflecting updated status after the consumption of the just-finished response. |
152-
| **Item Flow in a Response** | |
153-
| `response_output_item_added` | Notifies that a new, server-generated conversation item *is being created*; content is then be populated via incremental `add_content` messages with a final `response_output_item_done` command signifying the item creation completed. |
154-
| `response_output_item_done` | Notifies that a new conversation item is added to a conversation. For model-generated messages, this property is preceded by `response_output_item_added` and `delta` commands which begin and populate the new item, respectively. |
155-
| **Content Flow within Response Items** | |
156-
| `response_content_part_added` | Notifies that a new content part is being created within a conversation item in an ongoing response. Until `response_content_part_done` arrives, content is then incrementally provided via the appropriate `delta` commands. |
157-
| `response_content_part_done` | Signals that a newly created content part is complete and receives no further incremental updates. |
158-
| `response_audio_delta` | Provides an incremental update to a binary audio data content part generated by the model. |
159-
| `response_audio_done` | Signals that an audio content part's incremental updates are complete. |
160-
| `response_audio_transcript_delta` | Provides an incremental update to the audio transcription associated with the output audio content generated by the model. |
161-
| `response_audio_transcript_done` | Signals that the incremental updates to audio transcription of output audio are complete. |
162-
| `response_text_delta` | Provides an incremental update to a text content part within a conversation message item. |
163-
| `response_text_done` | Signals that the incremental updates to a text content part are complete. |
164-
| `response_function_call_arguments_delta` | Provides an incremental update to the arguments of a function call, as represented within an item in a conversation. |
165-
| `response_function_call_arguments_done` | Signals that incremental function call arguments are complete and that accumulated arguments can now be used in their entirety. |
166-
| **User Input Audio** | |
167-
| `input_audio_buffer_speech_started` | When you use configured voice activity detection, this command notifies that a start of user speech is detected within the input audio buffer at a specific audio sample index. |
168-
| `input_audio_buffer_speech_stopped` | When you use configured voice activity detection, this command notifies that an end of user speech is detected within the input audio buffer at a specific audio sample index. This setting automatically triggers response generation when configured. |
169-
| `item_input_audio_transcription_completed` | Notifies that a supplementary transcription of the user's input audio buffer is available. This behavior must be opted into via the `input_audio_transcription` property in `session.update`. |
170-
| `item_input_audio_transcription_failed` | Notifies that input audio transcription failed. |
171-
| `input_audio_buffer_committed` | Provides acknowledgment that the current state of the user audio input buffer is submitted to subscribed conversations. |
172-
| `input_audio_buffer_cleared` | Provides acknowledgment that the pending user audio input buffer is cleared. |
173-
| **Other** | |
174-
| `error` | Indicates that something went wrong while processing data on the session. Includes an `error` message that provides more detail. |
110+
### Session update example
111+
112+
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional; not everything needs to be configured!
113+
114+
```json
115+
{
116+
"type": "session.update",
117+
"session": {
118+
"voice": "alloy",
119+
"instructions": "Call provided tools if appropriate for the user's input.",
120+
"input_audio_format": "pcm16",
121+
"input_audio_transcription": {
122+
"model": "whisper-1"
123+
},
124+
"turn_detection": {
125+
"threshold": 0.4,
126+
"silence_duration_ms": 600,
127+
"type": "server_vad"
128+
},
129+
"tools": [
130+
{
131+
"type": "function",
132+
"name": "get_weather_for_location",
133+
"description": "gets the weather for a location",
134+
"parameters": {
135+
"type": "object",
136+
"properties": {
137+
"location": {
138+
"type": "string",
139+
"description": "The city and state e.g. San Francisco, CA"
140+
},
141+
"unit": {
142+
"type": "string",
143+
"enum": [
144+
"c",
145+
"f"
146+
]
147+
}
148+
},
149+
"required": [
150+
"location",
151+
"unit"
152+
]
153+
}
154+
}
155+
]
156+
}
157+
}
158+
```
159+
175160

176161
## Related content
177162

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: include file
3+
description: include file
4+
ms.topic: include
5+
ms.date: 12/11/2024
6+
ms.custom: include
7+
---
8+
9+
> [!NOTE]
10+
> This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

articles/ai-services/openai/realtime-audio-quickstart.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to use GPT-4o Realtime API for speech and audio with Azur
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 10/3/2024
8+
ms.date: 12/11/2024
99
author: eric-urban
1010
ms.author: eur
1111
ms.custom: references_regions, ignite-2024
@@ -15,6 +15,8 @@ recommendations: false
1515

1616
# GPT-4o Realtime API for speech and audio (Preview)
1717

18+
[!INCLUDE [Feature preview](includes/preview-feature.md)]
19+
1820
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o audio `realtime` API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
1921

2022
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.

0 commit comments

Comments
 (0)