You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/realtime-audio.md
+23-16Lines changed: 23 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,12 +40,17 @@ Before you can use GPT-4o real-time audio, you need:
40
40
41
41
- An Azure subscription - <ahref="https://azure.microsoft.com/free/cognitive-services"target="_blank">Create one for free</a>.
42
42
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
43
-
- You need a deployment of the `gpt-4o-realtime-preview` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Studio model catalog](../../ai-studio/how-to/model-catalog-overview.md) or from your project in AI Studio.
43
+
- You need a deployment of the `gpt-4o-realtime-preview` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Foundry portal model catalog](../../../ai-studio/how-to/model-catalog-overview.md) or from your project in AI Foundry portal.
44
44
45
45
For steps to deploy and use the `gpt-4o-realtime-preview` model, see [the real-time audio quickstart](../realtime-audio-quickstart.md).
46
46
47
47
For more information about the API and architecture, see the remaining sections in this guide.
48
48
49
+
## Sample code
50
+
51
+
Right now, the fastest way to get started development with the GPT-4o Realtime API is to download the sample code from the [Azure OpenAI GPT-4o real-time audio repository on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk).
52
+
53
+
[The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
49
54
50
55
## Architecture
51
56
@@ -55,25 +60,25 @@ The Realtime API (via `/realtime`) is built on [the WebSockets API](https://deve
55
60
56
61
The Realtime API requires an existing Azure OpenAI resource endpoint in a supported region. The API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
57
62
58
-
A full request URI can be constructed by concatenating:
63
+
You can construct a full request URI by concatenating:
59
64
60
65
- The secure WebSocket (`wss://`) protocol
61
-
- Your Azure OpenAI resource endpoint hostname, e.g.`my-aoai-resource.openai.azure.com`
66
+
- Your Azure OpenAI resource endpoint hostname, for example,`my-aoai-resource.openai.azure.com`
62
67
- The `openai/realtime` API path
63
-
- An `api-version` query string parameter for a supported API version -- initially,`2024-10-01-preview`
68
+
- An `api-version` query string parameter for a supported API version such as`2024-10-01-preview`
64
69
- A `deployment` query string parameter with the name of your `gpt-4o-realtime-preview` model deployment
65
70
66
-
Combining into a full example, the following could be a well-constructed `/realtime` request URI:
71
+
The following example is a well-constructed `/realtime` request URI:
-**Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI Service resource that has managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
78
+
-**Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI Service resource with managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
74
79
-**API key**: An `api-key` can be provided in one of two ways:
75
-
1. Using an `api-key` connection header on the pre-handshake connection. This option isn't available in a browser environment.
76
-
2. Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
80
+
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
81
+
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
77
82
78
83
79
84
### API concepts
@@ -89,11 +94,11 @@ To authenticate:
89
94
90
95
## API details
91
96
92
-
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via sending and receiving WebSocket messages, herein referred to as "commands" to avoid ambiguity with the content-bearing "message" concept already present for inference. These commands each take the form of a JSON object. Commands can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
97
+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via sending and receiving WebSocket messages, that we refer to as "commands" to avoid ambiguity with the content-bearing "message" concept already present for inference. These commands each take the form of a JSON object. Commands can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
93
98
94
99
### Session configuration and turn handling mode
95
100
96
-
Often, the first command sent by the caller on a newly established `/realtime` session will be a `session.update` payload. This command controls a wide set of input and output behavior, with output and response generation portions then later overridable via `update_conversation_config` or other properties in `response.create`.
101
+
Often, the first command sent by the caller on a newly established `/realtime` session is a `session.update` payload. This command controls a wide set of input and output behavior, with output and response generation portions then later overridable via `update_conversation_config` or other properties in `response.create`.
97
102
98
103
One of the key session-wide settings is `turn_detection`, which controls how data flow is handled between the caller and model:
99
104
@@ -104,6 +109,8 @@ Transcription of user input audio is opted into via the `input_audio_transcripti
104
109
105
110
## Summary of commands
106
111
112
+
Here's a summary of the commands that can be [sent](#requests) and [received](#responses) via the `/realtime` endpoint.
113
+
107
114
### Requests
108
115
109
116
The following table describes commands sent from the caller to the `/realtime` endpoint.
@@ -113,11 +120,11 @@ The following table describes commands sent from the caller to the `/realtime` e
113
120
|**Session Configuration**||
114
121
|`session.update`| Configures the connection-wide behavior of the conversation session such as shared audio input handling and common response generation characteristics. This is typically sent immediately after connecting, but can also be sent at any point during a session to reconfigure behavior after the current response (if in progress) is complete. |
115
122
|**Input Audio**||
116
-
|`input_audio_buffer_append`| Appends audio data to the shared user input buffer. This audio won't be processed until an end of speech is detected in the `server_vad``turn_detection` mode or until a manual `response.create` is sent (in either `turn_detection` configuration). |
123
+
|`input_audio_buffer_append`| Appends audio data to the shared user input buffer. This audio isn't processed until an end of speech is detected in the `server_vad``turn_detection` mode or until a manual `response.create` is sent (in either `turn_detection` configuration). |
117
124
|`input_audio_buffer_clear`| Clears the current audio input buffer. This doesn't affect responses already in progress. |
118
125
|`input_audio_buffer_commit`| Commits the current state of the user input buffer to subscribed conversations, including it as information for the next response. |
119
-
|**Item Management**| For establishing history or including non-audio item information. |
120
-
|`item_create`| Inserts a new item into the conversation, optionally positioned according to `previous_item_id`. This property can provide new, non-audio input from the user (such as a text message), tool responses, or historical information from another interaction to form a conversation history before generation. |
126
+
|**Item Management**| For establishing history or including nonaudio item information. |
127
+
|`item_create`| Inserts a new item into the conversation, optionally positioned according to `previous_item_id`. This property can provide new, nonaudio input from the user (such as a text message), tool responses, or historical information from another interaction to form a conversation history before generation. |
121
128
|`item_delete`| Removes an item from an existing conversation. |
122
129
|`item_truncate`| Manually shortens text and audio content in a message. This property can be useful in situations where faster-than-realtime model generation produced more data that's later skipped by an interruption. |
123
130
|**Response Management**|
@@ -143,7 +150,7 @@ The following table describes commands sent by the `/realtime` endpoint to the c
143
150
|`response_cancelled`| Confirms that a response was canceled in response to a caller-initiated or internal signal. |
144
151
|`rate_limits_updated`| This response is sent immediately after `response.done`, this property provides the current rate limit information reflecting updated status after the consumption of the just-finished response. |
145
152
|**Item Flow in a Response**||
146
-
|`response_output_item_added`| Notifies that a new, server-generated conversation item *is being created*; content will then be populated via incremental `add_content` messages with a final `response_output_item_done` command signifying the item creation has completed. |
153
+
|`response_output_item_added`| Notifies that a new, server-generated conversation item *is being created*; content is then be populated via incremental `add_content` messages with a final `response_output_item_done` command signifying the item creation completed. |
147
154
|`response_output_item_done`| Notifies that a new conversation item is added to a conversation. For model-generated messages, this property is preceded by `response_output_item_added` and `delta` commands which begin and populate the new item, respectively. |
148
155
|**Content Flow within Response Items**||
149
156
|`response_content_part_added`| Notifies that a new content part is being created within a conversation item in an ongoing response. Until `response_content_part_done` arrives, content is then incrementally provided via the appropriate `delta` commands. |
@@ -168,5 +175,5 @@ The following table describes commands sent by the `/realtime` endpoint to the c
168
175
169
176
## Related content
170
177
171
-
*Learn more about Azure OpenAI [deployment types](deployment-types.md)
172
-
* Learn more about Azure OpenAI [quotas and limits](quotas-limits.md)
178
+
*Try the [real-time audio quickstart](../realtime-audio-quickstart.md)
179
+
* Learn more about Azure OpenAI [quotas and limits](../quotas-limits.md)
0 commit comments