Skip to content

Commit dc9b931

Browse files
authored
Merge pull request #1876 from eric-urban/eur/how-to-real-time
how to real time
2 parents 31034cb + 8e91645 commit dc9b931

File tree

4 files changed

+180
-2
lines changed

4 files changed

+180
-2
lines changed
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
title: 'How to use the GPT-4o Realtime API for speech and audio with Azure OpenAI Service'
3+
titleSuffix: Azure OpenAI
4+
description: Learn how to use the GPT-4o Realtime API for speech and audio with Azure OpenAI Service.
5+
manager: nitinme
6+
ms.service: azure-ai-openai
7+
ms.topic: how-to
8+
ms.date: 12/11/2024
9+
author: eric-urban
10+
ms.author: eur
11+
ms.custom: references_regions
12+
recommendations: false
13+
---
14+
15+
# How to use the GPT-4o Realtime API for speech and audio (Preview)
16+
17+
[!INCLUDE [Feature preview](../includes/preview-feature.md)]
18+
19+
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
20+
21+
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.
22+
23+
## Supported models
24+
25+
Currently only `gpt-4o-realtime-preview` version: `2024-10-01-preview` supports real-time audio.
26+
27+
The `gpt-4o-realtime-preview` model is available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
28+
29+
> [!IMPORTANT]
30+
> The system stores your prompts and completions as described in the "Data Use and Access for Abuse Monitoring" section of the service-specific Product Terms for Azure OpenAI Service, except that the Limited Exception does not apply. Abuse monitoring will be turned on for use of the `gpt-4o-realtime-preview` API even for customers who otherwise are approved for modified abuse monitoring.
31+
32+
## API support
33+
34+
Support for the Realtime API was first added in API version `2024-10-01-preview`.
35+
36+
> [!NOTE]
37+
> For more information about the API and architecture, see the [Azure OpenAI GPT-4o real-time audio repository on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk).
38+
39+
## Get started
40+
41+
Before you can use GPT-4o real-time audio, you need:
42+
43+
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
44+
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
45+
- You need a deployment of the `gpt-4o-realtime-preview` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Foundry portal model catalog](../../../ai-studio/how-to/model-catalog-overview.md) or from your project in AI Foundry portal.
46+
47+
For steps to deploy and use the `gpt-4o-realtime-preview` model, see [the real-time audio quickstart](../realtime-audio-quickstart.md).
48+
49+
For more information about the API and architecture, see the remaining sections in this guide.
50+
51+
## Sample code
52+
53+
Right now, the fastest way to get started development with the GPT-4o Realtime API is to download the sample code from the [Azure OpenAI GPT-4o real-time audio repository on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk).
54+
55+
[The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
56+
57+
## Connection and authentication
58+
59+
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model.
60+
61+
> [!IMPORTANT]
62+
> Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
63+
64+
The Realtime API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
65+
66+
You can construct a full request URI by concatenating:
67+
68+
- The secure WebSocket (`wss://`) protocol
69+
- Your Azure OpenAI resource endpoint hostname, for example, `my-aoai-resource.openai.azure.com`
70+
- The `openai/realtime` API path
71+
- An `api-version` query string parameter for a supported API version such as `2024-10-01-preview`
72+
- A `deployment` query string parameter with the name of your `gpt-4o-realtime-preview` model deployment
73+
74+
The following example is a well-constructed `/realtime` request URI:
75+
76+
```http
77+
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview-deployment-name
78+
```
79+
80+
To authenticate:
81+
- **Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI Service resource with managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
82+
- **API key**: An `api-key` can be provided in one of two ways:
83+
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
84+
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
85+
86+
## Realtime API architecture
87+
88+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object. Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
89+
90+
- A caller establishes a connection to `/realtime`, which starts a new `session`.
91+
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
92+
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice-activity-based (VAD) turn detection.
93+
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
94+
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
95+
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
96+
- Each caller-initiated `response.create` can override some of the output `response` behavior, if desired.
97+
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
98+
99+
## Session configuration and turn handling mode
100+
101+
Often, the first event sent by the caller on a newly established `/realtime` session is a `session.update` payload. This event controls a wide set of input and output behavior, with output and response generation portions then later overridable via `response.create` properties.
102+
103+
One of the key session-wide settings is `turn_detection`, which controls how data flow is handled between the caller and model:
104+
105+
- `server_vad` evaluates incoming user audio (as sent via `input_audio_buffer.append`) using a voice activity detector (VAD) component and automatically use that audio to initiate response generation on applicable conversations when an end of speech is detected. Silence detection for the VAD can be configured when specifying `server_vad` detection mode.
106+
- `none` relies on caller-initiated `input_audio_buffer.commit` and `response.create` events to progress conversations and produce output. This setting is useful for push-to-talk applications or situations that have external audio flow control (such as caller-side VAD component). These manual signals can still be used in `server_vad` mode to supplement VAD-initiated response generation.
107+
108+
Transcription of user input audio is opted into via the `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of `conversation.item.audio_transcription.completed` events.
109+
110+
### Session update example
111+
112+
An example `session.update` that configures several aspects of the session, including tools, follows. All session parameters are optional; not everything needs to be configured!
113+
114+
```json
115+
{
116+
"type": "session.update",
117+
"session": {
118+
"voice": "alloy",
119+
"instructions": "Call provided tools if appropriate for the user's input.",
120+
"input_audio_format": "pcm16",
121+
"input_audio_transcription": {
122+
"model": "whisper-1"
123+
},
124+
"turn_detection": {
125+
"threshold": 0.4,
126+
"silence_duration_ms": 600,
127+
"type": "server_vad"
128+
},
129+
"tools": [
130+
{
131+
"type": "function",
132+
"name": "get_weather_for_location",
133+
"description": "gets the weather for a location",
134+
"parameters": {
135+
"type": "object",
136+
"properties": {
137+
"location": {
138+
"type": "string",
139+
"description": "The city and state e.g. San Francisco, CA"
140+
},
141+
"unit": {
142+
"type": "string",
143+
"enum": [
144+
"c",
145+
"f"
146+
]
147+
}
148+
},
149+
"required": [
150+
"location",
151+
"unit"
152+
]
153+
}
154+
}
155+
]
156+
}
157+
}
158+
```
159+
160+
161+
## Related content
162+
163+
* Try the [real-time audio quickstart](../realtime-audio-quickstart.md)
164+
* Learn more about Azure OpenAI [quotas and limits](../quotas-limits.md)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: include file
3+
description: include file
4+
ms.topic: include
5+
ms.date: 12/11/2024
6+
ms.custom: include
7+
---
8+
9+
> [!NOTE]
10+
> This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

articles/ai-services/openai/realtime-audio-quickstart.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to use GPT-4o Realtime API for speech and audio with Azur
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 10/3/2024
8+
ms.date: 12/11/2024
99
author: eric-urban
1010
ms.author: eur
1111
ms.custom: references_regions, ignite-2024
@@ -15,6 +15,8 @@ recommendations: false
1515

1616
# GPT-4o Realtime API for speech and audio (Preview)
1717

18+
[!INCLUDE [Feature preview](includes/preview-feature.md)]
19+
1820
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o audio `realtime` API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
1921

2022
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.
@@ -125,5 +127,5 @@ You can run the sample code locally on your machine by following these steps. Re
125127
126128
## Related content
127129
128-
* Learn more about Azure OpenAI [deployment types](./how-to/deployment-types.md)
130+
* Learn more about [How to use the Realtime API](./how-to/realtime-audio.md)
129131
* Learn more about Azure OpenAI [quotas and limits](quotas-limits.md)

articles/ai-services/openai/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,8 @@ items:
183183
href: ./how-to/azure-developer-cli.md
184184
- name: Troubleshooting and best practices
185185
href: ./how-to/on-your-data-best-practices.md
186+
- name: Use the Realtime API (preview)
187+
href: ./how-to/realtime-audio.md
186188
- name: Migrate to OpenAI Python v1.x
187189
href: ./how-to/migration.md
188190
- name: Migrate to OpenAI JavaScript v4.x

0 commit comments

Comments
 (0)