Skip to content

Commit 9a941dc

Browse files
committed
realtime webrtc
1 parent 33ab1e6 commit 9a941dc

File tree

7 files changed

+458
-147
lines changed

7 files changed

+458
-147
lines changed

articles/ai-services/openai/how-to/realtime-audio-webrtc.md

Lines changed: 319 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: 'How to use the GPT-4o Realtime API via WebSockets (Preview)'
3+
titleSuffix: Azure OpenAI Service
4+
description: Learn how to use the GPT-4o Realtime API for speech and audio via WebSockets.
5+
manager: nitinme
6+
ms.service: azure-ai-openai
7+
ms.topic: how-to
8+
ms.date: 4/28/2025
9+
author: eric-urban
10+
ms.author: eur
11+
ms.custom: references_regions
12+
recommendations: false
13+
---
14+
15+
# How to use the GPT-4o Realtime API via WebSockets (Preview)
16+
17+
[!INCLUDE [Feature preview](../includes/preview-feature.md)]
18+
19+
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
20+
21+
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. Follow the instructions in this article to get started with the Realtime API via WebSockets.
22+
23+
In most cases, we recommend using the [Realtime API via WebRTC](./realtime-audio-webrtc.md) for real-time audio streaming. The WebRTC API is a web standard that enables real-time communication (RTC) between browsers and mobile applications.
24+
25+
WebSockets are not recommended for real-time audio streaming because they have higher latency than WebRTC. Use the Realtime API via WebSockets if you need to stream audio data from a server to a client, or if you need to send and receive data in real time between a client and server.
26+
27+
## Supported models
28+
29+
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
30+
- `gpt-4o-mini-realtime-preview` (2024-12-17)
31+
- `gpt-4o-realtime-preview` (2024-12-17)
32+
- `gpt-4o-realtime-preview` (2024-10-01)
33+
34+
See the [models and versions documentation](../concepts/models.md#audio-models) for more information.
35+
36+
## Prerequisites
37+
38+
Before you can use GPT-4o real-time audio, you need:
39+
40+
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
41+
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
42+
- You need a deployment of the `gpt-4o-realtime-preview` or `gpt-4o-mini-realtime-preview` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Foundry portal model catalog](../../../ai-foundry/how-to/model-catalog-overview.md) or from your project in Azure AI Foundry portal.
43+
44+
## Connection and authentication
45+
46+
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model.
47+
48+
The Realtime API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
49+
50+
You can construct a full request URI by concatenating:
51+
52+
- The secure WebSocket (`wss://`) protocol.
53+
- Your Azure OpenAI resource endpoint hostname, for example, `my-aoai-resource.openai.azure.com`
54+
- The `openai/realtime` API path.
55+
- An `api-version` query string parameter for a supported API version such as `2024-12-17`
56+
- A `deployment` query string parameter with the name of your `gpt-4o-realtime-preview` or `gpt-4o-mini-realtime-preview` model deployment.
57+
58+
The following example is a well-constructed `/realtime` request URI:
59+
60+
```http
61+
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-12-17&deployment=gpt-4o-mini-realtime-preview-deployment-name
62+
```
63+
64+
To authenticate:
65+
- **Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI resource with managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
66+
- **API key**: An `api-key` can be provided in one of two ways:
67+
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
68+
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
69+
70+
## Realtime API via WebSockets architecture
71+
72+
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.
73+
74+
:::image type="content" source="../media/how-to/real-time/realtime-api-sequence.png" alt-text="Diagram of the Realtime API authentication and connection sequence." lightbox="../media/how-to/real-time/realtime-api-sequence.png":::
75+
76+
<!--
77+
sequenceDiagram
78+
actor User as End User
79+
participant MiddleTier as /realtime host
80+
participant AOAI as Azure OpenAI
81+
User->>MiddleTier: Begin interaction
82+
MiddleTier->>MiddleTier: Authenticate/Validate User
83+
MiddleTier--)User: audio information
84+
User--)MiddleTier:
85+
MiddleTier--)User: text information
86+
User--)MiddleTier:
87+
MiddleTier--)User: control information
88+
User--)MiddleTier:
89+
MiddleTier->>AOAI: connect to /realtime
90+
MiddleTier->>AOAI: configure session
91+
AOAI->>MiddleTier: session start
92+
MiddleTier--)AOAI: send/receive WS commands
93+
AOAI--)MiddleTier:
94+
AOAI--)MiddleTier: create/start conversation responses
95+
AOAI--)MiddleTier: (within responses) create/start/add/finish items
96+
AOAI--)MiddleTier: (within items) create/stream/finish content parts
97+
-->
98+
99+
Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
100+
101+
- A client-side caller establishes a connection to `/realtime`, which starts a new [`session`](../realtime-audio-reference.md#realtimerequestsession).
102+
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
103+
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
104+
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
105+
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
106+
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
107+
- Each caller-initiated [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) can override some of the output [`response`](../realtime-audio-reference.md#realtimeresponse) behavior, if desired.
108+
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
109+
110+
## Try the quickstart
111+
112+
Now that you have the prerequisites, you can follow the instructions in the [Realtime API quickstart](../realtime-audio-quickstart.md) to get started with the Realtime API via WebSockets.
113+
114+
## Related content
115+
116+
* Try the [real-time audio quickstart](../realtime-audio-quickstart.md)
117+
* See the [Realtime API reference](../realtime-audio-reference.md)
118+
* Learn more about Azure OpenAI [quotas and limits](../quotas-limits.md)

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 9 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: 'How to use the GPT-4o Realtime API for speech and audio with Azure OpenAI Service'
3-
titleSuffix: Azure OpenAI
4-
description: Learn how to use the GPT-4o Realtime API for speech and audio with Azure OpenAI Service.
2+
title: 'How to use the GPT-4o Realtime API for speech and audio with Azure OpenAI'
3+
titleSuffix: Azure OpenAI Service
4+
description: Learn how to use the GPT-4o Realtime API for speech and audio with Azure OpenAI.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 3/20/2025
8+
ms.date: 4/28/2025
99
author: eric-urban
1010
ms.author: eur
1111
ms.custom: references_regions
@@ -20,6 +20,10 @@ Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o mode
2020

2121
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.
2222

23+
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. In most cases, we recommend using the WebRTC API for low-latency real-time audio streaming. For more information, see:
24+
- [Realtime API via WebRTC](./realtime-audio-webrtc.md)
25+
- [Realtime API via WebSockets](./realtime-audio-websockets.md)
26+
2327
## Supported models
2428

2529
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
@@ -39,78 +43,9 @@ Before you can use GPT-4o real-time audio, you need:
3943

4044
Here are some of the ways you can get started with the GPT-4o Realtime API for speech and audio:
4145
- For steps to deploy and use the `gpt-4o-realtime-preview` or `gpt-4o-mini-realtime-preview` model, see [the real-time audio quickstart](../realtime-audio-quickstart.md).
42-
- Download the sample code from the [Azure OpenAI GPT-4o real-time audio repository on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk).
46+
- Try the [WebRTC via HTML and JavaScript example](./realtime-audio-webrtc.md#webrtc-example-via-html-and-javascript) to get started with the Realtime API via WebRTC.
4347
- [The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
4448

45-
## Connection and authentication
46-
47-
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model.
48-
49-
> [!IMPORTANT]
50-
> Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
51-
52-
The Realtime API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
53-
54-
You can construct a full request URI by concatenating:
55-
56-
- The secure WebSocket (`wss://`) protocol.
57-
- Your Azure OpenAI resource endpoint hostname, for example, `my-aoai-resource.openai.azure.com`
58-
- The `openai/realtime` API path.
59-
- An `api-version` query string parameter for a supported API version such as `2024-12-17`
60-
- A `deployment` query string parameter with the name of your `gpt-4o-realtime-preview` or `gpt-4o-mini-realtime-preview` model deployment.
61-
62-
The following example is a well-constructed `/realtime` request URI:
63-
64-
```http
65-
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-12-17&deployment=gpt-4o-mini-realtime-preview-deployment-name
66-
```
67-
68-
To authenticate:
69-
- **Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI Service resource with managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
70-
- **API key**: An `api-key` can be provided in one of two ways:
71-
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
72-
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
73-
74-
## Realtime API architecture
75-
76-
Once the WebSocket connection session to `/realtime` is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.
77-
78-
:::image type="content" source="../media/how-to/real-time/realtime-api-sequence.png" alt-text="Diagram of the Realtime API authentication and connection sequence." lightbox="../media/how-to/real-time/realtime-api-sequence.png":::
79-
80-
<!--
81-
sequenceDiagram
82-
actor User as End User
83-
participant MiddleTier as /realtime host
84-
participant AOAI as Azure OpenAI
85-
User->>MiddleTier: Begin interaction
86-
MiddleTier->>MiddleTier: Authenticate/Validate User
87-
MiddleTier--)User: audio information
88-
User--)MiddleTier:
89-
MiddleTier--)User: text information
90-
User--)MiddleTier:
91-
MiddleTier--)User: control information
92-
User--)MiddleTier:
93-
MiddleTier->>AOAI: connect to /realtime
94-
MiddleTier->>AOAI: configure session
95-
AOAI->>MiddleTier: session start
96-
MiddleTier--)AOAI: send/receive WS commands
97-
AOAI--)MiddleTier:
98-
AOAI--)MiddleTier: create/start conversation responses
99-
AOAI--)MiddleTier: (within responses) create/start/add/finish items
100-
AOAI--)MiddleTier: (within items) create/stream/finish content parts
101-
-->
102-
103-
Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
104-
105-
- A client-side caller establishes a connection to `/realtime`, which starts a new [`session`](#session-configuration).
106-
- A `session` automatically creates a default `conversation`. Multiple concurrent conversations aren't supported.
107-
- The `conversation` accumulates input signals until a `response` is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
108-
- Each `response` consists of one or more `items`, which can encapsulate messages, function calls, and other information.
109-
- Each message `item` has `content_part`, allowing multiple modalities (text and audio) to be represented across a single item.
110-
- The `session` manages configuration of caller input handling (for example, user audio) and common output generation handling.
111-
- Each caller-initiated [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) can override some of the output [`response`](../realtime-audio-reference.md#realtimeresponse) behavior, if desired.
112-
- Server-created `item` and the `content_part` in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
113-
11449
## Session configuration
11550

11651
Often, the first event sent by the caller on a newly established `/realtime` session is a [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) payload. This event controls a wide set of input and output behavior, with output and response generation properties then later overridable using the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.

articles/ai-services/openai/includes/realtime-javascript.md

Lines changed: 0 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -247,45 +247,3 @@ Received 26400 bytes of audio data.
247247
248248
Connection closed!
249249
```
250-
251-
## Web application sample
252-
253-
Our JavaScript web sample [on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk) demonstrates how to use the GPT-4o Realtime API to interact with the model in real time. The sample code includes a simple web interface that captures audio from the user's microphone and sends it to the model for processing. The model responds with text and audio, which the sample code renders in the web interface.
254-
255-
You can run the sample code locally on your machine by following these steps. Refer to the [repository on GitHub](https://github.com/azure-samples/aoai-realtime-audio-sdk) for the most up-to-date instructions.
256-
1. If you don't have Node.js installed, download and install the [LTS version of Node.js](https://nodejs.org/).
257-
258-
1. Clone the repository to your local machine:
259-
260-
```bash
261-
git clone https://github.com/Azure-Samples/aoai-realtime-audio-sdk.git
262-
```
263-
264-
1. Go to the `javascript/samples/web` folder in your preferred code editor.
265-
266-
```bash
267-
cd ./javascript/samples
268-
```
269-
270-
1. Run `download-pkg.ps1` or `download-pkg.sh` to download the required packages.
271-
272-
1. Go to the `web` folder from the `./javascript/samples` folder.
273-
274-
```bash
275-
cd ./web
276-
```
277-
278-
1. Run `npm install` to install package dependencies.
279-
280-
1. Run `npm run dev` to start the web server, navigating any firewall permissions prompts as needed.
281-
1. Go to any of the provided URIs from the console output (such as `http://localhost:5173/`) in a browser.
282-
1. Enter the following information in the web interface:
283-
- **Endpoint**: The resource endpoint of an Azure OpenAI resource. You don't need to append the `/realtime` path. An example structure might be `https://my-azure-openai-resource-from-portal.openai.azure.com`.
284-
- **API Key**: A corresponding API key for the Azure OpenAI resource.
285-
- **Deployment**: The name of the `gpt-4o-mini-realtime-preview` model that [you deployed in the previous section](#deploy-a-model-for-real-time-audio).
286-
- **System Message**: Optionally, you can provide a system message such as "You always talk like a friendly pirate."
287-
- **Temperature**: Optionally, you can provide a custom temperature.
288-
- **Voice**: Optionally, you can select a voice.
289-
1. Select the **Record** button to start the session. Accept permissions to use your microphone if prompted.
290-
1. You should see a `<< Session Started >>` message in the main output. Then you can speak into the microphone to start a chat.
291-
1. You can interrupt the chat at any time by speaking. You can end the chat by selecting the **Stop** button.
132 KB
Loading

articles/ai-services/openai/realtime-audio-reference.md

Lines changed: 2 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ description: Learn how to use the Realtime API to interact with the Azure OpenAI
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: conceptual
8-
ms.date: 3/20/2025
8+
ms.date: 4/28/2025
99
author: eric-urban
1010
ms.author: eur
1111
recommendations: false
1212
---
1313

14-
# Realtime API (Preview) reference
14+
# Realtime events reference
1515

1616
[!INCLUDE [Feature preview](includes/preview-feature.md)]
1717

@@ -22,32 +22,6 @@ The Realtime API (via `/realtime`) is built on [the WebSockets API](https://deve
2222
> [!TIP]
2323
> To get started with the Realtime API, see the [quickstart](realtime-audio-quickstart.md) and [how-to guide](./how-to/realtime-audio.md).
2424
25-
## Connection
26-
27-
The Realtime API requires an existing Azure OpenAI resource endpoint in a supported region. The API is accessed via a secure WebSocket connection to the `/realtime` endpoint of your Azure OpenAI resource.
28-
29-
You can construct a full request URI by concatenating:
30-
31-
- The secure WebSocket (`wss://`) protocol.
32-
- Your Azure OpenAI resource endpoint hostname, for example, `my-aoai-resource.openai.azure.com`
33-
- The `openai/realtime` API path.
34-
- An `api-version` query string parameter for a supported API version such as `2024-10-01-preview`.
35-
- A `deployment` query string parameter with the name of your `gpt-4o-realtime-preview` or `gpt-4o-mini-realtime-preview` model deployment.
36-
37-
The following example is a well-constructed `/realtime` request URI:
38-
39-
```http
40-
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview
41-
```
42-
43-
## Authentication
44-
45-
To authenticate:
46-
- **Microsoft Entra** (recommended): Use token-based authentication with the `/realtime` API for an Azure OpenAI Service resource with managed identity enabled. Apply a retrieved authentication token using a `Bearer` token with the `Authorization` header.
47-
- **API key**: An `api-key` can be provided in one of two ways:
48-
- Using an `api-key` connection header on the prehandshake connection. This option isn't available in a browser environment.
49-
- Using an `api-key` query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
50-
5125
## Client events
5226

5327
There are nine client events that can be sent from the client to the server:

0 commit comments

Comments
 (0)