Skip to content

Commit 0854005

Browse files
Merge pull request #6938 from PatrickFarley/openai-audio
OpenAI audio
2 parents 0e3ef9b + f916e7a commit 0854005

File tree

7 files changed

+87
-36
lines changed

7 files changed

+87
-36
lines changed

articles/ai-foundry/openai/concepts/audio.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ For information about the available audio models per region in Azure OpenAI, see
2222

2323
## GPT-4o audio Realtime API
2424

25-
GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
25+
GPT real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT real-time audio, see the [GPT real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
2626

2727
## GPT-4o audio completions
2828

@@ -40,4 +40,4 @@ The audio models via the `/audio` API can be used for speech to text, translatio
4040
- [Audio models](models.md#audio-models)
4141
- [Whisper quickstart](../whisper-quickstart.md)
4242
- [Audio generation quickstart](../audio-completions-quickstart.md)
43-
- [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md)
43+
- [GPT real-time audio quickstart](../realtime-audio-quickstart.md)

articles/ai-foundry/openai/how-to/realtime-audio-webrtc.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: 'How to use the GPT-4o Realtime API via WebRTC'
2+
title: 'How to use the GPT Realtime API via WebRTC'
33
titleSuffix: Azure OpenAI in Azure AI Foundry Models
4-
description: Learn how to use the GPT-4o Realtime API for speech and audio via WebRTC.
4+
description: Learn how to use the GPT Realtime API for speech and audio via WebRTC.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
@@ -12,10 +12,10 @@ ms.custom: references_regions
1212
recommendations: false
1313
---
1414

15-
# How to use the GPT-4o Realtime API via WebRTC
15+
# How to use the GPT Realtime API via WebRTC
1616

1717

18-
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
18+
Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
1919

2020
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. Follow the instructions in this article to get started with the Realtime API via WebRTC.
2121

@@ -29,7 +29,7 @@ Use the [Realtime API via WebSockets](./realtime-audio-websockets.md) if you nee
2929

3030
## Supported models
3131

32-
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
32+
The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
3333
- `gpt-4o-mini-realtime-preview` (2024-12-17)
3434
- `gpt-4o-realtime-preview` (2024-12-17)
3535
- `gpt-realtime` (version 2025-08-28)
@@ -40,7 +40,7 @@ For more information about supported models, see the [models and versions docume
4040

4141
## Prerequisites
4242

43-
Before you can use GPT-4o real-time audio, you need:
43+
Before you can use GPT real-time audio, you need:
4444

4545
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
4646
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
@@ -113,9 +113,9 @@ sequenceDiagram
113113
114114
## WebRTC example via HTML and JavaScript
115115
116-
The following code sample demonstrates how to use the GPT-4o Realtime API via WebRTC. The sample uses the [WebRTC API](https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API) to establish a real-time audio connection with the model.
116+
The following code sample demonstrates how to use the GPT Realtime API via WebRTC. The sample uses the [WebRTC API](https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API) to establish a real-time audio connection with the model.
117117
118-
The sample code is an HTML page that allows you to start a session with the GPT-4o Realtime API and send audio input to the model. The model's responses are played back in real-time.
118+
The sample code is an HTML page that allows you to start a session with the GPT Realtime API and send audio input to the model. The model's responses are played back in real-time.
119119
120120
> [!WARNING]
121121
> The sample code includes the API key hardcoded in the JavaScript. This code isn't recommended for production use. In a production environment, you should use a secure backend service to generate an ephemeral key and return it to the client.
@@ -299,7 +299,7 @@ The sample code is an HTML page that allows you to start a session with the GPT-
299299
</html>
300300
```
301301
302-
1. Select **Start Session** to start a session with the GPT-4o Realtime API. The session ID and ephemeral key are displayed in the log container.
302+
1. Select **Start Session** to start a session with the GPT Realtime API. The session ID and ephemeral key are displayed in the log container.
303303
1. Allow the browser to access your microphone when prompted.
304304
1. Confirmation messages are displayed in the log container as the session progresses. Here's an example of the log messages:
305305

articles/ai-foundry/openai/how-to/realtime-audio-websockets.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: 'How to use the GPT-4o Realtime API via WebSockets'
2+
title: 'How to use the GPT Realtime API via WebSockets'
33
titleSuffix: Azure OpenAI in Azure AI Foundry Models
4-
description: Learn how to use the GPT-4o Realtime API for speech and audio via WebSockets.
4+
description: Learn how to use the GPT Realtime API for speech and audio via WebSockets.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
@@ -12,10 +12,10 @@ ms.custom: references_regions
1212
recommendations: false
1313
---
1414

15-
# How to use the GPT-4o Realtime API via WebSockets
15+
# How to use the GPT Realtime API via WebSockets
1616

1717

18-
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
18+
Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
1919

2020
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time.
2121

@@ -26,7 +26,7 @@ Follow the instructions in this article to get started with the Realtime API via
2626
2727
## Supported models
2828

29-
The GPT-4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
29+
The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
3030
- `gpt-4o-mini-realtime-preview` (2024-12-17)
3131
- `gpt-4o-realtime-preview` (2024-12-17)
3232
- `gpt-realtime` (version 2025-08-28)
@@ -37,7 +37,7 @@ For more information about supported models, see the [models and versions docume
3737

3838
## Prerequisites
3939

40-
Before you can use GPT-4o real-time audio, you need:
40+
Before you can use GPT real-time audio, you need:
4141

4242
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
4343
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).

articles/ai-foundry/openai/how-to/realtime-audio.md

Lines changed: 61 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: 'How to use the GPT-4o Realtime API for speech and audio with Azure OpenAI'
2+
title: 'How to use the GPT Realtime API for speech and audio with Azure OpenAI'
33
titleSuffix: Azure OpenAI in Azure AI Foundry Models
4-
description: Learn how to use the GPT-4o Realtime API for speech and audio with Azure OpenAI.
4+
description: Learn how to use the GPT Realtime API for speech and audio with Azure OpenAI.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
@@ -12,9 +12,9 @@ ms.custom: references_regions
1212
recommendations: false
1313
---
1414

15-
# How to use the GPT-4o Realtime API for speech and audio
15+
# How to use the GPT Realtime API for speech and audio
1616

17-
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
17+
Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
1818

1919
Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.
2020

@@ -24,7 +24,7 @@ You can use the Realtime API via WebRTC or WebSocket to send audio input to the
2424

2525
## Supported models
2626

27-
The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
27+
The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
2828
- `gpt-4o-mini-realtime-preview` (2024-12-17)
2929
- `gpt-4o-realtime-preview` (2024-12-17)
3030
- `gpt-realtime` (version 2025-08-28)
@@ -35,16 +35,16 @@ See the [models and versions documentation](../concepts/models.md#audio-models)
3535

3636
## Get started
3737

38-
Before you can use GPT-4o real-time audio, you need:
38+
Before you can use GPT real-time audio, you need:
3939

4040
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
4141
- An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
4242
- You need a deployment of the `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, or `gpt-realtime` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Foundry portal model catalog](../../../ai-foundry/how-to/model-catalog-overview.md) or from your project in Azure AI Foundry portal.
4343

44-
Here are some of the ways you can get started with the GPT-4o Realtime API for speech and audio:
44+
Here are some of the ways you can get started with the GPT Realtime API for speech and audio:
4545
- For steps to deploy and use the `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, or `gpt-realtime` model, see [the real-time audio quickstart](../realtime-audio-quickstart.md).
4646
- Try the [WebRTC via HTML and JavaScript example](./realtime-audio-webrtc.md#webrtc-example-via-html-and-javascript) to get started with the Realtime API via WebRTC.
47-
- [The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
47+
- [The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT realtime API for audio.
4848

4949
## Session configuration
5050

@@ -229,7 +229,7 @@ Set [`turn_detection.create_response`](../realtime-audio-reference.md#realtimetu
229229

230230
## Conversation and response generation
231231

232-
The GPT-4o real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
232+
The GPT real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
233233

234234
### Conversation sequence and items
235235

@@ -278,7 +278,58 @@ A user might want to interrupt the assistant's response or ask the assistant to
278278
- Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
279279
- The server responds with a [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event.
280280

281-
## Text in audio out example
281+
## Image input
282+
283+
The `gpt-realtime` model supports image input as part of the conversation. The model can ground responses in what the user is currently seeing. You can send images to the model as part of a conversation item. The model can then generate responses that reference the images.
284+
285+
The following example json body adds an image to the conversation:
286+
287+
```json
288+
{
289+
"type": "conversation.item.create",
290+
"previous_item_id": null,
291+
"item": {
292+
"type": "message",
293+
"role": "user",
294+
"content": [
295+
{
296+
"type": "input_image",
297+
"image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"
298+
}
299+
]
300+
}
301+
}
302+
```
303+
304+
## MCP server support
305+
306+
To enable MCP support in a Realtime API session, provide the URL of a remote MCP server in your session configuration. After connecting, the API will automatically manage tool calls on your behalf.
307+
308+
You can easily enhance your agent's functionality by specifying a different MCP server in the session configuration—any tools available on that server will be accessible immediately.
309+
310+
The following example json body sets up an MCP server:
311+
312+
```json
313+
{
314+
"session": {
315+
"type": "realtime",
316+
"tools": [
317+
{
318+
"type": "mcp",
319+
"server_label": "stripe",
320+
"server_url": "https://mcp.stripe.com",
321+
"authorization": "{access_token}",
322+
"require_approval": "never"
323+
}
324+
]
325+
}
326+
}
327+
```
328+
329+
330+
331+
332+
## Text-in, audio-out example
282333

283334
Here's an example of the event sequence for a simple text-in, audio-out conversation:
284335

articles/ai-foundry/openai/includes/realtime-portal.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ms.date: 3/20/2025
1111

1212
[!INCLUDE [Deploy model](realtime-deploy-model.md)]
1313

14-
## Use the GPT-4o real-time audio
14+
## Use the GPT real-time audio
1515

1616
To chat with your deployed `gpt-realtime` model in the [Azure AI Foundry](https://ai.azure.com/?cid=learnDocs) **Real-time audio** playground, follow these steps:
1717

articles/ai-foundry/openai/realtime-audio-quickstart.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: 'How to use GPT-4o Realtime API for speech and audio with Azure OpenAI in Azure AI Foundry Models'
2+
title: 'How to use GPT Realtime API for speech and audio with Azure OpenAI in Azure AI Foundry Models'
33
titleSuffix: Azure OpenAI
4-
description: Learn how to use GPT-4o Realtime API for speech and audio with Azure OpenAI.
4+
description: Learn how to use GPT Realtime API for speech and audio with Azure OpenAI.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
@@ -13,10 +13,10 @@ zone_pivot_groups: openai-portal-js-python-ts
1313
recommendations: false
1414
---
1515

16-
# GPT-4o Realtime API for speech and audio
16+
# GPT Realtime API for speech and audio
1717

1818

19-
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
19+
Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
2020

2121
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time.
2222

@@ -27,7 +27,7 @@ Follow the instructions in this article to get started with the Realtime API via
2727
2828
## Supported models
2929

30-
The GPT 4o real-time models are available for global deployments.
30+
The GPT real-time models are available for global deployments.
3131
- `gpt-4o-realtime-preview` (version `2024-12-17`)
3232
- `gpt-4o-mini-realtime-preview` (version `2024-12-17`)
3333
- `gpt-realtime` (version `2025-08-28`)

articles/ai-foundry/openai/whats-new.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ This article provides a summary of the latest releases and major documentation u
2222

2323
### Realtime API audio model GA
2424

25-
OpenAI's GPT-4o RealTime and Audio models are now generally available on Azure AI Foundry Direct Models.
25+
OpenAI's GPT RealTime and Audio models are now generally available on Azure AI Foundry Direct Models.
2626

2727
Model improvements:
2828
- Improved instruction following: Enhanced capabilities to follow tone, pacing, and escalation instructions more accurately and reliably. Can also switch languages.
@@ -210,15 +210,15 @@ The `gpt-4o-audio-preview` model introduces the audio modality into the existing
210210
> [!NOTE]
211211
> The [Realtime API](./realtime-audio-quickstart.md) uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.
212212
213-
### GPT-4o Realtime API 2024-12-17
213+
### GPT Realtime API 2024-12-17
214214

215215
The `gpt-4o-realtime-preview` model version 2024-12-17 is available for global deployments in [East US 2 and Sweden Central regions](./concepts/models.md#global-standard-model-availability). Use the `gpt-4o-realtime-preview` version 2024-12-17 model instead of the `gpt-4o-realtime-preview` version 2024-10-01-preview model for real-time audio interactions.
216216

217217
- Added support for [prompt caching](./how-to/prompt-caching.md) with the `gpt-4o-realtime-preview` model.
218218
- Added support for new voices. The `gpt-4o-realtime-preview` models now support the following voices: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`.
219219
- Rate limits are no longer based on connections per minute. Rate limiting is now based on RPM (requests per minute) and TPM (tokens per minute) for the `gpt-4o-realtime-preview` model. The rate limits for each `gpt-4o-realtime-preview` model deployment are 100 K TPM and 1 K RPM. During the preview, [Azure AI Foundry portal](https://ai.azure.com/?cid=learnDocs) and APIs might inaccurately show different rate limits. Even if you try to set a different rate limit, the actual rate limit is 100 K TPM and 1 K RPM.
220220

221-
For more information, see the [GPT-4o real-time audio quickstart](realtime-audio-quickstart.md) and the [how-to guide](./how-to/realtime-audio.md).
221+
For more information, see the [GPT real-time audio quickstart](realtime-audio-quickstart.md) and the [how-to guide](./how-to/realtime-audio.md).
222222

223223
## December 2024
224224

0 commit comments

Comments
 (0)