Merge pull request #6938 from PatrickFarley/openai-audio

JamesJBarnett · web-flow · commit 085400537ae2 · 2025-09-04T15:03:29.000-07:00
OpenAI audio
diff --git a/articles/ai-foundry/openai/concepts/audio.md b/articles/ai-foundry/openai/concepts/audio.md
@@ -22,7 +22,7 @@ For information about the available audio models per region in Azure OpenAI, see
 
 ## GPT-4o audio Realtime API
 
-GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
+GPT real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT real-time audio, see the [GPT real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
 
 ## GPT-4o audio completions
 
@@ -40,4 +40,4 @@ The audio models via the `/audio` API can be used for speech to text, translatio
 - [Audio models](models.md#audio-models)
 - [Whisper quickstart](../whisper-quickstart.md)
 - [Audio generation quickstart](../audio-completions-quickstart.md)
-- [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md)
+- [GPT real-time audio quickstart](../realtime-audio-quickstart.md)
diff --git a/articles/ai-foundry/openai/how-to/realtime-audio-webrtc.md b/articles/ai-foundry/openai/how-to/realtime-audio-webrtc.md
@@ -1,7 +1,7 @@
 ---
-title: 'How to use the GPT-4o Realtime API via WebRTC'
+title: 'How to use the GPT Realtime API via WebRTC'
 titleSuffix: Azure OpenAI in Azure AI Foundry Models
-description: Learn how to use the GPT-4o Realtime API for speech and audio via WebRTC.
+description: Learn how to use the GPT Realtime API for speech and audio via WebRTC.
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: how-to
@@ -12,10 +12,10 @@ ms.custom: references_regions
 recommendations: false
 ---
 
-# How to use the GPT-4o Realtime API via WebRTC
+# How to use the GPT Realtime API via WebRTC
 
 
-Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
+Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
 
 You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. Follow the instructions in this article to get started with the Realtime API via WebRTC.
 
@@ -29,7 +29,7 @@ Use the [Realtime API via WebSockets](./realtime-audio-websockets.md) if you nee
 
 ## Supported models
 
-The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
+The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
 - `gpt-4o-mini-realtime-preview` (2024-12-17)
 - `gpt-4o-realtime-preview` (2024-12-17)
 - `gpt-realtime` (version 2025-08-28)
@@ -40,7 +40,7 @@ For more information about supported models, see the [models and versions docume
 
 ## Prerequisites
 
-Before you can use GPT-4o real-time audio, you need:
+Before you can use GPT real-time audio, you need:
 
 - An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
 - An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
@@ -113,9 +113,9 @@ sequenceDiagram
 
 ## WebRTC example via HTML and JavaScript
 
-The following code sample demonstrates how to use the GPT-4o Realtime API via WebRTC. The sample uses the [WebRTC API](https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API) to establish a real-time audio connection with the model.
+The following code sample demonstrates how to use the GPT Realtime API via WebRTC. The sample uses the [WebRTC API](https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API) to establish a real-time audio connection with the model.
 
-The sample code is an HTML page that allows you to start a session with the GPT-4o Realtime API and send audio input to the model. The model's responses are played back in real-time.
+The sample code is an HTML page that allows you to start a session with the GPT Realtime API and send audio input to the model. The model's responses are played back in real-time.
 
 > [!WARNING]
 > The sample code includes the API key hardcoded in the JavaScript. This code isn't recommended for production use. In a production environment, you should use a secure backend service to generate an ephemeral key and return it to the client.
@@ -299,7 +299,7 @@ The sample code is an HTML page that allows you to start a session with the GPT-
     </html>
     ```
 
-1. Select **Start Session** to start a session with the GPT-4o Realtime API. The session ID and ephemeral key are displayed in the log container.
+1. Select **Start Session** to start a session with the GPT Realtime API. The session ID and ephemeral key are displayed in the log container.
 1. Allow the browser to access your microphone when prompted.
 1. Confirmation messages are displayed in the log container as the session progresses. Here's an example of the log messages:
 
diff --git a/articles/ai-foundry/openai/how-to/realtime-audio-websockets.md b/articles/ai-foundry/openai/how-to/realtime-audio-websockets.md
@@ -1,7 +1,7 @@
 ---
-title: 'How to use the GPT-4o Realtime API via WebSockets'
+title: 'How to use the GPT Realtime API via WebSockets'
 titleSuffix: Azure OpenAI in Azure AI Foundry Models
-description: Learn how to use the GPT-4o Realtime API for speech and audio via WebSockets.
+description: Learn how to use the GPT Realtime API for speech and audio via WebSockets.
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: how-to
@@ -12,10 +12,10 @@ ms.custom: references_regions
 recommendations: false
 ---
 
-# How to use the GPT-4o Realtime API via WebSockets
+# How to use the GPT Realtime API via WebSockets
 
 
-Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
+Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
 
 You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. 
 
@@ -26,7 +26,7 @@ Follow the instructions in this article to get started with the Realtime API via
 
 ## Supported models
 
-The GPT-4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
+The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
 - `gpt-4o-mini-realtime-preview` (2024-12-17)
 - `gpt-4o-realtime-preview` (2024-12-17)
 - `gpt-realtime` (version 2025-08-28)
@@ -37,7 +37,7 @@ For more information about supported models, see the [models and versions docume
 
 ## Prerequisites
 
-Before you can use GPT-4o real-time audio, you need:
+Before you can use GPT real-time audio, you need:
 
 - An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
 - An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
diff --git a/articles/ai-foundry/openai/how-to/realtime-audio.md b/articles/ai-foundry/openai/how-to/realtime-audio.md
@@ -1,7 +1,7 @@
 ---
-title: 'How to use the GPT-4o Realtime API for speech and audio with Azure OpenAI'
+title: 'How to use the GPT Realtime API for speech and audio with Azure OpenAI'
 titleSuffix: Azure OpenAI in Azure AI Foundry Models
-description: Learn how to use the GPT-4o Realtime API for speech and audio with Azure OpenAI.
+description: Learn how to use the GPT Realtime API for speech and audio with Azure OpenAI.
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: how-to
@@ -12,9 +12,9 @@ ms.custom: references_regions
 recommendations: false
 ---
 
-# How to use the GPT-4o Realtime API for speech and audio
+# How to use the GPT Realtime API for speech and audio
 
-Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
+Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT Realtime API is designed to handle real-time, low-latency conversational interactions. Realtime API is a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
 
 Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams. 
 
@@ -24,7 +24,7 @@ You can use the Realtime API via WebRTC or WebSocket to send audio input to the
 
 ## Supported models
 
-The GPT 4o real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
+The GPT real-time models are available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
 - `gpt-4o-mini-realtime-preview` (2024-12-17)
 - `gpt-4o-realtime-preview` (2024-12-17)
 - `gpt-realtime` (version 2025-08-28)
@@ -35,16 +35,16 @@ See the [models and versions documentation](../concepts/models.md#audio-models)
 
 ## Get started
 
-Before you can use GPT-4o real-time audio, you need:
+Before you can use GPT real-time audio, you need:
 
 - An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
 - An Azure OpenAI resource created in a [supported region](#supported-models). For more information, see [Create a resource and deploy a model with Azure OpenAI](create-resource.md).
 - You need a deployment of the `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, or `gpt-realtime` model in a supported region as described in the [supported models](#supported-models) section. You can deploy the model from the [Azure AI Foundry portal model catalog](../../../ai-foundry/how-to/model-catalog-overview.md) or from your project in Azure AI Foundry portal. 
 
-Here are some of the ways you can get started with the GPT-4o Realtime API for speech and audio:
+Here are some of the ways you can get started with the GPT Realtime API for speech and audio:
 - For steps to deploy and use the `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, or `gpt-realtime` model, see [the real-time audio quickstart](../realtime-audio-quickstart.md).
 - Try the [WebRTC via HTML and JavaScript example](./realtime-audio-webrtc.md#webrtc-example-via-html-and-javascript) to get started with the Realtime API via WebRTC.
-- [The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT-4o realtime API for audio.
+- [The Azure-Samples/aisearch-openai-rag-audio repo](https://github.com/Azure-Samples/aisearch-openai-rag-audio) contains an example of how to implement RAG support in applications that use voice as their user interface, powered by the GPT realtime API for audio.
 
 ## Session configuration
 
@@ -229,7 +229,7 @@ Set [`turn_detection.create_response`](../realtime-audio-reference.md#realtimetu
 
 ## Conversation and response generation
 
-The GPT-4o real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
+The GPT real-time audio models are designed for real-time, low-latency conversational interactions. The API is built on a series of events that allow the client to send and receive messages, control the flow of the conversation, and manage the state of the session.
 
 ### Conversation sequence and items
 
@@ -278,7 +278,58 @@ A user might want to interrupt the assistant's response or ask the assistant to
 - Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
 - The server responds with a [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event.
 
-## Text in audio out example
+## Image input
+
+The `gpt-realtime` model supports image input as part of the conversation. The model can ground responses in what the user is currently seeing. You can send images to the model as part of a conversation item. The model can then generate responses that reference the images.
+
+The following example json body adds an image to the conversation:
+
+```json
+{
+    "type": "conversation.item.create",
+    "previous_item_id": null,
+    "item": {
+        "type": "message",
+        "role": "user",
+        "content": [
+            {
+                "type": "input_image",
+                "image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"
+            }
+        ]
+    }
+}
+```
+
+## MCP server support
+
+To enable MCP support in a Realtime API session, provide the URL of a remote MCP server in your session configuration. After connecting, the API will automatically manage tool calls on your behalf.
+
+You can easily enhance your agent's functionality by specifying a different MCP server in the session configuration—any tools available on that server will be accessible immediately.
+
+The following example json body sets up an MCP server:
+
+```json
+{
+  "session": {
+    "type": "realtime",
+    "tools": [
+      {
+        "type": "mcp",
+        "server_label": "stripe",
+        "server_url": "https://mcp.stripe.com",
+        "authorization": "{access_token}",
+        "require_approval": "never"
+      }
+    ]
+  }
+}
+```
+
+
+
+
+## Text-in, audio-out example
 
 Here's an example of the event sequence for a simple text-in, audio-out conversation:
 
diff --git a/articles/ai-foundry/openai/includes/realtime-portal.md b/articles/ai-foundry/openai/includes/realtime-portal.md
@@ -11,7 +11,7 @@ ms.date: 3/20/2025
 
 [!INCLUDE [Deploy model](realtime-deploy-model.md)]
 
-## Use the GPT-4o real-time audio
+## Use the GPT real-time audio
 
 To chat with your deployed `gpt-realtime` model in the [Azure AI Foundry](https://ai.azure.com/?cid=learnDocs) **Real-time audio** playground, follow these steps:
 
diff --git a/articles/ai-foundry/openai/realtime-audio-quickstart.md b/articles/ai-foundry/openai/realtime-audio-quickstart.md
@@ -1,7 +1,7 @@
 ---
-title: 'How to use GPT-4o Realtime API for speech and audio with Azure OpenAI in Azure AI Foundry Models'
+title: 'How to use GPT Realtime API for speech and audio with Azure OpenAI in Azure AI Foundry Models'
 titleSuffix: Azure OpenAI
-description: Learn how to use GPT-4o Realtime API for speech and audio with Azure OpenAI.
+description: Learn how to use GPT Realtime API for speech and audio with Azure OpenAI.
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: how-to
@@ -13,10 +13,10 @@ zone_pivot_groups: openai-portal-js-python-ts
 recommendations: false
 ---
 
-# GPT-4o Realtime API for speech and audio
+# GPT Realtime API for speech and audio
 
 
-Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
+Azure OpenAI GPT Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. 
 
 You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time. 
 
@@ -27,7 +27,7 @@ Follow the instructions in this article to get started with the Realtime API via
 
 ## Supported models
 
-The GPT 4o real-time models are available for global deployments.
+The GPT real-time models are available for global deployments.
 - `gpt-4o-realtime-preview` (version `2024-12-17`)
 - `gpt-4o-mini-realtime-preview` (version `2024-12-17`)
 - `gpt-realtime` (version `2025-08-28`)
diff --git a/articles/ai-foundry/openai/whats-new.md b/articles/ai-foundry/openai/whats-new.md
@@ -22,7 +22,7 @@ This article provides a summary of the latest releases and major documentation u
 
 ### Realtime API audio model GA
 
-OpenAI's GPT-4o RealTime and Audio models are now generally available on Azure AI Foundry Direct Models.
+OpenAI's GPT RealTime and Audio models are now generally available on Azure AI Foundry Direct Models.
 
 Model improvements:
 - Improved instruction following: Enhanced capabilities to follow tone, pacing, and escalation instructions more accurately and reliably. Can also switch languages.
@@ -210,15 +210,15 @@ The `gpt-4o-audio-preview` model introduces the audio modality into the existing
 > [!NOTE]
 > The [Realtime API](./realtime-audio-quickstart.md) uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.
 
-### GPT-4o Realtime API 2024-12-17
+### GPT Realtime API 2024-12-17
 
 The `gpt-4o-realtime-preview` model version 2024-12-17 is available for global deployments in [East US 2 and Sweden Central regions](./concepts/models.md#global-standard-model-availability). Use the `gpt-4o-realtime-preview` version 2024-12-17 model instead of the `gpt-4o-realtime-preview` version 2024-10-01-preview model for real-time audio interactions.
 
 - Added support for [prompt caching](./how-to/prompt-caching.md) with the `gpt-4o-realtime-preview` model.
 - Added support for new voices. The `gpt-4o-realtime-preview` models now support the following voices: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`.
 - Rate limits are no longer based on connections per minute. Rate limiting is now based on RPM (requests per minute) and TPM (tokens per minute) for the `gpt-4o-realtime-preview` model. The rate limits for each `gpt-4o-realtime-preview` model deployment are 100 K TPM and 1 K RPM. During the preview, [Azure AI Foundry portal](https://ai.azure.com/?cid=learnDocs) and APIs might inaccurately show different rate limits. Even if you try to set a different rate limit, the actual rate limit is 100 K TPM and 1 K RPM.
 
-For more information, see the [GPT-4o real-time audio quickstart](realtime-audio-quickstart.md) and the [how-to guide](./how-to/realtime-audio.md).
+For more information, see the [GPT real-time audio quickstart](realtime-audio-quickstart.md) and the [how-to guide](./how-to/realtime-audio.md).
 
 ## December 2024