Skip to content

Commit bc1a808

Browse files
committed
audio completions preview
1 parent 711839a commit bc1a808

37 files changed

+2634
-62
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Quickstart - Getting started with Azure OpenAI audio generation
3+
titleSuffix: Azure OpenAI
4+
description: Walkthrough on how to get started with audio generation using Azure OpenAI.
5+
manager: nitinme
6+
ms.service: azure-ai-openai
7+
ms.topic: how-to
8+
ms.date: 1/21/2025
9+
author: eric-urban
10+
ms.author: eur
11+
ms.custom: references_regions
12+
zone_pivot_groups: audio-completions-quickstart
13+
recommendations: false
14+
---
15+
16+
# Quickstart: Get started using Azure OpenAI audio generation
17+
18+
::: zone pivot="ai-foundry-portal"
19+
20+
[!INCLUDE [AI Foundry](includes/audio-completions-ai-foundry.md)]
21+
22+
::: zone-end
23+
24+
::: zone pivot="programming-language-javascript"
25+
26+
[!INCLUDE [JavaScript quickstart](includes/audio-completions-javascript.md)]
27+
28+
::: zone-end
29+
30+
::: zone pivot="programming-language-python"
31+
32+
[!INCLUDE [Python SDK quickstart](includes/audio-completions-python.md)]
33+
34+
::: zone-end
35+
36+
::: zone pivot="rest-api"
37+
38+
[!INCLUDE [REST API quickstart](includes/audio-completions-rest.md)]
39+
40+
::: zone-end
41+
42+
::: zone pivot="programming-language-typescript"
43+
44+
[!INCLUDE [TypeScript quickstart](includes/audio-completions-typescript.md)]
45+
46+
::: zone-end
47+
48+
49+
## Clean-up resources
50+
51+
If you want to clean up and remove an Azure OpenAI resource, you can delete the resource. Before deleting the resource, you must first delete any deployed models.
52+
53+
- [Azure portal](../multi-service-resource.md?pivots=azportal#clean-up-resources)
54+
- [Azure CLI](../multi-service-resource.md?pivots=azcli#clean-up-resources)
55+
56+
## Related content
57+
58+
* Learn more about Azure OpenAI [deployment types](./how-to/deployment-types.md)
59+
* Learn more about Azure OpenAI [quotas and limits](quotas-limits.md)

articles/ai-services/openai/concepts/models.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Azure OpenAI Service is powered by a diverse set of models with different capabi
2020
|--|--|
2121
| [o1 & o1-mini](#o1-and-o1-mini-models-limited-access) | Limited access models, specifically designed to tackle reasoning and problem-solving tasks with increased focus and capability. |
2222
| [GPT-4o & GPT-4o mini & GPT-4 Turbo](#gpt-4o-and-gpt-4-turbo) | The latest most capable Azure OpenAI models with multimodal versions, which can accept both text and images as input. |
23-
| [GPT-4o-Realtime-Preview](#gpt-4o-realtime-preview) | A GPT-4o model that supports low-latency, "speech in, speech out" conversational interactions. |
23+
| [GPT-4o audio](#gpt-4o-audio) | GPT-4o audio models that support either low-latency, "speech in, speech out" conversational interactions or audio generation. |
2424
| [GPT-4](#gpt-4) | A set of models that improve on GPT-3.5 and can understand and generate natural language and code. |
2525
| [GPT-3.5](#gpt-35) | A set of models that improve on GPT-3 and can understand and generate natural language and code. |
2626
| [Embeddings](#embeddings-models) | A set of models that can convert text into numerical vector form to facilitate text similarity. |
@@ -56,20 +56,23 @@ To learn more about the advanced `o1` series models see, [getting started with o
5656
| `o1-preview` | See the [models table](#global-standard-model-availability). |
5757
| `o1-mini` | See the [models table](#global-provisioned-managed-model-availability). |
5858

59-
## GPT-4o-Realtime-Preview
59+
## GPT-4o audio
6060

61-
The GPT 4o audio models are part of the GPT-4o model family and support low-latency, "speech in, speech out" conversational interactions. GPT-4o audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user.
61+
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
62+
- GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
63+
- GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
6264

63-
GPT-4o audio is available in the East US 2 (`eastus2`) and Sweden Central (`swedencentral`) regions. To use GPT-4o audio, you need to [create](../how-to/create-resource.md) or use an existing resource in one of the supported regions.
65+
GPT-4o audio is available in the East US 2 (`eastus2`) and Sweden Central (`swedencentral`) regions. To use GPT-4o real-time audio, you need [an Azure OpenAI resource](../how-to/create-resource.md) in one of the supported regions.
6466

65-
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model. For more information on how to use GPT-4o audio, see the [GPT-4o audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
67+
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model.
6668

6769
Details about maximum request tokens and training data are available in the following table.
6870

6971
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
7072
|---|---|---|---|
71-
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
72-
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
73+
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
74+
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
75+
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
7376

7477
## GPT-4o and GPT-4 Turbo
7578

@@ -126,7 +129,7 @@ See [model versions](../concepts/model-versions.md) to learn about how Azure Ope
126129
| `gpt-4` (0314) | **Older GA model** <br> - [Retirement information](./model-retirements.md#current-models) | 8,192 | Sep 2021 |
127130

128131
> [!CAUTION]
129-
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models designated preview do not follow the standard Azure OpenAI model lifecycle.
132+
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
130133
131134
- GPT-4 version 0125-preview is an updated version of the GPT-4 Turbo preview previously released as version 1106-preview.
132135
- GPT-4 version 0125-preview completes tasks such as code generation more completely compared to gpt-4-1106-preview. Because of this, depending on the task, customers may find that GPT-4-0125-preview generates more output compared to the gpt-4-1106-preview. We recommend customers compare the outputs of the new model. GPT-4-0125-preview also addresses bugs in gpt-4-1106-preview with UTF-8 handling for non-English languages.

articles/ai-services/openai/how-to/realtime-audio.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The GPT 4o real-time models are available for global deployments in [East US 2 a
2626
- `gpt-4o-realtime-preview` (2024-12-17)
2727
- `gpt-4o-realtime-preview` (2024-10-01)
2828

29-
See the [models and versions documentation](../concepts/models.md#gpt-4o-realtime-preview) for more information.
29+
See the [models and versions documentation](../concepts/models.md#gpt-4o-audio) for more information.
3030

3131
## Get started
3232

@@ -248,7 +248,7 @@ In this case, the server evaluates user audio from the client (as sent via [`inp
248248
- The server commits the input audio buffer by sending the [`input_audio_buffer.committed`](../realtime-audio-reference.md#realtimeservereventinputaudiobuffercommitted) event.
249249
- The server sends the [`conversation.item.created`](../realtime-audio-reference.md#realtimeservereventconversationitemcreated) event with the user message item created from the audio buffer.
250250

251-
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-server-vad.png" alt-text="Diagram of the Realtime API input audio sequence with server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-server-vad.png":::
251+
:::image type="content" source="../media/how-to/real-time/input-audio-buffer-server-vad.png" alt-text="Diagram of the real time API input audio sequence with server decision mode." lightbox="../media/how-to/real-time/input-audio-buffer-server-vad.png":::
252252

253253

254254
<!--
@@ -300,7 +300,7 @@ Optionally, the client can truncate or delete items in the conversation:
300300
- The client deletes an item in the conversation with a [`conversation.item.delete`](../realtime-audio-reference.md#realtimeclienteventconversationitemdelete) event.
301301
- The server [`conversation.item.deleted`](../realtime-audio-reference.md#realtimeservereventconversationitemdeleted) event is returned to sync the client and server state.
302302

303-
:::image type="content" source="../media/how-to/real-time/conversation-item-sequence.png" alt-text="Diagram of the Realtime API conversation item sequence." lightbox="../media/how-to/real-time/conversation-item-sequence.png":::
303+
:::image type="content" source="../media/how-to/real-time/conversation-item-sequence.png" alt-text="Diagram of the real-time API conversation item sequence." lightbox="../media/how-to/real-time/conversation-item-sequence.png":::
304304

305305
<!--
306306
sequenceDiagram
@@ -324,11 +324,11 @@ To get a response from the model:
324324
- The client sends a [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event. The server responds with a [`response.created`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event. The response can contain one or more items, each of which can contain one or more content parts.
325325
- Or, when using server-side voice activity detection (VAD), the server automatically generates a response when it detects the end of speech in the input audio buffer. The server sends a [`response.created`](../realtime-audio-reference.md#realtimeservereventresponsecreated) event with the generated response.
326326

327-
### Response interuption
327+
### Response interruption
328328

329329
The client [`response.cancel`](../realtime-audio-reference.md#realtimeclienteventresponsecancel) event is used to cancel an in-progress response.
330330

331-
A user might want to interrupt the assistant's response or ask the assistant to stop talking. The server produces audio faster than realtime. The client can send a [`conversation.item.truncate`](../realtime-audio-reference.md#realtimeclienteventconversationitemtruncate) event to truncate the audio before it's played.
331+
A user might want to interrupt the assistant's response or ask the assistant to stop talking. The server produces audio faster than real-time. The client can send a [`conversation.item.truncate`](../realtime-audio-reference.md#realtimeclienteventconversationitemtruncate) event to truncate the audio before it's played.
332332
- The server's understanding of the audio with the client's playback is synchronized.
333333
- Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
334334
- The server responds with a [`conversation.item.truncated`](../realtime-audio-reference.md#realtimeservereventconversationitemtruncated) event.

articles/ai-services/openai/includes/assistants-javascript.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
6565

6666
## Retrieve resource information
6767

68-
[!INCLUDE [resource authentication](resource-auth.md)]
68+
[!INCLUDE [resource authentication](resource-authentication.md)]
6969

7070
> [!CAUTION]
7171
> To use the recommended keyless authentication with the SDK, make sure that the `AZURE_OPENAI_API_KEY` environment variable isn't set.

articles/ai-services/openai/includes/assistants-typescript.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
6565

6666
## Retrieve resource information
6767

68-
[!INCLUDE [resource authentication](resource-auth.md)]
68+
[!INCLUDE [resource authentication](resource-authentication.md)]
6969

7070
> [!CAUTION]
7171
> To use the recommended keyless authentication with the SDK, make sure that the `AZURE_OPENAI_API_KEY` environment variable isn't set.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
manager: nitinme
3+
author: eric-urban
4+
ms.author: eur
5+
ms.service: azure-ai-openai
6+
ms.topic: include
7+
ms.date: 1/7/2025
8+
---
9+
10+
[!INCLUDE [Audio completions introduction](audio-completions-intro.md)]
11+
12+
## Deploy a model for audio generation
13+
14+
[!INCLUDE [Deploy model](audio-completions-deploy-model.md)]
15+
16+
## Use GPT-4o audio generation
17+
18+
To chat with your deployed `gpt-4o-audio-preview` model in the **Chat** playground of [Azure AI Foundry portal](https://ai.azure.com), follow these steps:
19+
20+
1. Go to the [Azure OpenAI Service page](https://ai.azure.com/resource/overview) in Azure AI Foundry portal. Make sure you're signed in with the Azure subscription that has your Azure OpenAI Service resource and the deployed `gpt-4o-audio-preview` model.
21+
1. Select the **Chat** playground from under **Resource playground** in the left pane.
22+
1. Select your deployed `gpt-4o-audio-preview` model from the **Deployment** dropdown.
23+
1. Start chatting with the model and listen to the audio responses.
24+
25+
:::image type="content" source="../media/quickstarts/audio-completions-chat-playground.png" alt-text="Screenshot of the Chat playground page." lightbox="../media/quickstarts/audio-completions-chat-playground.png":::
26+
27+
You can:
28+
- Record audio prompts.
29+
- Attach audio files to the chat.
30+
- Enter text prompts.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
manager: nitinme
3+
author: eric-urban
4+
ms.author: eur
5+
ms.service: azure-ai-openai
6+
ms.topic: include
7+
ms.date: 1/21/2025
8+
---
9+
10+
To deploy the `gpt-4o-audio-preview` model in the Azure AI Foundry portal:
11+
1. Go to the [Azure OpenAI Service page](https://ai.azure.com/resource/overview) in Azure AI Foundry portal. Make sure you're signed in with the Azure subscription that has your Azure OpenAI Service resource and the deployed `gpt-4o-audio-preview` model.
12+
1. Select the **Chat** playground from under **Playgrounds** in the left pane.
13+
1. Select **+ Create new deployment** > **From base models** to open the deployment window.
14+
1. Search for and select the `gpt-4o-audio-preview` model and then select **Deploy to selected resource**.
15+
1. In the deployment wizard, select the `2024-12-17` model version.
16+
1. Follow the wizard to finish deploying the model.
17+
18+
Now that you have a deployment of the `gpt-4o-audio-preview` model, you can interact with it in the Azure AI Foundry portal **Chat** playground or chat completions API.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
manager: nitinme
3+
author: eric-urban
4+
ms.author: eur
5+
ms.service: azure-ai-openai
6+
ms.topic: include
7+
ms.date: 1/21/2025
8+
---
9+
10+
The `gpt-4o-audio-preview` model introduces the audio modality into the existing `/chat/completions` API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in `gpt-4o-audio-preview` model include:  text, audio, and text + audio.
11+
12+
Here's a table of the supported modalities with example use cases:
13+
14+
| Modality input | Modality output | Example use case |
15+
| --- | --- | --- |
16+
| Text | Text + audio | Text to speech, audio book generation |
17+
| Audio | Text + audio | Audio transcription, audio book generation |
18+
| Audio | Text | Audio transcription |
19+
| Text + audio | Text + audio | Audio book generation |
20+
| Text + audio | Text | Audio transcription |
21+
22+
By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.
23+
24+
## Supported models
25+
26+
Currently only `gpt-4o-audio-preview` version: `2024-12-17` supports audio generation.
27+
28+
The `gpt-4o-audio-preview` model is available for global deployments in [East US 2 and Sweden Central regions](../concepts/models.md#global-standard-model-availability).
29+
30+
Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.
31+
32+
> [!NOTE]
33+
> The [Realtime API](../realtime-audio-quickstart.md) uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.
34+
35+
## API support
36+
37+
Support for audio completions was first added in API version `2025-01-01-preview`.

0 commit comments

Comments
 (0)