Skip to content

Commit 3996a6e

Browse files
committed
audio models for AOAI
1 parent 3df8bfc commit 3996a6e

File tree

4 files changed

+140
-67
lines changed

4 files changed

+140
-67
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Azure OpenAI Service audio
3+
titleSuffix: Azure OpenAI
4+
description: Learn about the audio capabilities of Azure OpenAI Service.
5+
author: eric-urban
6+
ms.author: eur
7+
ms.service: azure-ai-openai
8+
ms.topic: conceptual
9+
ms.date: 4/15/2025
10+
ms.custom: template-concept
11+
manager: nitinme
12+
---
13+
14+
# Audio capabilities in Azure OpenAI Service
15+
16+
> [!IMPORTANT]
17+
> The content filtering system isn't applied to prompts and completions processed by the audio models such as Whisper in Azure OpenAI Service. Learn more about the [Audio API in Azure OpenAI](models.md?tabs=standard-audio#standard-models-by-endpoint).
18+
19+
20+
### GPT-4o audio models
21+
22+
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
23+
- GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
24+
- GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
25+
26+
> [!CAUTION]
27+
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
28+
29+
To use GPT-4o audio, you need [an Azure OpenAI resource](../how-to/create-resource.md) in one of the [supported regions](#global-standard-model-availability).
30+
31+
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model.
32+
33+
Details about maximum request tokens and training data are available in the following table.
34+
35+
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
36+
|---|---|---|---|
37+
|`gpt-4o-mini-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
38+
|`gpt-4o-mini-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
39+
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
40+
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
41+
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
42+
43+
To compare the availability of GPT-4o audio models across all regions, see the [models table](#global-standard-model-availability).
44+
45+
### Audio API
46+
47+
The audio models via the `/audio` API can be used for speech to text, translation, and text to speech.
48+
49+
#### Speech to text models
50+
51+
| Model ID | Description | Max Request (audio file size) |
52+
| ----- | ----- | ----- |
53+
| `whisper` | General-purpose speech recognition model. | 25 MB |
54+
| `gpt-4o-transcribe` | Speech to text powered by GPT-4o. | 25 MB|
55+
| `gpt-4o-mini-transcribe` | Speech to text powered by GPT-4o mini. | 25 MB|
56+
57+
You can also use the Whisper model via Azure AI Speech [batch transcription](../../speech-service/batch-transcription-create.md) API. Check out [What is the Whisper model?](../../speech-service/whisper-overview.md) to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.
58+
59+
#### Speech translation models
60+
61+
| Model ID | Description | Max Request (audio file size) |
62+
| ----- | ----- | ----- |
63+
| `whisper` | General-purpose speech recognition model. | 25 MB |
64+
65+
#### Text to speech models (Preview)
66+
67+
| Model ID | Description |
68+
| --- | :--- |
69+
| `tts` | Text to speech optimized for speed. |
70+
| `tts-hd` | Text to speech optimized for quality.|
71+
| `gpt-4o-mini-tts` | Text to speech model powered by GPT-4o mini. |
72+
73+
You can also use the OpenAI text to speech voices via Azure AI Speech. To learn more, see [OpenAI text to speech voices via Azure OpenAI Service or via Azure AI Speech](../../speech-service/openai-voices.md#openai-text-to-speech-voices-via-azure-openai-service-or-via-azure-ai-speech) guide.
74+
75+
For more information see [Audio models region availability](?tabs=standard-audio#standard-models-by-endpoint) in this article.
76+
77+
78+
## Related content
79+

articles/ai-services/openai/concepts/models.md

Lines changed: 57 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ titleSuffix: Azure OpenAI
44
description: Learn about the different model capabilities that are available with Azure OpenAI.
55
ms.service: azure-ai-openai
66
ms.topic: conceptual
7-
ms.date: 04/01/2025
7+
ms.date: 4/15/2025
88
ms.custom: references_regions, build-2023, build-2023-dataai, refefences_regions
99
manager: nitinme
1010
author: mrbullwinkle #ChrisHMSFT
@@ -22,12 +22,11 @@ Azure OpenAI Service is powered by a diverse set of models with different capabi
2222
| [GPT-4.5 Preview](#gpt-45-preview) |The latest GPT model that excels at diverse text and image tasks. |
2323
| [o-series models](#o-series-models) |[Reasoning models](../how-to/reasoning.md) with advanced problem-solving and increased focus and capability. |
2424
| [GPT-4o & GPT-4o mini & GPT-4 Turbo](#gpt-4o-and-gpt-4-turbo) | The latest most capable Azure OpenAI models with multimodal versions, which can accept both text and images as input. |
25-
| [GPT-4o audio](#gpt-4o-audio) | GPT-4o audio models that support either low-latency, "speech in, speech out" conversational interactions or audio generation. |
2625
| [GPT-4](#gpt-4) | A set of models that improve on GPT-3.5 and can understand and generate natural language and code. |
2726
| [GPT-3.5](#gpt-35) | A set of models that improve on GPT-3 and can understand and generate natural language and code. |
2827
| [Embeddings](#embeddings-models) | A set of models that can convert text into numerical vector form to facilitate text similarity. |
2928
| [DALL-E](#dall-e-models) | A series of models that can generate original images from natural language. |
30-
| [Audio](?tabs=standard-audio#standard-models-by-endpoint) | A series of models for speech to text, translation, and text to speech. |
29+
| [Audio](#audio-models) | A series of models for speech to text, translation, and text to speech. GPT-4o audio models support either low-latency, "speech in, speech out" conversational interactions or audio generation. |
3130

3231
## computer-use-preview
3332

@@ -98,40 +97,6 @@ To learn more about the advanced `o-series` models see, [getting started with re
9897
| `o1-preview` | See the [models table](#model-summary-table-and-region-availability). This model is only available for customers who were granted access as part of the original limited access |
9998
| `o1-mini` | See the [models table](#model-summary-table-and-region-availability). |
10099

101-
## GPT-4o audio
102-
103-
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
104-
- GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
105-
- GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
106-
107-
> [!CAUTION]
108-
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
109-
110-
To use GPT-4o audio, you need [an Azure OpenAI resource](../how-to/create-resource.md) in one of the [supported regions](#global-standard-model-availability).
111-
112-
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model.
113-
114-
Details about maximum request tokens and training data are available in the following table.
115-
116-
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
117-
|---|---|---|---|
118-
|`gpt-4o-mini-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
119-
|`gpt-4o-mini-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
120-
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
121-
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
122-
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
123-
124-
### Region availability
125-
126-
| Model | Region |
127-
|---|---|
128-
|`gpt-4o-mini-audio-preview` | East US2 (Global Standard) |
129-
|`gpt-4o-mini-realtime-preview` | East US2 (Global Standard) <br> Sweden Central (Global Standard) |
130-
|`gpt-4o-audio-preview` | East US2 (Global Standard) <br> Sweden Central (Global Standard) |
131-
|`gpt-4o-realtime-preview` | East US2 (Global Standard) <br> Sweden Central (Global Standard) |
132-
133-
To compare the availability of GPT-4o audio models across all regions, see the [models table](#global-standard-model-availability).
134-
135100
## GPT-4o and GPT-4 Turbo
136101

137102
GPT-4o integrates text and images in a single model, enabling it to handle multiple data types simultaneously. This multimodal approach enhances accuracy and responsiveness in human-computer interactions. GPT-4o matches GPT-4 Turbo in English text and coding tasks while offering superior performance in non-English languages and vision tasks, setting new benchmarks for AI capabilities.
@@ -235,11 +200,64 @@ OpenAI's MTEB benchmark testing found that even when the third generation model'
235200

236201
The DALL-E models generate images from text prompts that the user provides. DALL-E 3 is generally available for use with the REST APIs. DALL-E 2 and DALL-E 3 with client SDKs are in preview.
237202

238-
## Audio
203+
## Audio models
204+
205+
### GPT-4o audio models
206+
207+
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
208+
- GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
209+
- GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
210+
211+
> [!CAUTION]
212+
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
213+
214+
To use GPT-4o audio, you need [an Azure OpenAI resource](../how-to/create-resource.md) in one of the [supported regions](#global-standard-model-availability).
215+
216+
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model.
217+
218+
Details about maximum request tokens and training data are available in the following table.
219+
220+
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
221+
|---|---|---|---|
222+
|`gpt-4o-mini-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
223+
|`gpt-4o-mini-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
224+
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
225+
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
226+
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
227+
228+
To compare the availability of GPT-4o audio models across all regions, see the [models table](#global-standard-model-availability).
229+
230+
### Audio API
239231

240232
The audio models via the `/audio` API can be used for speech to text, translation, and text to speech.
241233

242-
For more information see [Audio models](?tabs=standard-audio#standard-models-by-endpoint) in this article.
234+
#### Speech to text models
235+
236+
| Model ID | Description | Max Request (audio file size) |
237+
| ----- | ----- | ----- |
238+
| `whisper` | General-purpose speech recognition model. | 25 MB |
239+
| `gpt-4o-transcribe` | Speech to text powered by GPT-4o. | 25 MB|
240+
| `gpt-4o-mini-transcribe` | Speech to text powered by GPT-4o mini. | 25 MB|
241+
242+
You can also use the Whisper model via Azure AI Speech [batch transcription](../../speech-service/batch-transcription-create.md) API. Check out [What is the Whisper model?](../../speech-service/whisper-overview.md) to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.
243+
244+
#### Speech translation models
245+
246+
| Model ID | Description | Max Request (audio file size) |
247+
| ----- | ----- | ----- |
248+
| `whisper` | General-purpose speech recognition model. | 25 MB |
249+
250+
#### Text to speech models (Preview)
251+
252+
| Model ID | Description |
253+
| --- | :--- |
254+
| `tts` | Text to speech optimized for speed. |
255+
| `tts-hd` | Text to speech optimized for quality.|
256+
| `gpt-4o-mini-tts` | Text to speech model powered by GPT-4o mini. |
257+
258+
You can also use the OpenAI text to speech voices via Azure AI Speech. To learn more, see [OpenAI text to speech voices via Azure OpenAI Service or via Azure AI Speech](../../speech-service/openai-voices.md#openai-text-to-speech-voices-via-azure-openai-service-or-via-azure-ai-speech) guide.
259+
260+
For more information see [Audio models region availability](?tabs=standard-audio#standard-models-by-endpoint) in this article.
243261

244262
## Model summary table and region availability
245263

@@ -392,32 +410,6 @@ These models can only be used with Embedding API requests.
392410

393411
[!INCLUDE [Audio](../includes/model-matrix/standard-audio.md)]
394412

395-
### Speech to text models
396-
397-
| Model ID | Description | Max Request (audio file size) |
398-
| ----- | ----- | ----- |
399-
| `whisper` | General-purpose speech recognition model. | 25 MB |
400-
| `gpt-4o-transcribe` | Speech to text powered by GPT-4o. | 25 MB|
401-
| `gpt-4o-mini-transcribe` | Speech to text powered by GPT-4o mini. | 25 MB|
402-
403-
You can also use the Whisper model via Azure AI Speech [batch transcription](../../speech-service/batch-transcription-create.md) API. Check out [What is the Whisper model?](../../speech-service/whisper-overview.md) to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.
404-
405-
### Speech translation models
406-
407-
| Model ID | Description | Max Request (audio file size) |
408-
| ----- | ----- | ----- |
409-
| `whisper` | General-purpose speech recognition model. | 25 MB |
410-
411-
### Text to speech models (Preview)
412-
413-
| Model ID | Description |
414-
| --- | :--- |
415-
| `tts` | Text to speech optimized for speed. |
416-
| `tts-hd` | Text to speech optimized for quality.|
417-
| `gpt-4o-mini-tts` | Text to speech model powered by GPT-4o mini. |
418-
419-
You can also use the OpenAI text to speech voices via Azure AI Speech. To learn more, see [OpenAI text to speech voices via Azure OpenAI Service or via Azure AI Speech](../../speech-service/openai-voices.md#openai-text-to-speech-voices-via-azure-openai-service-or-via-azure-ai-speech) guide.
420-
421413
# [Completions (Legacy)](#tab/standard-completions)
422414

423415
### Completions models

articles/ai-services/openai/includes/model-matrix/standard-audio.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: include
88
ms.custom: references_regions
9-
ms.date: 10/25/2024
9+
ms.date: 4/15/2025
1010
---
1111

12-
| **Region** | **tts**, **001** | **tts-hd**, **001** | **whisper**, **001** | **gpt-4o-mini-tts**, **001** | **gpt-4o-transcribe**, **001** | **gpt-4o-mini-transcribe **, **001** |
12+
| **Region** | **tts**, **001** | **tts-hd**, **001** | **whisper**, **001** | **gpt-4o-mini-tts**, **001** | **gpt-4o-transcribe**, **001** | **gpt-4o-mini-transcribe**, **001** |
1313
|:-----------------|:----------------:|:-------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
1414
| eastus2 | - | - || - |||
1515
| northcentralus |||||||

articles/ai-services/openai/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ items:
5454
href: ./concepts/assistants.md
5555
- name: Abuse monitoring
5656
href: ./concepts/abuse-monitoring.md
57+
- name: Audio
58+
href: ./concepts/audio.md
5759
- name: Content filtering
5860
href: ./concepts/content-filter.md
5961
- name: Default safety policies

0 commit comments

Comments
 (0)