You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/model-inference/concepts/content-filter.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ manager: nitinme
14
14
# Content filtering for model inference in Azure AI services
15
15
16
16
> [!IMPORTANT]
17
-
> The content filtering system isn't applied to prompts and completions processed by the Whisper model in Azure OpenAI. Learn more about the [Whisper model in Azure OpenAI](../../../ai-services/openai/concepts/models.md#whisper).
17
+
> The content filtering system isn't applied to prompts and completions processed by the audio models such as Whisper in Azure OpenAI Service. Learn more about the [Audio models in Azure OpenAI](../../../ai-services/openai/concepts/models.md?tabs=standard-audio#standard-models-by-endpoint).
18
18
19
19
Azure AI model inference in Azure AI Services includes a content filtering system that works alongside core models and it's powered by [Azure AI Content Safety](https://azure.microsoft.com/products/cognitive-services/ai-content-safety). This system works by running both the prompt and completion through an ensemble of classification models designed to detect and prevent the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Variations in API configurations and application design might affect completions and thus filtering behavior.
description: Learn about the audio capabilities of Azure OpenAI Service.
5
+
author: eric-urban
6
+
ms.author: eur
7
+
ms.service: azure-ai-openai
8
+
ms.topic: conceptual
9
+
ms.date: 4/15/2025
10
+
ms.custom: template-concept
11
+
manager: nitinme
12
+
---
13
+
14
+
# Audio capabilities in Azure OpenAI Service
15
+
16
+
> [!IMPORTANT]
17
+
> The content filtering system isn't applied to prompts and completions processed by the audio models such as Whisper in Azure OpenAI Service.
18
+
19
+
Audio models in Azure OpenAI are available via the `realtime`, `completions`, and `audio` APIs. The audio models are designed to handle a variety of tasks, including speech recognition, translation, and text to speech.
20
+
21
+
For information about the available audio models per region in Azure OpenAI Service, see the [audio models](models.md?tabs=standard-audio#standard-models-by-endpoint), [standard models by endpoint](models.md?tabs=standard-audio#standard-models-by-endpoint), and [global standard model availability](models.md?tabs=standard-audio#global-standard-model-availability) documentation.
22
+
23
+
### GPT-4o audio Realtime API
24
+
25
+
GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
26
+
27
+
## GPT-4o audio completions
28
+
29
+
GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
30
+
31
+
## Audio API
32
+
33
+
The audio models via the `/audio` API can be used for speech to text, translation, and text to speech. To get started with the audio API, see the [Whisper quickstart](../whisper-quickstart.md) for speech to text.
34
+
35
+
> [!NOTE]
36
+
> To help you decide whether to use Azure AI Speech or Azure OpenAI Service, see the [Azure AI Speech batch transcription](../../speech-service/batch-transcription-create.md), [What is the Whisper model?](../../speech-service/whisper-overview.md), and [OpenAI text to speech voices](../../speech-service/openai-voices.md#openai-text-to-speech-voices-via-azure-openai-service-or-via-azure-ai-speech) guides.
Copy file name to clipboardExpand all lines: articles/ai-services/openai/concepts/content-filter.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ manager: nitinme
14
14
# Content filtering
15
15
16
16
> [!IMPORTANT]
17
-
> The content filtering system isn't applied to prompts and completions processed by the Whisper model in Azure OpenAI Service. Learn more about the [Whisper model in Azure OpenAI](models.md#whisper).
17
+
> The content filtering system isn't applied to prompts and completions processed by the audio models such as Whisper in Azure OpenAI Service. Learn more about the [Audio models in Azure OpenAI](models.md?tabs=standard-audio#standard-models-by-endpoint).
18
18
19
19
Azure OpenAI Service includes a content filtering system that works alongside core models, including DALL-E image generation models. This system works by running both the prompt and completion through an ensemble of classification models designed to detect and prevent the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Variations in API configurations and application design might affect completions and thus filtering behavior.
@@ -23,13 +23,11 @@ Azure OpenAI Service is powered by a diverse set of models with different capabi
23
23
|[GPT-4.5 Preview](#gpt-45-preview)|The latest GPT model that excels at diverse text and image tasks. |
24
24
|[o-series models](#o-series-models)|[Reasoning models](../how-to/reasoning.md) with advanced problem-solving and increased focus and capability. |
25
25
|[GPT-4o & GPT-4o mini & GPT-4 Turbo](#gpt-4o-and-gpt-4-turbo)| The latest most capable Azure OpenAI models with multimodal versions, which can accept both text and images as input. |
26
-
|[GPT-4o audio](#gpt-4o-audio)| GPT-4o audio models that support either low-latency, "speech in, speech out" conversational interactions or audio generation. |
27
26
|[GPT-4](#gpt-4)| A set of models that improve on GPT-3.5 and can understand and generate natural language and code. |
28
27
|[GPT-3.5](#gpt-35)| A set of models that improve on GPT-3 and can understand and generate natural language and code. |
29
28
|[Embeddings](#embeddings-models)| A set of models that can convert text into numerical vector form to facilitate text similarity. |
30
29
|[DALL-E](#dall-e-models)| A series of models that can generate original images from natural language. |
31
-
|[Whisper](#whisper-models)| A series of models in preview that can transcribe and translate speech to text. |
32
-
|[Text to speech](#text-to-speech-models-preview) (Preview) | A series of models in preview that can synthesize text to speech. |
30
+
|[Audio](#audio-models)| A series of models for speech to text, translation, and text to speech. GPT-4o audio models support either low-latency, "speech in, speech out" conversational interactions or audio generation. |
33
31
34
32
## GPT 4.1 series
35
33
@@ -119,40 +117,6 @@ To learn more about the advanced `o-series` models see, [getting started with re
119
117
|`o1-preview`| See the [models table](#model-summary-table-and-region-availability). This model is only available for customers who were granted access as part of the original limited access |
120
118
|`o1-mini`| See the [models table](#model-summary-table-and-region-availability). |
121
119
122
-
## GPT-4o audio
123
-
124
-
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
125
-
- GPT-4o real-time audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user. For more information on how to use GPT-4o real-time audio, see the [GPT-4o real-time audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
126
-
- GPT-4o audio completion is designed to generate audio from audio or text prompts, making it a great fit for generating audio books, audio content, and other use cases that require audio generation. The GPT-4o audio completions model introduces the audio modality into the existing `/chat/completions` API. For more information on how to use GPT-4o audio completions, see the [audio generation quickstart](../audio-completions-quickstart.md).
127
-
128
-
> [!CAUTION]
129
-
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
130
-
131
-
To use GPT-4o audio, you need [an Azure OpenAI resource](../how-to/create-resource.md) in one of the [supported regions](#global-standard-model-availability).
132
-
133
-
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model.
134
-
135
-
Details about maximum request tokens and training data are available in the following table.
136
-
137
-
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
138
-
|---|---|---|---|
139
-
|`gpt-4o-mini-audio-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
140
-
|`gpt-4o-mini-realtime-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
141
-
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
142
-
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
143
-
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
144
-
145
-
### Region availability
146
-
147
-
| Model | Region |
148
-
|---|---|
149
-
|`gpt-4o-mini-audio-preview`| East US2 (Global Standard) |
150
-
|`gpt-4o-mini-realtime-preview`| East US2 (Global Standard) <br> Sweden Central (Global Standard) |
151
-
|`gpt-4o-audio-preview`| East US2 (Global Standard) <br> Sweden Central (Global Standard) |
152
-
|`gpt-4o-realtime-preview`| East US2 (Global Standard) <br> Sweden Central (Global Standard) |
153
-
154
-
To compare the availability of GPT-4o audio models across all regions, see the [models table](#global-standard-model-availability).
155
-
156
120
## GPT-4o and GPT-4 Turbo
157
121
158
122
GPT-4o integrates text and images in a single model, enabling it to handle multiple data types simultaneously. This multimodal approach enhances accuracy and responsiveness in human-computer interactions. GPT-4o matches GPT-4 Turbo in English text and coding tasks while offering superior performance in non-English languages and vision tasks, setting new benchmarks for AI capabilities.
@@ -256,17 +220,56 @@ OpenAI's MTEB benchmark testing found that even when the third generation model'
256
220
257
221
The DALL-E models generate images from text prompts that the user provides. DALL-E 3 is generally available for use with the REST APIs. DALL-E 2 and DALL-E 3 with client SDKs are in preview.
258
222
259
-
## Whisper
223
+
## Audio models
260
224
261
-
The Whisper models can be used for speech to text.
225
+
Audio models in Azure OpenAI are available via the `realtime`, `completions`, and `audio` APIs.
262
226
263
-
You can also use the Whisper model via Azure AI Speech [batch transcription](../../speech-service/batch-transcription-create.md) API. Check out [What is the Whisper model?](../../speech-service/whisper-overview.md) to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.
227
+
### GPT-4o audio models
264
228
265
-
## Text to speech (Preview)
229
+
The GPT 4o audio models are part of the GPT-4o model family and support either low-latency, "speech in, speech out" conversational interactions or audio generation.
266
230
267
-
The OpenAI text to speech models, currently in preview, can be used to synthesize text to speech.
231
+
> [!CAUTION]
232
+
> We don't recommend using preview models in production. We will upgrade all deployments of preview models to either future preview versions or to the latest stable GA version. Models that are designated preview don't follow the standard Azure OpenAI model lifecycle.
268
233
269
-
You can also use the OpenAI text to speech voices via Azure AI Speech. To learn more, see [OpenAI text to speech voices via Azure OpenAI Service or via Azure AI Speech](../../speech-service/openai-voices.md#openai-text-to-speech-voices-via-azure-openai-service-or-via-azure-ai-speech) guide.
234
+
Details about maximum request tokens and training data are available in the following table.
235
+
236
+
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
237
+
|---|---|---|---|
238
+
|`gpt-4o-mini-audio-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
239
+
|`gpt-4o-mini-realtime-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
240
+
|`gpt-4o-audio-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for audio and text generation. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
241
+
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
242
+
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio**|**Audio model** for real-time audio processing. |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
243
+
244
+
To compare the availability of GPT-4o audio models across all regions, see the [models table](#global-standard-model-availability).
245
+
246
+
### Audio API
247
+
248
+
The audio models via the `/audio` API can be used for speech to text, translation, and text to speech.
249
+
250
+
#### Speech to text models
251
+
252
+
| Model ID | Description | Max Request (audio file size) |
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/realtime-audio.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ The GPT 4o real-time models are available for global deployments in [East US 2 a
27
27
-`gpt-4o-realtime-preview` (2024-12-17)
28
28
-`gpt-4o-realtime-preview` (2024-10-01)
29
29
30
-
See the [models and versions documentation](../concepts/models.md#gpt-4o-audio) for more information.
30
+
See the [models and versions documentation](../concepts/models.md#audio-models) for more information.
31
31
32
32
## Get started
33
33
@@ -116,7 +116,7 @@ Events can be sent and received in parallel and applications should generally ha
116
116
Often, the first event sent by the caller on a newly established `/realtime` session is a [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) payload. This event controls a wide set of input and output behavior, with output and response generation properties then later overridable using the [`response.create`](../realtime-audio-reference.md#realtimeclienteventresponsecreate) event.
117
117
118
118
The [`session.update`](../realtime-audio-reference.md#realtimeclienteventsessionupdate) event can be used to configure the following aspects of the session:
119
-
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (`whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
119
+
- Transcription of user input audio is opted into via the session's `input_audio_transcription` property. Specifying a transcription model (such as `whisper-1`) in this configuration enables the delivery of [`conversation.item.audio_transcription.completed`](../realtime-audio-reference.md#realtimeservereventconversationiteminputaudiotranscriptioncompleted) events.
120
120
- Turn handling is controlled by the `turn_detection` property. This property's type can be set to `none` or `server_vad` as described in the [voice activity detection (VAD) and the audio buffer](#voice-activity-detection-vad-and-the-audio-buffer) section.
121
121
- Tools can be configured to enable the server to call out to external services or functions to enrich the conversation. Tools are defined as part of the `tools` property in the session configuration.
0 commit comments