You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/includes/api-versions/latest-inference-preview.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4553,7 +4553,7 @@ It responds with a session object, plus a `client_secret` key which contains a u
4553
4553
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field`| No ||
4554
4554
| input_audio_transcription | object | Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the Transcriptions endpoint](/azure/ai-foundry/openai/reference-preview#transcriptions---create) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No ||
4555
4555
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No ||
4556
-
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No ||
4556
+
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No ||
4557
4557
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No ||
4558
4558
| instructions | string | The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.<br><br>Note that the server sets default instructions which will be used if this field isn't set and are visible in the `session.created` event at the start of the session.<br> | No ||
4559
4559
| max_response_output_tokens | integer or string | Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or `inf` for the maximum available tokens for a given model. Defaults to `inf`.<br> | No ||
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field`| No ||
8886
8886
| input_audio_transcription | object | Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No ||
8887
8887
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No ||
8888
-
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `whisper-1`| No ||
8888
+
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, `whisper-1`| No ||
8889
8889
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No ||
8890
8890
| modalities || The set of modalities the model can respond with. To disable audio, set this to ["text"].<br> | No ||
8891
8891
| turn_detection | object | Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.<br>Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.<br>Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with `uhhm`, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.<br> | No ||
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/includes/api-versions/new-inference-preview.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -127,7 +127,7 @@ Transcribes audio into the input language.
127
127
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
128
128
| file | string | | Yes | |
129
129
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
130
-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
130
+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
131
131
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
132
132
| model | string | The model to use for this transcription request. | No | |
133
133
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4225,7 +4225,7 @@ The configuration information for an audio transcription request.
4225
4225
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
4226
4226
| file | string | | Yes | |
4227
4227
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
4228
-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
4228
+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
4229
4229
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
4230
4230
| model | string | The model to use for this transcription request. | No | |
4231
4231
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4684,11 +4684,11 @@ A citation for a web resource used to generate a model response.
4684
4684
4685
4685
### OpenAI.AudioResponseFormat
4686
4686
4687
-
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`.
4687
+
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`.
4688
4688
4689
4689
| Property | Value |
4690
4690
|----------|-------|
4691
-
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
4691
+
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/realtime-audio-reference.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1198,14 +1198,16 @@ The server `session.updated` event is returned when a session is updated by the
1198
1198
*`whisper-1`
1199
1199
*`gpt-4o-transcribe`
1200
1200
*`gpt-4o-mini-transcribe`
1201
+
*`gpt-4o-transcribe-diarize`
1202
+
1201
1203
1202
1204
### RealtimeAudioInputTranscriptionSettings
1203
1205
1204
1206
| Field | Type | Description |
1205
1207
|-------|------|-------------|
1206
1208
| language | string | The language of the input audio. Supplying the input language in ISO-639-1 format (such as `en`) will improve accuracy and latency. |
1207
1209
| model |[RealtimeAudioInputTranscriptionModel](#realtimeaudioinputtranscriptionmodel)| The model for audio input transcription. For example, `whisper-1`. |
1208
-
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
1210
+
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/whats-new.md
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,15 @@ ms.custom:
19
19
20
20
This article provides a summary of the latest releases and major documentation updates for Azure OpenAI.
21
21
22
-
## October 2025
22
+
## October 2025
23
+
24
+
### GPT-4o audio model released
25
+
26
+
- The `gpt-4o-transcribe-diarize` speech to text model is released. This is an Automatic Speech Recognition (ASR) model that converts spoken language into text in real time. It enables organizations to unlock insights from conversations instantly with ultra-low latency and high accuracy across 100+ languages. This capability is essential for workflows where voice data drives decisions—such as customer support, virtual meetings, and live events.
27
+
28
+
Diarization is the process of identifying who spoke when in an audio stream. It transforms conversations into speaker-attributed transcripts, enabling businesses to extract actionable insights from meetings, customer calls, and live events. With advanced models like `gpt-4o-transcribe-diarize`, organizations gain real-time clarity and context—turning voice into structured data that drives smarter decisions and improves productivity. supporting automatic speech recognition.
29
+
30
+
Use this model via the `/audio` and `/realtime` APIs.
23
31
24
32
### GPT-image-1-mini
25
33
@@ -36,7 +44,7 @@ Personally identifiable information (PII) detection is now available as a built-
36
44
37
45
## September 2025
38
46
39
-
## GPT-5-codex is now available
47
+
###GPT-5-codex is now available
40
48
41
49
- To learn more about `gpt-5-codex`, see the [getting started with reasoning models page](./how-to/reasoning.md).
42
50
-`gpt-5-codex` is specifically designed to be used with the [Codex CLI and the Visual Studio Code Codex extension](./how-to/codex.md).
0 commit comments