Skip to content

Commit 998cd58

Browse files
Merge pull request #7696 from PatrickFarley/aoai-updates
add diarize model
2 parents 1210313 + 584a6b3 commit 998cd58

File tree

6 files changed

+22
-9
lines changed

6 files changed

+22
-9
lines changed

articles/ai-foundry/openai/includes/api-versions/latest-inference-preview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4553,7 +4553,7 @@ It responds with a session object, plus a `client_secret` key which contains a u
45534553
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No | |
45544554
| input_audio_transcription | object | Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the Transcriptions endpoint](/azure/ai-foundry/openai/reference-preview#transcriptions---create) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No | |
45554555
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No | |
4556-
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No | |
4556+
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No | |
45574557
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No | |
45584558
| instructions | string | The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.<br><br>Note that the server sets default instructions which will be used if this field isn't set and are visible in the `session.created` event at the start of the session.<br> | No | |
45594559
| max_response_output_tokens | integer or string | Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or `inf` for the maximum available tokens for a given model. Defaults to `inf`.<br> | No | |
@@ -8885,7 +8885,7 @@ Realtime transcription session object configuration.
88858885
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No | |
88868886
| input_audio_transcription | object | Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No | |
88878887
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No | |
8888-
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `whisper-1` | No | |
8888+
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, `whisper-1` | No | |
88898889
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No | |
88908890
| modalities | | The set of modalities the model can respond with. To disable audio, set this to ["text"].<br> | No | |
88918891
| turn_detection | object | Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.<br>Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.<br>Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with `uhhm`, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.<br> | No | |

articles/ai-foundry/openai/includes/api-versions/new-inference-preview.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ Transcribes audio into the input language.
127127
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
128128
| file | string | | Yes | |
129129
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
130-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
130+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
131131
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
132132
| model | string | The model to use for this transcription request. | No | |
133133
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4225,7 +4225,7 @@ The configuration information for an audio transcription request.
42254225
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
42264226
| file | string | | Yes | |
42274227
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
4228-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
4228+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
42294229
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
42304230
| model | string | The model to use for this transcription request. | No | |
42314231
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4684,11 +4684,11 @@ A citation for a web resource used to generate a model response.
46844684

46854685
### OpenAI.AudioResponseFormat
46864686

4687-
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`.
4687+
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`.
46884688

46894689
| Property | Value |
46904690
|----------|-------|
4691-
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
4691+
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
46924692
| **Type** | string |
46934693
| **Values** | `json`<br>`text`<br>`srt`<br>`verbose_json`<br>`vtt` |
46944694

articles/ai-foundry/openai/includes/models-azure-direct-openai.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,8 @@ The audio models via the `/audio` API can be used for speech to text, translatio
344344
| `whisper` | General-purpose speech recognition model. | 25 MB |
345345
| `gpt-4o-transcribe` | Speech-to-text model powered by GPT-4o. | 25 MB|
346346
| `gpt-4o-mini-transcribe` | Speech-to-text model powered by GPT-4o mini. | 25 MB|
347+
| `gpt-4o-transcribe-diarize` | Speech-to-text model with automatic speech recognition. | 25 MB|
348+
347349

348350
#### Speech translation models
349351

articles/ai-foundry/openai/includes/retirement/models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ ms.custom: references_regions, build-2025
5757
| `gpt-4o-transcribe` | 2025-03-20 | Preview | No earlier than January 14, 2026 | |
5858
| `gpt-4o-mini-tts` | 2025-03-20 | Preview | No earlier than January 14, 2026 | |
5959
| `gpt-4o-mini-transcribe` | 2025-03-20 | Preview | No earlier than January 14, 2026 | |
60+
| `gpt-4o-mini-transcribe-diarize` | 2025-03-20 | Preview | No earlier than January 14, 2026 | |
6061
| `tts` | 001 | Generally Available | No earlier than February 1, 2026 | |
6162
| `tts-hd` | 001 | Generally Available | No earlier than February 1, 2026 | |
6263
| `whisper` | 001 | Generally Available | No earlier than February 1, 2026 | |

articles/ai-foundry/openai/realtime-audio-reference.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1198,14 +1198,16 @@ The server `session.updated` event is returned when a session is updated by the
11981198
* `whisper-1`
11991199
* `gpt-4o-transcribe`
12001200
* `gpt-4o-mini-transcribe`
1201+
* `gpt-4o-transcribe-diarize`
1202+
12011203

12021204
### RealtimeAudioInputTranscriptionSettings
12031205

12041206
| Field | Type | Description |
12051207
|-------|------|-------------|
12061208
| language | string | The language of the input audio. Supplying the input language in ISO-639-1 format (such as `en`) will improve accuracy and latency. |
12071209
| model | [RealtimeAudioInputTranscriptionModel](#realtimeaudioinputtranscriptionmodel) | The model for audio input transcription. For example, `whisper-1`. |
1208-
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
1210+
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
12091211

12101212
### RealtimeAudioInputAudioNoiseReductionSettings
12111213

articles/ai-foundry/openai/whats-new.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,15 @@ ms.custom:
1919

2020
This article provides a summary of the latest releases and major documentation updates for Azure OpenAI.
2121

22-
## October 2025
22+
## October 2025
23+
24+
### GPT-4o audio model released
25+
26+
- The `gpt-4o-transcribe-diarize` speech to text model is released. This is an Automatic Speech Recognition (ASR) model that converts spoken language into text in real time. It enables organizations to unlock insights from conversations instantly with ultra-low latency and high accuracy across 100+ languages. This capability is essential for workflows where voice data drives decisions—such as customer support, virtual meetings, and live events.
27+
28+
Diarization is the process of identifying who spoke when in an audio stream. It transforms conversations into speaker-attributed transcripts, enabling businesses to extract actionable insights from meetings, customer calls, and live events. With advanced models like `gpt-4o-transcribe-diarize`, organizations gain real-time clarity and context—turning voice into structured data that drives smarter decisions and improves productivity. supporting automatic speech recognition.
29+
30+
Use this model via the `/audio` and `/realtime` APIs.
2331

2432
### GPT-image-1-mini
2533

@@ -36,7 +44,7 @@ Personally identifiable information (PII) detection is now available as a built-
3644

3745
## September 2025
3846

39-
## GPT-5-codex is now available
47+
### GPT-5-codex is now available
4048

4149
- To learn more about `gpt-5-codex`, see the [getting started with reasoning models page](./how-to/reasoning.md).
4250
- `gpt-5-codex` is specifically designed to be used with the [Codex CLI and the Visual Studio Code Codex extension](./how-to/codex.md).

0 commit comments

Comments
 (0)