Skip to content

Commit 8415067

Browse files
committed
add diarize model
1 parent 53d0dee commit 8415067

File tree

6 files changed

+19
-8
lines changed

6 files changed

+19
-8
lines changed

articles/ai-foundry/openai/includes/api-versions/latest-inference-preview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4553,7 +4553,7 @@ It responds with a session object, plus a `client_secret` key which contains a u
45534553
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No | |
45544554
| input_audio_transcription | object | Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the Transcriptions endpoint](/azure/ai-foundry/openai/reference-preview#transcriptions---create) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No | |
45554555
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No | |
4556-
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No | |
4556+
| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No | |
45574557
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No | |
45584558
| instructions | string | The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.<br><br>Note that the server sets default instructions which will be used if this field isn't set and are visible in the `session.created` event at the start of the session.<br> | No | |
45594559
| max_response_output_tokens | integer or string | Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or `inf` for the maximum available tokens for a given model. Defaults to `inf`.<br> | No | |
@@ -8885,7 +8885,7 @@ Realtime transcription session object configuration.
88858885
| └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No | |
88868886
| input_audio_transcription | object | Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No | |
88878887
| └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No | |
8888-
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `whisper-1` | No | |
8888+
| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, `whisper-1` | No | |
88898889
| └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No | |
88908890
| modalities | | The set of modalities the model can respond with. To disable audio, set this to ["text"].<br> | No | |
88918891
| turn_detection | object | Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.<br>Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.<br>Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with `uhhm`, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.<br> | No | |

articles/ai-foundry/openai/includes/api-versions/new-inference-preview.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ Transcribes audio into the input language.
127127
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
128128
| file | string | | Yes | |
129129
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
130-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
130+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
131131
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
132132
| model | string | The model to use for this transcription request. | No | |
133133
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4225,7 +4225,7 @@ The configuration information for an audio transcription request.
42254225
| └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No | |
42264226
| file | string | | Yes | |
42274227
| filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No | |
4228-
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No | |
4228+
| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No | |
42294229
| language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No | |
42304230
| model | string | The model to use for this transcription request. | No | |
42314231
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
@@ -4684,11 +4684,11 @@ A citation for a web resource used to generate a model response.
46844684

46854685
### OpenAI.AudioResponseFormat
46864686

4687-
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`.
4687+
The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`.
46884688

46894689
| Property | Value |
46904690
|----------|-------|
4691-
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
4691+
| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
46924692
| **Type** | string |
46934693
| **Values** | `json`<br>`text`<br>`srt`<br>`verbose_json`<br>`vtt` |
46944694

articles/ai-foundry/openai/includes/models-azure-direct-openai.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,8 @@ The audio models via the `/audio` API can be used for speech to text, translatio
329329
| `whisper` | General-purpose speech recognition model. | 25 MB |
330330
| `gpt-4o-transcribe` | Speech-to-text model powered by GPT-4o. | 25 MB|
331331
| `gpt-4o-mini-transcribe` | Speech-to-text model powered by GPT-4o mini. | 25 MB|
332+
| `gpt-4o-transcribe-diarize` | Speech-to-text model with automatic speech recognition. | 25 MB|
333+
332334

333335
#### Speech translation models
334336

articles/ai-foundry/openai/includes/retirement/models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ ms.custom: references_regions, build-2025
5656
| `gpt-4o-audio-preview` | 2024-12-17 | Preview | No earlier than September 17, 2025 | |
5757
| `gpt-4o-audio-preview` | 2024-12-17 | Preview | No earlier than September 17, 2025 | |
5858
| `gpt-4o-transcribe` | 2025-03-20 | Preview | No earlier than September 17, 2025 | |
59+
| `gpt-4o-transcribe-diarize` | 2025-03-20 | Preview | No earlier than September 17, 2025 | |
5960
| `gpt-4o-mini-tts` | 2025-03-20 | Preview | No earlier than September 17, 2025 | |
6061
| `gpt-4o-mini-transcribe` | 2025-03-20 | Preview | No earlier than September 17, 2025 | |
6162
| `tts` | 001 | Generally Available | No earlier than February 1, 2026 | |

articles/ai-foundry/openai/realtime-audio-reference.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1198,14 +1198,16 @@ The server `session.updated` event is returned when a session is updated by the
11981198
* `whisper-1`
11991199
* `gpt-4o-transcribe`
12001200
* `gpt-4o-mini-transcribe`
1201+
* `gpt-4o-transcribe-diarize`
1202+
12011203

12021204
### RealtimeAudioInputTranscriptionSettings
12031205

12041206
| Field | Type | Description |
12051207
|-------|------|-------------|
12061208
| language | string | The language of the input audio. Supplying the input language in ISO-639-1 format (such as `en`) will improve accuracy and latency. |
12071209
| model | [RealtimeAudioInputTranscriptionModel](#realtimeaudioinputtranscriptionmodel) | The model for audio input transcription. For example, `whisper-1`. |
1208-
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
1210+
| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
12091211

12101212
### RealtimeAudioInputAudioNoiseReductionSettings
12111213

articles/ai-foundry/openai/whats-new.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,15 @@ ms.custom:
1919

2020
This article provides a summary of the latest releases and major documentation updates for Azure OpenAI.
2121

22+
## October 2025
23+
24+
### GPT-4o audio model released
25+
26+
- The `gpt-4o-transcribe-diarize` speech to text model is released supporting automatic speech recognition. Use this model via the `/audio` and `/realtime` APIs.
27+
2228
## September 2025
2329

24-
## GPT-5-codex is now available
30+
### GPT-5-codex is now available
2531

2632
- To learn more about `gpt-5-codex`, see the [getting started with reasoning models page](./how-to/reasoning.md).
2733
- `gpt-5-codex` is specifically designed to be used with the [Codex CLI and the Visual Studio Code Codex extension](./how-to/codex.md).

0 commit comments

Comments
 (0)