Merge pull request #7696 from PatrickFarley/aoai-updates

prmerger-automator[bot] · web-flow · commit 998cd58d4fbe · 2025-10-17T17:54:15.000Z
add diarize model
diff --git a/articles/ai-foundry/openai/includes/api-versions/latest-inference-preview.md b/articles/ai-foundry/openai/includes/api-versions/latest-inference-preview.md
@@ -4553,7 +4553,7 @@ It responds with a session object, plus a `client_secret` key which contains a u
 | └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No |  |
 | input_audio_transcription | object | Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the Transcriptions endpoint](/azure/ai-foundry/openai/reference-preview#transcriptions---create) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No |  |
 | └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No |  |
-| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No |  |
+| └─ model | string | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br> | No |  |
 | └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No |  |
 | instructions | string | The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good  responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.<br><br>Note that the server sets default instructions which will be used if this  field isn't set and are visible in the `session.created` event at the  start of the session.<br> | No |  |
 | max_response_output_tokens | integer or string | Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or `inf` for the maximum available tokens for a given model. Defaults to `inf`.<br> | No |  |
@@ -8885,7 +8885,7 @@ Realtime transcription session object configuration.
 | └─ type | enum | Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.<br><br>Possible values: `near_field`, `far_field` | No |  |
 | input_audio_transcription | object | Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.<br> | No |  |
 | └─ language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format improves accuracy and latency.<br> | No |  |
-| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `whisper-1` | No |  |
+| └─ model | enum | The model to use for transcription, current options are `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, and `whisper-1`.<br><br>Possible values: `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, `gpt-4o-mini-transcribe`, `whisper-1` | No |  |
 | └─ prompt | string | An optional text to guide the model's style or continue a previous audio segment.<br>For `whisper-1`, the prompt is a list of keywords.<br>For `gpt-4o-transcribe` models, the prompt is a free text string, for example "expect words related to technology".<br> | No |  |
 | modalities |  | The set of modalities the model can respond with. To disable audio, set this to ["text"].<br> | No |  |
 | turn_detection | object | Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.<br>Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.<br>Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with `uhhm`, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.<br> | No |  |
diff --git a/articles/ai-foundry/openai/includes/api-versions/new-inference-preview.md b/articles/ai-foundry/openai/includes/api-versions/new-inference-preview.md
@@ -127,7 +127,7 @@ Transcribes audio into the input language.
 | └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No |  |
 | file | string |  | Yes |  |
 | filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No |  |
-| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No |  |
+| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No |  |
 | language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No |  |
 | model | string | The model to use for this transcription request. | No |  |
 | prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No |  |
@@ -4225,7 +4225,7 @@ The configuration information for an audio transcription request.
 | └─ type | enum | Must be set to `server_vad` to enable manual chunking using server side VAD.<br>Possible values: `server_vad` | No |  |
 | file | string |  | Yes |  |
 | filename | string | The optional filename or descriptive identifier to associate with with the audio data. | No |  |
-| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`. | No |  |
+| include[] | array | Additional information to include in the transcription response.<br>`logprobs` will return the log probabilities of the tokens in the<br>response to understand the model's confidence in the transcription.<br>`logprobs` only works with response_format set to `json` and only with<br>the models `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`. | No |  |
 | language | string | The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. | No |  |
 | model | string | The model to use for this transcription request. | No |  |
 | prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No |  |
@@ -4684,11 +4684,11 @@ A citation for a web resource used to generate a model response.
 
 ### OpenAI.AudioResponseFormat
 
-The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`.
+The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`.
 
 | Property | Value |
 |----------|-------|
-| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
+| **Description** | The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`. For `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe`, the only supported format is `json`. |
 | **Type** | string |
 | **Values** | `json`<br>`text`<br>`srt`<br>`verbose_json`<br>`vtt` |
 
diff --git a/articles/ai-foundry/openai/includes/models-azure-direct-openai.md b/articles/ai-foundry/openai/includes/models-azure-direct-openai.md
@@ -344,6 +344,8 @@ The audio models via the `/audio` API can be used for speech to text, translatio
 | `whisper` | General-purpose speech recognition model. | 25 MB |
 | `gpt-4o-transcribe` | Speech-to-text model powered by GPT-4o. | 25 MB|
 | `gpt-4o-mini-transcribe` | Speech-to-text model powered by GPT-4o mini. | 25 MB|
+| `gpt-4o-transcribe-diarize` | Speech-to-text model with automatic speech recognition. | 25 MB|
+
 
 #### Speech translation models
 
diff --git a/articles/ai-foundry/openai/includes/retirement/models.md b/articles/ai-foundry/openai/includes/retirement/models.md
@@ -57,6 +57,7 @@ ms.custom: references_regions, build-2025
 | `gpt-4o-transcribe`            | 2025-03-20   | Preview             | No earlier than January 14, 2026         |                                      |
 | `gpt-4o-mini-tts`              | 2025-03-20   | Preview             | No earlier than January 14, 2026         |                                      |
 | `gpt-4o-mini-transcribe`       | 2025-03-20   | Preview             | No earlier than January 14, 2026         |                                      |
+| `gpt-4o-mini-transcribe-diarize`       | 2025-03-20   | Preview             | No earlier than January 14, 2026         |                                      |
 | `tts`                          | 001          | Generally Available | No earlier than February 1, 2026         |                                      |
 | `tts-hd`                       | 001          | Generally Available | No earlier than February 1, 2026         |                                      |
 | `whisper`                      | 001          | Generally Available | No earlier than February 1, 2026         |                                      |
diff --git a/articles/ai-foundry/openai/realtime-audio-reference.md b/articles/ai-foundry/openai/realtime-audio-reference.md
@@ -1198,14 +1198,16 @@ The server `session.updated` event is returned when a session is updated by the
 * `whisper-1` 
 * `gpt-4o-transcribe`
 * `gpt-4o-mini-transcribe`
+* `gpt-4o-transcribe-diarize`
+
 
 ### RealtimeAudioInputTranscriptionSettings
 
 | Field | Type | Description | 
 |-------|------|-------------|
 | language | string | The language of the input audio. Supplying the input language in ISO-639-1 format (such as `en`) will improve accuracy and latency. |
 | model | [RealtimeAudioInputTranscriptionModel](#realtimeaudioinputtranscriptionmodel) | The model for audio input transcription. For example, `whisper-1`. |
-| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
+| prompt | string | The prompt for the audio input transcription. Optional text to guide the model's style or continue a previous audio segment. For the `whisper-1` model, the prompt is a list of keywords. For the `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-4o-mini-transcribe` models, the prompt is a free text string such as "expect words related to technology."|
 
 ### RealtimeAudioInputAudioNoiseReductionSettings
 
diff --git a/articles/ai-foundry/openai/whats-new.md b/articles/ai-foundry/openai/whats-new.md
@@ -19,7 +19,15 @@ ms.custom:
 
 This article provides a summary of the latest releases and major documentation updates for Azure OpenAI.
 
-## October 2025 
+## October 2025
+
+### GPT-4o audio model released
+
+- The `gpt-4o-transcribe-diarize` speech to text model is released. This is an Automatic Speech Recognition (ASR) model that converts spoken language into text in real time. It enables organizations to unlock insights from conversations instantly with ultra-low latency and high accuracy across 100+ languages. This capability is essential for workflows where voice data drives decisions—such as customer support, virtual meetings, and live events.
+
+Diarization is the process of identifying who spoke when in an audio stream. It transforms conversations into speaker-attributed transcripts, enabling businesses to extract actionable insights from meetings, customer calls, and live events. With advanced models like `gpt-4o-transcribe-diarize`, organizations gain real-time clarity and context—turning voice into structured data that drives smarter decisions and improves productivity. supporting automatic speech recognition. 
+
+Use this model via the `/audio` and `/realtime` APIs.  
 
 ### GPT-image-1-mini 
 
@@ -36,7 +44,7 @@ Personally identifiable information (PII) detection is now available as a built-
 
 ## September 2025
 
-## GPT-5-codex is now available
+### GPT-5-codex is now available
 
 - To learn more about `gpt-5-codex`, see the [getting started with reasoning models page](./how-to/reasoning.md).
 - `gpt-5-codex` is specifically designed to be used with the [Codex CLI and the Visual Studio Code Codex extension](./how-to/codex.md).