Resolve comments 1-3 from Qinying. Comment 4 to be resolved in separate GA release PR!

goergenj · goergenj · commit ee3f1581d83f · 2025-09-22T14:48:27.000-07:00
diff --git a/articles/ai-services/speech-service/includes/quickstarts/voice-live-api/python.md b/articles/ai-services/speech-service/includes/quickstarts/voice-live-api/python.md
@@ -602,14 +602,14 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
             "--model",
             help="VoiceLive model to use",
             type=str,
-            default=os.environ.get("VOICE_LIVE_MODEL", "gpt-4o-realtime-preview"),
+            default=os.environ.get("VOICE_LIVE_MODEL", "gpt-realtime"),
         )
     
         parser.add_argument(
             "--voice",
             help="Voice to use for the assistant",
             type=str,
-            default=os.environ.get("VOICE_LIVE_VOICE", "en-US-AvaNeural"),
+            default=os.environ.get("VOICE_LIVE_VOICE", "en-US-Ava:DragonHDLatestNeural"),
             help="Voice to use for the assistant. E.g. alloy, echo, fable, en-US-AvaNeural, en-US-GuyNeural",
         )
     
diff --git a/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md b/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md
@@ -15,9 +15,9 @@ Our new “DragonV2.1” model brings improvements to the naturalness of speech,
 ### June 2025 release
 
 #### VoiceLive API update 
-- Support more GenAI models: GPT-4.1, GPT-4.1 Mini and GPT-4.1 Nano, Phi-4 mini and Phi-4 Multimodal models are now natively supported.
+- Support more GenAI models: GPT-4.1, GPT-4.1 Mini, Phi-4 mini and Phi-4 Multimodal models are now natively supported.
 - Support more customization capabilities
-- Azure Semantic VAD is extended to support GPT-4o-Realtime and GPT-4o-Mini-Realtime.
+- Azure Semantic VAD is extended to support GPT-Realtime and GPT-4o-Mini-Realtime.
 - Availability in more regions
 
 #### Public preview of Voice Conversion feature on selected en-US voices
diff --git a/articles/ai-services/speech-service/regions.md b/articles/ai-services/speech-service/regions.md
@@ -174,18 +174,18 @@ The regions in these tables support most of the core features of the Speech serv
 
 # [Voice live](#tab/voice-live)
 
-| **Region** | **gpt-realtime** | **gpt-4o-realtime** | **gpt-4o-mini-realtime** | **gpt-4o** | **gpt-4o-mini**  | **gpt-4.1** | **gpt-4.1-mini** | **gpt-4.1-nano** | **gpt-5** | **gpt-5-mini** | **gpt-5-nano** | **phi4-mm-realtime** | **phi4-mini** | 
-|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
-| centralindia       | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
-| eastus2       | Global standard | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
-| southeastasia       | - | - | - | - | - | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
-| swedencentral       | Global standard | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
-| westus2       | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | Regional | Regional |
-|australiaeast| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
-|japaneast| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
-|eastus| - | - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
-|uksouth| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
-|westeurope| - | - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
+| **Region** | **gpt-realtime** | **gpt-4o-mini-realtime** (Preview) | **gpt-4o** | **gpt-4o-mini**  | **gpt-4.1** | **gpt-4.1-mini** | **gpt-5** (Preview) | **gpt-5-mini** (Preview) | **gpt-5-nano** (Preview) | **phi4-mm-realtime** (Preview) | **phi4-mini** (Preview) | 
+|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
+| centralindia       | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
+| eastus2       | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
+| southeastasia       | - | - | - | - | Global standard | Global standard | - | - | - | Regional | Regional |
+| swedencentral       | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
+| westus2       | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | Regional | Regional |
+|australiaeast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
+|japaneast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
+|eastus| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
+|uksouth| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
+|westeurope| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
 
 <sup>1</sup> The Azure AI Foundry resource must be in Central India. Azure AI Speech features remain in Central India. The voice live API uses Sweden Central as needed for generative AI load balancing.  
 
diff --git a/articles/ai-services/speech-service/voice-live-how-to-customize.md b/articles/ai-services/speech-service/voice-live-how-to-customize.md
@@ -44,7 +44,7 @@ Use phrase list for lightweight just-in-time customization on audio input. To co
 ```
 
 > [!NOTE]
-> Phrase list currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
+> Phrase list currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
 
 ### Custom speech configuration
 
diff --git a/articles/ai-services/speech-service/voice-live-how-to.md b/articles/ai-services/speech-service/voice-live-how-to.md
@@ -139,7 +139,7 @@ Turn detection is the process of detecting when the end-user started or stopped
 | `speech_duration_ms` | integer | Optional | The duration of user's speech audio required to start detection. If not set or under 80 ms, the detector uses a default value of 80 ms. |
 | `silence_duration_ms` | integer  | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
 | `remove_filler_words` | boolean | Optional |  Determines whether to remove filler words to reduce the false alarm rate. This property must be set to `true` when using `azure_semantic_vad`.<br/><br/>The default value is `false`. |
-| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/>&nbsp;&nbsp;`semantic_detection_v1` supporting English.<br/>&nbsp;&nbsp;`semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages will be bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
+| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/>&nbsp;&nbsp;`semantic_detection_v1` supporting English.<br/>&nbsp;&nbsp;`semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages will be bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
 
 Here's an example of end of utterance detection in a session object:
 
@@ -166,7 +166,7 @@ Here's an example of end of utterance detection in a session object:
 
 ## Audio input through Azure speech to text
 
-Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o-realtime.
+Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o.
 
 In order to explicitly configure it you can set the `model` to `azure-speech` in `input_audio_transcription`. This can be useful to improve the recognition quality for specific language situations. See [How to customize voice live input and output](./voice-live-how-to-customize) learn more about speech input customization configuration.
 
diff --git a/articles/ai-services/speech-service/voice-live-language-support.md b/articles/ai-services/speech-service/voice-live-language-support.md
@@ -22,7 +22,7 @@ The voice live API supports multiple languages and configuration options. In thi
 
 ## [Speech input](#tab/speechinput)
 
-Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, and `phi4-mm-realtime`) or by `azure speech to text` models.
+Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-realtime`, `gpt-4o-mini-realtime`, and `phi4-mm-realtime`) or by `azure speech to text` models.
 
 ### Azure speech to text supported languages
 
@@ -78,11 +78,11 @@ To configure a single or multiple languages not supported by the multimodal mode
 }
 ```
 
-### gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview supported languages
+### gpt-realtime and gpt-4o-mini-realtime supported languages
 
 While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.
 
-The following languages are supported by `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview`:
+The following languages are supported by `gpt-realtime` and `gpt-4o-mini-realtime`:
 - Afrikaans
 - Arabic
 - Armenian
@@ -175,7 +175,7 @@ Multimodal models don't require a language configuration for the general process
 
 ## [Speech output](#tab/speechoutput)
 
-Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` or by `azure text to speech` voices.
+Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-realtime` and `gpt-4o-mini-realtime` or by `azure text to speech` voices.
 
 ### Azure text to speech supported languages
 
diff --git a/articles/ai-services/speech-service/voice-live.md b/articles/ai-services/speech-service/voice-live.md
@@ -74,13 +74,11 @@ The voice live API supports the following models. For supported regions, see the
 | Model | Description |
 | ------------------------------ | ----------- |
 | `gpt-realtime`      | GPT real-time + option to use Azure text to speech voices including custom voice for audio. |
-| `gpt-4o-realtime`      | GPT-4o real-time + option to use Azure text to speech voices including custom voice for audio. |
 | `gpt-4o-mini-realtime` | GPT-4o mini real-time + option to use Azure text to speech voices including custom voice for audio. |
 | `gpt-4o` | GPT-4o + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-4o-mini` | GPT-4o mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-4.1` | GPT-4.1 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-4.1-mini` | GPT-4.1 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
-| `gpt-4.1-nano` | GPT-4.1 nano + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-5` | GPT-5 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-5-mini` | GPT-5 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
 | `gpt-5-nano` | GPT-5 nano + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
@@ -113,9 +111,9 @@ You don't select a tier. You choose a generative AI model and the corresponding
 
 | Pricing category | Models |
 | ----- | ------ |
-| Voice live pro | `gpt-realtime`, `gpt-4o-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5` |
+| Voice live pro | `gpt-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5` |
 | Voice live basic | `gpt-4o-mini-realtime`, `gpt-4o-mini`, `gpt-4.1-mini`, `gpt-5-mini` |
-| Voice live lite | `gpt-4.1-nano`, `gpt-5-nano`,`phi4-mm-realtime`, `phi4-mini` |
+| Voice live lite | `gpt-5-nano`,`phi4-mm-realtime`, `phi4-mini` |
 
 If you choose to use custom voice for your speech output, you're charged separately for custom voice model training and hosting. Refer to the [Text to Speech – Custom Voice – Professional](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services) pricing for details. Custom voice is a limited access feature. [Learn more about how to create custom voices.](https://aka.ms/CNVPro)
 
@@ -142,11 +140,11 @@ You're charged separately for the training and model hosting of:
 
 #### Scenario 2
 
-A learning agent built with `gpt-4o-realtime` native audio input and standard Azure AI Speech output. 
+A learning agent built with `gpt-realtime` native audio input and standard Azure AI Speech output. 
 
 You're charged at the voice live pro rate for:
 - Text
-- Native audio with `gpt-4o-realtime`
+- Native audio with `gpt-realtime`
 - Audio with Azure AI Speech - Standard
 
 #### Scenario 3