Skip to content

Commit ee3f158

Browse files
committed
Resolve comments 1-3 from Qinying. Comment 4 to be resolved in separate GA release PR!
1 parent bd1355b commit ee3f158

File tree

7 files changed

+27
-29
lines changed

7 files changed

+27
-29
lines changed

articles/ai-services/speech-service/includes/quickstarts/voice-live-api/python.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -602,14 +602,14 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
602602
"--model",
603603
help="VoiceLive model to use",
604604
type=str,
605-
default=os.environ.get("VOICE_LIVE_MODEL", "gpt-4o-realtime-preview"),
605+
default=os.environ.get("VOICE_LIVE_MODEL", "gpt-realtime"),
606606
)
607607

608608
parser.add_argument(
609609
"--voice",
610610
help="Voice to use for the assistant",
611611
type=str,
612-
default=os.environ.get("VOICE_LIVE_VOICE", "en-US-AvaNeural"),
612+
default=os.environ.get("VOICE_LIVE_VOICE", "en-US-Ava:DragonHDLatestNeural"),
613613
help="Voice to use for the assistant. E.g. alloy, echo, fable, en-US-AvaNeural, en-US-GuyNeural",
614614
)
615615

articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ Our new “DragonV2.1” model brings improvements to the naturalness of speech,
1515
### June 2025 release
1616

1717
#### VoiceLive API update
18-
- Support more GenAI models: GPT-4.1, GPT-4.1 Mini and GPT-4.1 Nano, Phi-4 mini and Phi-4 Multimodal models are now natively supported.
18+
- Support more GenAI models: GPT-4.1, GPT-4.1 Mini, Phi-4 mini and Phi-4 Multimodal models are now natively supported.
1919
- Support more customization capabilities
20-
- Azure Semantic VAD is extended to support GPT-4o-Realtime and GPT-4o-Mini-Realtime.
20+
- Azure Semantic VAD is extended to support GPT-Realtime and GPT-4o-Mini-Realtime.
2121
- Availability in more regions
2222

2323
#### Public preview of Voice Conversion feature on selected en-US voices

articles/ai-services/speech-service/regions.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -174,18 +174,18 @@ The regions in these tables support most of the core features of the Speech serv
174174

175175
# [Voice live](#tab/voice-live)
176176

177-
| **Region** | **gpt-realtime** | **gpt-4o-realtime** | **gpt-4o-mini-realtime** | **gpt-4o** | **gpt-4o-mini** | **gpt-4.1** | **gpt-4.1-mini** | **gpt-4.1-nano** | **gpt-5** | **gpt-5-mini** | **gpt-5-nano** | **phi4-mm-realtime** | **phi4-mini** |
178-
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
179-
| centralindia | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
180-
| eastus2 | Global standard | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
181-
| southeastasia | - | - | - | - | - | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
182-
| swedencentral | Global standard | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
183-
| westus2 | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | Regional | Regional |
184-
|australiaeast| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
185-
|japaneast| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
186-
|eastus| - | - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
187-
|uksouth| - | - | - | Global standard | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
188-
|westeurope| - | - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
177+
| **Region** | **gpt-realtime** | **gpt-4o-mini-realtime** (Preview) | **gpt-4o** | **gpt-4o-mini** | **gpt-4.1** | **gpt-4.1-mini** | **gpt-5** (Preview) | **gpt-5-mini** (Preview) | **gpt-5-nano** (Preview) | **phi4-mm-realtime** (Preview) | **phi4-mini** (Preview) |
178+
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
179+
| centralindia | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
180+
| eastus2 | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
181+
| southeastasia | - | - | - | - | Global standard | Global standard | - | - | - | Regional | Regional |
182+
| swedencentral | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
183+
| westus2 | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | Regional | Regional |
184+
|australiaeast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
185+
|japaneast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
186+
|eastus| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
187+
|uksouth| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
188+
|westeurope| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
189189

190190
<sup>1</sup> The Azure AI Foundry resource must be in Central India. Azure AI Speech features remain in Central India. The voice live API uses Sweden Central as needed for generative AI load balancing.
191191

articles/ai-services/speech-service/voice-live-how-to-customize.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Use phrase list for lightweight just-in-time customization on audio input. To co
4444
```
4545

4646
> [!NOTE]
47-
> Phrase list currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
47+
> Phrase list currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
4848
4949
### Custom speech configuration
5050

articles/ai-services/speech-service/voice-live-how-to.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ Turn detection is the process of detecting when the end-user started or stopped
139139
| `speech_duration_ms` | integer | Optional | The duration of user's speech audio required to start detection. If not set or under 80 ms, the detector uses a default value of 80 ms. |
140140
| `silence_duration_ms` | integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
141141
| `remove_filler_words` | boolean | Optional | Determines whether to remove filler words to reduce the false alarm rate. This property must be set to `true` when using `azure_semantic_vad`.<br/><br/>The default value is `false`. |
142-
| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/>&nbsp;&nbsp;`semantic_detection_v1` supporting English.<br/>&nbsp;&nbsp;`semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages will be bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
142+
| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/>&nbsp;&nbsp;`semantic_detection_v1` supporting English.<br/>&nbsp;&nbsp;`semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages will be bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
143143

144144
Here's an example of end of utterance detection in a session object:
145145

@@ -166,7 +166,7 @@ Here's an example of end of utterance detection in a session object:
166166

167167
## Audio input through Azure speech to text
168168

169-
Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o-realtime.
169+
Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o.
170170

171171
In order to explicitly configure it you can set the `model` to `azure-speech` in `input_audio_transcription`. This can be useful to improve the recognition quality for specific language situations. See [How to customize voice live input and output](./voice-live-how-to-customize) learn more about speech input customization configuration.
172172

articles/ai-services/speech-service/voice-live-language-support.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The voice live API supports multiple languages and configuration options. In thi
2222

2323
## [Speech input](#tab/speechinput)
2424

25-
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, and `phi4-mm-realtime`) or by `azure speech to text` models.
25+
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-realtime`, `gpt-4o-mini-realtime`, and `phi4-mm-realtime`) or by `azure speech to text` models.
2626

2727
### Azure speech to text supported languages
2828

@@ -78,11 +78,11 @@ To configure a single or multiple languages not supported by the multimodal mode
7878
}
7979
```
8080

81-
### gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview supported languages
81+
### gpt-realtime and gpt-4o-mini-realtime supported languages
8282

8383
While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.
8484

85-
The following languages are supported by `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview`:
85+
The following languages are supported by `gpt-realtime` and `gpt-4o-mini-realtime`:
8686
- Afrikaans
8787
- Arabic
8888
- Armenian
@@ -175,7 +175,7 @@ Multimodal models don't require a language configuration for the general process
175175

176176
## [Speech output](#tab/speechoutput)
177177

178-
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` or by `azure text to speech` voices.
178+
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-realtime` and `gpt-4o-mini-realtime` or by `azure text to speech` voices.
179179

180180
### Azure text to speech supported languages
181181

articles/ai-services/speech-service/voice-live.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,11 @@ The voice live API supports the following models. For supported regions, see the
7474
| Model | Description |
7575
| ------------------------------ | ----------- |
7676
| `gpt-realtime` | GPT real-time + option to use Azure text to speech voices including custom voice for audio. |
77-
| `gpt-4o-realtime` | GPT-4o real-time + option to use Azure text to speech voices including custom voice for audio. |
7877
| `gpt-4o-mini-realtime` | GPT-4o mini real-time + option to use Azure text to speech voices including custom voice for audio. |
7978
| `gpt-4o` | GPT-4o + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8079
| `gpt-4o-mini` | GPT-4o mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8180
| `gpt-4.1` | GPT-4.1 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8281
| `gpt-4.1-mini` | GPT-4.1 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
83-
| `gpt-4.1-nano` | GPT-4.1 nano + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8482
| `gpt-5` | GPT-5 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8583
| `gpt-5-mini` | GPT-5 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
8684
| `gpt-5-nano` | GPT-5 nano + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
@@ -113,9 +111,9 @@ You don't select a tier. You choose a generative AI model and the corresponding
113111

114112
| Pricing category | Models |
115113
| ----- | ------ |
116-
| Voice live pro | `gpt-realtime`, `gpt-4o-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5` |
114+
| Voice live pro | `gpt-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5` |
117115
| Voice live basic | `gpt-4o-mini-realtime`, `gpt-4o-mini`, `gpt-4.1-mini`, `gpt-5-mini` |
118-
| Voice live lite | `gpt-4.1-nano`, `gpt-5-nano`,`phi4-mm-realtime`, `phi4-mini` |
116+
| Voice live lite | `gpt-5-nano`,`phi4-mm-realtime`, `phi4-mini` |
119117

120118
If you choose to use custom voice for your speech output, you're charged separately for custom voice model training and hosting. Refer to the [Text to Speech – Custom Voice – Professional](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services) pricing for details. Custom voice is a limited access feature. [Learn more about how to create custom voices.](https://aka.ms/CNVPro)
121119

@@ -142,11 +140,11 @@ You're charged separately for the training and model hosting of:
142140

143141
#### Scenario 2
144142

145-
A learning agent built with `gpt-4o-realtime` native audio input and standard Azure AI Speech output.
143+
A learning agent built with `gpt-realtime` native audio input and standard Azure AI Speech output.
146144

147145
You're charged at the voice live pro rate for:
148146
- Text
149-
- Native audio with `gpt-4o-realtime`
147+
- Native audio with `gpt-realtime`
150148
- Audio with Azure AI Speech - Standard
151149

152150
#### Scenario 3

0 commit comments

Comments
 (0)