You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| centralindia | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
180
-
| eastus2 | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
181
-
| southeastasia | - | - | - | - | Global standard | Global standard | - | - | - | Regional | Regional |
182
-
| swedencentral | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Regional | Regional |
183
-
| westus2 | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | Regional | Regional |
184
-
|australiaeast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
185
-
|japaneast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | Regional | Regional |
186
-
|eastus| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
187
-
|uksouth| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - |
188
-
|westeurope| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - |
189
-
190
-
<sup>1</sup> The Azure AI Foundry resource must be in Central India. Azure AI Speech features remain in Central India. The voice live API uses Sweden Central as needed for generative AI load balancing.
| centralindia | Cross-region<sup>1</sup> | Cross-region<sup>1</sup> |Cross-region<sup>1</sup> |Global standard | Global standard | Global standard | Global standard| -| - | - | - | - | - |
180
+
| eastus2 | Global standard | Global standard |Global standard |Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Global standard | Regional | Regional |
| swedencentral | Global standard | Global standard |Global standard |Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Global standard | Regional | Regional |
183
+
| westus2 | Cross-region<sup>2</sup> | Cross-region<sup>2</sup> |Cross-region<sup>2</sup> |Data zone standard | Data zone standard | Data zone standard | Data zone standard| -| - | - | - | Regional | Regional |
184
+
|australiaeast| - | - |- |Global standard | Global standard | Global standard | Global standard| -| - | - | - | - | - |
185
+
|japaneast| - | - |- |Global standard | Global standard | Global standard | Global standard| -| - | - | - | Regional | Regional |
186
+
|eastus| - | - |- |Data zone standard | Data zone standard | Data zone standard | Data zone standard| -| - | - | - | - | - |
187
+
|uksouth| - | - |- |Global standard | Global standard | Global standard | Global standard| -| - | - | - | - | - |
188
+
|westeurope| - | - |- |Data zone standard | Data zone standard | Data zone standard | Data zone standard| -| - | - | - | - | - |
189
+
190
+
<sup>1</sup> The Azure AI Foundry resource must be in Central India. Azure AI Speech features remain in Central India. The voice live API uses Sweden Central as needed for generative AI load balancing.
191
191
192
192
<sup>2</sup> The Azure AI Foundry resource must be in West US 2. Azure AI Speech features remain in West US 2. The voice live API uses East US 2 as needed for generative AI load balancing.
193
193
@@ -267,7 +267,7 @@ The regions in these tables support most of the core features of the Speech serv
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/voice-live-how-to.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ An [Azure AI Foundry resource](../multi-service-resource.md) is required to acce
33
33
The WebSocket endpoint for the voice live API is `wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2025-05-01-preview` or, for older resources, `wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview`.
34
34
The endpoint is the same for all models. The only difference is the required `model` query parameter, or, when using the Agent service, the `agent_id` and `project_id` parameters.
35
35
36
-
For example, an endpoint for a resource with a custom domain would be `wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o-mini-realtime`
36
+
For example, an endpoint for a resource with a custom domain would be `wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2025-05-01-preview&model=gpt-realtime`
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/voice-live-language-support.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The voice live API supports multiple languages and configuration options. In thi
22
22
23
23
## [Speech input](#tab/speechinput)
24
24
25
-
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-realtime`,`gpt-4o-mini-realtime`, and`phi4-mm-realtime`) or by `azure speech to text` models.
25
+
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-realtime`,`gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, and`phi4-mm-realtime`) or by `azure speech to text` models.
26
26
27
27
### Azure speech to text supported languages
28
28
@@ -78,11 +78,11 @@ To configure a single or multiple languages not supported by the multimodal mode
78
78
}
79
79
```
80
80
81
-
### gpt-realtimeand gpt-4o-mini-realtime supported languages
81
+
### gpt-realtime, gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview supported languages
82
82
83
83
While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.
84
84
85
-
The following languages are supported by `gpt-realtime`and `gpt-4o-mini-realtime`:
85
+
The following languages are supported by `gpt-realtime`, `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview`:
86
86
- Afrikaans
87
87
- Arabic
88
88
- Armenian
@@ -175,7 +175,7 @@ Multimodal models don't require a language configuration for the general process
175
175
176
176
## [Speech output](#tab/speechoutput)
177
177
178
-
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-realtime`and`gpt-4o-mini-realtime` or by `azure text to speech` voices.
178
+
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-realtime`, `gpt-4o-realtime-preview`, and`gpt-4o-mini-realtime-preview` or by `azure text to speech` voices.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/voice-live.md
+16-14Lines changed: 16 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ The voice live API is fully managed, eliminating the need for customers to handl
59
59
60
60
The voice live API is designed for compatibility with the Azure OpenAI Realtime API. The supported real-time events are mostly in parity with the [Azure OpenAI Realtime API events](/azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context), with some exceptions as described in the [voice live API how to guide](./voice-live-how-to.md).
61
61
62
-
Features that are unique to the voice live API are designed to be optional and additive. You can add Azure AI Speech capabilities such as noise suppression, echo cancellation, and advanced end-of-turn detection to your existing applications without needing to change your existing architecture.
62
+
Features that are unique to the voice live API are designed to be optional and additive. You can add Azure AI Speech capabilities such as noise suppression, echo cancellation, and advanced end-of-turn detection to your existing applications without needing to change your existing architecture.
63
63
64
64
The API is supported through WebSocket events, allowing for an easy server-to-server integration. Your backend or middle-tier service connects to the voice live API via WebSockets. You can use the WebSocket messages directly to interact with the API.
65
65
@@ -74,14 +74,16 @@ The voice live API supports the following models. For supported regions, see the
74
74
| Model | Description |
75
75
| ------------------------------ | ----------- |
76
76
|`gpt-realtime`| GPT real-time + option to use Azure text to speech voices including custom voice for audio. |
77
-
|`gpt-4o-mini-realtime`| GPT-4o mini real-time + option to use Azure text to speech voices including custom voice for audio. |
77
+
|`gpt-4o-realtime-preview`| GPT-4o real-time preview + option to use Azure text to speech voices including custom voice for audio. |
78
+
|`gpt-4o-mini-realtime-preview`| GPT-4o mini real-time preview + option to use Azure text to speech voices including custom voice for audio. |
78
79
|`gpt-4o`| GPT-4o + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
79
80
|`gpt-4o-mini`| GPT-4o mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
80
81
|`gpt-4.1`| GPT-4.1 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
81
82
|`gpt-4.1-mini`| GPT-4.1 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
82
83
|`gpt-5`| GPT-5 + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
83
84
|`gpt-5-mini`| GPT-5 mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
84
85
|`gpt-5-nano`| GPT-5 nano + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
86
+
|`gpt-5-chat`| GPT-5 chat + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
85
87
|`phi4-mm-realtime`| Phi4-mm + audio output through Azure text to speech voices including custom voice. |
86
88
|`phi4-mini`| Phi4-mm + audio input through Azure speech to text + audio output through Azure text to speech voices including custom voice. |
87
89
@@ -103,35 +105,35 @@ To meet your requirements, you can either build your own solution or use the voi
103
105
104
106
## Pricing
105
107
106
-
Pricing for the voice live API is in effect from July 1, 2025.
108
+
Pricing for the voice live API is in effect from July 1, 2025.
107
109
108
110
Pricing for the voice live API is tiered (**Pro**, **Basic**, and **Lite**) based on the generative AI model used.
109
111
110
112
You don't select a tier. You choose a generative AI model and the corresponding pricing applies.
111
113
112
114
| Pricing category | Models |
113
115
| ----- | ------ |
114
-
| Voice live pro |`gpt-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5`|
115
-
| Voice live basic |`gpt-4o-mini-realtime`, `gpt-4o-mini`, `gpt-4.1-mini`, `gpt-5-mini`|
116
+
| Voice live pro |`gpt-realtime`, `gpt-4o-realtime`, `gpt-4o`, `gpt-4.1`, `gpt-5`, `gpt-5-chat`|
117
+
| Voice live basic |`gpt-4o-mini-realtime-preview`, `gpt-4o-mini`, `gpt-4.1-mini`, `gpt-5-mini`|
116
118
| Voice live lite |`gpt-5-nano`,`phi4-mm-realtime`, `phi4-mini`|
117
119
118
120
If you choose to use custom voice for your speech output, you're charged separately for custom voice model training and hosting. Refer to the [Text to Speech – Custom Voice – Professional](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services) pricing for details. Custom voice is a limited access feature. [Learn more about how to create custom voices.](https://aka.ms/CNVPro)
119
121
120
-
Avatars are charged separately with [the interactive avatar pricing published here.](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services)
122
+
Avatars are charged separately with [the interactive avatar pricing published here.](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services)
121
123
122
-
For more details regarding custom voice and avatar training charges, [refer to this pricing note.](/azure/ai-services/speech-service/text-to-speech#model-training-and-hosting-time-for-custom-voice)
124
+
For more details regarding custom voice and avatar training charges, [refer to this pricing note.](/azure/ai-services/speech-service/text-to-speech#model-training-and-hosting-time-for-custom-voice)
123
125
124
126
### Example pricing scenarios
125
127
126
128
Here are some example pricing scenarios to help you understand how the voice live API is charged:
127
129
128
130
#### Scenario 1
129
131
130
-
A customer service agent built with standard Azure AI Speech input, GPT-4.1, custom Azure AI Speech output, and a custom avatar.
132
+
A customer service agent built with standard Azure AI Speech input, GPT-4.1, custom Azure AI Speech output, and a custom avatar.
131
133
132
134
You're charged at the voice live pro rate for:
133
135
- Text
134
-
- Audio with Azure AI Speech - Standard
136
+
- Audio with Azure AI Speech - Standard
135
137
- Audio with Azure AI Speech - Custom
136
138
137
139
You're charged separately for the training and model hosting of:
@@ -140,7 +142,7 @@ You're charged separately for the training and model hosting of:
140
142
141
143
#### Scenario 2
142
144
143
-
A learning agent built with `gpt-realtime` native audio input and standard Azure AI Speech output.
145
+
A learning agent built with `gpt-realtime` native audio input and standard Azure AI Speech output.
144
146
145
147
You're charged at the voice live pro rate for:
146
148
- Text
@@ -149,19 +151,19 @@ You're charged at the voice live pro rate for:
149
151
150
152
#### Scenario 3
151
153
152
-
A talent interview agent built with `gpt-4o-mini-realtime` native audio input, and standard Azure AI Speech output and standard avatar.
154
+
A talent interview agent built with `gpt-4o-mini-realtime-preview` native audio input, and standard Azure AI Speech output and standard avatar.
153
155
154
156
You're charged at the voice live basic rate for:
155
157
- Text
156
-
- Native audio with `gpt-4o-mini-realtime`
158
+
- Native audio with `gpt-4o-mini-realtime-preview`
157
159
- Audio with Azure AI Speech - Standard
158
160
159
161
You're charged separately for:
160
162
- Text to speech avatar (standard)
161
163
162
164
#### Scenario 4
163
165
164
-
An in-car assistant built with `phi4-mm-realtime` and Azure custom voice.
166
+
An in-car assistant built with `phi4-mm-realtime` and Azure custom voice.
165
167
166
168
You're charged at the voice live lite rate for:
167
169
- Text
@@ -184,7 +186,7 @@ You can estimate token usage for different model families with the voice live AP
184
186
| Azure OpenAI models |~10 tokens |~20 tokens |
185
187
| Phi models |~12.5 tokens |~20 tokens |
186
188
187
-
You're also charged for cached audio and text inputs, including the prompt and the context of the conversations.
189
+
You're also charged for cached audio and text inputs, including the prompt and the context of the conversations.
0 commit comments