You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/regions.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: goergenj
6
6
manager: nitinme
7
7
ms.service: azure-ai-speech
8
8
ms.topic: conceptual
9
-
ms.date: 9/26/2025
9
+
ms.date: 9/29/2025
10
10
ms.author: jagoerge
11
11
ms.custom: references_regions
12
12
#Customer intent: As a developer, I want to learn about the available regions and endpoints for the Speech service.
@@ -100,7 +100,7 @@ The regions in these tables support most of the core features of the Speech serv
100
100
| westus2 | ✅ | ✅ | ✅ |
101
101
| westus3 | ✅ | ✅ ||
102
102
103
-
<sup>1</sup> The region has dedicated hardware for custom speech training. If you plan to train a custom model with audio data, you must use one of the regions with dedicated hardware. Then you can [copy the trained model](how-to-custom-speech-train-model.md#copy-a-model) to another region.
103
+
<sup>1</sup> The region uses dedicated hardware for custom speech training. If you plan to train a custom model with audio data, you must use one of the regions with dedicated hardware. Then you can [copy the trained model](how-to-custom-speech-train-model.md#copy-a-model) to another region.
104
104
105
105
# [Text to speech](#tab/tts)
106
106
@@ -180,16 +180,17 @@ The regions in these tables support most of the core features of the Speech serv
180
180
| eastus2 | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Global standard | Regional | Regional |
181
181
| southeastasia | - | - | - | - | Global standard | Global standard | - | - | - | - | Regional | Regional |
182
182
| swedencentral | Global standard | Global standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Data zone standard | Global standard | Regional | Regional |
183
-
| westus2| Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | Regional | Regional |
183
+
| westus2<sup>3</sup>| Cross-region<sup>2</sup> | Cross-region<sup>2</sup> | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | Regional | Regional |
184
184
|australiaeast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - | - |
185
185
|japaneast| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | Regional | Regional |
186
186
|eastus| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - | - |
187
187
|uksouth| - | - | Global standard | Global standard | Global standard | Global standard | - | - | - | - | - | - |
188
-
|westeurope| - | - | Data zone standard | Data zone standard | Data zone standard | Data zone standard | - | - | - | - | - | - |
189
188
190
189
<sup>1</sup> The Azure AI Foundry resource must be in Central India. Azure AI Speech features remain in Central India. The Voice live API uses Sweden Central as needed for generative AI load balancing.
191
190
192
-
<sup>2</sup> The Azure AI Foundry resource must be in West US 2. Azure AI Speech features remain in West US 2. The Voice live API uses East US 2 as needed for generative AI load balancing.
191
+
<sup>2</sup> The resource must be in West US 2. Azure AI Speech features remain in West US 2. The Voice live API uses East US 2 as needed for generative AI load balancing.
192
+
193
+
<sup>3</sup> Currently West US 2 only supports Speech Service resources (not AI Foundry resources). Use one of the other regions to use an Azure AI Foundry resource and best integration with Azure AI Foundry Agent Service and bring-your-own-model (BYOM) support.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/voice-live-how-to.md
+12-8Lines changed: 12 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ author: goergenj
7
7
ms.author: jagoerge
8
8
ms.service: azure-ai-speech
9
9
ms.topic: how-to
10
-
ms.date: 9/26/2025
10
+
ms.date: 9/29/2025
11
11
ms.custom: references_regions
12
12
# Customer intent: As a developer, I want to learn how to use the Voice live API for real-time voice agents.
13
13
---
@@ -24,7 +24,11 @@ For a table of supported models and regions, see the [Voice live API overview](.
24
24
25
25
## Authentication
26
26
27
-
An [Azure AI Foundry resource](../multi-service-resource.md) is required to access the Voice live API.
27
+
An [Azure AI Foundry resource](../multi-service-resource.md) or a [Azure AI Speech Services resource](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices) is required to use the Voice live API.
28
+
29
+
> [!NOTE]
30
+
> Using Voice Live API is optimized for Azure AI Foundry resources. We recommend using Azure AI Foundry resources for full feature availability and best Azure AI Foundry integration experience.
31
+
> **Azure AI Speech Services resources** don't support Azure AI Foundry Agent Service integration and bring-your-own-model (BYOM).
28
32
29
33
### WebSocket endpoint
30
34
@@ -80,7 +84,7 @@ Here's an example `session.update` message that configures several aspects of th
80
84
```
81
85
82
86
> [!IMPORTANT]
83
-
> The `"instructions"` property is not supported when you're using a custom agent.
87
+
> The `"instructions"` property isn't supported when you're using a custom agent.
84
88
85
89
The server responds with a [`session.updated`](../openai/realtime-audio-reference.md?context=/azure/ai-services/speech-service/context/context#realtimeservereventsessionupdated) event to confirm the session configuration.
86
90
@@ -115,7 +119,7 @@ Here's an example of input audio properties is a session object:
115
119
116
120
Noise suppression enhances the input audio quality by suppressing or removing environmental background noise. Noise suppression helps the model understand the end-user with higher accuracy and improves accuracy of signals like interruption detection and end-of-turn detection.
117
121
118
-
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
122
+
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker. This helps avoiding the microphone picking up the model's own voice.
119
123
120
124
> [!NOTE]
121
125
> The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than two seconds, echo cancellation quality is impacted.
@@ -131,13 +135,13 @@ Turn detection is the process of detecting when the end-user started or stopped
131
135
132
136
| Property | Type | Required or optional | Description |
133
137
|----------|----------|----------|------------|
134
-
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. It primarily supports English. Type `azure_semantic_vad_multilingual` is also available to support a wider variety of languages: English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Azure semantic voice activity detection (VAD) can improve turn detection by removing filler words to reduce the false alarm rate. The `remove_filler_words` property must be set to `true` (it is`false` by default). The detected filler words in English are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove filler words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
138
+
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. It primarily supports English. Type `azure_semantic_vad_multilingual` is also available to support a wider variety of languages: English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Azure semantic voice activity detection (VAD) can improve turn detection by removing filler words to reduce the false alarm rate. The `remove_filler_words` property must be set to `true` (it's`false` by default). The detected filler words in English are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove filler words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
135
139
|`threshold`| number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
136
140
|`prefix_padding_ms`| integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
137
141
|`speech_duration_ms`| integer | Optional | The duration of user's speech audio required to start detection. If not set or under 80 ms, the detector uses a default value of 80 ms. |
138
142
|`silence_duration_ms`| integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
139
143
|`remove_filler_words`| boolean | Optional | Determines whether to remove filler words to reduce the false alarm rate. This property must be set to `true` when using `azure_semantic_vad`.<br/><br/>The default value is `false`. |
140
-
| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The Voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/> `semantic_detection_v1` supporting English.<br/> `semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages will be bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
144
+
| `end_of_utterance_detection` | object | Optional | Configuration for end of utterance detection. The Voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported values are:<br/> `semantic_detection_v1` supporting English.<br/> `semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi.<br/>Other languages are bypassed.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds. <br/><br/>End of utterance detection currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime.|
141
145
142
146
Here's an example of end of utterance detection in a session object:
143
147
@@ -164,9 +168,9 @@ Here's an example of end of utterance detection in a session object:
164
168
165
169
## Audio input through Azure speech to text
166
170
167
-
Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o.
171
+
Azure speech to text is automatically active when you're using a non-multimodal model like gpt-4o.
168
172
169
-
In order to explicitly configure it you can set the `model` to `azure-speech` in `input_audio_transcription`. This can be useful to improve the recognition quality for specific language situations. See [How to customize voice live input and output](./voice-live-how-to-customize.md) learn more about speech input customization configuration.
173
+
In order to explicitly configure it, you can set the `model` to `azure-speech` in `input_audio_transcription`. This can be useful to improve the recognition quality for specific language situations. See [How to customize voice live input and output](./voice-live-how-to-customize.md) learn more about speech input customization configuration.
0 commit comments