You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/voice-live-how-to.md
+10-19Lines changed: 10 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ ms.custom: references_regions
18
18
19
19
The voice live API provides a capable WebSocket interface compared to the [Azure OpenAI Realtime API](../../ai-foundry/openai/how-to/realtime-audio.md).
20
20
21
-
Unless otherwise noted, the voice live API uses the same events as the [Azure OpenAI Realtime API](/azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context). This document provides a reference for the event message properties that are specific to the voice live API.
21
+
Unless otherwise noted, the voice live API uses the [same events](/azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context) as the Azure OpenAI Realtime API. This document provides a reference for the event message properties that are specific to the voice live API.
22
22
23
23
## Supported models and regions
24
24
@@ -30,8 +30,8 @@ An [Azure AI Foundry resource](../multi-service-resource.md) is required to acce
30
30
31
31
### WebSocket endpoint
32
32
33
-
The WebSocket endpoint for the voice live API is `wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview`.
34
-
The endpoint is the same for all models. The only difference is the required `model` query parameter.
33
+
The WebSocket endpoint for the voice live API is `wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2025-05-01-preview` or, for older resources, `wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview`.
34
+
The endpoint is the same for all models. The only difference is the required `model` query parameter, or, when using the Agent service, the `agent_id` and `project_id` parameters.
35
35
36
36
For example, an endpoint for a resource with a custom domain would be `wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o-mini-realtime-preview`
37
37
@@ -47,7 +47,7 @@ The voice live API supports two authentication methods:
47
47
For the recommended keyless authentication with Microsoft Entra ID, you need to:
48
48
49
49
- Assign the `Cognitive Services User` role to your user account or a managed identity. You can assign roles in the Azure portal under **Access control (IAM)** > **Add role assignment**.
50
-
- Generate a token using the Azure CLI or Azure SDKs. The token must be generated with the `https://cognitiveservices.azure.com/.default` scope.
50
+
- Generate a token using the Azure CLI or Azure SDKs. The token must be generated with the `https://ai.azure.com/.default` scope, or the legacy `https://cognitiveservices.azure.com/.default` scope.
51
51
- Use the token in the `Authorization` header of the WebSocket connection request, with the format `Bearer <token>`.
52
52
53
53
## Session configuration
@@ -81,6 +81,9 @@ Here's an example `session.update` message that configures several aspects of th
81
81
}
82
82
```
83
83
84
+
> [!IMPORTANT]
85
+
> The `"instructions"` property is not supported when you're using a custom agent.
86
+
84
87
The server responds with a [`session.updated`](../openai/realtime-audio-reference.md?context=/azure/ai-services/speech-service/context/context#realtimeservereventsessionupdated) event to confirm the session configuration.
85
88
86
89
## Session Properties
@@ -117,20 +120,8 @@ Noise suppression enhances the input audio quality by suppressing or removing en
117
120
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
118
121
119
122
> [!NOTE]
120
-
> The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than 3 seconds, echo cancellation quality is impacted.
123
+
> The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than two seconds, echo cancellation quality is impacted.
121
124
122
-
```json
123
-
{
124
-
"session": {
125
-
"input_audio_noise_reduction": {
126
-
"type": "azure_deep_noise_suppression"
127
-
},
128
-
"input_audio_echo_cancellation": {
129
-
"type": "server_echo_cancellation"
130
-
}
131
-
}
132
-
}
133
-
```
134
125
135
126
## Conversational enhancements
136
127
@@ -142,12 +133,12 @@ Turn detection is the process of detecting when the end-user started or stopped
142
133
143
134
| Property | Type | Required or optional | Description |
144
135
|----------|----------|----------|------------|
145
-
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The `remove_filler_words` property must be set to `true`. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
136
+
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Type `azure_semantic_vad_multilingual` is also available to support a wider variety of languages: English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Azure semantic voice activity detection (VAD) can improve turn detection by removing filler words to reduce the false alarm rate. The `remove_filler_words` property must be set to `true` (it is `false` by default). The detected filler words in English are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
146
137
|`threshold`| number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
147
138
|`prefix_padding_ms`| integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
148
139
|`silence_duration_ms`| integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
149
140
|`remove_filler_words`| boolean | Determines whether to remove filler words to reduce the false alarm rate. This property must be set to `true` when using `azure_semantic_vad`.<br/><br/>The default value is `false`. |
150
-
|`end_of_utterance_detection`| object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection is only available when using `azure_semantic_vad`.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported value is `semantic_detection_v1`.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds.|
141
+
|`end_of_utterance_detection`| object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection can be used with either VAD selection.<br/><br/>Properties of `end_of_utterance_detection` include:<br/>-`model`: The model to use for end of utterance detection. The supported value is `semantic_detection_v1`.<br/>- `threshold`: Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.<br/>- `timeout`: Timeout in seconds. The default value is 2 seconds.|
151
142
152
143
Here's an example of end of utterance detection in a session object:
0 commit comments