You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Realtime events are used to communicate between the client and server in real-time audio applications. The events are sent as JSON objects over various endpoints, such as WebSockets or WebRTC. The events are used to manage the conversation, audio buffers, and responses in real-time.
17
17
18
-
The Realtime API is a WebSocket-based API that allows you to interact with the Azure OpenAI in real-time.
18
+
You can use audio client and server events with these APIs:
-[Azure AI Voice Live API](/azure/ai-services/speech-service/voice-live)
19
21
20
-
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model. Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
21
-
22
-
> [!TIP]
23
-
> To get started with the Realtime API, see the [quickstart](realtime-audio-quickstart.md) and [how-to guide](./how-to/realtime-audio.md).
22
+
Unless otherwise specified, the events described in this document are applicable to both APIs.
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/includes/quickstarts/voice-live-api/realtime-python.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ author: eric-urban
4
4
ms.author: eur
5
5
ms.service: azure-ai-openai
6
6
ms.topic: include
7
-
ms.date: 5/19/2025
7
+
ms.date: 6/27/2025
8
8
---
9
9
10
10
## Prerequisites
@@ -151,6 +151,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
151
151
session_update = {
152
152
"type": "session.update",
153
153
"session": {
154
+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
154
155
"turn_detection": {
155
156
"type": "azure_semantic_vad",
156
157
"threshold": 0.3,
@@ -170,7 +171,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
170
171
"type": "server_echo_cancellation"
171
172
},
172
173
"voice": {
173
-
"name": "en-US-Aria:DragonHDLatestNeural",
174
+
"name": "en-US-Ava:DragonHDLatestNeural",
174
175
"type": "azure-standard",
175
176
"temperature": 0.8,
176
177
},
@@ -417,7 +418,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
417
418
The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.
@@ -96,8 +97,8 @@ You can use input audio properties to configure the input audio stream.
96
97
| Property | Type | Required or optional | Description |
97
98
|----------|----------|----------|------------|
98
99
|`input_audio_sampling_rate`| integer | Optional | The sampling rate of the input audio.<br/><br/>The supported values are `16000` and `24000`. The default value is `24000`. |
99
-
|`input_audio_echo_cancellation`| object | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation` which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice. |
100
-
|`input_audio_noise_reduction`| object | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression` which optimizes for speakers closest to the microphone. |
100
+
|`input_audio_echo_cancellation`| object | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation`, which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice. |
101
+
|`input_audio_noise_reduction`| object | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression`, which optimizes for speakers closest to the microphone. |
101
102
102
103
Here's an example of input audio properties is a session object:
103
104
@@ -113,7 +114,7 @@ Here's an example of input audio properties is a session object:
113
114
114
115
Noise suppression enhances the input audio quality by suppressing or removing environmental background noise. Noise suppression helps the model understand the end-user with higher accuracy and improves accuracy of signals like interruption detection and end-of-turn detection.
115
116
116
-
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice.
117
+
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
117
118
118
119
> [!NOTE]
119
120
> The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than 3 seconds, echo cancellation quality is impacted.
@@ -141,7 +142,7 @@ Turn detection is the process of detecting when the end-user started or stopped
141
142
142
143
| Property | Type | Required or optional | Description |
143
144
|----------|----------|----------|------------|
144
-
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them. The `azure_semantic_vad` type isn't supported with the `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` models.<br/><br/>The default value is `server_vad`. |
145
+
|`type`| string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
145
146
|`threshold`| number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
146
147
|`prefix_padding_ms`| integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
147
148
|`silence_duration_ms`| integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
@@ -152,6 +153,7 @@ Here's an example of end of utterance detection in a session object:
152
153
```json
153
154
{
154
155
"session": {
156
+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
155
157
"turn_detection": {
156
158
"type": "azure_semantic_vad",
157
159
"threshold": 0.3,
@@ -168,6 +170,55 @@ Here's an example of end of utterance detection in a session object:
168
170
}
169
171
```
170
172
173
+
### Phrase list
174
+
175
+
Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the `session.update` message.
> Phrase list currently doesn't support gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
189
+
190
+
### Custom lexicon
191
+
192
+
Use the `custom_lexicon_url` string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon).
193
+
194
+
```json
195
+
{
196
+
"voice": {
197
+
"name": "en-US-Ava:DragonHDLatestNeural",
198
+
"type": "azure-standard",
199
+
"temperature": 0.8, // optional
200
+
"custom_lexicon_url": "<custom lexicon url>"
201
+
}
202
+
}
203
+
```
204
+
205
+
### Speaking rate
206
+
207
+
Use the `rate` string property to adjust the speaking speed for any standard Azure text to speech voices and custom voices.
208
+
209
+
The rate value should range from 0.5 to 1.5, with higher values indicating faster speeds.
210
+
211
+
```json
212
+
{
213
+
"voice": {
214
+
"name": "en-US-Ava:DragonHDLatestNeural",
215
+
"type": "azure-standard",
216
+
"temperature": 0.8, // optional
217
+
"rate": "1.2"
218
+
}
219
+
}
220
+
```
221
+
171
222
### Audio output through Azure text to speech
172
223
173
224
You can use the `voice` parameter to specify a standard or custom voice. The voice is used for audio output.
@@ -176,7 +227,7 @@ The `voice` object has the following properties:
176
227
177
228
| Property | Type | Required or optional | Description |
178
229
|----------|----------|----------|------------|
179
-
|`name`| string | Required | Specifies the name of the voice. For example, `en-US-AriaNeural`. |
230
+
|`name`| string | Required | Specifies the name of the voice. For example, `en-US-AvaNeural`. |
180
231
|`type`| string | Required | Configuration of the type of Azure voice between `azure-standard` and `azure-custom`. |
181
232
|`temperature`| number | Optional | Specifies temperature applicable to Azure HD voices. Higher values provide higher levels of variability in intonation, prosody, etc. |
182
233
@@ -187,7 +238,7 @@ Here's a partial message example for a standard (`azure-standard`) voice:
187
238
```json
188
239
{
189
240
"voice": {
190
-
"name": "en-US-AriaNeural",
241
+
"name": "en-US-AvaNeural",
191
242
"type": "azure-standard"
192
243
}
193
244
}
@@ -202,7 +253,7 @@ Here's an example `session.update` message for a standard high definition voice:
202
253
```json
203
254
{
204
255
"voice": {
205
-
"name": "en-US-Aria:DragonHDLatestNeural",
256
+
"name": "en-US-Ava:DragonHDLatestNeural",
206
257
"type": "azure-standard",
207
258
"temperature": 0.8// optional
208
259
}
@@ -341,11 +392,11 @@ To configure the viseme, you can set the `animation.outputs` in the `session.upd
341
392
"event_id": "your-session-id",
342
393
"session": {
343
394
"voice": {
344
-
"name": "en-US-AriaNeural",
395
+
"name": "en-US-AvaNeural",
345
396
"type": "azure-standard",
346
397
},
347
398
"modalities": ["text", "audio"],
348
-
"instructions": "You are a helpful assistant.",
399
+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
349
400
"turn_detection": {
350
401
"type": "server_vad"
351
402
},
@@ -388,4 +439,4 @@ And a `response.animation_viseme.done` message is sent when all viseme messages
388
439
## Related content
389
440
390
441
- Try out the [Voice Live API quickstart](./voice-live-quickstart.md)
391
-
- See the [Azure OpenAI Realtime API reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)
442
+
- See the [audio events reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)
0 commit comments