Skip to content

Commit 6084d17

Browse files
committed
voice live API updates
1 parent ed4baa6 commit 6084d17

File tree

7 files changed

+80
-29
lines changed

7 files changed

+80
-29
lines changed

articles/ai-services/openai/realtime-audio-reference.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,25 @@
11
---
2-
title: Azure OpenAI in Azure AI Foundry Models Realtime API Reference
2+
title: Audio events reference
33
titleSuffix: Azure OpenAI
4-
description: Learn how to use the Realtime API to interact with the Azure OpenAI in real-time.
4+
description: Learn how to use events with the Realtime API and Voice Live API.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: conceptual
8-
ms.date: 4/28/2025
8+
ms.date: 6/27/2025
99
author: eric-urban
1010
ms.author: eur
1111
recommendations: false
1212
---
1313

14-
# Realtime events reference
14+
# Audio events reference
1515

16-
[!INCLUDE [Feature preview](includes/preview-feature.md)]
16+
Realtime events are used to communicate between the client and server in real-time audio applications. The events are sent as JSON objects over various endpoints, such as WebSockets or WebRTC. The events are used to manage the conversation, audio buffers, and responses in real-time.
1717

18-
The Realtime API is a WebSocket-based API that allows you to interact with the Azure OpenAI in real-time.
18+
You can use audio client and server events with these APIs:
19+
- [Azure OpenAI Realtime API](/azure/ai-services/openai/realtime-audio-quickstart)
20+
- [Azure AI Voice Live API](/azure/ai-services/speech-service/voice-live)
1921

20-
The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model. Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
21-
22-
> [!TIP]
23-
> To get started with the Realtime API, see the [quickstart](realtime-audio-quickstart.md) and [how-to guide](./how-to/realtime-audio.md).
22+
Unless otherwise specified, the events described in this document are applicable to both APIs.
2423

2524
## Client events
2625

articles/ai-services/openai/toc.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -441,7 +441,7 @@ items:
441441
displayName: RAG, rag
442442
- name: Azure OpenAI monitoring data reference
443443
href: monitor-openai-reference.md
444-
- name: Realtime API (preview) events reference
444+
- name: Audio events reference
445445
href: realtime-audio-reference.md
446446
- name: Resources
447447
items:

articles/ai-services/speech-service/includes/quickstarts/voice-live-api/realtime-python.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ author: eric-urban
44
ms.author: eur
55
ms.service: azure-ai-openai
66
ms.topic: include
7-
ms.date: 5/19/2025
7+
ms.date: 6/27/2025
88
---
99

1010
## Prerequisites
@@ -151,6 +151,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
151151
session_update = {
152152
"type": "session.update",
153153
"session": {
154+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
154155
"turn_detection": {
155156
"type": "azure_semantic_vad",
156157
"threshold": 0.3,
@@ -170,7 +171,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
170171
"type": "server_echo_cancellation"
171172
},
172173
"voice": {
173-
"name": "en-US-Aria:DragonHDLatestNeural",
174+
"name": "en-US-Ava:DragonHDLatestNeural",
174175
"type": "azure-standard",
175176
"temperature": 0.8,
176177
},
@@ -417,7 +418,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
417418
The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.
418419

419420
```text
420-
Session created: {"type": "session.update", "session": {"turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Aria:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
421+
Session created: {"type": "session.update", "session": {"instructions": "You are a helpful AI assistant responding in natural, engaging language.","turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
421422
Starting the chat ...
422423
Received event: {'session.created'}
423424
Press 'q' and Enter to quit the chat.

articles/ai-services/speech-service/toc.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@ items:
239239
href: voice-live-quickstart.md
240240
- name: How to use Voice Live API
241241
href: voice-live-how-to.md
242-
- name: Realtime API events reference documentation
242+
- name: Audio events reference
243243
href: /azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
244244
- name: Intent recognition
245245
items:

articles/ai-services/speech-service/voice-live-how-to.md

Lines changed: 63 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: eric-urban
77
ms.author: eur
88
ms.service: azure-ai-speech
99
ms.topic: how-to
10-
ms.date: 5/19/2025
10+
ms.date: 6/27/2025
1111
ms.custom: references_regions
1212
# Customer intent: As a developer, I want to learn how to use the Voice Live API for real-time voice agents.
1313
---
@@ -58,6 +58,7 @@ Here's an example `session.update` message that configures several aspects of th
5858

5959
```json
6060
{
61+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
6162
"turn_detection": {
6263
"type": "azure_semantic_vad",
6364
"threshold": 0.3,
@@ -73,7 +74,7 @@ Here's an example `session.update` message that configures several aspects of th
7374
"input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"},
7475
"input_audio_echo_cancellation": {"type": "server_echo_cancellation"},
7576
"voice": {
76-
"name": "en-US-Aria:DragonHDLatestNeural",
77+
"name": "en-US-Ava:DragonHDLatestNeural",
7778
"type": "azure-standard",
7879
"temperature": 0.8,
7980
},
@@ -96,8 +97,8 @@ You can use input audio properties to configure the input audio stream.
9697
| Property | Type | Required or optional | Description |
9798
|----------|----------|----------|------------|
9899
| `input_audio_sampling_rate` | integer | Optional | The sampling rate of the input audio.<br/><br/>The supported values are `16000` and `24000`. The default value is `24000`. |
99-
| `input_audio_echo_cancellation` | object | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation` which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice. |
100-
| `input_audio_noise_reduction` | object | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression` which optimizes for speakers closest to the microphone. |
100+
| `input_audio_echo_cancellation` | object | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation`, which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice. |
101+
| `input_audio_noise_reduction` | object | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression`, which optimizes for speakers closest to the microphone. |
101102

102103
Here's an example of input audio properties is a session object:
103104

@@ -113,7 +114,7 @@ Here's an example of input audio properties is a session object:
113114

114115
Noise suppression enhances the input audio quality by suppressing or removing environmental background noise. Noise suppression helps the model understand the end-user with higher accuracy and improves accuracy of signals like interruption detection and end-of-turn detection.
115116

116-
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice.
117+
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
117118

118119
> [!NOTE]
119120
> The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than 3 seconds, echo cancellation quality is impacted.
@@ -141,7 +142,7 @@ Turn detection is the process of detecting when the end-user started or stopped
141142

142143
| Property | Type | Required or optional | Description |
143144
|----------|----------|----------|------------|
144-
| `type` | string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them. The `azure_semantic_vad` type isn't supported with the `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` models.<br/><br/>The default value is `server_vad`. |
145+
| `type` | string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
145146
| `threshold` | number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
146147
| `prefix_padding_ms` | integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
147148
| `silence_duration_ms` | integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
@@ -152,6 +153,7 @@ Here's an example of end of utterance detection in a session object:
152153
```json
153154
{
154155
"session": {
156+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
155157
"turn_detection": {
156158
"type": "azure_semantic_vad",
157159
"threshold": 0.3,
@@ -168,6 +170,55 @@ Here's an example of end of utterance detection in a session object:
168170
}
169171
```
170172

173+
### Phrase list
174+
175+
Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the `session.update` message.
176+
177+
```json
178+
{
179+
"session": {
180+
"input_audio": {
181+
"phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"]
182+
}
183+
}
184+
}
185+
```
186+
187+
> [!NOTE]
188+
> Phrase list currently doesn't support gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
189+
190+
### Custom lexicon
191+
192+
Use the `custom_lexicon_url` string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon).
193+
194+
```json
195+
{
196+
  "voice": {
197+
    "name": "en-US-Ava:DragonHDLatestNeural",
198+
    "type": "azure-standard",
199+
    "temperature": 0.8, // optional
200+
    "custom_lexicon_url": "<custom lexicon url>"
201+
}
202+
}
203+
```
204+
205+
### Speaking rate
206+
207+
Use the `rate` string property to adjust the speaking speed for any standard Azure text to speech voices and custom voices.
208+
209+
The rate value should range from 0.5 to 1.5, with higher values indicating faster speeds.
210+
211+
```json
212+
{
213+
  "voice": {
214+
    "name": "en-US-Ava:DragonHDLatestNeural",
215+
    "type": "azure-standard",
216+
    "temperature": 0.8, // optional
217+
   "rate": "1.2"
218+
}
219+
}
220+
```
221+
171222
### Audio output through Azure text to speech
172223

173224
You can use the `voice` parameter to specify a standard or custom voice. The voice is used for audio output.
@@ -176,7 +227,7 @@ The `voice` object has the following properties:
176227

177228
| Property | Type | Required or optional | Description |
178229
|----------|----------|----------|------------|
179-
| `name` | string | Required | Specifies the name of the voice. For example, `en-US-AriaNeural`. |
230+
| `name` | string | Required | Specifies the name of the voice. For example, `en-US-AvaNeural`. |
180231
| `type` | string | Required | Configuration of the type of Azure voice between `azure-standard` and `azure-custom`. |
181232
| `temperature` | number | Optional | Specifies temperature applicable to Azure HD voices. Higher values provide higher levels of variability in intonation, prosody, etc. |
182233

@@ -187,7 +238,7 @@ Here's a partial message example for a standard (`azure-standard`) voice:
187238
```json
188239
{
189240
"voice": {
190-
"name": "en-US-AriaNeural",
241+
"name": "en-US-AvaNeural",
191242
"type": "azure-standard"
192243
}
193244
}
@@ -202,7 +253,7 @@ Here's an example `session.update` message for a standard high definition voice:
202253
```json
203254
{
204255
"voice": {
205-
"name": "en-US-Aria:DragonHDLatestNeural",
256+
"name": "en-US-Ava:DragonHDLatestNeural",
206257
"type": "azure-standard",
207258
"temperature": 0.8 // optional
208259
}
@@ -341,11 +392,11 @@ To configure the viseme, you can set the `animation.outputs` in the `session.upd
341392
"event_id": "your-session-id",
342393
"session": {
343394
"voice": {
344-
"name": "en-US-AriaNeural",
395+
"name": "en-US-AvaNeural",
345396
"type": "azure-standard",
346397
},
347398
"modalities": ["text", "audio"],
348-
"instructions": "You are a helpful assistant.",
399+
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
349400
"turn_detection": {
350401
"type": "server_vad"
351402
},
@@ -388,4 +439,4 @@ And a `response.animation_viseme.done` message is sent when all viseme messages
388439
## Related content
389440

390441
- Try out the [Voice Live API quickstart](./voice-live-quickstart.md)
391-
- See the [Azure OpenAI Realtime API reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)
442+
- See the [audio events reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)

articles/ai-services/speech-service/voice-live-quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn how to use Voice Live API for real-time voice agents with Azu
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 5/19/2025
8+
ms.date: 6/27/2025
99
author: eric-urban
1010
ms.author: eur
1111
ms.custom: build-2025

articles/ai-services/speech-service/voice-live.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: eric-urban
77
ms.author: eur
88
ms.service: azure-ai-speech
99
ms.topic: how-to
10-
ms.date: 5/19/2025
10+
ms.date: 6/27/2025
1111
ms.custom: references_regions
1212
# Customer intent: As a developer, I want to learn about the Voice Live API for real-time voice agents.
1313
---

0 commit comments

Comments
 (0)