Skip to content

Commit 9d7794a

Browse files
authored
Merge pull request #4810 from eric-urban/eur/voice-live-api-edits
edits based on PM feedback
2 parents 1158c5d + bd4590f commit 9d7794a

File tree

3 files changed

+44
-37
lines changed

3 files changed

+44
-37
lines changed

articles/ai-services/speech-service/voice-live-how-to.md

Lines changed: 9 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
22
title: How to use the Voice Live API (Preview)
33
titleSuffix: Azure AI services
4-
description: Learn how to use the Voice Live API for real-time voice conversation.
4+
description: Learn how to use the Voice Live API for real-time voice agents.
55
manager: nitinme
66
author: eric-urban
77
ms.author: eur
88
ms.service: azure-ai-speech
99
ms.topic: how-to
1010
ms.date: 5/19/2025
1111
ms.custom: references_regions
12-
# Customer intent: As a developer, I want to learn how to use the Voice Live API for real-time voice conversation.
12+
# Customer intent: As a developer, I want to learn how to use the Voice Live API for real-time voice agents.
1313
---
1414

1515
# How to use the Voice Live API (Preview)
@@ -22,16 +22,7 @@ Unless otherwise noted, the Voice Live API uses the same events as the [Azure Op
2222

2323
## Supported models and regions
2424

25-
The Voice Live API supports the following models and regions:
26-
27-
| Model | Description | Supported regions |
28-
| ------------------------------ | ----------- | ----------- |
29-
| `gpt-4o-realtime-preview` | GPT-4o realtime + option to use Azure text to speech voices including custom neural voice for audio. | `eastus2`<br/>`swedencentral` |
30-
| `gpt-4o-mini-realtime-preview` | GPT-4o mini realtime + option to use Azure text to speech voices including custom neural voice for audio. | `eastus2`<br/>`swedencentral` |
31-
| `gpt-4o` | GPT-4o + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
32-
| `gpt-4o-mini` | GPT-4o mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
33-
| `phi4-mm-realtime` | Phi4-mm + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
34-
| `phi4` | Phi4-mm + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
25+
For a table of supported models and regions, see the [Voice Live API overview](./voice-live.md#supported-models-and-regions).
3526

3627
## Authentication
3728

@@ -140,13 +131,17 @@ Server echo cancellation enhances the input audio quality by removing the echo f
140131
}
141132
```
142133

134+
## Conversational enhancements
135+
136+
The Voice Live API offers conversational enhancements to provide robustness to the natural end-user conversation flow.
137+
143138
### Turn Detection Parameters
144139

145-
Turn detection is the process of detecting when the end-user started or stopped speaking. The Voice Live API provides a `turn_detection` property to configure turn detection. The `azure_semantic_vad` type is one differentiator between the Voice Live API and the Azure OpenAI Realtime API.
140+
Turn detection is the process of detecting when the end-user started or stopped speaking. The Voice Live API builds on the Azure OpenAI Realtime API `turn_detection` property to configure turn detection. The `azure_semantic_vad` type is one differentiator between the Voice Live API and the Azure OpenAI Realtime API.
146141

147142
| Property | Type | Required or optional | Description |
148143
|----------|----------|----------|------------|
149-
| `type` | string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. The `azure_semantic_vad` type is only available when using the `gpt-4o` model. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
144+
| `type` | string | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them. The `azure_semantic_vad` type isn't supported with the `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` models.<br/><br/>The default value is `server_vad`. |
150145
| `threshold` | number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
151146
| `prefix_padding_ms` | integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
152147
| `silence_duration_ms` | integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
@@ -292,10 +287,6 @@ And the service responds with the server SDP.
292287

293288
Then you can connect the avatar with the server SDP.
294289

295-
## Conversational enhancements
296-
297-
The Voice Live API offers several conversational enhancements to provide robustness to the natural end-user conversation flow.
298-
299290
### Audio timestamps
300291

301292
When you use Azure voices, and `output_audio_timestamp_types` is configured, the service returns the `response.audio_timestamp.delta` in the response, and `response.audio_timestamp.done` when the all timestamps message are returned.

articles/ai-services/speech-service/voice-live-quickstart.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: 'How to use Voice Live API for speech and audio with Azure AI Speech'
2+
title: 'How to use Voice Live API for real-time voice agents with Azure AI Speech'
33
titleSuffix: Azure AI services
4-
description: Learn how to use Voice Live API for speech and audio with Azure AI Speech.
4+
description: Learn how to use Voice Live API for real-time voice agents with Azure AI Speech.
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
@@ -12,7 +12,7 @@ ms.custom: build-2025
1212
recommendations: false
1313
---
1414

15-
# Voice Live API for speech and audio (Preview)
15+
# Quickstart: Voice Live API for real-time voice agents (Preview)
1616

1717
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1818

articles/ai-services/speech-service/voice-live.md

Lines changed: 32 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,26 @@
11
---
22
title: Voice Live API overview
33
titleSuffix: Azure AI services
4-
description: Learn about the Voice Live API and how to use it for real-time voice conversation.
4+
description: Learn about the Voice Live API for real-time voice agents.
55
manager: nitinme
66
author: eric-urban
77
ms.author: eur
88
ms.service: azure-ai-speech
99
ms.topic: how-to
1010
ms.date: 5/19/2025
11-
# Customer intent: As a developer, I want to learn about the Voice Live API and how to use it for real-time voice conversation.
11+
ms.custom: references_regions
12+
# Customer intent: As a developer, I want to learn about the Voice Live API for real-time voice agents.
1213
---
1314

14-
# Voice Live API for real-time voice conversation (Preview)
15+
# Voice Live API for real-time voice agents (Preview)
1516

1617
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1718

1819
## What is the Voice Live API?
1920

2021
The Voice Live API is a solution enabling low-latency, high-quality speech to speech interactions for voice agents. The API is designed for developers seeking scalable and efficient voice-driven experiences as it eliminates the need to manually orchestrate multiple components. By integrating speech recognition, generative AI, and text to speech functionalities into a single, unified interface, it provides an end-to-end solution for creating seamless experiences.
2122

22-
## Understanding speech to speech Experiences
23+
## Understanding speech to speech experiences
2324

2425
Speech to speech technology is revolutionizing how humans interact with systems, offering intuitive voice-based solutions. Traditional implementations involved combining disparate modules such as speech to text, intent recognition, dialog management, text to speech, and more. Such chaining can lead to increased engineering complexity and end-user perceived latency.
2526

@@ -40,8 +41,8 @@ Azure AI Voice Live API is ideal for scenarios where voice-driven interactions i
4041
The Voice Live API includes a comprehensive set of features to support diverse use cases and ensure superior voice interactions:
4142

4243
- **Broad locale coverage**: Supports over 15 locales for speech to text and offers over 600 prebuilt voices across 140+ locales for text to speech, ensuring global accessibility.
43-
- **Customizable input and output**: Use customized speech recognition models for domain-specific recognition and phrase list for lightweight just-in-time customization on audio input, and Custom Neural Voice to create unique, brand-aligned voices for audio output.
44-
- **Flexible generative AI model options**: Choose from multiple models, including GPT-4o, GPT-4o-mini, and Phi, tailored to conversational requirements.
44+
- **Customizable input and output**: Use phrase list for lightweight just-in-time customization on audio input. Use custom neural voice to create unique, brand-aligned voices for audio output.
45+
- **Flexible generative AI model options**: [Choose from multiple models](#supported-models-and-regions), including GPT-4o, GPT-4o-mini, and Phi, tailored to conversational requirements.
4546
- **Advanced conversational features**:
4647
- Noise suppression: Reduces environmental noise for clearer communication.
4748
- Echo cancellation: Prevents the agent from picking up its own responses.
@@ -62,23 +63,38 @@ Features that are unique to the Voice Live API are designed to be optional and a
6263

6364
The API is supported through WebSocket events, allowing for an easy server-to-server integration. Your backend or middle-tier service connects to the Voice Live API via WebSockets. You can use the WebSocket messages directly to interact with the API.
6465

65-
## Models supported natively
66+
## Supported models and regions
6667

6768
To power the intelligence of your voice agent, you have flexibility and choice in the generative AI model between GPT-4o, GPT-4o-mini, and Phi. Different generative AI models provide different types of capabilities, levels of intelligence, speed/latency of inferencing, and cost. Depending on what matters most for your business and use case, you can choose the model that best suits your needs.
6869

6970
All natively supported models – GPT-4o, GPT-4o-mini, and Phi – are fully managed, meaning you don’t have to deploy models, worry about capacity planning, or provisioning throughputs. You can simply use the model you need, and the Voice Live API takes care of the rest.
7071

72+
The Voice Live API supports the following models and regions:
73+
74+
| Model | Description | Supported regions |
75+
| ------------------------------ | ----------- | ----------- |
76+
| `gpt-4o-realtime-preview` | GPT-4o realtime + option to use Azure text to speech voices including custom neural voice for audio. | `eastus2`<br/>`swedencentral` |
77+
| `gpt-4o-mini-realtime-preview` | GPT-4o mini realtime + option to use Azure text to speech voices including custom neural voice for audio. | `eastus2`<br/>`swedencentral` |
78+
| `gpt-4o` | GPT-4o + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
79+
| `gpt-4o-mini` | GPT-4o mini + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
80+
| `phi4-mm-realtime` | Phi4-mm + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
81+
| `phi4` | Phi4-mm + audio input through Azure speech to text + audio output through Azure text to speech voices including custom neural voice. | `centralindia`<br/>`eastus2`<br/>`swedencentral`<br/>`westus2` |
82+
7183
## Comparing Voice Live API with other speech to speech solutions
7284

73-
| Application requirement | Do it yourself | Speech real-time | Voice Live API |
74-
|-----|-----|-----|-----|
75-
| Broad locale coverage with high accuracy (audio input) ||||
76-
| Maintain brand and character personality (audio output) ||||
77-
| Conversational enhancements ||||
78-
| Choice of generative AI models ||||
79-
| Visual output with text to speech avatar ||||
80-
| Low engineering cost ||||
81-
| Low latency perceived by end user ||||
85+
The Voice Live API is an alternative to orchestrating multiple components such as speech recognition, generative AI, and text to speech. This orchestration can be complex and time-consuming, requiring significant engineering effort to integrate and maintain. The Voice Live API simplifies this process by providing a single interface for all these components, allowing developers to focus on building their applications rather than managing the underlying infrastructure.
86+
87+
To meet your requirements, you can either build your own solution or use the Voice Live API. The table below compares the two approaches:
88+
89+
| Application requirement | Do it yourself | Voice Live API |
90+
|-----|-----|-----|
91+
| Broad locale coverage with high accuracy (audio input) |||
92+
| Maintain brand and character personality (audio output) |||
93+
| Conversational enhancements |||
94+
| Choice of generative AI models |||
95+
| Visual output with text to speech avatar |||
96+
| Low engineering cost |||
97+
| Low latency perceived by end user |||
8298

8399
## Related content
84100

0 commit comments

Comments
 (0)