Skip to content

Commit 88fbeb5

Browse files
authored
Merge pull request #4612 from eric-urban/eur/voice-live-1
voice live API
2 parents e633d34 + 441e4cc commit 88fbeb5

File tree

2 files changed

+92
-0
lines changed

2 files changed

+92
-0
lines changed

articles/ai-services/speech-service/toc.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,12 @@ items:
234234
href: video-translation-overview.md
235235
- name: How to use video translation
236236
href: video-translation-get-started.md
237+
- name: Voice Live API
238+
items:
239+
- name: Voice Live API overview
240+
href: voice-live.md
241+
- name: Realtime events reference documentation
242+
href: /azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
237243
- name: Intent recognition
238244
items:
239245
- name: Intent recognition overview
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
---
2+
title: Voice Live API
3+
titleSuffix: Azure AI services
4+
description: Learn how to use the Voice Live API for real-time voice conversation.
5+
manager: nitinme
6+
author: eric-urban
7+
ms.author: eur
8+
ms.service: azure-ai-speech
9+
ms.topic: how-to
10+
ms.date: 5/19/2025
11+
# Customer intent:
12+
---
13+
14+
# Voice Live API for real-time voice conversation
15+
16+
## What is the Voice Live API?
17+
18+
The Voice Live API is a solution enabling low-latency, high-quality speech-to-speech interactions for voice agents. It's designed for developers seeking scalable and efficient voice-driven experiences as it eliminates the need to manually orchestrate multiple components. By integrating speech recognition, generative AI, and text to speech functionalities into a single, unified interface, it provides an end-to-end solution for creating seamless experiences.
19+
20+
## Understanding Speech-to-Speech Experiences
21+
22+
Speech-to-speech technology is revolutionizing how humans interact with systems, offering intuitive voice-based solutions. Traditional implementations involved combining disparate modules such as speech to text, intent recognition, dialog management, text to speech, and more. Such chaining can lead to increased engineering complexity and end-user perceived latency.
23+
24+
With advancements in Large Language Models (LLMs) and multimodal AI, the Voice Live API consolidates these functionalities, simplifying workflows for developers. This approach enhances real-time interactions and ensures high-quality, natural communication, making it suitable for industries requiring instant, voice-enabled solutions.
25+
26+
## Key Scenarios for Voice Live API
27+
28+
Azure AI Voice Live API is ideal for scenarios where voice-driven interactions improve user experience. Examples include:
29+
30+
- **Contact Centers**: Develop interactive voice bots for customer support, product catalog navigation, and self-service solutions.
31+
- **Automotive Assistants**: Enable hands-free, in-car voice assistants for command execution, navigation, and general inquiries.
32+
- **Education**: Create voice-enabled learning companions and virtual tutors for interactive training and education.
33+
- **Public Services**: Build voice agents to assist citizens with administrative queries and public service information.
34+
- **Human Resources**: Enhance HR processes with voice-enabled tools for employee support, career development, and training.
35+
36+
## Features of the Voice Live API
37+
38+
The Voice Live API includes a comprehensive set of features to support diverse use cases and ensure superior voice interactions:
39+
40+
- **Broad Locale Coverage**: Supports over 15 locales for speech to text and offers over 600 prebuilt voices across 140+ locales for text to speech, ensuring global accessibility.
41+
- **Customizable Input and Output**: Use customized speech recognition models for domain-specific recognition and phrase list for lightweight just-in-time customization on audio input, and Custom Neural Voice to create unique, brand-aligned voices for audio output.
42+
- **Flexible Generative AI Model Options**: Choose from multiple models, including GPT-4o, GPT-4o-mini, and Phi, tailored to conversational requirements.
43+
- **Advanced Conversational Features**:
44+
- Noise Suppression: Reduces environmental noise for clearer communication.
45+
- Echo Cancellation: Prevents the agent from picking up its own responses.
46+
- Robust Interruption Detection: Ensures accurate recognition of interruptions during conversations.
47+
- Advanced End-of-Turn Detection: Allows natural pauses without prematurely concluding interactions.
48+
- **Avatar Integration**: Provides prebuilt or customizable avatars synchronized with audio output, offering a visual identity for voice agents.
49+
- **Function Calling**: Enables external actions, use of tools, and grounded responses using the VoiceRAG pattern.
50+
51+
## How It Works
52+
53+
The Voice Live API is fully managed, eliminating the need for customers to handle backend orchestration or component integration. Developers provide audio input and receive audio output, avatar visuals, and action triggers—all with minimal latency.
54+
55+
## API Design & Compatibility
56+
57+
The Azure AI Voice Live API is designed with seamless integration in mind, ensuring full compatibility with the Azure OpenAI Realtime API. This unified interface allows developers to effortlessly onboard and use the enhanced features of the Voice Live API. By adding a few more configuration parameters, developers can unlock its advanced capabilities, such as text to speech avatar, without overhauling their existing systems.
58+
59+
## WebSocket Interface
60+
61+
The API is supported through WebSocket events, allowing for an easy server-to-server integration. Your backend or middle-tier service connects to the Voice Live API via WebSockets. You can either use the WebSocket event messages directly to interact with the API, or use our lightweight SDK which is available in JavaScript and Python.
62+
63+
The supported real-time events are mostly in parity with the Azure OpenAI Realtime API, with a few exceptions. See the [Realtime events reference documentation](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context) for more details.
64+
65+
## Models Supported Natively
66+
67+
To power the intelligence of your voice agent, you have flexibility and choice in the GenAI model between GPT-4o, GPT-4o-mini, and Phi. Different GenAI models provide different types of capabilities, levels of intelligence, speed/latency of inferencing, and cost. Depending on what matters most for your business and use case, you can choose the model that best suits your needs.
68+
69+
All natively supported models – GPT-4o, GPT-4o-mini, and Phi – are fully managed, meaning you don’t have to worry about capacity planning, provisioning throughputs, etc.
70+
71+
## Comparing Voice Live API with other Speech Services
72+
73+
| Application requirement | Do it yourself | Speech real-time | Voice Live API |
74+
|-----|-----|-----|-----|
75+
| Broad locale coverage with high accuracy (audio input) ||||
76+
| Maintain brand and character personality (audio output) ||||
77+
| Conversational enhancements ||||
78+
| Choice of generative AI models ||||
79+
| Visual output with text to speech avatar ||||
80+
| Low engineering cost ||||
81+
| Low latency perceived by end user ||||
82+
83+
## Related content
84+
85+
- [Azure OpenAI Realtime API](../openai/realtime-audio-reference.md)
86+
- [Whisper model](./whisper-overview.md)

0 commit comments

Comments
 (0)