Skip to content

Commit c1256c8

Browse files
committed
Add customization documentation and region information on features in how to.
1 parent ceae670 commit c1256c8

File tree

4 files changed

+130
-38
lines changed

4 files changed

+130
-38
lines changed

articles/ai-services/speech-service/toc.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -236,14 +236,16 @@ items:
236236
items:
237237
- name: Voice live overview
238238
href: voice-live.md
239+
- name: Voice live language support
240+
href: voice-live-language-support.md
239241
- name: Voice live with Foundry models quickstart
240242
href: voice-live-quickstart.md
241243
- name: Voice live with Foundry agents quickstart
242244
href: voice-live-agents-quickstart.md
243245
- name: How to use voice live
244246
href: voice-live-how-to.md
245-
- name: Voice live language support
246-
href: voice-live-language-support.md
247+
- name: How to customize voice live input and output
248+
href: voice-live-how-to-customize.md
247249
- name: Audio events reference
248250
href: /azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
249251
- name: Keyword recognition
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
title: How to customize voice live input and output
3+
titleSuffix: Azure AI services
4+
description: Learn how to use the voice live API with customized models.
5+
manager: nitinme
6+
author: goergenj
7+
ms.author: jagoerge
8+
ms.service: azure-ai-speech
9+
ms.topic: how-to
10+
ms.date: 9/16/2025
11+
ms.custom: custom speech, custom voice, custom avatar, fine-tuning
12+
# Customer intent: As a developer, I want to learn how to use custom models with the voice live API for real-time voice agents.
13+
---
14+
15+
# How to customize voice live input and output
16+
17+
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
18+
19+
Voice live provides multiple options to optimize performance and quality by using custom models. The following customization options are currently available:
20+
21+
- Speech input customization:
22+
- Phrase-list: A lightweight just-in-time customization based on a list of words or phrases provided as part of the session configuration to help improve recognition quality. See [Improve recognition accuracy with phrase list](./improve-accuracy-phrase-list) to learn more.
23+
- Custom Speech: With custom speech, you can evaluate and improve the accuracy of speech recognition for your applications and products and fine-tune the recognition quality to your business needs. See [What is custom speech?](./custom-speech-overview) to learn more.
24+
- Speech output customization:
25+
- Custom Lexicon: Custom Lexicon allows you to easily customize pronunciation for both standard Azure text to speech voices and custom voices to improve speech output synthesization quality. See [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon) to learn more.
26+
- Custom voice: Custom voice lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as fine-tuning data. See [What is custom voice?](./custom-neural-voice) to learn more.
27+
- Custom avatar: Custom text to speech avatar allows you to create a customized, one-of-a-kind synthetic talking avatar for your application. With custom text to speech avatar, you can build a unique and natural-looking avatar for your product or brand by providing video recording data of your selected actors. See [What is custom text to speech avatar?](./text-to-speech-avatar/what-is-custom-text-to-speech-avatar) to learn more.
28+
29+
## Speech input customization
30+
31+
### Phrase list
32+
33+
Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the `session.update` message.
34+
35+
```json
36+
{
37+
"session": {
38+
"input_audio_transcription": {
39+
"model": "azure-speech",
40+
"phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"]
41+
}
42+
}
43+
}
44+
```
45+
46+
> [!NOTE]
47+
> Phrase list currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
48+
49+
### Custom speech configuration
50+
51+
You can use the custom_speech field to specify your custom speech models. This field is defined as a dictionary, where each key represents a locale code and each value corresponds to the `Model ID` of the custom speech model. For more information about custom speech, please see [What is custom speech?](./custom-speech-overview).
52+
53+
Voice live supports using a combination of base models and custom models as long as each type is unique per locale with a maximum of 10 languages specified in total.
54+
55+
Example session configuration with custom speech models. In this case, if the detected language is English, the base model is used, and if the detected language is Chinese, the custom speech model is used.
56+
57+
```json
58+
{
59+
"session": {
60+
"input_audio_transcription": {
61+
"model": "azure-speech",
62+
"language": "en",
63+
"custom_speech": {
64+
"zh-CN": "847cb03d-7f22-4b11-444-e1be1d77bf17"
65+
}
66+
}
67+
}
68+
}
69+
```
70+
71+
> [!NOTE]
72+
> In order to use a custom speech model with voice live API the model must be available on the same Azure AI Foundry resource you are using to call the voice live API. If you trained the model on a different Azure AI Foundry or Azure AI Speech resource you have to copy the model to the resource you are using to call the voice live API.
73+
74+
## Speech output customization
75+
76+
### Custom lexicon
77+
78+
Use the `custom_lexicon_url` string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon).
79+
80+
```json
81+
{
82+
  "voice": {
83+
    "name": "en-US-Ava:DragonHDLatestNeural",
84+
    "type": "azure-standard",
85+
    "temperature": 0.8, // optional
86+
    "custom_lexicon_url": "<custom lexicon url>"
87+
}
88+
}
89+
```
90+
91+
### Azure custom voices
92+
93+
You can use a custom voice for audio output. For information about how to create a custom voice, see [What is custom voice](./custom-neural-voice.md).
94+
95+
```json
96+
{
97+
"voice": {
98+
"name": "en-US-CustomNeural",
99+
"type": "azure-custom",
100+
"endpoint_id": "your-endpoint-id", // a guid string
101+
"temperature": 0.8 // optional, value range 0.0-1.0, only take effect when using HD voices
102+
}
103+
}
104+
```
105+
106+
### Azure custom avatar
107+
108+
[Text to speech avatar](./text-to-speech-avatar/what-is-text-to-speech-avatar.md) converts text into a digital video of a photorealistic human (either a standard avatar or a [custom text to speech avatar](./text-to-speech-avatar/what-is-custom-text-to-speech-avatar.md)) speaking with a natural-sounding voice.
109+
110+
The configuration for a custom avatar does not differ from the configuration of a standard avatar. Please refer to [How to use the voice live API - Azure text to speech avatar](./voice-live-how-to#azure-text-to-speech-avatar) for a detailed example.
111+
112+
## Related content
113+
114+
- Try out the [voice live API quickstart](./voice-live-quickstart.md)
115+
- See the [audio events reference](/azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)

articles/ai-services/speech-service/voice-live-how-to.md

Lines changed: 10 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -164,24 +164,21 @@ Here's an example of end of utterance detection in a session object:
164164

165165
## Audio input through Azure speech to text
166166

167-
### Phrase list
167+
Azure speech to text will automatically be active when you are using a non-multimodal model like gpt-4o-realtime.
168168

169-
Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the `session.update` message.
169+
In order to explicitly configure it you can set the `model` to `azure-speech` in `input_audio_transcription`. This can be useful to improve the recognition quality for specific language situations. See [How to customize voice live input and output](./voice-live-how-to-customize) learn more about speech input customization configuration.
170170

171171
```json
172172
{
173173
"session": {
174174
"input_audio_transcription": {
175175
"model": "azure-speech",
176-
"phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"]
176+
"language": "en"
177177
}
178178
}
179179
}
180180
```
181181

182-
> [!NOTE]
183-
> Phrase list currently doesn't support gpt-realtime, gpt-4o-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).
184-
185182
## Audio output through Azure text to speech
186183

187184
You can use the `voice` parameter to specify a standard or custom voice. The voice is used for audio output.
@@ -194,6 +191,8 @@ The `voice` object has the following properties:
194191
| `type` | string | Required | Configuration of the type of Azure voice between `azure-standard` and `azure-custom`. |
195192
| `temperature` | number | Optional | Specifies temperature applicable to Azure HD voices. Higher values provide higher levels of variability in intonation, prosody, etc. |
196193

194+
See [How to customize voice live input and output](./voice-live-how-to-customize) learn more about speech input customization configuration.
195+
197196
### Azure standard voices
198197

199198
Here's a partial message example for a standard (`azure-standard`) voice:
@@ -225,35 +224,8 @@ Here's an example `session.update` message for a standard high definition voice:
225224

226225
For the full list of standard high definition voices, see [high definition voices documentation](high-definition-voices.md#supported-azure-ai-speech-hd-voices).
227226

228-
### Azure custom voices
229-
230-
You can use a custom voice for audio output. For information about how to create a custom voice, see [What is custom voice](./custom-neural-voice.md).
231-
232-
```json
233-
{
234-
"voice": {
235-
"name": "en-US-CustomNeural",
236-
"type": "azure-custom",
237-
"endpoint_id": "your-endpoint-id", // a guid string
238-
"temperature": 0.8 // optional, value range 0.0-1.0, only take effect when using HD voices
239-
}
240-
}
241-
```
242-
243-
### Custom lexicon
244-
245-
Use the `custom_lexicon_url` string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon).
246-
247-
```json
248-
{
249-
  "voice": {
250-
    "name": "en-US-Ava:DragonHDLatestNeural",
251-
    "type": "azure-standard",
252-
    "temperature": 0.8, // optional
253-
    "custom_lexicon_url": "<custom lexicon url>"
254-
}
255-
}
256-
```
227+
> [!NOTE]
228+
> High definition voices are currently supported in the following regions only: southeastasia, centralindia, swedencentral, westeurope, eastus, eastus2, westus2
257229
258230
### Speaking rate
259231

@@ -440,6 +412,9 @@ And the service responds with the server SDP.
440412

441413
Then you can connect the avatar with the server SDP.
442414

415+
> [!NOTE]
416+
> Azure text to speech avatar is currently supported in the following regions only: southeastasia, centralindia, swedencentral, westeurope, eastus2, westus2
417+
443418
## Related content
444419

445420
- Try out the [voice live API quickstart](./voice-live-quickstart.md)

articles/ai-services/speech-service/voice-live.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Azure AI voice live API is ideal for scenarios where voice-driven interactions i
4141
The voice live API includes a comprehensive set of features to support diverse use cases and ensure superior voice interactions:
4242

4343
- **Broad locale coverage**: Supports over 15 locales for speech to text and offers over 600 standard voices across 140+ locales for text to speech, ensuring global accessibility.
44-
- **Customizable input and output**: Use phrase list for lightweight just-in-time customization on audio input. Use custom voice to create unique, brand-aligned voices for audio output.
44+
- **Customizable input and output**: Use phrase list for lightweight just-in-time customization on audio input or custom speech models for advanced speech recognizion fine-tuning. Use custom voice to create unique, brand-aligned voices for audio output. See [How to customize voice live input and output](./voice-live-how-to-customize) to learn more.
4545
- **Flexible generative AI model options**: [Choose from multiple models](#supported-models-and-regions), including GPT-5, GPT-4.1, GPT-4o, Phi, and more tailored to conversational requirements.
4646
- **Advanced conversational features**:
4747
- Noise suppression: Reduces environmental noise for clearer communication.

0 commit comments

Comments
 (0)