Skip to content

Commit 8d184c8

Browse files
authored
Merge pull request #6497 from goergenj/jagoerge-voicelive-language-support
Jagoerge voicelive language support
2 parents 4aa083b + ca6ebf0 commit 8d184c8

File tree

2 files changed

+214
-0
lines changed

2 files changed

+214
-0
lines changed

articles/ai-services/speech-service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,8 @@ items:
244244
href: voice-live-agents-quickstart.md
245245
- name: How to use voice live
246246
href: voice-live-how-to.md
247+
- name: Voice live language support
248+
href: voice-live-language-support.md
247249
- name: Audio events reference
248250
href: /azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
249251
- name: Keyword recognition
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
---
2+
title: Voice live API language support
3+
titleSuffix: Azure AI services
4+
description: Learn about the languages supported by voice live API and how to configure them.
5+
manager: nitinme
6+
author: goergenj
7+
ms.author: jagoerge
8+
ms.service: azure-ai-speech
9+
ms.topic: conceptual
10+
ms.date: 8/11/2025
11+
ms.custom: languages
12+
# Customer intent: As a developer, I want to learn about which languages are supported by the voice live API and how to configure them.
13+
---
14+
15+
# Voice live API supported languages (Preview)
16+
17+
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
18+
19+
## Introduction
20+
21+
The voice live API supports multiple languages and configuration options. In this document, you learn which languages the voice live API supports and how to configure them.
22+
23+
## [Speech input](#tab/speechinput)
24+
25+
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (for example, `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, and `phi4-mm-realtime`) or by `azure speech to text` models.
26+
27+
### Azure speech to text supported languages
28+
29+
Azure speech to text is used for all configuration where a non-multimodal model is being used and for speech input transcriptions with `phi4-mm-realtime`.
30+
It supports all languages documented on the [Language and voice support for the Speech service - Speech to text](./language-support.md?tabs=stt) tab.
31+
32+
There are three options for voice live language processing:
33+
- Automatic multilingual configuration using multilingual model (default)
34+
- Single language configuration
35+
- Multilingual configuration using up to 10 defined languages
36+
37+
The current multi-lingual model supports the following languages:
38+
- Chinese (China) [zh-CN]
39+
- English (Australia) [en-AU]
40+
- English (Canada) [en-CA]
41+
- English (India) [en-IN]
42+
- English (United Kingdom) [en-GB]
43+
- English (United States) [en-US]
44+
- French (Canada) [fr-CA]
45+
- French (France) [fr-FR]
46+
- German (Germany) [de-DE]
47+
- Hindi (India) [hi-IN]
48+
- Italian (Italy) [it-IT]
49+
- Japanese (Japan) [ja-JP]
50+
- Korean (Korea) [ko-KR]
51+
- Spanish (Mexico) [es-MX]
52+
- Spanish (Spain) [es-ES]
53+
54+
To use **Automatic multilingual configuration using multilingual model** no extra configuration is required. If you do add the `language` string to the session`session.update` message, make sure to leave it empty.
55+
56+
```json
57+
{
58+
"session": {
59+
"input_audio_transcription": {
60+
"model": "azure-speech",
61+
"language": ""
62+
}
63+
}
64+
```
65+
66+
> [!NOTE]
67+
> The multilingual model generates results for unsupported languages, if no language is defined. In these cases transcription, quality is low. Ensure to configure defined languages, if you're setting up application with languages unsupported by the multilingual model.
68+
69+
To configure a single or multiple languages not supported by the multimodal model, you must add them to the `language` string in the session`session.update` message. A maximum of 10 languages are supported.
70+
71+
```json
72+
{
73+
"session": {
74+
"input_audio_transcription": {
75+
"model": "azure-speech",
76+
"language": "en-US,fr-FR,de-DE"
77+
}
78+
}
79+
```
80+
81+
### gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview supported languages
82+
83+
While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.
84+
85+
The following languages are supported by `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview`:
86+
- Afrikaans
87+
- Arabic
88+
- Armenian
89+
- Azerbaijani
90+
- Belarusian
91+
- Bosnian
92+
- Bulgarian
93+
- Catalan
94+
- Chinese
95+
- Croatian
96+
- Czech
97+
- Danish
98+
- Dutch
99+
- English
100+
- Estonian
101+
- Finnish
102+
- French
103+
- Galician
104+
- German
105+
- Greek
106+
- Hebrew
107+
- Hindi
108+
- Hungarian
109+
- Icelandic
110+
- Indonesian
111+
- Italian
112+
- Japanese
113+
- Kannada
114+
- Kazakh
115+
- Korean
116+
- Latvian
117+
- Lithuanian
118+
- Macedonian
119+
- Malay
120+
- Marathi
121+
- Maori
122+
- Nepali
123+
- Norwegian
124+
- Persian
125+
- Polish
126+
- Portuguese
127+
- Romanian
128+
- Russian
129+
- Serbian
130+
- Slovak
131+
- Slovenian
132+
- Spanish
133+
- Swahili
134+
- Swedish
135+
- Tagalog
136+
- Tamil
137+
- Thai
138+
- Turkish
139+
- Ukrainian
140+
- Urdu
141+
- Vietnamese
142+
- Welsh
143+
144+
Multimodal models don't require a language configuration for the general processing. If you configure input audio transcription, you can provide the transcription models with a language hint to improve transcription quality. In this case you need to add the `language`string to the session`session.update` message.
145+
146+
```json
147+
{
148+
"session": {
149+
"input_audio_transcription": {
150+
"model": "gpt-4o-transcribe",
151+
"language": "English, German, French"
152+
}
153+
}
154+
```
155+
156+
> [!NOTE]
157+
> Multimodal gpt models only support the following transcription models: `whisper-1`, `gpt-4o-transcribe`, and `gpt-4o-mini-transcribe`.
158+
159+
### phi4-mm-realtime supported languages
160+
161+
The following languages are supported by `phi4-mm-realtime`:
162+
- Chinese
163+
- English
164+
- French
165+
- German
166+
- Italian
167+
- Japanese
168+
- Portuguese
169+
- Spanish
170+
171+
Multimodal models don't require a language configuration for the general processing. If you configure input audio transcription for `phi4-mm-realtime` you need to use the same configuration as for all non-mulitmodal model configuration where `azure-speech` is used for transcription as described.
172+
173+
> [!NOTE]
174+
> Multimodal phi models only support the following transcription models: `azure-speech`.
175+
176+
## [Speech output](#tab/speechoutput)
177+
178+
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` or by `azure text to speech` voices.
179+
180+
### Azure text to speech supported languages
181+
182+
Azure text to speech is used by default for all configuration where a non-multimodal OpenAI model is being used and can be configured in all configurations manually.
183+
It supports all voices documented on the [Language and voice support for the Speech service - Text to speech](./language-support.md?tabs=tts) tab.
184+
185+
The following types of voices are supported:
186+
- Monolingual voices
187+
- Multilingual voices
188+
- Custom voices
189+
190+
The supported language is tied to the voice used. To configure specific Azure text to speech voices, you need to add the `voice` configuration to the session`session.update` message.
191+
192+
```json
193+
{
194+
"session": {
195+
"voice": {
196+
"name": "en-US-Ava:DragonHDLatestNeural",
197+
"type": "azure-standard",
198+
"temperature": 0.8,
199+
}
200+
}
201+
}
202+
```
203+
204+
For more information, see how to configure [Audio output through Azure text to speech](./voice-live-how-to.md#audio-output-through-azure-text-to-speech).
205+
206+
If *Multilingual Voices* are used, the language output can optionally be controlled by setting specific SSML tags. You can learn more about SSML tags in the [Customize voice and sound with SSML](./speech-synthesis-markup-voice.md#lang-examples) how to.
207+
208+
## Related content
209+
210+
- Learn more about [How to use the voice live API](./voice-live-how-to.md)
211+
- Try out the [voice live API quickstart](./voice-live-quickstart.md)
212+
- See the [audio events reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)

0 commit comments

Comments
 (0)