Skip to content

Commit 9f126c5

Browse files
committed
First draft language support docs voice live
1 parent cb8ccef commit 9f126c5

File tree

2 files changed

+220
-0
lines changed

2 files changed

+220
-0
lines changed

articles/ai-services/speech-service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,8 @@ items:
244244
href: voice-live-agents-quickstart.md
245245
- name: How to use voice live
246246
href: voice-live-how-to.md
247+
- name: Voice live language support
248+
href: voice-live-language-support.md
247249
- name: Audio events reference
248250
href: /azure/ai-foundry/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
249251
- name: Keyword recognition
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
---
2+
title: Voice live API language support
3+
titleSuffix: Azure AI services
4+
description: Learn about the languages supported by voice live API and how to configure them.
5+
manager: nitinme
6+
author: goergenj
7+
ms.author: jagoerge
8+
ms.service: azure-ai-speech
9+
ms.topic: conceptual
10+
ms.date: 8/8/2025
11+
ms.custom: languages
12+
# Customer intent: As a developer, I want to learn about which languages are supported by the voice live API and how to configure them.
13+
---
14+
15+
# Voice live API supported languages (Preview)
16+
17+
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
18+
19+
## Introduction
20+
21+
The voice live API supports multiple languages and configuration options. In this document you will which languages are supported by the voice live API and how to configure them.
22+
23+
## [Speech input](#tab/speechinput)
24+
25+
Depending on which model is being used voice live speech input is processed either by one of the multimodal models (e.g. `gpt-4o-realtime-preview`, `gpt-4o-mini-realtime-preview`, and `phi4-mm-realtime`) or by `azure speech to text` models.
26+
27+
### azure speech to text supported languages
28+
29+
Azure speech to text is used for all configuration where a non-multimodal model is being used and for speech input transcriptions with `phi4-mm-realtime`.
30+
It supports all languages documented on the [Language and voice support for the Speech service - Speech to text](./language-support.md?tabs=stt) tab.
31+
32+
There are three options for voice live language processing:
33+
1. Automatic multilingual configuration using multilingual model (default)
34+
1. Single language configuration
35+
1. Multilingual configuration using up to 10 defined languages
36+
37+
Custom speech models can be used with ´´´Single language configuration´´´ or ´´´Multilingual configuration using up to 10 defined languages´´´´.
38+
39+
The current multi-lingual model supports the following languages:
40+
- Chinese (China) [zh-CN]
41+
- English (Australia) [en-AU]
42+
- English (Canada) [en-CA]
43+
- English (India) [en-IN]
44+
- English (United Kingdom) [en-GB]
45+
- English (United States) [en-US]
46+
- French (Canada) [fr-CA]
47+
- French (France) [fr-FR]
48+
- German (Germany) [de-DE]
49+
- Hindi (India) [hi-IN]
50+
- Italian (Italy) [it-IT]
51+
- Japanese (Japan) [ja-JP]
52+
- Korean (Korea) [ko-KR]
53+
- Spanish (Mexico) [es-MX]
54+
- Spanish (Spain) [es-ES]
55+
56+
To use **Automatic multilingual configuration using multilingual model** no additional configuration is required. If you do add the `language` string to the session`session.update` message, make sure to leave it empty.
57+
58+
```json
59+
{
60+
"session": {
61+
"input_audio_transcription": {
62+
"model": "azure-speech",
63+
"language": ""
64+
}
65+
}
66+
```
67+
68+
> [!NOTE]
69+
> The multilingual model will also generate results for unsupported languages. In these cases transcription quality will be low. Ensure to configure defined languages, if you are setting up application with languages unsupported by the multilingual model.
70+
71+
To configure a single or multiple languages not supported by the multimodal model you must add them to the `language` string in the session`session.update` message. A maximum of 10 languages are supported.
72+
73+
```json
74+
{
75+
"session": {
76+
"input_audio_transcription": {
77+
"model": "azure-speech",
78+
"language": "en-US,fr-FR,de-DE"
79+
}
80+
}
81+
```
82+
83+
### gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview supported languages
84+
85+
While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model will return results for languages not listed below but the quality will be low.
86+
87+
The following languages are supported by `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview`:
88+
- Afrikaans
89+
- Arabic
90+
- Armenian
91+
- Azerbaijani
92+
- Belarusian
93+
- Bosnian
94+
- Bulgarian
95+
- Catalan
96+
- Chinese
97+
- Croatian
98+
- Czech
99+
- Danish
100+
- Dutch
101+
- English
102+
- Estonian
103+
- Finnish
104+
- French
105+
- Galician
106+
- German
107+
- Greek
108+
- Hebrew
109+
- Hindi
110+
- Hungarian
111+
- Icelandic
112+
- Indonesian
113+
- Italian
114+
- Japanese
115+
- Kannada
116+
- Kazakh
117+
- Korean
118+
- Latvian
119+
- Lithuanian
120+
- Macedonian
121+
- Malay
122+
- Marathi
123+
- Maori
124+
- Nepali
125+
- Norwegian
126+
- Persian
127+
- Polish
128+
- Portuguese
129+
- Romanian
130+
- Russian
131+
- Serbian
132+
- Slovak
133+
- Slovenian
134+
- Spanish
135+
- Swahili
136+
- Swedish
137+
- Tagalog
138+
- Tamil
139+
- Thai
140+
- Turkish
141+
- Ukrainian
142+
- Urdu
143+
- Vietnamese
144+
- Welsh
145+
146+
Multimodal models do not require a language configuration for the general processing. If you configure input audio transcription you can provide the transcription models with a language hint to improve transcription quality. In this case you need to add the `language`string to the session`session.update` message.
147+
148+
```json
149+
{
150+
"session": {
151+
"input_audio_transcription": {
152+
"model": "gpt-4o-transcribe",
153+
"language": "English, German, French"
154+
}
155+
}
156+
```
157+
158+
> [!NOTE]
159+
> Multimodal gpt models only support the following transcription models: `whisper-1`, `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`.
160+
161+
### phi4-mm-realtime supported languages
162+
163+
The following languages are supported by `phi4-mm-realtime`:
164+
- Chinese
165+
- English
166+
- French
167+
- German
168+
- Italian
169+
- Japanese
170+
- Portuguese
171+
- Spanish
172+
173+
Multimodal models do not require a language configuration for the general processing. If you configure input audio transcription for `phi4-mm-realtime` you need to use the same configuration as for all non-mulitmodal model configuration where azure-speech is used for transcription as described above.
174+
175+
> [!NOTE]
176+
> Multimodal phi models only support the following transcription models: `azure-speech`.
177+
178+
## [Speech output](#tab/speechoutput)
179+
180+
Depending on which model is being used voice live speech output is processed either by one of the multimodal OpenAI voices integrated into `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` or by `azure text to speech` voices.
181+
182+
### azure text to speech supported languages
183+
184+
Azure text to speech is used by default for all configuration where a non-multimodal OpenAI model is being used and can be configured in all configurations manually.
185+
It supports all voices documented on the [Language and voice support for the Speech service - Text to speech](./language-support.md?tabs=tts) tab.
186+
187+
The following types of voices are supported:
188+
1. Monolingual voices
189+
1. Multilingual voices
190+
1. Custom voices
191+
192+
The supported language is tied to the voice used. To configure specific Azure text to speech voices you need to add the `voice` configuration to the session`session.update` message.
193+
194+
```json
195+
{
196+
"session": {
197+
"voice": {
198+
"name": "en-US-Ava:DragonHDLatestNeural",
199+
"type": "azure-standard",
200+
"temperature": 0.8,
201+
}
202+
}
203+
}
204+
```
205+
206+
For more details see how to configure [Audio output through Azure text to speech](./voice-live-how-to.md#audio-output-through-azure-text-to-speech).
207+
208+
In case of *Multilingual Voices* the language output can optionally be controlled by setting specific SSML tags. You can learn more about this in the [Customize voice and sound with SSML](./speech-synthesis-markup-voice.md#lang-examples) how to.
209+
210+
### OpenAI supported languages
211+
212+
***INPUT NEEDED QINYING!!!***
213+
214+
## Related content
215+
216+
- Learn more about [How to use the voice live API](./voice-live-how-to.md)
217+
- Try out the [voice live API quickstart](./voice-live-quickstart.md)
218+
- See the [audio events reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)

0 commit comments

Comments
 (0)