Skip to content

Commit aecb9f8

Browse files
authored
Merge pull request #722 from MicrosoftDocs/main
10/9/2024 AM Publish
2 parents 5719991 + 1186891 commit aecb9f8

File tree

8 files changed

+198
-43
lines changed

8 files changed

+198
-43
lines changed
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
title: What are neural text to speech HD voices?
3+
titleSuffix: Azure AI services
4+
description: Learn about neural text to speech HD voices that you can use with speech synthesis.
5+
author: eric-urban
6+
ms.author: eur
7+
ms.reviewer: v-baolianzou
8+
manager: nitinme
9+
ms.service: azure-ai-speech
10+
ms.topic: overview
11+
ms.date: 10/9/2024
12+
ms.custom: references_regions
13+
#customer intent: As a user who implements text to speech, I want to understand the options and differences between available neural text to speech HD voices in Azure AI Speech.
14+
---
15+
16+
# What are high definition voices? (Preview)
17+
18+
[!INCLUDE [Feature preview](../includes/preview-feature.md)]
19+
20+
Azure AI Speech continues to advance in the field of text to speech technology with the introduction of neural text to speech high definition (HD) voices. The HD voices can understand the content, automatically detect emotions in the input text, and adjust the speaking tone in real-time to match the sentiment. HD voices maintain a consistent voice persona from their neural (and non HD) counterparts, and deliver even more value through enhanced features.
21+
22+
## Key features of neural text to speech HD voices
23+
24+
The following are the key features of Azure AI Speech HD voices:
25+
26+
| Key features | Description |
27+
|--------------|-------------|
28+
| **Human-like speech generation** | Neural text to speech HD voices can generate highly natural and human-like speech. The model is trained on millions of hours of multilingual data, enabling it to accurately interpret input text and generate speech with the appropriate emotion, pace, and rhythm without manual adjustments. |
29+
| **Conversational** | Neural text to speech HD voices can replicate natural speech patterns, including spontaneous pauses and emphasis. When given conversational text, the model can reproduce common phonemes like pauses and filler words. The generated voice sounds as if someone is conversing directly with you. |
30+
| **Prosody variations** | Neural text to speech HD voices introduce slight variations in each output to enhance realism. These variations make the speech sound more natural, as human voices naturally exhibit variation. |
31+
| **High fidelity** | The primary objective of neural text to speech HD voices is to generate high-fidelity audio. The synthetic speech produced by our system can closely mimic human speech in both quality and naturalness. |
32+
| **Version control** | With neural text to speech HD voices, we release different versions of the same voice, each with a unique base model size and recipe. This offers you the opportunity to experience new voice variations or continue using a specific version of a voice. |
33+
34+
## Comparison of Azure AI Speech HD voices to other Azure text to speech voices
35+
36+
How do Azure AI Speech HD voices compare to other Azure text to speech voices? How do they differ in terms of features and capabilities?
37+
38+
Here's a comparison of features between Azure AI Speech HD voices, Azure OpenAI HD voices, and Azure AI Speech voices:
39+
40+
| Feature | Azure AI Speech HD voices | Azure OpenAI HD voices | Azure AI Speech voices (not HD) |
41+
|---------|---------------|------------------------|------------------------|
42+
| **Region** | North Central US, Sweden Central | North Central US, Sweden Central | Available in dozens of regions. See the [region list](regions.md#speech-service).|
43+
| **Number of voices** | 12 | 6 | More than 500 |
44+
| **Multilingual** | No (perform on primary language only) | Yes | Yes (applicable only to multilingual voices) |
45+
| **SSML support** | Support for [a subset of SSML elements](#supported-and-unsupported-ssml-elements-for-azure-ai-speech-hd-voices).| Support for [a subset of SSML elements](openai-voices.md#ssml-elements-supported-by-openai-text-to-speech-voices-in-azure-ai-speech). | Support for the [full set of SSML](speech-synthesis-markup-structure.md) in Azure AI Speech. |
46+
| **Development options** | Speech SDK, Speech CLI, REST API | Speech SDK, Speech CLI, REST API | Speech SDK, Speech CLI, REST API |
47+
| **Deployment options** | Cloud only | Cloud only | Cloud, embedded, hybrid, and containers. |
48+
| **Real-time or batch synthesis** | Real-time only | Real-time and batch synthesis | Real-time and batch synthesis |
49+
| **Latency** | Less than 300 ms | Greater than 500 ms | Less than 300 ms |
50+
| **Sample rate of synthesized audio** | 8, 16, 22.05, 24, 44.1, and 48 kHz | 8, 16, 24, and 48 kHz | 8, 16, 22.05, 24, 44.1, and 48 kHz |
51+
| **Speech output audio format** | opus, mp3, pcm, truesilk | opus, mp3, pcm, truesilk | opus, mp3, pcm, truesilk |
52+
53+
## Supported Azure AI Speech HD voices
54+
55+
The Azure AI Speech HD voice values are in the format `voicename:basemodel:version`. The name before the colon, such as `en-US-Ava`, is the voice persona name and its original locale. The base model is tracked by versions in subsequent updates.
56+
57+
Currently, `DragonHD` is the only base model available for Azure AI Speech HD voices. One version of the base model (`v1Neural`) is available for each voice persona. To ensure that you're using the latest version of the base model that we provide without having to make a code change, use the `LatestNeural` version.
58+
59+
For example, for the persona `en-US-Ava` you can specify two HD voice values:
60+
- `en-US-Ava:DragonHDLatestNeural`: Always uses the latest version of the base model that we provide later.
61+
- `en-US-Ava:DragonHDv1Neural`: Always uses the `v1Neural` version of the base model. When we release a new version of the base model, you need to update your code to use the new version.
62+
63+
The following table lists the Azure AI Speech HD voices that are currently available.
64+
65+
| Neural voice persona | HD voices |
66+
|----------------------|-----------|
67+
| de-DE-Seraphina | de-DE-Seraphina:DragonHDLatestNeural<br/>de-DE-Seraphina:DragonHDv1Neural |
68+
| en-US-Andrew | en-US-Andrew:DragonHDLatestNeural<br/>en-US-Andrew:DragonHDv1Neural |
69+
| en-US-Andrew2 | en-US-Andrew2:DragonHDLatestNeural<br/>en-US-Andrew2:DragonHDv1Neural |
70+
| en-US-Aria | en-US-Aria:DragonHDLatestNeural<br/>en-US-Aria:DragonHDv1Neural |
71+
| en-US-Ava | en-US-Ava:DragonHDLatestNeural<br/>en-US-Ava:DragonHDv1Neural |
72+
| en-US-Davis | en-US-Davis:DragonHDLatestNeural<br/>en-US-Davis:DragonHDv1Neural |
73+
| en-US-Emma | en-US-Emma:DragonHDLatestNeural<br/>en-US-Emma:DragonHDv1Neural |
74+
| en-US-Emma2 | en-US-Emma2:DragonHDLatestNeural<br/>en-US-Emma2:DragonHDv1Neural |
75+
| en-US-Jenny | en-US-Jenny:DragonHDLatestNeural<br/>en-US-Jenny:DragonHDv1Neural |
76+
| en-US-Steffan | en-US-Steffan:DragonHDLatestNeural<br/>en-US-Steffan:DragonHDv1Neural |
77+
| ja-JP-Masaru | ja-JP-Masaru:DragonHDLatestNeural<br/>ja-JP-Masaru:DragonHDv1Neural |
78+
| zh-CN-Xiaochen | zh-CN-Xiaochen:DragonHDLatestNeural<br/>zh-CN-Xiaochen:DragonHDv1Neural |
79+
80+
81+
## How to use Azure AI Speech HD voices
82+
83+
You can use HD voices with the same Speech SDK and REST APIs as the non HD voices.
84+
85+
Here are some key points to consider when using Azure AI Speech HD voices:
86+
87+
- **Voice locale**: The locale in the voice name indicates its original language and region.
88+
- **Base models**:
89+
- HD voices come with a base model that understands the input text and predicts the speaking pattern accordingly. You can specify the desired model (such as DragonHDLatestNeural) according to the availability of each voice.
90+
- **SSML usage**: To reference a voice in SSML, use the format `voicename:basemodel:version`. The name before the colon, such as `de-DE-Seraphina`, is the voice persona name and its original locale. The base model is tracked by versions in subsequent updates.
91+
- **Temperature parameter**:
92+
- The temperature value is a float ranging from 0 to 1, influencing the randomness of the output. You can also adjust the temperature parameter to control the variation of outputs. Less randomness yields more stable results, while more randomness offers variety but less consistency.
93+
- Lower temperature results in less randomness, leading to more predictable outputs. Higher temperature increases randomness, allowing for more diverse outputs. The default temperature is set at 1.0.
94+
95+
Here's an example of how to use Azure AI Speech HD voices in SSML:
96+
97+
```ssml
98+
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
99+
<voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice>
100+
</speak>
101+
```
102+
103+
## Supported and unsupported SSML elements for Azure AI Speech HD voices
104+
105+
The Speech Synthesis Markup Language (SSML) with input text determines the structure, content, and other characteristics of the text to speech output. For example, you can use SSML to define a paragraph, a sentence, a break or a pause, or silence. You can wrap text with event tags such as bookmark or viseme that your application processes later.
106+
107+
The Azure AI Speech HD voices don't support all SSML elements or events that other Azure AI Speech voices support. Of particular note, Azure AI Speech HD voices don't support [word boundary events](./how-to-speech-synthesis.md#subscribe-to-synthesizer-events).
108+
109+
For detailed information on the supported and unsupported SSML elements for Azure AI Speech HD voices, refer to the following table. For instructions on how to use SSML elements, refer to the [Speech Synthesis Markup Language (SSML) documentation](speech-synthesis-markup-structure.md).
110+
111+
| SSML element | Description | Supported in Azure AI Speech HD voices |
112+
|------------------------------|--------------------------------|-----------------------------------|
113+
| `<voice>` | Specifies the voice and optional effects (`eq_car` and `eq_telecomhp8k`). | Yes |
114+
| `<mstts:express-as>` | Specifies speaking styles and roles. | No |
115+
| `<mstts:ttsembedding>` | Specifies the `speakerProfileId` property for a personal voice. | No |
116+
| `<lang xml:lang>` | Specifies the speaking language. | Yes |
117+
| `<prosody>` | Adjusts pitch, contour, range, rate, and volume. | No |
118+
| `<emphasis>`| Adds or removes word-level stress for the text. | No|
119+
| `<audio>`| Embeds prerecorded audio into an SSML document. | No|
120+
| `<mstts:audioduration>` | Specifies the duration of the output audio. | No |
121+
| `<mstts:backgroundaudio>` | Adds background audio to your SSML documents or mixes an audio file with text to speech. | No |
122+
| `<phoneme>` |Specifies phonetic pronunciation in SSML documents. | No |
123+
| `<lexicon>` | Defines how multiple entities are read in SSML. | Yes (only supports alias) |
124+
| `<say-as>` | Indicates the content type, such as number or date, of the element's text. | Yes |
125+
| `<sub>` | Indicates that the alias attribute's text value should be pronounced instead of the element's enclosed text. | Yes |
126+
| `<math>` | Uses the MathML as input text to properly pronounce mathematical notations in the output audio. | No |
127+
| `<bookmark>` | Gets the offset of each marker in the audio stream. | No |
128+
| `<break>` | Overrides the default behavior of breaks or pauses between words. | No |
129+
| `<mstts:silence>` | Inserts pause before or after text, or between two adjacent sentences. | No |
130+
| `<mstts:viseme>` | Defines the position of the face and mouth while a person is speaking. | No |
131+
| `<p>` | Denotes paragraphs in SSML documents. | Yes |
132+
| `<s>` | Denotes sentences in SSML documents. | Yes |
133+
134+
> [!NOTE]
135+
> Although a [previous section in this guide](#comparison-of-azure-ai-speech-hd-voices-to-other-azure-text-to-speech-voices) also compared Azure AI Speech HD voices to Azure OpenAI HD voices, the SSML elements supported by Azure AI Speech aren't applicable to Azure OpenAI voices.
136+
137+
## Related content
138+
139+
- [Try the text to speech quickstart in Azure AI Speech](get-started-text-to-speech.md)
140+
- [Learn more about how to use SSML and events](speech-synthesis-markup-structure.md)

articles/ai-services/speech-service/includes/quickstarts/openai-speech/csharp.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -228,8 +228,8 @@ PS C:\dev\openai\csharp>
228228

229229
Here are some more considerations:
230230

231-
- To change the speech recognition language, replace `en-US` with another [supported language](~/articles/ai-services/speech-service/language-support.md). For example, `es-ES` for Spanish (Spain). The default language is `en-US`. For details about how to identify one of multiple languages that might be spoken, see [language identification](~/articles/ai-services/speech-service/language-identification.md).
232-
- To change the voice that you hear, replace `en-US-JennyMultilingualNeural` with another [supported voice](~/articles/ai-services/speech-service/language-support.md#prebuilt-neural-voices). If the voice doesn't speak the language of the text returned from Azure OpenAI, the Speech service doesn't output synthesized audio.
231+
- To change the speech recognition language, replace `en-US` with another [supported language](~/articles/ai-services/speech-service/language-support.md?tabs=tts). For example, `es-ES` for Spanish (Spain). The default language is `en-US`. For details about how to identify one of multiple languages that might be spoken, see [language identification](~/articles/ai-services/speech-service/language-identification.md).
232+
- To change the voice that you hear, replace `en-US-JennyMultilingualNeural` with another [supported voice](~/articles/ai-services/speech-service/language-support.md?tabs=tts#prebuilt-neural-voices). If the voice doesn't speak the language of the text returned from Azure OpenAI, the Speech service doesn't output synthesized audio.
233233
- To reduce latency for text to speech output, use the text streaming feature, which enables real-time text processing for fast audio generation and minimizes latency, enhancing the fluidity and responsiveness of real-time audio outputs. Refer to [how to use text streaming](~/articles/ai-services/speech-service/how-to-lower-speech-synthesis-latency.md#input-text-streaming).
234234
- To enable [TTS Avatar](~/articles/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar.md) as a visual experience of speech output, refer to [real-time synthesis for text to speech avatar](~/articles/ai-services/speech-service/text-to-speech-avatar/real-time-synthesis-avatar.md) and [sample code](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/js/browser/avatar#chat-sample) for chat scenario with avatar.
235235
- Azure OpenAI also performs content moderation on the prompt inputs and generated outputs. The prompts or responses might be filtered if harmful content is detected. For more information, see the [content filtering](/azure/ai-services/openai/concepts/content-filter) article.

articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,17 @@
22
author: eric-urban
33
ms.service: azure-ai-speech
44
ms.topic: include
5-
ms.date: 9/30/2024
5+
ms.date: 10/9/2024
66
ms.author: eur
77
ms.custom: references_regions
88
---
99

10+
### October 2024 release
11+
12+
#### Prebuilt high definition (HD) neural voice
13+
14+
Azure AI speech high definition (HD) voices are available in public preview. The HD voices can understand the content, automatically detect emotions in the input text, and adjust the speaking tone in real-time to match the sentiment. HD voices maintain a consistent voice persona from their neural (and non HD) counterparts, and deliver even more value through enhanced features. For more information, see [What are Azure AI Speech high definition (HD) voices?](../../high-definition-voices.md).
15+
1016
### September 2024 release
1117

1218
#### Prebuilt neural voice

articles/ai-services/speech-service/releasenotes.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: eric-urban
77
ms.author: eur
88
ms.service: azure-ai-speech
99
ms.topic: release-notes
10-
ms.date: 9/30/2024
10+
ms.date: 10/9/2024
1111
ms.custom: references_regions
1212
# Customer intent: As a developer, I want to learn about new releases and features for Azure AI Speech.
1313
---
@@ -18,9 +18,9 @@ Azure AI Speech is updated on an ongoing basis. To stay up-to-date with recent d
1818

1919
## Recent highlights
2020

21-
* Fast transcription is now available in public preview. Fast transcription allows you to transcribe audio file to text accurately and synchronously, and supports diarization to recognize and separate multiple speakers on mono channel audio. It can transcribe audio much faster than the actual audio length. For more information, see the [fast transcription API guide](fast-transcription-create.md).
21+
* Azure AI speech high definition (HD) voices are available in public preview. The HD voices can understand the content, automatically detect emotions in the input text, and adjust the speaking tone in real-time to match the sentiment. For more information, see [What are Azure AI Speech high definition (HD) voices?](high-definition-voices.md).
22+
* Fast transcription is now available in public preview. It can transcribe audio much faster than the actual audio length. For more information, see the [fast transcription API guide](fast-transcription-create.md).
2223
* Video translation is now available in the Azure AI Speech service. For more information, see [What is video translation?](./video-translation-overview.md).
23-
* Personal voice is now generally available. For more information, see [What is personal voice?](./personal-voice-overview.md).
2424
* The Azure AI Speech service supports OpenAI text to speech voices. For more information, see [What are OpenAI text to speech voices?](./openai-voices.md).
2525
* The custom voice API is available for creating and managing [professional](./professional-voice-create-project.md) and [personal](./personal-voice-create-project.md) custom neural voice models.
2626

articles/ai-services/speech-service/speech-synthesis-markup-structure.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,14 @@ The Speech Synthesis Markup Language (SSML) with input text determines the struc
1717

1818
Refer to the sections below for details about how to structure elements in the SSML document.
1919

20+
> [!NOTE]
21+
> In addition to Azure AI Speech neural (non HD) voices, you can also use [Azure AI Speech high definition (HD) voices](high-definition-voices.md) and [Azure OpenAI neural (HD and non HD) voices](openai-voices.md). The HD voices provide a higher quality for more versatile scenarios.
22+
>
23+
> Some voices don't support all [Speech Synthesis Markup Language (SSML)](speech-synthesis-markup-structure.md) tags. This includes neural text to speech HD voices, personal voices, and embedded voices.
24+
- For Azure AI Speech high definition (HD) voices, check the SSML support [here](high-definition-voices.md#supported-and-unsupported-ssml-elements-for-azure-ai-speech-hd-voices).
25+
- For personal voice, you can find the SSML support [here](personal-voice-how-to-use.md#supported-and-unsupported-ssml-elements-for-personal-voice).
26+
- For embedded voices, check the SSML support [here](embedded-speech.md#embedded-voices-capabilities).
27+
2028
## Document structure
2129

2230
The Speech service implementation of SSML is based on the World Wide Web Consortium's [Speech Synthesis Markup Language Version 1.0](https://www.w3.org/TR/2004/REC-speech-synthesis-20040907/). The elements supported by the Speech can differ from the W3C standard.

0 commit comments

Comments
 (0)