Skip to content

Commit 94cb8f2

Browse files
authored
Merge pull request #109209 from trevorbye/master
[Cog Svcs] Text to speech basics article
2 parents 9ddc5a9 + 0162730 commit 94cb8f2

File tree

6 files changed

+948
-0
lines changed

6 files changed

+948
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
---
2+
author: trevorbye
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 03/25/2020
6+
ms.author: trbye
7+
---
8+
9+
## Prerequisites
10+
11+
This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, [try the Speech service for free](../../../get-started.md).
12+
13+
## Install the Speech SDK
14+
15+
Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:
16+
17+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=linux&pivots=programming-language-cpp" target="_blank">Linux <span class="docon docon-navigate-external x-hidden-focus"></span></a>
18+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=macos&pivots=programming-language-cpp" target="_blank">macOS <span class="docon docon-navigate-external x-hidden-focus"></span></a>
19+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=windows&pivots=programming-language-cpp" target="_blank">Windows <span class="docon docon-navigate-external x-hidden-focus"></span></a>
20+
21+
## Import dependencies
22+
23+
To run the examples in this article, include the following import and `using` statements at the top of your script.
24+
25+
```cpp
26+
#include <iostream>
27+
#include <fstream>
28+
#include <string>
29+
#include <speechapi_cxx.h>
30+
31+
using namespace std;
32+
using namespace Microsoft::CognitiveServices::Speech;
33+
using namespace Microsoft::CognitiveServices::Speech::Audio;
34+
```
35+
36+
## Create a speech configuration
37+
38+
To call the Speech service using the Speech SDK, you need to create a [`SpeechConfig`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechconfig). This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
39+
40+
> [!NOTE]
41+
> Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
42+
43+
There are a few ways that you can initialize a [`SpeechConfig`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechconfig):
44+
45+
* With a subscription: pass in a key and the associated region.
46+
* With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
47+
* With a host: pass in a host address. A key or authorization token is optional.
48+
* With an authorization token: pass in an authorization token and the associated region.
49+
50+
In this example, you create a [`SpeechConfig`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechconfig) using a subscription key and region. See the [region support](https://docs.microsoft.com/azure/cognitive-services/speech-service/regions#speech-sdk) page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.
51+
52+
```cpp
53+
int wmain()
54+
{
55+
try
56+
{
57+
synthesizeSpeech();
58+
}
59+
catch (exception e)
60+
{
61+
cout << e.what();
62+
}
63+
return 0;
64+
}
65+
66+
void synthesizeSpeech()
67+
{
68+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
69+
}
70+
```
71+
72+
## Synthesize speech to a file
73+
74+
Next, you create a [`SpeechSynthesizer`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechsynthesizer) object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The [`SpeechSynthesizer`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechsynthesizer) accepts as params the [`SpeechConfig`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechconfig) object created in the previous step, and an [`AudioConfig`](https://docs.microsoft.com/cpp/cognitive-services/speech/audio-audioconfig) object that specifies how output results should be handled.
75+
76+
To start, create an `AudioConfig` to automatically write the output to a `.wav` file, using the `FromWavFileOutput()` function.
77+
78+
```cpp
79+
void synthesizeSpeech()
80+
{
81+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
82+
auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
83+
}
84+
```
85+
86+
Next, instantiate a `SpeechSynthesizer`, passing your `config` object and the `audioConfig` object as params. Then, executing speech synthesis and writing to a file is as simple as running `SpeakTextAsync()` with a string of text.
87+
88+
```cpp
89+
void synthesizeSpeech()
90+
{
91+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
92+
auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
93+
auto synthesizer = SpeechSynthesizer::FromConfig(config, audioConfig);
94+
auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
95+
}
96+
```
97+
98+
Run the program, and a synthesized `.wav` file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.
99+
100+
## Synthesize to speaker output
101+
102+
In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, simply omit the `AudioConfig` param when creating the `SpeechSynthesizer` in the example above. This outputs to the current active output device.
103+
104+
```cpp
105+
void synthesizeSpeech()
106+
{
107+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
108+
auto synthesizer = SpeechSynthesizer::FromConfig(config);
109+
auto result = synthesizer->SpeakTextAsync("Synthesizing directly to speaker output.").get();
110+
}
111+
```
112+
113+
## Get result as an in-memory stream
114+
115+
For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:
116+
117+
* Abstract the resulting byte array as a seek-able stream for custom downstream services.
118+
* Integrate the result with other API's or services.
119+
* Modify the audio data, write custom `.wav` headers, etc.
120+
121+
It's simple to make this change from the previous example. First, remove the `AudioConfig`, as you will manage the output behavior manually from this point onward for increased control. Then pass `NULL` for the `AudioConfig` in the `SpeechSynthesizer` constructor.
122+
123+
> [!NOTE]
124+
> Passing `NULL` for the `AudioConfig`, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.
125+
126+
This time, you save the result to a [`SpeechSynthesisResult`](https://docs.microsoft.com/cpp/cognitive-services/speech/speechsynthesisresult) variable. The `GetAudioData` getter returns a `byte []` of the output data. You can work with this `byte []` manually, or you can use the [`AudioDataStream`](https://docs.microsoft.com/cpp/cognitive-services/speech/audiodatastream) class to manage the in-memory stream. In this example you use the `AudioDataStream.FromResult()` static function to get a stream from the result.
127+
128+
```cpp
129+
void synthesizeSpeech()
130+
{
131+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
132+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
133+
134+
auto result = synthesizer->SpeakTextAsync("Getting the response as an in-memory stream.").get();
135+
auto stream = AudioDataStream::FromResult(result);
136+
}
137+
```
138+
139+
From here you can implement any custom behavior using the resulting `stream` object.
140+
141+
## Customize audio format
142+
143+
The following section shows how to customize audio output attributes including:
144+
145+
* Audio file type
146+
* Sample-rate
147+
* Bit-depth
148+
149+
To change the audio format, you use the `SetSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` of type [`SpeechSynthesisOutputFormat`](https://docs.microsoft.com/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat), which you use to select the output format. See the reference docs for a [list of audio formats](https://docs.microsoft.com/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat) that are available.
150+
151+
There are various options for different file types depending on your requirements. Note that by definition, raw formats like `Raw24Khz16BitMonoPcm` do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.
152+
153+
In this example, you specify a high-fidelity RIFF format `Riff24Khz16BitMonoPcm` by setting the `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, you use [`AudioDataStream`](https://docs.microsoft.com/cpp/cognitive-services/speech/audiodatastream) to get an in-memory stream of the result, and then write it to a file.
154+
155+
```cpp
156+
void synthesizeSpeech()
157+
{
158+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
159+
config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);
160+
161+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
162+
auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
163+
164+
auto stream = AudioDataStream::FromResult(result);
165+
stream->SaveToWavFileAsync("path/to/write/file.wav").get();
166+
}
167+
```
168+
169+
Running your program again will write a `.wav` file to the specified path.
170+
171+
## Use SSML to customize speech characteristics
172+
173+
Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md).
174+
175+
To start using SSML for customization, you make a simple change that switches the voice.
176+
First, create a new XML file for the SSML config in your root project directory, in this example `ssml.xml`. The root element is always `<speak>`, and wrapping the text in a `<voice>` element allows you to change the voice using the `name` param. This example changes the voice to a male English (UK) voice. Note that this voice is a **standard** voice, which has different pricing and availability than **neural** voices. See the [full list](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#standard-voices) of supported **standard** voices.
177+
178+
```xml
179+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
180+
<voice name="en-GB-George-Apollo">
181+
When you're on the motorway, it's a good idea to use a sat-nav.
182+
</voice>
183+
</speak>
184+
```
185+
186+
Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `SpeakTextAsync()` function, you use `SpeakSsmlAsync()`. This function expects an XML string, so you first load your SSML config as a string. From here, the result object is exactly the same as previous examples.
187+
188+
```cpp
189+
void synthesizeSpeech()
190+
{
191+
auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
192+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
193+
194+
std::ifstream file("./ssml.xml");
195+
std::string ssml, line;
196+
while (std::getline(file, line))
197+
{
198+
ssml += line;
199+
ssml.push_back('\n');
200+
}
201+
auto result = synthesizer->SpeakSsmlAsync(ssml).get();
202+
203+
auto stream = AudioDataStream::FromResult(result);
204+
stream->SaveToWavFileAsync("path/to/write/file.wav").get();
205+
}
206+
```
207+
208+
The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a `<prosody>` tag and reduce the speed to **90%** of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a `<break>` tag to delay the speech, and set the time param to **200ms**. Re-run the synthesis to see how these customizations affected the output.
209+
210+
```xml
211+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
212+
<voice name="en-GB-George-Apollo">
213+
<prosody rate="0.9">
214+
When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
215+
</prosody>
216+
</voice>
217+
</speak>
218+
```
219+
220+
## Neural voices
221+
222+
Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.
223+
224+
To switch to a neural voice, change the `name` to one of the [neural voice options](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#neural-voices). Then, add an XML namespace for `mstts`, and wrap your text in the `<mstts:express-as>` tag. Use the `style` param to customize the speaking style. This example uses `cheerful`, but try setting it to `customerservice` or `chat` to see the difference in speaking style.
225+
226+
> [!IMPORTANT]
227+
> Neural voices are **only** supported for Speech resources created in *East US*, *South East Asia*, and *West Europe* regions.
228+
229+
```xml
230+
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
231+
<voice name="en-US-AriaNeural">
232+
<mstts:express-as style="cheerful">
233+
This is awesome!
234+
</mstts:express-as>
235+
</voice>
236+
</speak>
237+
```

0 commit comments

Comments
 (0)