Skip to content

Commit 1bca8c1

Browse files
author
Trevor Bye
committed
draft for text to speech basics
1 parent 7a6bc3c commit 1bca8c1

File tree

1 file changed

+267
-0
lines changed

1 file changed

+267
-0
lines changed
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
author: trevorbye
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 03/25/2020
6+
ms.author: trbye
7+
---
8+
9+
## Prerequisites
10+
11+
This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, [try the Speech service for free](../../../get-started.md).
12+
13+
## Install the Speech SDK
14+
15+
Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:
16+
17+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=dotnet&pivots=programming-language-csharp" target="_blank">.NET Framework <span class="docon docon-navigate-external x-hidden-focus"></span></a>
18+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=dotnetcore&pivots=programming-language-csharp" target="_blank">.NET Core <span class="docon docon-navigate-external x-hidden-focus"></span></a>
19+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=unity&pivots=programming-language-csharp" target="_blank">Unity <span class="docon docon-navigate-external x-hidden-focus"></span></a>
20+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=uwps&pivots=programming-language-csharp" target="_blank">UWP <span class="docon docon-navigate-external x-hidden-focus"></span></a>
21+
* <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstarts/setup-platform?tabs=xaml&pivots=programming-language-csharp" target="_blank">Xamarin <span class="docon docon-navigate-external x-hidden-focus"></span></a>
22+
23+
## Import dependencies
24+
25+
To run the examples in this article, include the following `using` statements at the top of your script.
26+
27+
```csharp
28+
using System;
29+
using Microsoft.CognitiveServices.Speech;
30+
using Microsoft.CognitiveServices.Speech.Audio;
31+
using System.Threading.Tasks;
32+
using System.Net;
33+
using System.IO;
34+
using System.Text;
35+
using System.Xml.Linq;
36+
```
37+
38+
## Create a speech configuration
39+
40+
To call the Speech service using the Speech SDK, you need to create a [`SpeechConfig`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechconfig?view=azure-dotnet). This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
41+
42+
> [!NOTE]
43+
> Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
44+
45+
There are a few ways that you can initialize a [`SpeechConfig`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechconfig?view=azure-dotnet):
46+
47+
* With a subscription: pass in a key and the associated region.
48+
* With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
49+
* With a host: pass in a host address. A key or authorization token is optional.
50+
* With an authorization token: pass in an authorization token and the associated region.
51+
52+
In this example, you create a [`SpeechConfig`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechconfig?view=azure-dotnet) using a subscription key and region. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.
53+
54+
```csharp
55+
public class Program
56+
{
57+
public static async Task SynthesizeAudioAsync()
58+
{
59+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
60+
}
61+
62+
static void Main(string[] args)
63+
{
64+
SynthesizeAudioAsync().Wait();
65+
}
66+
}
67+
```
68+
69+
## Synthesize speech to a file
70+
71+
Next, you create a [`SpeechSynthesizer`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesizer?view=azure-dotnet) object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The [`SpeechSynthesizer`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesizer?view=azure-dotnet) accepts as params the [`SpeechConfig`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechconfig?view=azure-dotnet) object created in the previous step, and an [`AudioConfig`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.audio.audioconfig?view=azure-dotnet) object that specifies how output results should be handled.
72+
73+
To start, create an `AudioConfig` to automatically write the output to a `.wav` file, using the `FromWavFileOutput()` function, and wrap it in a `using` block.
74+
75+
```csharp
76+
public static async Task SynthesizeAudioAsync()
77+
{
78+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
79+
using (var fileOutput = AudioConfig.FromWavFileOutput("path/to/write/file.wav"))
80+
{
81+
}
82+
}
83+
```
84+
85+
Next, inside the `using` block you just created, create a nested `using` block and initialize the `SpeechSynthesizer`. Pass your `config` object and the `fileOutput` object as params. Then, executing speech synthesis and writing to a file is as simple as running `SpeakTextAsync()` with a string of text.
86+
87+
```csharp
88+
public static async Task SynthesizeAudioAsync()
89+
{
90+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
91+
using (var audioConfig = AudioConfig.FromWavFileOutput("path/to/write/file.wav"))
92+
{
93+
using (var synthesizer = new SpeechSynthesizer(config, audioConfig))
94+
{
95+
await synthesizer.SpeakTextAsync("A simple test to write to a file.");
96+
}
97+
}
98+
}
99+
```
100+
101+
Run the program, and a synthesized `.wav` file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.
102+
103+
## Get result as an in-memory stream
104+
105+
For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:
106+
107+
* Abstract the resulting byte array as a seek-able stream for custom downstream services.
108+
* Integrate the result with other API's or services.
109+
* Modify the audio data, write custom `.wav` headers, etc.
110+
111+
It's simple to make this change from the previous example. First, remove the `AudioConfig` block, as you will manage the output behavior manually from this point onward for increased control. Then pass `null` for the `AudioConfig` in the `SpeechSynthesizer` constructor.
112+
113+
This time, you save the result to a [`SpeechSynthesisResult`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesisresult?view=azure-dotnet) variable. The `AudioData` property contains a `byte []` of the output data. Simply grab the `byte []` and write it to a new `MemoryStream`. From here you can implement any custom behavior using the resulting output, but in this example you write to a file manually.
114+
115+
```csharp
116+
public static async Task SynthesizeAudioAsync()
117+
{
118+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
119+
using (var synthesizer = new SpeechSynthesizer(config, null))
120+
{
121+
var result = await synthesizer.SpeakTextAsync("Getting the response as a memory stream.");
122+
MemoryStream stream = new MemoryStream();
123+
stream.Write(result.AudioData);
124+
125+
FileStream fs = File.Create("path/to/write/file.wav");
126+
stream.WriteTo(fs);
127+
fs.Close();
128+
stream.Close();
129+
}
130+
}
131+
```
132+
133+
## Customize audio format
134+
135+
The following section shows how to customize audio output attributes including:
136+
137+
* Audio file type
138+
* Sample-rate
139+
* Bit-depth
140+
141+
To change the audio format, you use the `SetSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` of type [`SpeechSynthesisOutputFormat`](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesisoutputformat?view=azure-dotnet), which you use to select the output format. See the reference docs for a [list of audio formats](https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesisoutputformat?view=azure-dotnet) that are available.
142+
143+
There are various options for different file types depending on your requirements. In this example, you specify a high-fidelity raw format `Raw24Khz16BitMonoPcm`, which requires writing the `.wav` headers manually. If you choose a format such as `Audio24Khz96KBitRateMonoMp3`, you would not need to write headers.
144+
145+
First, create a function `WriteWavHeader()` to write the necessary audio metadata to the front of your `MemoryStream`. Since `Raw24Khz16BitMonoPcm` is a raw audio format, you need to write standardized audio file headers so that other software knows information like the number of channels, sample rate, and bit depth when your file is played.
146+
147+
```csharp
148+
private static void WriteWavHeader(MemoryStream stream, bool isFloatingPoint, ushort channelCount, ushort bitDepth, int sampleRate, int totalSampleCount)
149+
{
150+
stream.Position = 0;
151+
stream.Write(Encoding.ASCII.GetBytes("RIFF"), 0, 4);
152+
stream.Write(BitConverter.GetBytes(((bitDepth / 8) * totalSampleCount) + 36), 0, 4);
153+
stream.Write(Encoding.ASCII.GetBytes("WAVE"), 0, 4);
154+
stream.Write(Encoding.ASCII.GetBytes("fmt "), 0, 4);
155+
stream.Write(BitConverter.GetBytes(16), 0, 4);
156+
157+
// audio format (floating point (3) or PCM (1)). Any other format indicates compression.
158+
stream.Write(BitConverter.GetBytes((ushort)(isFloatingPoint ? 3 : 1)), 0, 2);
159+
160+
stream.Write(BitConverter.GetBytes(channelCount), 0, 2);
161+
stream.Write(BitConverter.GetBytes(sampleRate), 0, 4);
162+
stream.Write(BitConverter.GetBytes(sampleRate * channelCount * (bitDepth / 8)), 0, 4);
163+
stream.Write(BitConverter.GetBytes((ushort)channelCount * (bitDepth / 8)), 0, 2);
164+
stream.Write(BitConverter.GetBytes(bitDepth), 0, 2);
165+
stream.Write(Encoding.ASCII.GetBytes("data"), 0, 4);
166+
stream.Write(BitConverter.GetBytes((bitDepth / 8) * totalSampleCount), 0, 4);
167+
}
168+
```
169+
170+
Next, set the `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, you write the `byte []` from the result to a `MemoryStream`, but first you must write the custom `.wav` headers for the chosen file type. Use the function you created above, passing the memory stream by reference. For the other params, the number of **channels** is 1 (mono), the **bit-depth** is 16, the **sample-rate** is 24,000 (24Khz), and the **total samples** is the length of the raw `byte []` from the `SpeechSynthesisResult`.
171+
172+
```csharp
173+
public static async Task SynthesizeAudioAsync()
174+
{
175+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
176+
config.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);
177+
178+
using (var synthesizer = new SpeechSynthesizer(config, null))
179+
{
180+
var result = await synthesizer.SpeakTextAsync("Customizing audio output.");
181+
MemoryStream stream = new MemoryStream();
182+
// first write the headers to the front of the stream
183+
WriteWavHeader(stream, false, 1, 16, 24000, result.AudioData.Length);
184+
stream.Write(result.AudioData);
185+
186+
FileStream fs = File.Create("path/to/write/file.wav");
187+
stream.WriteTo(fs);
188+
fs.Close();
189+
stream.Close();
190+
}
191+
}
192+
```
193+
194+
Running your program again will write a custom-formatted `.wav` file to the specified path.
195+
196+
## Use SSML to customize speech characteristics
197+
198+
Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md).
199+
200+
To start using SSML for customization, you make a simple change that switches the voice.
201+
First, create a new XML file for the SSML config in your root project directory, in this example `ssml.xml`. The root element is always `<speak>`, and wrapping the text in a `<voice>` element allows you to change the voice using the `name` param. This example changes the voice to a male English (UK) voice. See the [full list](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#standard-voices) of supported standard voices for additional options.
202+
203+
```xml
204+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
205+
<voice name="en-GB-George-Apollo">
206+
When you're on the motorway, it's a good idea to use a sat nav.
207+
</voice>
208+
</speak>
209+
```
210+
211+
Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `SpeakTextAsync()` function, you use `SpeakSsmlAsync()`. This function expects an XML string, so you first load your SSML config as a string using `XDocument.Load()`. From here, the result object is exactly the same as previous examples.
212+
213+
> [!NOTE]
214+
> If you're using Visual Studio, your build config likely will not find your XML file by default. To fix this, right click the XML file and
215+
> select **Properties**. Change **Build Action** to *Content*, and change **Copy to Output Directory** to *Copy always*.
216+
217+
```csharp
218+
public static async Task SynthesizeAudioAsync()
219+
{
220+
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
221+
using (var synthesizer = new SpeechSynthesizer(config, null))
222+
{
223+
string ssml = XDocument.Load(@"./ssml.xml").ToString();
224+
var result = await synthesizer.SpeakSsmlAsync(ssml);
225+
226+
MemoryStream stream = new MemoryStream();
227+
stream.Write(result.AudioData);
228+
229+
FileStream fs = File.Create("path/to/write/file.wav");
230+
stream.WriteTo(fs);
231+
fs.Close();
232+
stream.Close();
233+
}
234+
}
235+
```
236+
237+
The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a `<prosody>` tag and reduce the speed to **90%** of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issues, add a `<break>` tag to delay the speech, and set the time param to **200ms**. Re-run the synthesis to see how these customizations affected the output.
238+
239+
```xml
240+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
241+
<voice name="en-GB-George-Apollo">
242+
<prosody rate="0.9">
243+
When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
244+
</prosody>
245+
</voice>
246+
</speak>
247+
```
248+
249+
### Neural voices
250+
251+
Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.
252+
253+
To switch to a neural voice, change the `name` to one of the [neural voice options](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#neural-voices). Then, add an XML namespace for `mstts`, and wrap your text in the `<mstts:express-as>` tag. Use the `style` param to customize the speaking style. This example uses `cheerful`, but try setting it to `customerservice` or `chat` to see the difference in speaking style.
254+
255+
> [!IMPORTANT]
256+
> Neural voices are **only** supported for Speech resources created in *East US*, *South East Asia*, and *West Europe* regions.
257+
258+
```xml
259+
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
260+
<voice name="en-US-AriaNeural">
261+
<mstts:express-as style="cheerful">
262+
This is awesome!
263+
</mstts:express-as>
264+
</voice>
265+
</speak>
266+
```
267+

0 commit comments

Comments
 (0)