Skip to content

Commit 8f4e36c

Browse files
committed
Initial pass of JS for synthesis
1 parent a274864 commit 8f4e36c

File tree

2 files changed

+273
-1
lines changed

2 files changed

+273
-1
lines changed
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
---
2+
author: trevorbye
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 04/14/2020
6+
ms.author: trbye
7+
---
8+
9+
## Prerequisites
10+
11+
This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, [try the Speech service for free](../../../get-started.md).
12+
13+
## Install the Speech SDK
14+
15+
Before you can do anything, you'll need to install the <a href="https://www.npmjs.com/package/microsoft-cognitiveservices-speech-sdk" target="_blank">JavaScript Speech SDK <span class="docon docon-navigate-external x-hidden-focus"></span></a>. Depending on your platform, use the following instructions:
16+
- <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-sdk?tabs=nodejs#get-the-speech-sdk" target="_blank">Node.js <span
17+
class="docon docon-navigate-external x-hidden-focus"></span></a>
18+
- <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-sdk?tabs=browser#get-the-speech-sdk" target="_blank">Web Browser <span class="docon docon-navigate-external x-hidden-focus"></span></a>
19+
20+
Additionally, depending on the target environment use one of the following:
21+
22+
# [import](#tab/import)
23+
24+
```javascriptscript
25+
import * as sdk from "microsoft-cognitiveservices-speech-sdk";
26+
```
27+
28+
For more information on `import`, see <a href="https://javascript.info/import-export" target="_blank">export and import <span class="docon docon-navigate-external x-hidden-focus"></span></a>.
29+
30+
# [require](#tab/require)
31+
32+
```javascriptscript
33+
const sdk = require("microsoft-cognitiveservices-speech-sdk");
34+
```
35+
36+
For more information on `require`, see <a href="https://nodejs.org/en/knowledge/getting-started/what-is-require/" target="_blank">what is require? <span class="docon docon-navigate-external x-hidden-focus"></span></a>.
37+
38+
# [script](#tab/script)
39+
40+
Download and extract the <a href="https://aka.ms/csspeech/jsbrowserpackage" target="_blank">JavaScript Speech SDK <span class="docon docon-navigate-external x-hidden-focus"></span></a> *microsoft.cognitiveservices.speech.sdk.bundle.js* file, and place it in a folder accessible to your HTML file.
41+
42+
```html
43+
<script src="microsoft.cognitiveservices.speech.sdk.bundle.js"></script>;
44+
```
45+
> [!TIP]
46+
> If you're targeting a web browser, and using the `<script>` tag; the `sdk` prefix is not needed. The `sdk` prefix is an alias we use to name our `import` or `require` module.
47+
48+
---
49+
50+
## Create a speech configuration
51+
52+
To call the Speech service using the Speech SDK, you need to create a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest). This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
53+
54+
> [!NOTE]
55+
> Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
56+
57+
There are a few ways that you can initialize a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest):
58+
59+
* With a subscription: pass in a key and the associated region.
60+
* With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
61+
* With a host: pass in a host address. A key or authorization token is optional.
62+
* With an authorization token: pass in an authorization token and the associated region.
63+
64+
In this example, you create a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest) using a subscription key and region. See the [region support](https://docs.microsoft.com/azure/cognitive-services/speech-service/regions#speech-sdk) page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.
65+
66+
```javascript
67+
function synthesizeSpeech() {
68+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
69+
const synthesizer = new sdk.SpeechSynthesizer(speechConfig);
70+
}
71+
```
72+
73+
## Synthesize speech to a file
74+
75+
Next, you create a [`SpeechSynthesizer`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesizer?view=azure-node-latest) object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The [`SpeechSynthesizer`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesizer?view=azure-node-latest) accepts as params the [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest) object created in the previous step, and an [`AudioConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/audioconfig?view=azure-node-latest) object that specifies how output results should be handled.
76+
77+
To start, create an `AudioConfig` to automatically write the output to a `.wav` file using the `fromWavFileOutput()` static function.
78+
79+
```javascript
80+
function synthesizeSpeech() {
81+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
82+
const audioConfig = sdk.AudioConfig.fromWavFileOutput("path/to/write/file.wav");
83+
}
84+
```
85+
86+
Next, instantiate a `SpeechSynthesizer` passing your `speechConfig` object and the `audioConfig` object as params. Then, executing speech synthesis and writing to a file is as simple as running `speakTextAsync()` with a string of text.
87+
88+
```javascript
89+
function synthesizeSpeech() {
90+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
91+
AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav");
92+
93+
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
94+
synthesizer.SpeakText("A simple test to write to a file.");
95+
}
96+
```
97+
98+
Run the program, and a synthesized `.wav` file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.
99+
100+
## Synthesize to speaker output
101+
102+
In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, instantiate the `AudioConfig` using the `fromDefaultSpeakerOutput()` static function. This outputs to the current active output device.
103+
104+
```javascript
105+
function synthesizeSpeech() {
106+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
107+
AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();
108+
109+
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
110+
synthesizer.SpeakText("Synthesizing directly to speaker output.");
111+
}
112+
```
113+
114+
## Get result as an in-memory stream
115+
116+
For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:
117+
118+
* Abstract the resulting byte array as a seek-able stream for custom downstream services.
119+
* Integrate the result with other API's or services.
120+
* Modify the audio data, write custom `.wav` headers, etc.
121+
122+
It's simple to make this change from the previous example. First, remove the `AudioConfig` block, as you will manage the output behavior manually from this point onward for increased control. Then pass `null` for the `AudioConfig` in the `SpeechSynthesizer` constructor.
123+
124+
> [!NOTE]
125+
> Passing `null` for the `AudioConfig`, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.
126+
127+
This time, you save the result to a [`SpeechSynthesisResult`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisresult?view=azure-node-latest) variable. The `SpeechSynthesisResult.audioData` property returns an `ArrayBuffer` of the output data. You can work with this `ArrayBuffer` manually.
128+
129+
```javascript
130+
function synthesizeSpeech() {
131+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
132+
const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null);
133+
134+
synthesizer.speakTextAsync(
135+
"Getting the response as an in-memory stream.",
136+
result => {
137+
// Interact with the audio ArrayBuffer data
138+
const audioData = result.audioData;
139+
},
140+
error => console.log(error));
141+
}
142+
```
143+
144+
From here you can implement any custom behavior using the resulting `ArrayBuffer` object.
145+
146+
## Customize audio format
147+
148+
The following section shows how to customize audio output attributes including:
149+
150+
* Audio file type
151+
* Sample-rate
152+
* Bit-depth
153+
154+
To change the audio format, you use the `setSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` of type [`SpeechSynthesisOutputFormat`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisoutputformat?view=azure-node-latest), which you use to select the output format. See the reference docs for a [list of audio formats](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisoutputformat?view=azure-node-latest) that are available.
155+
156+
There are various options for different file types depending on your requirements. Note that by definition, raw formats like `Raw24Khz16BitMonoPcm` do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.
157+
158+
In this example, you specify a high-fidelity RIFF format `Riff24Khz16BitMonoPcm` by setting the `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, get the audio `ArrayBuffer` data and interact with it.
159+
160+
```javascript
161+
function synthesizeSpeech() {
162+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
163+
164+
// Set the output format
165+
speechConfig.setSpeechSynthesisOutputFormat(sdk.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);
166+
167+
const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null);
168+
synthesizer.speakTextAsync(
169+
"Customizing audio output format.",
170+
result => {
171+
// Interact with the audio ArrayBuffer data
172+
const audioData = result.audioData;
173+
},
174+
error => console.log(error));
175+
}
176+
```
177+
178+
Running your program again will write a `.wav` file to the specified path.
179+
180+
## Use SSML to customize speech characteristics
181+
182+
Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md).
183+
184+
To start using SSML for customization, you make a simple change that switches the voice.
185+
First, create a new XML file for the SSML config in your root project directory, in this example `ssml.xml`. The root element is always `<speak>`, and wrapping the text in a `<voice>` element allows you to change the voice using the `name` param. This example changes the voice to a male English (UK) voice. Note that this voice is a **standard** voice, which has different pricing and availability than **neural** voices. See the [full list](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#standard-voices) of supported **standard** voices.
186+
187+
```xml
188+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
189+
<voice name="en-GB-George-Apollo">
190+
When you're on the motorway, it's a good idea to use a sat-nav.
191+
</voice>
192+
</speak>
193+
```
194+
195+
Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `speakTextAsync()` function, you use `speakSsmlAsync()`. This function expects an XML string, so first you create a function to load an XML file and return it as a string.
196+
197+
# [import](#tab/import)
198+
199+
```javascript
200+
import { readFileSync } from "fs";
201+
202+
function xmlToString(filePath) {
203+
const xml = readFileSync(filePath, "utf8");
204+
return xml;
205+
}
206+
```
207+
208+
# [require](#tab/require)
209+
210+
```javascript
211+
const fs = require("fs");
212+
213+
function xmlToString(filePath) {
214+
const xml = fs.readFileSync(filePath, "utf8");
215+
return xml;
216+
}
217+
```
218+
219+
---
220+
221+
From here, the result object is exactly the same as previous examples.
222+
223+
```javascript
224+
function synthesizeSpeech() {
225+
const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
226+
const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null);
227+
228+
const xml = xmlToString("ssml.xml");
229+
synthesizer.speakSsmlAsync(
230+
ssml,
231+
result => {
232+
// Interact with the audio ArrayBuffer data
233+
const audioData = result.audioData;
234+
},
235+
error => console.log(error));
236+
}
237+
```
238+
239+
The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a `<prosody>` tag and reduce the speed to **90%** of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a `<break>` tag to delay the speech, and set the time param to **200ms**. Re-run the synthesis to see how these customizations affected the output.
240+
241+
```xml
242+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
243+
<voice name="en-GB-George-Apollo">
244+
<prosody rate="0.9">
245+
When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
246+
</prosody>
247+
</voice>
248+
</speak>
249+
```
250+
251+
## Neural voices
252+
253+
Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.
254+
255+
To switch to a neural voice, change the `name` to one of the [neural voice options](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#neural-voices). Then, add an XML namespace for `mstts`, and wrap your text in the `<mstts:express-as>` tag. Use the `style` param to customize the speaking style. This example uses `cheerful`, but try setting it to `customerservice` or `chat` to see the difference in speaking style.
256+
257+
> [!IMPORTANT]
258+
> Neural voices are **only** supported for Speech resources created in *East US*, *South East Asia*, and *West Europe* regions.
259+
260+
```xml
261+
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
262+
<voice name="en-US-AriaNeural">
263+
<mstts:express-as style="cheerful">
264+
This is awesome!
265+
</mstts:express-as>
266+
</voice>
267+
</speak>
268+
```

articles/cognitive-services/Speech-Service/text-to-speech-basics.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: speech-service
1010
ms.topic: quickstart
11-
ms.date: 04/06/2020
11+
ms.date: 04/14/2020
1212
ms.author: trbye
1313
zone_pivot_groups: programming-languages-set-two
1414
---
@@ -38,6 +38,10 @@ In this article, you learn common design patterns for doing text-to-speech synth
3838
[!INCLUDE [Java Basics include](includes/how-to/text-to-speech-basics/text-to-speech-basics-java.md)]
3939
::: zone-end
4040

41+
::: zone pivot="programming-language-javascript"
42+
[!INCLUDE [JavaScript Basics include](includes/how-to/text-to-speech-basics/text-to-speech-basics-javascript.md)]
43+
::: zone-end
44+
4145
::: zone pivot="programming-language-python"
4246
[!INCLUDE [Python Basics include](includes/how-to/text-to-speech-basics/text-to-speech-basics-python.md)]
4347
::: zone-end

0 commit comments

Comments
 (0)