|
| 1 | +--- |
| 2 | +author: trevorbye |
| 3 | +ms.service: cognitive-services |
| 4 | +ms.topic: include |
| 5 | +ms.date: 04/14/2020 |
| 6 | +ms.author: trbye |
| 7 | +--- |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, [try the Speech service for free](../../../get-started.md). |
| 12 | + |
| 13 | +## Install the Speech SDK |
| 14 | + |
| 15 | +Before you can do anything, you'll need to install the <a href="https://www.npmjs.com/package/microsoft-cognitiveservices-speech-sdk" target="_blank">JavaScript Speech SDK <span class="docon docon-navigate-external x-hidden-focus"></span></a>. Depending on your platform, use the following instructions: |
| 16 | +- <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-sdk?tabs=nodejs#get-the-speech-sdk" target="_blank">Node.js <span |
| 17 | +class="docon docon-navigate-external x-hidden-focus"></span></a> |
| 18 | +- <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-sdk?tabs=browser#get-the-speech-sdk" target="_blank">Web Browser <span class="docon docon-navigate-external x-hidden-focus"></span></a> |
| 19 | + |
| 20 | +Additionally, depending on the target environment use one of the following: |
| 21 | + |
| 22 | +# [import](#tab/import) |
| 23 | + |
| 24 | +```javascriptscript |
| 25 | +import * as sdk from "microsoft-cognitiveservices-speech-sdk"; |
| 26 | +``` |
| 27 | + |
| 28 | +For more information on `import`, see <a href="https://javascript.info/import-export" target="_blank">export and import <span class="docon docon-navigate-external x-hidden-focus"></span></a>. |
| 29 | + |
| 30 | +# [require](#tab/require) |
| 31 | + |
| 32 | +```javascriptscript |
| 33 | +const sdk = require("microsoft-cognitiveservices-speech-sdk"); |
| 34 | +``` |
| 35 | + |
| 36 | +For more information on `require`, see <a href="https://nodejs.org/en/knowledge/getting-started/what-is-require/" target="_blank">what is require? <span class="docon docon-navigate-external x-hidden-focus"></span></a>. |
| 37 | + |
| 38 | +# [script](#tab/script) |
| 39 | + |
| 40 | +Download and extract the <a href="https://aka.ms/csspeech/jsbrowserpackage" target="_blank">JavaScript Speech SDK <span class="docon docon-navigate-external x-hidden-focus"></span></a> *microsoft.cognitiveservices.speech.sdk.bundle.js* file, and place it in a folder accessible to your HTML file. |
| 41 | + |
| 42 | +```html |
| 43 | +<script src="microsoft.cognitiveservices.speech.sdk.bundle.js"></script>; |
| 44 | +``` |
| 45 | +> [!TIP] |
| 46 | +> If you're targeting a web browser, and using the `<script>` tag; the `sdk` prefix is not needed. The `sdk` prefix is an alias we use to name our `import` or `require` module. |
| 47 | +
|
| 48 | +--- |
| 49 | + |
| 50 | +## Create a speech configuration |
| 51 | + |
| 52 | +To call the Speech service using the Speech SDK, you need to create a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest). This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token. |
| 53 | + |
| 54 | +> [!NOTE] |
| 55 | +> Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration. |
| 56 | +
|
| 57 | +There are a few ways that you can initialize a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest): |
| 58 | + |
| 59 | +* With a subscription: pass in a key and the associated region. |
| 60 | +* With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional. |
| 61 | +* With a host: pass in a host address. A key or authorization token is optional. |
| 62 | +* With an authorization token: pass in an authorization token and the associated region. |
| 63 | + |
| 64 | +In this example, you create a [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest) using a subscription key and region. See the [region support](https://docs.microsoft.com/azure/cognitive-services/speech-service/regions#speech-sdk) page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations. |
| 65 | + |
| 66 | +```javascript |
| 67 | +function synthesizeSpeech() { |
| 68 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 69 | + const synthesizer = new sdk.SpeechSynthesizer(speechConfig); |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +## Synthesize speech to a file |
| 74 | + |
| 75 | +Next, you create a [`SpeechSynthesizer`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesizer?view=azure-node-latest) object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The [`SpeechSynthesizer`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesizer?view=azure-node-latest) accepts as params the [`SpeechConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest) object created in the previous step, and an [`AudioConfig`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/audioconfig?view=azure-node-latest) object that specifies how output results should be handled. |
| 76 | + |
| 77 | +To start, create an `AudioConfig` to automatically write the output to a `.wav` file using the `fromWavFileOutput()` static function. |
| 78 | + |
| 79 | +```javascript |
| 80 | +function synthesizeSpeech() { |
| 81 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 82 | + const audioConfig = sdk.AudioConfig.fromWavFileOutput("path/to/write/file.wav"); |
| 83 | +} |
| 84 | +``` |
| 85 | + |
| 86 | +Next, instantiate a `SpeechSynthesizer` passing your `speechConfig` object and the `audioConfig` object as params. Then, executing speech synthesis and writing to a file is as simple as running `speakTextAsync()` with a string of text. |
| 87 | + |
| 88 | +```javascript |
| 89 | +function synthesizeSpeech() { |
| 90 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 91 | + AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav"); |
| 92 | + |
| 93 | + SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig); |
| 94 | + synthesizer.SpeakText("A simple test to write to a file."); |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +Run the program, and a synthesized `.wav` file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios. |
| 99 | + |
| 100 | +## Synthesize to speaker output |
| 101 | + |
| 102 | +In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, instantiate the `AudioConfig` using the `fromDefaultSpeakerOutput()` static function. This outputs to the current active output device. |
| 103 | + |
| 104 | +```javascript |
| 105 | +function synthesizeSpeech() { |
| 106 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 107 | + AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput(); |
| 108 | + |
| 109 | + SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig); |
| 110 | + synthesizer.SpeakText("Synthesizing directly to speaker output."); |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +## Get result as an in-memory stream |
| 115 | + |
| 116 | +For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including: |
| 117 | + |
| 118 | +* Abstract the resulting byte array as a seek-able stream for custom downstream services. |
| 119 | +* Integrate the result with other API's or services. |
| 120 | +* Modify the audio data, write custom `.wav` headers, etc. |
| 121 | + |
| 122 | +It's simple to make this change from the previous example. First, remove the `AudioConfig` block, as you will manage the output behavior manually from this point onward for increased control. Then pass `null` for the `AudioConfig` in the `SpeechSynthesizer` constructor. |
| 123 | + |
| 124 | +> [!NOTE] |
| 125 | +> Passing `null` for the `AudioConfig`, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device. |
| 126 | +
|
| 127 | +This time, you save the result to a [`SpeechSynthesisResult`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisresult?view=azure-node-latest) variable. The `SpeechSynthesisResult.audioData` property returns an `ArrayBuffer` of the output data. You can work with this `ArrayBuffer` manually. |
| 128 | + |
| 129 | +```javascript |
| 130 | +function synthesizeSpeech() { |
| 131 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 132 | + const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null); |
| 133 | + |
| 134 | + synthesizer.speakTextAsync( |
| 135 | + "Getting the response as an in-memory stream.", |
| 136 | + result => { |
| 137 | + // Interact with the audio ArrayBuffer data |
| 138 | + const audioData = result.audioData; |
| 139 | + }, |
| 140 | + error => console.log(error)); |
| 141 | +} |
| 142 | +``` |
| 143 | + |
| 144 | +From here you can implement any custom behavior using the resulting `ArrayBuffer` object. |
| 145 | + |
| 146 | +## Customize audio format |
| 147 | + |
| 148 | +The following section shows how to customize audio output attributes including: |
| 149 | + |
| 150 | +* Audio file type |
| 151 | +* Sample-rate |
| 152 | +* Bit-depth |
| 153 | + |
| 154 | +To change the audio format, you use the `setSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` of type [`SpeechSynthesisOutputFormat`](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisoutputformat?view=azure-node-latest), which you use to select the output format. See the reference docs for a [list of audio formats](https://docs.microsoft.com/javascript/api/microsoft-cognitiveservices-speech-sdk/speechsynthesisoutputformat?view=azure-node-latest) that are available. |
| 155 | + |
| 156 | +There are various options for different file types depending on your requirements. Note that by definition, raw formats like `Raw24Khz16BitMonoPcm` do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc. |
| 157 | + |
| 158 | +In this example, you specify a high-fidelity RIFF format `Riff24Khz16BitMonoPcm` by setting the `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, get the audio `ArrayBuffer` data and interact with it. |
| 159 | + |
| 160 | +```javascript |
| 161 | +function synthesizeSpeech() { |
| 162 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 163 | + |
| 164 | + // Set the output format |
| 165 | + speechConfig.setSpeechSynthesisOutputFormat(sdk.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm); |
| 166 | + |
| 167 | + const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null); |
| 168 | + synthesizer.speakTextAsync( |
| 169 | + "Customizing audio output format.", |
| 170 | + result => { |
| 171 | + // Interact with the audio ArrayBuffer data |
| 172 | + const audioData = result.audioData; |
| 173 | + }, |
| 174 | + error => console.log(error)); |
| 175 | +} |
| 176 | +``` |
| 177 | + |
| 178 | +Running your program again will write a `.wav` file to the specified path. |
| 179 | + |
| 180 | +## Use SSML to customize speech characteristics |
| 181 | + |
| 182 | +Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md). |
| 183 | + |
| 184 | +To start using SSML for customization, you make a simple change that switches the voice. |
| 185 | +First, create a new XML file for the SSML config in your root project directory, in this example `ssml.xml`. The root element is always `<speak>`, and wrapping the text in a `<voice>` element allows you to change the voice using the `name` param. This example changes the voice to a male English (UK) voice. Note that this voice is a **standard** voice, which has different pricing and availability than **neural** voices. See the [full list](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#standard-voices) of supported **standard** voices. |
| 186 | + |
| 187 | +```xml |
| 188 | +<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US"> |
| 189 | + <voice name="en-GB-George-Apollo"> |
| 190 | + When you're on the motorway, it's a good idea to use a sat-nav. |
| 191 | + </voice> |
| 192 | +</speak> |
| 193 | +``` |
| 194 | + |
| 195 | +Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `speakTextAsync()` function, you use `speakSsmlAsync()`. This function expects an XML string, so first you create a function to load an XML file and return it as a string. |
| 196 | + |
| 197 | +# [import](#tab/import) |
| 198 | + |
| 199 | +```javascript |
| 200 | +import { readFileSync } from "fs"; |
| 201 | + |
| 202 | +function xmlToString(filePath) { |
| 203 | + const xml = readFileSync(filePath, "utf8"); |
| 204 | + return xml; |
| 205 | +} |
| 206 | +``` |
| 207 | + |
| 208 | +# [require](#tab/require) |
| 209 | + |
| 210 | +```javascript |
| 211 | +const fs = require("fs"); |
| 212 | + |
| 213 | +function xmlToString(filePath) { |
| 214 | + const xml = fs.readFileSync(filePath, "utf8"); |
| 215 | + return xml; |
| 216 | +} |
| 217 | +``` |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +From here, the result object is exactly the same as previous examples. |
| 222 | + |
| 223 | +```javascript |
| 224 | +function synthesizeSpeech() { |
| 225 | + const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion"); |
| 226 | + const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null); |
| 227 | + |
| 228 | + const xml = xmlToString("ssml.xml"); |
| 229 | + synthesizer.speakSsmlAsync( |
| 230 | + ssml, |
| 231 | + result => { |
| 232 | + // Interact with the audio ArrayBuffer data |
| 233 | + const audioData = result.audioData; |
| 234 | + }, |
| 235 | + error => console.log(error)); |
| 236 | +} |
| 237 | +``` |
| 238 | + |
| 239 | +The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a `<prosody>` tag and reduce the speed to **90%** of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a `<break>` tag to delay the speech, and set the time param to **200ms**. Re-run the synthesis to see how these customizations affected the output. |
| 240 | + |
| 241 | +```xml |
| 242 | +<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US"> |
| 243 | + <voice name="en-GB-George-Apollo"> |
| 244 | + <prosody rate="0.9"> |
| 245 | + When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav. |
| 246 | + </prosody> |
| 247 | + </voice> |
| 248 | +</speak> |
| 249 | +``` |
| 250 | + |
| 251 | +## Neural voices |
| 252 | + |
| 253 | +Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems. |
| 254 | + |
| 255 | +To switch to a neural voice, change the `name` to one of the [neural voice options](https://docs.microsoft.com/azure/cognitive-services/speech-service/language-support#neural-voices). Then, add an XML namespace for `mstts`, and wrap your text in the `<mstts:express-as>` tag. Use the `style` param to customize the speaking style. This example uses `cheerful`, but try setting it to `customerservice` or `chat` to see the difference in speaking style. |
| 256 | + |
| 257 | +> [!IMPORTANT] |
| 258 | +> Neural voices are **only** supported for Speech resources created in *East US*, *South East Asia*, and *West Europe* regions. |
| 259 | +
|
| 260 | +```xml |
| 261 | +<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> |
| 262 | + <voice name="en-US-AriaNeural"> |
| 263 | + <mstts:express-as style="cheerful"> |
| 264 | + This is awesome! |
| 265 | + </mstts:express-as> |
| 266 | + </voice> |
| 267 | +</speak> |
| 268 | +``` |
0 commit comments