|
| 1 | +--- |
| 2 | +author: eric-urban |
| 3 | +ms.service: cognitive-services |
| 4 | +ms.topic: include |
| 5 | +ms.date: 07/02/2021 |
| 6 | +ms.author: eur |
| 7 | +--- |
| 8 | + |
| 9 | +[!INCLUDE [Header](../../common/cpp.md)] |
| 10 | + |
| 11 | +[!INCLUDE [Introduction](intro.md)] |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +[!INCLUDE [Prerequisites](../../common/azure-prerequisites.md)] |
| 16 | + |
| 17 | +### Install the Speech SDK |
| 18 | + |
| 19 | +Before you can do anything, you need to install the Speech SDK. Depending on your platform, use the following instructions: |
| 20 | + |
| 21 | +* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=linux" target="_blank">Linux </a> |
| 22 | +* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=macos" target="_blank">macOS </a> |
| 23 | +* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=windows" target="_blank">Windows </a> |
| 24 | + |
| 25 | +## Select synthesis language and voice |
| 26 | + |
| 27 | +The text-to-speech feature in the Azure Speech service supports more than 270 voices and more than 110 languages and variants. |
| 28 | +You can get the [full list](../../../language-support.md#prebuilt-neural-voices) or try them in a [text-to-speech demo](https://azure.microsoft.com/services/cognitive-services/text-to-speech/#features). |
| 29 | + |
| 30 | +Specify the language or voice of [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig) to match your input text and use the wanted voice: |
| 31 | + |
| 32 | +```cpp |
| 33 | +void synthesizeSpeech() |
| 34 | +{ |
| 35 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 36 | + // Set either the `SpeechSynthesisVoiceName` or `SpeechSynthesisLanguage`. |
| 37 | + config->SetSpeechSynthesisLanguage("en-US"); |
| 38 | + config->SetSpeechSynthesisVoiceName("en-US-JennyNeural"); |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +All neural voices are multilingual and fluent in their own language and English. For example, if the input text in English is "I'm excited to try text to speech" and you set `es-ES-ElviraNeural`, the text is spoken in English with a Spanish accent. If the voice does not speak the language of the input text, the Speech service won't output synthesized audio. See the [full list](../../../language-support.md#prebuilt-neural-voices) of supported neural voices. |
| 43 | + |
| 44 | +> [!NOTE] |
| 45 | +> The default voice is the first voice returned per locale via the [Voice List API](../../../rest-text-to-speech.md#get-a-list-of-voices). |
| 46 | +
|
| 47 | +The voice that speaks is determined in order of priority as follows: |
| 48 | +- If you don't set `SpeechSynthesisVoiceName` or `SpeechSynthesisLanguage`, the default voice for `en-US` will speak. |
| 49 | +- If you only set `SpeechSynthesisLanguage`, the default voice for the specified locale will speak. |
| 50 | +- If both `SpeechSynthesisVoiceName` and `SpeechSynthesisLanguage` are set, the `SpeechSynthesisLanguage` setting is ignored. The voice that you specified via `SpeechSynthesisVoiceName` will speak. |
| 51 | +- If the voice element is set via [Speech Synthesis Markup Language (SSML)](../../../speech-synthesis-markup.md), the `SpeechSynthesisVoiceName` and `SpeechSynthesisLanguage` settings are ignored. |
| 52 | + |
| 53 | +## Synthesize speech to a file |
| 54 | + |
| 55 | +Next, you create a [`SpeechSynthesizer`](/cpp/cognitive-services/speech/speechsynthesizer) object. This object executes text-to-speech conversions and outputs to speakers, files, or other output streams. `SpeechSynthesizer` accepts as parameters: |
| 56 | + |
| 57 | +- The [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig) object that you created in the previous step |
| 58 | +- An [`AudioConfig`](/cpp/cognitive-services/speech/audio-audioconfig) object that specifies how output results should be handled |
| 59 | + |
| 60 | +To start, create an `AudioConfig` instance to automatically write the output to a .wav file by using the `FromWavFileOutput()` function: |
| 61 | + |
| 62 | +```cpp |
| 63 | +void synthesizeSpeech() |
| 64 | +{ |
| 65 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 66 | + auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav"); |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +Next, instantiate a `SpeechSynthesizer` instance. Pass your `config` object and the `audioConfig` object as parameters. Then, the process of executing speech synthesis and writing to a file is as simple as running `SpeakTextAsync()` with a string of text. |
| 71 | + |
| 72 | +```cpp |
| 73 | +void synthesizeSpeech() |
| 74 | +{ |
| 75 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 76 | + auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav"); |
| 77 | + auto synthesizer = SpeechSynthesizer::FromConfig(config, audioConfig); |
| 78 | + auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get(); |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +Run the program. A synthesized .wav file is written to the location that you specified. This is a good example of the most basic usage. Next, you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios. |
| 83 | + |
| 84 | +## Synthesize to speaker output |
| 85 | + |
| 86 | +In some cases, you might want to output synthesized speech directly to a speaker. To do this, omit the `AudioConfig` parameter when you're creating the `SpeechSynthesizer` instance in the previous example. This change synthesizes to the current active output device. |
| 87 | + |
| 88 | +```cpp |
| 89 | +void synthesizeSpeech() |
| 90 | +{ |
| 91 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 92 | + auto synthesizer = SpeechSynthesizer::FromConfig(config); |
| 93 | + auto result = synthesizer->SpeakTextAsync("Synthesizing directly to speaker output.").get(); |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +## Get a result as an in-memory stream |
| 98 | + |
| 99 | +For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior, including: |
| 100 | + |
| 101 | +* Abstract the resulting byte array as a seekable stream for custom downstream services. |
| 102 | +* Integrate the result with other APIs or services. |
| 103 | +* Modify the audio data, write custom .wav headers, and do related tasks. |
| 104 | + |
| 105 | +It's simple to make this change from the previous example. First, remove the `AudioConfig` block, because you'll manage the output behavior manually from this point onward for increased control. Then pass `NULL` for `AudioConfig` in the `SpeechSynthesizer` constructor. |
| 106 | + |
| 107 | +> [!NOTE] |
| 108 | +> Passing `NULL` for `AudioConfig`, rather than omitting it as you did in the previous speaker output example, will not play the audio by default on the current active output device. |
| 109 | +
|
| 110 | +This time, save the result to a [`SpeechSynthesisResult`](/cpp/cognitive-services/speech/speechsynthesisresult) variable. The `GetAudioData` getter returns a `byte []` instance for the output data. You can work with this `byte []` instance manually, or you can use the [`AudioDataStream`](/cpp/cognitive-services/speech/audiodatastream) class to manage the in-memory stream. In this example, you use the `AudioDataStream.FromResult()` static function to get a stream from the result: |
| 111 | + |
| 112 | +```cpp |
| 113 | +void synthesizeSpeech() |
| 114 | +{ |
| 115 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 116 | + auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL); |
| 117 | + |
| 118 | + auto result = synthesizer->SpeakTextAsync("Getting the response as an in-memory stream.").get(); |
| 119 | + auto stream = AudioDataStream::FromResult(result); |
| 120 | +} |
| 121 | +``` |
| 122 | + |
| 123 | +From here, you can implement any custom behavior by using the resulting `stream` object. |
| 124 | + |
| 125 | +## Customize audio format |
| 126 | + |
| 127 | +You can customize audio output attributes, including: |
| 128 | + |
| 129 | +* Audio file type |
| 130 | +* Sample rate |
| 131 | +* Bit depth |
| 132 | + |
| 133 | +To change the audio format, you use the `SetSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` instance of type [`SpeechSynthesisOutputFormat`](/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat), which you use to select the output format. See the [list of audio formats](/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat) that are available. |
| 134 | + |
| 135 | +There are various options for different file types, depending on your requirements. By definition, raw formats like `Raw24Khz16BitMonoPcm` don't include audio headers. Use raw formats only in one of these situations: |
| 136 | + |
| 137 | +- You know that your downstream implementation can decode a raw bitstream. |
| 138 | +- You plan to manually build headers based on factors like bit depth, sample rate, and number of channels. |
| 139 | + |
| 140 | +In this example, you specify the high-fidelity RIFF format `Riff24Khz16BitMonoPcm` by setting `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, you use [`AudioDataStream`](/cpp/cognitive-services/speech/audiodatastream) to get an in-memory stream of the result, and then write it to a file. |
| 141 | + |
| 142 | +```cpp |
| 143 | +void synthesizeSpeech() |
| 144 | +{ |
| 145 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 146 | + config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm); |
| 147 | + |
| 148 | + auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL); |
| 149 | + auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get(); |
| 150 | + |
| 151 | + auto stream = AudioDataStream::FromResult(result); |
| 152 | + stream->SaveToWavFileAsync("path/to/write/file.wav").get(); |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +Running your program again will write a .wav file to the specified path. |
| 157 | + |
| 158 | +## Use SSML to customize speech characteristics |
| 159 | + |
| 160 | +You can use SSML to fine-tune the pitch, pronunciation, speaking rate, volume, and more in the text-to-speech output by submitting your requests from an XML schema. This section shows an example of changing the voice. For a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md). |
| 161 | + |
| 162 | +To start using SSML for customization, you make a simple change that switches the voice. |
| 163 | + |
| 164 | +First, create a new XML file for the SSML configuration in your root project directory. In this example, it's `ssml.xml`. The root element is always `<speak>`. Wrapping the text in a `<voice>` element allows you to change the voice by using the `name` parameter. See the [full list](../../../language-support.md#prebuilt-neural-voices) of supported neural voices. |
| 165 | + |
| 166 | +```xml |
| 167 | +<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US"> |
| 168 | + <voice name="en-US-JennyNeural"> |
| 169 | + When you're on the freeway, it's a good idea to use a GPS. |
| 170 | + </voice> |
| 171 | +</speak> |
| 172 | +``` |
| 173 | + |
| 174 | +Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `SpeakTextAsync()` function, you use `SpeakSsmlAsync()`. This function expects an XML string, so you first load your SSML configuration as a string. From here, the result object is exactly the same as previous examples. |
| 175 | + |
| 176 | +```cpp |
| 177 | +void synthesizeSpeech() |
| 178 | +{ |
| 179 | + auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>"); |
| 180 | + auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL); |
| 181 | + |
| 182 | + std::ifstream file("./ssml.xml"); |
| 183 | + std::string ssml, line; |
| 184 | + while (std::getline(file, line)) |
| 185 | + { |
| 186 | + ssml += line; |
| 187 | + ssml.push_back('\n'); |
| 188 | + } |
| 189 | + auto result = synthesizer->SpeakSsmlAsync(ssml).get(); |
| 190 | + |
| 191 | + auto stream = AudioDataStream::FromResult(result); |
| 192 | + stream->SaveToWavFileAsync("path/to/write/file.wav").get(); |
| 193 | +} |
| 194 | +``` |
| 195 | + |
| 196 | +> [!NOTE] |
| 197 | +> To change the voice without using SSML, you can set the property on `SpeechConfig` by using `SpeechConfig.SetSpeechSynthesisVoiceName("en-US-ChristopherNeural")`. |
| 198 | +
|
0 commit comments