Skip to content

Commit 3ae258a

Browse files
authored
Merge pull request #191582 from eric-urban/eur/tts-qs
New TTS quickstart
2 parents 51f6c6c + 54a3034 commit 3ae258a

File tree

32 files changed

+2553
-1565
lines changed

32 files changed

+2553
-1565
lines changed

articles/cognitive-services/Speech-Service/get-started-text-to-speech.md

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -58,22 +58,8 @@ keywords: text to speech
5858
[!INCLUDE [CLI include](includes/quickstarts/text-to-speech-basics/cli.md)]
5959
::: zone-end
6060

61-
## Get position information
62-
63-
Your project might need to know when a word is spoken by text-to-speech so that it can take specific action based on that timing. For example, if you want to highlight words as they're spoken, you need to know what to highlight, when to highlight it, and for how long to highlight it.
64-
65-
You can accomplish this by using the `WordBoundary` event within `SpeechSynthesizer`. This event is raised at the beginning of each new spoken word. It provides a time offset within the spoken stream and a text offset within the input prompt:
66-
67-
* `AudioOffset` reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS), with 10,000 HNS equivalent to 1 millisecond.
68-
* `WordOffset` reports the character position in the input string (original text or [SSML](speech-synthesis-markup.md)) immediately before the word that's about to be spoken.
69-
70-
> [!NOTE]
71-
> `WordBoundary` events are raised as the output audio data becomes available, which will be faster than playback to an output device. The caller must appropriately synchronize stream timing to "real time."
72-
73-
You can find examples of using `WordBoundary` in the [text-to-speech samples](https://aka.ms/csspeech/samples) on GitHub.
74-
7561
## Next steps
7662

77-
* [Get started with Custom Neural Voice](how-to-custom-voice.md)
78-
* [Improve synthesis with SSML](speech-synthesis-markup.md)
79-
* Learn how to use the [Long Audio API](long-audio-api.md) for large text samples like books and news articles
63+
> [!div class="nextstepaction"]
64+
> [Learn more about speech synthesis](how-to-speech-synthesis.md)
65+

articles/cognitive-services/Speech-Service/how-to-recognize-speech.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "How to recognize speech - Speech service"
33
titleSuffix: Azure Cognitive Services
4-
description: Learn how to use the Speech SDK to convert speech to text, including object construction, supported audio input formats, and configuration options for speech recognition.
4+
description: Learn how to convert speech to text, including object construction, supported audio input formats, and configuration options for speech recognition.
55
services: cognitive-services
66
author: eric-urban
77
manager: nitinme
@@ -59,5 +59,6 @@ keywords: speech to text, speech to text software
5959

6060
## Next steps
6161

62-
> [!div class="nextstepaction"]
63-
> [See the quickstart samples on GitHub](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart)
62+
* [Try the speech to text quickstart](get-started-speech-to-text.md)
63+
* [Improve recognition accuracy with custom speech](custom-speech-overview.md)
64+
* [Transcribe audio in batches](batch-transcription.md)
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
---
2+
title: "How to synthesize speech from text - Speech service"
3+
titleSuffix: Azure Cognitive Services
4+
description: Learn how to convert text to speech. Learn about object construction and design patterns, supported audio output formats, and custom configuration options for speech synthesis.
5+
services: cognitive-services
6+
author: eric-urban
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: speech-service
10+
ms.topic: how-to
11+
ms.date: 03/14/2022
12+
ms.author: eur
13+
ms.devlang: cpp, csharp, golang, java, javascript, objective-c, python
14+
ms.custom: devx-track-python, devx-track-js, devx-track-csharp, cog-serv-seo-aug-2020, mode-other
15+
zone_pivot_groups: programming-languages-speech-services
16+
keywords: text to speech
17+
---
18+
19+
# How to synthesize speech from text
20+
21+
::: zone pivot="programming-language-csharp"
22+
[!INCLUDE [C# include](includes/how-to/speech-synthesis/csharp.md)]
23+
::: zone-end
24+
25+
::: zone pivot="programming-language-cpp"
26+
[!INCLUDE [C++ include](includes/how-to/speech-synthesis/cpp.md)]
27+
::: zone-end
28+
29+
::: zone pivot="programming-language-go"
30+
[!INCLUDE [Go include](includes/how-to/speech-synthesis/go.md)]
31+
::: zone-end
32+
33+
::: zone pivot="programming-language-java"
34+
[!INCLUDE [Java include](includes/how-to/speech-synthesis/java.md)]
35+
::: zone-end
36+
37+
::: zone pivot="programming-language-javascript"
38+
[!INCLUDE [JavaScript include](includes/how-to/speech-synthesis/javascript.md)]
39+
::: zone-end
40+
41+
::: zone pivot="programming-language-objectivec"
42+
[!INCLUDE [ObjectiveC include](includes/how-to/speech-synthesis/objectivec.md)]
43+
::: zone-end
44+
45+
::: zone pivot="programming-language-swift"
46+
[!INCLUDE [Swift include](includes/how-to/speech-synthesis/swift.md)]
47+
::: zone-end
48+
49+
::: zone pivot="programming-language-python"
50+
[!INCLUDE [Python include](./includes/how-to/speech-synthesis/python.md)]
51+
::: zone-end
52+
53+
::: zone pivot="programming-language-rest"
54+
[!INCLUDE [REST include](includes/how-to/speech-synthesis/rest.md)]
55+
::: zone-end
56+
57+
::: zone pivot="programming-language-cli"
58+
[!INCLUDE [CLI include](includes/how-to/speech-synthesis/cli.md)]
59+
::: zone-end
60+
61+
## Get facial pose events
62+
63+
Speech can be a good way to drive the animation of facial expressions.
64+
[Visemes](how-to-speech-synthesis-viseme.md) are often used to represent the key poses in observed speech. Key poses include the position of the lips, jaw, and tongue in producing a particular phoneme.
65+
66+
You can subscribe to viseme events in the Speech SDK. Then, you can apply viseme events to animate the face of a character as speech audio plays.
67+
Learn [how to get viseme events](how-to-speech-synthesis-viseme.md#get-viseme-events-with-the-speech-sdk).
68+
69+
## Get position information
70+
71+
Your project might need to know when a word is spoken by text-to-speech so that it can take specific action based on that timing. For example, if you want to highlight words as they're spoken, you need to know what to highlight, when to highlight it, and for how long to highlight it.
72+
73+
You can accomplish this by using the `WordBoundary` event within `SpeechSynthesizer`. This event is raised at the beginning of each new spoken word. It provides a time offset within the spoken stream and a text offset within the input prompt:
74+
75+
* `AudioOffset` reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS), with 10,000 HNS equivalent to 1 millisecond.
76+
* `WordOffset` reports the character position in the input string (original text or [SSML](speech-synthesis-markup.md)) immediately before the word that's about to be spoken.
77+
78+
> [!NOTE]
79+
> `WordBoundary` events are raised as the output audio data becomes available, which will be faster than playback to an output device. The caller must appropriately synchronize streaming and real time.
80+
81+
You can find examples of using `WordBoundary` in the [text-to-speech samples](https://aka.ms/csspeech/samples) on GitHub.
82+
83+
## Next steps
84+
85+
* [Get started with Custom Neural Voice](how-to-custom-voice.md)
86+
* [Improve synthesis with SSML](speech-synthesis-markup.md)
87+
* [Synthesize from long-form text](long-audio-api.md) like books and news articles

articles/cognitive-services/Speech-Service/includes/common/java.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@ ms.topic: include
77
ms.author: eur
88
---
99

10-
[Reference documentation](/java/api/com.microsoft.cognitiveservices.speech) | [Package (Maven)](https://mvnrepository.com/artifact/com.microsoft.cognitiveservices.speech) | [Additional Samples on GitHub](https://aka.ms/speech/github-java)
10+
[Reference documentation](/java/api/com.microsoft.cognitiveservices.speech) | [Additional Samples on GitHub](https://aka.ms/speech/github-java)
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
author: eric-urban
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 08/11/2020
6+
ms.author: eur
7+
---
8+
9+
[!INCLUDE [Introduction](intro.md)]
10+
11+
## Prerequisites
12+
13+
[!INCLUDE [Prerequisites](../../common/azure-prerequisites.md)]
14+
15+
## Download and install
16+
17+
[!INCLUDE [SPX Setup](../../spx-setup.md)]
18+
19+
## Synthesize speech to a speaker
20+
21+
Now you're ready to run the Speech CLI to synthesize speech from text. From the command line, change to the directory that contains the Speech CLI binary file. Then run the following command:
22+
23+
```bash
24+
spx synthesize --text "The speech synthesizer greets you!"
25+
```
26+
27+
The Speech CLI will produce natural language in English through the computer speaker.
28+
29+
## Synthesize speech to a file
30+
31+
Run the following command to change the output from your speaker to a .wav file:
32+
33+
```bash
34+
spx synthesize --text "The speech synthesizer greets you!" --audio output greetings.wav
35+
```
36+
37+
The Speech CLI will produce natural language in English in the *greetings.wav* audio file. On Windows, you can play the audio file by entering `start greetings.wav`.
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
---
2+
author: eric-urban
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 07/02/2021
6+
ms.author: eur
7+
---
8+
9+
[!INCLUDE [Header](../../common/cpp.md)]
10+
11+
[!INCLUDE [Introduction](intro.md)]
12+
13+
## Prerequisites
14+
15+
[!INCLUDE [Prerequisites](../../common/azure-prerequisites.md)]
16+
17+
### Install the Speech SDK
18+
19+
Before you can do anything, you need to install the Speech SDK. Depending on your platform, use the following instructions:
20+
21+
* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=linux" target="_blank">Linux </a>
22+
* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=macos" target="_blank">macOS </a>
23+
* <a href="/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-cpp&tabs=windows" target="_blank">Windows </a>
24+
25+
## Select synthesis language and voice
26+
27+
The text-to-speech feature in the Azure Speech service supports more than 270 voices and more than 110 languages and variants.
28+
You can get the [full list](../../../language-support.md#prebuilt-neural-voices) or try them in a [text-to-speech demo](https://azure.microsoft.com/services/cognitive-services/text-to-speech/#features).
29+
30+
Specify the language or voice of [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig) to match your input text and use the wanted voice:
31+
32+
```cpp
33+
void synthesizeSpeech()
34+
{
35+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
36+
// Set either the `SpeechSynthesisVoiceName` or `SpeechSynthesisLanguage`.
37+
config->SetSpeechSynthesisLanguage("en-US");
38+
config->SetSpeechSynthesisVoiceName("en-US-JennyNeural");
39+
}
40+
```
41+
42+
All neural voices are multilingual and fluent in their own language and English. For example, if the input text in English is "I'm excited to try text to speech" and you set `es-ES-ElviraNeural`, the text is spoken in English with a Spanish accent. If the voice does not speak the language of the input text, the Speech service won't output synthesized audio. See the [full list](../../../language-support.md#prebuilt-neural-voices) of supported neural voices.
43+
44+
> [!NOTE]
45+
> The default voice is the first voice returned per locale via the [Voice List API](../../../rest-text-to-speech.md#get-a-list-of-voices).
46+
47+
The voice that speaks is determined in order of priority as follows:
48+
- If you don't set `SpeechSynthesisVoiceName` or `SpeechSynthesisLanguage`, the default voice for `en-US` will speak.
49+
- If you only set `SpeechSynthesisLanguage`, the default voice for the specified locale will speak.
50+
- If both `SpeechSynthesisVoiceName` and `SpeechSynthesisLanguage` are set, the `SpeechSynthesisLanguage` setting is ignored. The voice that you specified via `SpeechSynthesisVoiceName` will speak.
51+
- If the voice element is set via [Speech Synthesis Markup Language (SSML)](../../../speech-synthesis-markup.md), the `SpeechSynthesisVoiceName` and `SpeechSynthesisLanguage` settings are ignored.
52+
53+
## Synthesize speech to a file
54+
55+
Next, you create a [`SpeechSynthesizer`](/cpp/cognitive-services/speech/speechsynthesizer) object. This object executes text-to-speech conversions and outputs to speakers, files, or other output streams. `SpeechSynthesizer` accepts as parameters:
56+
57+
- The [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig) object that you created in the previous step
58+
- An [`AudioConfig`](/cpp/cognitive-services/speech/audio-audioconfig) object that specifies how output results should be handled
59+
60+
To start, create an `AudioConfig` instance to automatically write the output to a .wav file by using the `FromWavFileOutput()` function:
61+
62+
```cpp
63+
void synthesizeSpeech()
64+
{
65+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
66+
auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
67+
}
68+
```
69+
70+
Next, instantiate a `SpeechSynthesizer` instance. Pass your `config` object and the `audioConfig` object as parameters. Then, the process of executing speech synthesis and writing to a file is as simple as running `SpeakTextAsync()` with a string of text.
71+
72+
```cpp
73+
void synthesizeSpeech()
74+
{
75+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
76+
auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
77+
auto synthesizer = SpeechSynthesizer::FromConfig(config, audioConfig);
78+
auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
79+
}
80+
```
81+
82+
Run the program. A synthesized .wav file is written to the location that you specified. This is a good example of the most basic usage. Next, you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.
83+
84+
## Synthesize to speaker output
85+
86+
In some cases, you might want to output synthesized speech directly to a speaker. To do this, omit the `AudioConfig` parameter when you're creating the `SpeechSynthesizer` instance in the previous example. This change synthesizes to the current active output device.
87+
88+
```cpp
89+
void synthesizeSpeech()
90+
{
91+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
92+
auto synthesizer = SpeechSynthesizer::FromConfig(config);
93+
auto result = synthesizer->SpeakTextAsync("Synthesizing directly to speaker output.").get();
94+
}
95+
```
96+
97+
## Get a result as an in-memory stream
98+
99+
For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior, including:
100+
101+
* Abstract the resulting byte array as a seekable stream for custom downstream services.
102+
* Integrate the result with other APIs or services.
103+
* Modify the audio data, write custom .wav headers, and do related tasks.
104+
105+
It's simple to make this change from the previous example. First, remove the `AudioConfig` block, because you'll manage the output behavior manually from this point onward for increased control. Then pass `NULL` for `AudioConfig` in the `SpeechSynthesizer` constructor.
106+
107+
> [!NOTE]
108+
> Passing `NULL` for `AudioConfig`, rather than omitting it as you did in the previous speaker output example, will not play the audio by default on the current active output device.
109+
110+
This time, save the result to a [`SpeechSynthesisResult`](/cpp/cognitive-services/speech/speechsynthesisresult) variable. The `GetAudioData` getter returns a `byte []` instance for the output data. You can work with this `byte []` instance manually, or you can use the [`AudioDataStream`](/cpp/cognitive-services/speech/audiodatastream) class to manage the in-memory stream. In this example, you use the `AudioDataStream.FromResult()` static function to get a stream from the result:
111+
112+
```cpp
113+
void synthesizeSpeech()
114+
{
115+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
116+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
117+
118+
auto result = synthesizer->SpeakTextAsync("Getting the response as an in-memory stream.").get();
119+
auto stream = AudioDataStream::FromResult(result);
120+
}
121+
```
122+
123+
From here, you can implement any custom behavior by using the resulting `stream` object.
124+
125+
## Customize audio format
126+
127+
You can customize audio output attributes, including:
128+
129+
* Audio file type
130+
* Sample rate
131+
* Bit depth
132+
133+
To change the audio format, you use the `SetSpeechSynthesisOutputFormat()` function on the `SpeechConfig` object. This function expects an `enum` instance of type [`SpeechSynthesisOutputFormat`](/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat), which you use to select the output format. See the [list of audio formats](/cpp/cognitive-services/speech/microsoft-cognitiveservices-speech-namespace#speechsynthesisoutputformat) that are available.
134+
135+
There are various options for different file types, depending on your requirements. By definition, raw formats like `Raw24Khz16BitMonoPcm` don't include audio headers. Use raw formats only in one of these situations:
136+
137+
- You know that your downstream implementation can decode a raw bitstream.
138+
- You plan to manually build headers based on factors like bit depth, sample rate, and number of channels.
139+
140+
In this example, you specify the high-fidelity RIFF format `Riff24Khz16BitMonoPcm` by setting `SpeechSynthesisOutputFormat` on the `SpeechConfig` object. Similar to the example in the previous section, you use [`AudioDataStream`](/cpp/cognitive-services/speech/audiodatastream) to get an in-memory stream of the result, and then write it to a file.
141+
142+
```cpp
143+
void synthesizeSpeech()
144+
{
145+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
146+
config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);
147+
148+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
149+
auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
150+
151+
auto stream = AudioDataStream::FromResult(result);
152+
stream->SaveToWavFileAsync("path/to/write/file.wav").get();
153+
}
154+
```
155+
156+
Running your program again will write a .wav file to the specified path.
157+
158+
## Use SSML to customize speech characteristics
159+
160+
You can use SSML to fine-tune the pitch, pronunciation, speaking rate, volume, and more in the text-to-speech output by submitting your requests from an XML schema. This section shows an example of changing the voice. For a more detailed guide, see the [SSML how-to article](../../../speech-synthesis-markup.md).
161+
162+
To start using SSML for customization, you make a simple change that switches the voice.
163+
164+
First, create a new XML file for the SSML configuration in your root project directory. In this example, it's `ssml.xml`. The root element is always `<speak>`. Wrapping the text in a `<voice>` element allows you to change the voice by using the `name` parameter. See the [full list](../../../language-support.md#prebuilt-neural-voices) of supported neural voices.
165+
166+
```xml
167+
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
168+
<voice name="en-US-JennyNeural">
169+
When you're on the freeway, it's a good idea to use a GPS.
170+
</voice>
171+
</speak>
172+
```
173+
174+
Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the `SpeakTextAsync()` function, you use `SpeakSsmlAsync()`. This function expects an XML string, so you first load your SSML configuration as a string. From here, the result object is exactly the same as previous examples.
175+
176+
```cpp
177+
void synthesizeSpeech()
178+
{
179+
auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
180+
auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
181+
182+
std::ifstream file("./ssml.xml");
183+
std::string ssml, line;
184+
while (std::getline(file, line))
185+
{
186+
ssml += line;
187+
ssml.push_back('\n');
188+
}
189+
auto result = synthesizer->SpeakSsmlAsync(ssml).get();
190+
191+
auto stream = AudioDataStream::FromResult(result);
192+
stream->SaveToWavFileAsync("path/to/write/file.wav").get();
193+
}
194+
```
195+
196+
> [!NOTE]
197+
> To change the voice without using SSML, you can set the property on `SpeechConfig` by using `SpeechConfig.SetSpeechSynthesisVoiceName("en-US-ChristopherNeural")`.
198+

0 commit comments

Comments
 (0)