Skip to content

Commit a48a306

Browse files
Merge pull request #223861 from sally-baolian/patch-90
Audio duration element
2 parents b6f4346 + 3ec8469 commit a48a306

File tree

3 files changed

+31
-1
lines changed

3 files changed

+31
-1
lines changed

articles/cognitive-services/Speech-Service/rest-text-to-speech.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ curl --location --request GET 'https://YOUR_RESOURCE_REGION.tts.speech.microsoft
7171

7272
### Sample response
7373

74-
You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. This JSON example shows partial results to illustrate the structure of a response:
74+
You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. The `WordsPerMinute` property for each voice can be used to estimate the length of the output speech. This JSON example shows partial results to illustrate the structure of a response:
7575

7676
```json
7777
[

articles/cognitive-services/Speech-Service/speech-synthesis-markup-structure.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Here's a subset of the basic structure and syntax of an SSML document:
3737
<lang xml:lang="string"></lang>
3838
<lexicon uri="string"/>
3939
<math xmlns="http://www.w3.org/1998/Math/MathML"></math>
40+
<mstts:audioduration value="string"/>
4041
<mstts:express-as style="string" styledegree="value" role="string"></mstts:express-as>
4142
<mstts:silence type="string" value="string"/>
4243
<mstts:viseme type="string"/>
@@ -58,6 +59,7 @@ Some examples of contents that are allowed in each element are described in the
5859
- `lang`: This element can contain all other elements except `mstts:backgroundaudio`, `voice`, and `speak`.
5960
- `lexicon`: This element can't contain text or any other elements.
6061
- `math`: This element can only contain text and MathML elements.
62+
- `mstts:audioduration`: This element can't contain text or any other elements.
6163
- `mstts:backgroundaudio`: This element can't contain text or any other elements.
6264
- `mstts:express-as`: This element can contain text and the following elements: `audio`, `break`, `emphasis`, `lang`, `phoneme`, `prosody`, `say-as`, and `sub`.
6365
- `mstts:silence`: This element can't contain text or any other elements.

articles/cognitive-services/Speech-Service/speech-synthesis-markup-voice.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,34 @@ This SSML snippet illustrates how the `src` attribute is used to insert audio fr
406406
</speak>
407407
```
408408

409+
## Audio duration
410+
411+
Use the `mstts:audioduration` element to set the duration of the output audio. Use this element to help synchronize the timing of audio output completion. The audio duration can be decreased or increased between 0.5 to 2 times the rate of the original audio. The original audio here is the audio without any other rate settings. The speaking rate will be slowed down or sped up accordingly based on the set value.
412+
413+
The audio duration setting is applied to all input text within its enclosing `voice` element. To reset or change the audio duration setting again, you must use a new `voice` element with either the same voice or a different voice.
414+
415+
Usage of the `mstts:audioduration` element's attributes are described in the following table.
416+
417+
| Attribute | Description | Required or optional |
418+
| ---------- | ---------- | ---------- |
419+
| `value` | The requested duration of the output audio in either seconds (such as `2s`) or milliseconds (such as `2000ms`).<br/><br/>This value should be within 0.5 to 2 times the original audio without any other rate settings. For example, if the requested duration of your audio is `30s`, then the original audio must have otherwise been between 15 and 60 seconds. If you set a value outside of these boundaries, the duration is set according to the respective minimum or maximum multiple.<br/><br/>Given your requested output audio duration, the Speech service adjusts the speaking rate accordingly. Use the [voice list](rest-text-to-speech.md#get-a-list-of-voices) API and check the `WordsPerMinute` attribute to find out the speaking rate of the neural voice that you're using. You can divide the number of words in your input text by the value of the `WordsPerMinute` attribute to get the approximate original output audio duration. The output audio will sound most natural when you set the audio duration closest to the estimated duration.| Required |
420+
421+
### mstts audio duration examples
422+
423+
The supported values for attributes of the `mstts:audioduration` element were [described previously](#audio-duration).
424+
425+
In this example, the original audio is around 15 seconds. The `mstts:audioduration` element is used to set the audio duration to 20 seconds (`20s`).
426+
427+
```xml
428+
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
429+
<voice name="en-US-JennyNeural">
430+
<mstts:audioduration value="20s"/>
431+
If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
432+
A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
433+
</voice>
434+
</speak>
435+
```
436+
409437
## Background audio
410438

411439
You can use the `mstts:backgroundaudio` element to add background audio to your SSML documents or mix an audio file with text-to-speech. With `mstts:backgroundaudio`, you can loop an audio file in the background, fade in at the beginning of text-to-speech, and fade out at the end of text-to-speech.

0 commit comments

Comments
 (0)