Merge pull request #223861 from sally-baolian/patch-90

prmerger-automator[bot] · web-flow · commit a48a3064ae95 · 2023-02-02T05:52:31.000Z
Audio duration element
diff --git a/articles/cognitive-services/Speech-Service/rest-text-to-speech.md b/articles/cognitive-services/Speech-Service/rest-text-to-speech.md
@@ -71,7 +71,7 @@ curl --location --request GET 'https://YOUR_RESOURCE_REGION.tts.speech.microsoft
 
 ### Sample response
 
-You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. This JSON example shows partial results to illustrate the structure of a response:
+You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. The `WordsPerMinute` property for each voice can be used to estimate the length of the output speech. This JSON example shows partial results to illustrate the structure of a response:
 
 ```json
 [  
diff --git a/articles/cognitive-services/Speech-Service/speech-synthesis-markup-structure.md b/articles/cognitive-services/Speech-Service/speech-synthesis-markup-structure.md
@@ -37,6 +37,7 @@ Here's a subset of the basic structure and syntax of an SSML document:
         <lang xml:lang="string"></lang>
         <lexicon uri="string"/>
         <math xmlns="http://www.w3.org/1998/Math/MathML"></math>
+        <mstts:audioduration value="string"/>
         <mstts:express-as style="string" styledegree="value" role="string"></mstts:express-as>
         <mstts:silence type="string" value="string"/>
         <mstts:viseme type="string"/>
@@ -58,6 +59,7 @@ Some examples of contents that are allowed in each element are described in the
 - `lang`: This element can contain all other elements except `mstts:backgroundaudio`, `voice`, and `speak`.
 - `lexicon`: This element can't contain text or any other elements.
 - `math`: This element can only contain text and MathML elements.
+- `mstts:audioduration`: This element can't contain text or any other elements.
 - `mstts:backgroundaudio`: This element can't contain text or any other elements.
 - `mstts:express-as`: This element can contain text and the following elements: `audio`, `break`, `emphasis`, `lang`, `phoneme`, `prosody`, `say-as`, and `sub`.
 - `mstts:silence`: This element can't contain text or any other elements.
diff --git a/articles/cognitive-services/Speech-Service/speech-synthesis-markup-voice.md b/articles/cognitive-services/Speech-Service/speech-synthesis-markup-voice.md
@@ -406,6 +406,34 @@ This SSML snippet illustrates how the `src` attribute is used to insert audio fr
 </speak>
 ```
 
+## Audio duration
+
+Use the `mstts:audioduration` element to set the duration of the output audio. Use this element to help synchronize the timing of audio output completion. The audio duration can be decreased or increased between 0.5 to 2 times the rate of the original audio. The original audio here is the audio without any other rate settings. The speaking rate will be slowed down or sped up accordingly based on the set value. 
+
+The audio duration setting is applied to all input text within its enclosing `voice` element. To reset or change the audio duration setting again, you must use a new `voice` element with either the same voice or a different voice.
+
+Usage of the `mstts:audioduration` element's attributes are described in the following table.
+
+| Attribute | Description | Required or optional |
+| ---------- | ---------- | ---------- |
+| `value` | The requested duration of the output audio in either seconds (such as `2s`) or milliseconds (such as `2000ms`).<br/><br/>This value should be within 0.5 to 2 times the original audio without any other rate settings. For example, if the requested duration of your audio is `30s`, then the original audio must have otherwise been between 15 and 60 seconds. If you set a value outside of these boundaries, the duration is set according to the respective minimum or maximum multiple.<br/><br/>Given your requested output audio duration, the Speech service adjusts the speaking rate accordingly. Use the [voice list](rest-text-to-speech.md#get-a-list-of-voices) API and check the `WordsPerMinute` attribute to find out the speaking rate of the neural voice that you're using. You can divide the number of words in your input text by the value of the `WordsPerMinute` attribute to get the approximate original output audio duration. The output audio will sound most natural when you set the audio duration closest to the estimated duration.| Required |
+
+###  mstts audio duration examples
+
+The supported values for attributes of the `mstts:audioduration` element were [described previously](#audio-duration).
+
+In this example, the original audio is around 15 seconds. The `mstts:audioduration` element is used to set the audio duration to 20 seconds (`20s`).
+
+```xml
+<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
+<voice name="en-US-JennyNeural">
+<mstts:audioduration value="20s"/>
+If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
+A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
+</voice>
+</speak>
+```
+
 ## Background audio
 
 You can use the `mstts:backgroundaudio` element to add background audio to your SSML documents or mix an audio file with text-to-speech. With `mstts:backgroundaudio`, you can loop an audio file in the background, fade in at the beginning of text-to-speech, and fade out at the end of text-to-speech.

Original file line number	Diff line number	Diff line change
`@@ -71,7 +71,7 @@ curl --location --request GET 'https://YOUR_RESOURCE_REGION.tts.speech.microsoft`
`71`	`71`
`72`	`72`	`### Sample response`
`73`	`73`
`74`		`-You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. This JSON example shows partial results to illustrate the structure of a response:`
	`74`	+You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. The `WordsPerMinute` property for each voice can be used to estimate the length of the output speech. This JSON example shows partial results to illustrate the structure of a response:
`75`	`75`
`76`	`76`	```json
`77`	`77`	`[`