Merge pull request #212772 from sally-baolian/patch-64

prmerger-automator[bot] · web-flow · commit 19856a4df6ab · 2022-09-29T15:08:00.000Z
Update language-support.md
diff --git a/articles/cognitive-services/Speech-Service/language-support.md b/articles/cognitive-services/Speech-Service/language-support.md
@@ -42,7 +42,7 @@ Each prebuilt neural voice supports a specific language and dialect, identified
 > [!IMPORTANT]
 > Pricing varies for Prebuilt Neural Voice (see *Neural* on the pricing page) and Custom Neural Voice (see *Custom Neural* on the pricing page). For more information, see the [Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services/) page.
 
-Prebuilt neural voices are created from samples that use a 24-khz sample rate. All voices can upsample or downsample to other sample rates when synthesizing.
+Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. Other sample rates can be obtained through upsampling or downsampling when synthesizing.
 
 Please note that the following neural voices are retired.
 
diff --git a/articles/cognitive-services/Speech-Service/long-audio-api.md b/articles/cognitive-services/Speech-Service/long-audio-api.md
@@ -455,6 +455,8 @@ We support flexible audio output formats. You can generate audio outputs per par
 
 > [!NOTE]
 > The default audio format is riff-16khz-16bit-mono-pcm.
+> 
+> The sample rate for long audio voices is 24kHz, not 48kHz. Other sample rates can be obtained through upsampling or downsampling when synthesizing.
 
 * riff-8khz-16bit-mono-pcm
 * riff-16khz-16bit-mono-pcm
diff --git a/articles/cognitive-services/Speech-Service/rest-text-to-speech.md b/articles/cognitive-services/Speech-Service/rest-text-to-speech.md
@@ -272,7 +272,7 @@ If the HTTP status is `200 OK`, the body of the response contains an audio file
 
 ## Audio outputs
 
-The supported streaming and non-streaming audio formats are sent in each request as the `X-Microsoft-OutputFormat` header. Each format incorporates a bit rate and encoding type. The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Prebuilt neural voices are created from samples that use a 24-khz sample rate. All voices can upsample or downsample to other sample rates when synthesizing.
+The supported streaming and non-streaming audio formats are sent in each request as the `X-Microsoft-OutputFormat` header. Each format incorporates a bit rate and encoding type. The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. 
 
 #### [Streaming](#tab/streaming)
 
@@ -322,9 +322,8 @@ riff-48khz-16bit-mono-pcm
 ***
 
 > [!NOTE]
-> en-US-AriaNeural, en-US-JennyNeural and zh-CN-XiaoxiaoNeural are available in public preview in 48Khz output. Other voices support 24khz upsampled to 48khz output.
-
-> [!NOTE]
+> If you select 48kHz output format, the high-fidelity voice model with 48kHz will be invoked accordingly. The sample rates other than 24kHz and 48kHz can be obtained through upsampling or downsampling when synthesizing, for example, 44.1kHz is downsampled from 48kHz.
+>
 > If your selected voice and output format have different bit rates, the audio is resampled as necessary. You can decode the `ogg-24khz-16bit-mono-opus` format by using the [Opus codec](https://opus-codec.org/downloads/).
 
 ## Next steps
diff --git a/articles/cognitive-services/Speech-Service/text-to-speech.md b/articles/cognitive-services/Speech-Service/text-to-speech.md
@@ -41,7 +41,7 @@ Here's more information about neural text-to-speech features in the Speech servi
 
 * **Asynchronous synthesis of long audio**: Use the [Long Audio API](long-audio-api.md) to asynchronously synthesize text-to-speech files longer than 10 minutes (for example, audio books or lectures). Unlike synthesis performed via the Speech SDK or speech-to-text REST API, responses aren't returned in real time. The expectation is that requests are sent asynchronously, responses are polled for, and synthesized audio is downloaded when the service makes it available.
 
-* **Prebuilt neural voices**: Microsoft neural text-to-speech capability uses deep neural networks to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. You can use neural voices to:
+* **Prebuilt neural voices**: Microsoft neural text-to-speech capability uses deep neural networks to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. You can use neural voices to:
 
   - Make interactions with chatbots and voice assistants more natural and engaging.
   - Convert digital texts such as e-books into audiobooks.