Merge pull request #5860 from eric-urban/eur/voice-conversion

Stacyrch140 · web-flow · commit 325ae39974a6 · 2025-07-03T14:21:27.000-04:00
voice conversion documentation
diff --git a/articles/ai-services/speech-service/how-to-speech-synthesis-viseme.md b/articles/ai-services/speech-service/how-to-speech-synthesis-viseme.md
@@ -94,7 +94,7 @@ The blend shapes JSON string is represented as a 2-dimensional matrix. Each row
 To get viseme with your synthesized speech, subscribe to the `VisemeReceived` event in the Speech SDK.
 
 > [!NOTE]
-> To request SVG or blend shapes output, you should use the `mstts:viseme` element in SSML. For details, see [how to use viseme element in SSML](speech-synthesis-markup-structure.md#viseme-element).
+> To request SVG or blend shapes output, you should use the `mstts:viseme` element in SSML. For details, see [how to use viseme element in SSML](speech-synthesis-markup-voice.md#viseme-element).
 
 The following snippet shows how to subscribe to the viseme event:
 
diff --git a/articles/ai-services/speech-service/includes/language-support/tts.md b/articles/ai-services/speech-service/includes/language-support/tts.md
@@ -168,7 +168,7 @@ ms.custom: references_regions
 
 <sup>2</sup> The neural voice is available in public preview in these service [regions](../../regions.md): Central India, East Asia, East US, Southeast Asia, and West US.
 
-<sup>3</sup> [Phonemes](../../speech-synthesis-markup-pronunciation.md#phoneme-element), [custom lexicon](../../speech-synthesis-markup-pronunciation.md#custom-lexicon), and [visemes](../../speech-synthesis-markup-structure.md#viseme-element) aren't supported. For details about supported visemes, see [viseme locales](../../language-support.md?tabs=tts#viseme). 
+<sup>3</sup> [Phonemes](../../speech-synthesis-markup-pronunciation.md#phoneme-element), [custom lexicon](../../speech-synthesis-markup-pronunciation.md#custom-lexicon), and [visemes](../../speech-synthesis-markup-voice.md#viseme-element) aren't supported. For details about supported visemes, see [viseme locales](../../language-support.md?tabs=tts#viseme). 
 
 <sup>4</sup> The neural voice is a multilingual voice in Azure AI Speech. Turbo version of Azure OpenAI voices has the similar voice persona as Azure OpenAI voices but supports extra features. Turbo voices support the full set of SSML elements and more features like word boundary, just like other Azure AI Speech voices.
 
diff --git a/articles/ai-services/speech-service/includes/language-support/voice-conversion.md b/articles/ai-services/speech-service/includes/language-support/voice-conversion.md
@@ -0,0 +1,39 @@
+---
+author: eric-urban
+ms.service: azure-ai-speech
+ms.date: 7/3/2025
+ms.topic: include
+ms.author: eur
+---
+
+| Locale (BCP-47) | Language | Text to speech voices |
+| ----- | ----- | ----- |
+| `en-US` | English (United States) | `en-US-AdamMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-AlloyTurboMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-AmandaMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-AndrewMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-AvaMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-BrandonMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-BrianMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-ChristopherMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-CoraMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-DavisMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-DerekMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-DustinMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-EchoTurboMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-EmmaMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-EvelynMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-FableTurboMultilingualNeural` (Neutral) |
+| `en-US` | English (United States) | `en-US-JennyMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-LewisMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-LolaMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-NancyMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-NovaTurboMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-OnyxTurboMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-PhoebeMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-RyanMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-SamuelMultilingualNeural` (Male) |
+| `en-US` | English (United States) | `en-US-SerenaMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-ShimmerTurboMultilingualNeural` (Female) |
+| `en-US` | English (United States) | `en-US-SteffanMultilingualNeural` (Male) | 
+
diff --git a/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md b/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md
@@ -856,7 +856,7 @@ For more information, see the [language and voice list](../../language-support.m
 #### Get facial position with viseme
 
 * Added support for blend shapes to drive the facial movements of a 3D character that you designed. Learn more at [how to get facial position with viseme](../../how-to-speech-synthesis-viseme.md).
-* SSML updated to support viseme element. See [speech synthesis markup](../../speech-synthesis-markup-structure.md#viseme-element).
+* SSML updated to support viseme element. See [speech synthesis markup](../../speech-synthesis-markup-voice.md#viseme-element).
 
 ### June 2022 release
 
diff --git a/articles/ai-services/speech-service/language-support.md b/articles/ai-services/speech-service/language-support.md
@@ -94,7 +94,7 @@ Use the following table to determine supported styles and roles for each voice.
 
 ### Viseme
 
-This table lists all the locales supported for [Viseme](speech-synthesis-markup-structure.md#viseme-element). For more information about Viseme, see [Get facial position with viseme](how-to-speech-synthesis-viseme.md) and [Viseme element](speech-synthesis-markup-structure.md#viseme-element). 
+This table lists all the locales supported for [Viseme](speech-synthesis-markup-voice.md#viseme-element). For more information about Viseme, see [Get facial position with viseme](how-to-speech-synthesis-viseme.md) and [Viseme element](speech-synthesis-markup-voice.md#viseme-element). 
 
 [!INCLUDE [Language support include](includes/language-support/viseme.md)]
 
@@ -125,6 +125,13 @@ With the cross-lingual feature, you can transfer your custom voice model to spea
 [!INCLUDE [Language support include](includes/language-support/personal-voice.md)]
 
 
+### Voice conversion
+
+[Voice conversion](voice-conversion.md) is a feature that lets you transform the voice characteristics of a given audio to a target voice speaker. The following table summarizes the locales supported for voice conversion.
+
+[!INCLUDE [Language support include](includes/language-support/voice-conversion.md)]
+
+
 # [Pronunciation assessment](#tab/pronunciation-assessment)
 
 The table in this section summarizes the 33 locales supported for pronunciation assessment, and each language is available on all [speech to text regions](regions.md#regions). Latest update extends support from English to 32 more languages and quality enhancements to existing features, including accuracy, fluency, and miscue assessment. You should specify the language that you're learning or practicing improving pronunciation. The default language is set as `en-US`. If you know your target learning language, [set the locale](how-to-pronunciation-assessment.md#get-pronunciation-assessment-results) accordingly. For example, if you're learning British English, you should specify the language as `en-GB`. If you're teaching a broader language, such as Spanish, and are uncertain about which locale to select, you can run various accent models (`es-ES`, `es-MX`) to determine the one that achieves the highest score to suit your specific scenario. If you're interested in languages not listed in the following table, fill out this [intake form](https://aka.ms/speechpa/intake) for further assistance.
diff --git a/articles/ai-services/speech-service/speech-synthesis-markup-structure.md b/articles/ai-services/speech-service/speech-synthesis-markup-structure.md
@@ -36,6 +36,7 @@ Here's a subset of the basic structure and syntax of an SSML document:
 ```xml
 <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="string">
     <mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>
+    <mstts:voiceconversion url="string"/>
     <voice name="string" effect="string">
         <audio src="string"></audio>
         <bookmark mark="string"/>
@@ -69,6 +70,7 @@ Some examples of contents that are allowed in each element are described in the
 - `math`: This element can only contain text and MathML elements.
 - `mstts:audioduration`: This element can't contain text or any other elements.
 - `mstts:backgroundaudio`: This element can't contain text or any other elements.
+- `<mstts:voiceconversion>`: This element can't contain text or any other elements. It specifies the source audio URL for the voice conversion.
 - `mstts:embedding`: This element can contain text and the following elements: `audio`, `break`, `emphasis`, `lang`, `phoneme`, `prosody`, `say-as`, and `sub`.
 - `mstts:express-as`: This element can contain text and the following elements: `audio`, `break`, `emphasis`, `lang`, `phoneme`, `prosody`, `say-as`, and `sub`.
 - `mstts:silence`: This element can't contain text or any other elements.
@@ -259,36 +261,6 @@ As an example, you might want to know the time offset of each flower word in the
 </speak>
 ```
 
-## Viseme element
-
-A viseme is the visual description of a phoneme in spoken language. It defines the position of the face and mouth while a person is speaking. You can use the `mstts:viseme` element in SSML to request viseme output. For more information, see [Get facial position with viseme](how-to-speech-synthesis-viseme.md).
-
-The viseme setting is applied to all input text within its enclosing `voice` element. To reset or change the viseme setting again, you must use a new `voice` element with either the same voice or a different voice.
-
-Usage of the `viseme` element's attributes are described in the following table.
-
-| Attribute | Description | Required or optional |
-| ---------- | ---------- | ---------- |
-| `type`    | The type of viseme output.<ul><li>`redlips_front` – lip-sync with viseme ID and audio offset output </li><li>`FacialExpression` – blend shapes output</li></ul> | Required  |
-
-> [!NOTE]
-> Currently, `redlips_front` only supports neural voices in `en-US` locale, and `FacialExpression` supports neural voices in `en-US` and `zh-CN` locales.
-
-### Viseme examples
-
-The supported values for attributes of the `viseme` element were [described previously](#viseme-element).
-
-This SSML snippet illustrates how to request blend shapes with your synthesized speech.
-
-```xml
-<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
-  <voice name="en-US-AvaNeural">
-    <mstts:viseme type="FacialExpression"/>
-    Rainbow has seven colors: Red, orange, yellow, green, blue, indigo, and violet.
-  </voice>
-</speak>
-```
-
 ## Next steps
 
 - [SSML overview](speech-synthesis-markup.md)
diff --git a/articles/ai-services/speech-service/speech-synthesis-markup-voice.md b/articles/ai-services/speech-service/speech-synthesis-markup-voice.md
diff --git a/articles/ai-services/speech-service/toc.yml b/articles/ai-services/speech-service/toc.yml
diff --git a/articles/ai-services/speech-service/voice-conversion.md b/articles/ai-services/speech-service/voice-conversion.md