Merge pull request #258741 from eric-urban/eur/tts-stt-updates

PMEds28 · web-flow · commit f16cc80515bf · 2023-11-16T09:25:30.000Z
tts stt updates
diff --git a/articles/ai-services/responsible-use-of-ai-overview.md b/articles/ai-services/responsible-use-of-ai-overview.md
@@ -186,6 +186,10 @@ Azure AI services provides information and guidelines on how to responsibly use
 * [Code of conduct](/legal/cognitive-services/speech-service/tts-code-of-conduct?context=/azure/ai-services/speech-service/context/context)
 * [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/ai-services/speech-service/context/context)
 
+## Speech - Text to speech
+
+* [Transparency note and use cases](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context)
+
 ## Speech - Speech to text
 
 * [Transparency note and use cases](/legal/cognitive-services/speech-service/speech-to-text/transparency-note?context=/azure/ai-services/speech-service/context/context)
diff --git a/articles/ai-services/speech-service/batch-transcription-audio-data.md b/articles/ai-services/speech-service/batch-transcription-audio-data.md
@@ -25,17 +25,27 @@ Audio files that are stored in Azure Blob storage can be accessed via one of two
 
 You can specify one or multiple audio files when creating a transcription. We recommend that you provide multiple files per request or point to an Azure Blob storage container with the audio files to transcribe. The batch transcription service can handle a large number of submitted transcriptions. The service transcribes the files concurrently, which reduces the turnaround time. 
 
-## Supported audio formats
+## Supported audio formats and codecs
 
-The batch transcription API supports the following formats:
+The batch transcription API supports a number of different formats and codecs, such as:
 
-| Format | Codec | Bits per sample | Sample rate             |
-|--------|-------|---------|---------------------------------|
-| WAV    | PCM   | 16-bit  | 8 kHz or 16 kHz, mono or stereo |
-| MP3    | PCM   | 16-bit  | 8 kHz or 16 kHz, mono or stereo |
-| OGG    | OPUS  | 16-bit  | 8 kHz or 16 kHz, mono or stereo |
+- WAV
+- MP3
+- OPUS/OGG
+- AAC
+- FLAC
+- WMA
+- ALAW in WAV container
+- MULAW in WAV container
+- AMR
+- WebM
+- MP4
+- M4A
+- SPEEX
 
-For stereo audio streams, the left and right channels are split during the transcription. A JSON result file is created for each input audio file. To create an ordered final transcript, use the timestamps that are generated per utterance.
+
+> [!NOTE]
+> Batch transcription service integrates GStreamer and may accept more formats and codecs without returning errors, while we suggest to use lossless formats such as WAV (PCM encoding) and FLAC to ensure best transcription quality.
 
 ## Azure Blob Storage upload
 
diff --git a/articles/ai-services/speech-service/includes/language-support/personal-voice.md b/articles/ai-services/speech-service/includes/language-support/personal-voice.md
@@ -0,0 +1,103 @@
+---
+author: eric-urban
+ms.service: azure-ai-speech
+ms.date: 11/15/2023
+ms.topic: include
+ms.author: eur
+---
+
+|Locale (BCP-47)|Language|
+|-----------------|--------------------------------|
+|af-ZA|Afrikaans (South Africa)|
+|am-ET|Amharic (Ethiopia)|
+|ar-EG|Arabic (Egypt)|
+|ar-SA|Arabic (Saudi Arabia)|
+|az-AZ|Azerbaijani (Latin, Azerbaijan)|
+|bg-BG|Bulgarian (Bulgaria)|
+|bn-BD|Bangla (Bangladesh)|
+|bn-IN|Bengali (India)|
+|bs-BA|Bosnian (Bosnia and Herzegovina)|
+|ca-ES|Catalan|
+|cs-CZ|Czech (Czechia)|
+|cy-GB|Welsh (United Kingdom)|
+|da-DK|Danish (Denmark)|
+|de-AT|German (Austria)|
+|de-CH|German (Switzerland)|
+|de-DE<sup>1</sup>|German (Germany)|
+|el-GR|Greek (Greece)|
+|en-AU<sup>1</sup>|English (Australia)|
+|en-CA<sup>1</sup>|English (Canada)|
+|en-GB<sup>1</sup>|English (United Kingdom)|
+|en-HK|English (Hong Kong SAR)|
+|en-IE|English (Ireland)|
+|en-IN|English (India)|
+|en-US<sup>1</sup>|English (United States)|
+|es-ES<sup>1</sup>|Spanish (Spain)|
+|es-MX<sup>1</sup>|Spanish (Mexico)|
+|et-EE|Estonian (Estonia)|
+|eu-ES|Basque|
+|fa-IR|Persian (Iran)|
+|fi-FI|Finnish (Finland)|
+|fil-PH|Filipino (Philippines)|
+|fr-BE|French (Belgium)|
+|fr-CA|French (Canada)|
+|fr-CH|French (Switzerland)|
+|fr-FR<sup>1</sup>|French (France)|
+|ga-IE|Irish (Ireland)|
+|gl-ES|Galician|
+|he-IL|Hebrew (Israel)|
+|hi-IN|Hindi (India)|
+|hr-HR|Croatian (Croatia)|
+|hu-HU|Hungarian (Hungary)|
+|hy-AM|Armenian (Armenia)|
+|id-ID|Indonesian (Indonesia)|
+|is-IS|Icelandic (Iceland)|
+|it-IT|Italian (Italy)|
+|ja-JP|Japanese (Japan)|
+|jv-ID|Javanese (Latin, Indonesia)|
+|ka-GE|Georgian (Georgia)|
+|kk-KZ|Kazakh (Kazakhstan)|
+|km-KH|Khmer (Cambodia)|
+|kn-IN|Kannada (India)|
+|ko-KR|Korean (Korea)|
+|lo-LA|Lao (Laos)|
+|lt-LT|Lithuanian (Lithuania)|
+|lv-LV|Latvian (Latvia)|
+|mk-MK|Macedonian (North Macedonia)|
+|ml-IN|Malayalam (India)|
+|mn-MN|Mongolian (Mongolia)|
+|ms-MY|Malay (Malaysia)|
+|mt-MT|Maltese (Malta)|
+|my-MM|Burmese (Myanmar)|
+|nb-NO|Norwegian Bokmål (Norway)|
+|ne-NP|Nepali (Nepal)|
+|nl-BE|Dutch (Belgium)|
+|nl-NL|Dutch (Netherlands)|
+|pl-PL|Polish (Poland)|
+|ps-AF|Pashto (Afghanistan)|
+|pt-BR|Portuguese (Brazil)|
+|pt-PT|Portuguese (Portugal)|
+|ro-RO|Romanian (Romania)|
+|ru-RU|Russian (Russia)|
+|si-LK|Sinhala (Sri Lanka)|
+|sl-SI|Slovenian (Slovenia)|
+|so-SO|Somali (Somalia)|
+|sq-AL|Albanian (Albania)|
+|sr-RS|Serbian (Cyrillic, Serbia)|
+|su-ID|Sundanese (Indonesia)|
+|sv-SE|Swedish (Sweden)|
+|sw-KE|Swahili (Kenya)|
+|ta-IN|Tamil (India)|
+|te-IN|Telugu (India)|
+|th-TH|Thai (Thailand)|
+|tr-TR|Turkish (Turkey)|
+|uk-UA|Ukrainian (Ukraine)|
+|ur-PK|Urdu (Pakistan)|
+|uz-UZ|Uzbek (Latin, Uzbekistan)|
+|vi-VN|Vietnamese (Vietnam)|
+|zh-CN|Chinese (Mandarin, Simplified)|
+|zh-HK|Chinese (Cantonese, Traditional)|
+|zh-TW|Chinese (Taiwanese Mandarin, Traditional)|
+|zu-ZA|Zulu (South Africa)|
+
+<sup>1</sup> You can try this locale in the Speech Studio [personal voice demo](https://speech.microsoft.com). 
diff --git a/articles/ai-services/speech-service/includes/release-notes/release-notes-stt.md b/articles/ai-services/speech-service/includes/release-notes/release-notes-stt.md
@@ -8,9 +8,28 @@ ms.author: eur
 
 ### November 2023 release
 
+#### Speech To text models update
+
+We're excited to introduce a significant update to our speech models, promising enhanced accuracy, improved readability, and refined entity recognition. This upgrade comes with a robust new structure, bolstered by an expanded training dataset, ensuring a marked advancement in overall performance. It includes newly released models for en-US, zh-CN, ja-JP, it-IT, pt-BR, es-MX, es-ES, fr-FR, de-DE, ko-KR, tr-TR, sv-SE, and he-IL.
+
+Highlights:
+- Better accuracy with new model structure: The redefined model structure, coupled with a richer training dataset, elevates accuracy levels, promising more precise speech output.
+- Readability improvement: Our latest model brings a substantial boost to readability, enhancing the coherence and clarity of spoken content.
+- Advanced entity recognition: Entity recognition receives a substantial upgrade, resulting in more accurate and nuanced results.
+
+Potential impacts: Despite these advancements, it's crucial to be mindful of potential impacts:
+- Custom Silence Timeout Feature: Users employing custom silence timeout, especially with low settings, might encounter over-segmentation and potential omissions of single-word phrases.
+- The new model might exhibit compatibility issues with the Keyword prefix feature, and users are advised to assess its performance in their specific applications.
+- Reduced disfluency words or phrases: Users might notice a reduction in disfluency words or phrases like "um" or "uh" in the speech output.
+- Inaccuracies in word timestamp duration: Some disfluency words might display inaccuracies in timestamp duration, requiring attention in applications dependent on precise timing.
+- Confidence score distribution variance: Users relying on confidence scores and associated thresholds should be aware of potential variations in distribution, necessitating adjustments for optimal performance.
+- The accuracy enhancement of the phrase list feature might be affected by the misrecognition of certain phrases.
+
+We encourage you to explore these improvements and consider potential issues for a seamless transition, and as always, your feedback is instrumental in refining and advancing our services.
+
 #### Pronunciation Assessment
 
-- Speech [Pronunciation Assessment](../../how-to-pronunciation-assessment.md) now supports 18 languages generally available, with 6 additional languages available in public preview. For more information, see the full [language list for Pronunciation Assessment](../../language-support.md?tabs=pronunciation-assessment).
+- Speech [Pronunciation Assessment](../../how-to-pronunciation-assessment.md) now supports 18 languages generally available, with six more languages available in public preview. For more information, see the full [language list for Pronunciation Assessment](../../language-support.md?tabs=pronunciation-assessment).
 
   | Language | Locale (BCP-47) | 
   |--|--|
@@ -41,7 +60,7 @@ ms.author: eur
 
   <sup>1</sup> The language is in public preview for pronunciation assessment.
 
-- We are excited to announce that Pronunciation Assessment is introducing new features starting November 1, 2023: Prosody, Grammar, Vocabulary, and Topic. These enhancements aim to provide an even more comprehensive language learning experience for both reading and speaking assessments. Explore further details in the [How to use pronunciation assessment](../../how-to-pronunciation-assessment.md) and [Pronunciation assessment in Speech Studio](../../pronunciation-assessment-tool.md).
+- We're excited to announce that Pronunciation Assessment is introducing new features starting November 1, 2023: Prosody, Grammar, Vocabulary, and Topic. These enhancements aim to provide an even more comprehensive language learning experience for both reading and speaking assessments. Explore further details in the [How to use pronunciation assessment](../../how-to-pronunciation-assessment.md) and [Pronunciation assessment in Speech Studio](../../pronunciation-assessment-tool.md).
 
 ### September 2023 release
 
diff --git a/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md b/articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md
@@ -8,9 +8,46 @@ ms.author: eur
 
 ### November 2023 release
 
+#### Personal voice
+
+Personal voice is available in preview in the following regions: West Europe, East US, and South East Asia. With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than 100 locales.  
+
+For more information, see [personal voice](../../personal-voice-overview.md).
+
+#### Text to speech avatar
+
+Text to speech avatar is available in preview in the following regions: West US 2, West Europe, and Southeast Asia. 
+
+Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a [custom text to speech avatar](../../text-to-speech-avatar/what-is-custom-text-to-speech-avatar.md)) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.
+
+For more information, see [text to speech avatar](../../text-to-speech-avatar/what-is-text-to-speech-avatar.md), [transparency notes](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context), and [disclosure for voice and avatar talent](/legal/cognitive-services/speech-service/disclosure-voice-talent?context=/azure/ai-services/speech-service/context/context).
+
 #### Custom neural voice
 
-- Added support for the 24 new locales for cross-lingual voice. See the [full language list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
+Added support for the 24 new locales for cross-lingual voice. See the [full language list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
+
+#### Prebuilt neural voice
+Introducing new voices for public preview:
+
+| Locale (BCP-47) | Language | Text to speech voices |
+| ----- | ----- | ----- |
+| `de-DE` | German (Germany) | `SeraphinaNeural` (Female) |
+| `es-ES` | Spanish (Spain) | `XimenaNeural` (Female) |
+| `fr-CA` | French (Canada) | `ThierryNeural` (Male) |
+| `fr-FR` | French (France) | `VivienneNeural` (Female) |
+| `it-IT` | Italian (Italy) | `GiuseppeNeural` (Male) |
+| `ko-KR` | Korean (Korea) | `HyunsuNeural` (Male) |
+| `pt-BR` | Portuguese (Brazil) | `ThalitaNeural` (Female) |
+
+Models updated with bugs fixed and quality improvement:
+
+| Locale (BCP-47) | Language | Text to speech voices |
+| ----- | ----- | ----- |
+| `es-ES` | Spanish (Spain) | `AlvaroNeural` (Male) |
+| `en-GB` | English (United Kingdom) | `RyanNeural` (Male) |
+| `ko-KR` | Korean (Korea) | `InjoonNeural` (Male) |
+
+See the [full language and voice list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
 
 ### October 2023 release
 
@@ -85,7 +122,7 @@ Introducing new features in public preview for below voices:
 #### Audio Content Creation
 
 - All prebuilt voices with speaking styles and multi-style custom voices support style degree adjustment.
-- Now you can fix the pronunciation of a word by simply speaking the word and recording it. The phonemes can be automatically recognized from your recording. The **Recognize by speaking** feature is now in public previw.
+- Now you can fix the pronunciation of a word by speaking the word and recording it. The phonemes can be automatically recognized from your recording. The **Recognize by speaking** feature is now in public preview.
 
 ### April 2023 release
 
@@ -107,7 +144,7 @@ For more information, see the [language and voice list](../../language-support.m
 
 #### New features
 
-Speech Synthesis Markup Language (SSML) has been updated to support audio effect processor elements that optimize the quality of the synthesized speech output for specific scenarios on devices. Learn more at [speech synthesis markup](../../speech-synthesis-markup-voice.md#use-voice-elements).
+Speech Synthesis Markup Language (SSML) is updated to support audio effect processor elements that optimize the quality of the synthesized speech output for specific scenarios on devices. Learn more at [speech synthesis markup](../../speech-synthesis-markup-voice.md#use-voice-elements).
 
 #### Custom neural voice
 
@@ -144,7 +181,7 @@ The following voices are now generally available. See the [full language and voi
 
 #### Batch synthesis REST API (Preview)
 
-The Batch synthesis API is currently in public preview. Once it's generally available, the Long Audio API will be deprecated. For more information, see [Migrate to batch synthesis API](../../migrate-to-batch-synthesis.md).
+The Batch synthesis API is currently in public preview. Once it's generally available, the Long Audio API is deprecated. For more information, see [Migrate to batch synthesis API](../../migrate-to-batch-synthesis.md).
 
 ### November 2022 release
 
diff --git a/articles/ai-services/speech-service/language-support.md b/articles/ai-services/speech-service/language-support.md
@@ -46,7 +46,7 @@ To improve Speech to text recognition accuracy, customization is available for s
 
 The table in this section summarizes the locales and voices supported for Text to speech. See the table footnotes for more details.
 
-Additional remarks for Text to speech locales are included in the [Voice styles and roles](#voice-styles-and-roles), [Prebuilt neural voices](#prebuilt-neural-voices), and [Custom Neural Voice](#custom-neural-voice) sections below. 
+Additional remarks for text to speech locales are included in the [voice styles and roles](#voice-styles-and-roles), [prebuilt neural voices](#prebuilt-neural-voices), [Custom Neural Voice](#custom-neural-voice), and [personal voice](#personal-voice) sections below. 
 
 > [!TIP]
 > Check the [Voice Gallery](https://speech.microsoft.com/portal/voicegallery) and determine the right voice for your business needs. 
@@ -93,6 +93,14 @@ With the cross-lingual feature, you can transfer your custom neural voice model
 
 [!INCLUDE [Language support include](includes/language-support/tts-cnv.md)]
 
+
+### Personal voice
+
+[Personal voice](personal-voice-overview.md) is a feature that lets you create a voice that sounds like you or your users. The following table summarizes the locales supported for personal voice. 
+
+[!INCLUDE [Language support include](includes/language-support/personal-voice.md)]
+
+
 # [Pronunciation assessment](#tab/pronunciation-assessment)
 
 The table in this section summarizes the 24 locales supported for pronunciation assessment, and each language is available on all [Speech to text regions](regions.md#speech-service). Latest update extends support from English to 23 additional languages and quality enhancements to existing features, including accuracy, fluency and miscue assessment. You should specify the language that you're learning or practicing improving pronunciation. The default language is set as `en-US`. If you know your target learning language, [set the locale](how-to-pronunciation-assessment.md#get-pronunciation-assessment-results) accordingly. For example, if you're learning British English, you should specify the language as `en-GB`. If you're teaching a broader language, such as Spanish, and are uncertain about which locale to select, you can run various accent models (`es-ES`, `es-MX`) to determine the one that achieves the highest score to suit your specific scenario. 
diff --git a/articles/ai-services/speech-service/personal-voice-overview.md b/articles/ai-services/speech-service/personal-voice-overview.md
@@ -13,10 +13,11 @@ ms.custom: references_regions
 
 # What is personal voice (preview) for text to speech? 
 
-With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than locales.  
+With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than 100 locales.  
 
 > [!NOTE]
 > Personal voice is available in these regions: West Europe, East US, and South East Asia. 
+> For supported locales, see [personal voice language support](./language-support.md#personal-voice).
 
 The following table summarizes the difference between custom neural voice pro and personal voice.  
  
@@ -70,6 +71,10 @@ Here's example SSML in a request for text to speech with the voice name and the
 </speak> 
 ```
 
+### Responsible AI 
+
+We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI [transparency notes](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context).
+
 ## Reference documentation
 
 The API reference documentation is made available to approved customers. You can apply for access [here](https://aka.ms/customneural).
diff --git a/articles/ai-services/speech-service/power-automate-batch-transcription.md b/articles/ai-services/speech-service/power-automate-batch-transcription.md
@@ -127,7 +127,7 @@ To trigger the test flow, upload an audio file to the Azure Blob Storage contain
 
 ## Upload files to the container
 
-Follow these steps to upload [wav, mp3, or ogg](batch-transcription-audio-data.md#supported-audio-formats) files from your local directory to the Azure Storage container that you [created previously](#create-the-azure-blob-storage-container). 
+Follow these steps to upload [wav, mp3, or ogg](batch-transcription-audio-data.md#supported-audio-formats-and-codecs) files from your local directory to the Azure Storage container that you [created previously](#create-the-azure-blob-storage-container). 
 
 1. Go to the [Azure portal](https://portal.azure.com/) and sign in to your Azure account.
 1. <a href="https://portal.azure.com/#create/Microsoft.StorageAccount-ARM"  title="Create a Storage account resource"  target="_blank">Create a Storage account resource</a> in the Azure portal. Use the same subscription and resource group as your Speech resource.
diff --git a/articles/ai-services/speech-service/sovereign-clouds.md b/articles/ai-services/speech-service/sovereign-clouds.md
@@ -34,8 +34,10 @@ Available to US government entities and their partners only. See more informatio
     - Neural voice
   - Speech translation
 - **Unsupported features:**
-  - Custom Voice
-  - Custom Commands
+  - Custom commands
+  - Custom neural voice
+  - Personal voice
+  - Text to speech avatar
 - **Supported languages:**
   - See the list of supported languages [here](language-support.md)
 
diff --git a/articles/ai-services/speech-service/speech-services-quotas-and-limits.md b/articles/ai-services/speech-service/speech-services-quotas-and-limits.md
diff --git a/articles/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar.md b/articles/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar.md