Skip to content

Commit f16cc80

Browse files
authored
Merge pull request #258741 from eric-urban/eur/tts-stt-updates
tts stt updates
2 parents 8faa32a + ca24ffc commit f16cc80

11 files changed

+211
-22
lines changed

articles/ai-services/responsible-use-of-ai-overview.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,10 @@ Azure AI services provides information and guidelines on how to responsibly use
186186
* [Code of conduct](/legal/cognitive-services/speech-service/tts-code-of-conduct?context=/azure/ai-services/speech-service/context/context)
187187
* [Data, privacy, and security](/legal/cognitive-services/speech-service/custom-neural-voice/data-privacy-security-custom-neural-voice?context=/azure/ai-services/speech-service/context/context)
188188

189+
## Speech - Text to speech
190+
191+
* [Transparency note and use cases](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context)
192+
189193
## Speech - Speech to text
190194

191195
* [Transparency note and use cases](/legal/cognitive-services/speech-service/speech-to-text/transparency-note?context=/azure/ai-services/speech-service/context/context)

articles/ai-services/speech-service/batch-transcription-audio-data.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,27 @@ Audio files that are stored in Azure Blob storage can be accessed via one of two
2525

2626
You can specify one or multiple audio files when creating a transcription. We recommend that you provide multiple files per request or point to an Azure Blob storage container with the audio files to transcribe. The batch transcription service can handle a large number of submitted transcriptions. The service transcribes the files concurrently, which reduces the turnaround time.
2727

28-
## Supported audio formats
28+
## Supported audio formats and codecs
2929

30-
The batch transcription API supports the following formats:
30+
The batch transcription API supports a number of different formats and codecs, such as:
3131

32-
| Format | Codec | Bits per sample | Sample rate |
33-
|--------|-------|---------|---------------------------------|
34-
| WAV | PCM | 16-bit | 8 kHz or 16 kHz, mono or stereo |
35-
| MP3 | PCM | 16-bit | 8 kHz or 16 kHz, mono or stereo |
36-
| OGG | OPUS | 16-bit | 8 kHz or 16 kHz, mono or stereo |
32+
- WAV
33+
- MP3
34+
- OPUS/OGG
35+
- AAC
36+
- FLAC
37+
- WMA
38+
- ALAW in WAV container
39+
- MULAW in WAV container
40+
- AMR
41+
- WebM
42+
- MP4
43+
- M4A
44+
- SPEEX
3745

38-
For stereo audio streams, the left and right channels are split during the transcription. A JSON result file is created for each input audio file. To create an ordered final transcript, use the timestamps that are generated per utterance.
46+
47+
> [!NOTE]
48+
> Batch transcription service integrates GStreamer and may accept more formats and codecs without returning errors, while we suggest to use lossless formats such as WAV (PCM encoding) and FLAC to ensure best transcription quality.
3949
4050
## Azure Blob Storage upload
4151

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
author: eric-urban
3+
ms.service: azure-ai-speech
4+
ms.date: 11/15/2023
5+
ms.topic: include
6+
ms.author: eur
7+
---
8+
9+
|Locale (BCP-47)|Language|
10+
|-----------------|--------------------------------|
11+
|af-ZA|Afrikaans (South Africa)|
12+
|am-ET|Amharic (Ethiopia)|
13+
|ar-EG|Arabic (Egypt)|
14+
|ar-SA|Arabic (Saudi Arabia)|
15+
|az-AZ|Azerbaijani (Latin, Azerbaijan)|
16+
|bg-BG|Bulgarian (Bulgaria)|
17+
|bn-BD|Bangla (Bangladesh)|
18+
|bn-IN|Bengali (India)|
19+
|bs-BA|Bosnian (Bosnia and Herzegovina)|
20+
|ca-ES|Catalan|
21+
|cs-CZ|Czech (Czechia)|
22+
|cy-GB|Welsh (United Kingdom)|
23+
|da-DK|Danish (Denmark)|
24+
|de-AT|German (Austria)|
25+
|de-CH|German (Switzerland)|
26+
|de-DE<sup>1</sup>|German (Germany)|
27+
|el-GR|Greek (Greece)|
28+
|en-AU<sup>1</sup>|English (Australia)|
29+
|en-CA<sup>1</sup>|English (Canada)|
30+
|en-GB<sup>1</sup>|English (United Kingdom)|
31+
|en-HK|English (Hong Kong SAR)|
32+
|en-IE|English (Ireland)|
33+
|en-IN|English (India)|
34+
|en-US<sup>1</sup>|English (United States)|
35+
|es-ES<sup>1</sup>|Spanish (Spain)|
36+
|es-MX<sup>1</sup>|Spanish (Mexico)|
37+
|et-EE|Estonian (Estonia)|
38+
|eu-ES|Basque|
39+
|fa-IR|Persian (Iran)|
40+
|fi-FI|Finnish (Finland)|
41+
|fil-PH|Filipino (Philippines)|
42+
|fr-BE|French (Belgium)|
43+
|fr-CA|French (Canada)|
44+
|fr-CH|French (Switzerland)|
45+
|fr-FR<sup>1</sup>|French (France)|
46+
|ga-IE|Irish (Ireland)|
47+
|gl-ES|Galician|
48+
|he-IL|Hebrew (Israel)|
49+
|hi-IN|Hindi (India)|
50+
|hr-HR|Croatian (Croatia)|
51+
|hu-HU|Hungarian (Hungary)|
52+
|hy-AM|Armenian (Armenia)|
53+
|id-ID|Indonesian (Indonesia)|
54+
|is-IS|Icelandic (Iceland)|
55+
|it-IT|Italian (Italy)|
56+
|ja-JP|Japanese (Japan)|
57+
|jv-ID|Javanese (Latin, Indonesia)|
58+
|ka-GE|Georgian (Georgia)|
59+
|kk-KZ|Kazakh (Kazakhstan)|
60+
|km-KH|Khmer (Cambodia)|
61+
|kn-IN|Kannada (India)|
62+
|ko-KR|Korean (Korea)|
63+
|lo-LA|Lao (Laos)|
64+
|lt-LT|Lithuanian (Lithuania)|
65+
|lv-LV|Latvian (Latvia)|
66+
|mk-MK|Macedonian (North Macedonia)|
67+
|ml-IN|Malayalam (India)|
68+
|mn-MN|Mongolian (Mongolia)|
69+
|ms-MY|Malay (Malaysia)|
70+
|mt-MT|Maltese (Malta)|
71+
|my-MM|Burmese (Myanmar)|
72+
|nb-NO|Norwegian Bokmål (Norway)|
73+
|ne-NP|Nepali (Nepal)|
74+
|nl-BE|Dutch (Belgium)|
75+
|nl-NL|Dutch (Netherlands)|
76+
|pl-PL|Polish (Poland)|
77+
|ps-AF|Pashto (Afghanistan)|
78+
|pt-BR|Portuguese (Brazil)|
79+
|pt-PT|Portuguese (Portugal)|
80+
|ro-RO|Romanian (Romania)|
81+
|ru-RU|Russian (Russia)|
82+
|si-LK|Sinhala (Sri Lanka)|
83+
|sl-SI|Slovenian (Slovenia)|
84+
|so-SO|Somali (Somalia)|
85+
|sq-AL|Albanian (Albania)|
86+
|sr-RS|Serbian (Cyrillic, Serbia)|
87+
|su-ID|Sundanese (Indonesia)|
88+
|sv-SE|Swedish (Sweden)|
89+
|sw-KE|Swahili (Kenya)|
90+
|ta-IN|Tamil (India)|
91+
|te-IN|Telugu (India)|
92+
|th-TH|Thai (Thailand)|
93+
|tr-TR|Turkish (Turkey)|
94+
|uk-UA|Ukrainian (Ukraine)|
95+
|ur-PK|Urdu (Pakistan)|
96+
|uz-UZ|Uzbek (Latin, Uzbekistan)|
97+
|vi-VN|Vietnamese (Vietnam)|
98+
|zh-CN|Chinese (Mandarin, Simplified)|
99+
|zh-HK|Chinese (Cantonese, Traditional)|
100+
|zh-TW|Chinese (Taiwanese Mandarin, Traditional)|
101+
|zu-ZA|Zulu (South Africa)|
102+
103+
<sup>1</sup> You can try this locale in the Speech Studio [personal voice demo](https://speech.microsoft.com).

articles/ai-services/speech-service/includes/release-notes/release-notes-stt.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,28 @@ ms.author: eur
88

99
### November 2023 release
1010

11+
#### Speech To text models update
12+
13+
We're excited to introduce a significant update to our speech models, promising enhanced accuracy, improved readability, and refined entity recognition. This upgrade comes with a robust new structure, bolstered by an expanded training dataset, ensuring a marked advancement in overall performance. It includes newly released models for en-US, zh-CN, ja-JP, it-IT, pt-BR, es-MX, es-ES, fr-FR, de-DE, ko-KR, tr-TR, sv-SE, and he-IL.
14+
15+
Highlights:
16+
- Better accuracy with new model structure: The redefined model structure, coupled with a richer training dataset, elevates accuracy levels, promising more precise speech output.
17+
- Readability improvement: Our latest model brings a substantial boost to readability, enhancing the coherence and clarity of spoken content.
18+
- Advanced entity recognition: Entity recognition receives a substantial upgrade, resulting in more accurate and nuanced results.
19+
20+
Potential impacts: Despite these advancements, it's crucial to be mindful of potential impacts:
21+
- Custom Silence Timeout Feature: Users employing custom silence timeout, especially with low settings, might encounter over-segmentation and potential omissions of single-word phrases.
22+
- The new model might exhibit compatibility issues with the Keyword prefix feature, and users are advised to assess its performance in their specific applications.
23+
- Reduced disfluency words or phrases: Users might notice a reduction in disfluency words or phrases like "um" or "uh" in the speech output.
24+
- Inaccuracies in word timestamp duration: Some disfluency words might display inaccuracies in timestamp duration, requiring attention in applications dependent on precise timing.
25+
- Confidence score distribution variance: Users relying on confidence scores and associated thresholds should be aware of potential variations in distribution, necessitating adjustments for optimal performance.
26+
- The accuracy enhancement of the phrase list feature might be affected by the misrecognition of certain phrases.
27+
28+
We encourage you to explore these improvements and consider potential issues for a seamless transition, and as always, your feedback is instrumental in refining and advancing our services.
29+
1130
#### Pronunciation Assessment
1231

13-
- Speech [Pronunciation Assessment](../../how-to-pronunciation-assessment.md) now supports 18 languages generally available, with 6 additional languages available in public preview. For more information, see the full [language list for Pronunciation Assessment](../../language-support.md?tabs=pronunciation-assessment).
32+
- Speech [Pronunciation Assessment](../../how-to-pronunciation-assessment.md) now supports 18 languages generally available, with six more languages available in public preview. For more information, see the full [language list for Pronunciation Assessment](../../language-support.md?tabs=pronunciation-assessment).
1433

1534
| Language | Locale (BCP-47) |
1635
|--|--|
@@ -41,7 +60,7 @@ ms.author: eur
4160

4261
<sup>1</sup> The language is in public preview for pronunciation assessment.
4362

44-
- We are excited to announce that Pronunciation Assessment is introducing new features starting November 1, 2023: Prosody, Grammar, Vocabulary, and Topic. These enhancements aim to provide an even more comprehensive language learning experience for both reading and speaking assessments. Explore further details in the [How to use pronunciation assessment](../../how-to-pronunciation-assessment.md) and [Pronunciation assessment in Speech Studio](../../pronunciation-assessment-tool.md).
63+
- We're excited to announce that Pronunciation Assessment is introducing new features starting November 1, 2023: Prosody, Grammar, Vocabulary, and Topic. These enhancements aim to provide an even more comprehensive language learning experience for both reading and speaking assessments. Explore further details in the [How to use pronunciation assessment](../../how-to-pronunciation-assessment.md) and [Pronunciation assessment in Speech Studio](../../pronunciation-assessment-tool.md).
4564

4665
### September 2023 release
4766

articles/ai-services/speech-service/includes/release-notes/release-notes-tts.md

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,46 @@ ms.author: eur
88

99
### November 2023 release
1010

11+
#### Personal voice
12+
13+
Personal voice is available in preview in the following regions: West Europe, East US, and South East Asia. With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than 100 locales.
14+
15+
For more information, see [personal voice](../../personal-voice-overview.md).
16+
17+
#### Text to speech avatar
18+
19+
Text to speech avatar is available in preview in the following regions: West US 2, West Europe, and Southeast Asia.
20+
21+
Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a [custom text to speech avatar](../../text-to-speech-avatar/what-is-custom-text-to-speech-avatar.md)) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.
22+
23+
For more information, see [text to speech avatar](../../text-to-speech-avatar/what-is-text-to-speech-avatar.md), [transparency notes](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context), and [disclosure for voice and avatar talent](/legal/cognitive-services/speech-service/disclosure-voice-talent?context=/azure/ai-services/speech-service/context/context).
24+
1125
#### Custom neural voice
1226

13-
- Added support for the 24 new locales for cross-lingual voice. See the [full language list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
27+
Added support for the 24 new locales for cross-lingual voice. See the [full language list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
28+
29+
#### Prebuilt neural voice
30+
Introducing new voices for public preview:
31+
32+
| Locale (BCP-47) | Language | Text to speech voices |
33+
| ----- | ----- | ----- |
34+
| `de-DE` | German (Germany) | `SeraphinaNeural` (Female) |
35+
| `es-ES` | Spanish (Spain) | `XimenaNeural` (Female) |
36+
| `fr-CA` | French (Canada) | `ThierryNeural` (Male) |
37+
| `fr-FR` | French (France) | `VivienneNeural` (Female) |
38+
| `it-IT` | Italian (Italy) | `GiuseppeNeural` (Male) |
39+
| `ko-KR` | Korean (Korea) | `HyunsuNeural` (Male) |
40+
| `pt-BR` | Portuguese (Brazil) | `ThalitaNeural` (Female) |
41+
42+
Models updated with bugs fixed and quality improvement:
43+
44+
| Locale (BCP-47) | Language | Text to speech voices |
45+
| ----- | ----- | ----- |
46+
| `es-ES` | Spanish (Spain) | `AlvaroNeural` (Male) |
47+
| `en-GB` | English (United Kingdom) | `RyanNeural` (Male) |
48+
| `ko-KR` | Korean (Korea) | `InjoonNeural` (Male) |
49+
50+
See the [full language and voice list](../../language-support.md?tabs=tts#custom-neural-voice) for more information.
1451

1552
### October 2023 release
1653

@@ -85,7 +122,7 @@ Introducing new features in public preview for below voices:
85122
#### Audio Content Creation
86123

87124
- All prebuilt voices with speaking styles and multi-style custom voices support style degree adjustment.
88-
- Now you can fix the pronunciation of a word by simply speaking the word and recording it. The phonemes can be automatically recognized from your recording. The **Recognize by speaking** feature is now in public previw.
125+
- Now you can fix the pronunciation of a word by speaking the word and recording it. The phonemes can be automatically recognized from your recording. The **Recognize by speaking** feature is now in public preview.
89126

90127
### April 2023 release
91128

@@ -107,7 +144,7 @@ For more information, see the [language and voice list](../../language-support.m
107144

108145
#### New features
109146

110-
Speech Synthesis Markup Language (SSML) has been updated to support audio effect processor elements that optimize the quality of the synthesized speech output for specific scenarios on devices. Learn more at [speech synthesis markup](../../speech-synthesis-markup-voice.md#use-voice-elements).
147+
Speech Synthesis Markup Language (SSML) is updated to support audio effect processor elements that optimize the quality of the synthesized speech output for specific scenarios on devices. Learn more at [speech synthesis markup](../../speech-synthesis-markup-voice.md#use-voice-elements).
111148

112149
#### Custom neural voice
113150

@@ -144,7 +181,7 @@ The following voices are now generally available. See the [full language and voi
144181

145182
#### Batch synthesis REST API (Preview)
146183

147-
The Batch synthesis API is currently in public preview. Once it's generally available, the Long Audio API will be deprecated. For more information, see [Migrate to batch synthesis API](../../migrate-to-batch-synthesis.md).
184+
The Batch synthesis API is currently in public preview. Once it's generally available, the Long Audio API is deprecated. For more information, see [Migrate to batch synthesis API](../../migrate-to-batch-synthesis.md).
148185

149186
### November 2022 release
150187

articles/ai-services/speech-service/language-support.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ To improve Speech to text recognition accuracy, customization is available for s
4646

4747
The table in this section summarizes the locales and voices supported for Text to speech. See the table footnotes for more details.
4848

49-
Additional remarks for Text to speech locales are included in the [Voice styles and roles](#voice-styles-and-roles), [Prebuilt neural voices](#prebuilt-neural-voices), and [Custom Neural Voice](#custom-neural-voice) sections below.
49+
Additional remarks for text to speech locales are included in the [voice styles and roles](#voice-styles-and-roles), [prebuilt neural voices](#prebuilt-neural-voices), [Custom Neural Voice](#custom-neural-voice), and [personal voice](#personal-voice) sections below.
5050

5151
> [!TIP]
5252
> Check the [Voice Gallery](https://speech.microsoft.com/portal/voicegallery) and determine the right voice for your business needs.
@@ -93,6 +93,14 @@ With the cross-lingual feature, you can transfer your custom neural voice model
9393

9494
[!INCLUDE [Language support include](includes/language-support/tts-cnv.md)]
9595

96+
97+
### Personal voice
98+
99+
[Personal voice](personal-voice-overview.md) is a feature that lets you create a voice that sounds like you or your users. The following table summarizes the locales supported for personal voice.
100+
101+
[!INCLUDE [Language support include](includes/language-support/personal-voice.md)]
102+
103+
96104
# [Pronunciation assessment](#tab/pronunciation-assessment)
97105

98106
The table in this section summarizes the 24 locales supported for pronunciation assessment, and each language is available on all [Speech to text regions](regions.md#speech-service). Latest update extends support from English to 23 additional languages and quality enhancements to existing features, including accuracy, fluency and miscue assessment. You should specify the language that you're learning or practicing improving pronunciation. The default language is set as `en-US`. If you know your target learning language, [set the locale](how-to-pronunciation-assessment.md#get-pronunciation-assessment-results) accordingly. For example, if you're learning British English, you should specify the language as `en-GB`. If you're teaching a broader language, such as Spanish, and are uncertain about which locale to select, you can run various accent models (`es-ES`, `es-MX`) to determine the one that achieves the highest score to suit your specific scenario.

articles/ai-services/speech-service/personal-voice-overview.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,11 @@ ms.custom: references_regions
1313

1414
# What is personal voice (preview) for text to speech?
1515

16-
With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than locales.
16+
With personal voice (preview), you can get AI generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt, and then use it to generate speech in any of the more than 90 languages supported across more than 100 locales.
1717

1818
> [!NOTE]
1919
> Personal voice is available in these regions: West Europe, East US, and South East Asia.
20+
> For supported locales, see [personal voice language support](./language-support.md#personal-voice).
2021
2122
The following table summarizes the difference between custom neural voice pro and personal voice.
2223

@@ -70,6 +71,10 @@ Here's example SSML in a request for text to speech with the voice name and the
7071
</speak>
7172
```
7273

74+
### Responsible AI
75+
76+
We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI [transparency notes](/legal/cognitive-services/speech-service/text-to-speech/transparency-note?context=/azure/ai-services/speech-service/context/context).
77+
7378
## Reference documentation
7479

7580
The API reference documentation is made available to approved customers. You can apply for access [here](https://aka.ms/customneural).

articles/ai-services/speech-service/power-automate-batch-transcription.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ To trigger the test flow, upload an audio file to the Azure Blob Storage contain
127127

128128
## Upload files to the container
129129

130-
Follow these steps to upload [wav, mp3, or ogg](batch-transcription-audio-data.md#supported-audio-formats) files from your local directory to the Azure Storage container that you [created previously](#create-the-azure-blob-storage-container).
130+
Follow these steps to upload [wav, mp3, or ogg](batch-transcription-audio-data.md#supported-audio-formats-and-codecs) files from your local directory to the Azure Storage container that you [created previously](#create-the-azure-blob-storage-container).
131131

132132
1. Go to the [Azure portal](https://portal.azure.com/) and sign in to your Azure account.
133133
1. <a href="https://portal.azure.com/#create/Microsoft.StorageAccount-ARM" title="Create a Storage account resource" target="_blank">Create a Storage account resource</a> in the Azure portal. Use the same subscription and resource group as your Speech resource.

articles/ai-services/speech-service/sovereign-clouds.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,10 @@ Available to US government entities and their partners only. See more informatio
3434
- Neural voice
3535
- Speech translation
3636
- **Unsupported features:**
37-
- Custom Voice
38-
- Custom Commands
37+
- Custom commands
38+
- Custom neural voice
39+
- Personal voice
40+
- Text to speech avatar
3941
- **Supported languages:**
4042
- See the list of supported languages [here](language-support.md)
4143

0 commit comments

Comments
 (0)