Skip to content

Commit bcf2286

Browse files
authored
Merge pull request #4926 from eric-urban/eur/ai-speech-cnv-foundry-1
CNV in AI Foundry and HD training in Speech Studio
2 parents f5de16b + 146f512 commit bcf2286

36 files changed

+770
-62
lines changed

articles/ai-services/speech-service/high-definition-voices.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ Here's a comparison of features between Azure AI Speech HD voices, Azure OpenAI
3737
| Feature | Azure AI Speech HD voices | Azure OpenAI HD voices | Azure AI Speech voices (not HD) |
3838
|---------|---------------|------------------------|------------------------|
3939
| **Region** | East US, Southeast Asia, West Europe | North Central US, Sweden Central | Available in dozens of regions. See the [region list](regions.md#regions).|
40-
| **Number of voices** | 12 | 6 | More than 500 |
41-
| **Multilingual** | No (perform on primary language only) | Yes | Yes (applicable only to multilingual voices) |
40+
| **Number of voices** | 30 | 6 | More than 500 |
41+
| **Multilingual** | Yes | Yes | Yes (applicable only to multilingual voices) |
4242
| **SSML support** | Support for [a subset of SSML elements](#supported-and-unsupported-ssml-elements-for-azure-ai-speech-hd-voices).| Support for [a subset of SSML elements](openai-voices.md#ssml-elements-supported-by-openai-text-to-speech-voices-in-azure-ai-speech). | Support for the [full set of SSML](speech-synthesis-markup-structure.md) in Azure AI Speech. |
4343
| **Development options** | Speech SDK, Speech CLI, REST API | Speech SDK, Speech CLI, REST API | Speech SDK, Speech CLI, REST API |
4444
| **Deployment options** | Cloud only | Cloud only | Cloud, embedded, hybrid, and containers. |
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: include file
3+
description: include file
4+
author: eric-urban
5+
ms.author: eur
6+
ms.service: azure-ai-speech
7+
ms.topic: include
8+
ms.date: 5/19/2025
9+
ms.custom: include
10+
---
11+
12+
A voice talent is an individual or target speaker whose voices are recorded and used to create neural voice models.
13+
14+
Before you can train a neural voice, you must submit a recording of the voice talent's consent statement. The voice talent statement is a recording of the voice talent reading a statement that they consent to the usage of their speech data to train a custom voice model. The consent statement is also used to verify that the voice talent is the same person as the speaker in the training data.
15+
16+
> [!TIP]
17+
> Before you get started in Azure AI Foundry portal, define your voice [persona and choose the right voice talent](../../../../record-custom-voice-samples.md#choose-your-voice-talent).
18+
19+
You can find the verbal consent statement in multiple languages on [GitHub](https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/CustomVoice/script/verbal-statement-all-locales.txt). The language of the verbal statement must be the same as your recording. See also the [disclosure for voice talent](/legal/cognitive-services/speech-service/disclosure-voice-talent?context=/azure/ai-services/speech-service/context/context).
20+
21+
## Add voice talent
22+
23+
> [!TIP]
24+
> For a sample consent statement and training data, see the [GitHub repository](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/Sample%20Data).
25+
26+
To add a voice talent profile and upload their consent statement, follow these steps:
27+
28+
1. Sign in to the [Azure AI Foundry portal](https://ai.azure.com).
29+
1. Select **Fine-tuning** from the left pane and then select **AI Service fine-tuning**.
30+
1. Select the professional voice fine-tuning task (by model name) that you [started as described in the create professional voice article](/azure/ai-services/speech-service/professional-voice-create-project).
31+
1. Select **Set up voice talent** > **+ Add voice talent**.
32+
1. In the **Add new voice talent** wizard, select the target scenarios for the voice talent. The target scenarios must be consistent with what you provided in the application form. The scenarios are used to help identify the voice talent and to ensure that the voice model is trained for the intended use cases.
33+
1. Optionally in the **Voice characteristics** text box, enter a description of the characteristics of the voice you're going to create.
34+
1. Select **Next**.
35+
1. On the **Upload verbal statement** page, follow the instructions to upload the voice talent statement you've recorded beforehand.
36+
37+
- Enter the voice talent name and company name. The voice talent name must be the name of the person who recorded the consent statement. Enter the name in the same language used in the recorded statement. The company name must match the company name that was spoken in the recorded statement. Ensure the company name is entered in the same language as the recorded statement.
38+
- Make sure the verbal statement was [recorded](../../../../record-custom-voice-samples.md) with the same settings, environment, and speaking style as your training data.
39+
40+
:::image type="content" source="../../../../media/custom-voice/professional-voice/upload-verbal-statement.png" alt-text="Screenshot of the voice talent statement upload dialog." lightbox="../../../../media/custom-voice/professional-voice/fine-tune-azure-ai-services.png":::
41+
42+
1. Select **Next**.
43+
1. Review the voice talent and persona details, and select **Add voice talent**.
44+
45+
After the voice talent status is *Succeeded*, you can [add training data](../../../../professional-voice-create-training-set.md).
46+
47+
## Next steps
48+
49+
> [!div class="nextstepaction"]
50+
> [Add training data for professional voice fine-tuning](../../../../professional-voice-create-training-set.md)
51+

articles/ai-services/speech-service/includes/how-to/professional-voice/create-consent/rest.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: eric-urban
55
ms.author: eur
66
ms.service: azure-ai-speech
77
ms.topic: include
8-
ms.date: 12/1/2023
8+
ms.date: 5/19/2025
99
ms.custom: include
1010
---
1111

articles/ai-services/speech-service/includes/how-to/professional-voice/create-consent/speech-studio.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: eric-urban
55
ms.author: eur
66
ms.service: azure-ai-speech
77
ms.topic: include
8-
ms.date: 12/1/2023
8+
ms.date: 5/19/2025
99
ms.custom: include
1010
---
1111

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
author: eric-urban
3+
ms.author: eur
4+
ms.service: azure-ai-speech
5+
ms.topic: include
6+
ms.date: 5/19/2025
7+
---
8+
9+
All it takes to get started are a handful of audio files and the associated transcriptions. See if custom neural voice supports your [language](../../../../language-support.md?tabs=tts) and [region](../../../../regions.md#regions).
10+
11+
## Start fine-tuning
12+
13+
In the [Azure AI Foundry portal](https://ai.azure.com), you can fine-tune some Azure AI services models. For example, you can fine-tune a professional voice model.
14+
15+
To fine-tune a professional voice model, follow these steps:
16+
17+
1. Go to your project in the [Azure AI Foundry portal](https://ai.azure.com). If you need to create a project, see [Create an Azure AI Foundry project](/azure/ai-foundry/how-to/create-projects).
18+
1. Select **Fine-tuning** from the left pane.
19+
1. Select **AI Service fine-tuning** > **+ Fine-tune**.
20+
21+
:::image type="content" source="../../../../media/custom-voice/professional-voice/fine-tune-azure-ai-services.png" alt-text="Screenshot of the page to select fine-tuning of Azure AI Services models." lightbox="../../../../media/custom-voice/professional-voice/fine-tune-azure-ai-services.png":::
22+
23+
1. In the wizard, select **Custom voice (professional voice fine-tuning)** for custom voice. Then select **Next**.
24+
1. Follow the instructions provided by the wizard to create your project.
25+
26+
## Continue fine-tuning
27+
28+
Go to the Azure AI Speech documentation to learn how to continue fine-tuning your professional voice model:
29+
* [Add voice talent consent to the professional voice project](../../../../professional-voice-create-consent.md)
30+
* [Add training datasets](../../../../professional-voice-create-training-set.md)
31+
* [Train your voice model](../../../../professional-voice-train-voice.md)
32+
* [Deploy your professional voice model as an endpoint](../../../../professional-voice-deploy-endpoint.md)
33+
34+
## View fine-tuned models
35+
36+
After fine-tuning, you can access your custom voice models and deployments from the **Fine-tuning** page.
37+
38+
1. Sign in to the [Azure AI Foundry portal](https://ai.azure.com).
39+
1. Select **Fine-tuning** from the left pane.
40+
1. Select **AI Service fine-tuning**. You can view the status of your fine-tuning tasks and the models that were created.
41+
42+
:::image type="content" source="../../../../media/custom-voice/professional-voice/fine-tune-azure-ai-services.png" alt-text="Screenshot of the page to view fine-tuned AI services models." lightbox="../../../../media/custom-voice/professional-voice/fine-tune-azure-ai-services.png":::
43+
44+
## Next steps
45+
46+
> [!div class="nextstepaction"]
47+
> [Add voice talent consent to the professional voice project.](../../../../professional-voice-create-consent.md)
48+

articles/ai-services/speech-service/includes/how-to/professional-voice/create-project/rest.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: eric-urban
55
ms.author: eur
66
ms.service: azure-ai-speech
77
ms.topic: include
8-
ms.date: 12/1/2023
8+
ms.date: 5/19/2025
99
ms.custom: include
1010
---
1111

articles/ai-services/speech-service/includes/how-to/professional-voice/create-project/speech-studio.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ author: eric-urban
33
ms.author: eur
44
ms.service: azure-ai-speech
55
ms.topic: include
6-
ms.date: 2/19/2025
6+
ms.date: 5/19/2025
77
---
88

99
Content for [Custom neural voice](https://aka.ms/customvoice) like data, models, tests, and endpoints are organized into projects in Speech Studio. Each project is specific to a country/region and language, and the gender of the voice you want to create. For example, you might create a project for a female voice for your call center's chat bots that use English in the United States.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: include file
3+
description: include file
4+
author: eric-urban
5+
ms.author: eur
6+
ms.service: azure-ai-speech
7+
ms.topic: include
8+
ms.date: 5/19/2025
9+
ms.custom: include
10+
---
11+
12+
When you're ready to create a custom text to speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. For details on recording voice samples, see [the tutorial](../../../../record-custom-voice-samples.md). The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. After you've trained the voice, you can start synthesizing speech in your applications.
13+
14+
All data you upload must meet the requirements for the data type that you choose. It's important to correctly format your data before it's uploaded, which ensures the data will be accurately processed by the Speech service. To confirm that your data is correctly formatted, see [Training data types](../../../../how-to-custom-voice-training-data.md).
15+
16+
> [!NOTE]
17+
> - Standard subscription (S0) users can upload five data files simultaneously. If you reach the limit, wait until at least one of your data files finishes importing. Then try again.
18+
> - The maximum number of data files allowed to be imported per subscription is 500 .zip files for standard subscription (S0) users. Please see out [Speech service quotas and limits](../../../../speech-services-quotas-and-limits.md#custom-neural-voice---professional) for more details.
19+
20+
## Upload your data
21+
22+
> [!TIP]
23+
> For a sample consent statement and training data, see the [GitHub repository](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/Sample%20Data).
24+
25+
When you're ready to upload your data, go to the **Prepare training data** tab to add your first training set and upload data. A *training set* is a set of audio utterances and their mapping scripts used for training a voice model. You can use a training set to organize your training data. The service checks data readiness per each training set. You can import multiple data to a training set.
26+
27+
To upload training data, follow these steps:
28+
1. Sign in to the [Azure AI Foundry portal](https://ai.azure.com).
29+
1. Select **Fine-tuning** from the left pane and then select **AI Service fine-tuning**.
30+
1. Select the professional voice fine-tuning task (by model name) that you [started as described in the create professional voice article](/azure/ai-services/speech-service/professional-voice-create-project).
31+
1. Select **Prepare training data** > **Upload data**.
32+
1. In the **Upload data** wizard, choose a [data type](../../../../how-to-custom-voice-training-data.md). If you're using the sample data, select **Individual utterances + matching transcript**.
33+
34+
:::image type="content" source="../../../../media/custom-voice/professional-voice/choose-training-data-type.png" alt-text="Screenshot of the page to select the training data type." lightbox="../../../../media/custom-voice/professional-voice/choose-training-data-type.png":::
35+
36+
1. Select **Next**.
37+
1. On the **Specify the target training set** page, select **Create new**.
38+
1. Enter a training set name and then select **Create**.
39+
40+
:::image type="content" source="../../../../media/custom-voice/professional-voice/create-new-training-set.png" alt-text="Screenshot of the page to create a new training set." lightbox="../../../../media/custom-voice/professional-voice/create-new-training-set.png":::
41+
42+
1. Select **Next**.
43+
1. On the **Data upload** page, select a **Recording file** and **Script file** in the respective tiles. You can select local files from your computer or enter the Azure Blob storage URL to upload data.
44+
1. Select **Next**.
45+
1. Enter a name and description for your data and then select **Next**.
46+
1. Review the upload details, and select **Upload data**.
47+
48+
> [!NOTE]
49+
> Duplicate IDs aren't accepted. Utterances with the same ID will be removed.
50+
>
51+
> Duplicate audio names are removed from the training. Make sure the data you select don't contain the same audio names within the .zip file or across multiple .zip files. If utterance IDs (either in audio or script files) are duplicates, they're rejected.
52+
53+
Data files are automatically validated when you select **Upload data**. Data validation includes series of checks on the audio files to verify their file format, size, and sampling rate. If there are any errors, fix them and submit again.
54+
55+
After you upload the data, you can check the details in the training set detail view. On the detail page, you can further check the pronunciation issue and the noise level for each of your data. The pronunciation score at the sentence level ranges from 0-100. A score below 70 normally indicates a speech error or script mismatch. Utterances with an overall score lower than 70 will be rejected. A heavy accent can reduce your pronunciation score and affect the generated digital voice.
56+
57+
## Resolve data issues online
58+
59+
After upload, you can check the data details of the training set. Before continuing to [train your voice model](../../../../professional-voice-train-voice.md), you should try to resolve any data issues.
60+
61+
### Typical data issues
62+
63+
The issues are divided into three types. Refer to the following tables to check the respective types of errors.
64+
65+
**Auto-rejected**
66+
67+
Data with these errors won't be used for training. Imported data with errors will be ignored, so you don't need to delete them. You can [fix these data errors online](#resolve-data-issues-online) or upload the corrected data again for training.
68+
69+
| Category | Name | Description |
70+
| --------- | ----------- | --------------------------- |
71+
| Script | Invalid separator| You must separate the utterance ID and the script content with a Tab character.|
72+
| Script | Invalid script ID| The script line ID must be numeric.|
73+
| Script | Duplicated script|Each line of the script content must be unique. The line is duplicated with {}.|
74+
| Script | Script too long| The script must be less than 1,000 characters.|
75+
| Script | No matching audio| The ID of each utterance (each line of the script file) must match the audio ID.|
76+
| Script | No valid script| No valid script is found in this dataset. Fix the script lines that appear in the detailed issue list.|
77+
| Audio | No matching script| No audio files match the script ID. The name of the .wav files must match with the IDs in the script file.|
78+
| Audio | Invalid audio format| The audio format of the .wav files is invalid. Check the .wav file format by using an audio tool like [SoX](http://sox.sourceforge.net/).|
79+
| Audio | Low sampling rate| The sampling rate of the .wav files can't be lower than 16 KHz.|
80+
| Audio | Too long audio| Audio duration is longer than 30 seconds. Split the long audio into multiple files. It's a good idea to make utterances shorter than 15 seconds.|
81+
| Audio | No valid audio| No valid audio is found in this dataset. Check your audio data and upload again.|
82+
| Mismatch | Low scored utterance| Sentence-level pronunciation score is lower than 70. Review the script and the audio content to make sure they match.|
83+
84+
**Auto-fixed**
85+
86+
The following errors are fixed automatically, but you should review and confirm the fixes are made correctly.
87+
88+
| Category | Name | Description |
89+
| --------- | ----------- | --------------------------- |
90+
| Mismatch |Silence auto fixed |The start silence is detected to be shorter than 100 ms, and has been extended to 100 ms automatically. Download the normalized dataset and review it. |
91+
| Mismatch |Silence auto fixed | The end silence is detected to be shorter than 100 ms, and has been extended to 100 ms automatically. Download the normalized dataset and review it.|
92+
| Script | Text auto normalized|Text is automatically normalized for digits, symbols, and abbreviations. Review the script and audio to make sure they match.|
93+
94+
**Manual check required**
95+
96+
Unresolved errors listed in the next table affect the quality of training, but data with these errors won't be excluded during training. For higher-quality training, it's a good idea to fix these errors manually.
97+
98+
| Category | Name | Description |
99+
| --------- | ----------- | --------------------------- |
100+
| Script | Non-normalized text |This script contains symbols. Normalize the symbols to match the audio. For example, normalize */* to *slash*.|
101+
| Script | Not enough question utterances| At least 10 percent of the total utterances should be question sentences. This helps the voice model properly express a questioning tone.|
102+
| Script | Not enough exclamation utterances| At least 10 percent of the total utterances should be exclamation sentences. This helps the voice model properly express an excited tone.|
103+
| Script | No valid end punctuation| Add one of the following at the end of the line: full stop (half-width '.' or full-width '。'), exclamation point (half-width '!' or full-width '!' ), or question mark (half-width '?' or full-width '?').|
104+
| Audio| Low sampling rate for neural voice | It's recommended that the sampling rate of your .wav files should be 24 KHz or higher for creating neural voices. If it's lower, it will be automatically raised to 24 KHz.|
105+
| Volume |Overall volume too low|Volume shouldn't be lower than -18 dB (10 percent of max volume). Control the volume average level within proper range during the sample recording or data preparation.|
106+
| Volume | Volume overflow| Overflowing volume is detected at {}s. Adjust the recording equipment to avoid the volume overflow at its peak value.|
107+
| Volume | Start silence issue | The first 100 ms of silence isn't clean. Reduce the recording noise floor level, and leave the first 100 ms at the start silent.|
108+
| Volume| End silence issue| The last 100 ms of silence isn't clean. Reduce the recording noise floor level, and leave the last 100 ms at the end silent.|
109+
| Mismatch | Low scored words|Review the script and the audio content to make sure they match, and control the noise floor level. Reduce the length of long silence, or split the audio into multiple utterances if it's too long.|
110+
| Mismatch | Start silence issue |Extra audio was heard before the first word. Review the script and the audio content to make sure they match, control the noise floor level, and make the first 100 ms silent.|
111+
| Mismatch | End silence issue| Extra audio was heard after the last word. Review the script and the audio content to make sure they match, control the noise floor level, and make the last 100 ms silent.|
112+
| Mismatch | Low signal-noise ratio | Audio SNR level is lower than 20 dB. At least 35 dB is recommended.|
113+
| Mismatch | No score available |Failed to recognize speech content in this audio. Check the audio and the script content to make sure the audio is valid, and matches the script.|
114+
115+
## Next steps
116+
117+
> [!div class="nextstepaction"]
118+
> [Train the professional voice](../../../../professional-voice-train-voice.md)
119+

0 commit comments

Comments
 (0)