Skip to content

Commit 5af3a3d

Browse files
authored
Merge pull request #836 from MicrosoftDocs/main
10/16 11:00 AM IST Publish
2 parents 9a3d20f + 821b420 commit 5af3a3d

17 files changed

+117
-137
lines changed

articles/ai-services/encryption/cognitive-services-encryption-keys-portal.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22
title: Customer-Managed Keys for Azure AI services
33
titleSuffix: Azure AI services
44
description: Learn about using customer-managed keys to improve data security with Azure AI services.
5-
author: deeikele
5+
author: PatrickFarley
66
ms.service: azure-ai-services
77
ms.custom:
88
- ignite-2023
99
ms.topic: conceptual
1010
ms.date: 11/15/2023
11-
ms.author: deeikele
11+
ms.author: pafarley
1212
---
1313

1414
# Customer-managed keys for encryption

articles/ai-services/speech-service/how-to-custom-voice-training-data.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,14 @@ Follow these guidelines when preparing audio.
5555
| -------- | ----- |
5656
| File format | RIFF (.wav), grouped into a .zip file |
5757
| File name | File name characters supported by Windows OS, with .wav extension.<br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
58-
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
58+
| Sampling rate | 24 KHz and higher required when creating a custom neural voice. |
5959
| Sample format | PCM, at least 16-bit |
6060
| Audio length | Shorter than 15 seconds |
6161
| Archive format | .zip |
6262
| Maximum archive size | 2048 MB |
6363

6464
> [!NOTE]
65-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
65+
> The default sampling rate for a custom neural voice is 24 KHz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
6666
6767
### Transcription data for Individual utterances + matching transcript
6868

@@ -104,14 +104,14 @@ Follow these guidelines when preparing audio for segmentation.
104104
| -------- | ----- |
105105
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
106106
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
107-
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
107+
| Sampling rate | 24 KHz and higher required when creating a custom neural voice. |
108108
| Sample format |RIFF(.wav): PCM, at least 16-bit.<br/><br/>mp3: At least 256 KBps bit rate.|
109109
| Audio length | Longer than 20 seconds |
110110
| Archive format | .zip |
111111
| Maximum archive size | 2048 MB, at most 1,000 audio files included |
112112

113113
> [!NOTE]
114-
> The default sampling rate for a custom neural voice is 24,000 Hz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
114+
> The default sampling rate for a custom neural voice is 24 KHz. Audio files with a sampling rate lower than 16,000 Hz will be rejected. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
115115
116116
All audio files should be grouped into a zip file. It's OK to put .wav files and .mp3 files into the same zip file. For example, you can upload a 45-second audio file named 'kingstory.wav' and a 200-second long audio file named 'queenstory.mp3' in the same zip file. All .mp3 files will be transformed into the .wav format after processing.
117117

@@ -147,14 +147,14 @@ Follow these guidelines when preparing audio.
147147
| -------- | ----- |
148148
| File format | RIFF (.wav) or .mp3, grouped into a .zip file |
149149
| File name | File name characters supported by Windows OS, with .wav extension. <br>The characters `\ / : * ? " < > \|` aren't allowed. <br>It can't start or end with a space, and can't start with a dot. <br>No duplicate file names allowed. |
150-
| Sampling rate | When you create a custom neural voice, 24,000 Hz is required. |
150+
| Sampling rate | 24 KHz and higher required when creating a custom neural voice. |
151151
| Sample format |RIFF(.wav): PCM, at least 16-bit<br>mp3: At least 256 KBps bit rate.|
152152
| Audio length | No limit |
153153
| Archive format | .zip |
154154
| Maximum archive size | 2048 MB, at most 1,000 audio files included |
155155

156156
> [!NOTE]
157-
> The default sampling rate for a custom neural voice is 24,000 Hz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24,000 Hz will be up-sampled to 24,000 Hz to train a neural voice. It's recommended that you should use a sample rate of 24,000 Hz for your training data.
157+
> The default sampling rate for a custom neural voice is 24 KHz. Your audio files with a sampling rate higher than 16,000 Hz and lower than 24 KHz will be up-sampled to 24 KHz to train a neural voice. It's recommended that you should use a sample rate of 24 KHz and higher for your training data.
158158
159159
All audio files should be grouped into a zip file. Once your dataset is successfully uploaded, the Speech service helps you segment the audio file into utterances based on our speech batch transcription service. Unique IDs are assigned to the segmented utterances automatically. Matching transcripts are generated through speech recognition. All .mp3 files will be transformed into the .wav format after processing. You can check the segmented utterances and the matching transcripts by downloading the dataset.
160160

articles/ai-services/speech-service/record-custom-voice-samples.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -336,15 +336,15 @@ Take regular breaks and provide a beverage to help your voice talent keep their
336336

337337
### After the session
338338

339-
Modern recording studios run on computers. At the end of the session, you receive one or more audio files, not a tape. These files are probably WAV or AIFF format in CD quality (44.1 KHz 16-bit) or better. 24 KHz 16-bit is common and desirable. The default sampling rate for a custom neural voice is 24 KHz. It's recommended that you should use a sample rate of 24 KHz for your training data. Higher sampling rates, such as 96 KHz, aren't usually needed.
339+
Modern recording studios run on computers. At the end of the session, you receive one or more audio files, not a tape. These files are probably WAV or AIFF format in CD quality (44.1 KHz 16-bit) or better. 24 KHz 16-bit is common and desirable. The default sampling rate for a custom neural voice is 24 KHz. It's recommended that you should use a sample rate of 24 KHz and higher for your training data. Higher sampling rates, such as 96 KHz, aren't usually needed.
340340

341341
Speech Studio requires each provided utterance to be in its own file. Each audio file delivered by the studio contains multiple utterances. So the primary post-production task is to split up the recordings and prepare them for submission. The recording engineer might have placed markers in the file (or provided a separate cue sheet) to indicate where each utterance starts.
342342

343343
Use your notes to find the exact takes you want, and then use a sound editing utility, such as [Avid Pro Tools](https://www.avid.com/en/pro-tools), [Adobe Audition](https://www.adobe.com/products/audition.html), or the free [Audacity](https://www.audacityteam.org/), to copy each utterance into a new file.
344344

345345
Listen to each file carefully. At this stage, you can edit out small unwanted sounds that you missed during recording, like a slight lip smack before a line, but be careful not to remove any actual speech. If you can't fix a file, remove it from your dataset and note that you've done so.
346346

347-
Convert each file to 16 bits and a sample rate of 24 KHz before saving and if you recorded the studio chatter, remove the second channel. Save each file in WAV format, naming the files with the utterance number from your script.
347+
Convert each file to 16 bits and a sample rate of 24 KHz and higher before saving and if you recorded the studio chatter, remove the second channel. Save each file in WAV format, naming the files with the utterance number from your script.
348348

349349
Finally, create the transcript that associates each WAV file with a text version of the corresponding utterance. [Train your voice model](./professional-voice-train-voice.md) includes details of the required format. You can copy the text directly from your script. Then create a Zip file of the WAV files and the text transcript.
350350

articles/ai-services/speech-service/speech-container-faq.yml

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ sections:
5555
5656
> 2 (clusters) * 5 (VMs per cluster) * $0.3528/hour * 365 (days) * 24 (hours) = $31 K / year
5757
58-
When mapping to physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.
58+
When mapping to a physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.
5959
6060
For on-premises, all of these extra factors come into play:
6161
@@ -150,57 +150,66 @@ sections:
150150
151151
152152
- question: |
153-
What are the recommended resources, CPU and RAM; for 50 concurrent requests?
153+
What are the recommended CPU and RAM resources for Speech containers?
154154
answer: |
155-
How many concurrent requests will a 4-core, 8-GB RAM handle? If we have to serve for example, 50 concurrent requests, how many Core and RAM is recommended?
156-
157-
At real-time, 8 with our latest `en-US`, so we recommend using more docker containers beyond six concurrent requests. Beyond 16 cores it becomes nonuniform memory access (NUMA) node sensitive. The following table describes the minimum and recommended allocation of resources for each Speech container.
158155
159-
# [Speech-to-text](#tab/stt)
156+
# [Speech to text](#tab/stt)
160157
161-
The following table describes the minimum and recommended allocation of resources
158+
The following table describes the minimum and recommended allocation of resources for Speech to text.
162159
163-
| Container | Minimum | Recommended | Speech Model |
164-
|-----------|---------|-------------|--------------|
165-
| Speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory |
160+
| Container | Minimum | Recommended | Speech Model | Concurrency Limit |
161+
|-----------|---------|-------------|--------------|-------------------|
162+
| Speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
166163
167164
> [!NOTE]
168165
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
169166
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
170167
> Also, the first run of either container might take longer because models are being paged into memory.
168+
169+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
170+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
171+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
172+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
171173
172-
# [Custom speech-to-text](#tab/cstt)
174+
# [Custom speech to text](#tab/cstt)
173175
174-
The following table describes the minimum and recommended allocation of resources
176+
The following table describes the minimum and recommended allocation of resources for Custom speech to text.
175177
176-
| Container | Minimum | Recommended | Speech Model |
177-
|-----------|---------|-------------|--------------|
178-
| Custom speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory |
178+
| Container | Minimum | Recommended | Speech Model | Concurrency Limit |
179+
|-----------|---------|-------------|--------------|-------------------|
180+
| Custom speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
179181
180182
> [!NOTE]
181183
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
182184
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
183185
> Also, the first run of either container might take longer because models are being paged into memory.
186+
187+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
188+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
189+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
190+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
184191
185192
# [Neural Text-to-speech](#tab/tts)
186193
187-
The following table describes the minimum and recommended allocation of resources
194+
The following table describes the minimum and recommended allocation of resources for Neural text to speech.
188195
189-
| Container | Minimum | Recommended |
190-
|-----------|---------|-------------|
191-
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory |
196+
| Container | Minimum | Recommended | Concurrency Limit |
197+
|-----------|---------|-------------|-------------------|
198+
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory \n 8 core, 24-GB memory (multilingual voices) | 5 requests per 8 core CPU |
192199
193200
> [!NOTE]
194201
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
195202
> Also, the first run of either container might take longer because models are being paged into memory.
203+
204+
- Real-time performance varies depending on concurrency. With a concurrency of 1, an NTTS container instance can achieve 10x real-time performance. However, when concurrency increases to 5, real-time performance drops to 3x or lower. We recommended sending less than 5 concurrent requests in one container. Start more containers for increased concurrency.
196205
197206
# [Speech language identification](#tab/lid)
198207
199-
The following table describes the minimum and recommended allocation of resources
208+
The following table describes the minimum and recommended allocation of resources for Language Identification.
200209
201-
| Container | Minimum | Recommended |
202-
|-----------|---------|-------------|
203-
| Speech language identification | 1 core, 1-GB memory | 1 core, 1-GB memory |
210+
| Container | Minimum | Recommended | Concurrency Limit |
211+
|-----------|---------|-------------|-------------------|
212+
| Speech language identification | 1 core, 1-GB memory | 1 core, 1-GB memory | 2 sessions per core |
204213
205214
> [!NOTE]
206215
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
@@ -209,13 +218,10 @@ sections:
209218
***
210219
211220
- Each core must be at least 2.6 GHz or faster.
212-
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
213-
- The decoder is capable of doing about 2-3x real-time. For this, the overall CPU usage will be close to two cores for a single recognition. That's why we don't recommend keeping more than two active connections, per container instance. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
214-
- For microphone, it is at 1x real-time. The overall usage should be at about one core for a single recognition.
215221
216-
Consider the total number of hours of audio you have. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
222+
Consider the total number of hours of audio you have, and your expected time to complete the task. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
217223
218-
As an example, to handle 1000 hours/24 hours, we have tried setting up 3-4 VMs, with 10 instances/decoders per VM.
224+
As an example, we have performed speech to text on 1000 hours of audio within 24 hours with 4-5 VMs and 10 instances/decoders per VM.
219225
220226
- question: |
221227
Does the Speech container support punctuation?

0 commit comments

Comments
 (0)