You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/speech-container-faq.yml
+34-28Lines changed: 34 additions & 28 deletions
Original file line number
Diff line number
Diff line change
@@ -55,7 +55,7 @@ sections:
55
55
56
56
> 2 (clusters) * 5 (VMs per cluster) * $0.3528/hour * 365 (days) * 24 (hours) = $31 K / year
57
57
58
-
When mapping to physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.
58
+
When mapping to a physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.
59
59
60
60
For on-premises, all of these extra factors come into play:
61
61
@@ -150,57 +150,66 @@ sections:
150
150
151
151
152
152
- question: |
153
-
What are the recommended resources, CPU and RAM; for 50 concurrent requests?
153
+
What are the recommended CPU and RAM resources for Speech containers?
154
154
answer: |
155
-
How many concurrent requests will a 4-core, 8-GB RAM handle? If we have to serve for example, 50 concurrent requests, how many Core and RAM is recommended?
156
-
157
-
At real-time, 8 with our latest `en-US`, so we recommend using more docker containers beyond six concurrent requests. Beyond 16 cores it becomes nonuniform memory access (NUMA) node sensitive. The following table describes the minimum and recommended allocation of resources for each Speech container.
158
155
159
-
# [Speech-to-text](#tab/stt)
156
+
# [Speech to text](#tab/stt)
160
157
161
-
The following table describes the minimum and recommended allocation of resources
158
+
The following table describes the minimum and recommended allocation of resources for Speech to text.
162
159
163
-
| Container | Minimum | Recommended | Speech Model |
| Speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
166
163
167
164
> [!NOTE]
168
165
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
169
166
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
170
167
> Also, the first run of either container might take longer because models are being paged into memory.
168
+
169
+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
170
+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
171
+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
172
+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
171
173
172
-
# [Custom speech-to-text](#tab/cstt)
174
+
# [Custom speech to text](#tab/cstt)
173
175
174
-
The following table describes the minimum and recommended allocation of resources
176
+
The following table describes the minimum and recommended allocation of resources for Custom speech to text.
175
177
176
-
| Container | Minimum | Recommended | Speech Model |
| Custom speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
179
181
180
182
> [!NOTE]
181
183
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
182
184
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
183
185
> Also, the first run of either container might take longer because models are being paged into memory.
186
+
187
+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
188
+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
189
+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
190
+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
184
191
185
192
# [Neural Text-to-speech](#tab/tts)
186
193
187
-
The following table describes the minimum and recommended allocation of resources
194
+
The following table describes the minimum and recommended allocation of resources for Neural text to speech.
188
195
189
-
| Container | Minimum | Recommended |
190
-
|-----------|---------|-------------|
191
-
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory |
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory \n 8 core, 24-GB memory (multilingual voices) | 5 requests per 8 core CPU |
192
199
193
200
> [!NOTE]
194
201
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
195
202
> Also, the first run of either container might take longer because models are being paged into memory.
203
+
204
+
- Real-time performance varies depending on concurrency. With a concurrency of 1, an NTTS container instance can achieve 10x real-time performance. However, when concurrency increases to 5, real-time performance drops to 3x or lower. We recommended sending less than 5 concurrent requests in one container. Start more containers for increased concurrency.
196
205
197
206
# [Speech language identification](#tab/lid)
198
207
199
-
The following table describes the minimum and recommended allocation of resources
208
+
The following table describes the minimum and recommended allocation of resources for Language Identification.
| Speech language identification | 1 core, 1-GB memory | 1 core, 1-GB memory | 2 sessions per core |
204
213
205
214
> [!NOTE]
206
215
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
@@ -209,13 +218,10 @@ sections:
209
218
***
210
219
211
220
- Each core must be at least 2.6 GHz or faster.
212
-
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
213
-
- The decoder is capable of doing about 2-3x real-time. For this, the overall CPU usage will be close to two cores for a single recognition. That's why we don't recommend keeping more than two active connections, per container instance. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
214
-
- For microphone, it is at 1x real-time. The overall usage should be at about one core for a single recognition.
215
221
216
-
Consider the total number of hours of audio you have. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
222
+
Consider the total number of hours of audio you have, and your expected time to complete the task. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
217
223
218
-
As an example, to handle 1000 hours/24 hours, we have tried setting up 3-4 VMs, with 10 instances/decoders per VM.
224
+
As an example, we have performed speech to text on 1000 hours of audio within 24 hours with 4-5 VMs and 10 instances/decoders per VM.
0 commit comments