You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/speech-container-faq.yml
+14-6Lines changed: 14 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -164,6 +164,11 @@ sections:
164
164
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
165
165
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
166
166
> Also, the first run of either container might take longer because models are being paged into memory.
167
+
168
+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
169
+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
170
+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
171
+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
167
172
168
173
# [Custom speech-to-text](#tab/cstt)
169
174
@@ -175,16 +180,23 @@ sections:
175
180
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
176
181
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
177
182
> Also, the first run of either container might take longer because models are being paged into memory.
183
+
184
+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
185
+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
186
+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
187
+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory \n 8 core, 24-GB memory (multilingual voices) | 5 requests per 8 core CPU |
184
194
185
195
> [!NOTE]
186
196
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
187
197
> Also, the first run of either container might take longer because models are being paged into memory.
198
+
199
+
- Real-time performance varies depending on concurrency. With a concurrency of 1, an NTTS container instance can achieve 10x real-time performance. However, when concurrency increases to 5, real-time performance drops to 3x or lower. We recommended sending less than 5 concurrent requests in one container. Start more containers for increased concurrency.
188
200
189
201
# [Speech language identification](#tab/lid)
190
202
@@ -199,10 +211,6 @@ sections:
199
211
***
200
212
201
213
- Each core must be at least 2.6 GHz or faster.
202
-
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
203
-
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
204
-
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
205
-
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
206
214
207
215
Consider the total number of hours of audio you have, and your expected time to complete the task. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
0 commit comments