Skip to content

Commit 867f1f6

Browse files
authored
Update container FAQ concurrency
1 parent 13fb5dd commit 867f1f6

File tree

1 file changed

+19
-27
lines changed

1 file changed

+19
-27
lines changed

articles/ai-services/speech-service/speech-container-faq.yml

Lines changed: 19 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -150,57 +150,48 @@ sections:
150150
151151
152152
- question: |
153-
What are the recommended resources, CPU and RAM; for 50 concurrent requests?
153+
What are the recommended CPU and RAM resources?
154154
answer: |
155-
How many concurrent requests will a 4-core, 8-GB RAM handle? If we have to serve for example, 50 concurrent requests, how many Core and RAM is recommended?
156-
157-
At real-time, 8 with our latest `en-US`, so we recommend using more docker containers beyond six concurrent requests. Beyond 16 cores it becomes nonuniform memory access (NUMA) node sensitive. The following table describes the minimum and recommended allocation of resources for each Speech container.
155+
156+
The following table describes the minimum and recommended allocation of resources for each Speech container.
158157
159158
# [Speech-to-text](#tab/stt)
160-
161-
The following table describes the minimum and recommended allocation of resources
162159
163-
| Container | Minimum | Recommended | Speech Model |
164-
|-----------|---------|-------------|--------------|
165-
| Speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory |
160+
| Container | Minimum | Recommended | Speech Model | Concurrency Limit |
161+
|-----------|---------|-------------|--------------|-------------------|
162+
| Speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
166163
167164
> [!NOTE]
168165
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
169166
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
170167
> Also, the first run of either container might take longer because models are being paged into memory.
171168
172169
# [Custom speech-to-text](#tab/cstt)
173-
174-
The following table describes the minimum and recommended allocation of resources
175170
176-
| Container | Minimum | Recommended | Speech Model |
177-
|-----------|---------|-------------|--------------|
178-
| Custom speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory |
171+
| Container | Minimum | Recommended | Speech Model | Concurrency Limit |
172+
|-----------|---------|-------------|--------------|-------------------|
173+
| Custom speech to text | 4 core, 4-GB memory | 8 core, 8-GB memory | +4 to 8 GB memory | 2 sessions per core |
179174
180175
> [!NOTE]
181176
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
182177
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
183178
> Also, the first run of either container might take longer because models are being paged into memory.
184179
185180
# [Neural Text-to-speech](#tab/tts)
186-
187-
The following table describes the minimum and recommended allocation of resources
188181
189-
| Container | Minimum | Recommended |
190-
|-----------|---------|-------------|
182+
| Container | Minimum | Recommended | Concurrency Limit |
183+
|-----------|---------|-------------|-------------------| 5 requests per 8 core CPU |
191184
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory |
192185
193186
> [!NOTE]
194187
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
195188
> Also, the first run of either container might take longer because models are being paged into memory.
196189
197190
# [Speech language identification](#tab/lid)
198-
199-
The following table describes the minimum and recommended allocation of resources
200191
201-
| Container | Minimum | Recommended |
202-
|-----------|---------|-------------|
203-
| Speech language identification | 1 core, 1-GB memory | 1 core, 1-GB memory |
192+
| Container | Minimum | Recommended | Concurrency Limit |
193+
|-----------|---------|-------------|-------------------|
194+
| Speech language identification | 1 core, 1-GB memory | 1 core, 1-GB memory | 2 sessions per core |
204195
205196
> [!NOTE]
206197
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
@@ -210,12 +201,13 @@ sections:
210201
211202
- Each core must be at least 2.6 GHz or faster.
212203
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
213-
- The decoder is capable of doing about 2-3x real-time. For this, the overall CPU usage will be close to two cores for a single recognition. That's why we don't recommend keeping more than two active connections, per container instance. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
214-
- For microphone, it is at 1x real-time. The overall usage should be at about one core for a single recognition.
204+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
205+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
206+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
215207
216-
Consider the total number of hours of audio you have. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
208+
Consider the total number of hours of audio you have, and your expected time to complete the task. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
217209
218-
As an example, to handle 1000 hours/24 hours, we have tried setting up 3-4 VMs, with 10 instances/decoders per VM.
210+
As an example, we have performed speech to text on 1000 hours of audio within 24 hours with 4-5 VMs and 10 instances/decoders per VM.
219211
220212
- question: |
221213
Does the Speech container support punctuation?

0 commit comments

Comments
 (0)