Skip to content

Commit f879b53

Browse files
authored
Updated based on Yanchang's changes
1 parent 7d2cd28 commit f879b53

File tree

1 file changed

+14
-6
lines changed

1 file changed

+14
-6
lines changed

articles/ai-services/speech-service/speech-container-faq.yml

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,11 @@ sections:
164164
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
165165
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
166166
> Also, the first run of either container might take longer because models are being paged into memory.
167+
168+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
169+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
170+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
171+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
167172
168173
# [Custom speech-to-text](#tab/cstt)
169174
@@ -175,16 +180,23 @@ sections:
175180
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
176181
> For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table).
177182
> Also, the first run of either container might take longer because models are being paged into memory.
183+
184+
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
185+
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
186+
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
187+
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
178188
179189
# [Neural Text-to-speech](#tab/tts)
180190
181191
| Container | Minimum | Recommended | Concurrency Limit |
182-
|-----------|---------|-------------|-------------------| 5 requests per 8 core CPU |
183-
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory |
192+
|-----------|---------|-------------|-------------------|
193+
| Neural text to speech | 6 core, 12-GB memory | 8 core, 16-GB memory \n 8 core, 24-GB memory (multilingual voices) | 5 requests per 8 core CPU |
184194
185195
> [!NOTE]
186196
> The minimum and recommended allocations are based on Docker limits, *not* the host machine resources.
187197
> Also, the first run of either container might take longer because models are being paged into memory.
198+
199+
- Real-time performance varies depending on concurrency. With a concurrency of 1, an NTTS container instance can achieve 10x real-time performance. However, when concurrency increases to 5, real-time performance drops to 3x or lower. We recommended sending less than 5 concurrent requests in one container. Start more containers for increased concurrency.
188200
189201
# [Speech language identification](#tab/lid)
190202
@@ -199,10 +211,6 @@ sections:
199211
***
200212
201213
- Each core must be at least 2.6 GHz or faster.
202-
- For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
203-
- For speech to text, the decoder is capable of doing about 2-3x real-time. That's why we don't recommend keeping more than two active connections per core. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like `DS13_V2`. For the container version 1.3 and later, there's a param you could try setting `-e DECODER_MAX_COUNT=20`.
204-
- For microphone, speech to text happens at 1x real-time. The overall usage should be at about one core for a single recognition.
205-
- Beyond 16 cores the system becomes nonuniform memory access (NUMA) node sensitive.
206214
207215
Consider the total number of hours of audio you have, and your expected time to complete the task. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
208216

0 commit comments

Comments
 (0)