Skip to content

Commit ed263c4

Browse files
committed
Learn Editor: Update provisioned-throughput.md
1 parent 99d4009 commit ed263c4

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -164,16 +164,16 @@ For provisioned deployments, we use a variation of the leaky bucket algorithm to
164164
1. When a request is made:
165165

166166
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
167+
168+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining the prompt tokens, less any cacehd tokens, and the specified `max_tokens` in the call. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
169+
170+
1. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
167171

168-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
169-
170-
1. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
171-
172-
a. If the actual > estimated, then the difference is added to the deployment's utilization.
173-
174-
b. If the actual < estimated, then the difference is subtracted.
175-
176-
1. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
172+
a. If the actual > estimated, then the difference is added to the deployment's utilization.
173+
174+
b. If the actual < estimated, then the difference is subtracted.
175+
176+
1. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
177177

178178
> [!NOTE]
179179
> Calls are accepted until utilization reaches 100%. Bursts just over 100% may be permitted in short periods, but over time, your traffic is capped at 100% utilization.
@@ -183,7 +183,7 @@ For provisioned deployments, we use a variation of the leaky bucket algorithm to
183183

184184
#### How many concurrent calls can I have on my deployment?
185185

186-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service continues to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
186+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, `max_tokens` parameter, etc.). The service continues to accept calls until the utilization reaches 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of output tokens set for the `max_tokens` parameter, then the provisioned deployment will accept more requests.
187187

188188
## What models and regions are available for provisioned throughput?
189189

0 commit comments

Comments
 (0)