Skip to content

Commit 3db3abd

Browse files
committed
update list
1 parent 0561dcd commit 3db3abd

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -161,20 +161,20 @@ In all provisioned deployment types, each request is evaluated individually acco
161161

162162
For provisioned deployments, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
163163

164-
1. Each customer has a set amount of capacity they can utilize on a deployment
164+
1. Each customer has a set amount of capacity they can utilize on a deployment
165165
1. When a request is made:
166166

167-
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
167+
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
168168

169-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
170-
171-
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
169+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
172170

173-
a. If the actual > estimated, then the difference is added to the deployment's utilization
171+
1. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
174172

175-
b. If the actual < estimated, then the difference is subtracted.
173+
a. If the actual > estimated, then the difference is added to the deployment's utilization.
176174

177-
4. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
175+
b. If the actual < estimated, then the difference is subtracted.
176+
177+
1. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
178178

179179
> [!NOTE]
180180
> Calls are accepted until utilization reaches 100%. Bursts just over 100% may be permitted in short periods, but over time, your traffic is capped at 100% utilization.

0 commit comments

Comments
 (0)