Skip to content

Commit bda2220

Browse files
committed
update
1 parent 373fa06 commit bda2220

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ A few high-level considerations:
8686

8787
Provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
8888

89-
In Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard Pay-As-You-Go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
89+
In Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
9090

9191
### How can I monitor capacity?
9292

@@ -101,7 +101,7 @@ The `retry-after-ms` and `retry-after` headers in the response tell you the tim
101101

102102
#### How does the service decide when to send a 429?
103103

104-
In the Provisioned-Managed offering, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to Pay-As-You-Go deployments which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For Pay-As-You-Go deployments this can lead to HTTP 429's being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
104+
In the Provisioned-Managed offering, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429s being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
105105

106106
For Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
107107
1. Each customer has a set amount of capacity they can utilize on a deployment
@@ -126,7 +126,7 @@ For Provisioned-Managed, we use a variation of the leaky bucket algorithm to mai
126126

127127
#### How many concurrent calls can I have on my deployment?
128128

129-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc). The service will continue to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
129+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service will continue to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
130130

131131
## Next steps
132132

0 commit comments

Comments
 (0)