You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/concepts/provisioned-throughput.md
+15-18Lines changed: 15 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,19 +40,16 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
40
40
## How much thoughput per PTU you get for each model
41
41
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens being generated.
42
42
43
-
Generating output tokens requires more processing and the more tokens generated, the lower the overall TPM per PTU. Provisioned deployments dynamically balance the two, so users do not have to set specific input and output limits. This means the service is resilient to fluctuations in the workload shape.
43
+
Generating output tokens requires more processing and the more tokens generated, the lower the overall TPM per PTU. Provisioned deployments dynamically balance the two, so users do not have to set specific input and output limits. This approach means the service is resilient to fluctuations in the workload shape.
44
44
45
-
To help with simplifying the sizing effort, the table below outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
45
+
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
| Latency target | > 25 tokens per second*| > 33 tokens per second*|
53
-
54
-
\* Calculated as the average of the per-call average generated tokens on a 1-minute bassis over the month
55
-
\** For a full list please see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
52
+
\** For a full list see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
56
53
57
54
58
55
## Key concepts
@@ -79,7 +76,7 @@ az cognitiveservices account deployment create \
79
76
80
77
#### Provisioned throughput units
81
78
82
-
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota on a regional basis, which defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
79
+
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
83
80
84
81
85
82
#### Model independent quota
@@ -88,36 +85,36 @@ Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, P
88
85
89
86
:::image type="content" source="../media/provisioned/model-independent-quota.png" alt-text="Diagram of model independent quota with one pool of PTUs available to multiple Azure OpenAI models." lightbox="../media/provisioned/model-independent-quota.png":::
90
87
91
-
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item will show the deployments contributing to usage of each quota.
88
+
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item shows the deployments contributing to usage of each quota.
92
89
93
90
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
94
91
95
92
#### Obtaining PTU Quota
96
93
97
-
PTU quota is available by default in many regions. If additional quota is required, customers can request additional quota via the Request Quota linkto the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota items in Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer will receive an email at the included address once the request is approved, typically within two business days.
94
+
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota tabs in the Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
98
95
99
96
#### Per-Model PTU Minimums
100
97
101
98
The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
102
99
103
100
## Capacity transparency
104
101
105
-
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
102
+
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This constraint can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
106
103
107
-
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and is not a guarantee of capacity availability.
104
+
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and does not guarantee of capacity availability.
108
105
- Capacity is allocated at deployment time and is held for as long as the deployment exists. If service capacity is not available, the deployment will fail
109
106
- Customers use real-time information on quota/capacity availability to choose an appropriate region for their scenario with the necessary model capacity
110
107
- Scaling down or deleting a deployment releases capacity back to the region. There is no guarantee that the capacity will be available should the deployment be scaled up or re-created later.
111
108
112
109
#### Regional capacity guidance
113
110
114
-
To help users find the capacity needed for their deployments, customers will use a new API and Studio experience to provide real-time information on.
111
+
To find the capacity needed for their deployments, use the capacity API or the Studio deployment experience to provide real-time information on capacity availability.
115
112
116
-
In Azure OpenAI Studio, the deployment experience will identify when a region lacks the capacity to support the desired model, version and number of PTUs, and will direct the user to a select an alternative region when needed.
113
+
In Azure OpenAI Studio, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If cpacity is unavailable, the experience direct users to a select an alternative region.
117
114
118
115
Details on the new deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
119
116
120
-
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can also be used to programmatically identify the maximum sized deployment of a specified model that can be created in each region based on the availability of both quota in the subscription and service capacity in the region.
117
+
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can be used to programmatically identify the maximum sized deployment of a specified model. The API consideres both the your quota and service capacity in the region.
121
118
122
119
If an acceptable region isn't available to support the desire model, version and/or PTUs, customers can also try the following steps:
123
120
@@ -131,13 +128,13 @@ PTUs represent an amount of model processing capacity. Similar to your computer
131
128
132
129
A few high-level considerations:
133
130
- Generations require more capacity than prompts
134
-
-Larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This also means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some very large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
131
+
-For GPT-4o and later models, the TPM per PTU is set for input and output tokens separately. For older models, larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This tiering means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
135
132
136
133
### How utilization performance works
137
134
138
135
Provisioned and global provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
139
136
140
-
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
137
+
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.
141
138
142
139
### How can I monitor capacity?
143
140
@@ -161,7 +158,7 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
161
158
162
159
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
163
160
164
-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service will estimate a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
161
+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
165
162
166
163
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
167
164
@@ -178,7 +175,7 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
178
175
179
176
#### How many concurrent calls can I have on my deployment?
180
177
181
-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service will continue to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
178
+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service continues to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
182
179
183
180
## What models and regions are available for provisioned throughput?
0 commit comments