You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## How much thoughput per PTU you get for each model
41
+
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in the minute. Generating output tokens requires more processing than input tokens and so the more output tokens generated the lower your overall TPM. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload shape.
42
+
43
+
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
\** For a full list see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
42
52
43
-
> [!NOTE]
44
-
> The provisioned version of `gpt-4`**Version:**`turbo-2024-04-09` is currently limited to text only.
45
53
46
54
## Key concepts
47
55
@@ -67,7 +75,7 @@ az cognitiveservices account deployment create \
67
75
68
76
#### Provisioned throughput units
69
77
70
-
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota on a regional basis, which defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
78
+
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
71
79
72
80
73
81
#### Model independent quota
@@ -76,36 +84,36 @@ Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, P
76
84
77
85
:::image type="content" source="../media/provisioned/model-independent-quota.png" alt-text="Diagram of model independent quota with one pool of PTUs available to multiple Azure OpenAI models." lightbox="../media/provisioned/model-independent-quota.png":::
78
86
79
-
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item will show the deployments contributing to usage of each quota.
87
+
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item shows the deployments contributing to usage of each quota.
80
88
81
89
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
82
90
83
91
#### Obtaining PTU Quota
84
92
85
-
PTU quota is available by default in many regions. If additional quota is required, customers can request additional quota via the Request Quota linkto the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota items in Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer will receive an email at the included address once the request is approved, typically within two business days.
93
+
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota tabs in the Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
86
94
87
95
#### Per-Model PTU Minimums
88
96
89
97
The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
90
98
91
99
## Capacity transparency
92
100
93
-
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
101
+
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This constraint can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
94
102
95
-
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and is not a guarantee of capacity availability.
103
+
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and does not guarantee of capacity availability.
96
104
- Capacity is allocated at deployment time and is held for as long as the deployment exists. If service capacity is not available, the deployment will fail
97
105
- Customers use real-time information on quota/capacity availability to choose an appropriate region for their scenario with the necessary model capacity
98
106
- Scaling down or deleting a deployment releases capacity back to the region. There is no guarantee that the capacity will be available should the deployment be scaled up or re-created later.
99
107
100
108
#### Regional capacity guidance
101
109
102
-
To help users find the capacity needed for their deployments, customers will use a new API and Studio experience to provide real-time information on.
110
+
To find the capacity needed for their deployments, use the capacity API or the Studio deployment experience to provide real-time information on capacity availability.
103
111
104
-
In Azure OpenAI Studio, the deployment experience will identify when a region lacks the capacity to support the desired model, version and number of PTUs, and will direct the user to a select an alternative region when needed.
112
+
In Azure OpenAI Studio, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If cpacity is unavailable, the experience direct users to a select an alternative region.
105
113
106
114
Details on the new deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
107
115
108
-
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can also be used to programmatically identify the maximum sized deployment of a specified model that can be created in each region based on the availability of both quota in the subscription and service capacity in the region.
116
+
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can be used to programmatically identify the maximum sized deployment of a specified model. The API consideres both the your quota and service capacity in the region.
109
117
110
118
If an acceptable region isn't available to support the desire model, version and/or PTUs, customers can also try the following steps:
111
119
@@ -119,13 +127,13 @@ PTUs represent an amount of model processing capacity. Similar to your computer
119
127
120
128
A few high-level considerations:
121
129
- Generations require more capacity than prompts
122
-
-Larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This also means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some very large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
130
+
-For GPT-4o and later models, the TPM per PTU is set for input and output tokens separately. For older models, larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This tiering means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
123
131
124
132
### How utilization performance works
125
133
126
134
Provisioned and global provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
127
135
128
-
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
136
+
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.
129
137
130
138
### How can I monitor capacity?
131
139
@@ -149,7 +157,7 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
149
157
150
158
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
151
159
152
-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service will estimate a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
160
+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
153
161
154
162
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
155
163
@@ -166,7 +174,14 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
166
174
167
175
#### How many concurrent calls can I have on my deployment?
168
176
169
-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service will continue to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
177
+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service continues to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
178
+
179
+
## What models and regions are available for provisioned throughput?
0 commit comments