Skip to content

Commit 4a6d02d

Browse files
committed
removed latency row. Fixed acrolynx items to hit min
1 parent a368b48 commit 4a6d02d

File tree

1 file changed

+15
-18
lines changed

1 file changed

+15
-18
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -40,19 +40,16 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
4040
## How much thoughput per PTU you get for each model
4141
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens being generated.
4242

43-
Generating output tokens requires more processing and the more tokens generated, the lower the overall TPM per PTU. Provisioned deployments dynamically balance the two, so users do not have to set specific input and output limits. This means the service is resilient to fluctuations in the workload shape.
43+
Generating output tokens requires more processing and the more tokens generated, the lower the overall TPM per PTU. Provisioned deployments dynamically balance the two, so users do not have to set specific input and output limits. This approach means the service is resilient to fluctuations in the workload shape.
4444

45-
To help with simplifying the sizing effort, the table below outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
45+
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
4646

4747
| | **gpt-4o**, **2024-05-13** & **gpt-4o**, **2024-08-06** | **gpt-4o-mini**, **2024-07-18** |
4848
| --| --| --|
4949
| Deployable Increments | 50 | 25|
5050
| Input TPM per PTU | 2,500 | 37,000 |
5151
| Output TPM per PTU | 833 | 12,333 |
52-
| Latency target | > 25 tokens per second* | > 33 tokens per second* |
53-
54-
\* Calculated as the average of the per-call average generated tokens on a 1-minute bassis over the month
55-
\** For a full list please see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
52+
\** For a full list see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
5653

5754

5855
## Key concepts
@@ -79,7 +76,7 @@ az cognitiveservices account deployment create \
7976

8077
#### Provisioned throughput units
8178

82-
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota on a regional basis, which defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
79+
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
8380

8481

8582
#### Model independent quota
@@ -88,36 +85,36 @@ Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, P
8885

8986
:::image type="content" source="../media/provisioned/model-independent-quota.png" alt-text="Diagram of model independent quota with one pool of PTUs available to multiple Azure OpenAI models." lightbox="../media/provisioned/model-independent-quota.png":::
9087

91-
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item will show the deployments contributing to usage of each quota.
88+
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item shows the deployments contributing to usage of each quota.
9289

9390
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
9491

9592
#### Obtaining PTU Quota
9693

97-
PTU quota is available by default in many regions. If additional quota is required, customers can request additional quota via the Request Quota link to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota items in Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer will receive an email at the included address once the request is approved, typically within two business days.
94+
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota tabs in the Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
9895

9996
#### Per-Model PTU Minimums
10097

10198
The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
10299

103100
## Capacity transparency
104101

105-
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
102+
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This constraint can limit some customers’ ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
106103

107-
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and is not a guarantee of capacity availability.
104+
- Quota places a limit on the maximum number of PTUs that can be deployed in a subscription and region, and does not guarantee of capacity availability.
108105
- Capacity is allocated at deployment time and is held for as long as the deployment exists. If service capacity is not available, the deployment will fail
109106
- Customers use real-time information on quota/capacity availability to choose an appropriate region for their scenario with the necessary model capacity
110107
- Scaling down or deleting a deployment releases capacity back to the region. There is no guarantee that the capacity will be available should the deployment be scaled up or re-created later.
111108

112109
#### Regional capacity guidance
113110

114-
To help users find the capacity needed for their deployments, customers will use a new API and Studio experience to provide real-time information on.
111+
To find the capacity needed for their deployments, use the capacity API or the Studio deployment experience to provide real-time information on capacity availability.
115112

116-
In Azure OpenAI Studio, the deployment experience will identify when a region lacks the capacity to support the desired model, version and number of PTUs, and will direct the user to a select an alternative region when needed.
113+
In Azure OpenAI Studio, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If cpacity is unavailable, the experience direct users to a select an alternative region.
117114

118115
Details on the new deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
119116

120-
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can also be used to programmatically identify the maximum sized deployment of a specified model that can be created in each region based on the availability of both quota in the subscription and service capacity in the region.
117+
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can be used to programmatically identify the maximum sized deployment of a specified model. The API consideres both the your quota and service capacity in the region.
121118

122119
If an acceptable region isn't available to support the desire model, version and/or PTUs, customers can also try the following steps:
123120

@@ -131,13 +128,13 @@ PTUs represent an amount of model processing capacity. Similar to your computer
131128

132129
A few high-level considerations:
133130
- Generations require more capacity than prompts
134-
- Larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This also means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some very large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
131+
- For GPT-4o and later models, the TPM per PTU is set for input and output tokens separately. For older models, larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This tiering means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
135132

136133
### How utilization performance works
137134

138135
Provisioned and global provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
139136

140-
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
137+
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.
141138

142139
### How can I monitor capacity?
143140

@@ -161,7 +158,7 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
161158

162159
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
163160
164-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service will estimate a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
161+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
165162

166163
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
167164

@@ -178,7 +175,7 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
178175

179176
#### How many concurrent calls can I have on my deployment?
180177

181-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service will continue to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
178+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service continues to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
182179

183180
## What models and regions are available for provisioned throughput?
184181

0 commit comments

Comments
 (0)