Skip to content

Commit 9d0d687

Browse files
Merge pull request #326 from MicrosoftDocs/release-2024-openai-global-ptu
[Azure OpenAI] [Release branch to main tracking PR] PTUM Global
2 parents 7ea5367 + 3b00b8e commit 9d0d687

File tree

6 files changed

+70
-53
lines changed

6 files changed

+70
-53
lines changed

articles/ai-services/openai/concepts/provisioned-migration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ We also recommend that customers using commitments now create their deployments
7979
See the following links for more information. The guidance for reservations and commitments is the same:
8080

8181
* [Capacity Transparency](#self-service-migration)
82-
* [Sizing reservations](../how-to/provisioned-throughput-onboarding.md#important-sizing-azure-openai-provisioned-reservations)
82+
* [Sizing reservations](../how-to/provisioned-throughput-onboarding.md#important-sizing-azure-openai-provisioned--global-provisioned-reservations)
8383

8484
## New hourly reservation payment model
8585

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,23 @@ recommendations: false
1717
1818
The provisioned throughput capability allows you to specify the amount of throughput you require in a deployment. The service then allocates the necessary model processing capacity and ensures it's ready for you. Throughput is defined in terms of provisioned throughput units (PTU) which is a normalized way of representing the throughput for your deployment. Each model-version pair requires different amounts of PTU to deploy and provide different amounts of throughput per PTU.
1919

20-
## What does the provisioned deployment type provide?
20+
## What do the provisioned and global provisioned deployment types provide?
2121

2222
- **Predictable performance:** stable max latency and throughput for uniform workloads.
2323
- **Reserved processing capacity:** A deployment configures the amount of throughput. Once deployed, the throughput is available whether used or not.
2424
- **Cost savings:** High throughput workloads might provide cost savings vs token-based consumption.
2525

26-
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)).
26+
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)). Global deployments are available in the same Azure OpenAI resources as non-global deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with best availability for each request.
2727

2828
## What do you get?
2929

3030
| Topic | Provisioned|
3131
|---|---|
3232
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
3333
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
34-
| Quota | Provisioned-managed throughput Units for a given model. |
34+
| Quota | Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit assigned per region. Quota can be used across any available Azure OpenAI model.|
3535
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
36-
| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor. |
36+
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
3737
| Estimating size | Provided calculator in the studio & benchmarking script. |
3838

3939
## What models and regions are available for provisioned throughput?
@@ -47,9 +47,9 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
4747

4848
### Deployment types
4949

50-
When creating a provisioned deployment in Azure OpenAI Studio, the deployment type on the Create Deployment dialog is Provisioned-Managed.
50+
When creating a provisioned deployment in Azure OpenAI Studio, the deployment type on the Create Deployment dialog is Provisioned-Managed. When creating a global provisioned managed deployment in Azure Open Studio, the deployment type on the Create Deployment dialog is Global Provisioned-Managed.
5151

52-
When creating a provisioned deployment in Azure OpenAI via CLI or API, you need to set the `sku-name` to be Provisioned-Managed. The `sku-capacity` specifies the number of PTUs assigned to the deployment.
52+
When creating a provisioned deployment in Azure OpenAI via CLI or API, you need to set the `sku-name` to be `ProvisionedManaged`. When creating a global provisioned deployment in Azure OpenAI via CLI or API, you need to set the `sku-name` to be `GlobalProvisionedManaged`. The `sku-capacity` specifies the number of PTUs assigned to the deployment.
5353

5454
```azurecli
5555
az cognitiveservices account deployment create \
@@ -76,13 +76,13 @@ Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, P
7676

7777
:::image type="content" source="../media/provisioned/model-independent-quota.png" alt-text="Diagram of model independent quota with one pool of PTUs available to multiple Azure OpenAI models." lightbox="../media/provisioned/model-independent-quota.png":::
7878

79-
The new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item will show the deployments contributing to usage of the quota.
79+
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item will show the deployments contributing to usage of each quota.
8080

8181
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
8282

8383
#### Obtaining PTU Quota
8484

85-
PTU quota is available by default in many regions. If additional quota is required, customers can request additional quota via the Request Quota link to the right of the Provisioned Managed Throughput Unit quota item in Azure OpenAI Studio. The form allows the customer to request an increase in PTU quota for a specified region. The customer will receive an email at the included address once the request is approved, typically within two business days.
85+
PTU quota is available by default in many regions. If additional quota is required, customers can request additional quota via the Request Quota link to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota items in Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer will receive an email at the included address once the request is approved, typically within two business days.
8686

8787
#### Per-Model PTU Minimums
8888

@@ -123,13 +123,13 @@ A few high-level considerations:
123123

124124
### How utilization performance works
125125

126-
Provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
126+
Provisioned and global provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
127127

128-
In Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
128+
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will immediately return a 429 HTTP Status Error. This enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or leverage a retry strategy to manage a given request. The service will continue to return the 429 HTTP status code until the utilization drops below 100%.
129129

130130
### How can I monitor capacity?
131131

132-
The [Provisioned-Managed Utilization V2 metric](../how-to/monitoring.md#azure-openai-metrics) in Azure Monitor measures a given deployments utilization on 1-minute increments. Provisioned-Managed deployments are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics).
132+
The [Provisioned-Managed Utilization V2 metric](../how-to/monitoring.md#azure-openai-metrics) in Azure Monitor measures a given deployments utilization on 1-minute increments. Provisioned-Managed and Global Provisioned-Managed deployments are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics).
133133

134134
#### What should I do when I receive a 429 response?
135135
The 429 response isn't an error, but instead part of the design for telling users that a given deployment is fully utilized at a point in time. By providing a fast-fail response, you have control over how to handle these situations in a way that best fits your application requirements.
@@ -140,9 +140,10 @@ The `retry-after-ms` and `retry-after` headers in the response tell you the tim
140140

141141
#### How does the service decide when to send a 429?
142142

143-
In the Provisioned-Managed offering, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments, which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429 errors being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
143+
In the Provisioned-Managed and Global Provisioned-Managed offerings, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments, which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429 errors being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
144+
145+
For Provisioned-Managed and Global Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
144146

145-
For Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
146147
1. Each customer has a set amount of capacity they can utilize on a deployment
147148
2. When a request is made:
148149

0 commit comments

Comments
 (0)