Skip to content

Commit 3a300c9

Browse files
Merge pull request #1815 from sydneemayers/docs-editor/deployment-types-1733274504
Data Zone PTU Launch Doc Updates
2 parents e4c8e87 + 3ad8fe3 commit 3a300c9

File tree

4 files changed

+56
-38
lines changed

4 files changed

+56
-38
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -14,39 +14,40 @@ recommendations: false
1414

1515
> [!NOTE]
1616
> The Azure OpenAI Provisioned offering received significant updates on August 12, 2024, including aligning the purchase model with Azure standards and moving to model-independent quota. It is highly recommended that customers onboarded before this date read the Azure [OpenAI provisioned August update](./provisioned-migration.md) to learn more about these changes.
17-
17+
18+
1819
The provisioned throughput capability allows you to specify the amount of throughput you require in a deployment. The service then allocates the necessary model processing capacity and ensures it's ready for you. Throughput is defined in terms of provisioned throughput units (PTU) which is a normalized way of representing the throughput for your deployment. Each model-version pair requires different amounts of PTU to deploy and provide different amounts of throughput per PTU.
1920

20-
## What do the provisioned and global provisioned deployment types provide?
21+
## What do the provisioned deployment types provide?
2122

2223
- **Predictable performance:** stable max latency and throughput for uniform workloads.
2324
- **Reserved processing capacity:** A deployment configures the amount of throughput. Once deployed, the throughput is available whether used or not.
2425
- **Cost savings:** High throughput workloads might provide cost savings vs token-based consumption.
2526

26-
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)). Global deployments are available in the same Azure OpenAI resources as non-global deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with best availability for each request.
27+
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)). Global provisioned deployments are available in the same Azure OpenAI resources as all other deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with the best availability for each request. Similarly, data zone provisioned deployments are also available in the same resources as all other deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center within the Microsoft specified data zone with the best availability for each request.
2728

2829
## What do you get?
2930

3031
| Topic | Provisioned|
3132
|---|---|
3233
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
3334
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
34-
| Quota | Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit assigned per region. Quota can be used across any available Azure OpenAI model.|
35+
| Quota |Provisioned Managed Throughput Unit, Global Provisioned Managed Throughput Unit, or Data Zone Provisioned Managed Throughput Unit assigned per region. Quota can be used across any available Azure OpenAI model.|
3536
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
3637
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
37-
| Estimating size | Provided calculator in the studio & benchmarking script. |
38-
| Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
38+
| Estimating size | Provided calculator in Azure AI Foundry & benchmarking script. |
39+
|Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
3940

4041

4142
## How much throughput per PTU you get for each model
4243
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in the minute. Generating output tokens requires more processing than input tokens and so the more output tokens generated the lower your overall TPM. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload shape.
4344

4445
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models which represent the max TPM assuming all traffic is either input or output. To understand how different ratios of input and output tokens impact your Max TPM per PTU, see the [Azure OpenAI capacity calculator](https://oai.azure.com/portal/calculator). The table also shows Service Level Agreement (SLA) Latency Target Values per model. For more information about the SLA for Azure OpenAI Service, see the [Service Level Agreements (SLA) for Online Services page](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1)
4546

46-
|| **gpt-4o**, **2024-05-13** & **gpt-4o**, **2024-08-06** | **gpt-4o-mini**, **2024-07-18** |
47+
|Topic| **gpt-4o**, **2024-05-13** & **gpt-4o**, **2024-08-06** | **gpt-4o-mini**, **2024-07-18** |
4748
| --- | --- | --- |
48-
|Global provisioned minimum deployment|15|15|
49-
|Global provisioned scale increment|5|5|
49+
|Global & data zone provisioned minimum deployment|15|15|
50+
|Global & data zone provisioned scale increment|5|5|
5051
|Regional provisioned minimum deployment | 50 | 25|
5152
|Regional provisioned scale increment|50|25|
5253
|Max Input TPM per PTU | 2,500 | 37,000 |
@@ -57,15 +58,15 @@ For a full list see the [AOAI Foundry calculator](https://oai.azure.com/portal/c
5758

5859

5960
> [!NOTE]
60-
> Global provisioned deployments are only supported for gpt-4o, 2024-08-06 and gpt-4o-mini, 2024-07-18 models at this time. For more information on model availability, review the [models documentation](./models.md).
61+
> Global provisioned deployments are only supported for gpt-4o, 2024-08-06 and gpt-4o-mini, 2024-07-18 models at this time. Data zone provisioned deployments are only supported for gpt-4o, 2024-08-06, gpt-4o, 2024-05-13, and gpt-4o-mini, 2024-07-18 models at this time. For more information on model availability, review the [models documentation](./models.md).
6162
6263
## Key concepts
6364

6465
### Deployment types
6566

66-
When creating a provisioned deployment in Azure OpenAI Studio, the deployment type on the Create Deployment dialog is Provisioned-Managed. When creating a global provisioned managed deployment in Azure Open Studio, the deployment type on the Create Deployment dialog is Global Provisioned-Managed.
67+
When creating a provisioned deployment in Azure AI Foundry, the deployment type on the Create Deployment dialog can be set to the Global Provisioned-Managed, DataZone Provisioned-Managed, or regional Provisioned-Managed deployment type depending on the data processing needs for the given workload.
6768

68-
When creating a provisioned deployment in Azure OpenAI via CLI or API, you need to set the `sku-name` to be `ProvisionedManaged`. When creating a global provisioned deployment in Azure OpenAI via CLI or API, you need to set the `sku-name` to be `GlobalProvisionedManaged`. The `sku-capacity` specifies the number of PTUs assigned to the deployment.
69+
When creating a provisioned deployment in Azure OpenAI via CLI or API, the `sku-name` can be set to `GlobalProvisionedManaged`, `DataZoneProvisionedManaged`, or `ProvisionedManaged` depending on the data processing need for the given workload. To adapt the Azure CLI example command below to a different deployment type, simply update the `sku-name` parameter to match the deployment type you wish to deploy.
6970

7071
```azurecli
7172
az cognitiveservices account deployment create \
@@ -92,13 +93,13 @@ Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, P
9293

9394
:::image type="content" source="../media/provisioned/model-independent-quota.png" alt-text="Diagram of model independent quota with one pool of PTUs available to multiple Azure OpenAI models." lightbox="../media/provisioned/model-independent-quota.png":::
9495

95-
For provisioned deployments, the new quota shows up in Azure OpenAI Studio as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned managed deployments, the new quota shows up in the Azure OpenAI Studio as a quota item named **Global Provisioned Managed Throughput Unit**. In the Studio Quota pane, expanding the quota item shows the deployments contributing to usage of each quota.
96+
For provisioned deployments, the new quota shows up in Azure AI Foundry as a quota item named **Provisioned Managed Throughput Unit**. For global provisioned deployments, the new quota shows up in the Azure AI Foundry as a quota item named **Global Provisioned Managed Throughput Unit**. For data zone provisioned deployments, the new quota shows up in Azure AI Foundry as a quota item named **Data Zone Provisioned Managed Throughput Unit.** In the Foundry Quota pane, expanding the quota item shows the deployments contributing to usage of each quota.
9697

9798
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
9899

99100
#### Obtaining PTU Quota
100101

101-
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the Provisioned Managed Throughput Unit or Global Provisioned Managed Throughput Unit quota tabs in the Azure OpenAI Studio. The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
102+
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the designated provisioned deployment type quota tabs in Azure AI Foundry The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
102103

103104
#### Per-Model PTU Minimums
104105

@@ -115,9 +116,9 @@ Azure OpenAI is a highly sought-after service where customer demand might exceed
115116

116117
#### Regional capacity guidance
117118

118-
To find the capacity needed for their deployments, use the capacity API or the Studio deployment experience to provide real-time information on capacity availability.
119+
To find the capacity needed for their deployments, use the capacity API or the AI Foundry deployment experience to provide real-time information on capacity availability.
119120

120-
In Azure OpenAI Studio, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If capacity is unavailable, the experience direct users to a select an alternative region.
121+
In Azure AI Foundry, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If capacity is unavailable, the experience directs users to a select an alternative region.
121122

122123
Details on the new deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
123124

@@ -127,7 +128,7 @@ If an acceptable region isn't available to support the desire model, version and
127128

128129
- Attempt the deployment with a smaller number of PTUs.
129130
- Attempt the deployment at a different time. Capacity availability changes dynamically based on customer demand and more capacity might become available later.
130-
- Ensure that quota is available in all acceptable regions. The [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) and Studio experience consider quota availability in returning alternative regions for creating a deployment.
131+
- Ensure that quota is available in all acceptable regions. The [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) and AI Foundry experience consider quota availability in returning alternative regions for creating a deployment.
131132

132133
### Determining the number of PTUs needed for a workload
133134

@@ -139,13 +140,13 @@ A few high-level considerations:
139140

140141
### How utilization performance works
141142

142-
Provisioned and global provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
143+
Provisioned deployments provide you with an allocated amount of model processing capacity to run a given model.
143144

144-
In Provisioned-Managed and Global Provisioned-Managed deployments, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.
145+
In all provisioned deployment types, when capacity is exceeded, the API will return a 429 HTTP Status Error. This fast response enables the user to make decisions on how to manage their traffic. Users can redirect requests to a separate deployment, to a standard pay-as-you-go instance, or use a retry strategy to manage a given request. The service continues to return the 429 HTTP status code until the utilization drops below 100%.
145146

146147
### How can I monitor capacity?
147148

148-
The [Provisioned-Managed Utilization V2 metric](../how-to/monitoring.md#azure-openai-metrics) in Azure Monitor measures a given deployments utilization on 1-minute increments. Provisioned-Managed and Global Provisioned-Managed deployments are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics).
149+
The [Provisioned-Managed Utilization V2 metric](../how-to/monitoring.md#azure-openai-metrics) in Azure Monitor measures a given deployments utilization on 1-minute increments. All provisioned deployment types are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics).
149150

150151
#### What should I do when I receive a 429 response?
151152
The 429 response isn't an error, but instead part of the design for telling users that a given deployment is fully utilized at a point in time. By providing a fast-fail response, you have control over how to handle these situations in a way that best fits your application requirements.
@@ -156,9 +157,9 @@ The `retry-after-ms` and `retry-after` headers in the response tell you the tim
156157

157158
#### How does the service decide when to send a 429?
158159

159-
In the Provisioned-Managed and Global Provisioned-Managed offerings, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments, which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429 errors being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
160+
In all provisioned deployment types, each request is evaluated individually according to its prompt size, expected generation size, and model to determine its expected utilization. This is in contrast to pay-as-you-go deployments, which have a [custom rate limiting behavior](../how-to/quota.md) based on the estimated traffic load. For pay-as-you-go deployments this can lead to HTTP 429 errors being generated prior to defined quota values being exceeded if traffic is not evenly distributed.
160161

161-
For Provisioned-Managed and Global Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
162+
For provisioned deployments, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
162163

163164
1. Each customer has a set amount of capacity they can utilize on a deployment
164165
1. When a request is made:

0 commit comments

Comments
 (0)