Skip to content

Commit 06e5b9a

Browse files
committed
reorganizing
1 parent 19faaef commit 06e5b9a

File tree

1 file changed

+47
-35
lines changed

1 file changed

+47
-35
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 47 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -16,23 +16,31 @@ recommendations: false
1616
> The Azure OpenAI Provisioned offering received significant updates on August 12, 2024, including aligning the purchase model with Azure standards and moving to model-independent quota. It is highly recommended that customers onboarded before this date read the Azure [OpenAI provisioned August update](./provisioned-migration.md) to learn more about these changes.
1717
1818

19-
The provisioned throughput capability allows you to specify the amount of throughput you require in a deployment. The service then allocates the necessary model processing capacity and ensures it's ready for you. Throughput is defined in terms of provisioned throughput units (PTU) which is a normalized way of representing the throughput for your deployment. Each model-version pair requires different amounts of PTU to deploy and provide different amounts of throughput per PTU.
20-
21-
## What do the provisioned deployment types provide?
19+
The provisioned throughput offring is a model deployment type that allows you to specify the amount of throughput you require in a model deployment. The Azure OpenAI service then allocates the necessary model processing capacity and ensures it's ready for you. Provisioned throughput provides:
2220

2321
- **Predictable performance:** stable max latency and throughput for uniform workloads.
2422
- **Allocated processing capacity:** A deployment configures the amount of throughput. Once deployed, the throughput is available whether used or not.
2523
- **Cost savings:** High throughput workloads might provide cost savings vs token-based consumption.
2624

27-
> [!NOTE]
28-
> Customers can take advantage of additional cost savings on provisioned deployments when they buy [Microsoft Azure OpenAI Service reservations](/azure/cost-management-billing/reservations/azure-openai#buy-a-microsoft-azure-openai-service-reservation).
25+
> [!TIP]
26+
> * You can take advantage of additional cost savings when you buy [Microsoft Azure OpenAI Service reservations](/azure/cost-management-billing/reservations/azure-openai#buy-a-microsoft-azure-openai-service-reservation).
27+
> * Provisioned throughput is available as the following deployment types: [global provisioned](../how-to/deployment-types.md#global-provisioned), [data zone provisioned](../how-to/deployment-types.md#data-zone-provisioned) and [standard provisioned](../how-to/deployment-types.md#provisioned).
28+
29+
<!--
30+
Throughput is defined in terms of provisioned throughput units (PTU) which is a normalized way of representing the throughput for your deployment. Each model-version pair requires different amounts of PTU to deploy, and provides different amounts of throughput per PTU.
2931
32+
An Azure OpenAI deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and using features, such as [content moderation](content-filter.md).
33+
-->
3034

31-
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)). Global provisioned deployments are available in the same Azure OpenAI resources as all other deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with the best availability for each request. Similarly, data zone provisioned deployments are also available in the same resources as all other deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center within the Microsoft specified data zone with the best availability for each request.
35+
## When to use provisioned throughput
3236

37+
You should consider switching from standard deployments to provisioned managed deployments when you have well-defined, predictable throughput and latency requirements. Typically, this occurs when the application is ready for production or has already been deployed in production and there's an understanding of the expected traffic. This allows users to accurately forecast the required capacity and avoid unexpected billing. Provisioned managed deployments are also useful for applications that have real-time/latency sensitive requirements.
38+
39+
<!--
3340
## What do you get?
3441
35-
| Topic | Provisioned|
42+
43+
| Topic | Description |
3644
|---|---|
3745
| What is it? |Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
3846
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
@@ -41,27 +49,7 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
4149
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
4250
|Estimating size |Provided sizing calculator in Azure AI Foundry.|
4351
|Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
44-
45-
46-
## How much throughput per PTU you get for each model
47-
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in the minute. Generating output tokens requires more processing than input tokens. For the models specified in the table below, 1 output token counts as 3 input tokens towards your TPM per PTU limit. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload shape.
48-
49-
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the specified models. To understand the impact of output tokens on the TPM per PTU limit, use the 3 input token to 1 output token ratio. For a detailed understanding of how different ratios of input and output tokens impact the throughput your workload needs, see the [Azure OpenAI capacity calculator](https://oai.azure.com/portal/calculator). The table also shows Service Level Agreement (SLA) Latency Target Values per model. For more information about the SLA for Azure OpenAI Service, see the [Service Level Agreements (SLA) for Online Services page](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1)
50-
51-
|Topic| **gpt-4o** | **gpt-4o-mini** | **o1**|
52-
| --- | --- | --- | --- |
53-
|Global & data zone provisioned minimum deployment|15|15|15|
54-
|Global & data zone provisioned scale increment|5|5|5|
55-
|Regional provisioned minimum deployment|50|25|50|
56-
|Regional provisioned scale increment|50|25|50|
57-
|Input TPM per PTU |2,500|37,000|230|
58-
|Latency Target Value |25 Tokens Per Second|33 Tokens Per Second|25 Tokens Per Second|
59-
60-
For a full list see the [Azure OpenAI Service in Azure AI Foundry portal calculator](https://oai.azure.com/portal/calculator).
61-
62-
63-
> [!NOTE]
64-
> Global provisioned and data zone provisioned deployments are only supported for gpt-4o and gpt-4o-mini models at this time. For more information on model availability, review the [models documentation](./models.md).
52+
-->
6553

6654
## Key concepts
6755

@@ -89,7 +77,6 @@ az cognitiveservices account deployment create \
8977

9078
Provisioned throughput units (PTU) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
9179

92-
9380
#### Model independent quota
9481

9582
Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, PTUs are model-independent. The PTUs might be used to deploy any supported model/version in the region.
@@ -100,15 +87,40 @@ For provisioned deployments, the new quota shows up in Azure AI Foundry as a quo
10087

10188
:::image type="content" source="../media/provisioned/ptu-quota-page.png" alt-text="Screenshot of quota UI for Azure OpenAI provisioned." lightbox="../media/provisioned/ptu-quota-page.png":::
10289

103-
#### Obtaining PTU Quota
90+
91+
## How much throughput per PTU you get for each model
92+
The amount of throughput (measured in tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in a given minute.
93+
94+
Generating output tokens requires more processing than input tokens. For the models specified in the table below, 1 output token counts as 3 input tokens towards your TPM-per-PTU limit. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload.
95+
96+
To help with simplifying the sizing effort, the following table outlines the TPM-per-PTU for the specified models. To understand the impact of output tokens on the TPM-per-PTU limit, use the 3 input token to 1 output token ratio.
97+
98+
For a detailed understanding of how different ratios of input and output tokens impact the throughput your workload needs, see the [Azure OpenAI capacity calculator](https://oai.azure.com/portal/calculator). The table also shows Service Level Agreement (SLA) Latency Target Values per model. For more information about the SLA for Azure OpenAI Service, see the [Service Level Agreements (SLA) for Online Services page](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1)
99+
100+
|Topic| **gpt-4o** | **gpt-4o-mini** | **o1**|
101+
| --- | --- | --- | --- |
102+
|Global & data zone provisioned minimum deployment|15|15|15|
103+
|Global & data zone provisioned scale increment|5|5|5|
104+
|Regional provisioned minimum deployment|50|25|50|
105+
|Regional provisioned scale increment|50|25|50|
106+
|Input TPM per PTU |2,500|37,000|230|
107+
|Latency Target Value |25 Tokens Per Second|33 Tokens Per Second|25 Tokens Per Second|
108+
109+
For a full list see the [Azure OpenAI Service in Azure AI Foundry portal calculator](https://oai.azure.com/portal/calculator).
110+
111+
112+
> [!NOTE]
113+
> Global provisioned and data zone provisioned deployments are only supported for gpt-4o and gpt-4o-mini models at this time. For more information on model availability, review the [models documentation](./models.md).
114+
115+
#### Obtaining PTU quota
104116

105117
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the designated provisioned deployment type quota tabs in Azure AI Foundry The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
106118

107-
#### Per-Model PTU Minimums
119+
#### Per-Model PTU minimums
108120

109121
The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
110122

111-
## Capacity transparency
123+
### Capacity transparency
112124

113125
Azure OpenAI is a highly sought-after service where customer demand might exceed service GPU capacity. Microsoft strives to provide capacity for all in-demand regions and models, but selling out a region is always a possibility. This constraint can limit some customers' ability to create a deployment of their desired model, version, or number of PTUs in a desired region - even if they have quota available in that region. Generally speaking:
114126

@@ -121,11 +133,11 @@ Azure OpenAI is a highly sought-after service where customer demand might exceed
121133

122134
To find the capacity needed for their deployments, use the capacity API or the Azure AI Foundry deployment experience to provide real-time information on capacity availability.
123135

124-
In Azure AI Foundry, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If capacity is unavailable, the experience directs users to a select an alternative region.
136+
In Azure AI Foundry, the deployment experience identifies when a region lacks the capacity needed to deploy the model. This looks at the desired model, version and number of PTUs. If capacity is unavailable, the experience directs users to a select an alternative region.
125137

126-
Details on the new deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
138+
Details on the deployment experience can be found in the Azure OpenAI [Provisioned get started guide](../how-to/provisioned-get-started.md).
127139

128-
The new [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can be used to programmatically identify the maximum sized deployment of a specified model. The API considers both your quota and service capacity in the region.
140+
The [model capacities API](/rest/api/aiservices/accountmanagement/model-capacities/list?view=rest-aiservices-accountmanagement-2024-04-01-preview&tabs=HTTP&preserve-view=true) can be used to programmatically identify the maximum sized deployment of a specified model. The API considers both your quota and service capacity in the region.
129141

130142
If an acceptable region isn't available to support the desire model, version and/or PTUs, customers can also try the following steps:
131143

0 commit comments

Comments
 (0)