Skip to content

Commit f324ad8

Browse files
Merge pull request #2224 from mrbullwinkle/mrb_01_09_2025_quota_updates
[Azure OpenAI] Quota updates
2 parents ddb0435 + 664411b commit f324ad8

File tree

2 files changed

+29
-25
lines changed

2 files changed

+29
-25
lines changed

articles/ai-services/openai/how-to/quota.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: mrbullwinkle
77
manager: nitinme
88
ms.service: azure-ai-openai
99
ms.topic: how-to
10-
ms.date: 11/04/2024
10+
ms.date: 01/09/2025
1111
ms.author: mbullwin
1212
---
1313

@@ -18,18 +18,18 @@ Quota provides the flexibility to actively manage the allocation of rate limits
1818
## Prerequisites
1919

2020
> [!IMPORTANT]
21-
> For any task that requires viewing available quota we recommend using the **Cognitive Services Usages Reader** role. This role provides the minimal access necessary to view quota usage across an Azure subscription. To learn more about this role and the other roles you will need to access Azure OpenAI, consult our [Azure role-based access (Azure RBAC) guide](./role-based-access-control.md).
21+
> For any task that requires viewing available quota we recommend using the **Cognitive Services Usages Reader** role. This role provides the minimal access necessary to view quota usage across an Azure subscription. To learn more about this role and the other roles you will need to access Azure OpenAI, consult our [Azure role-based access control guide](./role-based-access-control.md).
2222
>
23-
> This role can be found in the Azure portal under **Subscriptions** > **Access control (IAM)** > **Add role assignment** > search for **Cognitive Services Usages Reader**.This role **must be applied at the subscription level**, it does not exist at the resource level.
23+
> This role can be found in the Azure portal under **Subscriptions** > **Access control (IAM)** > **Add role assignment** > search for **Cognitive Services Usages Reader**. This role **must be applied at the subscription level**, it does not exist at the resource level.
2424
>
2525
> If you do not wish to use this role, the subscription **Reader** role will provide equivalent access, but it will also grant read access beyond the scope of what is needed for viewing quota and model deployment.
2626
2727
## Introduction to quota
2828

29-
Azure OpenAI's quota feature enables assignment of rate limits to your deployments, up-to a global limit called your quota.” Quota is assigned to your subscription on a per-region, per-model basis in units of **Tokens-per-Minute (TPM)**. When you onboard a subscription to Azure OpenAI, you'll receive default quota for most available models. Then, you'll assign TPM to each deployment as it is created, and the available quota for that model will be reduced by that amount. You can continue to create deployments and assign them TPM until you reach your quota limit. Once that happens, you can only create new deployments of that model by reducing the TPM assigned to other deployments of the same model (thus freeing TPM for use), or by requesting and being approved for a model quota increase in the desired region.
29+
Azure OpenAI's quota feature enables assignment of rate limits to your deployments, up-to a global limit called your *quota*. Quota is assigned to your subscription on a per-region, per-model basis in units of **Tokens-per-Minute (TPM)**. When you onboard a subscription to Azure OpenAI, you'll receive default quota for most available models. Then, you'll assign TPM to each deployment as it is created, and the available quota for that model will be reduced by that amount. You can continue to create deployments and assign them TPM until you reach your quota limit. Once that happens, you can only create new deployments of that model by reducing the TPM assigned to other deployments of the same model (thus freeing TPM for use), or by requesting and being approved for a model quota increase in the desired region.
3030

3131
> [!NOTE]
32-
> With a quota of 240,000 TPM for GPT-35-Turbo in East US, a customer can create a single deployment of 240K TPM, 2 deployments of 120K TPM each, or any number of deployments in one or multiple Azure OpenAI resources as long as their TPM adds up to less than 240K total in that region.
32+
> With a quota of 240,000 TPM for GPT-35-Turbo in East US, a customer can create a single deployment of 240 K TPM, 2 deployments of 120 K TPM each, or any number of deployments in one or multiple Azure OpenAI resources as long as their TPM adds up to less than 240 K total in that region.
3333
3434
When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A **Requests-Per-Minute (RPM)** rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:
3535

@@ -53,6 +53,10 @@ Post deployment you can adjust your TPM allocation by selecting and editing your
5353
> [!IMPORTANT]
5454
> Quotas and limits are subject to change, for the most up-date-information consult our [quotas and limits article](../quotas-limits.md).
5555
56+
## Request more quota
57+
58+
Quota increase requests can be submitted via the [quota increase request form](https://aka.ms/oai/stuquotarequest). Due to high demand, quota increase requests are being accepted and will be filled in the order they're received. Priority is given to customers who generate traffic that consumes the existing quota allocation, and your request might be denied if this condition isn't met.
59+
5660
## Model specific settings
5761

5862
Different model deployments, also called model classes have unique max TPM values that you're now able to control. **This represents the maximum amount of TPM that can be allocated to that type of model deployment in a given region.**
@@ -71,7 +75,7 @@ For an all up view of your quota allocations across deployments in a given regio
7175
- **Deployment**: Model deployments divided by model class.
7276
- **Quota type**: There's one quota value per region for each model type. The quota covers all versions of that model.
7377
- **Quota allocation**: For the quota name, this shows how much quota is used by deployments and the total quota approved for this subscription and region. This amount of quota used is also represented in the bar graph.
74-
- **Request Quota**: The icon navigates to a form where requests to increase quota can be submitted.
78+
- **Request Quota**: The icon navigates to [this form](https://aka.ms/oai/stuquotarequest) where requests to increase quota can be submitted.
7579

7680
## Migrating existing deployments
7781

@@ -92,7 +96,7 @@ As requests come into the deployment endpoint, the estimated max-processed-token
9296
> [!IMPORTANT]
9397
> The token count used in the rate limit calculation is an estimate based in part on the character count of the API request. The rate limit token estimate is not the same as the token calculation that is used for billing/determining that a request is below a model's input token limit. Due to the approximate nature of the rate limit token calculation, it is expected behavior that a rate limit can be triggered prior to what might be expected in comparison to an exact token count measurement for each request.
9498
95-
RPM rate limits are based on the number of requests received over time. The rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests may receive a 429 response even though the limit isn't met when measured over the course of a minute. To implement this behavior, Azure OpenAI Service evaluates the rate of incoming requests over a small period of time, typically 1 or 10 seconds. If the number of requests received during that time exceeds what would be expected at the set RPM limit, then new requests will receive a 429 response code until the next evaluation period. For example, if Azure OpenAI is monitoring request rate on 1-second intervals, then rate limiting will occur for a 600-RPM deployment if more than 10 requests are received during each 1-second period (600 requests per minute = 10 requests per second).
99+
RPM rate limits are based on the number of requests received over time. The rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests might receive a 429 response even though the limit isn't met when measured over the course of a minute. To implement this behavior, Azure OpenAI Service evaluates the rate of incoming requests over a small period of time, typically 1 or 10 seconds. If the number of requests received during that time exceeds what would be expected at the set RPM limit, then new requests will receive a 429 response code until the next evaluation period. For example, if Azure OpenAI is monitoring request rate on 1-second intervals, then rate limiting will occur for a 600-RPM deployment if more than 10 requests are received during each 1-second period (600 requests per minute = 10 requests per second).
96100

97101
### Rate limit best practices
98102

@@ -106,7 +110,7 @@ To minimize issues related to rate limits, it's a good idea to use the following
106110

107111
## Automate deployment
108112

109-
This section contains brief example templates to help get you started programmatically creating deployments that use quota to set TPM rate limits. With the introduction of quota you must use API version `2023-05-01` for resource management related activities. This API version is only for managing your resources, and does not impact the API version used for inferencing calls like completions, chat completions, embedding, image generation etc.
113+
This section contains brief example templates to help get you started programmatically creating deployments that use quota to set TPM rate limits. With the introduction of quota you must use API version `2023-05-01` for resource management related activities. This API version is only for managing your resources, and does not impact the API version used for inferencing calls like completions, chat completions, embedding, image generation, etc.
110114

111115
# [REST](#tab/rest)
112116

@@ -151,7 +155,7 @@ curl -X PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-0
151155
> [!NOTE]
152156
> There are multiple ways to generate an authorization token. The easiest method for initial testing is to launch the Cloud Shell from the [Azure portal](https://portal.azure.com). Then run [`az account get-access-token`](/cli/azure/account?view=azure-cli-latest#az-account-get-access-token&preserve-view=true). You can use this token as your temporary authorization token for API testing.
153157
154-
For more information, refer to the REST API reference documentation for [usages](/rest/api/aiservices/accountmanagement/usages/list?branch=main&tabs=HTTP) and [deployment](/rest/api/aiservices/accountmanagement/deployments/create-or-update).
158+
For more information, see the REST API reference documentation for [usages](/rest/api/aiservices/accountmanagement/usages/list?branch=main&tabs=HTTP) and [deployment](/rest/api/aiservices/accountmanagement/deployments/create-or-update).
155159

156160
### Usage
157161

@@ -201,7 +205,7 @@ az cognitiveservices account deployment create --model-format
201205
[--sku]
202206
```
203207

204-
To sign into your local installation of the CLI, run the [az login](/cli/azure/reference-index#az-login) command:
208+
To sign into your local installation of the CLI, run the [`az login`](/cli/azure/reference-index#az-login) command:
205209

206210
```azurecli
207211
az login
@@ -231,7 +235,7 @@ az cognitiveservices usage list -l eastus
231235

232236
This command runs in the context of the currently active subscription for Azure CLI. Use `az-account-set --subscription` to [modify the active subscription](/cli/azure/manage-azure-subscriptions-azure-cli#change-the-active-subscription).
233237

234-
For more details on `az cognitiveservices account` and `az cognitivesservices usage` consult the [Azure CLI reference documentation](/cli/azure/cognitiveservices/account/deployment?view=azure-cli-latest&preserve-view=true)
238+
For more information, see the [Azure CLI reference documentation](/cli/azure/cognitiveservices/account/deployment?view=azure-cli-latest&preserve-view=true)
235239

236240
# [Azure PowerShell](#tab/powershell)
237241

@@ -328,7 +332,7 @@ For more details on `New-AzCognitiveServicesAccountDeployment` and `Get-AzCognit
328332
}
329333
```
330334

331-
For more details, consult the [full Azure Resource Manager reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-arm-template).
335+
For more information, see the [full Azure Resource Manager reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-arm-template).
332336

333337
# [Bicep](#tab/bicep)
334338

@@ -354,7 +358,7 @@ resource arm_je_std_deployment 'Microsoft.CognitiveServices/accounts/deployments
354358
}
355359
```
356360

357-
For more details consult the [full Bicep reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-bicep).
361+
For more information, see the [full Bicep reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-bicep).
358362

359363
# [Terraform](#tab/terraform)
360364

@@ -430,7 +434,7 @@ resource "azapi_resource" "TERRAFORM-AOAI-STD-DEPLOYMENT" {
430434
}
431435
```
432436

433-
For more details consult the [full Terraform reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-terraform).
437+
For more information, see the [full Terraform reference documentation](/azure/templates/microsoft.cognitiveservices/accounts/deployments?pivots=deployment-language-terraform).
434438

435439
---
436440

articles/ai-services/openai/quotas-limits.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.custom:
1010
- ignite-2023
1111
- references_regions
1212
ms.topic: conceptual
13-
ms.date: 11/11/2024
13+
ms.date: 01/09/2025
1414
ms.author: mbullwin
1515
---
1616

@@ -61,26 +61,26 @@ The following sections provide you with a quick guide to the default quotas and
6161

6262
[!INCLUDE [Quota](./includes/global-batch-limits.md)]
6363

64-
## o1-preview & o1-mini rate limits
64+
## o1 & o1-mini rate limits
6565

6666
> [!IMPORTANT]
6767
> The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:
6868
>
6969
> - **Older chat models:** 1 unit of capacity = 6 RPM and 1,000 TPM.
70-
> - **o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
70+
> - **o1 & o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
7171
> - **o1-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
7272
>
7373
> This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.
7474
>
75-
> There is a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but does not apply the correct ratio for the accurate calculation of TPM.
75+
> There is a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but doesn't apply the correct ratio for the accurate calculation of TPM.
7676
77-
### o1-preview & o1-mini global standard
77+
### o1 & o1-mini global standard
7878

7979
| Model|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
8080
|---|---|:---:|:---:|
81-
| `o1-preview` | Enterprise agreement | 30 M | 5 K |
81+
| `o1` & `o1-preview` | Enterprise agreement | 30 M | 5 K |
8282
| `o1-mini`| Enterprise agreement | 50 M | 5 K |
83-
| `o1-preview` | Default | 3 M | 500 |
83+
| `o1` & `o1-preview` | Default | 3 M | 500 |
8484
| `o1-mini`| Default | 5 M | 500 |
8585

8686
### o1-preview & o1-mini standard
@@ -134,12 +134,12 @@ M = million | K = thousand
134134

135135
#### Usage tiers
136136

137-
Global standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. Similarly, Data zone standard deployments allow you to leverage Azure global infrastructure to dynamically route traffic to the data center within the Microsoft defined data zone with the best availability for each request. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variability in response latency.
137+
Global standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. Similarly, Data zone standard deployments allow you to leverage Azure global infrastructure to dynamically route traffic to the data center within the Microsoft defined data zone with the best availability for each request. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see greater variability in response latency.
138138

139139
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
140140

141141
> [!NOTE]
142-
> Usage tiers only apply to standard, data zone standard, and global standard deployment types. Usage tiers do not apply to global batch and provisioned throughput deployments.
142+
> Usage tiers only apply to standard, data zone standard, and global standard deployment types. Usage tiers don't apply to global batch and provisioned throughput deployments.
143143
144144
#### GPT-4o global standard, data zone standard, & standard
145145

@@ -179,9 +179,9 @@ To minimize issues related to rate limits, it's a good idea to use the following
179179
- Test different load increase patterns.
180180
- Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.
181181

182-
### How to request increases to the default quotas and limits
182+
## How to request quota increases
183183

184-
Quota increase requests can be submitted from the [Quotas](./how-to/quota.md) page in the Azure AI Foundry portal. Due to high demand, quota increase requests are being accepted and will be filled in the order they're received. Priority is given to customers who generate traffic that consumes the existing quota allocation, and your request might be denied if this condition isn't met.
184+
Quota increase requests can be submitted via the [quota increase request form](https://aka.ms/oai/stuquotarequest). Due to high demand, quota increase requests are being accepted and will be filled in the order they're received. Priority is given to customers who generate traffic that consumes the existing quota allocation, and your request might be denied if this condition isn't met.
185185

186186
For other rate limits, [submit a service request](../cognitive-services-support-options.md?context=/azure/ai-services/openai/context/context).
187187

0 commit comments

Comments
 (0)