Skip to content

Commit dcc5c79

Browse files
Merge pull request #5669 from mrbullwinkle/mrb_06_23_2025_quota
[Azure OpenAI] quota updates
2 parents a0fe262 + 98f2d3b commit dcc5c79

File tree

2 files changed

+34
-22
lines changed

2 files changed

+34
-22
lines changed

articles/ai-services/openai/how-to/quota.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@ Quota provides the flexibility to actively manage the allocation of rate limits
1616
## Prerequisites
1717

1818
> [!IMPORTANT]
19-
> For any task that requires viewing available quota we recommend using the **Cognitive Services Usages Reader** role. This role provides the minimal access necessary to view quota usage across an Azure subscription. To learn more about this role and the other roles you will need to access Azure OpenAI, consult our [Azure role-based access control guide](./role-based-access-control.md).
19+
> For any task that requires viewing available quota we recommend using the **Cognitive Services Usages Reader** role. This role provides the minimal access necessary to view quota usage across an Azure subscription. To learn more about this role and the other roles you'll need to access Azure OpenAI, consult our [Azure role-based access control guide](./role-based-access-control.md).
2020
>
21-
> This role can be found in the Azure portal under **Subscriptions** > **Access control (IAM)** > **Add role assignment** > search for **Cognitive Services Usages Reader**. This role **must be applied at the subscription level**, it does not exist at the resource level.
21+
> This role can be found in the Azure portal under **Subscriptions** > **Access control (IAM)** > **Add role assignment** > search for **Cognitive Services Usages Reader**. This role **must be applied at the subscription level**, it doesn't exist at the resource level.
2222
>
23-
> If you do not wish to use this role, the subscription **Reader** role will provide equivalent access, but it will also grant read access beyond the scope of what is needed for viewing quota and model deployment.
23+
> If you don't wish to use this role, the subscription **Reader** role will provide equivalent access, but it will also grant read access beyond the scope of what is needed for viewing quota and model deployment.
2424
2525
## Introduction to quota
2626

@@ -31,7 +31,20 @@ Azure OpenAI's quota feature enables assignment of rate limits to your deploymen
3131
3232
When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A **Requests-Per-Minute (RPM)** rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:
3333

34-
6 RPM per 1000 TPM. (This ratio can vary by model for more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).)
34+
> [!IMPORTANT]
35+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
36+
>
37+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
38+
> |------------------------|:----------:|:--------------------------:|:-----------------------:|
39+
> | **Older chat models:** | 1 Unit | 6 RPM | 1,000 TPM |
40+
> | **o1 & o1-preview:** | 1 Unit | 1 RPM | 6,000 TPM |
41+
> | **o3** | 1 Unit | 1 RPM | 1,000 TPM |
42+
> | **o4-mini** | 1 Unit | 1 RPM | 1,000 TPM |
43+
> | **o3-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
44+
> | **o1-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
45+
> | **o3-pro:** | 1 Unit | 1 RPM | 10,000 TPM |
46+
>
47+
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota. For more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).
3548
3649
The flexibility to distribute TPM globally within a subscription and region has allowed Azure OpenAI to loosen other restrictions:
3750

@@ -62,7 +75,7 @@ Different model deployments, also called model classes have unique max TPM value
6275
All other model classes have a common max TPM value.
6376

6477
> [!NOTE]
65-
> Quota Tokens-Per-Minute (TPM) allocation is not related to the max input token limit of a model. Model input token limits are defined in the [models table](../concepts/models.md) and are not impacted by changes made to TPM.
78+
> Quota Tokens-Per-Minute (TPM) allocation isn't related to the max input token limit of a model. Model input token limits are defined in the [models table](../concepts/models.md) and aren't impacted by changes made to TPM.
6679
6780
## View and request quota
6881

@@ -92,7 +105,7 @@ As each request is received, Azure OpenAI computes an estimated max processed-to
92105
As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets.
93106

94107
> [!IMPORTANT]
95-
> The token count used in the rate limit calculation is an estimate based in part on the character count of the API request. The rate limit token estimate is not the same as the token calculation that is used for billing/determining that a request is below a model's input token limit. Due to the approximate nature of the rate limit token calculation, it is expected behavior that a rate limit can be triggered prior to what might be expected in comparison to an exact token count measurement for each request.
108+
> The token count used in the rate limit calculation is an estimate based in part on the character count of the API request. The rate limit token estimate isn't the same as the token calculation that is used for billing/determining that a request is below a model's input token limit. Due to the approximate nature of the rate limit token calculation, it's expected behavior that a rate limit can be triggered prior to what might be expected in comparison to an exact token count measurement for each request.
96109
97110
RPM rate limits are based on the number of requests received over time. The rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests might receive a 429 response even though the limit isn't met when measured over the course of a minute. To implement this behavior, Azure OpenAI evaluates the rate of incoming requests over a small period of time, typically 1 or 10 seconds. If the number of requests received during that time exceeds what would be expected at the set RPM limit, then new requests will receive a 429 response code until the next evaluation period. For example, if Azure OpenAI is monitoring request rate on 1-second intervals, then rate limiting will occur for a 600-RPM deployment if more than 10 requests are received during each 1-second period (600 requests per minute = 10 requests per second).
98111

@@ -108,7 +121,7 @@ To minimize issues related to rate limits, it's a good idea to use the following
108121

109122
## Automate deployment
110123

111-
This section contains brief example templates to help get you started programmatically creating deployments that use quota to set TPM rate limits. With the introduction of quota you must use API version `2023-05-01` for resource management related activities. This API version is only for managing your resources, and does not impact the API version used for inferencing calls like completions, chat completions, embedding, image generation, etc.
124+
This section contains brief example templates to help get you started programmatically creating deployments that use quota to set TPM rate limits. With the introduction of quota you must use API version `2023-05-01` for resource management related activities. This API version is only for managing your resources, and doesn't impact the API version used for inferencing calls like completions, chat completions, embedding, image generation, etc.
112125

113126
# [REST](#tab/rest)
114127

@@ -139,7 +152,7 @@ This is only a subset of the available request body parameters. For the full lis
139152
|Parameter|Type| Description |
140153
|--|--|--|
141154
|sku | Sku | The resource model definition representing SKU.|
142-
|capacity|integer|This represents the amount of [quota](../how-to/quota.md) you are assigning to this deployment. A value of 1 equals 1,000 Tokens per Minute (TPM). A value of 10 equals 10k Tokens per Minute (TPM).|
155+
|capacity|integer|This represents the amount of [quota](../how-to/quota.md) you're assigning to this deployment. A value of 1 equals 1,000 Tokens per Minute (TPM). A value of 10 equals 10k Tokens per Minute (TPM).|
143156

144157
#### Example request
145158

@@ -186,7 +199,7 @@ curl -X GET https://management.azure.com/subscriptions/00000000-0000-0000-0000-0
186199

187200
Install the [Azure CLI](/cli/azure/install-azure-cli). Quota requires `Azure CLI version 2.51.0`. If you already have Azure CLI installed locally run `az upgrade` to update to the latest version.
188201

189-
To check which version of Azure CLI you are running use `az version`. Azure Cloud Shell is currently still running 2.50.0 so in the interim local installation of Azure CLI is required to take advantage of the latest Azure OpenAI features.
202+
To check which version of Azure CLI you're running use `az version`. Azure Cloud Shell is currently still running 2.50.0 so in the interim local installation of Azure CLI is required to take advantage of the latest Azure OpenAI features.
190203

191204
### Deployment
192205

@@ -239,7 +252,7 @@ For more information, see the [Azure CLI reference documentation](/cli/azure/cog
239252

240253
Install the latest version of the [Az PowerShell module](/powershell/azure/install-azure-powershell). If you already have the Az PowerShell module installed locally, run `Update-Module -Name Az` to update to the latest version.
241254

242-
To check which version of the Az PowerShell module you are running, use `Get-InstalledModule -Name Az`. Azure Cloud Shell is currently running a version of Azure PowerShell that can take advantage of the latest Azure OpenAI features.
255+
To check which version of the Az PowerShell module you're running, use `Get-InstalledModule -Name Az`. Azure Cloud Shell is currently running a version of Azure PowerShell that can take advantage of the latest Azure OpenAI features.
243256

244257
### Deployment
245258

articles/ai-services/openai/quotas-limits.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -99,30 +99,29 @@ The following sections provide you with a quick guide to the default quotas and
9999
| `model-router` (2025-05-19) | Enterprise Tier | 10 M | 10 K |
100100
| `model-router` (2025-05-19) | Default | 1 M | 1 K |
101101

102-
103102
## computer-use-preview global standard rate limits
104103

105104
| Model|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
106105
|---|---|:---:|:---:|
107106
| `computer-use-preview`| Enterprise Tier | 30 M | 300 K |
108107
| `computer-use-preview`| Default | 450 K | 4.5 K |
109108

110-
111109
## o-series rate limits
112110

113111
> [!IMPORTANT]
114-
> The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:
115-
>
116-
> - **Older chat models:** 1 unit of capacity = 6 RPM and 1,000 TPM.
117-
> - **o1 & o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
118-
> - **o3** 1 unit of capacity = 1 RPM per 1,000 TPM
119-
> - **o4-mini** 1 unit of capacity = 1 RPM per 1,000 TPM
120-
> - **o3-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
121-
> - **o1-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
112+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
122113
>
123-
> This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.
114+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
115+
> |------------------------|:----------:|:--------------------------:|:-----------------------:|
116+
> | **Older chat models:** | 1 Unit | 6 RPM | 1,000 TPM |
117+
> | **o1 & o1-preview:** | 1 Unit | 1 RPM | 6,000 TPM |
118+
> | **o3** | 1 Unit | 1 RPM | 1,000 TPM |
119+
> | **o4-mini** | 1 Unit | 1 RPM | 1,000 TPM |
120+
> | **o3-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
121+
> | **o1-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
122+
> | **o3-pro:** | 1 Unit | 1 RPM | 10,000 TPM |
124123
>
125-
> There's a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but doesn't apply the correct ratio for the accurate calculation of TPM.
124+
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota.
126125
127126
### o-series global standard
128127

0 commit comments

Comments
 (0)