You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/quota.md
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,20 @@ Azure OpenAI's quota feature enables assignment of rate limits to your deploymen
31
31
32
32
When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A **Requests-Per-Minute (RPM)** rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:
33
33
34
-
6 RPM per 1000 TPM. (This ratio can vary by model for more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).)
34
+
> [!IMPORTANT]
35
+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
36
+
>
37
+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota. For more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).
35
48
36
49
The flexibility to distribute TPM globally within a subscription and region has allowed Azure OpenAI to loosen other restrictions:
Copy file name to clipboardExpand all lines: articles/ai-services/openai/quotas-limits.md
+11-12Lines changed: 11 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,30 +99,29 @@ The following sections provide you with a quick guide to the default quotas and
99
99
|`model-router` (2025-05-19) | Enterprise Tier | 10 M | 10 K |
100
100
|`model-router` (2025-05-19) | Default | 1 M | 1 K |
101
101
102
-
103
102
## computer-use-preview global standard rate limits
104
103
105
104
| Model|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
106
105
|---|---|:---:|:---:|
107
106
|`computer-use-preview`| Enterprise Tier | 30 M | 300 K |
108
107
|`computer-use-preview`| Default | 450 K | 4.5 K |
109
108
110
-
111
109
## o-series rate limits
112
110
113
111
> [!IMPORTANT]
114
-
> The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:
115
-
>
116
-
> -**Older chat models:** 1 unit of capacity = 6 RPM and 1,000 TPM.
117
-
> -**o1 & o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
118
-
> -**o3** 1 unit of capacity = 1 RPM per 1,000 TPM
119
-
> -**o4-mini** 1 unit of capacity = 1 RPM per 1,000 TPM
120
-
> -**o3-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
121
-
> -**o1-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
112
+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
122
113
>
123
-
> This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.
114
+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
> There's a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but doesn't apply the correct ratio for the accurate calculation of TPM.
124
+
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota.
0 commit comments