Skip to content

Commit cbf810d

Browse files
committed
update
1 parent 887b379 commit cbf810d

File tree

2 files changed

+25
-13
lines changed

2 files changed

+25
-13
lines changed

articles/ai-services/openai/how-to/quota.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,20 @@ Azure OpenAI's quota feature enables assignment of rate limits to your deploymen
3131
3232
When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A **Requests-Per-Minute (RPM)** rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:
3333

34-
6 RPM per 1000 TPM. (This ratio can vary by model for more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).)
34+
> [!IMPORTANT]
35+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
36+
>
37+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
38+
> |------------------------|:----------:|:--------------------------:|:-----------------------:|
39+
> | **Older chat models:** | 1 Unit | 6 RPM | 1,000 TPM |
40+
> | **o1 & o1-preview:** | 1 Unit | 1 RPM | 6,000 TPM |
41+
> | **o3** | 1 Unit | 1 RPM | 1,000 TPM |
42+
> | **o4-mini** | 1 Unit | 1 RPM | 1,000 TPM |
43+
> | **o3-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
44+
> | **o1-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
45+
> | **o3-pro:** | 1 Unit | 1 RPM | 10,000 TPM |
46+
>
47+
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota. For more information, see [quota, and limits](../quotas-limits.md#o-series-rate-limits).
3548
3649
The flexibility to distribute TPM globally within a subscription and region has allowed Azure OpenAI to loosen other restrictions:
3750

articles/ai-services/openai/quotas-limits.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -99,30 +99,29 @@ The following sections provide you with a quick guide to the default quotas and
9999
| `model-router` (2025-05-19) | Enterprise Tier | 10 M | 10 K |
100100
| `model-router` (2025-05-19) | Default | 1 M | 1 K |
101101

102-
103102
## computer-use-preview global standard rate limits
104103

105104
| Model|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
106105
|---|---|:---:|:---:|
107106
| `computer-use-preview`| Enterprise Tier | 30 M | 300 K |
108107
| `computer-use-preview`| Default | 450 K | 4.5 K |
109108

110-
111109
## o-series rate limits
112110

113111
> [!IMPORTANT]
114-
> The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:
115-
>
116-
> - **Older chat models:** 1 unit of capacity = 6 RPM and 1,000 TPM.
117-
> - **o1 & o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
118-
> - **o3** 1 unit of capacity = 1 RPM per 1,000 TPM
119-
> - **o4-mini** 1 unit of capacity = 1 RPM per 1,000 TPM
120-
> - **o3-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
121-
> - **o1-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
112+
> The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or [request a quota increase](https://aka.ms/oai/stuquotarequest) you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:
122113
>
123-
> This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.
114+
> | Model | Capacity | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
115+
> |------------------------|:----------:|:--------------------------:|:-----------------------:|
116+
> | **Older chat models:** | 1 Unit | 6 RPM | 1,000 TPM |
117+
> | **o1 & o1-preview:** | 1 Unit | 1 RPM | 6,000 TPM |
118+
> | **o3** | 1 Unit | 1 RPM | 1,000 TPM |
119+
> | **o4-mini** | 1 Unit | 1 RPM | 1,000 TPM |
120+
> | **o3-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
121+
> | **o1-mini:** | 1 Unit | 1 RPM | 10,000 TPM |
122+
> | **o3-pro:** | 1 Unit | 1 RPM | 10,000 TPM |
124123
>
125-
> There's a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but doesn't apply the correct ratio for the accurate calculation of TPM.
124+
> This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota.
126125
127126
### o-series global standard
128127

0 commit comments

Comments
 (0)