Merge pull request #789 from mrbullwinkle/mrb_10_11_2024_quota

prmerger-automator[bot] · web-flow · commit 3ffda5f99550 · 2024-10-11T21:10:43.000Z
[Azure OpenAI] o1-series quota clarification
diff --git a/articles/ai-services/openai/quotas-limits.md b/articles/ai-services/openai/quotas-limits.md
@@ -10,7 +10,7 @@ ms.custom:
   - ignite-2023
   - references_regions
 ms.topic: conceptual
-ms.date: 10/10/2024
+ms.date: 10/11/2024
 ms.author: mbullwin
 ---
 
@@ -62,6 +62,17 @@ The following sections provide you with a quick guide to the default quotas and
 
 ## o1-preview & o1-mini rate limits
 
+> [!IMPORTANT]
+> The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:
+>
+> - **Older chat models:** 1 unit of capacity = 6 RPM and 1,000 TPM.
+> - **o1-preview:** 1 unit of capacity = 1 RPM and 6,000 TPM.
+> - **o1-mini:** 1 unit of capacity = 1 RPM per 10,000 TPM.
+>
+> This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.
+>
+> There is a known issue with the [quota/usages API](/rest/api/aiservices/accountmanagement/usages/list?view=rest-aiservices-accountmanagement-2024-06-01-preview&tabs=HTTP&preserve-view=true) where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but does not apply the correct ratio for the accurate calculation of TPM.
+
 ### o1-preview & o1-mini global standard
 
 | Model|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |