Skip to content

Commit cc37f78

Browse files
authored
Merge pull request #6580 from swingfu/quota&region/update
Update quota description for ADMs
2 parents 9ffad51 + 886437c commit cc37f78

File tree

1 file changed

+35
-27
lines changed

1 file changed

+35
-27
lines changed

articles/ai-foundry/foundry-models/quotas-limits.md

Lines changed: 35 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,19 @@ author: msakande
66
ms.service: azure-ai-model-inference
77
ms.custom: ignite-2024, github-universe-2024
88
ms.topic: concept-article
9-
ms.date: 05/19/2025
9+
ms.date: 08/14/2025
1010
ms.author: mopeakande
11-
ms.reviewer: fasantia
12-
reviewer: santiagxf
11+
ms.reviewer: shiyingfu
12+
reviewer: swingfu
1313
---
1414

1515
# Azure AI Foundry Models quotas and limits
1616

17-
This article contains a quick reference and a detailed description of the quotas and limits for Azure AI Foundry Models. For quotas and limits specific to the Azure OpenAI in Foundry Models, see [Quota and limits in Azure OpenAI](../openai/quotas-limits.md).
17+
This article provides a quick reference and detailed description of the quotas and limits for Azure AI Foundry Models. For quotas and limits specific to the Azure OpenAI in Foundry Models, see [Quota and limits in Azure OpenAI](../openai/quotas-limits.md).
1818

1919
## Quotas and limits reference
2020

21-
Azure uses quotas and limits to prevent budget overruns due to fraud, and to honor Azure capacity constraints. Consider these limits as you scale for production workloads. The following sections provide you with a quick guide to the default quotas and limits that apply to Azure AI model's inference service in Azure AI Foundry:
21+
Azure uses quotas and limits to prevent budget overruns due to fraud and to honor Azure capacity constraints. Consider these limits as you scale for production workloads. The following sections provide a quick guide to the default quotas and limits that apply to Azure AI model inference service in Azure AI Foundry:
2222

2323
### Resource limits
2424

@@ -30,58 +30,66 @@ Azure uses quotas and limits to prevent budget overruns due to fraud, and to hon
3030

3131
### Rate limits
3232

33-
| Limit name | Applies to | Limit value |
34-
| -------------------- | ------------------- | ----------- |
35-
| Tokens per minute | Azure OpenAI models | Varies per model and SKU. See [limits for Azure OpenAI](../openai/quotas-limits.md). |
36-
| Requests per minute | Azure OpenAI models | Varies per model and SKU. See [limits for Azure OpenAI](../openai/quotas-limits.md). |
37-
| Tokens per minute | DeepSeek-R1<br />DeepSeek-V3-0324 | 5,000,000 |
38-
| Requests per minute | DeepSeek-R1<br />DeepSeek-V3-0324 | 5,000 |
39-
| Concurrent requests | DeepSeek-R1<br />DeepSeek-V3-0324 | 300 |
40-
| Tokens per minute | Rest of models | 400,000 |
41-
| Requests per minute | Rest of models | 1,000 |
42-
| Concurrent requests | Rest of models | 300 |
33+
The following table lists limits for Foundry Models for the following rates:
4334

44-
You can [request increases to the default limits](#request-increases-to-the-default-limits). Due to high demand, limit increase requests can be submitted and evaluated per request.
35+
- Tokens per minute
36+
- Requests per minute
37+
- Concurrent request
38+
39+
| Models | Tokens per minute | Requests per minute | Concurrent requests |
40+
| ---------------------------------------------------------------------- | --------------------------------------------------- | ----------------------------------------------------- | -------------------- |
41+
| Azure OpenAI models | Varies per model and SKU. See [limits for Azure OpenAI](../openai/quotas-limits.md). | Varies per model and SKU. See [limits for Azure OpenAI](../openai/quotas-limits.md). | not applicable |
42+
| - DeepSeek-R1<br />- DeepSeek-V3-0324 | 5,000,000 | 5,000 | 300 |
43+
| - Llama 3.3 70B Instruct<br />- Llama-4-Maverick-17B-128E-Instruct-FP8<br />- Grok 3<br />- Grok 3 mini | 400,000 | 1,000 | 300 |
44+
| - Flux-Pro 1.1<br />- Flux.1-Kontext Pro | not applicable | 2 capacity units (6 requests per minute) | not applicable |
45+
| Rest of models | 400,000 | 1,000 | 300 |
46+
47+
To increase your quota:
48+
49+
- For Azure OpenAI, use [Azure AI Foundry Service: Request for Quota Increase](https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR4xPXO648sJKt4GoXAed-0pUMFE1Rk9CU084RjA0TUlVSUlMWEQzVkJDNCQlQCN0PWcu) to submit your request.
50+
- For other models, see [request increases to the default limits](#request-increases-to-the-default-limits).
51+
52+
Due to high demand, we evaluate limit increase requests per request.
4553

4654
### Other limits
4755

4856
| Limit name | Limit value |
4957
|--|--|
5058
| Max number of custom headers in API requests<sup>1</sup> | 10 |
5159

52-
<sup>1</sup> Our current APIs allow up to 10 custom headers, which are passed through the pipeline, and returned. We have noticed some customers now exceed this header count resulting in HTTP 431 errors. There is no solution for this error, other than to reduce header volume. **In future API versions we will no longer pass through custom headers**. We recommend customers not depend on custom headers in future system architectures.
60+
<sup>1</sup> Our current APIs allow up to 10 custom headers, which the pipeline passes through and returns. If you exceed this header count, your request results in an HTTP 431 error. To resolve this error, reduce the header volume. **Future API versions won't pass through custom headers**. We recommend that you don't depend on custom headers in future system architectures.
5361

5462
## Usage tiers
5563

56-
Global Standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer's inference requests. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variabilities in response latency.
64+
Global Standard deployments use Azure's global infrastructure to dynamically route customer traffic to the data center with best availability for the customer's inference requests. This infrastructure enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variabilities in response latency.
5765

5866
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer's usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
5967

6068
## Request increases to the default limits
6169

62-
Limit increase requests can be submitted and evaluated per request. [Open an online customer support request](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest/). When requesting for endpoint limit increase, provide the following information:
70+
You can submit limit increase requests, which we evaluate one at a time. [Open an online customer support request](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest/). When you request an endpoint limit increase, provide the following information:
6371

64-
1. When opening the support request, select **Service and subscription limits (quotas)** as the **Issue type**.
72+
1. Select **Service and subscription limits (quotas)** as the **Issue type** when you open the support request.
6573

66-
1. Select the subscription of your choice.
74+
1. Select the subscription you want to use.
6775

6876
1. Select **Cognitive Services** as **Quota type**.
6977

7078
1. Select **Next**.
7179

72-
1. On the **Additional details** tab, you need to provide detailed reasons for the limit increase in order for your request to be processed. Be sure to add the following information into the reason for limit increase:
80+
1. On the **Additional details** tab, provide detailed reasons for the limit increase so that your request can be processed. Be sure to add the following information to the reason for limit increase:
7381

7482
* Model name, model version (if applicable), and deployment type (SKU).
7583
* Description of your scenario and workload.
7684
* Rationale for the requested increase.
77-
* Provide the target throughput: Tokens per minute, requests per minute, etc.
78-
* Provide planned time plan (by when you need increased limits).
85+
* Target throughput: Tokens per minute, requests per minute, and other relevant metrics.
86+
* Planned time plan (by when you need increased limits).
7987

80-
1. Finally, select **Save and continue** to continue.
88+
1. Select **Save and continue**.
8189

82-
## General best practices to remain within rate limits
90+
## General best practices to stay within rate limits
8391

84-
To minimize issues related to rate limits, it's a good idea to use the following techniques:
92+
To minimize issues related to rate limits, use the following techniques:
8593

8694
- Implement retry logic in your application.
8795
- Avoid sharp changes in the workload. Increase the workload gradually.

0 commit comments

Comments
 (0)