Skip to content

Commit 282bafa

Browse files
Merge pull request #278584 from mrbullwinkle/mrb_06_18_2024_token_rate_limit
[Azure OpenAI] Rate limit note
2 parents 9dbe672 + 78586fa commit 282bafa

File tree

1 file changed

+4
-1
lines changed
  • articles/ai-services/openai/how-to

1 file changed

+4
-1
lines changed

articles/ai-services/openai/how-to/quota.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: mrbullwinkle
77
manager: nitinme
88
ms.service: azure-ai-openai
99
ms.topic: how-to
10-
ms.date: 05/31/2024
10+
ms.date: 06/18/2024
1111
ms.author: mbullwin
1212
---
1313

@@ -91,6 +91,9 @@ As each request is received, Azure OpenAI computes an estimated max processed-to
9191

9292
As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets.
9393

94+
> [!IMPORTANT]
95+
> The token count used in the rate limit calculation is an estimate based in part on the character count of the API request. The rate limit token estimate is not the same as the token calculation that is used for billing/determining that a request is below a model's input token limit. Due to the approximate nature of the rate limit token calculation, it is expected behavior that a rate limit can be triggered prior to what might be expected in comparison to an exact token count measurement for each request.
96+
9497
RPM rate limits are based on the number of requests received over time. The rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests may receive a 429 response even though the limit isn't met when measured over the course of a minute. To implement this behavior, Azure OpenAI Service evaluates the rate of incoming requests over a small period of time, typically 1 or 10 seconds. If the number of requests received during that time exceeds what would be expected at the set RPM limit, then new requests will receive a 429 response code until the next evaluation period. For example, if Azure OpenAI is monitoring request rate on 1-second intervals, then rate limiting will occur for a 600-RPM deployment if more than 10 requests are received during each 1-second period (600 requests per minute = 10 requests per second).
9598

9699
### Rate limit best practices

0 commit comments

Comments
 (0)