Skip to content

Commit 58b4d23

Browse files
author
gitName
committed
[APIM] Token limit updates, gpt-4o support
1 parent 43bb9d4 commit 58b4d23

File tree

3 files changed

+76
-18
lines changed

3 files changed

+76
-18
lines changed

articles/api-management/azure-openai-token-limit-policy.md

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
- build-2024
1111
ms.topic: article
12-
ms.date: 06/25/2024
12+
ms.date: 10/09/2024
1313
ms.author: danlep
1414
---
1515

1616
# Limit Azure OpenAI API token usage
1717

1818
[!INCLUDE [api-management-availability-premium-dev-standard-basic-standardv2-basicv2](../../includes/api-management-availability-premium-dev-standard-basic-standardv2-basicv2.md)]
1919

20-
The `azure-openai-token-limit` policy prevents Azure OpenAI Service API usage spikes on a per key basis by limiting consumption of language model tokens to a specified number per minute. When the token usage is exceeded, the caller receives a `429 Too Many Requests` response status code.
20+
The `azure-openai-token-limit` policy prevents Azure OpenAI Service API usage spikes on a per key basis by limiting consumption of language model tokens to either a specified rate (number per minute) or a quota over a specified period. When a specified token rate limit is exceeded, the caller receives a `429 Too Many Requests` response status code. When a specified quota is exceeded, the caller receives a `403 Forbidden` response status code.
2121

2222
By relying on token usage metrics returned from the OpenAI endpoint, the policy can accurately monitor and enforce limits in real time. The policy also enables precalculation of prompt tokens by API Management, minimizing unnecessary requests to the OpenAI backend if the limit is already exceeded.
2323

@@ -30,9 +30,13 @@ By relying on token usage metrics returned from the OpenAI endpoint, the policy
3030
```xml
3131
<azure-openai-token-limit counter-key="key value"
3232
tokens-per-minute="number"
33+
token-quota="number"
34+
token-quota-period="Hourly | Daily | Weekly | Monthly | Yearly"
3335
estimate-prompt-tokens="true | false"
3436
retry-after-header-name="custom header name, replaces default 'Retry-After'"
3537
retry-after-variable-name="policy expression variable name"
38+
remaining-quota-tokens-header-name="header name"
39+
remaining-quota-tokens-variable-name="policy expression variable name"
3640
remaining-tokens-header-name="header name"
3741
remaining-tokens-variable-name="policy expression variable name"
3842
tokens-consumed-header-name="header name"
@@ -43,12 +47,16 @@ By relying on token usage metrics returned from the OpenAI endpoint, the policy
4347
| Attribute | Description | Required | Default |
4448
| -------------- | ----------------------------------------------------------------------------------------------------- | -------- | ------- |
4549
| counter-key | The key to use for the token limit policy. For each key value, a single counter is used for all scopes at which the policy is configured. Policy expressions are allowed.| Yes | N/A |
46-
| tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute. | Yes | N/A |
50+
| tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute. | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
51+
| token-quota | The maximum number of tokens allowed during the time interval specified in the `token-quota-period`. Policy expressions aren't allowed. | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
52+
| token-quota-period | The length of the fixed window after which the `token-quota` resets. The value must be one of the following: `Hourly`,`Daily`, `Weekly`, `Monthly`, `Yearly` | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
4753
| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens. <br><br>When set to `false`, the remaining tokens per `counter-key` are calculated using the actual token usage from the response of the model. This could result in prompts being sent to the model that exceed the token limit. In such case, this will be detected in the response, and all succeeding requests will be blocked by the policy until the token limit frees up again. | Yes | N/A |
48-
| retry-after-header-name | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | `Retry-After` |
49-
| retry-after-variable-name | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | N/A |
50-
| remaining-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
51-
| remaining-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
54+
| retry-after-header-name | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` or `token-quota` is exceeded. Policy expressions aren't allowed. | No | `Retry-After` |
55+
| retry-after-variable-name | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` or `token-quota` is exceeded. Policy expressions aren't allowed. | No | N/A |
56+
| remaining-quota-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens corresponding to `token-quota` allowed for the `token-quota-period`. Policy expressions aren't allowed. | No | N/A |
57+
| remaining-quota-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens corresponding to `token-quota` allowed for the `token-quota-period`. Policy expressions aren't allowed. | No | N/A |
58+
| remaining-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens corresponding to `tokens-per-minute` allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
59+
| remaining-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens corresponding to `tokens-per-minute` allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
5260
| tokens-consumed-header-name | The name of a response header whose value is the number of tokens consumed by both prompt and completion. The header is added to response only after the response is received from backend. Policy expressions aren't allowed.| No | N/A |
5361
| tokens-consumed-variable-name | The name of a variable initialized to the estimated number of tokens in the prompt in `backend` section of pipeline if `estimate-prompt-tokens` is `true` and zero otherwise. The variable is updated with the reported count upon receiving the response in `outbound` section.| No | N/A |
5462

@@ -66,9 +74,11 @@ By relying on token usage metrics returned from the OpenAI endpoint, the policy
6674
* Certain Azure OpenAI endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, prompt tokens are always estimated, regardless of the value of the `estimate-prompt-tokens` attribute. Completion tokens are also estimated when responses are streamed.
6775
* [!INCLUDE [api-management-rate-limit-key-scope](../../includes/api-management-rate-limit-key-scope.md)]
6876

69-
## Example
77+
## Examples
7078

71-
In the following example, the token limit of 5000 per minute is keyed by the caller IP address. The policy doesn't estimate the number of tokens required for a prompt. After each policy execution, the remaining tokens allowed for that caller IP address in the time period are stored in the variable `remainingTokens`.
79+
### Token rate limit
80+
81+
In the following example, the token rate limit of 5000 per minute is keyed by the caller IP address. The policy doesn't estimate the number of tokens required for a prompt. After each policy execution, the remaining tokens allowed for that caller IP address in the time period are stored in the variable `remainingTokens`.
7282

7383
```xml
7484
<policies>
@@ -84,6 +94,25 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
8494
</policies>
8595
```
8696

97+
### Token quota
98+
99+
In the following example, the token quota of 10000 is keyed by the subscription ID and resets daily. After each policy execution, the number of remaining tokens allowed for that subscription ID in the time period is stored in the variable `remainingQuotaTokens`.
100+
101+
```xml
102+
<policies>
103+
<inbound>
104+
<base />
105+
<azure-openai-token-limit
106+
counter-key="@(context.Subscription.Id)"
107+
token-quota="10000" token-quota-period="Daily" remaining-quota-tokens-variable-name="remainingQuotaTokens" />
108+
</inbound>
109+
<outbound>
110+
<base />
111+
</outbound>
112+
</policies>
113+
114+
```
115+
87116
## Related policies
88117

89118
* [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)

articles/api-management/llm-token-limit-policy.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ ms.author: danlep
1616

1717
[!INCLUDE [api-management-availability-premium-dev-standard-basic-standardv2-basicv2](../../includes/api-management-availability-premium-dev-standard-basic-standardv2-basicv2.md)]
1818

19-
The `llm-token-limit` policy prevents large language model (LLM) API usage spikes on a per key basis by limiting consumption of LLM tokens to a specified number per minute. When the token usage is exceeded, the caller receives a `429 Too Many Requests` response status code.
19+
The `llm-token-limit` policy prevents large language model (LLM) API usage spikes on a per key basis by limiting consumption of language model tokens to either a specified rate (number per minute) or a quota over a specified period. When a specified token rate limit is exceeded, the caller receives a `429 Too Many Requests` response status code. When a specified quota is exceeded, the caller receives a `403 Forbidden` response status code.
2020

2121
By relying on token usage metrics returned from the LLM endpoint, the policy can accurately monitor and enforce limits in real time. The policy also enables precalculation of prompt tokens by API Management, minimizing unnecessary requests to the LLM backend if the limit is already exceeded.
2222

@@ -32,9 +32,13 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
3232
```xml
3333
<llm-token-limit counter-key="key value"
3434
tokens-per-minute="number"
35+
token-quota="number"
36+
token-quota-period="Hourly | Daily | Weekly | Monthly | Yearly"
3537
estimate-prompt-tokens="true | false"
3638
retry-after-header-name="custom header name, replaces default 'Retry-After'"
3739
retry-after-variable-name="policy expression variable name"
40+
remaining-quota-tokens-header-name="header name"
41+
remaining-quota-tokens-variable-name="policy expression variable name"
3842
remaining-tokens-header-name="header name"
3943
remaining-tokens-variable-name="policy expression variable name"
4044
tokens-consumed-header-name="header name"
@@ -45,12 +49,16 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
4549
| Attribute | Description | Required | Default |
4650
| -------------- | ----------------------------------------------------------------------------------------------------- | -------- | ------- |
4751
| counter-key | The key to use for the token limit policy. For each key value, a single counter is used for all scopes at which the policy is configured. Policy expressions are allowed.| Yes | N/A |
48-
| tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute. | Yes | N/A |
52+
| tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute. | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
53+
| token-quota | The maximum number of tokens allowed during the time interval specified in the `token-quota-period`. Policy expressions aren't allowed. | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
54+
| token-quota-period | The length of the fixed window after which the `token-quota` resets. The value must be one of the following: `Hourly`,`Daily`, `Weekly`, `Monthly`, `Yearly` | Either `tokens-per-minute` or both `token-quota` and `token-quota-period` must be specified. | N/A |
4955
| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens. <br><br>When set to `false`, the remaining tokens per `counter-key` are calculated using the actual token usage from the response of the model. This could result in prompts being sent to the model that exceed the token limit. In such case, this will be detected in the response, and all succeeding requests will be blocked by the policy until the token limit frees up again. | Yes | N/A |
50-
| retry-after-header-name | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | `Retry-After` |
51-
| retry-after-variable-name | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | N/A |
52-
| remaining-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
53-
| remaining-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
56+
| retry-after-header-name | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` or `token-quota` is exceeded. Policy expressions aren't allowed. | No | `Retry-After` |
57+
| retry-after-variable-name | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` or `token-quota` is exceeded. Policy expressions aren't allowed. | No | N/A |
58+
| remaining-quota-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens corresponding to `token-quota` allowed for the `token-quota-period`. Policy expressions aren't allowed. | No | N/A |
59+
| remaining-quota-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens corresponding to `token-quota` allowed for the `token-quota-period`. Policy expressions aren't allowed. | No | N/A |
60+
| remaining-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens corresponding to `tokens-per-minute` allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
61+
| remaining-tokens-variable-name | The name of a variable that after each policy execution stores the number of remaining tokens corresponding to `tokens-per-minute` allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
5462
| tokens-consumed-header-name | The name of a response header whose value is the number of tokens consumed by both prompt and completion. The header is added to response only after the response is received from backend. Policy expressions aren't allowed.| No | N/A |
5563
| tokens-consumed-variable-name | The name of a variable initialized to the estimated number of tokens in the prompt in `backend` section of pipeline if `estimate-prompt-tokens` is `true` and zero otherwise. The variable is updated with the reported count upon receiving the response in `outbound` section.| No | N/A |
5664

@@ -67,9 +75,11 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
6775
* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, prompt tokens are always estimated, regardless of the value of the `estimate-prompt-tokens` attribute.
6876
* [!INCLUDE [api-management-rate-limit-key-scope](../../includes/api-management-rate-limit-key-scope.md)]
6977

70-
## Example
78+
## Examples
7179

72-
In the following example, the token limit of 5000 per minute is keyed by the caller IP address. The policy doesn't estimate the number of tokens required for a prompt. After each policy execution, the remaining tokens allowed for that caller IP address in the time period are stored in the variable `remainingTokens`.
80+
### Token rate limit
81+
82+
In the following example, the token rate limit of 5000 per minute is keyed by the caller IP address. The policy doesn't estimate the number of tokens required for a prompt. After each policy execution, the remaining tokens allowed for that caller IP address in the time period are stored in the variable `remainingTokens`.
7383

7484
```xml
7585
<policies>
@@ -85,6 +95,25 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
8595
</policies>
8696
```
8797

98+
### Token quota
99+
100+
In the following example, the token quota of 10000 is keyed by the subscription ID and resets daily. After each policy execution, the number of remaining tokens allowed for that subscription ID in the time period is stored in the variable `remainingQuotaTokens`.
101+
102+
```xml
103+
<policies>
104+
<inbound>
105+
<base />
106+
<llm-token-limit
107+
counter-key="@(context.Subscription.Id)"
108+
token-quota="10000" token-quota-period="Daily" remaining-quota-tokens-variable-name="remainingQuotaTokens" />
109+
</inbound>
110+
<outbound>
111+
<base />
112+
</outbound>
113+
</policies>
114+
115+
```
116+
88117
## Related policies
89118

90119
* [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)

includes/api-management-azure-openai-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The policy is used with APIs [added to API Management from the Azure OpenAI Serv
1414

1515
| API type | Supported models |
1616
|-------|-------------|
17-
| Chat completion | gpt-3.5<br/><br/>gpt-4 |
17+
| Chat completion | gpt-3.5<br/><br/>gpt-4<br/><br/>gpt-4o |
1818
| Completion | gpt-3.5-turbo-instruct |
1919
| Embeddings | text-embedding-3-large<br/><br/> text-embedding-3-small<br/><br/>text-embedding-ada-002 |
2020

0 commit comments

Comments
 (0)