Skip to content

Commit 6ec7591

Browse files
docs
1 parent 34de53c commit 6ec7591

File tree

2 files changed

+119
-1
lines changed

2 files changed

+119
-1
lines changed

docs/en/latest/config.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,8 @@
158158
"plugins/request-id",
159159
"plugins/proxy-control",
160160
"plugins/client-control",
161-
"plugins/workflow"
161+
"plugins/workflow",
162+
"plugins/ai-rate-limiting"
162163
]
163164
},
164165
{
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: AI Rate Limiting
3+
keywords:
4+
- Apache APISIX
5+
- API Gateway
6+
- Plugin
7+
- ai-rate-limiting
8+
description: The ai-rate-limiting plugin enforces token-based rate limiting for LLM service requests, preventing overuse, optimizing API consumption, and ensuring efficient resource allocation.
9+
---
10+
11+
<!--
12+
#
13+
# Licensed to the Apache Software Foundation (ASF) under one or more
14+
# contributor license agreements. See the NOTICE file distributed with
15+
# this work for additional information regarding copyright ownership.
16+
# The ASF licenses this file to You under the Apache License, Version 2.0
17+
# (the "License"); you may not use this file except in compliance with
18+
# the License. You may obtain a copy of the License at
19+
#
20+
# http://www.apache.org/licenses/LICENSE-2.0
21+
#
22+
# Unless required by applicable law or agreed to in writing, software
23+
# distributed under the License is distributed on an "AS IS" BASIS,
24+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
25+
# See the License for the specific language governing permissions and
26+
# limitations under the License.
27+
#
28+
-->
29+
30+
## Description
31+
32+
The `ai-rate-limiting` plugin enforces token-based rate limiting for requests sent to LLM services. It helps manage API usage by controlling the number of tokens consumed within a specified time frame, ensuring fair resource allocation and preventing excessive load on the service. It is often used with `ai-proxy` or `ai-proxy-multi` plugin.
33+
34+
## Plugin Attributes
35+
36+
| Name | Type | Required | Description |
37+
| ------------------------- | ------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
38+
| `limit` | integer | false | The maximum number of tokens allowed to consume within a given time interval. At least one of `limit` and `instances.limit` should be configured. |
39+
| `time_window` | integer | false | The time interval corresponding to the rate limiting `limit` in seconds. At least one of `time_window` and `instances.time_window` should be configured. |
40+
| `show_limit_quota_header` | boolean | false | If true, include `X-AI-RateLimit-Limit-*` to show the total quota, `X-AI-RateLimit-Remaining-*` to show the remaining quota in the response header, and `X-AI-RateLimit-Reset-*` to show the number of seconds left for the counter to reset, where `*` is the instance name. Default: `true` |
41+
| `limit_strategy` | string | false | Type of token to apply rate limiting. `total_tokens`, `prompt_tokens`, and `completion_tokens` values are returned in each model response, where `total_tokens` is the sum of `prompt_tokens` and `completion_tokens`. Default: `total_tokens` |
42+
| `instances` | array[object] | false | LLM instance rate limiting configurations. |
43+
| `instances.name` | string | true | Name of the LLM service instance. |
44+
| `instances.limit` | integer | true | The maximum number of tokens allowed to consume within a given time interval. |
45+
| `instances.time_window` | integer | true | The time interval corresponding to the rate limiting `limit` in seconds. |
46+
| `rejected_code` | integer | false | The HTTP status code returned when a request exceeding the quota is rejected. Default: `503` |
47+
| `rejected_msg` | string | false | The response body returned when a request exceeding the quota is rejected. |
48+
49+
## Example
50+
51+
Create a route as such and update with your LLM providers, models, API keys, and endpoints:
52+
53+
```shell
54+
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
55+
-H "X-API-KEY: ${ADMIN_API_KEY}" \
56+
-d '{
57+
"id": "ai-rate-limiting-route",
58+
"uri": "/anything",
59+
"methods": ["POST"],
60+
"plugins": {
61+
"ai-proxy": {
62+
"provider": "openai",
63+
"auth": {
64+
"header": {
65+
"Authorization": "Bearer '"$API_KEY"'"
66+
}
67+
},
68+
"options": {
69+
"model": "gpt-35-turbo-instruct",
70+
"max_tokens": 512,
71+
"temperature": 1.0
72+
}
73+
},
74+
"ai-rate-limiting": {
75+
"limit": 300,
76+
"time_window": 30,
77+
"limit_strategy": "prompt_tokens"
78+
}
79+
}
80+
}'
81+
```
82+
83+
Send a POST request to the route with a system prompt and a sample user question in the request body:
84+
85+
```shell
86+
curl "http://127.0.0.1:9080/anything" -X POST \
87+
-H "Content-Type: application/json" \
88+
-d '{
89+
"messages": [
90+
{ "role": "system", "content": "You are a mathematician" },
91+
{ "role": "user", "content": "What is 1+1?" }
92+
]
93+
}'
94+
```
95+
96+
You should receive a response similar to the following:
97+
98+
```json
99+
{
100+
...
101+
"model": "deepseek-chat",
102+
"choices": [
103+
{
104+
"index": 0,
105+
"message": {
106+
"role": "assistant",
107+
"content": "1 + 1 equals 2. This is a fundamental arithmetic operation where adding one unit to another results in a total of two units."
108+
},
109+
"logprobs": null,
110+
"finish_reason": "stop"
111+
}
112+
],
113+
...
114+
}
115+
```
116+
117+
If rate limiting quota of 300 tokens has been consumed in a 30-second window, the additional requests will all be rejected.

0 commit comments

Comments
 (0)