Skip to content

Commit ea24d88

Browse files
committed
docs fix
1 parent 8b3e3c1 commit ea24d88

File tree

3 files changed

+185
-186
lines changed

3 files changed

+185
-186
lines changed
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
2+
# Dynamic TPM/RPM Allocation
3+
4+
Prevent projects from gobbling too much tpm/rpm. You should use this feature when you want to reserve tpm/rpm capacity for specific projects. For example, a realtime use case should get higher priority than a different use case.
5+
6+
Dynamically allocate TPM/RPM quota to api keys, based on active keys in that minute. [**See Code**](https://github.com/BerriAI/litellm/blob/9bffa9a48e610cc6886fc2dce5c1815aeae2ad46/litellm/proxy/hooks/dynamic_rate_limiter.py#L125)
7+
8+
1. Setup config.yaml
9+
10+
```yaml
11+
model_list:
12+
- model_name: my-fake-model
13+
litellm_params:
14+
model: gpt-3.5-turbo
15+
api_key: my-fake-key
16+
mock_response: hello-world
17+
tpm: 60
18+
19+
litellm_settings:
20+
callbacks: ["dynamic_rate_limiter_v3"]
21+
22+
general_settings:
23+
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
24+
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
25+
```
26+
27+
2. Start proxy
28+
29+
```bash
30+
litellm --config /path/to/config.yaml
31+
```
32+
33+
3. Test it!
34+
35+
```python
36+
"""
37+
- Run 2 concurrent teams calling same model
38+
- model has 60 TPM
39+
- Mock response returns 30 total tokens / request
40+
- Each team will only be able to make 1 request per minute
41+
"""
42+
43+
import requests
44+
from openai import OpenAI, RateLimitError
45+
46+
def create_key(api_key: str, base_url: str):
47+
response = requests.post(
48+
url="{}/key/generate".format(base_url),
49+
json={},
50+
headers={
51+
"Authorization": "Bearer {}".format(api_key)
52+
}
53+
)
54+
55+
_response = response.json()
56+
57+
return _response["key"]
58+
59+
key_1 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
60+
key_2 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
61+
62+
# call proxy with key 1 - works
63+
openai_client_1 = OpenAI(api_key=key_1, base_url="http://0.0.0.0:4000")
64+
65+
response = openai_client_1.chat.completions.with_raw_response.create(
66+
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
67+
)
68+
69+
print("Headers for call 1 - {}".format(response.headers))
70+
_response = response.parse()
71+
print("Total tokens for call - {}".format(_response.usage.total_tokens))
72+
73+
74+
# call proxy with key 2 - works
75+
openai_client_2 = OpenAI(api_key=key_2, base_url="http://0.0.0.0:4000")
76+
77+
response = openai_client_2.chat.completions.with_raw_response.create(
78+
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
79+
)
80+
81+
print("Headers for call 2 - {}".format(response.headers))
82+
_response = response.parse()
83+
print("Total tokens for call - {}".format(_response.usage.total_tokens))
84+
# call proxy with key 2 - fails
85+
try:
86+
openai_client_2.chat.completions.with_raw_response.create(model="my-fake-model", messages=[{"role": "user", "content": "Hey, how's it going?"}])
87+
raise Exception("This should have failed!")
88+
except RateLimitError as e:
89+
print("This was rate limited b/c - {}".format(str(e)))
90+
91+
```
92+
93+
**Expected Response**
94+
95+
```
96+
This was rate limited b/c - Error code: 429 - {'error': {'message': {'error': 'Key=<hashed_token> over available TPM=0. Model TPM=0, Active keys=2'}, 'type': 'None', 'param': 'None', 'code': 429}}
97+
```
98+
99+
100+
#### [BETA] Set Priority / Reserve Quota
101+
102+
Reserve tpm/rpm capacity for projects in prod.
103+
104+
:::tip
105+
106+
Reserving tpm/rpm on keys based on priority is a premium feature. Please [get an enterprise license](./enterprise.md) for it.
107+
:::
108+
109+
110+
1. Setup config.yaml
111+
112+
```yaml
113+
model_list:
114+
- model_name: gpt-3.5-turbo
115+
litellm_params:
116+
model: "gpt-3.5-turbo"
117+
api_key: os.environ/OPENAI_API_KEY
118+
rpm: 100
119+
120+
litellm_settings:
121+
callbacks: ["dynamic_rate_limiter"]
122+
priority_reservation: {"dev": 0, "prod": 1}
123+
124+
general_settings:
125+
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
126+
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
127+
```
128+
129+
130+
priority_reservation:
131+
- Dict[str, float]
132+
- str: can be any string
133+
- float: from 0 to 1. Specify the % of tpm/rpm to reserve for keys of this priority.
134+
135+
**Start Proxy**
136+
137+
```
138+
litellm --config /path/to/config.yaml
139+
```
140+
141+
2. Create a key with that priority
142+
143+
```bash
144+
curl -X POST 'http://0.0.0.0:4000/key/generate' \
145+
-H 'Authorization: Bearer <your-master-key>' \
146+
-H 'Content-Type: application/json' \
147+
-D '{
148+
"metadata": {"priority": "dev"} # 👈 KEY CHANGE
149+
}'
150+
```
151+
152+
**Expected Response**
153+
154+
```
155+
{
156+
...
157+
"key": "sk-.."
158+
}
159+
```
160+
161+
162+
3. Test it!
163+
164+
```bash
165+
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
166+
-H 'Content-Type: application/json' \
167+
-H 'Authorization: sk-...' \ # 👈 key from step 2.
168+
-d '{
169+
"model": "gpt-3.5-turbo",
170+
"messages": [
171+
{
172+
"role": "user",
173+
"content": "what llm are you"
174+
}
175+
],
176+
}'
177+
```
178+
179+
**Expected Response**
180+
181+
```
182+
Key=... over available RPM=0. Model RPM=100, Active keys=None
183+
```
184+

docs/my-website/docs/proxy/team_budgets.md

Lines changed: 0 additions & 185 deletions
Original file line numberDiff line numberDiff line change
@@ -178,188 +178,3 @@ Expect to see this metric on prometheus to track the Remaining Budget for the te
178178
```shell
179179
litellm_remaining_team_budget_metric{team_alias="QA Prod Bot",team_id="de35b29e-6ca8-4f47-b804-2b79d07aa99a"} 9.699999999999992e-06
180180
```
181-
182-
183-
### Dynamic TPM/RPM Allocation
184-
185-
Prevent projects from gobbling too much tpm/rpm.
186-
187-
Dynamically allocate TPM/RPM quota to api keys, based on active keys in that minute. [**See Code**](https://github.com/BerriAI/litellm/blob/9bffa9a48e610cc6886fc2dce5c1815aeae2ad46/litellm/proxy/hooks/dynamic_rate_limiter.py#L125)
188-
189-
1. Setup config.yaml
190-
191-
```yaml
192-
model_list:
193-
- model_name: my-fake-model
194-
litellm_params:
195-
model: gpt-3.5-turbo
196-
api_key: my-fake-key
197-
mock_response: hello-world
198-
tpm: 60
199-
200-
litellm_settings:
201-
callbacks: ["dynamic_rate_limiter"]
202-
203-
general_settings:
204-
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
205-
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
206-
```
207-
208-
2. Start proxy
209-
210-
```bash
211-
litellm --config /path/to/config.yaml
212-
```
213-
214-
3. Test it!
215-
216-
```python
217-
"""
218-
- Run 2 concurrent teams calling same model
219-
- model has 60 TPM
220-
- Mock response returns 30 total tokens / request
221-
- Each team will only be able to make 1 request per minute
222-
"""
223-
224-
import requests
225-
from openai import OpenAI, RateLimitError
226-
227-
def create_key(api_key: str, base_url: str):
228-
response = requests.post(
229-
url="{}/key/generate".format(base_url),
230-
json={},
231-
headers={
232-
"Authorization": "Bearer {}".format(api_key)
233-
}
234-
)
235-
236-
_response = response.json()
237-
238-
return _response["key"]
239-
240-
key_1 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
241-
key_2 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
242-
243-
# call proxy with key 1 - works
244-
openai_client_1 = OpenAI(api_key=key_1, base_url="http://0.0.0.0:4000")
245-
246-
response = openai_client_1.chat.completions.with_raw_response.create(
247-
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
248-
)
249-
250-
print("Headers for call 1 - {}".format(response.headers))
251-
_response = response.parse()
252-
print("Total tokens for call - {}".format(_response.usage.total_tokens))
253-
254-
255-
# call proxy with key 2 - works
256-
openai_client_2 = OpenAI(api_key=key_2, base_url="http://0.0.0.0:4000")
257-
258-
response = openai_client_2.chat.completions.with_raw_response.create(
259-
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
260-
)
261-
262-
print("Headers for call 2 - {}".format(response.headers))
263-
_response = response.parse()
264-
print("Total tokens for call - {}".format(_response.usage.total_tokens))
265-
# call proxy with key 2 - fails
266-
try:
267-
openai_client_2.chat.completions.with_raw_response.create(model="my-fake-model", messages=[{"role": "user", "content": "Hey, how's it going?"}])
268-
raise Exception("This should have failed!")
269-
except RateLimitError as e:
270-
print("This was rate limited b/c - {}".format(str(e)))
271-
272-
```
273-
274-
**Expected Response**
275-
276-
```
277-
This was rate limited b/c - Error code: 429 - {'error': {'message': {'error': 'Key=<hashed_token> over available TPM=0. Model TPM=0, Active keys=2'}, 'type': 'None', 'param': 'None', 'code': 429}}
278-
```
279-
280-
281-
#### [BETA] Set Priority / Reserve Quota
282-
283-
Reserve tpm/rpm capacity for projects in prod.
284-
285-
:::tip
286-
287-
Reserving tpm/rpm on keys based on priority is a premium feature. Please [get an enterprise license](./enterprise.md) for it.
288-
:::
289-
290-
291-
1. Setup config.yaml
292-
293-
```yaml
294-
model_list:
295-
- model_name: gpt-3.5-turbo
296-
litellm_params:
297-
model: "gpt-3.5-turbo"
298-
api_key: os.environ/OPENAI_API_KEY
299-
rpm: 100
300-
301-
litellm_settings:
302-
callbacks: ["dynamic_rate_limiter"]
303-
priority_reservation: {"dev": 0, "prod": 1}
304-
305-
general_settings:
306-
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
307-
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
308-
```
309-
310-
311-
priority_reservation:
312-
- Dict[str, float]
313-
- str: can be any string
314-
- float: from 0 to 1. Specify the % of tpm/rpm to reserve for keys of this priority.
315-
316-
**Start Proxy**
317-
318-
```
319-
litellm --config /path/to/config.yaml
320-
```
321-
322-
2. Create a key with that priority
323-
324-
```bash
325-
curl -X POST 'http://0.0.0.0:4000/key/generate' \
326-
-H 'Authorization: Bearer <your-master-key>' \
327-
-H 'Content-Type: application/json' \
328-
-D '{
329-
"metadata": {"priority": "dev"} # 👈 KEY CHANGE
330-
}'
331-
```
332-
333-
**Expected Response**
334-
335-
```
336-
{
337-
...
338-
"key": "sk-.."
339-
}
340-
```
341-
342-
343-
3. Test it!
344-
345-
```bash
346-
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
347-
-H 'Content-Type: application/json' \
348-
-H 'Authorization: sk-...' \ # 👈 key from step 2.
349-
-d '{
350-
"model": "gpt-3.5-turbo",
351-
"messages": [
352-
{
353-
"role": "user",
354-
"content": "what llm are you"
355-
}
356-
],
357-
}'
358-
```
359-
360-
**Expected Response**
361-
362-
```
363-
Key=... over available RPM=0. Model RPM=100, Active keys=None
364-
```
365-

docs/my-website/sidebars.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ const sidebars = {
201201
{
202202
type: "category",
203203
label: "Budgets + Rate Limits",
204-
items: ["proxy/users", "proxy/temporary_budget_increase", "proxy/rate_limit_tiers", "proxy/team_budgets", "proxy/customers"],
204+
items: ["proxy/users", "proxy/temporary_budget_increase", "proxy/rate_limit_tiers", "proxy/team_budgets", "proxy/dynamic_rate_limit", "proxy/customers"],
205205
},
206206
{
207207
type: "link",

0 commit comments

Comments
 (0)