-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
What happened?
I want to use model names by features in my application with tiered users. I want free users to be able to use a premium model (gpt-5) with a certain budget. Once the budget is exceeded. requests should be routed to a cheaper model (gpt-5-mini). Pro users have unlimited usage.
To implement this, I am using virtual keys with a model specific budget. My config is as follows:
model_list:
- model_name: feedback
litellm_params:
# free users can use the premium model with a certain budget
model: openai/gpt-5
api_key: os.environ/OPENAI_API_KEY
tags: ["free"]
- model_name: feedback
litellm_params:
# pro users have unlimited usage of the premium model
model: openai/gpt-5
api_key: os.environ/OPENAI_API_KEY
tags: ["pro"]
- model_name: feedback-fallback
litellm_params:
# fallback model using the cheap model. if tags=["free"] exhaust their budget, they get routed here via fallback mechanism
model: openai/gpt-5-mini
api_key: os.environ/OPENAI_API_KEY
tags: ["free", "pro"]
router_settings:
num_retries: 2
enable_tag_filtering: True
fallbacks:
- {"feedback": ["feedback-fallback"]}
max_fallbacks: 1
litellm_settings:
callbacks: ["prometheus"]
Not quite sure whether the tags are necessary, since the virtual key has the budget attached and should be routed to feedback-fallback because of the fallbacks configuration.
I wrote a quick test script to test this:
import asyncio
import re
import time
import uuid
import openai
from dotenv import load_dotenv
from app.ai.litellm.admin import litellm_admin_api
from app.core.config import settings
load_dotenv()
async def main():
random_user_id = str(uuid.uuid4())
user = await litellm_admin_api.create_user(random_user_id, user_role="customer")
key = await litellm_admin_api.create_user_virtual_key(
user_id=user.user_id,
model_max_budget={
"openai/gpt-5": {
"budget_limit": 0.000000001,
"time_period": "30d",
}
},
tags=["free"],
)
print("key:", key)
key = key.key
print(f"User ID: {user.user_id}")
print(f"User Key: {key}")
client = openai.OpenAI(api_key=key, base_url=settings.LITELLM_BASE_URL)
# this should return model=gpt-5-yyyy-mm-dd
response = client.chat.completions.create(
model="feedback",
messages=[{"role": "user", "content": "Hello, world!"}],
user=user.user_id,
extra_body={"metadata": {"tags": ["free"]}},
)
print(f"First request used model {response.model}")
assert re.match(r"gpt-5-\d{4}-\d{2}-\d{2}", response.model)
time.sleep(5)
# should get routed to mini now
response = client.chat.completions.create(
model="feedback",
messages=[{"role": "user", "content": "Hello, world!"}],
user=user.user_id,
extra_body={"metadata": {"tags": ["free"]}},
)
print(f"Second request used model {response.model}")
assert re.match(r"gpt-5-mini-\d{4}-\d{2}-\d{2}", response.model)
if __name__ == "__main__":
asyncio.run(main())
When I run this on a fresh LiteLLM installation, I get routed to gpt-5 twice, even though the second request should be routed to gpt-5-mini
First request used model gpt-5-2025-08-07
Second request used model gpt-5-2025-08-07
I suspect it is because the budget is not properly tracked for routing:
budget_config = BudgetConfig(time_period="1d", budget_limit=0.1) |
Relevant log output
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.77.3-stable
Twitter / LinkedIn details
No response