Replies: 1 comment
-
We already support this, load balancing: https://docs.litellm.ai/docs/routing |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We recently hit a deadlock with our system when we reached the rate limit for Llama 3 70B on Groq. It was a bit of a mess - the program kept retrying, which only made things worse by consuming more tokens and extending the wait time for the rate limit to reset.
As a quick fix, I bundled Llama 3, Llama 3.1, and Gemma 2 into a wrapper. The idea was that if we hit a rate limit error with one model, the system would temporarily switch to another. This works because, at least with Groq, the rate limits are per model.
Now, I made sure beforehand that switching between these models wouldn’t significantly affect our results. But I’m wondering if we could discuss a more formal way to implement this trick.
Think of it like a connection pool for databases, but instead, it’s a ‘rate limit pool’ for LLMs. How could we design this to be more robust and reusable?
Beta Was this translation helpful? Give feedback.
All reactions