-
Let's say I want a chat flow to load balance all workloads across 3 equivalent chat endpoints and 3 equivalent embeddings endpoints -- the main difference is that each of the 3 are in different availability zones. Is that possible to do in Flowise currently? The main issue is that using a single chat endpoint (e.g., gpt4) may yield rate limiting errors (429s) but if we split the requests across 3 different equivalent endpoints in 3 different availability zones, then the rate limits are no longer an issue. I didn't see explicit support for this capability yet, but I didn't know if anyone else was already working on such as fix. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 10 replies
-
Great question! I think the support for this might not fall within Flowise codebase, but on your deployment setup process. |
Beta Was this translation helpful? Give feedback.
-
Hey @HenryHengZJ , the issue with that approach, is that you could never load-balance within a given session. For example, pretend a single user is generating questions/queries/chat messages that exceed the API rate limit of a given embedding endpoint or chat completion endpoint -- within that session, the user would only be routed to a single FW instance. If we had external logic to re-route the user's session over to a different FW endpoint, then it's likely their previous chat history would be entirely lost. The only workaround to that, would be to build some sort of chat history cache that could be shared across each of the FW instances, which makes things even more complicated. Ideally, it would be better if a single chatflow session within FW could have enough logic to perform intra-session load-balancing across one or more embedding or chat completion endpoints, so that the user never experiences API rate limit errors to begin with. To put it another way, figuring out how to divide up a single chat session split across 3 different FW deployments seems very complicated, as most load balancers usually only split resources by session. Instead, we'd need to figure out how to configure a load balancer to split resources by message. I'm not sure that's doable. |
Beta Was this translation helpful? Give feedback.
-
Hey @HenryHengZJ , it turns out there's a way to accomplish this outside of Flowise: LiteLLM OpenAI Proxy Server: https://docs.litellm.ai/docs/proxy/quick_start Essentially, if you point Flowise to the LiteLLM proxy server, then that system can handle all the load balancing issues transparently. The only issue, is that there's no "LiteLLM" compatible version of Embedding, Chat, or LLM node types (which would need to be created, I think). |
Beta Was this translation helpful? Give feedback.
-
Hey @HenryHengZJ , it turns out that I was able to get Flowise to successfully use the LiteLLM OpenAI Proxy server using the The only caveat is that for some reason If that's successful, I'll submit a PR to expose this setting in the node (where the ![]() |
Beta Was this translation helpful? Give feedback.
-
On the Embeddings side, I think the @HenryHengZJ and @ishaan-jaff , if you wanted to make it clearer for other Flowise users about this support, you might want to consider cloning the |
Beta Was this translation helpful? Give feedback.
-
So at this point, I've learned a couple of things:
So, rather than try and update the Essentially, 3rd party proxy support is present in the langchain library, according to this documentation: Specifically, it's exposed either as So my new plan is:
|
Beta Was this translation helpful? Give feedback.
-
So apparently the path of using Everything else will work correctly -- including built-in streaming support. I'm going to test embedding support next using the standard |
Beta Was this translation helpful? Give feedback.
-
Update:
If all those things apply, then you'll get bitten with this sort of error when attempting to use the bot:
The core issue has to do with a bug in LangchainJS -- specifically a combination of Azure OpenAI + Function Calling + Streaming Mode. The Langchain Python maintainers already fixed the issue in the Python codebase here: BUT, it's still an open issue with LangchainJS and I've opened a ticket to track that, here: In the meantime, I think the workaround is to temporarily disable streaming within the |
Beta Was this translation helpful? Give feedback.
-
Update:
The fix is tracked here and is fully resolved in their To use the proxy to load balance Azure OpenAI endpoints, the process is:
So essentially, the high availability / load-balanced configuration sort of looks like this:
(where the value of |
Beta Was this translation helpful? Give feedback.
Update:
The fix is tracked here and is fully resolved in their
main
branch:BerriAI/litellm#2138
To use the proxy to load balance Azure OpenAI endpoints, the process is:
config.yaml
filechat models
,embeddings
,llms
-- everythingSo essentially, the high availability / load-balanced configuration sort of looks like this: