Support for Load Balancing Chat and Embedding Endpoints? #1457

dkindlund · 2023-12-31T04:38:18Z

dkindlund
Dec 31, 2023

Let's say I want a chat flow to load balance all workloads across 3 equivalent chat endpoints and 3 equivalent embeddings endpoints -- the main difference is that each of the 3 are in different availability zones. Is that possible to do in Flowise currently?

The main issue is that using a single chat endpoint (e.g., gpt4) may yield rate limiting errors (429s) but if we split the requests across 3 different equivalent endpoints in 3 different availability zones, then the rate limits are no longer an issue.

I didn't see explicit support for this capability yet, but I didn't know if anyone else was already working on such as fix. Thanks in advance!

Answered by dkindlund

Feb 29, 2024

Update:

The LiteLLM OpenAI proxy now supports Azure OpenAI endpoints properly in all scenarios.
This includes streaming + function calling.

The fix is tracked here and is fully resolved in their main branch:
BerriAI/litellm#2138

To use the proxy to load balance Azure OpenAI endpoints, the process is:

Define your Azure OpenAI endpoints in the LiteLLM config.yaml file
Deploy your proxy
Then in Flowise, specify the standard OpenAI nodes (not the Azure OpenAI nodes) -- this goes for chat models, embeddings, llms -- everything

So essentially, the high availability / load-balanced configuration sort of looks like this:

Flowise[1..n] -- (using standard OpenAI nodes) --> LiteLLM OpenAI Proxy[…

View full answer

HenryHengZJ · 2024-01-08T16:57:44Z

HenryHengZJ
Jan 8, 2024
Maintainer

Great question! I think the support for this might not fall within Flowise codebase, but on your deployment setup process.
For example, using AWS I would have 3 FW instances deployed on different AZ, setup a load balancing and automatically route the request to different FW instance

0 replies

dkindlund · 2024-01-08T18:25:07Z

dkindlund
Jan 8, 2024
Author

Great question! I think the support for this might not fall within Flowise codebase, but on your deployment setup process.
For example, using AWS I would have 3 FW instances deployed on different AZ, setup a load balancing and automatically route the request to different FW instance

Hey @HenryHengZJ , the issue with that approach, is that you could never load-balance within a given session.

For example, pretend a single user is generating questions/queries/chat messages that exceed the API rate limit of a given embedding endpoint or chat completion endpoint -- within that session, the user would only be routed to a single FW instance. If we had external logic to re-route the user's session over to a different FW endpoint, then it's likely their previous chat history would be entirely lost.

The only workaround to that, would be to build some sort of chat history cache that could be shared across each of the FW instances, which makes things even more complicated.

Ideally, it would be better if a single chatflow session within FW could have enough logic to perform intra-session load-balancing across one or more embedding or chat completion endpoints, so that the user never experiences API rate limit errors to begin with.

To put it another way, figuring out how to divide up a single chat session split across 3 different FW deployments seems very complicated, as most load balancers usually only split resources by session. Instead, we'd need to figure out how to configure a load balancer to split resources by message. I'm not sure that's doable.

0 replies

dkindlund · 2024-01-13T04:31:36Z

dkindlund
Jan 13, 2024
Author

Hey @HenryHengZJ , it turns out there's a way to accomplish this outside of Flowise:

LiteLLM OpenAI Proxy Server: https://docs.litellm.ai/docs/proxy/quick_start

Essentially, if you point Flowise to the LiteLLM proxy server, then that system can handle all the load balancing issues transparently.

The only issue, is that there's no "LiteLLM" compatible version of Embedding, Chat, or LLM node types (which would need to be created, I think).

2 replies

ishaan-jaff Feb 10, 2024

Hi @dkindlund I'm the litellm maintainer - can we hop on a call to learn how you're using litellm ? (Happy to solve this issue with you on call)

Link to my cal for your convenience: https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat?month=2024-02

dkindlund Feb 17, 2024
Author

Hey @ishaan-jaff , thanks for reaching out -- we're in the early stages of integration development to get this operational. Let me get back to you once my team has had a chance to do some initial testing. Thanks!

dkindlund · 2024-02-19T04:50:46Z

dkindlund
Feb 19, 2024
Author

Hey @HenryHengZJ , it turns out that I was able to get Flowise to successfully use the LiteLLM OpenAI Proxy server using the ChatLocalAI Chat Model node.

The only caveat is that for some reason streaming is set to false by default within this node -- but I'm pretty sure it should work with streaming set to true. I'm going to test it out and see if by exposing the streaming setting as a boolean under Additional Parameters, I can get streaming working with the LiteLLM Proxy server.

If that's successful, I'll submit a PR to expose this setting in the node (where the streaming setting will be set to false by default -- as legacy support for all the other local model types).

5 replies

amansoniamazatic Feb 21, 2024

Hii @dkindlund Did you able to achieve the streaming if yes can you show the steps how you achieved that.

dkindlund Feb 21, 2024
Author

Hey @amansoniamazatic , I enabled the feature on my branch, but I wasn't able to replicate streaming behavior, unfortunately. I think the core issue is that there is still some functionality missing from the ChatLocalAI node that was present in the AzureChatOpenAI node. I'm still trying to reconcile the differences.

dkindlund Feb 21, 2024
Author

One core issue is that the ChatLocalAI node uses the legacy, deprecated OpenAIChat class, whereas the AzureChatOpenAI node uses the updated, supported ChatOpenAI class.... sigh...

langchain-ai/langchain#1556

amansoniamazatic Feb 22, 2024

@dkindlund Is there any alternate way to solve workload issues? I am currently using Flowise for production and I'm concerned that it might encounter workload issues. How can I address this? Can you provide a suggestion apart from litellm.

dkindlund Feb 22, 2024
Author

Hey @amansoniamazatic , take a look at my comments later in this discussion. There's a workaround and you can still use the LiteLLM OpenAI proxy in Flowise. See this comment for more details:
#1457 (comment)

dkindlund · 2024-02-19T04:58:40Z

dkindlund
Feb 19, 2024
Author

On the Embeddings side, I think the LocalAIEmbedding should be compatible with LiteLLM Proxy as well, but I still need to test this out. Will report back on this.

@HenryHengZJ and @ishaan-jaff , if you wanted to make it clearer for other Flowise users about this support, you might want to consider cloning the ChatLocalAI and LocalAIEmbedding nodes in Flowise to be something like ChatLiteLLMProxy and LiteLLMProxyEmbedding respectively. That way, future versions of these LiteLLM-specific Flowise nodes could expose proxy-specific features in the node configurations. FYI.

0 replies

dkindlund · 2024-02-21T15:53:34Z

dkindlund
Feb 21, 2024
Author

So at this point, I've learned a couple of things:

The ChatLocalAI node doesn't seem that well maintained.
It's still using legacy features from langchain-ai.
Conversely, the AzureChatOpenAI node is well-maintained.

So, rather than try and update the ChatLocalAI node to support all the newest features offered within the AzureChatOpenAI node, I'm thinking I'll try to extend the AzureChatOpenAI node to support 3rd party proxies.

Essentially, 3rd party proxy support is present in the langchain library, according to this documentation:
https://js.langchain.com/docs/integrations/text_embedding/azure_openai

Specifically, it's exposed either as AZURE_OPENAI_BASE_PATH environment variable or passing azureOpenAIBasePath as a parameter into the ChatOpenAI object.

So my new plan is:

Create 1 PR that exposes the azureOpenAIBasePath parameter as an optional Additional Parameter in the AzureChatOpenAI node.
Create 1 PR that exposes the azureOpenAIBasePath parameter as an optional Additional Parameter in the AzureOpenAIEmbedding node.
Test to confirm that LiteLLM OpenAI Proxy server works with these 2 existing nodes.

1 reply

dkindlund Feb 21, 2024
Author

Technically, the base path should be associated with the credentials, so it might take a bit longer than expected to implement this...

dkindlund · 2024-02-21T19:21:43Z

dkindlund
Feb 21, 2024
Author

So apparently the path of using AzureChatOpenAI was also incorrect. It turns out that the ChatOpenAI node already has everything needed to get the LiteLLM OpenAI Proxy operational. Specifically, you have to go into the Additional Parameters and set the BasePath to be the URL of your proxy.

Everything else will work correctly -- including built-in streaming support.

I'm going to test embedding support next using the standard OpenAIEmbedding node with the BasePath specified. If that works, then I'll mark this issue as resolved.

0 replies

dkindlund · 2024-02-22T16:44:41Z

dkindlund
Feb 22, 2024
Author

Update:

So using LiteLLM OpenAI Proxy does seem to work well with the ChatOpenAI node.
But there's one small caveat that only applies if you're:
- Using the LiteLLM OpenAI Proxy and
- Are using one Azure OpenAI endpoint and
- The model you're using supports function calling (like gpt-3.5-turbo-1106) and
- You want to use the model in streaming mode and
- You want to use function calling in a Flowise Chatflow like the Conversational Retrieval Agent

If all those things apply, then you'll get bitten with this sort of error when attempting to use the bot:

Error: additional_kwargs[arguments] already exists in the message chunk, but with a different type.
    at AIMessageChunk._mergeAdditionalKwargs (/usr/src/packages/node_modules/@langchain/core/dist/messages/index.cjs:147:23)
    at AIMessageChunk._mergeAdditionalKwargs (/usr/src/packages/node_modules/@langchain/core/dist/messages/index.cjs:154:36)
    at AIMessageChunk.concat (/usr/src/packages/node_modules/@langchain/core/dist/messages/index.cjs:246:47)
    at ChatGenerationChunk.concat (/usr/src/packages/node_modules/@langchain/core/dist/outputs.cjs:55:35)
    at ChatOpenAI._generate (/usr/src/packages/node_modules/@langchain/openai/dist/chat_models.cjs:503:61)

The core issue has to do with a bug in LangchainJS -- specifically a combination of Azure OpenAI + Function Calling + Streaming Mode.

The Langchain Python maintainers already fixed the issue in the Python codebase here:
langchain-ai/langchain#13229
langchain-ai/langchain#13768

BUT, it's still an open issue with LangchainJS and I've opened a ticket to track that, here:
langchain-ai/langchainjs#4488

In the meantime, I think the workaround is to temporarily disable streaming within the ChatOpenAI node, so I'm going to see if that works and if it does, I'll submit a PR to expose the streaming selector in the node for Flowise (as a temporary workaround).

2 replies

dkindlund Feb 22, 2024
Author

So Flowise doesn't really like when you set streaming to false, unfortunately. Bulk API requests will work fine, but any sort of real-time chat doesn't work properly because apparently the Bot UI doesn't update after the LiteLLM OpenAI Proxy returns any response.

dkindlund Feb 26, 2024
Author

Hey @HenryHengZJ , is there any sort of extended debugging supported in Flowise where it can log each streaming message chunk as like a separate log entry? I'm trying to debug a particularly tough issue and I'm wondering what sort of low-level logging exists in Flowise by default -- any pointers here would be super helpful!

dkindlund · 2024-02-29T17:14:18Z

dkindlund
Feb 29, 2024
Author

Update:

The LiteLLM OpenAI proxy now supports Azure OpenAI endpoints properly in all scenarios.
This includes streaming + function calling.

The fix is tracked here and is fully resolved in their main branch:
BerriAI/litellm#2138

To use the proxy to load balance Azure OpenAI endpoints, the process is:

Define your Azure OpenAI endpoints in the LiteLLM config.yaml file
Deploy your proxy
Then in Flowise, specify the standard OpenAI nodes (not the Azure OpenAI nodes) -- this goes for chat models, embeddings, llms -- everything

So essentially, the high availability / load-balanced configuration sort of looks like this:

Flowise[1..n] -- (using standard OpenAI nodes) --> LiteLLM OpenAI Proxy[1..n] -- (using Azure compatible logic) --> Azure OpenAI endpoints[1..n]

(where the value of n in each of these components can be entirely different, depending on where the bottlenecks are)

0 replies

Uh oh!

Support for Load Balancing Chat and Embedding Endpoints? #1457

Uh oh!

dkindlund Dec 31, 2023

Replies: 9 comments · 10 replies

Uh oh!

HenryHengZJ Jan 8, 2024 Maintainer

Uh oh!

dkindlund Jan 8, 2024 Author

Uh oh!

dkindlund Jan 13, 2024 Author

Uh oh!

ishaan-jaff Feb 10, 2024

Uh oh!

dkindlund Feb 17, 2024 Author

Uh oh!

dkindlund Feb 19, 2024 Author

Uh oh!

amansoniamazatic Feb 21, 2024

Uh oh!

dkindlund Feb 21, 2024 Author

Uh oh!

dkindlund Feb 21, 2024 Author

Uh oh!

Uh oh!

amansoniamazatic Feb 22, 2024

Uh oh!

dkindlund Feb 22, 2024 Author

Uh oh!

dkindlund Feb 19, 2024 Author

Uh oh!

dkindlund Feb 21, 2024 Author

Uh oh!

dkindlund Feb 21, 2024 Author

Uh oh!

dkindlund Feb 21, 2024 Author

Uh oh!

dkindlund Feb 22, 2024 Author

Uh oh!

dkindlund Feb 22, 2024 Author

Uh oh!

dkindlund Feb 26, 2024 Author

Uh oh!

dkindlund Feb 29, 2024 Author

dkindlund
Dec 31, 2023

Replies: 9 comments 10 replies

HenryHengZJ
Jan 8, 2024
Maintainer

dkindlund
Jan 8, 2024
Author

dkindlund
Jan 13, 2024
Author

dkindlund Feb 17, 2024
Author

dkindlund
Feb 19, 2024
Author

dkindlund Feb 21, 2024
Author

dkindlund Feb 21, 2024
Author

dkindlund Feb 22, 2024
Author

dkindlund
Feb 19, 2024
Author

dkindlund
Feb 21, 2024
Author

dkindlund Feb 21, 2024
Author

dkindlund
Feb 21, 2024
Author

dkindlund
Feb 22, 2024
Author

dkindlund Feb 22, 2024
Author

dkindlund Feb 26, 2024
Author

dkindlund
Feb 29, 2024
Author