15,000 user architecture #8470
Replies: 3 comments 7 replies
-
A large org. has just successfully deployed LibreChat. I've asked if they would share some of their findings to bring in more confidence about scalability, as I would like to write a blog post about some of these details. I'm optimistic they will share soon! |
Beta Was this translation helpful? Give feedback.
-
We're looking to deploy LC into a cluster of backend servers right now in anticipation of future features that might boggle our current single-node setup. A horizontally scaled setup like that is also better for availability. We have thousands of users daily and don't want to impact the user experience after the multi-server deployment. So, we proceeding with caution and studying the LC codebase looking for any potential issues regarding this matter. Maybe this is a good place for discussion :-) MongoDB and Redis can be independently scaled and are not concerned of LC. The caching mechanism within the codebase is our focus to make LC ready for horizontal scaling and zero-downtime deployment. ✅ Preventing cross-deployment contamination of Redis cacheIn our zero-downtime deployment scheme, two different deployments can coexist for a short period of time: once the new deployment is ready, traffic is drained from the old deployment and routed to the new one. The two deployments are hooked to the same MongoDB and, in the near future, to the same Redis cluster. Having the new deployment using the old cache from the previous deployment is dangerous but we can't just wipe everything in the shared Redis cluster either due to the zero-downtime constrain. We solved this by using the deployment ID as the global prefix for Redis via REDIS_KEY_PREFIX_VAR. ❌ Assuring consistent behaviors through shared cache
❌ Static but frequently accessed cache should be in memory
|
Beta Was this translation helpful? Give feedback.
-
Hey Theo,
Congrats on deploying LibreChat to such a large user base, that's really
cool! 👍
I work with enterprises on user adoption strategies, as AI can be quite
technical for some non-technical users.
I would love to hear about how you're approaching user training for all
those users. We've gathered some interesting patterns from other
deployments that might be helpful, and I'd be happy to share these insights.
Happy to take this to DMs if you'd prefer a more detailed discussion.
Cheers,
Lucas
On Tue, Jul 29, 2025 at 1:07 AM Theo N. Truong ***@***.***> wrote:
To give you a bit of context of the scale that we're operating with AI,
including LibreChat: We have one or two MCP servers being added to
liberchat.yaml every week. The tools lists on these servers are also
constantly changing. We also have an MCP discovery service that we also
hook LibreChat to (with our own custom code).
So, having initializeMCPs being called only during startup is a bit of an
issue for us. We have to redeploy to refresh the tools list when asked
(Thankfully it's a zero-downtime deployment). So I'm taking that into
account when solving the inconsistent MCP connections pool problem (2nd
bullet point in the long comment above), too. Here's how I imagine the MCP
connections will be managed in the context of a cluster of backend servers:
1. When the servers boot up, one is selected as the leader.
2. This leader is responsible for periodically updating the GLOBAL
tools list and the MCP Servers list - both of which are stored on Redis.
3. All servers, including the leader, establish connections to the MCP
servers on-demand, in the lazy loading fashion: The connection doesn't
exist until the first request for that MCP server is required on that LC
backend server. Each server will have their own list of live connections in
memory (this cannot be shared). These connections can expire when not used
by anyone for awhile.
4. We will turn on sticky-sessions on our cloud infra for the LC
deployment to make sure that the users are not bounced from one server to
another, to reduce the number of duplicate connections across servers,
especially the user-specific connections.
—
Reply to this email directly, view it on GitHub
<#8470 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BGFKMSVYOZY6EHR7A4WC4DT3K2NJ3AVCNFSM6AAAAACBRRK4D6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGOJRGUYTOMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
ᐧ
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a loaded question, but what high-level deployment architecture with high availability would you consider to design a solution for 15,000 users? Of course, not all users will be simultaneous, but let's assume a 15% peak concurrent use. Has anyone tried this kind of load yet? We would need MongoDB and VectorDB replicas. Also, the API service would need to be redundant and fronted by load balancers. The RAG service should be deployed in clusters. Finally, GPUs should also be behind load balancers. Anyway, any directions on the required architecture would be helpful.
Beta Was this translation helpful? Give feedback.
All reactions