Guidance on Multi-Service LangGraph Architecture, Checkpointing, and Cross-Agent Orchestration #5036
Unanswered
shriyanskapoor
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team,
I'm exploring a multi-agent architecture using LangGraph and am looking for guidance on how best to structure inter-agent communication and state management across service boundaries.
What I'm trying to build
I have a Supervisor Agent running in Service A that receives user requests. Depending on the request, it may:
Handle it directly, or
Decompose it into a series of steps that require coordination with other Domain Agents, each deployed independently (e.g., Service B, Service C, etc.).
Each agent is exposed as a service (via HTTP), and is itself running a LangGraph internally. I'm not using LangGraph's RemoteGraph infrastructure, and don’t intend to. Instead, each agent is a self-contained service that accepts inputs, processes them via its own graph, and returns outputs.
Questions and challenges
Is using agents-as-tools the only recommended way to orchestrate cross-service graphs like this?
The Supervisor Agent could wrap each Domain Agent as a LangGraph tool, but I’d like to confirm if this is the recommended or most ergonomic approach.
Checkpointing and recovery:
Suppose the Supervisor Agent invokes Domain Agent B in step 4 of its execution. Domain Agent B starts its own LangGraph and fails at, say, step 3.
If I only have access to the Supervisor Agent (i.e., I don’t persist or share execution state from Domain Agent B), how can I resume execution from within Domain Agent B's graph (step 3) in a future request?
Should I be propagating checkpoints across services manually?
Is LangGraph's built-in checkpointer designed to handle this kind of distributed recovery?
Is there a recommended pattern for managing short-term memory across service boundaries?
Fallback strategy:
If seamless execution-resumption across network boundaries isn't currently feasible, what’s the ideal way to achieve at least isolated checkpointing per agent so that agents can be resumed in isolation if needed?
Closing thoughts
Things work well when I keep the entire flow within a single service/graph. But when scaling out across agents in multiple services, I’m unsure how to maintain seamless orchestration, checkpointing, and state continuity.
If LangGraph doesn’t yet fully support this pattern, is there a recommended way to extend it? Or are there patterns/tooling outside LangGraph that complement this model?
Appreciate any insights, redirections, or best practices you can share. Thank you for your time and for the excellent work on this framework.
Beta Was this translation helpful? Give feedback.
All reactions