A production-ready, multi-agent orchestration system built with LangGraph. This banking multi-agent system intelligently routes queries to specialized expert agents while maximizing LLM serving efficiency through prefix caching.
- Optimized for Prefix Caching - Architecture designed to maximize KV cache hit rates
- Multi-Agent System - Router + 3 specialized expert agents (Technical, Compliance, Support)
- Full Observability - Integrated Langfuse tracing for production monitoring and auto evaluation
- Session Management - Persistent conversation state with checkpointing, allow stateful, multi-Turn Conversations
- Production Ready - Error handling, logging, health checks, containerization
- FastAPI + LangGraph Banking Multi-Agent System
The system implements a supervisor pattern with specialized expert agents:
Flow:
- Router Node - Analyzes incoming query with recent chat history and routes to appropriate expert(s)
- Specialized Agents - Process domain-specific queries in parallel:
- Technical Specialist - Responsible for extracting system specifications, API limits, and troubleshooting steps from the manual.
- Compliance Auditor - Interprets regulatory rules, "Can/Cannot" constraints, and policy boundaries.
- Support Concierge - Summarizes complex procedures into step-by-step guides for non-technical staff.
- Supervisor Response - Synthesizes expert outputs into final answer
- State Management - Persistent conversation state with checkpointing
This system processes a 50-page Internal Operations & Compliance Manual (~25,000 tokens) for every query. Without optimization, this would result in:
- High computational cost per request because of recompute attention and KV states repeatedly (E.g with Claude Sonnet, cached input cost 0.3 USD/M tokens while uncached ones costs 3 USD/M tokens - a 10x much )
- Cause high Time To First Token (TTFT)
Following the design principles described in the Manus Context Engineering Blog, this system is explicitly engineered around KV-cache behavior in autoregressive language models.
Because LLMs are autoregressive, even a single-token difference in the prompt prefix will invalidate the cached key–value (KV) states and force the model to recompute the full attention matrix. To avoid this, the prompt prefix must remain strictly stable across invocations.
The system therefore adopts the following prompt composition model:
Prompt = [Shared Fixed Prefix] + [Agent-Specific Dynamic Suffix]
When using vLLM, which implements PagedAttention, KV-cache memory is managed in fixed-size blocks (default: 16 tokens per block) and reuses cached blocks when prefix matches. If agent role instructions were embedded before the manual (or interleaved differently per agent):
- vLLM would allocate separate KV-cache blocks per agent
- KV-cache usage would increase significantly (≈3× in a 3-agent system), tripling memory usage and reducing throughput.
To prevent this, the system enforces a fully static, identical prefix reused across all agent calls:
- The large manual (~25000-token Opretion Manual) share across agents is placed at the beginning of prompts as a fixed prefix
- Agent-specific instructions are appended after the manual
- Contains role instructions and formatting rules
- Small and recomputed per agent
User Query
↓
Router
↓
(Technical | Compliance | Support) ← parallel when needed
↓
Synthesis
↓
Final Response
- The
Routernode determines which expert agent(s) should handle the query. - One or more specialized agents may be executed in parallel.
- The Synthesizes expert outputs into a single response.
Because all agents share the same fixed prompt prefix, only the first agent invocation incurs the full prefix cost. Subsequent agents reuse cached KV states, reducing Time To First Token (TTFT).
State schema is defined in /agent-worker/app/schemas/base.py
class SubAgentOutput(TypedDict):
"""Represents the output produced by an individual expert agent."""
source: str
result: Optional[str]
class Router(TypedDict):
"""
Represents a single routing decision produced by the Router, including:
- The target agent
- A decompose query rewritten or scoped for that agent’s expertise based on use's query and conversation history
"""
source: Literal["technical", "compliance", "support"]
query: str
class RouterResult(BaseModel):
"""Structured output from the Router LLM"""
results: List[Router] = Field(default_factory=list, description="Result of router")
class BankingAgentState(TypedDict):
messages: Annotated[List[BaseMessage], add_messages]
session_id: Optional[str]
user_id: Optional[str]
router_result: Annotated[Optional[RouterResult], lambda x, y: x or y]
results: Annotated[list[SubAgentOutput], operator.add]agent-worker/
├── app/
│ ├── api/
│ │ ├── middleware/ # Authorization, logging, request context
│ │ └── v1/
│ │ ├── endpoints/
│ │ │ ├── chat.py # Chat & streaming endpoints
│ │ │ └── health.py # Health check endpoint
│ │ └── router.py # API router
│ │
│ ├── core/
│ │ ├── agents/ # Agent abstractions and entry points
│ │ │ ├── base.py # BaseAgent abstraction
│ │ │ ├── technical.py # Technical Agent graph wrapper
│ │ │ ├── compliance.py # Compliance Agent graph wrapper
│ │ │ ├── support.py # Support Agent graph wrapper
│ │ │ └── supervisor.py # Supervisor Agent (router + synthesis)
│ │ │
│ │ ├── graphs/ # LangGraph state graphs (node-level)
│ │ │ ├── technical/ # Technical agent nodes
│ │ │ ├── compliance/ # Compliance agent nodes
│ │ │ ├── support/ # Support agent nodes
│ │ │ └── supervisor/ # Router & synthesis nodes
│ │ │
│ │ ├── llm/ # LLM manager (vLLM/OpenAI API Competitive)
│ │ ├── memory/ # Persistent state & checkpointing
│ │ ├── prompt/ # Prompt management & versioning
│ │ │ ├── default/ # Default prompt templates
│ │ │ ├── base.py # Base Prompt Manager
│ │ │ ├── langfuse_manager.py # Langfuse-backed prompt manager
│ │ │ └── __init__.py # prompt_manager factory
│ │ │
│ │ └── tracing/ # Langfuse tracing & observability
│ │
│ ├── data/ # Static knowledge sources
│ │ └── operations_manual_full_merged.md
│ │
│ ├── schemas/ # LangGraph state schema & LLM config models
│ ├── utils/ # Logging, helpers, error handling
│ ├── config.py # Application settings
│ └── main.py # FastAPI application entry point
│
├── Dockerfile # Production container image
├── docker-compose.yaml # Local / multi-service orchestration
├── .dockerignore # Docker build optimization
├── pyproject.toml # Dependencies & project metadata
├── uv.lock # Dependency lock file
├── .env.example # Environment variable template
└── README.md # Project documentation
- Python 3.13+
- Docker & Docker Compose (for containerized deployment)
- UV (Python package manager) -
curl -LsSf https://astral.sh/uv/install.sh | sh - LLM Serving Endpoint (vLLM/LMCache with prefix caching support)
- Langfuse (for observability) - Optional but recommended
Langfuse is an open-source LLM engineering platform that provides full observability for your agent system, including trace visualization, session tracking, and LLM-as-judge evaluations.
In this assignment, Langfuse is used to:
- Trace multi-agent execution paths
- Inspect routing decisions and agent outputs
- Measure latency and Time To First Token (TTFT)
- Monitoring, Debugging and Evaluation
- Go to Langfuse folder
cd langfuse- Start Langfuse services:
# Create network if not existed
docker network create agent-net
# Start Langfuse (Change credenticals value if needed)
docker compose up -d --build- Access Langfuse UI:
- URL: http://localhost:3030
- Create account and project
- Copy API keys from Settings
- Sign up at cloud.langfuse.com
- Create a project
- Copy your API keys
Update .env with your Langfuse credentials:
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASE_URL=http://localhost:3030 # or https://cloud.langfuse.comCreate a local environment file from the provided template:
cd agent-worker
cp .env.example .envEdit the .env file and configure the following variables as needed.
# Application Settings
APP_NAME=banking-agent
APP_VERSION=0.1.0
APP_DESCRIPTION=Banking Agent Application
ENVIRONMENT=development
API_CORS_ORIGINS="*"
DEBUG=true
LOG_LEVEL=INFO
LOCAL_TIMEZONE=Asia/Ho_Chi_Minh
# API Configuration
API_HOST=0.0.0.0
API_PORT=<API_PORT>
# LLM Server (OpenAI API Competitve) Configuration
LLM_TYPE=openai-like
LLM_MODEL_NAME=Qwen/Qwen3-30B-A3B-Instruct-2507 # or other LLM
LLM_BASE_URL=http://<LLM_HOST>:<LLM_PORT>/v1
LLM_API_KEY=vllm_sk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LLM_EXTRA_BODY="{'top_k':20, 'min_p': 0}"
# Langfuse Configuration
TRACING_TYPE=langfuse
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_BASE_URL=http://localhost:3030
LANGFUSE_ENABLED=true
LANGFUSE_CACHE_TTL=300
# Prompt configuration
PROMPT_MANAGER_TYPE=langfuse
PROMPT_NAME_ROUTER=agent-router
PROMPT_NAME_TECHNICAL_SPECIALIST=agent_technical_specialist
PROMPT_NAME_COMPLIANCE_AUDITOR=agent_compliance_auditor
PROMPT_NAME_SUPPORT_CONCIERGE=agent_support_concierge
PROMPT_NAME_RESPONSE=synthesize-response
PROMPT_FALLBACK_TO_DEFAULT=true
# Memory to store Agent's State
MEMORY_TYPE=inmemory # Validate value is ['inmemory' or 'postgres']| Section | Purpose | Notes |
|---|---|---|
| LLM_BASE_URL | vLLM serving endpoint | Must support prefix caching (APC) |
| MEMORY_TYPE | Agent state persistence | inmemory for local testing, postgres for production |
In case you want to self-host LLM with vLLM/LMCache ,follow vLLM Config
cd agent-worker
# Create shared Docker network (used by Langfuse if enabled)
docker network create agent-net
# Build and start services
docker compose up -d --build
# View application logs
docker logs -f banking-agent-worker-service# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv sync
# Run the application with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload# Start all services
docker compose up -d
# View logs
docker compose logs -f
# Stop services
docker compose down
# Rebuild after code changes
docker compose up -d --build# Activate virtual environment
source .venv/bin/activate # Linux/Mac
# Run with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload# Check health endpoint
curl -X 'GET' \
'http://localhost:8238/api/v1/health' \
-H 'accept: application/json'Expected response
{
"status": "ok",
"version": "0.1.0",
"timestamp": 1768293084
}- Swagger UI: http://localhost:8238/docs
- ReDoc: http://localhost:8238/redoc
All API endpoints are prefixed with:
http://localhost:8238/api/v1
Endpoint: POST /api/v1/chat
Description: Generate a complete chat response
Request:
curl --location 'http://localhost:8238/api/v1/chat' \
--header 'Content-Type: application/json' \
--header 'X-Request-ID: 43246029-b8cc-4a2d-1743-1100797bbd645' \
--data '{
"requestParameters": {
"message": "How can i get my bank account balanced",
"sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c6e",
"userID": "huyhoangcloud"
}
}'Request Body:
{
"requestParameters": {
"message": "string", // User query
"sessionID": "string", // Session ID for conversation persistence <uuid4>
"userID": "string" // User identifier for tracking (uuid4 or string)
}
}Response:
{
"took": 25453,
"responseDateTime": "2026-01-13T15:50:12.246479+07:00",
"responseStatus": {
"responseCode": "200 Successfully"
},
"responseData": {
"message": "You can check your ABC Bank account balance using **three secure and convenient methods**: the **customer portal (website)**, the **mobile app**, or by visiting a **local branch**. Below is a clear, step-by-step guide for each option..."
}
}Response Fields:
| Field | Type | Description |
|---|---|---|
took |
number | Total processing time in milliseconds (E2E latency) |
responseDateTime |
string | Response timestamp |
responseStatus.responseCode |
string | Execution status message |
responseData.message |
string | Final AI-generated response |
Endpoint: POST /api/v1/chat/stream
Description: Stream response chunks in real-time using Server-Sent Events (SSE).
Request:
curl --location 'http://localhost:8238/api/v1/chat/stream' \
--header 'Content-Type: application/json' \
--header 'X-Request-ID: 43246029-b8cc-2a2d-1743-1100797bbd645' \
--data '{
"requestParameters": {
"message": "How can i get my bank account balanced",
"sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c1e",
"userID": "huyhoangcloud"
}
}'Response Format (SSE):
Each chunk is delivered as a discrete SSE event:
data: {"chunk": "To check"}
data: {"chunk": " your account"}
data: {"chunk": " balanced ABC Bank"}
...
data: {"chunk": " contact support."}
data: {"done": true}
| Header | Required | Description | Example |
|---|---|---|---|
Content-Type |
Yes | Must be application/json |
application/json |
X-Request-ID |
No | Custom request identifier for tracing | uuid-v4 |
- Open Langfuse UI: http://localhost:3030 or https://cloud.langfuse.com
- Navigate to Tracing, Sessions, Users
- Find your test requests, session and user
- Trace Request, Latency, Tokens, TTFT
- Trace Session
- Trace User
Following the official vLLM guide on Prometheus and Grafana monitoring, my assignment also includes a ready-to-use monitoring stack for observing LLM serving performance.
- Go to monitoring folder
cd monitoring-
Change target server IP in
prometheus.yaml -
Start monitoring services:
# Start Prometheus and Grafana
docker compose up -d --build- Access Prometheus UI:
- URL: http://localhost:9090/targets
- Verify that the vLLM metrics endpoint is listed and marked as UP
- After logging in:
- Navigate to Dashboards
- Select vLLM Monitoring dashboard
- Access Grafana UI:
- URL: http://localhost:3000
- Default
usernameandpasswordbothadmin - Changepassword then go to DashBoard, select vLLM Monitoring
In this assignment, I use LLM-as-a-Judge evaluations with Langfuse to assess the system along two key dimensions:
- Helpfulness – Whether the final response is accurate, clear, and useful to the user.
- Routing Correctness – Whether the Router selects the appropriate expert agent(s) for the query.
- Open the Langfuse UI:
- Self-hosted: http://localhost:3030
- Cloud: https://cloud.langfuse.com
- Navigate to Evaluations and select LLM-as-a-Judge
- Create a new LLM Connection based on your LLM provider
- Select the predefined Helpfulness evaluator provided by Langfuse or integrate with Ragas
- Configure
JsonPathmappings for the evaluation inputs:{{query}}:$.messages[?(@.type=="human")].content{{generation}}:$.messages[?(@.type=="ai")].content
- Select Create Custom Evaluator
- Define:
- Evaluation prompt
- Score reasoning prompt
- Score range
- Configure
JsonPathmappings:{{query}}:$.messages[?(@.type=="human")].content{{router_result}}:$.router_result.results
Each tracing request now includes both Helpfulness and Routing Correctness scores, as shown below:
This project is licensed under the MIT License.
See the LICENSE file for full license details.
For questions, issues, please:
- Open an issue on the GitHub repository, or
- Contact me directly at: huyhoang18bkhn@gmail.com
This project builds on the following tools and frameworks:
- LangChain & LangGraph — Multi-agent orchestration and stateful workflows
- Langfuse — LLM observability, tracing, and evaluation
- vLLM — High-performance LLM serving with prefix caching (APC)
- FastAPI — Modern, high-performance web framework for APIs













