FastAPI + LangGraph Banking Multi-Agent System

A production-ready, multi-agent orchestration system built with LangGraph. This banking multi-agent system intelligently routes queries to specialized expert agents while maximizing LLM serving efficiency through prefix caching.

Key Features

Optimized for Prefix Caching - Architecture designed to maximize KV cache hit rates
Multi-Agent System - Router + 3 specialized expert agents (Technical, Compliance, Support)
Full Observability - Integrated Langfuse tracing for production monitoring and auto evaluation
Session Management - Persistent conversation state with checkpointing, allow stateful, multi-Turn Conversations
Production Ready - Error handling, logging, health checks, containerization

Architecture Overview

Multi-Agent System Design

The system implements a supervisor pattern with specialized expert agents:

Flow:

Router Node - Analyzes incoming query with recent chat history and routes to appropriate expert(s)
Specialized Agents - Process domain-specific queries in parallel:
- Technical Specialist - Responsible for extracting system specifications, API limits, and troubleshooting steps from the manual.
- Compliance Auditor - Interprets regulatory rules, "Can/Cannot" constraints, and policy boundaries.
- Support Concierge - Summarizes complex procedures into step-by-step guides for non-technical staff.
Supervisor Response - Synthesizes expert outputs into final answer
State Management - Persistent conversation state with checkpointing

The Efficiency Challenge

This system processes a 50-page Internal Operations & Compliance Manual (~25,000 tokens) for every query. Without optimization, this would result in:

High computational cost per request because of recompute attention and KV states repeatedly (E.g with Claude Sonnet, cached input cost 0.3 USD/M tokens while uncached ones costs 3 USD/M tokens - a 10x much )
Cause high Time To First Token (TTFT)

My Solution: Prefix-Aligned Prompt Architecture

Following the design principles described in the Manus Context Engineering Blog, this system is explicitly engineered around KV-cache behavior in autoregressive language models.

Because LLMs are autoregressive, even a single-token difference in the prompt prefix will invalidate the cached key–value (KV) states and force the model to recompute the full attention matrix. To avoid this, the prompt prefix must remain strictly stable across invocations.

The system therefore adopts the following prompt composition model:

Prompt = [Shared Fixed Prefix] + [Agent-Specific Dynamic Suffix]

Shared fixed prefix

When using vLLM, which implements PagedAttention, KV-cache memory is managed in fixed-size blocks (default: 16 tokens per block) and reuses cached blocks when prefix matches. If agent role instructions were embedded before the manual (or interleaved differently per agent):

vLLM would allocate separate KV-cache blocks per agent
KV-cache usage would increase significantly (≈3× in a 3-agent system), tripling memory usage and reducing throughput.

To prevent this, the system enforces a fully static, identical prefix reused across all agent calls:

The large manual (~25000-token Opretion Manual) share across agents is placed at the beginning of prompts as a fixed prefix
Agent-specific instructions are appended after the manual

Agent-Specific Dynamic Suffix

Contains role instructions and formatting rules
Small and recomputed per agent

How It Works

1. LangGraph Workflow

User Query
   ↓
Router
   ↓
(Technical | Compliance | Support)   ← parallel when needed
   ↓
Synthesis
   ↓
Final Response

The Router node determines which expert agent(s) should handle the query.
One or more specialized agents may be executed in parallel.
The Synthesizes expert outputs into a single response.

Because all agents share the same fixed prompt prefix, only the first agent invocation incurs the full prefix cost. Subsequent agents reuse cached KV states, reducing Time To First Token (TTFT).

2. LangGraph State Management

State schema is defined in /agent-worker/app/schemas/base.py

class SubAgentOutput(TypedDict): 
  """Represents the output produced by an individual expert agent."""
  source: str
  result: Optional[str]

class Router(TypedDict):
  """
  Represents a single routing decision produced by the Router, including:
  - The target agent
  - A decompose query rewritten or scoped for that agent’s expertise based on use's query and conversation history
  """
  source: Literal["technical", "compliance", "support"]
  query: str
    
class RouterResult(BaseModel):
    """Structured output from the Router LLM"""
    results: List[Router] = Field(default_factory=list, description="Result of router")

class BankingAgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    session_id: Optional[str]
    user_id: Optional[str]
    router_result: Annotated[Optional[RouterResult], lambda x, y: x or y]
    results: Annotated[list[SubAgentOutput], operator.add]

Agent-Worker Structure

agent-worker/
├── app/
│   ├── api/
│   │   ├── middleware/                 # Authorization, logging, request context
│   │   └── v1/
│   │       ├── endpoints/
│   │       │   ├── chat.py              # Chat & streaming endpoints
│   │       │   └── health.py            # Health check endpoint
│   │       └── router.py                # API router
│   │
│   ├── core/
│   │   ├── agents/                     # Agent abstractions and entry points
│   │   │   ├── base.py                 # BaseAgent abstraction
│   │   │   ├── technical.py            # Technical Agent graph wrapper
│   │   │   ├── compliance.py           # Compliance Agent graph wrapper
│   │   │   ├── support.py              # Support Agent graph wrapper
│   │   │   └── supervisor.py           # Supervisor Agent (router + synthesis)
│   │   │
│   │   ├── graphs/                     # LangGraph state graphs (node-level)
│   │   │   ├── technical/              # Technical agent nodes
│   │   │   ├── compliance/             # Compliance agent nodes
│   │   │   ├── support/                # Support agent nodes
│   │   │   └── supervisor/             # Router & synthesis nodes
│   │   │
│   │   ├── llm/                        # LLM manager (vLLM/OpenAI API Competitive)
│   │   ├── memory/                     # Persistent state & checkpointing
│   │   ├── prompt/                     # Prompt management & versioning
│   │   │   ├── default/                # Default prompt templates
│   │   │   ├── base.py                 # Base Prompt Manager
│   │   │   ├── langfuse_manager.py     # Langfuse-backed prompt manager
│   │   │   └── __init__.py              # prompt_manager factory
│   │   │
│   │   └── tracing/                    # Langfuse tracing & observability
│   │
│   ├── data/                           # Static knowledge sources
│   │   └── operations_manual_full_merged.md
│   │
│   ├── schemas/                        # LangGraph state schema & LLM config models
│   ├── utils/                          # Logging, helpers, error handling
│   ├── config.py                       # Application settings
│   └── main.py                         # FastAPI application entry point
│
├── Dockerfile                          # Production container image
├── docker-compose.yaml                 # Local / multi-service orchestration
├── .dockerignore                       # Docker build optimization
├── pyproject.toml                      # Dependencies & project metadata
├── uv.lock                             # Dependency lock file
├── .env.example                        # Environment variable template
└── README.md                           # Project documentation

Prerequisites

Python 3.13+
Docker & Docker Compose (for containerized deployment)
UV (Python package manager) - curl -LsSf https://astral.sh/uv/install.sh | sh
LLM Serving Endpoint (vLLM/LMCache with prefix caching support)
Langfuse (for observability) - Optional but recommended

Langfuse Setup

Langfuse is an open-source LLM engineering platform that provides full observability for your agent system, including trace visualization, session tracking, and LLM-as-judge evaluations.

In this assignment, Langfuse is used to:

Trace multi-agent execution paths
Inspect routing decisions and agent outputs
Measure latency and Time To First Token (TTFT)
Monitoring, Debugging and Evaluation

Option 1: Self-Hosted Langfuse

Go to Langfuse folder

cd langfuse

Start Langfuse services:

# Create network if not existed
docker network create agent-net

# Start Langfuse (Change credenticals value if needed)
docker compose up -d --build

Access Langfuse UI:
- URL: http://localhost:3030
- Create account and project
- Copy API keys from Settings

Option 2: Langfuse Cloud

Sign up at cloud.langfuse.com
Create a project
Copy your API keys

Configure Agent Worker

Update .env with your Langfuse credentials:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASE_URL=http://localhost:3030  # or https://cloud.langfuse.com

Configuration

1. Copy Environment Template

Create a local environment file from the provided template:

cd agent-worker
cp .env.example .env

2. Configure Environment Variables

Edit the .env file and configure the following variables as needed.

# Application Settings
APP_NAME=banking-agent
APP_VERSION=0.1.0
APP_DESCRIPTION=Banking Agent Application
ENVIRONMENT=development

API_CORS_ORIGINS="*"
DEBUG=true
LOG_LEVEL=INFO

LOCAL_TIMEZONE=Asia/Ho_Chi_Minh

# API Configuration
API_HOST=0.0.0.0
API_PORT=<API_PORT>

# LLM Server (OpenAI API Competitve) Configuration
LLM_TYPE=openai-like
LLM_MODEL_NAME=Qwen/Qwen3-30B-A3B-Instruct-2507 # or other LLM 
LLM_BASE_URL=http://<LLM_HOST>:<LLM_PORT>/v1
LLM_API_KEY=vllm_sk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LLM_EXTRA_BODY="{'top_k':20, 'min_p': 0}"


# Langfuse Configuration
TRACING_TYPE=langfuse
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_BASE_URL=http://localhost:3030
LANGFUSE_ENABLED=true
LANGFUSE_CACHE_TTL=300

# Prompt configuration
PROMPT_MANAGER_TYPE=langfuse
PROMPT_NAME_ROUTER=agent-router
PROMPT_NAME_TECHNICAL_SPECIALIST=agent_technical_specialist
PROMPT_NAME_COMPLIANCE_AUDITOR=agent_compliance_auditor
PROMPT_NAME_SUPPORT_CONCIERGE=agent_support_concierge
PROMPT_NAME_RESPONSE=synthesize-response
PROMPT_FALLBACK_TO_DEFAULT=true


# Memory to store Agent's State
MEMORY_TYPE=inmemory # Validate value is ['inmemory' or 'postgres']

Note

Section	Purpose	Notes
LLM_BASE_URL	vLLM serving endpoint	Must support prefix caching (APC)
MEMORY_TYPE	Agent state persistence	`inmemory` for local testing, `postgres` for production

In case you want to self-host LLM with vLLM/LMCache ,follow vLLM Config

Installation

Method 1: Docker Compose

cd agent-worker

# Create shared Docker network (used by Langfuse if enabled)
docker network create agent-net

# Build and start services
docker compose up -d --build

# View application logs
docker logs -f banking-agent-worker-service

Method 2: Local Development with UV

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv sync

# Run the application with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload

Running the Application

Using Docker Compose

# Start all services
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

# Rebuild after code changes
docker compose up -d --build

Local Development

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac

# Run with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload

Verify Application

# Check health endpoint
curl -X 'GET' \
  'http://localhost:8238/api/v1/health' \
  -H 'accept: application/json'

Expected response

{
  "status": "ok",
  "version": "0.1.0",
  "timestamp": 1768293084
}

Access API Documentation

Swagger UI: http://localhost:8238/docs
ReDoc: http://localhost:8238/redoc

API Reference

Base URL

All API endpoints are prefixed with:

http://localhost:8238/api/v1

Endpoints

1. Chat Completion (Non-Streaming)

Endpoint: POST /api/v1/chat

Description: Generate a complete chat response

Request:

curl --location 'http://localhost:8238/api/v1/chat' \
  --header 'Content-Type: application/json' \
  --header 'X-Request-ID: 43246029-b8cc-4a2d-1743-1100797bbd645' \
  --data '{
    "requestParameters": {
      "message": "How can i get my bank account balanced",
      "sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c6e",
      "userID": "huyhoangcloud"
    }
  }'

Request Body:

{
  "requestParameters": {
    "message": "string",      // User query
    "sessionID": "string",    // Session ID for conversation persistence <uuid4>
    "userID": "string"        // User identifier for tracking (uuid4 or string)
  }
}

Response:

{
  "took": 25453,
  "responseDateTime": "2026-01-13T15:50:12.246479+07:00",
  "responseStatus": {
    "responseCode": "200 Successfully"
  },
  "responseData": {
    "message": "You can check your ABC Bank account balance using **three secure and convenient methods**: the **customer portal (website)**, the **mobile app**, or by visiting a **local branch**. Below is a clear, step-by-step guide for each option..."
  }
}

Response Fields:

Field	Type	Description
`took`	number	Total processing time in milliseconds (E2E latency)
`responseDateTime`	string	Response timestamp
`responseStatus.responseCode`	string	Execution status message
`responseData.message`	string	Final AI-generated response

2. Chat Completion (Streaming)

Endpoint: POST /api/v1/chat/stream

Description: Stream response chunks in real-time using Server-Sent Events (SSE).

Request:

curl --location 'http://localhost:8238/api/v1/chat/stream' \
  --header 'Content-Type: application/json' \
  --header 'X-Request-ID: 43246029-b8cc-2a2d-1743-1100797bbd645' \
  --data '{
    "requestParameters": {
      "message": "How can i get my bank account balanced",
      "sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c1e",
      "userID": "huyhoangcloud"
    }
  }'

Response Format (SSE):

Each chunk is delivered as a discrete SSE event:

data: {"chunk": "To check"}
data: {"chunk": " your account"}
data: {"chunk": " balanced ABC Bank"}
...
data: {"chunk": " contact support."}
data: {"done": true}

HTTP Headers

Request Headers

Header	Required	Description	Example
`Content-Type`	Yes	Must be `application/json`	`application/json`
`X-Request-ID`	No	Custom request identifier for tracing	`uuid-v4`

Monitoring & Observability

Verify Langfuse Tracing

Open Langfuse UI: http://localhost:3030 or https://cloud.langfuse.com
Navigate to Tracing, Sessions, Users
Find your test requests, session and user

Trace Request, Latency, Tokens, TTFT

Trace Session

Trace User

Monitoring with Prometheus & Grafana

Following the official vLLM guide on Prometheus and Grafana monitoring, my assignment also includes a ready-to-use monitoring stack for observing LLM serving performance.

Go to monitoring folder

cd monitoring

Change target server IP in prometheus.yaml
Start monitoring services:

# Start Prometheus and Grafana
docker compose up -d --build

Access Prometheus UI:
- URL: http://localhost:9090/targets
- Verify that the vLLM metrics endpoint is listed and marked as UP
- After logging in:
  1. Navigate to Dashboards
  2. Select vLLM Monitoring dashboard

Access Grafana UI:
- URL: http://localhost:3000
- Default username and password both admin
- Changepassword then go to DashBoard, select vLLM Monitoring

Evaluation

In this assignment, I use LLM-as-a-Judge evaluations with Langfuse to assess the system along two key dimensions:

Helpfulness – Whether the final response is accurate, clear, and useful to the user.
Routing Correctness – Whether the Router selects the appropriate expert agent(s) for the query.

Evaluation Setup

Open the Langfuse UI:
- Self-hosted: http://localhost:3030
- Cloud: https://cloud.langfuse.com
Navigate to Evaluations and select LLM-as-a-Judge
Create a new LLM Connection based on your LLM provider

Helpfulness Evaluation

Select the predefined Helpfulness evaluator provided by Langfuse or integrate with Ragas
Configure JsonPath mappings for the evaluation inputs:
- {{query}}: $.messages[?(@.type=="human")].content
- {{generation}}: $.messages[?(@.type=="ai")].content

Routing Correctness Evaluation

Select Create Custom Evaluator
Define:
- Evaluation prompt
- Score reasoning prompt
- Score range

Configure JsonPath mappings:
- {{query}}: $.messages[?(@.type=="human")].content
- {{router_result}}: $.router_result.results

Each tracing request now includes both Helpfulness and Routing Correctness scores, as shown below:

License

This project is licensed under the MIT License.
See the LICENSE file for full license details.

Contact

For questions, issues, please:

Open an issue on the GitHub repository, or
Contact me directly at: huyhoang18bkhn@gmail.com

Acknowledgments

This project builds on the following tools and frameworks:

LangChain & LangGraph — Multi-agent orchestration and stateful workflows
Langfuse — LLM observability, tracing, and evaluation
vLLM — High-performance LLM serving with prefix caching (APC)
FastAPI — Modern, high-performance web framework for APIs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agent-worker		agent-worker
assets		assets
langfuse		langfuse
monitoring		monitoring
vLLM		vLLM
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

FastAPI + LangGraph Banking Multi-Agent System

Key Features

Table of Contents

Architecture Overview

Multi-Agent System Design

The Efficiency Challenge

My Solution: Prefix-Aligned Prompt Architecture

Shared fixed prefix

Agent-Specific Dynamic Suffix

How It Works

1. LangGraph Workflow

2. LangGraph State Management

Agent-Worker Structure

Prerequisites

Langfuse Setup

Option 1: Self-Hosted Langfuse

Option 2: Langfuse Cloud

Configure Agent Worker

Configuration

1. Copy Environment Template

2. Configure Environment Variables

Note

Installation

Method 1: Docker Compose

Method 2: Local Development with UV

Running the Application

Using Docker Compose

Local Development

Verify Application

Access API Documentation

API Reference

Base URL

Endpoints

1. Chat Completion (Non-Streaming)

2. Chat Completion (Streaming)

HTTP Headers

Request Headers

Monitoring & Observability

Verify Langfuse Tracing

Monitoring with Prometheus & Grafana

Evaluation

Evaluation Setup

Helpfulness Evaluation

Routing Correctness Evaluation

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages