Skip to content

splendor1811/Banking-Multi-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastAPI + LangGraph Banking Multi-Agent System

Python Version FastAPI LangGraph LangChain Docker License

A production-ready, multi-agent orchestration system built with LangGraph. This banking multi-agent system intelligently routes queries to specialized expert agents while maximizing LLM serving efficiency through prefix caching.

Key Features

  • Optimized for Prefix Caching - Architecture designed to maximize KV cache hit rates
  • Multi-Agent System - Router + 3 specialized expert agents (Technical, Compliance, Support)
  • Full Observability - Integrated Langfuse tracing for production monitoring and auto evaluation
  • Session Management - Persistent conversation state with checkpointing, allow stateful, multi-Turn Conversations
  • Production Ready - Error handling, logging, health checks, containerization

Table of Contents


Architecture Overview

Multi-Agent System Design

The system implements a supervisor pattern with specialized expert agents:

Agent Architecture

Flow:

  1. Router Node - Analyzes incoming query with recent chat history and routes to appropriate expert(s)
  2. Specialized Agents - Process domain-specific queries in parallel:
    • Technical Specialist - Responsible for extracting system specifications, API limits, and troubleshooting steps from the manual.
    • Compliance Auditor - Interprets regulatory rules, "Can/Cannot" constraints, and policy boundaries.
    • Support Concierge - Summarizes complex procedures into step-by-step guides for non-technical staff.
  3. Supervisor Response - Synthesizes expert outputs into final answer
  4. State Management - Persistent conversation state with checkpointing

The Efficiency Challenge

This system processes a 50-page Internal Operations & Compliance Manual (~25,000 tokens) for every query. Without optimization, this would result in:

  • High computational cost per request because of recompute attention and KV states repeatedly (E.g with Claude Sonnet, cached input cost 0.3 USD/M tokens while uncached ones costs 3 USD/M tokens - a 10x much )
  • Cause high Time To First Token (TTFT)

My Solution: Prefix-Aligned Prompt Architecture

Following the design principles described in the Manus Context Engineering Blog, this system is explicitly engineered around KV-cache behavior in autoregressive language models.

Because LLMs are autoregressive, even a single-token difference in the prompt prefix will invalidate the cached key–value (KV) states and force the model to recompute the full attention matrix. To avoid this, the prompt prefix must remain strictly stable across invocations.

The system therefore adopts the following prompt composition model:

Prompt = [Shared Fixed Prefix] + [Agent-Specific Dynamic Suffix]

Shared fixed prefix

When using vLLM, which implements PagedAttention, KV-cache memory is managed in fixed-size blocks (default: 16 tokens per block) and reuses cached blocks when prefix matches. If agent role instructions were embedded before the manual (or interleaved differently per agent):

  • vLLM would allocate separate KV-cache blocks per agent
  • KV-cache usage would increase significantly (≈3× in a 3-agent system), tripling memory usage and reducing throughput.

To prevent this, the system enforces a fully static, identical prefix reused across all agent calls:

  • The large manual (~25000-token Opretion Manual) share across agents is placed at the beginning of prompts as a fixed prefix
  • Agent-specific instructions are appended after the manual

Agent-Specific Dynamic Suffix

  • Contains role instructions and formatting rules
  • Small and recomputed per agent

Prompt Design


How It Works

1. LangGraph Workflow

User Query
   ↓
Router
   ↓
(Technical | Compliance | Support)   ← parallel when needed
   ↓
Synthesis
   ↓
Final Response
  • The Router node determines which expert agent(s) should handle the query.
  • One or more specialized agents may be executed in parallel.
  • The Synthesizes expert outputs into a single response.

Because all agents share the same fixed prompt prefix, only the first agent invocation incurs the full prefix cost. Subsequent agents reuse cached KV states, reducing Time To First Token (TTFT).

2. LangGraph State Management

State schema is defined in /agent-worker/app/schemas/base.py

class SubAgentOutput(TypedDict): 
  """Represents the output produced by an individual expert agent."""
  source: str
  result: Optional[str]

class Router(TypedDict):
  """
  Represents a single routing decision produced by the Router, including:
  - The target agent
  - A decompose query rewritten or scoped for that agent’s expertise based on use's query and conversation history
  """
  source: Literal["technical", "compliance", "support"]
  query: str
    
class RouterResult(BaseModel):
    """Structured output from the Router LLM"""
    results: List[Router] = Field(default_factory=list, description="Result of router")

class BankingAgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    session_id: Optional[str]
    user_id: Optional[str]
    router_result: Annotated[Optional[RouterResult], lambda x, y: x or y]
    results: Annotated[list[SubAgentOutput], operator.add]

Agent-Worker Structure

agent-worker/
├── app/
│   ├── api/
│   │   ├── middleware/                 # Authorization, logging, request context
│   │   └── v1/
│   │       ├── endpoints/
│   │       │   ├── chat.py              # Chat & streaming endpoints
│   │       │   └── health.py            # Health check endpoint
│   │       └── router.py                # API router
│   │
│   ├── core/
│   │   ├── agents/                     # Agent abstractions and entry points
│   │   │   ├── base.py                 # BaseAgent abstraction
│   │   │   ├── technical.py            # Technical Agent graph wrapper
│   │   │   ├── compliance.py           # Compliance Agent graph wrapper
│   │   │   ├── support.py              # Support Agent graph wrapper
│   │   │   └── supervisor.py           # Supervisor Agent (router + synthesis)
│   │   │
│   │   ├── graphs/                     # LangGraph state graphs (node-level)
│   │   │   ├── technical/              # Technical agent nodes
│   │   │   ├── compliance/             # Compliance agent nodes
│   │   │   ├── support/                # Support agent nodes
│   │   │   └── supervisor/             # Router & synthesis nodes
│   │   │
│   │   ├── llm/                        # LLM manager (vLLM/OpenAI API Competitive)
│   │   ├── memory/                     # Persistent state & checkpointing
│   │   ├── prompt/                     # Prompt management & versioning
│   │   │   ├── default/                # Default prompt templates
│   │   │   ├── base.py                 # Base Prompt Manager
│   │   │   ├── langfuse_manager.py     # Langfuse-backed prompt manager
│   │   │   └── __init__.py              # prompt_manager factory
│   │   │
│   │   └── tracing/                    # Langfuse tracing & observability
│   │
│   ├── data/                           # Static knowledge sources
│   │   └── operations_manual_full_merged.md
│   │
│   ├── schemas/                        # LangGraph state schema & LLM config models
│   ├── utils/                          # Logging, helpers, error handling
│   ├── config.py                       # Application settings
│   └── main.py                         # FastAPI application entry point
│
├── Dockerfile                          # Production container image
├── docker-compose.yaml                 # Local / multi-service orchestration
├── .dockerignore                       # Docker build optimization
├── pyproject.toml                      # Dependencies & project metadata
├── uv.lock                             # Dependency lock file
├── .env.example                        # Environment variable template
└── README.md                           # Project documentation

Prerequisites

  • Python 3.13+
  • Docker & Docker Compose (for containerized deployment)
  • UV (Python package manager) - curl -LsSf https://astral.sh/uv/install.sh | sh
  • LLM Serving Endpoint (vLLM/LMCache with prefix caching support)
  • Langfuse (for observability) - Optional but recommended

Langfuse Setup

Langfuse is an open-source LLM engineering platform that provides full observability for your agent system, including trace visualization, session tracking, and LLM-as-judge evaluations.

In this assignment, Langfuse is used to:

  • Trace multi-agent execution paths
  • Inspect routing decisions and agent outputs
  • Measure latency and Time To First Token (TTFT)
  • Monitoring, Debugging and Evaluation

Option 1: Self-Hosted Langfuse

  1. Go to Langfuse folder
cd langfuse
  1. Start Langfuse services:
# Create network if not existed
docker network create agent-net

# Start Langfuse (Change credenticals value if needed)
docker compose up -d --build
  1. Access Langfuse UI:

Option 2: Langfuse Cloud

  1. Sign up at cloud.langfuse.com
  2. Create a project
  3. Copy your API keys

Configure Agent Worker

Update .env with your Langfuse credentials:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASE_URL=http://localhost:3030  # or https://cloud.langfuse.com

Langfuse Setup


Configuration

1. Copy Environment Template

Create a local environment file from the provided template:

cd agent-worker
cp .env.example .env

2. Configure Environment Variables

Edit the .env file and configure the following variables as needed.

# Application Settings
APP_NAME=banking-agent
APP_VERSION=0.1.0
APP_DESCRIPTION=Banking Agent Application
ENVIRONMENT=development

API_CORS_ORIGINS="*"
DEBUG=true
LOG_LEVEL=INFO

LOCAL_TIMEZONE=Asia/Ho_Chi_Minh

# API Configuration
API_HOST=0.0.0.0
API_PORT=<API_PORT>

# LLM Server (OpenAI API Competitve) Configuration
LLM_TYPE=openai-like
LLM_MODEL_NAME=Qwen/Qwen3-30B-A3B-Instruct-2507 # or other LLM 
LLM_BASE_URL=http://<LLM_HOST>:<LLM_PORT>/v1
LLM_API_KEY=vllm_sk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LLM_EXTRA_BODY="{'top_k':20, 'min_p': 0}"


# Langfuse Configuration
TRACING_TYPE=langfuse
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LANGFUSE_BASE_URL=http://localhost:3030
LANGFUSE_ENABLED=true
LANGFUSE_CACHE_TTL=300

# Prompt configuration
PROMPT_MANAGER_TYPE=langfuse
PROMPT_NAME_ROUTER=agent-router
PROMPT_NAME_TECHNICAL_SPECIALIST=agent_technical_specialist
PROMPT_NAME_COMPLIANCE_AUDITOR=agent_compliance_auditor
PROMPT_NAME_SUPPORT_CONCIERGE=agent_support_concierge
PROMPT_NAME_RESPONSE=synthesize-response
PROMPT_FALLBACK_TO_DEFAULT=true


# Memory to store Agent's State
MEMORY_TYPE=inmemory # Validate value is ['inmemory' or 'postgres']

Note

Section Purpose Notes
LLM_BASE_URL vLLM serving endpoint Must support prefix caching (APC)
MEMORY_TYPE Agent state persistence inmemory for local testing, postgres for production

In case you want to self-host LLM with vLLM/LMCache ,follow vLLM Config


Installation

Method 1: Docker Compose

cd agent-worker

# Create shared Docker network (used by Langfuse if enabled)
docker network create agent-net

# Build and start services
docker compose up -d --build

# View application logs
docker logs -f banking-agent-worker-service

Method 2: Local Development with UV

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv sync

# Run the application with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload

Running the Application

Using Docker Compose

# Start all services
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

# Rebuild after code changes
docker compose up -d --build

Local Development

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac

# Run with auto-reload
uv run uvicorn app.main:app --host 0.0.0.0 --port 8238 --reload

Verify Application

# Check health endpoint
curl -X 'GET' \
  'http://localhost:8238/api/v1/health' \
  -H 'accept: application/json'

Expected response

{
  "status": "ok",
  "version": "0.1.0",
  "timestamp": 1768293084
}

Access API Documentation


API Reference

Base URL

All API endpoints are prefixed with:

http://localhost:8238/api/v1

Endpoints

1. Chat Completion (Non-Streaming)

Endpoint: POST /api/v1/chat

Description: Generate a complete chat response

Request:

curl --location 'http://localhost:8238/api/v1/chat' \
  --header 'Content-Type: application/json' \
  --header 'X-Request-ID: 43246029-b8cc-4a2d-1743-1100797bbd645' \
  --data '{
    "requestParameters": {
      "message": "How can i get my bank account balanced",
      "sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c6e",
      "userID": "huyhoangcloud"
    }
  }'

Request Body:

{
  "requestParameters": {
    "message": "string",      // User query
    "sessionID": "string",    // Session ID for conversation persistence <uuid4>
    "userID": "string"        // User identifier for tracking (uuid4 or string)
  }
}

Response:

{
  "took": 25453,
  "responseDateTime": "2026-01-13T15:50:12.246479+07:00",
  "responseStatus": {
    "responseCode": "200 Successfully"
  },
  "responseData": {
    "message": "You can check your ABC Bank account balance using **three secure and convenient methods**: the **customer portal (website)**, the **mobile app**, or by visiting a **local branch**. Below is a clear, step-by-step guide for each option..."
  }
}

Response Fields:

Field Type Description
took number Total processing time in milliseconds (E2E latency)
responseDateTime string Response timestamp
responseStatus.responseCode string Execution status message
responseData.message string Final AI-generated response

2. Chat Completion (Streaming)

Endpoint: POST /api/v1/chat/stream

Description: Stream response chunks in real-time using Server-Sent Events (SSE).

Request:

curl --location 'http://localhost:8238/api/v1/chat/stream' \
  --header 'Content-Type: application/json' \
  --header 'X-Request-ID: 43246029-b8cc-2a2d-1743-1100797bbd645' \
  --data '{
    "requestParameters": {
      "message": "How can i get my bank account balanced",
      "sessionID": "e6d5a8dc-3fcf-43a3-9d2a-3b0135q63c1e",
      "userID": "huyhoangcloud"
    }
  }'

Response Format (SSE):

Each chunk is delivered as a discrete SSE event:

data: {"chunk": "To check"}
data: {"chunk": " your account"}
data: {"chunk": " balanced ABC Bank"}
...
data: {"chunk": " contact support."}
data: {"done": true}

HTTP Headers

Request Headers

Header Required Description Example
Content-Type Yes Must be application/json application/json
X-Request-ID No Custom request identifier for tracing uuid-v4

Monitoring & Observability

Verify Langfuse Tracing

  1. Open Langfuse UI: http://localhost:3030 or https://cloud.langfuse.com
  2. Navigate to Tracing, Sessions, Users
  3. Find your test requests, session and user
  • Trace Request, Latency, Tokens, TTFT

Trace Request

Trace Request

  • Trace Session

Trace Session

  • Trace User

Trace User


Monitoring with Prometheus & Grafana

Following the official vLLM guide on Prometheus and Grafana monitoring, my assignment also includes a ready-to-use monitoring stack for observing LLM serving performance.

  1. Go to monitoring folder
cd monitoring
  1. Change target server IP in prometheus.yaml

  2. Start monitoring services:

# Start Prometheus and Grafana
docker compose up -d --build
  1. Access Prometheus UI:
    • URL: http://localhost:9090/targets
    • Verify that the vLLM metrics endpoint is listed and marked as UP
    • After logging in:
      1. Navigate to Dashboards
      2. Select vLLM Monitoring dashboard

Prometheus UI

  1. Access Grafana UI:
    • URL: http://localhost:3000
    • Default username and password both admin
    • Changepassword then go to DashBoard, select vLLM Monitoring

Grafana Dashboard


Evaluation

In this assignment, I use LLM-as-a-Judge evaluations with Langfuse to assess the system along two key dimensions:

  • Helpfulness – Whether the final response is accurate, clear, and useful to the user.
  • Routing Correctness – Whether the Router selects the appropriate expert agent(s) for the query.

Evaluation Setup

LLM Connection


Helpfulness Evaluation

  • Select the predefined Helpfulness evaluator provided by Langfuse or integrate with Ragas
  • Configure JsonPath mappings for the evaluation inputs:
    • {{query}}: $.messages[?(@.type=="human")].content
    • {{generation}}: $.messages[?(@.type=="ai")].content

Helpfulness Evaluation


Routing Correctness Evaluation

  • Select Create Custom Evaluator
  • Define:
    • Evaluation prompt
    • Score reasoning prompt
    • Score range

Routing Evaluator Definition

  • Configure JsonPath mappings:
    • {{query}}: $.messages[?(@.type=="human")].content
    • {{router_result}}: $.router_result.results

Routing Evaluation


Each tracing request now includes both Helpfulness and Routing Correctness scores, as shown below:

Evaluation Tracing


License

This project is licensed under the MIT License.
See the LICENSE file for full license details.


Contact

For questions, issues, please:


Acknowledgments

This project builds on the following tools and frameworks:

  • LangChain & LangGraph — Multi-agent orchestration and stateful workflows
  • Langfuse — LLM observability, tracing, and evaluation
  • vLLM — High-performance LLM serving with prefix caching (APC)
  • FastAPI — Modern, high-performance web framework for APIs

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors