M&A Due Diligence Agentic RAG System

An enterprise-grade Agentic Retrieval-Augmented Generation (RAG) system designed for M&A due diligence Q&A over private documents. Built with Next.js 14, Vercel AI SDK, and deployed on Vercel with full production support.

1. System Overview

What the System Does

This system provides an intelligent conversational interface for querying M&A due diligence documents. Users can ask natural language questions about company financials, contracts, risks, products, and legal matters, receiving accurate answers with source citations.

Problem It Solves

In M&A transactions, due diligence involves reviewing hundreds of documents—financial statements, contracts, legal filings, and operational data. This system automates Q&A over these documents by:

Semantic Search: Understanding intent, not just keywords
Structured Data Queries: Querying CSV data with SQL-like precision
Multi-Step Reasoning: Breaking complex questions into sub-queries
Source-Cited Answers: Every claim backed by verifiable citations
Confidence Scoring: Transparency about answer reliability

High-Level Architecture

The system follows a modular architecture with five core layers:

┌─────────────────────────────────────────────────────────────────┐
│                         USER INTERFACE                          │
│  Next.js Chat UI • Streaming Responses • Tool Trace • Citations │
└────────────────────────────────┬────────────────────────────────┘
                                 │ HTTP/SSE
┌────────────────────────────────▼────────────────────────────────┐
│                         AGENT LAYER                             │
│  GPT-4o Orchestrator • Tool Selection • Multi-Step Planning     │
└────────────────────────────────┬────────────────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         │                       │                       │
┌────────▼────────┐    ┌─────────▼─────────┐   ┌────────▼────────┐
│  VECTOR STORE   │    │   CSV ENGINE      │   │  QUERY OPTIMIZER│
│  ChromaDB       │    │   Structured Data │   │  Expansion      │
│  OpenAI Embed   │    │   SQL-like Filter │   │  Rewriting      │
└─────────────────┘    └───────────────────┘   └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────┐
│                      DOCUMENT STORE                             │
│  15 TXT Files • 13 CSV Files • 28 Total Documents               │
└─────────────────────────────────────────────────────────────────┘

2. Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                              FRONTEND                                   │
│  ┌─────────────────┐  ┌──────────────────┐  ┌────────────────────────┐  │
│  │ ChatInterface   │  │ ChatMessage      │  │ ToolTrace              │  │
│  │ - useChat hook  │  │ - Answer bubble  │  │ - Real-time visibility │  │
│  │ - Message list  │  │ - Confidence chip│  │ - Tool arguments       │  │
│  │ - Input handling│  │ - Citation badges│  │ - Result preview       │  │
│  └─────────────────┘  └──────────────────┘  └────────────────────────┘  │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │ SSE Stream
┌────────────────────────────────────▼────────────────────────────────────┐
│                           API LAYER (/api/chat)                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  Vercel AI SDK streamText()                                       │  │
│  │  - Model: GPT-4o                                                  │  │
│  │  - Tools: 5 registered tools                                      │  │
│  │  - Max Steps: 5 (multi-turn tool use)                             │  │
│  │  - System Prompt: Role, tools, formatting, confidence rules       │  │
│  └───────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
    ┌────────────────┬───────────────┼───────────────┬────────────────┐
    │                │               │               │                │
┌───▼───┐        ┌───▼───┐       ┌───▼───┐      ┌───▼───┐        ┌───▼───┐
│vector_│        │hybrid_│       │csv_   │      │csv_   │        │date_  │
│search │        │search │       │query  │      │aggr.  │        │window │
└───┬───┘        └───┬───┘       └───┬───┘      └───┬───┘        └───┬───┘
    │                │               │               │                │
    └────────────────┴───────┬───────┴───────────────┴────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────────────┐
│                         DATA LAYER                                      │
│  ┌─────────────────────┐        ┌─────────────────────────────────────┐ │
│  │  SimpleVectorStore  │        │  CSV Storage (JSON)                 │ │
│  │  - ChromaDB wrapper │        │  - 13 normalized tables             │ │
│  │  - 1536-dim OpenAI  │        │  - Row-level citation IDs           │ │
│  │  - 222 chunks       │        │  - Filtering & aggregation          │ │
│  └─────────────────────┘        └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

Agent Orchestration Flow

User Query → Chat Interface
Query Analysis → Agent determines tool(s) needed
Tool Execution → One or more tools invoked in sequence
Result Synthesis → Agent combines results into coherent answer
Confidence Scoring → Agent assesses answer reliability
Response Streaming → Answer + citations + confidence streamed to UI

3. Retrieval Pipeline

Document Ingestion

Documents are processed through a multi-stage pipeline:

Source Files (28)
       │
       ▼
┌──────────────┐
│   Parser     │ → Extract text, detect sections, parse CSV headers
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Chunker    │ → Section-aware chunking, 500-1000 tokens, overlap
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Embeddings  │ → OpenAI text-embedding-3-small (1536 dimensions)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Vector Store │ → ChromaDB with metadata (source, section, chunk_id)
└──────────────┘

Chunking Strategy

Aspect	Strategy
Method	Section-aware semantic chunking
Chunk Size	500-1000 tokens (optimized for context)
Overlap	50 tokens between chunks
Boundaries	Respects section headers (e.g., `===`, `---`)
Metadata	Source file, section name, chunk index, total chunks

Embedding Model

Property	Value
Model	OpenAI `text-embedding-3-small`
Dimensions	1536
Environment	Production (Vercel) and Local
Consistency	Same model for ingestion and query

Vector Store

Property	Value
Database	ChromaDB (SimpleVectorStore wrapper)
Collection.	`ma_documents`
Storage	Persistent (`data/vectors/`)
Index Type	HNSW (Hierarchical Navigable Small World)

Hybrid Search

The hybrid_search tool combines multiple retrieval strategies:

Query: "Series B funding amount"
           │
           ├──────────────────────────────────────┐
           │                                      │
           ▼                                      ▼
┌─────────────────────┐              ┌─────────────────────┐
│  Semantic Search    │              │  Keyword Matching   │
│  (Vector Similarity)│              │  (BM25-style)       │
│  Weight: 70%        │              │  Weight: 30%        │
└──────────┬──────────┘              └──────────┬──────────┘
           │                                      │
           └──────────────┬───────────────────────┘
                          │
                          ▼
                ┌─────────────────┐
                │    Reranker     │
                │  - Keyword boost│
                │  - Section score│
                │  - Position     │
                └────────┬────────┘
                         │
                         ▼
               Top K Results with Citations

4. Agentic Reasoning

Tool Selection Logic

The agent uses a decision tree based on query classification:

Query Type	Primary Tool	Fallback
Factual lookup	`hybrid_search`	`vector_search`
Financial data	`csv_query` or `csv_aggregate`	`hybrid_search`
Date-based filter	`date_window` → `csv_query`	`hybrid_search`
Aggregation	`csv_aggregate`	Manual computation
Complex synthesis	Multi-tool chain	-

Query Rewriting & Expansion

The system transforms queries for better retrieval:

Abbreviation Expansion: "Q4" → "fourth quarter", "M&A" → "mergers and acquisitions"
Synonym Addition: "revenue" → "revenue, sales, income"
Entity Extraction: Identifies companies, dates, financial terms

Multi-Step Planning

For complex queries, the agent chains tools:

Query: "Compare revenue growth and list top customers by contract value"

Step 1: csv_query → Get revenue data by year
Step 2: csv_aggregate → Calculate growth percentages  
Step 3: csv_query → Get customer contracts
Step 4: Synthesize → Combine into coherent answer

Confidence Scoring

Every answer includes a calibrated confidence score:

Score Range	Meaning	Criteria
0.85-1.00	High	Multiple agreeing sources, exact matches
0.60-0.84	Medium	2-3 sources, moderate agreement
0.30-0.59	Low	Limited evidence, some inference
<0.30	Very Low	Insufficient evidence (triggers failure mode)

UI Rendering: Confidence appears as a clickable chip below the answer, above citations. Colors indicate confidence level (green/amber/orange).

Citation Grounding

Every factual claim is grounded with citations:

interface Citation {
  source: string      // e.g., "01_company_overview.txt"
  section: string     // e.g., "KEY MILESTONES"
  chunkId: string     // Unique identifier
  relevance: number   // 0.0-1.0 similarity score
  text: string        // Excerpt (preview)
}

5. Tools Implemented

5.1 vector_search

Purpose: Pure semantic search over document chunks.

{
  query: string,           // Natural language query
  topK?: number,           // Default: 5
  filterBySource?: string  // Optional: filter by filename
}

Returns: Top K chunks with citations, sorted by semantic similarity.

5.2 hybrid_search

Purpose: Combined semantic + keyword search with reranking.

{
  query: string,
  topK?: number,           // Default: 5
  enableReranking?: boolean // Default: false (for performance)
}

Returns: Reranked results optimizing for relevance.

5.3 csv_query

Purpose: SQL-like queries over structured CSV data.

{
  table: string,    // e.g., "22_customer_contracts_summary"
  filter?: string,  // e.g., "annual_value > 500000"
  columns?: string[], // Columns to return
  limit?: number
}

Returns: Matching rows with row-level citations.

5.4 csv_aggregate

Purpose: Aggregation queries (SUM, COUNT, AVG, etc.).

{
  table: string,
  aggregation: "SUM" | "COUNT" | "AVG" | "MIN" | "MAX",
  column: string,
  groupBy?: string,
  filter?: string
}

Returns: Aggregated values with source citations.

5.5 date_window

Purpose: Parse natural language dates and filter by time range.

{
  query: string,     // e.g., "contracts expiring in next 6 months"
  referenceDate?: string  // Default: current date
}

Returns: Parsed date range for downstream filtering.

5.6 Confidence Scoring Logic

Confidence is calculated based on:

Number of sources: More independent sources = higher confidence
Source agreement: Consistent values across sources = higher
Similarity scores: Vector search scores factor in
Query type: Deterministic (CSV) vs. synthesis (vector)

6. Evaluation & Quality

Retrieval Quality

Metric	Target	Achieved
Precision@5	>80%	~85% (estimated)
Recall	>70%	~75% (estimated)
MRR	>0.7	~0.8 (estimated)

Note: Formal evaluation suite not implemented; estimates based on manual testing.

Reranking Impact

Reranking improves precision by:

Boosting results with exact keyword matches
Prioritizing important sections (EXECUTIVE SUMMARY, KEY MILESTONES)
Penalizing very long or very short chunks

Confidence Scoring

The confidence system prevents overconfident answers:

Scores are calibrated (not always 0.99)
Low evidence → explicit lower score
UI shows color-coded confidence chip

Hallucination Control

Hallucinations are minimized through:

Citation Requirement: Every fact must have a citation
Failure Mode: "I cannot provide a sourced answer" when evidence is insufficient
Confidence Transparency: Users see answer reliability
Tool Visibility: Users see exactly which tools were used

7. Deployment

Vercel Setup

The application is deployed on Vercel with the following configuration:

vercel.json:

{
  "framework": "nextjs",
  "outputDirectory": ".next",
  "functions": {
    "src/app/api/**/*.ts": {
      "maxDuration": 30
    }
  }
}

Environment Variables

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key for GPT-4o and embeddings	Yes

Embedding Consistency

Critical: The vector store must be built with the same embedding model used for queries.

Production: OpenAI text-embedding-3-small (1536 dimensions)
Ingestion: npm run reingest:openai uses OpenAI embeddings
Data Deployed: data/vectors/ is committed and deployed with the app

Production Issues Solved

Issue	Root Cause	Solution
Embedding dimension mismatch	Local (384-dim) vs. production (1536-dim)	Re-ingested with OpenAI embeddings
Tool calls timing out	Vercel 10s default timeout	Increased to 30s via vercel.json
Streaming not working	Wrong protocol	Set `streamProtocol: 'data'`
Confidence not rendering	Parser required both delimiters	Fixed to handle partial format

8. Bonus Features

✅ Hybrid Search

Combines semantic (70%) and keyword (30%) retrieval for better precision.

✅ Reranking

Multi-factor reranking: keyword boost, section importance, length normalization.

✅ Confidence Scoring

Every answer includes calibrated confidence (0.0-1.0) with reasoning.

✅ Clean UI Citations

Clickable citation badges with modal showing full context.

✅ Tool Call Visibility

Real-time sidebar showing tool invocations, arguments, and results.

✅ Cloud Deployment

Fully deployed on Vercel with production-ready configuration.

✅ Streaming Responses

Real-time token streaming with typing indicator.

9. How to Run Locally

Prerequisites

Node.js 18+
npm or yarn
OpenAI API key

Setup

# Clone repository
git clone https://github.com/aryanndhir/Agentic-RAG-M-A.git
cd Agentic-RAG-M-A

# Install dependencies
npm install

# Configure environment
cp env.example .env.local
# Edit .env.local and add: OPENAI_API_KEY=your_key_here

Ingestion (Optional - data is pre-ingested)

# Re-ingest with OpenAI embeddings (if needed)
npm run reingest:openai

Development Server

npm run dev
# Open http://localhost:3000

Production Build

npm run build
npm start

10. Known Limitations & Future Work

Current Limitations

Limitation	Impact	Potential Solution
No automated test suite	Manual verification only	Add Vitest + Playwright
Rule-based query optimization	Fixed patterns	ML-based query understanding
Simple keyword matching	Not true BM25	Integrate Elasticsearch
Fixed reranking weights	No adaptation	Learn weights from feedback
No caching	Repeated queries hit API	Add Redis caching
Local vector DB	Not scalable beyond 10K docs	Migrate to Pinecone/Weaviate

Future Work

Evaluation Framework: Implement RAGAS or similar for automated quality measurement
Learning-based Reranking: Train a cross-encoder for better relevance
Query Understanding: Fine-tune a model for query classification
Advanced Planning: Implement ReAct or Tree-of-Thought for complex queries
User Feedback Loop: Collect thumbs up/down to improve retrieval
Multi-Modal: Support PDF, images, and tables natively

11. Design Justification

This section explains the reasoning behind each major architectural and design decision, addressing why each choice was made over obvious alternatives.

11.1 Model Choice

Decision: OpenAI GPT-4o as the primary LLM.

Why GPT-4o and not GPT-3.5-turbo or Claude?

Tool calling reliability: GPT-4o has the most robust structured output and function calling capabilities. In RAG systems with 5+ tools, the model must reliably select the correct tool and format arguments precisely. GPT-3.5-turbo frequently hallucinated tool arguments or called incorrect tools in testing.
Context window: 128K tokens allows ingesting full tool results without truncation, critical when csv_query returns 30+ rows.
Reasoning quality: M&A due diligence requires multi-step reasoning (e.g., "find contracts expiring soon AND calculate total value"). GPT-4o handles compositional queries where GPT-3.5 would fail to chain tools correctly.
Cost tradeoff: GPT-4o costs ~10x more than GPT-3.5, but for a due diligence assistant where accuracy is paramount and query volume is low, this is acceptable. A wrong answer costs more than API fees.

Why not open-source models (Llama, Mixtral)?

Vercel's serverless environment has cold start constraints. Loading a 7B+ parameter model per request is infeasible. OpenAI's API provides consistent sub-second latency.
Tool calling in open-source models requires custom prompting and is less reliable without fine-tuning.

11.2 Retrieval Strategy

Decision: Hybrid search (semantic 70% + keyword 30%) with optional reranking.

Why hybrid instead of pure vector search?

M&A documents contain precise terms that must match exactly: "Series B", "$52,000,000", "NexusPay". Pure semantic search might return conceptually similar but factually wrong chunks (e.g., "Series A" when asked about "Series B").
Keyword component ensures exact matches are boosted. This is critical for financial figures, dates, and entity names.

Why not pure BM25/keyword search?

Due diligence questions are often paraphrased: "What's the runway?" vs. documents saying "cash burn rate" and "months of operating capital". Semantic understanding is required.
Hybrid combines the precision of keywords with the recall of embeddings.

Why reranking?

Initial retrieval optimizes for recall (finding relevant chunks). Reranking optimizes for precision (ordering by true relevance).
Multi-factor reranking (exact keyword match, section importance, chunk length) catches cases where vector similarity alone misjudges relevance.
Reranking is disabled by default on Vercel due to latency constraints but can be enabled for accuracy-critical queries.

11.3 Vector Store Choice

Decision: ChromaDB with local persistent storage, using OpenAI embeddings.

Why ChromaDB and not Pinecone/Weaviate/Qdrant?

Simplicity for demo: ChromaDB requires no external service, no API keys beyond OpenAI, and persists to disk. This reduces setup friction for evaluators.
Cost: Hosted vector databases charge per query and storage. For a demo with <500 chunks, this overhead is unnecessary.
Portability: The data/vectors/ directory is committed to Git and deployed with the app. No database provisioning required.

How would this scale in production?

ChromaDB is not suitable beyond ~10K documents or multi-user concurrent access.
Production migration path: Pinecone (managed, scalable, sub-10ms latency) or Weaviate (self-hosted, hybrid search native).
The abstraction layer (SimpleVectorStore) was designed for easy swapping—only the storage backend changes, not the retrieval logic.

Why OpenAI embeddings and not local models?

Vercel's serverless runtime cannot load Transformers models (ONNX/PyTorch) reliably at cold start.
OpenAI text-embedding-3-small provides 1536-dimensional embeddings with API latency <100ms, acceptable for this use case.
Dimension consistency is critical: the vector store was re-ingested with OpenAI embeddings after discovering a dimension mismatch (384 vs 1536) that caused production failures.

11.4 Tool Architecture

Decision: Five specialized tools instead of one general-purpose retrieval tool.

Why separate tools (vector_search, csv_query, hybrid_search, csv_aggregate, date_window)?

Determinism: csv_aggregate performs real arithmetic. If the LLM were asked to sum 30 contract values, it would hallucinate. The tool returns exact computed results.
Structured data integrity: csv_query applies filters with database-like precision. Filter Annual Value > 500000 is executed exactly, not interpreted by the LLM.
Citation accuracy: Each tool returns structured provenance. Mixing all data sources into one tool would make citation attribution ambiguous.

Why is aggregation a first-class tool?

LLMs cannot reliably perform multi-row arithmetic. Testing showed GPT-4o would approximate sums like "$2.1M" when the actual answer was "$2,147,500".
csv_aggregate computes SUM/AVG/COUNT/MIN/MAX deterministically on numeric columns, with the source rows attached for verification.
This eliminates a major class of hallucination in financial Q&A.

Why date_window as a separate tool?

Date parsing is error-prone. "Next quarter", "within 6 months", "2024 renewals" all require interpretation relative to a reference date.
date_window normalizes these to ISO date ranges (start_date, end_date), which csv_aggregate consumes for filtering.
Separation ensures date logic is testable and consistent.

11.5 Confidence Scoring Design

Decision: Confidence computed from retrieval signals, rendered as UI metadata separate from answer text.

Why retrieval-based confidence instead of asking the model "how confident are you?"

LLM self-reported confidence is unreliable. Models tend to express high confidence even when wrong, especially for factual questions where they lack self-awareness of knowledge gaps.
Retrieval signals are objective: number of chunks retrieved, similarity scores, cross-source agreement. These correlate with actual answer quality.
Calibration: High score (0.85+) requires multiple agreeing sources with high similarity. Low score (<0.60) indicates limited evidence or conflicting sources.

Why render as a separate chip, not inline text?

Inline confidence ("The answer is X with 85% confidence") pollutes the natural language response and is hard to parse programmatically.
A dedicated ConfidenceChip component allows:
- Consistent visual treatment (color-coded by score)
- Clickable popover with detailed reasoning
- Separation from the answer for clean formatting

How does this support analyst trust?

Analysts can quickly identify low-confidence answers that require manual verification.
The clickable reason explains why confidence is low (e.g., "Only 1 source found, no corroboration"), enabling informed judgment.

11.6 UI and UX Decisions

Decision: shadcn/ui components, visible tool traces, interactive citation chips.

Why shadcn/ui and not Material UI or custom components?

shadcn/ui provides unstyled, accessible primitives that integrate seamlessly with Tailwind CSS.
Components are copied into the project (not imported from node_modules), enabling full customization without fighting library constraints.
Consistent design language with minimal bundle size impact.

Why are tool traces visible in the sidebar?

Transparency builds trust. Analysts need to understand why the system gave a particular answer.
Visible tool calls show: which tools were invoked, what arguments were passed, and what results were returned.
This enables debugging when answers are incorrect—users can see if the wrong tool was called or filters were misconfigured.

Why interactive citation chips instead of inline footnotes?

Inline footnotes (e.g., "[1]") require jumping to the bottom of the page and back. This breaks reading flow.
Clickable chips show source details on click without navigation.
Aggregate citations collapse repeated sources (e.g., "customer_contracts (30 rows)") to prevent visual clutter.

11.7 Deployment Architecture

Decision: Vercel with serverless functions, OpenAI embeddings, committed vector store.

Why Vercel and not AWS Lambda or self-hosted?

Vercel provides zero-config Next.js deployment with automatic edge caching, preview deployments per PR, and native streaming support.
Serverless functions handle variable load without provisioning—suitable for demo/evaluation traffic patterns.
Integration with GitHub enables continuous deployment on every push.

How does serverless impact vector loading?

Serverless functions have cold starts. Loading a vector database from scratch on each request would be too slow.
Solution: The ChromaDB data is committed to the repository (data/vectors/) and deployed as static files. The vector store reads from the filesystem on function boot.
OpenAI embeddings are generated via API at query time (no local model loading), keeping cold start latency <500ms.

How are secrets and environment variables handled?

OPENAI_API_KEY is stored in Vercel's encrypted environment variable store, never in code.
.env.local is gitignored for local development.
No other secrets are required—the system is self-contained with OpenAI as the only external dependency.

What production issues were solved?

Embedding dimension mismatch: Local development used 384-dimensional Xenova embeddings; production required 1536-dimensional OpenAI embeddings. Solved by re-ingesting all documents with OpenAI embeddings.
Timeout errors: Default 10-second Vercel timeout was insufficient for complex queries. Increased to 30 seconds via vercel.json.
Streaming protocol: Data stream protocol required explicit configuration in useChat hook to display tool calls correctly.

Tech Stack Summary

Component	Technology
Framework	Next.js 14 (App Router)
AI SDK	Vercel AI SDK v4
LLM	OpenAI GPT-4o
Embeddings	OpenAI text-embedding-3-small
Vector DB	ChromaDB (local)
UI	shadcn/ui + Tailwind CSS
Deployment	Vercel
Language	TypeScript

Project Status

✅ Complete and Submission-Ready

All required features implemented:

Document ingestion and chunking
Vector search with embeddings
Hybrid search with reranking
Agentic tool orchestration
Streaming chat UI
Source-cited answers
Confidence scoring
Cloud deployment

Built for M&A due diligence document analysis. Submission version v1.0.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.cursor		.cursor
data		data
src		src
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHUNKING_STRATEGY.md		CHUNKING_STRATEGY.md
FINAL_VERIFICATION.md		FINAL_VERIFICATION.md
PHASE2_SUMMARY.md		PHASE2_SUMMARY.md
PHASE3_SUMMARY.md		PHASE3_SUMMARY.md
PHASE4_SUMMARY.md		PHASE4_SUMMARY.md
PHASE5_SUMMARY.md		PHASE5_SUMMARY.md
QUERY_OPTIMIZATION.md		QUERY_OPTIMIZATION.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
SUBMISSION_CHECKLIST.md		SUBMISSION_CHECKLIST.md
TECHNICAL_DESIGN.md		TECHNICAL_DESIGN.md
components.json		components.json
env.example		env.example
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

M&A Due Diligence Agentic RAG System

Table of Contents

1. System Overview

What the System Does

Problem It Solves

High-Level Architecture

2. Architecture

Component Diagram

Agent Orchestration Flow

3. Retrieval Pipeline

Document Ingestion

Chunking Strategy

Embedding Model

Vector Store

Hybrid Search

4. Agentic Reasoning

Tool Selection Logic

Query Rewriting & Expansion

Multi-Step Planning

Confidence Scoring

Citation Grounding

5. Tools Implemented

5.1 vector_search

5.2 hybrid_search

5.3 csv_query

5.4 csv_aggregate

5.5 date_window

5.6 Confidence Scoring Logic

6. Evaluation & Quality

Retrieval Quality

Reranking Impact

Confidence Scoring

Hallucination Control

7. Deployment

Vercel Setup

Environment Variables

Embedding Consistency

Production Issues Solved

8. Bonus Features

✅ Hybrid Search

✅ Reranking

✅ Confidence Scoring

✅ Clean UI Citations

✅ Tool Call Visibility

✅ Cloud Deployment

✅ Streaming Responses

9. How to Run Locally

Prerequisites

Setup

Ingestion (Optional - data is pre-ingested)

Development Server

Production Build

10. Known Limitations & Future Work

Current Limitations

Future Work

11. Design Justification

11.1 Model Choice

11.2 Retrieval Strategy

11.3 Vector Store Choice

11.4 Tool Architecture

11.5 Confidence Scoring Design

11.6 UI and UX Decisions

11.7 Deployment Architecture

Tech Stack Summary

Project Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages