An enterprise-grade Agentic Retrieval-Augmented Generation (RAG) system designed for M&A due diligence Q&A over private documents. Built with Next.js 14, Vercel AI SDK, and deployed on Vercel with full production support.
- System Overview
- Architecture
- Retrieval Pipeline
- Agentic Reasoning
- Tools Implemented
- Evaluation & Quality
- Deployment
- Bonus Features
- How to Run Locally
- Known Limitations & Future Work
- Design Justification
This system provides an intelligent conversational interface for querying M&A due diligence documents. Users can ask natural language questions about company financials, contracts, risks, products, and legal matters, receiving accurate answers with source citations.
In M&A transactions, due diligence involves reviewing hundreds of documents—financial statements, contracts, legal filings, and operational data. This system automates Q&A over these documents by:
- Semantic Search: Understanding intent, not just keywords
- Structured Data Queries: Querying CSV data with SQL-like precision
- Multi-Step Reasoning: Breaking complex questions into sub-queries
- Source-Cited Answers: Every claim backed by verifiable citations
- Confidence Scoring: Transparency about answer reliability
The system follows a modular architecture with five core layers:
┌─────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ Next.js Chat UI • Streaming Responses • Tool Trace • Citations │
└────────────────────────────────┬────────────────────────────────┘
│ HTTP/SSE
┌────────────────────────────────▼────────────────────────────────┐
│ AGENT LAYER │
│ GPT-4o Orchestrator • Tool Selection • Multi-Step Planning │
└────────────────────────────────┬────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌────────▼────────┐ ┌─────────▼─────────┐ ┌────────▼────────┐
│ VECTOR STORE │ │ CSV ENGINE │ │ QUERY OPTIMIZER│
│ ChromaDB │ │ Structured Data │ │ Expansion │
│ OpenAI Embed │ │ SQL-like Filter │ │ Rewriting │
└─────────────────┘ └───────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌────────────────────────────────▼────────────────────────────────┐
│ DOCUMENT STORE │
│ 15 TXT Files • 13 CSV Files • 28 Total Documents │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ FRONTEND │
│ ┌─────────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ ChatInterface │ │ ChatMessage │ │ ToolTrace │ │
│ │ - useChat hook │ │ - Answer bubble │ │ - Real-time visibility │ │
│ │ - Message list │ │ - Confidence chip│ │ - Tool arguments │ │
│ │ - Input handling│ │ - Citation badges│ │ - Result preview │ │
│ └─────────────────┘ └──────────────────┘ └────────────────────────┘ │
└────────────────────────────────────┬────────────────────────────────────┘
│ SSE Stream
┌────────────────────────────────────▼────────────────────────────────────┐
│ API LAYER (/api/chat) │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Vercel AI SDK streamText() │ │
│ │ - Model: GPT-4o │ │
│ │ - Tools: 5 registered tools │ │
│ │ - Max Steps: 5 (multi-turn tool use) │ │
│ │ - System Prompt: Role, tools, formatting, confidence rules │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────┬────────────────────────────────────┘
│
┌────────────────┬───────────────┼───────────────┬────────────────┐
│ │ │ │ │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│vector_│ │hybrid_│ │csv_ │ │csv_ │ │date_ │
│search │ │search │ │query │ │aggr. │ │window │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │ │
└────────────────┴───────┬───────┴───────────────┴────────────────┘
│
┌────────────────────────────▼────────────────────────────────────────────┐
│ DATA LAYER │
│ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
│ │ SimpleVectorStore │ │ CSV Storage (JSON) │ │
│ │ - ChromaDB wrapper │ │ - 13 normalized tables │ │
│ │ - 1536-dim OpenAI │ │ - Row-level citation IDs │ │
│ │ - 222 chunks │ │ - Filtering & aggregation │ │
│ └─────────────────────┘ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
- User Query → Chat Interface
- Query Analysis → Agent determines tool(s) needed
- Tool Execution → One or more tools invoked in sequence
- Result Synthesis → Agent combines results into coherent answer
- Confidence Scoring → Agent assesses answer reliability
- Response Streaming → Answer + citations + confidence streamed to UI
Documents are processed through a multi-stage pipeline:
Source Files (28)
│
▼
┌──────────────┐
│ Parser │ → Extract text, detect sections, parse CSV headers
└──────┬───────┘
│
▼
┌──────────────┐
│ Chunker │ → Section-aware chunking, 500-1000 tokens, overlap
└──────┬───────┘
│
▼
┌──────────────┐
│ Embeddings │ → OpenAI text-embedding-3-small (1536 dimensions)
└──────┬───────┘
│
▼
┌──────────────┐
│ Vector Store │ → ChromaDB with metadata (source, section, chunk_id)
└──────────────┘
| Aspect | Strategy |
|---|---|
| Method | Section-aware semantic chunking |
| Chunk Size | 500-1000 tokens (optimized for context) |
| Overlap | 50 tokens between chunks |
| Boundaries | Respects section headers (e.g., ===, ---) |
| Metadata | Source file, section name, chunk index, total chunks |
| Property | Value |
|---|---|
| Model | OpenAI text-embedding-3-small |
| Dimensions | 1536 |
| Environment | Production (Vercel) and Local |
| Consistency | Same model for ingestion and query |
| Property | Value |
|---|---|
| Database | ChromaDB (SimpleVectorStore wrapper) |
| Collection. | ma_documents |
| Storage | Persistent (data/vectors/) |
| Index Type | HNSW (Hierarchical Navigable Small World) |
The hybrid_search tool combines multiple retrieval strategies:
Query: "Series B funding amount"
│
├──────────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Semantic Search │ │ Keyword Matching │
│ (Vector Similarity)│ │ (BM25-style) │
│ Weight: 70% │ │ Weight: 30% │
└──────────┬──────────┘ └──────────┬──────────┘
│ │
└──────────────┬───────────────────────┘
│
▼
┌─────────────────┐
│ Reranker │
│ - Keyword boost│
│ - Section score│
│ - Position │
└────────┬────────┘
│
▼
Top K Results with Citations
The agent uses a decision tree based on query classification:
| Query Type | Primary Tool | Fallback |
|---|---|---|
| Factual lookup | hybrid_search |
vector_search |
| Financial data | csv_query or csv_aggregate |
hybrid_search |
| Date-based filter | date_window → csv_query |
hybrid_search |
| Aggregation | csv_aggregate |
Manual computation |
| Complex synthesis | Multi-tool chain | - |
The system transforms queries for better retrieval:
- Abbreviation Expansion: "Q4" → "fourth quarter", "M&A" → "mergers and acquisitions"
- Synonym Addition: "revenue" → "revenue, sales, income"
- Entity Extraction: Identifies companies, dates, financial terms
For complex queries, the agent chains tools:
Query: "Compare revenue growth and list top customers by contract value"
Step 1: csv_query → Get revenue data by year
Step 2: csv_aggregate → Calculate growth percentages
Step 3: csv_query → Get customer contracts
Step 4: Synthesize → Combine into coherent answer
Every answer includes a calibrated confidence score:
| Score Range | Meaning | Criteria |
|---|---|---|
| 0.85-1.00 | High | Multiple agreeing sources, exact matches |
| 0.60-0.84 | Medium | 2-3 sources, moderate agreement |
| 0.30-0.59 | Low | Limited evidence, some inference |
| <0.30 | Very Low | Insufficient evidence (triggers failure mode) |
UI Rendering: Confidence appears as a clickable chip below the answer, above citations. Colors indicate confidence level (green/amber/orange).
Every factual claim is grounded with citations:
interface Citation {
source: string // e.g., "01_company_overview.txt"
section: string // e.g., "KEY MILESTONES"
chunkId: string // Unique identifier
relevance: number // 0.0-1.0 similarity score
text: string // Excerpt (preview)
}Purpose: Pure semantic search over document chunks.
{
query: string, // Natural language query
topK?: number, // Default: 5
filterBySource?: string // Optional: filter by filename
}Returns: Top K chunks with citations, sorted by semantic similarity.
Purpose: Combined semantic + keyword search with reranking.
{
query: string,
topK?: number, // Default: 5
enableReranking?: boolean // Default: false (for performance)
}Returns: Reranked results optimizing for relevance.
Purpose: SQL-like queries over structured CSV data.
{
table: string, // e.g., "22_customer_contracts_summary"
filter?: string, // e.g., "annual_value > 500000"
columns?: string[], // Columns to return
limit?: number
}Returns: Matching rows with row-level citations.
Purpose: Aggregation queries (SUM, COUNT, AVG, etc.).
{
table: string,
aggregation: "SUM" | "COUNT" | "AVG" | "MIN" | "MAX",
column: string,
groupBy?: string,
filter?: string
}Returns: Aggregated values with source citations.
Purpose: Parse natural language dates and filter by time range.
{
query: string, // e.g., "contracts expiring in next 6 months"
referenceDate?: string // Default: current date
}Returns: Parsed date range for downstream filtering.
Confidence is calculated based on:
- Number of sources: More independent sources = higher confidence
- Source agreement: Consistent values across sources = higher
- Similarity scores: Vector search scores factor in
- Query type: Deterministic (CSV) vs. synthesis (vector)
| Metric | Target | Achieved |
|---|---|---|
| Precision@5 | >80% | ~85% (estimated) |
| Recall | >70% | ~75% (estimated) |
| MRR | >0.7 | ~0.8 (estimated) |
Note: Formal evaluation suite not implemented; estimates based on manual testing.
Reranking improves precision by:
- Boosting results with exact keyword matches
- Prioritizing important sections (EXECUTIVE SUMMARY, KEY MILESTONES)
- Penalizing very long or very short chunks
The confidence system prevents overconfident answers:
- Scores are calibrated (not always 0.99)
- Low evidence → explicit lower score
- UI shows color-coded confidence chip
Hallucinations are minimized through:
- Citation Requirement: Every fact must have a citation
- Failure Mode: "I cannot provide a sourced answer" when evidence is insufficient
- Confidence Transparency: Users see answer reliability
- Tool Visibility: Users see exactly which tools were used
The application is deployed on Vercel with the following configuration:
vercel.json:
{
"framework": "nextjs",
"outputDirectory": ".next",
"functions": {
"src/app/api/**/*.ts": {
"maxDuration": 30
}
}
}| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for GPT-4o and embeddings | Yes |
Critical: The vector store must be built with the same embedding model used for queries.
- Production: OpenAI
text-embedding-3-small(1536 dimensions) - Ingestion:
npm run reingest:openaiuses OpenAI embeddings - Data Deployed:
data/vectors/is committed and deployed with the app
| Issue | Root Cause | Solution |
|---|---|---|
| Embedding dimension mismatch | Local (384-dim) vs. production (1536-dim) | Re-ingested with OpenAI embeddings |
| Tool calls timing out | Vercel 10s default timeout | Increased to 30s via vercel.json |
| Streaming not working | Wrong protocol | Set streamProtocol: 'data' |
| Confidence not rendering | Parser required both delimiters | Fixed to handle partial format |
Combines semantic (70%) and keyword (30%) retrieval for better precision.
Multi-factor reranking: keyword boost, section importance, length normalization.
Every answer includes calibrated confidence (0.0-1.0) with reasoning.
Clickable citation badges with modal showing full context.
Real-time sidebar showing tool invocations, arguments, and results.
Fully deployed on Vercel with production-ready configuration.
Real-time token streaming with typing indicator.
- Node.js 18+
- npm or yarn
- OpenAI API key
# Clone repository
git clone https://github.com/aryanndhir/Agentic-RAG-M-A.git
cd Agentic-RAG-M-A
# Install dependencies
npm install
# Configure environment
cp env.example .env.local
# Edit .env.local and add: OPENAI_API_KEY=your_key_here# Re-ingest with OpenAI embeddings (if needed)
npm run reingest:openainpm run dev
# Open http://localhost:3000npm run build
npm start| Limitation | Impact | Potential Solution |
|---|---|---|
| No automated test suite | Manual verification only | Add Vitest + Playwright |
| Rule-based query optimization | Fixed patterns | ML-based query understanding |
| Simple keyword matching | Not true BM25 | Integrate Elasticsearch |
| Fixed reranking weights | No adaptation | Learn weights from feedback |
| No caching | Repeated queries hit API | Add Redis caching |
| Local vector DB | Not scalable beyond 10K docs | Migrate to Pinecone/Weaviate |
- Evaluation Framework: Implement RAGAS or similar for automated quality measurement
- Learning-based Reranking: Train a cross-encoder for better relevance
- Query Understanding: Fine-tune a model for query classification
- Advanced Planning: Implement ReAct or Tree-of-Thought for complex queries
- User Feedback Loop: Collect thumbs up/down to improve retrieval
- Multi-Modal: Support PDF, images, and tables natively
This section explains the reasoning behind each major architectural and design decision, addressing why each choice was made over obvious alternatives.
Decision: OpenAI GPT-4o as the primary LLM.
Why GPT-4o and not GPT-3.5-turbo or Claude?
- Tool calling reliability: GPT-4o has the most robust structured output and function calling capabilities. In RAG systems with 5+ tools, the model must reliably select the correct tool and format arguments precisely. GPT-3.5-turbo frequently hallucinated tool arguments or called incorrect tools in testing.
- Context window: 128K tokens allows ingesting full tool results without truncation, critical when csv_query returns 30+ rows.
- Reasoning quality: M&A due diligence requires multi-step reasoning (e.g., "find contracts expiring soon AND calculate total value"). GPT-4o handles compositional queries where GPT-3.5 would fail to chain tools correctly.
- Cost tradeoff: GPT-4o costs ~10x more than GPT-3.5, but for a due diligence assistant where accuracy is paramount and query volume is low, this is acceptable. A wrong answer costs more than API fees.
Why not open-source models (Llama, Mixtral)?
- Vercel's serverless environment has cold start constraints. Loading a 7B+ parameter model per request is infeasible. OpenAI's API provides consistent sub-second latency.
- Tool calling in open-source models requires custom prompting and is less reliable without fine-tuning.
Decision: Hybrid search (semantic 70% + keyword 30%) with optional reranking.
Why hybrid instead of pure vector search?
- M&A documents contain precise terms that must match exactly: "Series B", "$52,000,000", "NexusPay". Pure semantic search might return conceptually similar but factually wrong chunks (e.g., "Series A" when asked about "Series B").
- Keyword component ensures exact matches are boosted. This is critical for financial figures, dates, and entity names.
Why not pure BM25/keyword search?
- Due diligence questions are often paraphrased: "What's the runway?" vs. documents saying "cash burn rate" and "months of operating capital". Semantic understanding is required.
- Hybrid combines the precision of keywords with the recall of embeddings.
Why reranking?
- Initial retrieval optimizes for recall (finding relevant chunks). Reranking optimizes for precision (ordering by true relevance).
- Multi-factor reranking (exact keyword match, section importance, chunk length) catches cases where vector similarity alone misjudges relevance.
- Reranking is disabled by default on Vercel due to latency constraints but can be enabled for accuracy-critical queries.
Decision: ChromaDB with local persistent storage, using OpenAI embeddings.
Why ChromaDB and not Pinecone/Weaviate/Qdrant?
- Simplicity for demo: ChromaDB requires no external service, no API keys beyond OpenAI, and persists to disk. This reduces setup friction for evaluators.
- Cost: Hosted vector databases charge per query and storage. For a demo with <500 chunks, this overhead is unnecessary.
- Portability: The
data/vectors/directory is committed to Git and deployed with the app. No database provisioning required.
How would this scale in production?
- ChromaDB is not suitable beyond ~10K documents or multi-user concurrent access.
- Production migration path: Pinecone (managed, scalable, sub-10ms latency) or Weaviate (self-hosted, hybrid search native).
- The abstraction layer (
SimpleVectorStore) was designed for easy swapping—only the storage backend changes, not the retrieval logic.
Why OpenAI embeddings and not local models?
- Vercel's serverless runtime cannot load Transformers models (ONNX/PyTorch) reliably at cold start.
- OpenAI
text-embedding-3-smallprovides 1536-dimensional embeddings with API latency <100ms, acceptable for this use case. - Dimension consistency is critical: the vector store was re-ingested with OpenAI embeddings after discovering a dimension mismatch (384 vs 1536) that caused production failures.
Decision: Five specialized tools instead of one general-purpose retrieval tool.
Why separate tools (vector_search, csv_query, hybrid_search, csv_aggregate, date_window)?
- Determinism: csv_aggregate performs real arithmetic. If the LLM were asked to sum 30 contract values, it would hallucinate. The tool returns exact computed results.
- Structured data integrity: csv_query applies filters with database-like precision. Filter
Annual Value > 500000is executed exactly, not interpreted by the LLM. - Citation accuracy: Each tool returns structured provenance. Mixing all data sources into one tool would make citation attribution ambiguous.
Why is aggregation a first-class tool?
- LLMs cannot reliably perform multi-row arithmetic. Testing showed GPT-4o would approximate sums like "$2.1M" when the actual answer was "$2,147,500".
- csv_aggregate computes SUM/AVG/COUNT/MIN/MAX deterministically on numeric columns, with the source rows attached for verification.
- This eliminates a major class of hallucination in financial Q&A.
Why date_window as a separate tool?
- Date parsing is error-prone. "Next quarter", "within 6 months", "2024 renewals" all require interpretation relative to a reference date.
- date_window normalizes these to ISO date ranges (start_date, end_date), which csv_aggregate consumes for filtering.
- Separation ensures date logic is testable and consistent.
Decision: Confidence computed from retrieval signals, rendered as UI metadata separate from answer text.
Why retrieval-based confidence instead of asking the model "how confident are you?"
- LLM self-reported confidence is unreliable. Models tend to express high confidence even when wrong, especially for factual questions where they lack self-awareness of knowledge gaps.
- Retrieval signals are objective: number of chunks retrieved, similarity scores, cross-source agreement. These correlate with actual answer quality.
- Calibration: High score (0.85+) requires multiple agreeing sources with high similarity. Low score (<0.60) indicates limited evidence or conflicting sources.
Why render as a separate chip, not inline text?
- Inline confidence ("The answer is X with 85% confidence") pollutes the natural language response and is hard to parse programmatically.
- A dedicated ConfidenceChip component allows:
- Consistent visual treatment (color-coded by score)
- Clickable popover with detailed reasoning
- Separation from the answer for clean formatting
How does this support analyst trust?
- Analysts can quickly identify low-confidence answers that require manual verification.
- The clickable reason explains why confidence is low (e.g., "Only 1 source found, no corroboration"), enabling informed judgment.
Decision: shadcn/ui components, visible tool traces, interactive citation chips.
Why shadcn/ui and not Material UI or custom components?
- shadcn/ui provides unstyled, accessible primitives that integrate seamlessly with Tailwind CSS.
- Components are copied into the project (not imported from node_modules), enabling full customization without fighting library constraints.
- Consistent design language with minimal bundle size impact.
Why are tool traces visible in the sidebar?
- Transparency builds trust. Analysts need to understand why the system gave a particular answer.
- Visible tool calls show: which tools were invoked, what arguments were passed, and what results were returned.
- This enables debugging when answers are incorrect—users can see if the wrong tool was called or filters were misconfigured.
Why interactive citation chips instead of inline footnotes?
- Inline footnotes (e.g., "[1]") require jumping to the bottom of the page and back. This breaks reading flow.
- Clickable chips show source details on click without navigation.
- Aggregate citations collapse repeated sources (e.g., "customer_contracts (30 rows)") to prevent visual clutter.
Decision: Vercel with serverless functions, OpenAI embeddings, committed vector store.
Why Vercel and not AWS Lambda or self-hosted?
- Vercel provides zero-config Next.js deployment with automatic edge caching, preview deployments per PR, and native streaming support.
- Serverless functions handle variable load without provisioning—suitable for demo/evaluation traffic patterns.
- Integration with GitHub enables continuous deployment on every push.
How does serverless impact vector loading?
- Serverless functions have cold starts. Loading a vector database from scratch on each request would be too slow.
- Solution: The ChromaDB data is committed to the repository (
data/vectors/) and deployed as static files. The vector store reads from the filesystem on function boot. - OpenAI embeddings are generated via API at query time (no local model loading), keeping cold start latency <500ms.
How are secrets and environment variables handled?
OPENAI_API_KEYis stored in Vercel's encrypted environment variable store, never in code..env.localis gitignored for local development.- No other secrets are required—the system is self-contained with OpenAI as the only external dependency.
What production issues were solved?
- Embedding dimension mismatch: Local development used 384-dimensional Xenova embeddings; production required 1536-dimensional OpenAI embeddings. Solved by re-ingesting all documents with OpenAI embeddings.
- Timeout errors: Default 10-second Vercel timeout was insufficient for complex queries. Increased to 30 seconds via
vercel.json. - Streaming protocol: Data stream protocol required explicit configuration in useChat hook to display tool calls correctly.
| Component | Technology |
|---|---|
| Framework | Next.js 14 (App Router) |
| AI SDK | Vercel AI SDK v4 |
| LLM | OpenAI GPT-4o |
| Embeddings | OpenAI text-embedding-3-small |
| Vector DB | ChromaDB (local) |
| UI | shadcn/ui + Tailwind CSS |
| Deployment | Vercel |
| Language | TypeScript |
✅ Complete and Submission-Ready
All required features implemented:
- Document ingestion and chunking
- Vector search with embeddings
- Hybrid search with reranking
- Agentic tool orchestration
- Streaming chat UI
- Source-cited answers
- Confidence scoring
- Cloud deployment
Built for M&A due diligence document analysis. Submission version v1.0.