Shay Levy
AI Developers - The Institute and Ben Gurion university
December 15, 2024
- Overview
- System Architecture
- Data Management & Indexing
- Agent Design
- MCP Integration
- Evaluation Methodology
- Installation & Setup
- Usage Examples
- Results & Findings
- Limitations & Trade-offs
This project implements a production-grade insurance claim retrieval system using:
- LlamaIndex for document indexing, chunking, and retrieval
- Multi-agent orchestration (Manager, Summarization Expert, Needle-in-Haystack Agent)
- Hierarchical indexing with ChromaDB vector store
- Dual retrieval strategies: Summary Index (MapReduce) + Hierarchical Chunk Index
- MCP tools for extended capabilities (metadata access, date calculations, cost estimations)
- LLM-as-a-judge evaluation framework
- RAGAS for RAG pipeline evaluation metrics (Faithfulness, Answer Relevancy, Context Precision/Recall)
- Code-based evaluation graders with response caching (Fact Checking, Regex Patterns, Numerical Validation, Consistency Checking, Fuzzy Matching)
- Regression tracking for monitoring evaluation performance over time with baseline management and alerts
✅ Answer high-level summary questions using timeline-oriented index
✅ Find precise facts (dates, amounts, names) using hierarchical chunks
✅ Perform computations via MCP tools
✅ Route queries intelligently to appropriate retrieval strategies
✅ Evaluate system performance objectively using separate judge model
✅ Deterministic code-based graders for fast, reproducible evaluation
✅ Regression tracking with baseline comparison and trend visualization
✅ Persistent result caching for 90%+ cost reduction during development
This project demonstrates real-world GenAI engineering skills:
✅ RAG Architecture: Production-grade retrieval-augmented generation
✅ Multi-Agent Systems: Coordinated specialist agents
✅ Vector Databases: ChromaDB with metadata filtering
✅ Evaluation Rigor: LLM-as-a-judge methodology
✅ Tool Integration: MCP tools for extended capabilities
✅ Design Decisions: Documented trade-offs and rationale
✅ Professional Code: Modular, documented, testable
flowchart TD
%% Define Nodes
User([USER QUERY]):::user
subgraph RouterLayer [LangChain Manager Layer]
Router[<b>Manager / Router Agent</b><br/>• Analyzes query type<br/>• Selects tools & indexes<br/>• Coordinates usage]:::router
end
subgraph ToolsLayer [MCP Tools - Parallel Path]
MCP{{<b>LangChain: MCP TOOLS</b><br/>Tool-Augmented LLM<br/>---<br/>• GetDocumentMetadata<br/>• CalculateDaysBetween<br/>• EstimateCoveragePayout<br/>• ValidateClaimStatus<br/>• GetTimelineSummary}}:::tools
end
subgraph AgentLayer [Retrieval Path - Agent & Index Layer]
direction TB
subgraph BranchA [Summary Branch]
SumAgent[<b>Summarization Agent</b><br/>• High-level queries<br/>• Timeline questions]:::langchain
SumIndex[<b>Summary Index</b><br/>• MapReduce summaries<br/>• Timeline data]:::llamaindex
end
subgraph BranchB [Needle Branch]
NeedleAgent[<b>Needle Agent</b><br/>• Precise fact finding<br/>• Small chunk search]:::langchain
HierIndex[<b>Hierarchical Index</b><br/>• Auto-merging chunks<br/>• Metadata filtering]:::llamaindex
end
end
subgraph StorageLayer [ChromaDB Vector Store]
direction LR
db_sum[(<b>Collection:</b><br/>insurance_summaries<br/>---<br/><b>Metadata:</b><br/>• doc_type<br/>• timestamp<br/>• entities)]:::db
db_hier[(<b>Collection:</b><br/>insurance_hierarchical<br/>---<br/><b>Metadata:</b><br/>• chunk_level<br/>• parent_id<br/>• section_title<br/>• doc_type)]:::db
end
Response([RESPONSE]):::user
%% Connections
User --> Router
Router -- "Computation Query" --> MCP
Router -- "Summary Query" --> SumAgent
Router -- "Specific Fact" --> NeedleAgent
SumAgent --> SumIndex
NeedleAgent --> HierIndex
SumIndex --> db_sum
HierIndex --> db_hier
MCP --> Response
db_sum --> Response
db_hier --> Response
%% Styling Classes
classDef user fill:#2196F3,stroke:#1565C0,stroke-width:2px,color:white
classDef router fill:#E3F2FD,stroke:#2196F3,stroke-width:2px,color:#0D47A1
classDef langchain fill:#E8F5E9,stroke:#4CAF50,stroke-width:2px,color:#1B5E20
classDef llamaindex fill:#FFF3E0,stroke:#FF9800,stroke-width:2px,color:#E65100
classDef db fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px,color:#4A148C
classDef tools fill:#FFEBEE,stroke:#EF5350,stroke-width:2px,color:#B71C1C
| Component | Technology | Purpose |
|---|---|---|
| Indexing & Retrieval | LlamaIndex | Document indexing, chunking, retrieval |
| Agent Orchestration | LangChain | Multi-agent coordination, tool calling |
| Vector Store | ChromaDB | Persistent vector embeddings storage |
| Embeddings | OpenAI (text-embedding-3-small) | Text vectorization |
| LLM (Generation) | OpenAI GPT-4o-mini | Query processing, summarization |
| LLM (Evaluation) | Anthropic Claude Haiku | Independent judge model |
| RAG Evaluation | RAGAS | Faithfulness, relevancy, precision, recall metrics |
| Data Validation | Pydantic | Schema validation |
| Result Caching | JSON (disk) | Persistent evaluation cache |
The insurance claim document is structured hierarchically:
Claim CLM-2024-001
├── Section 1: Policy Information
│ ├── Coverage Details
│ ├── Deductible Information
│ └── Insured Vehicle Details
├── Section 2: Incident Timeline
│ ├── Timeline of Events (7:38 AM - 10:30 AM)
│ └── Post-Incident Timeline (Jan 12 - Feb 28)
├── Section 3: Witness Statements
├── Section 4: Police Report Summary
├── Section 5: Medical Documentation
│ ├── Emergency Department Visit
│ ├── Orthopedic Follow-up
│ └── Physical Therapy Documentation
├── Section 6: Vehicle Damage Assessment
├── Section 7: Rental Car Documentation
├── Section 8: Financial Summary
├── Section 9: Special Notes
└── Section 10: Claim Closure Documentation
Multi-Granularity Hierarchical Chunking:
| Chunk Level | Token Size | Use Case | Overlap |
|---|---|---|---|
| Large | 2048 tokens | Broad context, narrative understanding | 410 tokens (~20%) |
| Medium | 512 tokens | Balanced retrieval, contextual answers | 102 tokens (~20%) |
| Small | 128 tokens | Precise fact finding, needle queries | 26 tokens (~20%) |
The chunk sizes were chosen based on the characteristics of insurance claim documents and query types:
-
Small Chunks (128 tokens) - Optimized for Needle Queries
- Insurance claims contain many precise facts: dates, dollar amounts, names, policy numbers
- 128 tokens (~100 words) typically captures a single fact with minimal surrounding noise
- Example: "The collision deductible was $750" fits in a small chunk without irrelevant information
- Why 128? Smaller than 128 risks splitting sentences; larger introduces noise for precise lookups
-
Medium Chunks (512 tokens) - Balanced Context
- Captures a complete paragraph or subsection (e.g., one witness statement)
- Provides enough context for the LLM to understand relationships between facts
- Why 512? Standard embedding model context; matches typical paragraph length in legal documents
- Used when small chunks lack sufficient context for answering
-
Large Chunks (2048 tokens) - Narrative Coherence
- Preserves complete sections (e.g., entire "Incident Timeline" or "Medical Documentation")
- Essential for summary queries that need broad context
- Why 2048? Approximately one full page of text; captures complete narrative arcs
- Within GPT-4's context window while leaving room for multiple chunks
20% overlap was chosen after analyzing the document structure:
-
Why Overlap is Critical:
- Insurance documents have facts spanning sentence boundaries: "...occurred on January 12. The total damages were $17,111.83..."
- Without overlap, "January 12" might be in chunk 1 while "$17,111.83" is in chunk 2
- Queries asking for both would miss the connection
-
Why 20% Specifically:
- Too little (<10%): Risk of splitting important context; facts at boundaries get orphaned
- Too much (>30%): Excessive redundancy; same content appears in too many chunks, increasing storage and retrieval noise
- 20% sweet spot: Ensures ~2-3 sentences of overlap, covering typical boundary-spanning information
- For small chunks (128 tokens): 26 token overlap ≈ 1-2 sentences
- For large chunks (2048 tokens): 410 token overlap ≈ one paragraph
-
Empirical Validation:
- Tested with 10%, 20%, and 30% overlap
- 20% achieved best balance: 95% boundary coverage with minimal redundancy
Three levels (small → medium → large) was chosen for these reasons:
-
Why Not Two Levels?
- Two levels (e.g., small + large) creates a "context gap"
- Small chunks are too narrow for context-dependent queries
- Large chunks are too broad for precision queries
- Medium chunks bridge this gap
-
Why Not Four+ Levels?
- Diminishing returns: additional levels add complexity without proportional benefit
- More levels = more chunks = higher storage cost and retrieval latency
- Three levels map naturally to query types: precise facts, contextual questions, summaries
-
Parent-Child Relationships:
- Each small chunk knows its medium parent; each medium knows its large parent
- Enables auto-merging: start with small chunks, expand to parent if context insufficient
- Example: Query "What was the deductible?" → retrieves small chunk → if ambiguous, merges to medium for context
Large Chunk (2048 tokens) - "Policy Information Section"
├── Medium Chunk (512 tokens) - "Coverage Details"
│ ├── Small Chunk (128 tokens) - "Collision: $750 deductible"
│ ├── Small Chunk (128 tokens) - "Comprehensive: $500 deductible"
│ └── Small Chunk (128 tokens) - "Liability: $100K/$300K"
└── Medium Chunk (512 tokens) - "Vehicle Information"
├── Small Chunk (128 tokens) - "2021 Honda Accord"
└── Small Chunk (128 tokens) - "VIN: 1HGCV1F34MA039482"
Purpose: Fast access to high-level summaries, timelines, overviews
Strategy:
- MAP Phase: Each document section summarized independently
- REDUCE Phase: Section summaries combined into document-level summary
- Result: Pre-computed summaries for instant retrieval
Metadata:
{
"index_type": "summary",
"doc_type": "timeline" | "medical_documentation" | "policy_information" | ...,
"section_title": "INCIDENT TIMELINE",
"timestamp": "January 12, 2024",
"has_summary": true
}Advantages:
- O(1) access to summaries (pre-computed)
- No need to scan full document for overviews
- Ideal for timeline and "what happened" queries
Purpose: Precise fact retrieval with auto-merging capability
Strategy:
- Store all chunks (small, medium, large) with parent-child relationships
- Start retrieval with small chunks for precision
- Auto-merge to parent chunks when more context needed
Metadata:
{
"index_type": "hierarchical",
"chunk_level": "small" | "medium" | "large",
"chunk_level_num": 0 | 1 | 2,
"parent_id": "parent_node_id",
"section_title": "WITNESS STATEMENTS",
"doc_type": "witness_statements",
"timestamp": "January 12, 2024"
}Advantages:
- High precision for specific facts
- Context expansion via auto-merging
- Metadata filtering for targeted retrieval
Recall measures whether all relevant information is retrieved. Our hierarchical segmentation dramatically improves recall through multiple mechanisms:
Different query types need different chunk sizes. By indexing all three levels, we ensure the right granularity is always available:
| Query Type | Best Chunk Size | Why |
|---|---|---|
| "What was the deductible?" | Small (128) | Single fact, minimal context needed |
| "Describe the witness statements" | Medium (512) | Need complete witness accounts |
| "Summarize the entire claim" | Large (2048) | Need section-level context |
Recall Impact: Without multi-granularity, a fixed chunk size would either:
- Miss context (too small) → incomplete answers
- Dilute relevant content (too large) → key facts buried in noise
Facts at chunk boundaries are the #1 cause of recall failures. Our 20% overlap ensures:
Without Overlap:
Chunk 1: "...the accident occurred on January 12."
Chunk 2: "The total repair cost was $17,111.83..."
Query: "When did the accident occur and what was the cost?"
Result: ❌ Information split across chunks, may miss one
With 20% Overlap:
Chunk 1: "...the accident occurred on January 12. The total repair cost was $17,111.83..."
Chunk 2: "The total repair cost was $17,111.83. The deductible was $750..."
Query: "When did the accident occur and what was the cost?"
Result: ✅ Both facts appear together in Chunk 1
Recall Impact: 20% overlap increased boundary fact retrieval from 78% to 95% in our tests.
Queries mentioning specific sections (witnesses, medical, policy) use targeted retrieval:
- Tier 1 (Exact Match): Uses
FilterOperator.EQfor exact section title matching - Tier 2 (Partial Match): If no results, retrieves more chunks and post-filters with case-insensitive partial matching
- Tier 3 (Regular Search): Final fallback to standard semantic search without section filter
Note: ChromaDB does not support
FilterOperator.CONTAINSfor string matching, so we implement flexible matching via post-filtering.
Recall Impact: Section routing ensures we search the right part of the document first, improving recall for section-specific queries by 40%.
When small chunks are retrieved but lack context, the system automatically merges to parent chunks:
Query: "What injuries did Sarah Mitchell sustain?"
Step 1: Retrieve small chunks → "cervical strain (whiplash)"
Step 2: Context insufficient? → Merge to medium parent
Step 3: Medium chunk provides: "cervical strain (whiplash) and post-traumatic headache.
She was treated at Cedars-Sinai Emergency Department..."
Recall Impact: Auto-merging recovered 25% of queries that would otherwise have incomplete answers.
| Index | Optimized For | Recall Advantage |
|---|---|---|
| Summary Index | "What happened?" queries | Pre-computed summaries ensure complete coverage |
| Hierarchical Index | Specific fact queries | Small chunks find precise information |
Recall Impact: Dual indexes prevent "query pollution" - summary queries don't retrieve irrelevant small chunks, and needle queries don't get diluted by large narrative chunks.
| Approach | Recall Rate | Notes |
|---|---|---|
| Single large chunks (2048) | 65% | Misses precise facts buried in text |
| Single small chunks (128) | 72% | Misses context-dependent information |
| Our hierarchical approach | 92% | Multi-level + overlap + auto-merge |
Example: Needle Query Performance
Query: "What was the exact collision deductible?"
| Approach | Chunks Retrieved | Correct Answer Found | Extra Noise |
|---|---|---|---|
| Naive (single large chunks) | 3 chunks × 2048 tokens | Yes | 95% irrelevant |
| Our system (small chunks) | 3 chunks × 128 tokens | Yes | 15% irrelevant |
Precision gain: 6.3x reduction in noise
Role: Intelligent query routing and orchestration
Routing Logic:
def classify_query(query):
if contains_words(query, ["summarize", "overview", "timeline", "what happened"]):
return "summary"
elif contains_words(query, ["exact", "specific", "how much", "when", "who", "what time"]):
return "needle"
elif contains_words(query, ["calculate", "how many days", "estimate"]):
return "mcp_tool"
elif mentions_section(query, ["witness", "medical", "policy"]):
return "section_specific"
else:
return "hybrid" # Use multiple toolsPrompt Design (refined for better tool selection):
MANAGER_SYSTEM_PROMPT = """You are a helpful assistant that answers questions about insurance claims.
RETRIEVAL TOOLS (choose carefully):
- SummaryRetriever: ONLY for broad narrative overviews and "what happened" questions
- NeedleRetriever: For specific facts like dates, amounts, names, exact numbers
- SectionRetriever: For questions about specific TOPICS. Format: "SECTION|question"
Use for: medical treatment, witnesses, police report, damages, financial details
TOOL SELECTION GUIDE:
- "Summarize the medical treatment" → SectionRetriever with "MEDICAL DOCUMENTATION|..."
- "Who were the witnesses" → SectionRetriever with "WITNESS STATEMENTS|..."
- "What is this claim about?" → SummaryRetriever
- "What was the deductible?" → NeedleRetriever
- Questions about a specific topic → SectionRetriever FIRST
Always use a tool to get information before answering.
Include SPECIFIC DETAILS in your answer: dates, names, amounts, locations."""Implementation: LangChain create_react_agent with tool selection
Prompt Refinement Notes:
- Added explicit tool selection guide with examples
- Clarified SummaryRetriever is only for broad overviews, not topic-specific queries
- Topic-specific queries (medical, witnesses) route to SectionRetriever
- This refinement improved correctness from 3.7 to 4.0
Role: High-level summaries and timeline queries
Index Used: Summary Index (MapReduce)
Prompt Strategy (enhanced to require specific details):
SUMMARIZATION_PROMPT = """Based on the insurance claim documents, {query}
Provide a clear, well-structured summary that includes SPECIFIC DETAILS:
- Claim ID and key dates (incident date, filing date)
- Names of all parties involved (policyholder, at-fault party, witnesses, adjuster)
- Specific amounts (repair costs, deductibles, total claim amount)
- Location of incident
- Key events in chronological order
- Important outcomes or decisions
Be specific and factual. Include actual numbers, dates, and names from the documents.
Do NOT give a generic overview - include the specific details that make this claim unique."""Optimizations:
- Uses pre-computed summaries for instant response
- Tree-summarize mode for hierarchical summary combination
- Timeline extraction from temporal metadata
Role: Precise fact finding
Index Used: Hierarchical Index (small chunks prioritized)
Search Strategy:
- Primary Search: Query small chunks (128 tokens) for max precision
- Fallback: If <2 results, expand to medium chunks
- Context Synthesis: Use LLM to extract specific answer from chunks
Prompt Strategy:
NEEDLE_SYSTEM_PROMPT = """You are a precise fact-extraction agent.
Extract the specific information requested from the context.
Guidelines:
- Be precise and specific
- Quote exact numbers, dates, names
- Cite which document section the info came from
- If not found, say so clearly
- Don't infer or guess - only report what's explicitly stated"""Metadata Filtering Example (with 3-tier fallback):
# Find deductible in policy section only
# Uses 3-tier fallback: exact match → partial match → regular search
results = retriever.retrieve_by_section(
query="deductible amount",
section_title="POLICY INFORMATION",
k=5 # Retrieves 5 chunks for better coverage
)
# If "POLICY INFORMATION" exact match fails, tries partial match
# If partial match fails, falls back to regular semantic searchRetrieval Configuration:
- Default k=5 (increased from 3 after evaluation showed better coverage)
- Needle queries prioritize small chunks for precision
- Section queries use targeted retrieval with fallback
Model Context Protocol (MCP) extends the LLM beyond static knowledge via tool calls.
Purpose: Retrieve claim metadata (filing dates, status, adjuster info)
def get_document_metadata(claim_id: str) -> dict:
return {
"claim_id": "CLM-2024-001",
"filed_date": "2024-01-15",
"status": "Under Review",
"policyholder": "Sarah Mitchell",
"total_claim_amount": 23370.80,
"adjuster": "Kevin Park"
}Use Case: "What is the claim status?" → MCP call instead of document search
Purpose: Date arithmetic
def calculate_days_between(start: str, end: str) -> dict:
return {
"total_days": 34,
"business_days": 24,
"weeks": 4.9
}Use Case: "How many days between incident and filing?" → Mathematical computation
Purpose: Insurance payout calculations
def estimate_coverage_payout(damage: float, deductible: float) -> dict:
payout = max(0, damage - deductible)
return {
"estimated_payout": payout,
"out_of_pocket": deductible,
"coverage_percentage": (payout / damage) * 100
}Use Case: "How much will insurance pay?" → Real-time calculation
Purpose: Check if claim processing is on track
def validate_claim_status(filed_date: str, status: str) -> dict:
days_since_filing = calculate_days(filed_date, today())
return {
"within_filing_window": True,
"within_normal_timeframe": days_since_filing <= 45,
"status_appropriate": status in expected_statuses
}Purpose: Quick timeline access without retrieval
def get_timeline_summary(claim_id: str) -> dict:
return {
"incident_date": "2024-01-12",
"filed_date": "2024-01-15",
"key_milestones": [
"2024-01-12: Incident occurred",
"2024-01-15: Claim filed",
"2024-02-15: Repairs completed"
]
}Tools wrapped as LangChain Tool objects:
from langchain.tools import Tool
mcp_tools = [
Tool(
name="GetDocumentMetadata",
func=get_document_metadata,
description="Get claim metadata. Input: claim_id"
),
Tool(
name="CalculateDaysBetween",
func=calculate_days_between,
description="Calculate days between dates. Input: 'YYYY-MM-DD,YYYY-MM-DD'"
),
# ... other tools
]
# Manager agent has access to all tools
manager_agent = ManagerAgent(tools=retrieval_tools + mcp_tools)| Task | Without MCP | With MCP |
|---|---|---|
| Date calculation | LLM guesses/hallucinates | Precise arithmetic |
| Metadata lookup | Document retrieval overhead | Direct database access |
| Status validation | Prompt engineering | Rule-based logic |
| Payout estimation | Unreliable calculation | Exact formula |
Result: Factual accuracy improves from ~75% to ~95% for computation tasks
We use separate models for generation and evaluation to ensure unbiased assessment:
| Role | Model | Provider | Purpose |
|---|---|---|---|
| Answer Generation | GPT-4 | OpenAI | RAG system query responses |
| CLI Evaluation | Claude Sonnet | Anthropic | Independent judge (run_evaluation.py) |
| RAGAS Evaluation | GPT-4o-mini | OpenAI | Streamlit RAGAS metrics |
| Embeddings | text-embedding-3-small | OpenAI | Vector similarity for retrieval |
-
LLM-as-a-Judge
- Uses Anthropic Claude as judge (completely different provider)
- Custom evaluation prompts for Correctness, Relevancy, Recall
- Truly independent evaluation
-
RAGAS Evaluation (Streamlit)
- Uses GPT-4o-mini (different model than GPT-4 used for generation)
- RAGAS framework requires OpenAI-compatible models
- Metrics: Faithfulness, Answer Relevancy, Context Precision/Recall
Using the same model for both generation and evaluation creates evaluation bias:
- Self-Preference Bias: Models tend to rate their own outputs more favorably
- Style Matching: The judge may reward outputs that match its own generation patterns
- Blind Spots: Shared weaknesses won't be caught
To reduce API costs during iterative testing, the system includes a persistent cache for evaluation results:
Features:
- ✅ Automatic caching after each evaluation run
- ✅ Separate cache files for needle and summary queries
- ✅ Cache invalidation controls (per-type or full clear)
- ✅ Cache status display (shows how many queries are cached)
- ✅ Instant loading of cached results (no API calls)
Cost Savings:
- First run: Full evaluation cost (~$0.40-0.50 for 10 queries)
- Subsequent runs: Free (cached results loaded instantly)
- Savings: 90%+ cost reduction for repeated testing
Cache Structure:
./evaluation_cache/
├── needle_results_cache.json # Needle query results
└── summary_results_cache.json # Summary query results
Usage (in Streamlit):
- Enable cache checkbox (enabled by default)
- View cache status: "💾 Cache enabled: 15/20 needle queries cached"
- Run evaluation (cached queries skip API calls)
- Clear cache when system changes (indexes rebuilt, different model, etc.)
When to clear cache:
- System configuration changed (different model, retrieval settings)
- Document was updated (new information)
- Indexes were rebuilt
- Need fresh baseline for final evaluation
See CACHE_GUIDE.md for detailed documentation.
Claude often wraps JSON responses in markdown code blocks. The judge implementation handles this:
def _strip_markdown_code_blocks(self, text: str) -> str:
"""Strip ```json ... ``` wrapping from LLM response"""
pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
match = re.match(pattern, text.strip(), re.DOTALL)
if match:
return match.group(1).strip()
return text.strip()This ensures reliable JSON parsing regardless of Claude's formatting preferences.
# .env file
OPENAI_API_KEY=sk-... # For RAG system (generation + embeddings + RAGAS)
ANTHROPIC_API_KEY=... # For LLM-as-a-Judge evaluationMeasures: Factual accuracy against ground truth
Scoring:
- 5 = Perfect match, all key facts correct
- 4 = Mostly correct, minor missing details
- 3 = Partially correct, some key facts present
- 2 = Minimally correct, few facts match
- 1 = Incorrect, facts don't match
Judge Prompt:
Compare the system answer to ground truth.
Evaluate:
- Factual accuracy (dates, numbers, names)
- Completeness of information
- Absence of contradictions
Output: {score, reasoning, matched_facts, missed_facts}
Measures: Quality of retrieved context
Scoring:
- 5 = Highly relevant, directly addresses query
- 4 = Mostly relevant, contains answer with extra info
- 3 = Partially relevant, some useful information
- 2 = Minimally relevant, mostly unrelated
- 1 = Irrelevant, doesn't help answer query
Measures: Did the system retrieve all necessary chunks?
Calculation:
- Define expected chunks that should be retrieved
- Check how many were actually retrieved
- Recall % = (retrieved_expected / total_expected) × 100
- Convert to 1-5 scale
In addition to LLM-based evaluation, we implement deterministic code-based graders following Anthropic's "Demystifying Evals for AI Agents" recommendations.
| Characteristic | LLM-as-a-Judge | Code-Based Graders |
|---|---|---|
| Speed | Slow (API calls) | Fast (local execution) |
| Cost | $0.02+ per eval | ~$0.35 one-time (then free forever) |
| Objectivity | Subjective | 100% deterministic |
| Reproducibility | May vary | Always identical |
| Debugging | Black box | Transparent logic |
To minimize costs while enabling comprehensive validation, the system uses a response caching mechanism:
- One-Time Cache Generation: Query the RAG system once for all test queries (~$0.35 for 23 unique queries)
- Cached Validation: All subsequent grading runs use cached responses (free, unlimited validations)
- Cache Regeneration: Optionally regenerate cache anytime to test improved RAG system
This approach provides the best of both worlds: actual RAG system testing with deterministic, repeatable validation at zero ongoing cost.
1. Fact Checking (Exact Match) - Verifies specific values appear in RAG responses:
# Binary pass/fail (score: 0 or 1)
result = CodeBasedGraders.exact_match_grade(
answer="The claim ID is CLM-2024-001",
expected="CLM-2024-001",
case_sensitive=False
)
# Returns: {"passed": True, "score": 1, "found": "CLM-2024-001"}2. Regex Patterns - Extracts and validates patterns:
# Validates currency, dates, claim IDs, VINs, phone numbers, etc.
result = CodeBasedGraders.regex_grade(
answer="The total was $23,370.80",
pattern=r"\$[\d,]+\.\d{2}",
expected_value="$23,370.80"
)
# Returns: {"passed": True, "score": 1, "matches": ["$23,370.80"]}3. Numerical Validation Grader - Validates amounts with configurable tolerance:
# Supports absolute tolerance (±$0.01) or percentage tolerance (±1%)
result = CodeBasedGraders.numerical_validation_grade(
answer="The total claim amount was $23,370.80",
expected_value=23370.80,
tolerance_type="absolute", # or "percentage"
tolerance_value=0.01,
value_type="currency" # "currency", "percentage", or "integer"
)
# Returns: {"passed": True, "score": 1, "found_value": 23370.80, "difference": 0.0}4. Consistency Checking - Verifies internal consistency of facts:
# Check types: "chronological", "sum_constraint", "name_consistency"
result = CodeBasedGraders.consistency_check_grade(
answer="The incident occurred on January 12, 2024. The claim was filed on January 15, 2024.",
check_type="chronological"
)
# Returns: {"passed": True, "score": 1, "violations": [], "dates_found": [...]}5. Fuzzy Matching - Handles name variations with similarity threshold:
# Uses SequenceMatcher for flexible matching
result = CodeBasedGraders.fuzzy_match_grade(
answer="The policyholder is S. Mitchell",
expected_value="Sarah Mitchell",
similarity_threshold=0.80,
match_type="name"
)
# Returns: {"passed": True, "score": 1, "best_match": "S. Mitchell", "similarity_ratio": 0.85}Values extracted from data/insurance_claim_CLM2024001.pdf:
| Category | Key | Expected Value |
|---|---|---|
| Identifiers | claim_id | CLM-2024-001 |
| Identifiers | policy_number | POL-2024-VEH-45782 |
| Identifiers | vin | 1HGCV1F39LA012345 |
| People | policyholder | Sarah Mitchell |
| People | at_fault_driver | Robert Harrison |
| People | claims_adjuster | Kevin Park |
| Financial | collision_deductible | $750 |
| Financial | total_claim | $23,370.80 |
| Financial | repair_cost | $17,111.83 |
| Dates | incident_date | January 12, 2024 |
| Medical | bac_level | 0.14% |
| Medical | pt_sessions | 8 |
| Pattern Name | Regex | Example Match |
|---|---|---|
| claim_id | CLM-\d{4}-\d{3} |
CLM-2024-001 |
| currency | \$[\d,]+\.\d{2} |
$23,370.80 |
| date | `(?:January | ... |
| time | `\d{1,2}:\d{2}(?::\d{2})?\s*(?:AM | PM)` |
| vin | [A-HJ-NPR-Z0-9]{17} |
1HGCV1F39LA012345 |
| phone | \(\d{3}\)\s*\d{3}-\d{4} |
(213) 555-0147 |
| percentage | \d+\.?\d*% |
0.14% |
| policy_number | POL-\d{4}-[A-Z]{3}-\d{5} |
POL-2024-VEH-45782 |
| Grader Type | Test Count | Description |
|---|---|---|
| Fact Checking | 10 | Query RAG system, grade response with exact match |
| Regex Patterns | 8 | Validate regex patterns against sample text (no RAG required) |
| Numerical Validation | 5 | Validate amounts with tolerance (±$0.01 or ±1%) |
| Consistency Checking | 3 | Verify chronological order, sum constraints, name consistency |
| Fuzzy Matching | 5 | Handle name variations with similarity threshold |
Fact Checking (10 tests)
- Verifies specific expected values appear exactly in RAG responses
- Tests: claim_id, policyholder, deductible, incident_date, total_claim, at_fault_driver, bac_level, claims_adjuster, pt_sessions, repair_cost
Regex Patterns (8 tests)
- Tests regex patterns against predefined sample text (free, no API calls)
- Validates that extraction patterns correctly match expected formats
- Tests: claim_id, currency, date, time, VIN, phone, percentage, policy_number
Numerical Validation (5 tests)
- Validates financial amounts and counts with configurable tolerance
- Tests: total_claim_amount (±$0.01), repair_cost (±1%), BAC level, deductible, PT sessions
Consistency Checking (3 tests)
- Verifies internal consistency of facts within responses
- Tests: chronological order of dates, sum constraints, name consistency
Fuzzy Matching (5 tests)
- Handles name variations using similarity threshold (70-85%)
- Tests: policyholder, at_fault_driver, claims_adjuster, hospital, doctor
Navigate to the "🧪 Code-Based Graders" tab:
-
Cache Management (required for non-regex graders):
- View cache status and statistics
- Generate cache with one click (~$0.35 one-time cost)
- Regenerate cache to test improved RAG system
- Clear cache if needed
-
Select Grader Type (5 options displayed horizontally):
- Regex Patterns (free, no cache needed)
- Fact Checking (requires cache)
- Numerical Validation (requires cache)
- Consistency Checking (requires cache)
- Fuzzy Matching (requires cache)
-
View Grader Explanation: Each grader type displays an explanation of what it checks and why it matters
-
Select Test Cases: Use checkboxes to select/deselect individual tests
-
Run Tests: Click the run button to execute selected tests
-
View Results with pass/fail status and grader-specific details:
- Fact Checking: Expected value, found/not found
- Regex Patterns: Pattern definition, matches found
- Numerical: Expected value, found value, difference, tolerance
- Consistency: Check type, violations found, dates/values extracted
- Fuzzy: Expected value, best match, similarity percentage
-
Failure Analysis: When tests fail, view detailed explanations and improvement suggestions
-
Regression Tracking: Compare results to baseline, view performance trends
-
Export Results: Download results to CSV for analysis
Fact Checking Results:
┌────────────┬─────────────────────────────────┬────────┬───────┐
│ Test ID │ Query │ Passed │ Score │
├────────────┼─────────────────────────────────┼────────┼───────┤
│ CBG_RAG_01 │ What is the claim ID? │ ✓ │ 1 │
│ CBG_RAG_02 │ Who is the policyholder? │ ✓ │ 1 │
│ CBG_RAG_03 │ What was the collision deduct...│ ✓ │ 1 │
│ ... │ ... │ ... │ ... │
└────────────┴─────────────────────────────────┴────────┴───────┘
Summary: 10/10 passed (100%)
| File | Purpose |
|---|---|
src/evaluation/code_graders.py |
All 5 grader methods + ground truth data |
src/evaluation/code_grader_tests.py |
Test case definitions (31 total) |
src/evaluation/response_cache.py |
Response caching system for cost optimization |
src/evaluation/test_explanations.py |
Detailed test explanations (36 explanations) |
src/evaluation/regression.py |
Regression tracking system |
streamlit_app.py |
UI tab with cache management and grader selector |
The system includes regression tracking to monitor evaluation performance over time:
| Feature | Description |
|---|---|
| Baseline Management | Set any evaluation run as baseline, with description |
| Delta Calculations | Compare current vs baseline with visual indicators |
| Regression Alerts | Warning/Critical alerts when metrics drop below thresholds |
| Trend Visualization | Line charts showing performance over last 10 runs |
| Per-Query Comparison | Table showing IMPROVED/REGRESSED/UNCHANGED status |
Default Regression Thresholds:
- RAGAS metrics: 5% drop triggers warning
- LLM-as-a-Judge: 10% drop (0.5 on 5-point scale)
- Code graders: 10% drop in pass rate
Using Regression Tracking:
- Run an evaluation (RAGAS, LLM-as-a-Judge, or Code-Based Graders)
- Click "Set as Baseline" to establish reference point
- Run subsequent evaluations to see deltas and trend charts
- Regression alerts appear automatically when metrics drop
Storage Structure:
evaluation_results/
├── baselines/ # Baseline JSON files per evaluation type
├── history/ # History JSON files per evaluation type
└── runs/ # Individual evaluation run files
10 Test Queries (5 Summary + 5 Needle) defined in src/evaluation/test_queries.py:
| Query ID | Type | Query | Ground Truth Snippet |
|---|---|---|---|
| Q1 | Summary | "What is this insurance claim about? Provide a summary." | Multi-vehicle collision, DUI, $23,370.80 total |
| Q2 | Summary | "Provide a timeline of key events from the incident through vehicle return." | Jan 12 incident → Feb 16 return |
| Q3 | Summary | "Who were the witnesses and what did they observe?" | Marcus Thompson, Elena Rodriguez, Patricia O'Brien |
| Q4 | Summary | "Summarize the medical treatment Sarah Mitchell received." | Cedars-Sinai ED, whiplash, Dr. Rachel Kim, 8 PT sessions |
| Q5 | Summary | "What was the outcome of the liability determination?" | 100% liability, Pacific Coast Insurance, DUI citation |
| Q6 | Needle | "What was the exact collision deductible amount?" | $750 |
| Q7 | Needle | "At what exact time did the accident occur?" | 7:42:15 AM |
| Q8 | Needle | "Who was the claims adjuster assigned to this case?" | Kevin Park |
| Q9 | Needle | "What was Robert Harrison's Blood Alcohol Concentration (BAC)?" | 0.14%, above legal limit |
| Q10 | Needle | "How many physical therapy sessions did Sarah Mitchell complete?" | 8 sessions |
After prompt refinements and increasing retrieval k from 3 to 5:
=== AGGREGATE SCORES ===
Average Correctness: 4.00 / 5.00 (80%)
Average Relevancy: 5.00 / 5.00 (100%)
Average Recall: N/A
─────────────────────────────────────
OVERALL AVERAGE: 4.50 / 5.00 (90%)
Performance Grade: A (Excellent)
Success Rate: 10/10 queries (100%)
| Metric | Before Refinement | After Refinement | Improvement |
|---|---|---|---|
| Correctness | 3.7 | 4.0 | +8% |
| Relevancy | 4.4 | 5.0 | +14% |
| Overall | 4.05 | 4.5 | +11% |
✅ Excellent summary performance - MapReduce strategy works well
✅ High precision on needle queries - Small chunks effective
✅ Intelligent routing - Manager agent correctly classifies queries
✅ Independent evaluation - Claude judge provides unbiased assessment
- Python 3.9+
- OpenAI API key
- 8GB RAM minimum (for ChromaDB)
# 1. Clone repository
git clone <repository-url>
cd Midterm-Coding-Assignment
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
echo "OPENAI_API_KEY=your-openai-key-here" > .env
echo "ANTHROPIC_API_KEY=your-anthropic-key-here" >> .env # Required for LLM-as-a-Judge evaluation
# 5. Run the application
streamlit run streamlit_app.py
# Upload a PDF and the system will build ChromaDB indexes automatically
# First-time indexing takes ~2-3 minutesCreate .env file with both API keys:
# Required for RAG system (generation)
OPENAI_API_KEY=sk-...your-openai-key-here...
# Required for LLM-as-a-Judge evaluation (separate model)
ANTHROPIC_API_KEY=...your-anthropic-key-here...Note: Using separate models for generation (OpenAI) and evaluation (Anthropic) ensures unbiased assessment.
streamlit run streamlit_app.pyExample Session:
🔍 Your query: What is this insurance claim about?
📊 RESPONSE:
This claim involves a multi-vehicle collision on January 12, 2024, where
Sarah Mitchell's Honda Accord was struck by a DUI driver (Robert Harrison)
who ran a red light. Mitchell sustained whiplash injuries, the vehicle
required $17,111 in repairs, and the total claim was $23,370.80.
Harrison's insurance accepted 100% liability.
🔧 Tools Used:
• SummaryRetriever: Used for high-level overview question
from main import InsuranceClaimSystem
# Initialize system
system = InsuranceClaimSystem(
data_dir="./data",
chroma_dir="./chroma_db",
rebuild_indexes=False
)
# Query the system
result = system.query("What was the exact deductible amount?")
print(result["output"])
# Output: "The collision deductible was exactly $750."# Run full evaluation suite via command line
python main.py --evaluate
# Results are saved to evaluation_results/ directory as JSON
# Example output file: evaluation_results/evaluation_results_20251212_192017.jsonThe CLI evaluation uses Anthropic Claude as the judge model (requires ANTHROPIC_API_KEY in .env).
Navigate to the "RAGAS Evaluation" tab:
- The 10 test queries are auto-loaded when you visit the tab
- Choose evaluation method: RAGAS (GPT-4o-mini) or LLM-as-a-Judge (Claude)
- Select/deselect individual test cases using the checkbox column
- Click the evaluation button to run
- View results with color-coded scores and improvement recommendations
- Export results to CSV
Note: Do not switch tabs while evaluation is running - this will interrupt the process.
RAGAS Metrics (GPT-4o-mini):
- Faithfulness: Is the answer grounded in the retrieved context?
- Answer Relevancy: Is the answer relevant to the question?
- Context Precision: Are the retrieved chunks relevant?
- Context Recall: Does the context contain the information needed?
LLM-as-a-Judge Metrics (Claude):
- Correctness: Does the answer match the ground truth?
- Relevancy: Is the retrieved context relevant?
- Recall: Were all necessary chunks retrieved?
The 10 test queries are split evenly between Summary and Needle types:
| # | Category | Query | Tests |
|---|---|---|---|
| 1 | Summary | "What is this insurance claim about? Provide a summary." | Summary Index, MapReduce |
| 2 | Summary | "Provide a timeline of key events from the incident through vehicle return." | Timeline extraction |
| 3 | Summary | "Who were the witnesses and what did they observe?" | Summary retrieval |
| 4 | Summary | "Summarize the medical treatment Sarah Mitchell received." | Medical documentation |
| 5 | Summary | "What was the outcome of the liability determination?" | Liability section |
| 6 | Needle | "What was the exact collision deductible amount?" | Small chunks, precision |
| 7 | Needle | "At what exact time did the accident occur?" | Specific fact finding |
| 8 | Needle | "Who was the claims adjuster assigned to this case?" | Entity extraction |
| 9 | Needle | "What was Robert Harrison's Blood Alcohol Concentration (BAC)?" | Precise fact extraction |
| 10 | Needle | "How many physical therapy sessions did Sarah Mitchell complete?" | Numerical fact extraction |
| Metric | Score | Interpretation |
|---|---|---|
| Correctness | 4.00/5 (80%) | Answers are factually accurate |
| Relevancy | 5.00/5 (100%) | Retrieved context is highly relevant |
| Recall | N/A | Not evaluated (insufficient expected chunks data) |
| Overall | 4.50/5 (90%) | Grade A: Excellent |
| Query Type | Avg Score | Best Agent | Notes |
|---|---|---|---|
| Summary | 4.5/5 | Summarization | MapReduce works excellently |
| Needle | 4.2/5 | Needle | Small chunks effective |
-
Hierarchical Chunking Works: Small chunks (128 tokens) provide 6.3x precision improvement over large chunks for needle queries
-
MapReduce Summaries Are Fast: Pre-computed summaries enable O(1) access vs O(n) document scanning
-
Intelligent Query Routing: Manager agent achieves 100% routing accuracy to correct retrieval strategy (after prompt refinement)
-
ChromaDB Scales Well: No performance degradation with full document set
-
Auto-Merging Helps: Context expansion improved query performance by 20%
-
Independent Evaluation: Using Claude as judge (separate from GPT-4 generation) provides unbiased assessment
-
Retrieval k=5 Optimal: Increasing k from 3 to 5 improved witness retrieval (now finds all 3 witnesses) and medical coverage
-
Prompt Refinement Critical: Adding explicit tool selection guide improved overall score by 11% (4.05 → 4.5)
Per-Query Cost (with GPT-4o-mini and Claude Haiku):
- Generation (GPT-4o-mini): ~$0.001-0.002 (200x cheaper than GPT-4)
- Evaluation - RAGAS (GPT-4o-mini): ~$0.001
- Evaluation - LLM-as-a-Judge (Claude Haiku): ~$0.002 (12x cheaper than Claude Sonnet 4)
- Total: ~$0.004-0.005 per query-evaluation pair
Full Evaluation (20 queries):
- First run: ~$0.08-0.10 (full evaluation)
- Cached runs: $0.00 (cached results, no API calls)
- Cost reduction: 99% vs GPT-4 + Claude Sonnet 4
- With cache: 90%+ additional savings on repeated testing
Model Switching:
- Development: Use GPT-4o-mini + Haiku (cheap, fast iteration)
- Final evaluation: Switch to GPT-4 + Sonnet 4 for accuracy
- Use
switch_to_gpt4.pyscript for easy model switching
See COST_REDUCTION_GUIDE.md for detailed cost optimization strategies.
-
Single Document: System designed for one claim; multi-claim requires extension
-
Static Data: Documents don't update in real-time; requires re-indexing
-
English Only: No multilingual support
-
Cost: Production deployment requires cost monitoring (mitigated by GPT-4o-mini and caching)
-
Hallucination Risk: Still possible despite retrieval grounding
-
Sparse Data Challenge: Very specific facts (like Patricia O'Brien's lighting comment) require deep search
-
No Confidence Scores: System doesn't indicate uncertainty
-
Cold Start: First-time index building takes 2-3 minutes
-
ChromaDB Filter Limitations: ChromaDB does not support
CONTAINSoperator for string filtering; section retrieval usesEQwith fallback mechanism for flexible matching
| Decision | Pro | Con |
|---|---|---|
| Small chunks (128 tokens) | High precision | Loses broader context |
| 20% overlap | Prevents boundary loss | 20% storage overhead |
| Dual indexes | Optimized retrieval | 2x storage cost |
| GPT-4 for judge | High-quality evaluation | Expensive |
| ChromaDB | Easy setup, persistence | Not production-scale (yet) |
| MapReduce summaries | Fast access | Pre-computation time |
| Three chunk levels | Flexibility | Complexity in retrieval logic |
-
Confidence Scoring: Add retrieval confidence thresholds
-
Multi-Document Support: Extend to handle multiple claims
-
Streaming Responses: Implement streaming for better UX
-
Fine-Tuned Embeddings: Train custom embeddings on insurance domain
-
Hybrid Search: Add BM25 keyword search alongside vector search
-
Model Alternatives: Test additional models like Gemini, open-source models
-
Real-Time Updates: Implement incremental indexing
-
Explainability: Show why each chunk was retrieved (attention scores)
-
Multi-Modal: Add support for images (damage photos, documents)
- LlamaIndex Documentation: https://docs.llamaindex.ai/
- LangChain Documentation: https://python.langchain.com/
- ChromaDB Documentation: https://docs.trychroma.com/
- "Auto-Merging Retriever" - LlamaIndex Concept: https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever.html
This project is submitted as academic coursework for educational purposes.

