Version: 2.0
Last Updated: January 2026
Author: Vikas Sahani (Product Lead)
Engineering Team: Kiro (AI Co-Engineering Assistant), Antigravity (AI Co-Assistant)
- Architecture Overview
- System Components
- Data Flow
- Component Details
- Integration Points
- Performance & Scalability
- Security & Privacy
SaralPolicy is a privacy-first, locally-run AI system for analyzing Indian insurance policy documents. The architecture is designed around the principle of zero cloud dependencies and complete user data privacy.
✅ Privacy-First: All processing happens locally on the user's machine
✅ Modular Architecture: Loosely coupled services with dependency injection
✅ Offline Capable: No internet required for core functionality
✅ POC/Demo Ready: Built-in guardrails, evaluation, and HITL workflows
✅ Regulatory Compliance: IRDAI knowledge base integration
✅ OSS-First: Local-first open source frameworks (RAGAS, Huey, OpenTelemetry)
graph TB
subgraph FE["FRONTEND LAYER"]
UI["Material 3 Web UI"]
end
subgraph API_LAYER["API LAYER"]
API["FastAPI Gateway"]
end
subgraph CORE["CORE SERVICES"]
DOC["Document Service"]
POLICY["Policy Service"]
RAG["RAG Service"]
LLM["Ollama LLM"]
end
subgraph STORAGE["STORAGE"]
CHROMA["ChromaDB"]
IRDAI["IRDAI Knowledge"]
end
subgraph SAFETY["SAFETY & QUALITY"]
GUARD["Guardrails"]
EVAL["Evaluation"]
HITL["HITL Services"]
end
subgraph AUX["AUXILIARY SERVICES"]
TTS["Text-to-Speech"]
TRANS["Translation"]
end
UI -->|Upload| API
API -->|Parse| DOC
DOC -->|Text| API
UI -->|Analyze| API
API -->|Orchestrate| POLICY
POLICY -->|Retrieve| RAG
RAG -->|Query| CHROMA
IRDAI -->|Knowledge| CHROMA
POLICY -->|Generate| LLM
LLM -->|Response| POLICY
POLICY -->|Result| API
API -->|Display| UI
POLICY -->|Validate| GUARD
POLICY -->|Evaluate| EVAL
EVAL -->|Review| HITL
HITL -->|Verified| API
API -->|Audio| TTS
API -->|Translate| TRANS
TTS -->|Audio| UI
TRANS -->|Hindi| UI
User → Frontend UI → FastAPI → Document Processor → Text Extraction
↓
Embedding Generation
↓
Vector Storage (ChromaDB)
Processing Steps:
- Upload: User drags PDF/DOCX to web interface
- Validation: File type, size, and content checks via Guardrails
- Extraction: Parallel text extraction using PyPDF2 (multi-threaded)
- Chunking: Intelligent text chunking for optimal RAG performance
- Embedding: Generate embeddings via Ollama's nomic-embed-text
- Storage: Store vectors + metadata in ChromaDB persistent storage
Performance Optimization:
- ✅ MD5-based document caching (avoid reprocessing)
- ✅ Parallel PDF page extraction (4-worker ThreadPoolExecutor)
- ✅ Batch embedding generation
- ✅ Optimized chunking with list comprehensions
Uploaded Document → RAG Service → Hybrid Search (BM25 + Vector)
↓
Context Retrieval from IRDAI KB
↓
Prompt Engineering with Context
↓
LLM Generation (gemma2:2b)
↓
Evaluation & Quality Check
↓
High Confidence → User | Low Confidence → HITL Review
Key Features:
- Hybrid Search: Combines keyword (BM25) + semantic (vector) search
- Context Augmentation: IRDAI knowledge base pre-indexed (39 regulatory chunks)
- Quality Control: TruLens, Giskard, DeepEval metrics
- Human Oversight: Automatic flagging for expert review
User Question → API → Guardrails (Input Validation)
↓
RAG Query (Hybrid Search)
↓
Context from Document + IRDAI KB
↓
LLM Generation (Contextual Answer)
↓
PII Redaction & Safety Check
↓
Response + Sources → User
Optimizations:
- ✅ Query caching (MD5-based keys)
- ✅ Connection pooling for Ollama API
- ✅ Persistent ChromaDB sessions
Technology: Material 3 Design, HTML5, CSS3, JavaScript
Features:
- Drag-and-drop file upload
- Real-time analysis progress indicators
- Interactive Q&A chat interface
- Audio playback for TTS summaries
- Dark mode support
- Print-friendly policy views
File: backend/templates/index.html, backend/static/
Technology: FastAPI (Python 3.10+)
Key Endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Serve frontend UI |
/upload |
POST | Upload policy document |
/analyze |
POST | Analyze uploaded policy |
/rag/ask |
POST | Ask question via RAG |
/rag/stats |
GET | RAG service statistics |
/tts |
POST | Generate audio summary |
Features:
- CORS middleware for cross-origin requests
- Session management for multi-user support
- Structured logging (structlog)
- Performance metrics tracking
File: backend/main.py
Purpose: Extract text from PDF, DOCX, TXT files
Optimizations:
- Parallel PDF page processing (ThreadPoolExecutor)
- MD5-based file caching
- Memory-efficient streaming for large files
File: backend/app/services/document_service.py (DocumentService class)
Supported Formats:
- ✅ PDF (via PyPDF2)
- ✅ DOCX (via python-docx)
- ✅ TXT (native Python)
Purpose: Orchestrates analysis, RAG, and response generation
File: backend/app/services/policy_service.py
Key Features:
- Centralized business logic
- Integrates RAG, LLM, and Guardrails with confidence overrides
- Generates rich citation metadata
Purpose: Retrieval-Augmented Generation with hybrid search
Technology: ChromaDB + BM25 + Ollama embeddings
Key Features:
- Hybrid Search: Combines BM25 (keyword) + Vector (semantic) search
- Batch Processing: Parallel embedding generation with caching
- Connection Pooling: Persistent HTTP sessions for Ollama
- Query Caching: MD5-based cache for repeated queries
File: backend/app/services/rag_service.py
Methods:
index_document(text, metadata) # Index document chunks
hybrid_search(query, collection_name, top_k) # Search both BM25 + Vector
get_embeddings(texts) # Batch embedding with cache
get_stats() # Service statisticsPurpose: Local LLM inference using gemma2:2b
Model: gemma2:2b (2 billion parameters)
Configuration:
- Temperature: 0.3 (deterministic)
- Context Window: 4096 tokens
- Max Tokens: 1500 output
- Streaming: Disabled (batch processing)
Privacy Guarantee:
- ✅ 100% local inference
- ✅ No API keys required
- ✅ No data sent to cloud services
File: backend/app/services/ollama_llm_service.py
Purpose: Persistent vector storage for embeddings
Location: backend/data/chroma/
Collections:
policy_documents- Uploaded policy chunksirdai_knowledge_base- Pre-indexed regulatory content
Metadata Schema:
{
"chunk_id": "string",
"source": "filename.pdf",
"chunk_index": 0,
"type": "policy_section",
"timestamp": "2025-10-07T12:00:00"
}Purpose: Regulatory compliance context
Location: backend/data/irdai_knowledge/
Content:
IRDAI_Master_Circular_Health_2024.txt(Health insurance regulations)IRDAI_Protection_of_Policyholders_Interests.txt(Consumer rights)Insurance_Guidelines_Terms_Definitions.txt(Standard terminology)
Statistics:
- 39 indexed chunks
- Pre-embedded and ready for queries
- Automatically loaded on service startup
Purpose: Lexical matching for exact term searches
Library: rank-bm25
Use Cases:
- Policy number lookups
- Specific clause references
- Exact terminology searches
Purpose: Semantic similarity matching
Embedding Model: nomic-embed-text (274MB via Ollama)
Use Cases:
- Conceptual queries ("What is covered for accidents?")
- Cross-language understanding
- Paraphrase detection
Purpose: Input validation, PII protection, hallucination prevention
File: backend/app/services/guardrails_service.py
Checks:
- ✅ PII redaction (names, phone, Aadhaar, PAN)
- ✅ Input sanitization (SQL injection, XSS)
- ✅ File size and type validation
- ✅ Prompt injection detection
Purpose: Quality metrics for LLM outputs
File: backend/app/services/evaluation.py, backend/app/services/rag_evaluation_service.py
Primary Framework: RAGAS (2026-01-03)
- License: Apache 2.0
- GitHub: https://github.com/explodinggradients/ragas (7k+ stars)
- Metrics:
- Faithfulness (hallucination detection)
- Answer Relevancy
- Context Precision
- Context Recall (with ground truth)
Fallback: Heuristic-based evaluation when RAGAS not installed
Thresholds:
- High Confidence: Faithfulness ≥ 0.7
- Hallucination Risk: Faithfulness < 0.7
Installation (Optional):
pip install ragas datasets langchain-communityPurpose: Expert review for low-confidence analyses
File: backend/app/services/hitl_service.py
Workflow:
- System flags low-confidence result
- Expert reviews analysis in UI
- Expert approves/corrects/rejects
- Feedback stored for model improvement
- User receives verified analysis
Purpose: Background task processing for HITL and async operations
File: backend/app/services/task_queue_service.py
Framework: Huey
- License: MIT
- GitHub: https://github.com/coleifer/huey (5k+ stars)
- Backend: SQLite (no Redis required)
Features:
- Priority-based task scheduling (HIGH, MEDIUM, LOW)
- Automatic retries with exponential backoff
- Task status tracking
- Graceful fallback to synchronous execution
Task Types:
- Review notifications
- Expert assignment
- Review reminders
- Feedback processing
Installation (Optional):
pip install hueyPurpose: Metrics, tracing, and health monitoring
File: backend/app/services/observability_service.py
Framework: OpenTelemetry
- License: Apache 2.0
- GitHub: https://github.com/open-telemetry/opentelemetry-python (1.5k+ stars)
- Export: Console (local) - no cloud required
Metrics:
- Request counts and latencies
- LLM call duration and token counts
- RAG query performance
- Error rates
Tracing:
- Distributed tracing with spans
- Automatic error tracking
- Duration measurement
Installation (Optional):
pip install opentelemetry-api opentelemetry-sdkPurpose: Generate audio summaries
Libraries: pyttsx3 (offline), gTTS (online fallback), Indic Parler-TTS (high-quality Hindi)
Features:
- Hindi + English voice support
- Adjustable speech rate
- MP3 output format
- High-quality neural TTS for Hindi (optional)
File: backend/app/services/tts_service.py, backend/app/services/indic_parler_engine.py
Indic Parler-TTS (Optional - High-Quality Hindi TTS)
- Model: ai4bharat/indic-parler-tts
- License: Apache 2.0
- Size: 0.9B parameters
- Speakers: Rohit, Divya (Hindi), Thoma, Mary (English)
- Features: Natural voice descriptions, clear audio quality
Citations:
@inproceedings{sankar25_interspeech,
title = {{Rasmalai : Resources for Adaptive Speech Modeling in IndiAn Languages with Accents and Intonations}},
author = {Ashwin Sankar and Yoach Lacombe and Sherry Thomas and Praveen {Srinivasa Varadhan} and Sanchit Gandhi and Mitesh M. Khapra},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {4128--4132},
doi = {10.21437/Interspeech.2025-2758},
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}
@misc{lyth2024natural,
title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
author={Dan Lyth and Simon King},
year={2024},
eprint={2402.01912},
archivePrefix={arXiv},
}Fallback Chain: Indic Parler-TTS → gTTS → pyttsx3
Purpose: Hindi ↔ English translation
Library: Argos Translate (Offline) (unofficial API)
Use Cases:
- Bilingual policy summaries
- Term explanations in Hindi
- User interface localization
File: backend/app/services/translation_service.py
-
Ollama (Required)
- Installation:
curl https://ollama.ai/install.sh | sh - Models:
gemma2:2b,nomic-embed-text - Port: 11434 (default)
- Installation:
-
ChromaDB (Bundled)
- Version: 0.5.15
- Storage:
backend/data/chroma/
-
Python Packages (See
requirements.txt)- FastAPI, Uvicorn
- PyPDF2, python-docx
- rank-bm25, chromadb
- pyttsx3, Argos Translate (Offline)
| Operation | Time (Avg) | Optimization |
|---|---|---|
| PDF Parsing (10 pages) | 2.3s | Parallel processing |
| Embedding Generation (50 chunks) | 1.8s | Batch API calls |
| Hybrid Search Query | 0.4s | Query caching |
| LLM Generation (500 tokens) | 3.5s | Optimized prompt |
| Full Analysis | 8-12s | End-to-end pipeline |
Current POC Limitations:
- Single-user session management
- In-memory caching (lost on restart)
- No distributed processing
Production Roadmap:
- Multi-user support with session persistence
- Distributed vector store (Weaviate, Milvus)
- GPU acceleration for embeddings
- Load balancing for API layer
✅ Zero Cloud Calls: All AI processing happens locally
✅ No API Keys: No third-party AI services
✅ Data Sovereignty: User data never leaves their machine
✅ PII Protection: Automatic redaction of sensitive info
✅ Audit Logs: All operations logged locally
- Input Validation: All uploads sanitized via Guardrails
- File Type Restrictions: Only PDF/DOCX/TXT allowed
- Size Limits: Max 10MB upload size (configurable)
- SQL Injection Prevention: Parameterized queries only
- XSS Protection: Output sanitization in frontend
- Framework: FastAPI 0.115.12
- Language: Python 3.10+
- AI/ML: Ollama (gemma2:2b, nomic-embed-text)
- Vector DB: ChromaDB 0.5.15
- Search: rank-bm25 0.2.2
- UI Framework: Material Design 3
- Styling: Custom CSS with dark mode
- Interactivity: Vanilla JavaScript
- Server: Uvicorn ASGI
- Logging: structlog
- Testing: pytest, unittest
# Health check
curl http://localhost:8000/
# Upload document
curl -X POST http://localhost:8000/upload \
-F "file=@policy.pdf"
# Analyze policy
curl -X POST http://localhost:8000/analyze \
-F "file=@policy.pdf"
# Ask question via RAG
curl -X POST http://localhost:8000/rag/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is the sum insured?", "use_knowledge_base": true}'
# Get RAG statistics
curl http://localhost:8000/rag/stats| Component | Path |
|---|---|
| Main App | backend/main.py |
| RAG Service | backend/app/services/rag_service.py |
| Ollama LLM | backend/app/services/ollama_llm_service.py |
| ChromaDB Data | backend/data/chroma/ |
| IRDAI Docs | backend/data/irdai_knowledge/ |
| Frontend | backend/templates/index.html |
| Tests | tests/ |
- Add Automatic Speech Recognition (ASR) for voice queries
- Implement Redis for distributed caching
- Add PostgreSQL for persistent session management
- Integrate more IRDAI documents (target: 100+ chunks)
- Multi-language support (10+ Indian languages)
- Mobile app (React Native)
- Browser extension for policy scanning
- API marketplace for insurtech partners
For questions or contributions, see: CONTRIBUTING.md
Last Updated: October 7, 2025
Version: 1.0.0