Insurance Helper Bot

A complete end-to-end pipeline for scraping, analyzing, and querying insurance product data from the Insurance Regulatory and Development Authority of India (IRDAI). The system collects insurance product metadata and policy documents, transforms them into structured formats for analysis, and provides LLM-powered insights through a RAG (Retrieval-Augmented Generation) system.

Problem Statement

Navigating India's insurance landscape is challenging:

Information Overload: IRDAI lists 8,500+ insurance products across life, health, and non-life categories
Scattered Data: Policy documents are spread across multiple pages with inconsistent formats
Complex Comparison: Comparing policies requires manually downloading and reading lengthy PDF documents
No Unified Search: There's no way to query across all products to find the best fit for specific needs

This project aims to solve these problems by creating an intelligent assistant that can help users understand, compare, and choose insurance products.

Pipeline Architecture

graph LR
    A[IRDAI Website] --> B[Scraper]
    B --> C[Raw Data]
    C --> D[Analyzer]
    D --> E[Structured Data]
    E --> F[Vector Store]
    F --> G[LLM/RAG Engine]
    G --> H[Insights & Recommendations]

    subgraph "Stage 1: Data Collection"
        A
        B
        C
    end

    subgraph "Stage 2: Data Analysis"
        D
        E
    end

    subgraph "Stage 3: LLM Integration"
        F
        G
        H
    end

Pipeline Stages

Stage 1: Data Collection (Implemented)

The scraper collects comprehensive metadata and policy documents from IRDAI's official website.

What It Does:

Crawls 4 different IRDAI product listing pages
Extracts metadata for ~8,500 insurance products
Downloads associated PDF/XLSX policy documents
Maintains state for resumable scraping sessions

Features:

Parallel async downloads for high throughput
CSV export with complete product metadata
Resume capability with JSON state tracking
Progress bars and real-time status reporting
Configurable concurrency and page ranges

Data Collected:

Page	URL	Records	File Type	Data Points
Life Insurance Products	`/life-insurance-products`	~1,500	PDF	UIN, insurer, product type, launch date, par/non-par status, individual/group
List of Life Products	`/list-of-life-products`	~27	XLSX	Product descriptions, last updated dates
Non-Life Insurance Products	`/non-life-insurance-products`	~5,200	PDF	UIN, insurer, product type, approval date
Health Insurance Products	`/health-insurance-products`	~1,800	PDF	UIN, insurer, product name, approval date

Stage 2: Data Analysis (Coming Soon)

Transforms raw scraped data into structured, queryable formats suitable for LLM consumption.

Planned Components:

PDF Parser
- Extract text content from policy documents
- Handle multi-column layouts and tables
- OCR support for scanned documents
- Preserve document structure and hierarchy
Data Normalizer
- Standardize insurer names and product categories
- Normalize dates, amounts, and coverage terms
- Map product types to unified taxonomy
- Handle missing or inconsistent data
Feature Extractor
- Identify key policy terms (premium, coverage, exclusions)
- Extract coverage limits and claim procedures
- Parse waiting periods and policy conditions
- Identify riders and add-on options
Schema Builder
- Create structured JSON/database records
- Build relationships between products and insurers
- Generate embeddings for semantic search
- Index documents for fast retrieval

Stage 3: LLM-Powered Insights (Coming Soon)

RAG-based system for intelligent querying and analysis of insurance products.

Architecture:

Vector database for semantic document storage
Embedding model for document and query vectorization
LLM for natural language understanding and generation
Retrieval pipeline for context-aware responses

Planned Use Cases:

Policy Comparison
- Compare multiple insurance policies side-by-side
- Highlight key differences in coverage, premiums, and terms
- Generate comparison tables and summaries
- Identify best value options for specific criteria
Product Recommendations
- Understand user requirements through conversational interface
- Match needs to suitable insurance products
- Explain why specific products are recommended
- Consider factors like age, health, budget, and coverage needs
Policy Q&A
- Answer specific questions about any insurance product
- Explain complex terms and conditions in simple language
- Clarify exclusions and waiting periods
- Provide claim procedure guidance
Regulatory Insights
- Track policy approvals and withdrawals over time
- Identify trends in insurance product offerings
- Monitor insurer product portfolios
- Analyze market coverage gaps

Installation

Requires Python 3.11+ and uv.

# Clone the repository
git clone https://github.com/EXTREMOPHILARUM/insurance-helper.git
cd insurance-helper

# Install dependencies
uv sync

Usage

Scrape All Products

# Scrape all pages (metadata + files)
uv run irdai-scraper scrape --type all

# Scrape specific product type
uv run irdai-scraper scrape --type life
uv run irdai-scraper scrape --type life_list
uv run irdai-scraper scrape --type nonlife
uv run irdai-scraper scrape --type health

Options

# Metadata only (no file downloads)
uv run irdai-scraper scrape --type all --metadata-only

# Custom concurrency (default: 10)
uv run irdai-scraper scrape --type all --concurrent 20

# Scrape specific page range (for testing)
uv run irdai-scraper scrape --type life --start-page 1 --end-page 5

# Start fresh (ignore previous progress)
uv run irdai-scraper scrape --type all --no-resume

# Custom output directory
uv run irdai-scraper scrape --type all --output ./my-data

Other Commands

# Check scraping status
uv run irdai-scraper status

# Retry failed downloads
uv run irdai-scraper retry-failed

# Reset progress for a specific type
uv run irdai-scraper reset --type life

# Reset all progress
uv run irdai-scraper reset --yes

Output Structure

data/
├── metadata/
│   ├── life_insurance_products.csv      # Life insurance product metadata
│   ├── life_products_list.csv           # Life products list metadata
│   ├── nonlife_insurance_products.csv   # Non-life product metadata
│   └── health_insurance_products.csv    # Health product metadata
├── downloads/
│   ├── life/                            # Life insurance PDFs
│   │   └── {FY}/{Insurer}/{UIN}_{ProductName}.pdf
│   ├── life_list/                       # Life products XLSX files
│   │   └── {filename}.xlsx
│   ├── nonlife/                         # Non-life insurance PDFs
│   │   └── {FY}/{Insurer}/{UIN}_{ProductName}.pdf
│   └── health/                          # Health insurance PDFs
│       └── {FY}/{Insurer}/{UIN}_{ProductName}.pdf
└── state.json                           # Scraping progress state

CSV Schema

Life Insurance Products:

Column	Description
archive_status	Whether product is archived
financial_year	FY of product launch/modification
insurer	Insurance company name
product_name	Name of the insurance product
uin	Unique Identification Number
type_of_product	Product category
launch_modification_date	Date of launch or last modification
closing_withdrawal_date	Product withdrawal date (if applicable)
protection_savings_retirement	Product classification
par_nonpar	Participating or Non-participating
individual_group	Individual or Group product
remarks	Additional notes
document_url	URL to policy document
document_filename	Downloaded filename
local_file_path	Path to downloaded file

Non-Life Insurance Products:

Column	Description
s_no	Serial number
financial_year	FY of approval
insurer	Insurance company name
product_name	Name of the insurance product
type_of_product	Product category
uin	Unique Identification Number
date_of_approval	IRDAI approval date
document_url	URL to policy document
document_filename	Downloaded filename
local_file_path	Path to downloaded file
archive_status	Whether product is archived

Health Insurance Products:

Column	Description
financial_year	FY of approval
insurer	Insurance company name
uin	Unique Identification Number
product_name	Name of the insurance product
date_of_approval	IRDAI approval date
document_url	URL to policy document
document_filename	Downloaded filename
local_file_path	Path to downloaded file
type_of_product	Product category
archive_status	Whether product is archived

List of Life Products:

Column	Description
archive_status	Whether entry is archived
short_description	Brief product description
last_updated	Last update timestamp
sub_title	Additional title/category
document_url	URL to document
document_filename	Downloaded filename
local_file_path	Path to downloaded file

Resume Capability

The scraper maintains progress state in data/state.json:

Tracks completed pages for each product type
Records successfully downloaded files
Logs failed downloads for retry
Enables interruption and resumption without data loss

Roadmap

Technical Notes

SSL verification is disabled due to IRDAI certificate issues
Estimated runtime: ~2.5 hours for all files (10 concurrent downloads)
With 20 concurrent downloads: ~1.2 hours
Total data size: ~15-20 GB (all PDFs and metadata)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
data/metadata		data/metadata
scripts		scripts
src/irdai_scraper		src/irdai_scraper
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance Helper Bot

Problem Statement

Pipeline Architecture

Pipeline Stages

Stage 1: Data Collection (Implemented)

Stage 2: Data Analysis (Coming Soon)

Stage 3: LLM-Powered Insights (Coming Soon)

Installation

Usage

Scrape All Products

Options

Other Commands

Output Structure

CSV Schema

Resume Capability

Roadmap

Technical Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Insurance Helper Bot

Problem Statement

Pipeline Architecture

Pipeline Stages

Stage 1: Data Collection (Implemented)

Stage 2: Data Analysis (Coming Soon)

Stage 3: LLM-Powered Insights (Coming Soon)

Installation

Usage

Scrape All Products

Options

Other Commands

Output Structure

CSV Schema

Resume Capability

Roadmap

Technical Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages