Insurance Claims Analytics - MinIO + Microsoft Fabric

End-to-end data engineering project implementing Medallion Architecture for insurance claims analysis, using local MinIO and Microsoft Fabric for processing and visualization.

Approach: MVP (Minimum Viable Product) built incrementally in functional stages with small, manageable datasets (~50 records) to demonstrate data engineering skills.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     STORAGE LAYER                            │
├─────────────────────────────────────────────────────────────┤
│  MinIO (Local - Docker)                                      │
│  ├── insurance-data/bronze/                                  │
│  │   ├── customers.csv                                       │
│  │   ├── policies.csv                                        │
│  │   └── claims.csv                                          │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│              MICROSOFT FABRIC (Lakehouse)                    │
├─────────────────────────────────────────────────────────────┤
│  📦 BRONZE Layer (Raw Data)                                  │
│  ├── Data Pipeline: MinIO → Lakehouse                        │
│  ├── Format: Delta Tables                                    │
│  └── Data as-is from source                                  │
│                                                               │
│  🔄 SILVER Layer (Clean & Merged)                            │
│  ├── PySpark Notebooks                                       │
│  ├── Data cleaning and validation                            │
│  ├── Table joins                                             │
│  └── Format: Delta Tables                                    │
│                                                               │
│  ✨ GOLD Layer (Business Ready)                              │
│  ├── KPI aggregations                                        │
│  ├── Analytics-optimized data                                │
│  └── Format: Delta Tables                                    │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│              VISUALIZATION LAYER                             │
├─────────────────────────────────────────────────────────────┤
│  Power BI (Fabric)                                           │
│  ├── KPI Dashboard                                           │
│  ├── Top claimers analysis                                   │
│  ├── Approval rate by gender/policy                          │
│  ├── Temporal trends                                         │                                    │
└─────────────────────────────────────────────────────────────┘

📋 Prerequisites

Docker and Docker Compose installed
Python 3.8+
Microsoft Fabric trial account
Git

1. Environment Setup

# Clone the repository
git clone https://github.com/Daniel-jcVv/insurance-analytics-medallion.git
cd insurance-analytics-medallion

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment variables
cp .env.example .env

2. Start MinIO

# Start MinIO with Docker Compose
docker compose up -d

# Verify it's running
docker ps

MinIO Console: http://localhost:9101

User: admin
Password: adminpassword123

3. Upload Sample Data

Option A: Local Development (MinIO)

# Install dependencies
pip install -r requirements.txt

# Upload data to MinIO
python scripts/upload_to_minio.py

# Run tests
pytest tests/ -v

Option B: Cloud Storage (AWS S3) - Recommended for Fabric

# Configure AWS credentials in .env file
cp .env.example .env
# Edit .env and add your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

# Upload data to S3
python scripts/upload_to_s3.py

Uploaded files:

data/raw/customers.csv → bronze/customers.csv
data/raw/policies.csv → bronze/policies.csv
data/raw/claims.csv → bronze/claims.csv

Note: We use small datasets (50 records) for MVP - sufficient to validate logic and demonstrate understanding.

Cloud Integration: Using AWS S3 allows Microsoft Fabric to connect directly via Amazon S3 connector, enabling automated data pipelines instead of manual uploads.

4. Setup Microsoft Fabric (Cloud)

4.1 Create Workspace and Lakehouse

Go to Microsoft Fabric
Create a new Workspace named Insurance Analytics
Inside the workspace, create a Lakehouse named insurance_lakehouse

4.2 Connect Fabric to AWS S3

In your workspace, create a new Data Pipeline
Add Copy Data activity
Configure Source with Amazon S3 connector:
- Endpoint: https://s3.amazonaws.com (or region-specific endpoint)
- Access Key ID: Your AWS_ACCESS_KEY_ID
- Secret Access Key: Your AWS_SECRET_ACCESS_KEY
- Bucket: insurance-data-fabric (or your S3_BUCKET name)
- Region: us-east-1 (or your AWS_REGION)
Select files from bronze/ folder
Configure Destination as your Lakehouse
Run the pipeline to ingest data

4.3 Create PySpark Notebooks

Bronze → Silver:

Clean null values
Standardize formats
Join tables

Silver → Gold:

KPI aggregations
Business metrics calculation
Report optimization

📊 Business Questions Answered

This project provides insights into:

Top Claimers: Which customers generate the most claims?
Approval Rates: Claim approval by gender and policy type
Temporal Trends: Claims patterns over time
Policy Performance: Claim activity by policy type
Demographics: Claims by customer segments
Risk Analysis: High-risk customers with claim-to-coverage ratio > 50%

Power BI Dashboard Features

Dashboard: "Customer Claims Performance Report"

KPI Cards:

Total Customers: 24
Total Claim Amount: $1.354M
Average Approval Rate: 22.65%
High Risk Customers: 7

Visualizations:

Customer Distribution by Gender - Donut chart showing gender breakdown
Policy Types Distribution - Bar chart analyzing Life, Auto, and Health policies
Claims vs Coverage Analysis - Scatter plot showing relationship by gender
Top 10 Customers by Total Claims - Horizontal bar chart of highest claimers
Claims Volume Over Time - Line chart showing temporal trends (2019-2023)

📸 Screenshot: power-bi/Customer-Claims-Performance-Report.png

📁 Project Structure

insurance-claims/
├── docker-compose.yml               # MinIO configuration
├── requirements.txt                 # Python dependencies
├── .env.example                    # Environment variables template
├── .gitignore                      # Files to ignore
├── README.md                       # This file
│
├── data/
│   └── raw/                        # CSV files (50 records each)
│       ├── customers.csv
│       ├── policies.csv
│       └── claims.csv
│
├── scripts/
│   ├── upload_to_minio.py          # Upload to MinIO (local dev)
│   ├── upload_to_minio_production.py  # Production MinIO (retry + logging)
│   ├── upload_to_s3.py             # Upload to AWS S3 (cloud/Fabric)
│ 
│
├── tests/
│   ├── test_upload.py              # Integration tests
│   └── README.md                   # Test documentation
│
├── fabric/
│   └── notebooks/
│       ├── 01_Bronze_to_Silver.ipynb   # Data cleaning & validation
│       └── 02_Silver_to_Gold.ipynb     # Business metrics & analytics
│
├── power-bi/
│   └── Customer-Claims-Performance-Report.png  # Dashboard screenshot
│
├── docs/
│   └── screenshots/
│       └── fabric/                 # Fabric workspace screenshots
│           ├── 03-fabric-bronze-files.png
│           ├── 05-fabric-silver-tables.png
│           └── 06-fabric-gold-tables.png
│
│
├── minio/
│   └── data/                       # MinIO persistent storage (gitignored)
│___

🎯 Implementation Stages (MVP)

Stage 1: FOUNDATION - "It Works"

✅ MinIO running in Docker
✅ 50 records in each CSV file
✅ AWS S3 bucket created and data uploaded
✅ Fabric workspace created

Stage 2: TRANSFORMATION - "It's Clean"

✅ Silver layer notebook created (01_Bronze_to_Silver.ipynb)
✅ Data cleaning implemented (dates, nulls, formatting)
✅ 3 Silver Delta tables written

Stage 3: ANALYTICS - "It Answers Questions"

✅ Tables joined (claims → policies → customers)
✅ Gold aggregation table created (gold_customer_claims_analytics)
✅ Business metrics calculated (approval rates, coverage ratios, temporal analysis)
✅ Data quality checks implemented
✅ Business insights generated

Stage 4: VISUALIZATION - "It's Presentable"

✅ Power BI report created (Customer Claims Performance Report)
✅ Dashboard with KPIs and visualizations
✅ Portfolio screenshots captured

Stage 5 (Optional): SCALE - "It Handles More"

🔲 Script to generate 200-500 records
🔲 Pipeline re-run with larger dataset
🔲 Performance documented

📈 Scalability & Production Readiness

This project demonstrates pragmatic through two upload scripts:

Simple Version (`upload_to_minio.py`)

Use for: MVP, demos, small datasets

Basic error handling
File validation

Production Version (`upload_to_minio_production.py`)

Use for: Larger datasets, production environments

Retry logic with exponential backoff
Parallel uploads (4 workers)
Logging to file
Progress bars

Philosophy: Start simple, add complexity when justified. The value is in the Medallion Architecture + Fabric transformations.

🔧 Useful Commands

# View MinIO logs
docker compose logs -f minio

# Stop MinIO
docker compose down

# Clean MinIO data (destructive)
docker compose down -v

# Upload data to MinIO (simple)
python scripts/upload_to_minio.py

# Upload data to MinIO (production)
python scripts/upload_to_minio_production.py

# Verify data in MinIO
# Access http://localhost:9101 and explore the bucket

📝 Portfolio Highlights

This project demonstrates:

✅ Medallion Architecture (Bronze/Silver/Gold) - Complete implementation
✅ End-to-End Data Engineering - From raw data to business insights
✅ Microsoft Fabric (Lakehouse, PySpark Notebooks, Power BI)
✅ PySpark Processing - Data cleaning, transformations, and aggregations
✅ Object Storage Patterns (S3-compatible MinIO + AWS S3)
✅ Power BI Dashboards - Interactive analytics with 5 visualizations + 4 KPIs
✅ Docker for local infrastructure
✅ Python Automation (upload scripts + integration tests)
✅ Delta Lake format (ACID transactions, time travel)
✅ Data Quality Checks - Validation and business logic
✅ Cloud Integration - AWS S3 → Fabric pipeline
✅ Pragmatic Engineering (simple first, complexity when justified)

🎯 Architecture Strategy

Why MinIO + Fabric (instead of just Fabric)?

MinIO (Local Development):

Practice S3 API integration patterns
Test data pipelines without cloud costs
Demonstrate infrastructure-as-code skills
Learn object storage concepts

Fabric (Cloud Demo):

Showcase Medallion Architecture transformations
Demonstrate PySpark and Delta Lake expertise
Build Power BI dashboards
Prove cloud data engineering skills

Key Insight: MinIO localhost cannot be accessed by cloud Fabric. For MVP, data is manually uploaded to Fabric Lakehouse. In production, this would be automated via Azure Data Lake Storage → Fabric Data Pipeline.

🙏 Acknowledgments

Ora et labora, ahora Soli Deo gloria My gratitude to the open-source community for generously sharing their knowledge.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/raw		data/raw
docs/screenshots		docs/screenshots
fabric/notebooks		fabric/notebooks
power-bi		power-bi
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENCE		LICENCE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

Daniel-jcVv/insurance-analytics-medallion

Folders and files

Latest commit

History

Repository files navigation

Insurance Claims Analytics - MinIO + Microsoft Fabric

🏗️ Architecture

📋 Prerequisites

1. Environment Setup

2. Start MinIO

3. Upload Sample Data

Option A: Local Development (MinIO)

Option B: Cloud Storage (AWS S3) - Recommended for Fabric

4. Setup Microsoft Fabric (Cloud)

4.1 Create Workspace and Lakehouse

4.2 Connect Fabric to AWS S3

4.3 Create PySpark Notebooks

📊 Business Questions Answered

Power BI Dashboard Features

📁 Project Structure

🎯 Implementation Stages (MVP)

Stage 1: FOUNDATION - "It Works"

Stage 2: TRANSFORMATION - "It's Clean"

Stage 3: ANALYTICS - "It Answers Questions"

Stage 4: VISUALIZATION - "It's Presentable"

Stage 5 (Optional): SCALE - "It Handles More"

📈 Scalability & Production Readiness

Simple Version (upload_to_minio.py)

Production Version (upload_to_minio_production.py)

🔧 Useful Commands

📝 Portfolio Highlights

🎯 Architecture Strategy

Why MinIO + Fabric (instead of just Fabric)?

🙏 Acknowledgments

📚 Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Simple Version (`upload_to_minio.py`)

Production Version (`upload_to_minio_production.py`)

Packages