Policy Recommender — Fairness-Aware ML Decision System

An applied machine learning system that ranks government welfare schemes using learned feature contributions while enforcing deterministic eligibility rules. Combines fairness analysis with explainable ML-driven ranking for policy recommendations.

Overview

Policy Recommender is an ML-driven decision system that identifies suitable government welfare schemes for citizens. It separates concerns: rule-based eligibility (deterministic, policy-governed) from ML-based ranking (learned relevance patterns). This architecture ensures legal compliance while leveraging learned patterns to improve scheme ordering for end users. The system includes fairness analysis to detect demographic disparities in recommendation distributions.

Features

ML-Based Ranking: Logistic regression learns feature contributions to predict scheme relevance among eligible options
Rule-Based Eligibility: Deterministic gate ensures only policy-compliant schemes are considered
Fairness Analysis: Detects demographic parity issues and representation variance across ranking outputs
Explainable Decisions: All eligibility, scoring, and ML contributions are explained in structured outputs
Authentication & RBAC: JWT-based access control with role-based endpoints
Immutable Audit Trail: Write-once, read-many (WORM) compliance audit records
Evaluation Metrics: NDCG, Precision@k, MAP for ranking quality assessment
Offline Experiment Framework: Compare rule-based, ML, and hybrid scoring modes offline

Why ML is Used

This system uses machine learning for ranking only, not eligibility:

Eligibility is Policy-Driven: Legal requirements (income, age, state) are non-negotiable rules. ML cannot learn these; they must be explicitly defined.
Ranking is Data-Driven: Among schemes a user is eligible for, some are more relevant than others. ML captures patterns from features (income proximity, age range fit, category match, gender fit) to improve ranking quality.
Explainability Over Accuracy: The ML model (logistic regression) is chosen for interpretability. Every prediction includes feature contributions, making decisions defensible to auditors and beneficiaries.
Fairness Visibility: By analyzing ranking outputs for demographic parity and representation variance, we surface potential biases early for governance decisions.

Trade-off: We sacrifice raw predictive power for interpretability and auditability. This is appropriate for government systems where trust and legality outweigh accuracy.

Tech Stack

Language: Python 3.11
ML Library: scikit-learn (logistic regression, metrics)
Evaluation: NDCG, Precision@k, MAP (sklearn.metrics)
Fairness Analysis: Demographic parity, representation variance
Architecture: Modular service layer (eligibility, scoring, ML ranking, fairness)
Data Processing: numpy for numerical feature transformations

Project Structure

policy-recommender-ai/
├── app.py                  # FastAPI application entry point
├── requirements.txt        # Python dependencies
├── README.md              # Documentation
├── config.json            # Scoring mode configuration
├── src/                   # Source code
│   ├── models/            # Data models
│   ├── rules/             # Rule engine
│   ├── services/          # Services (eligibility, scoring, ML ranking)
│   ├── evaluation/        # Evaluation metrics and fairness analysis
│   │   ├── ranking_metrics.py   # NDCG, Precision@k, MAP
│   │   └── fairness_metrics.py  # Demographic parity, variance
│   └── __init__.py
├── experiments/           # Offline experiments
│   └── compare_ranking_methods.py
├── data/                  # Data files and configurations
├── tests/                 # Unit and integration tests
└── notebook/              # Analysis notebooks

Installation

Clone the repository

git clone <repository-url>
cd policy-recommender-ai

Activate Conda environment
```
conda activate ai
```
Install dependencies
```
pip install -r requirements.txt
```

Environment Setup

This project requires Conda environment "ai" with Python 3.11+.

Required Setup Steps

Ensure Conda "ai" environment exists (should already be present)
```
conda env list | grep ai
```
Activate the environment
```
conda activate ai
```

Verify Python version

python --version
# Should output: Python 3.11.x

Install all dependencies
```
pip install -r requirements.txt
```

Dependencies

The requirements.txt includes:

fastapi - Web framework
uvicorn - ASGI server
pydantic - Data validation
numpy - Numerical computing
scikit-learn - ML library (for logistic regression)
sqlalchemy - ORM for database persistence

Running the Application

Start the API server with auto-reload:

conda activate ai
cd policy-recommender-ai
uvicorn app:app --reload

The API will be available at: http://127.0.0.1:8000

Access Swagger documentation at: http://127.0.0.1:8000/docs

Verification

After startup, verify all systems are operational:

curl http://127.0.0.1:8000/
# Should return: {"status": "healthy", "service": "Policy Recommendation Engine", "version": "1.0.0"}

Usage

Run the application:

python app.py

The application will start and process user eligibility data to generate policy recommendations.

Development

Running Tests

python -m pytest tests/

Testing Philosophy

This project follows a testing-by-scope approach: test what matters for compliance and correctness, skip what's out of scope.

Why These Tests Are Sufficient:

Eligibility Correctness (26 tests): Every constraint (age, income, state, category, gender) is tested with boundary cases. Eligibility is deterministic; if it's correct, the system's core responsibility is met.
RBAC Enforcement (32 tests): User/auditor/admin roles are tested against all protected endpoints. This is security-critical; token handling and role checks are comprehensive.
Audit Immutability (20 tests): WORM is tested across all mutation methods (DELETE, PATCH, PUT). This is compliance-critical; we verify all paths return 409.
ML Versioning (12 tests): Model version and confidence are captured and tracked. Drift detection RBAC is enforced. This validates new ML operations features.
No Mocks: All tests use real integration (FastAPI TestClient against live endpoints). This catches real failures, not mock-only bugs.
Deterministic Input: Every test uses fixed inputs and expected outputs. No randomness, no flakiness. Tests are reproducible and trustworthy.

What Is NOT Tested (Intentionally):

❌ UI/Frontend: This is a backend API only. No UI exists to test.
❌ External LLM APIs: LLM integration is optional and gracefully degraded. If LLM is down, system still works.
❌ Performance/Load: Not in scope for v1.0. Backend is stateless; scaling is horizontal.
❌ Database Migrations: SQLite is simple; no complex migrations exist.
❌ Full E2E Workflows: Individual endpoints are tested; E2E testing is out of scope for backend unit/integration testing.

Result: 90 focused tests covering critical paths (eligibility, security, audit, versioning). This is sufficient for a compliance-grade government recommendation system.

Project Components

Models (src/models/): Define eligibility criteria and policy scheme structures
Rules (src/rules/): Implement decision logic for scheme recommendations
Services (src/services/): Handle recommendation orchestration and explainability

Deployment on Render

Deploy this service to Render in minutes for production-grade hosting with automatic scaling and monitoring.

Quick Start

Push to GitHub
```
git push origin main
```
Connect to Render
- Visit render.com
- Create new Web Service
- Connect your GitHub repository
- Select policy-recommender-ai branch
Set Environment Variables
- JWT_SECRET: Generate a strong random secret (use openssl rand -hex 32 or similar)
- Other variables auto-configured from render.yaml
Deploy
- Render will automatically build and deploy
- Monitor status in Render dashboard

Health Monitoring

The service includes dedicated health endpoints:

GET /health - Render load balancer checks this every 30 seconds
- Returns: {"status": "ok", "service": "policy-recommender-ai", "version": "1.0.0"}
- Used by: Container orchestration, monitoring systems, load balancers
GET / - Human-readable health status
- Returns: {"status": "healthy", "service": "Policy Recommendation Engine", "version": "1.0.0"}

Data Persistence

SQLite database persists audit trails to ./audit_trail.db
Render ephemeral storage: Re-deploy to reset database
Future: Migrate to PostgreSQL for multi-instance deployments

API Access

Once deployed to Render:

Live API: https://<your-service>.onrender.com
API Docs: https://<your-service>.onrender.com/docs
Health Check: https://<your-service>.onrender.com/health

Performance & Scaling

Default configuration:

Instances: 1 (auto-scales on CPU/memory threshold)
Region: Oregon (customize in render.yaml)
Timeout: 30 seconds per request (standard)

Monitor metrics in Render dashboard for CPU and memory usage.

Design Decisions

1. Rules as Source of Truth

Decision: Eligibility logic is entirely rule-based; ML never overrides eligibility decisions.

Rationale:

Government welfare schemes have legal eligibility criteria that cannot be negotiated or learned from data
Rule-based eligibility is deterministic, auditable, and compliant with policy requirements
ML can optimize within eligible schemes but cannot legitimize ineligibility

Implementation:

eligibility_engine.py is the single source of truth for eligibility
scoring_engine.py ranks only among already-eligible schemes
ml_ranker.py is optional for ranking, never consulted for eligibility

Why This Matters:

Audit compliance: Every eligibility decision can be traced to specific rules
Legal defensibility: Policy enforcers can explain why a scheme was not recommended
System evolution: Rules can be updated without retraining ML models

2. ML for Ranking, Not Eligibility

Decision: Machine learning enhances ranking precision among eligible schemes but does not drive eligibility.

Rationale:

Ranking is subjective and data-driven; eligibility is objective and policy-driven
ML captures patterns from historical recommendation data to improve relevance ordering
Multiple scoring modes allow controlled experimentation without disrupting eligibility logic

Implementation:

Three configurable scoring modes:
- "rules": Pure rule-based scoring (deterministic baseline)
- "ml": ML-based scoring (learned relevance patterns)
- "hybrid": Average of rule and ML scores (balanced approach)
Config-driven mode selection in config.json
Mode can be switched without redeploying service

Why This Matters:

Safe experimentation: Interviewers can test ML without breaking production eligibility
Graceful degradation: If ML unavailable, system falls back to rules
Continuous improvement: New ML models can be deployed/tested independently

3. Explainability Preserved at Every Layer

Decision: Every decision—eligibility, scoring, ML feature contribution—is explained in structured JSON.

Rationale:

Government systems require audit trails; vague ML scores are not acceptable
Beneficiaries and caseworkers must understand why a scheme was (or wasn't) recommended
System is only production-ready if every step is explainable

Implementation:

Eligibility layer: eligibility_engine.py returns boolean + rule explanation
Scoring layer: scoring_engine.py returns score breakdown (income proximity, age fit, category match, gender match)
ML layer: ml_ranker.py includes feature contributions (e.g., "age contribution: +0.3%")
API response: DecisionTrace model includes scoring mode, feature details, and all intermediate scores

Why This Matters:

Trust: Beneficiaries can understand and contest decisions
Debugging: Product teams can identify model drift or rule violations
Compliance: Audit teams can reconstruct decision logic for any beneficiary

4. Safe Evolution and Continuous Learning

Decision: System architecture supports adding new schemes, rules, and ML models without breaking existing functionality.

Rationale:

Government policies change; recommendation systems must adapt
New eligibility criteria should not require retraining or redeploying ML
A/B testing of new rules or ML models should be possible without disruption

Implementation:

Modular services: Each service (eligibility, scoring, ML) is independent
Config-driven experimentation: New rules, weights, and scoring modes defined in config.json
Offline evaluation: evaluate.py compares rule vs ML ranking before production deployment
Extensible data model: New eligibility criteria or scheme attributes can be added to schemes.json without API changes

Why This Matters:

Scalability: System grows with policy changes
Risk mitigation: Experiments validated offline before live deployment
Junior engineer onboarding: Clear separation of concerns makes code understandable

5. ML Model Design: Controlled and Interpretable

Decision: Use logistic regression (not deep learning); train on synthetic data with domain features.

Rationale:

Logistic regression is inherently interpretable (coefficients = feature importance)
Synthetic training data ensures reproducible, deterministic behavior
No "black box" that inspectors cannot explain

Implementation:

Model: Logistic Regression from scikit-learn
Features: age_normalized, income_ratio, category_match, gender_match
Training: Synthetic dataset with deterministic seed (reproducible)
Output: 0-100 score + feature-level contributions

Why This Matters:

Interpretability: Interviewers see exactly how ML reasons about relevance
Transparency: Model behavior is auditable and reproducible
Production readiness: Simple model = fewer deployment surprises

6. Evaluation-Aware System Design

Decision: Include offline evaluation script comparing rule vs ML ranking consistency.

Rationale:

Before deploying ML ranking, verify it doesn't contradict rule-based wisdom
Metrics (top-1 agreement, rank deltas) detect unexpected model behavior
Evaluation is continuous, not one-time

Implementation:

evaluate.py: Generates synthetic users, compares ranking across modes
Metrics: Top-1 agreement %, average rank delta, score distributions
Output: evaluation_results.json for trend tracking

Why This Matters:

Confidence: ML is validated before production
Debugging: Unexpected agreement drops signal data drift or rule changes
Stakeholder trust: Numbers, not narratives, justify ML deployment

Technical Architecture

Service Layer Separation

User Request
    ↓
┌─────────────────────────────────────────┐
│  Eligibility Engine (src/services/)     │ ← Determines YES/NO
│  - Rule-based only                      │
│  - No learning                          │
└─────────────────────────────────────────┘
    ↓ (if eligible)
┌─────────────────────────────────────────┐
│  Config & Scoring Router (app.py)       │ ← Selects mode
│  - Check config.json                    │
│  - Route to appropriate scorer          │
└─────────────────────────────────────────┘
    ↓
    ├─→ Rule-Based Scoring (src/services/scoring_engine.py)
    │   - Deterministic, explainable
    │
    ├─→ ML Ranking (src/services/ml_ranker.py)
    │   - Optional, graceful fallback
    │   - Feature contributions
    │
    └─→ Hybrid (both, averaged)
        - Combines approaches
        - Traces both paths
    ↓
┌─────────────────────────────────────────┐
│  API Response (FastAPI)                 │ ← Structured JSON
│  - DecisionTrace with all details       │
│  - Scoring factors & contributions      │
│  - Audit-ready output                   │
└─────────────────────────────────────────┘

How to Use This Design

As a Developer

Add a new scheme: Update data/schemes.json and eligibility rules
Change rule weights: Update config.json → restart API
Experiment with ML: Set scoring_mode in config.json, restart
Validate changes: Run python evaluate.py before deployment

As an Interviewer

Verify explainability: Call /explain endpoint; confirm all reasoning is structured JSON
Check eligibility logic: Trace through eligibility_engine.py rules
Validate ML behavior: Run evaluate.py; confirm metrics are reasonable
Test configuration: Modify config.json; confirm API honors the mode

As an Auditor

Trace a decision: Examine DecisionTrace from /recommend response
Verify rule compliance: All eligibility rules implemented in eligibility_engine.py
Check model fairness: Feature contributions in ML response (age, income, category, gender)
Review change history: git log on config.json and src/services/

Data Model & Audit Trail

Overview

The system persists every recommendation run to SQLite for compliance and debugging. This enables:

Complete decision traceability (why was a scheme recommended or rejected?)
Audit trail for policy adjustments and ML model changes
Benchmarking and performance analysis
Graceful evolution to PostgreSQL for production deployments

Database Tables

1. `user_profiles`

Stores user profiles used in recommendations.

Column	Type	Purpose
`id`	UUID	Primary key
`age`	Integer	User age
`income`	Integer	Annual income (rupees)
`state`	String	State of residence
`category`	String	Social/income category (EWS, FARMER, etc.)
`gender`	String	Gender (MALE, FEMALE, OTHER)
`created_at`	DateTime	When profile was created

Why store separately:

Normalize data: One user profile can generate multiple recommendations
Privacy considerations: Clear separation between immutable profiles and recommendations

2. `recommendation_runs`

Stores metadata for each /recommend API call.

Column	Type	Purpose
`run_id`	UUID	Primary key (returned in API response)
`user_profile_id`	UUID FK	Foreign key to user profile
`scoring_mode`	String	"rules", "ml", or "hybrid"
`config_version`	String	Config version used during this run
`created_at`	DateTime	When recommendation was generated
`total_schemes_checked`	Integer	How many schemes were evaluated
`eligible_count`	Integer	How many schemes qualified
`ineligible_count`	Integer	How many schemes were rejected

Why store separately:

Decouple runs from individual decisions: Query all recommendations for a user or time period easily
Track configuration changes: Verify that ML was enabled/disabled when decisions were made
Summary metrics: Quick analytics without joining to scheme decisions

3. `scheme_decisions`

Stores individual eligibility and scoring decisions for each scheme in a run.

Column	Type	Purpose
`id`	UUID	Primary key
`run_id`	UUID FK	Foreign key to recommendation run
`scheme_id`	String	Scheme identifier
`scheme_name`	String	Human-readable scheme name
`is_eligible`	Boolean	YES if eligible, NO otherwise
`eligibility_reason`	Text	Why eligible or rejected
`score`	Float	Relevance score (0-100) if eligible, NULL if not
`scoring_method`	String	"rules", "ml", or "hybrid" (if eligible)
`decision_trace`	JSON	Full DecisionTrace object (passed/failed rules)
`scoring_factors`	JSON	Score breakdown (income, age, category, gender)
`ml_features`	JSON	ML feature contributions (if ML involved)
`created_at`	DateTime	When decision was made

Why store as JSON:

Flexibility: Future changes to decision logic don't require schema migrations
Debuggability: Exact state of reasoning preserved for audit
Portability: Easy to export and analyze offline

Audit Trail Example

To audit why a user got specific recommendations:

# 1. Make a recommendation
curl -X POST http://127.0.0.1:8000/recommend \
  -H "Content-Type: application/json" \
  -d '{
    "age": 35,
    "income": 450000,
    "state": "MH",
    "category": "EWS",
    "gender": "FEMALE"
  }'

# Response includes "audit_run_id": "abc-123-def"

# 2. Retrieve full audit trail
curl http://127.0.0.1:8000/audit/abc-123-def

# Response includes:
# - User profile used
# - All scheme decisions (eligible and ineligible)
# - Scoring mode (rules, ml, hybrid)
# - Timestamp (when recommendation was made)
# - Complete decision traces and reasoning

Why This Matters

For Compliance:

Policy auditors can trace every decision to specific rules
All changes are timestamped and linked to config versions
Replayability: Modify rules, re-run old users, compare outcomes

For Product Evolution:

A/B test new scoring modes without disrupting production
Compare rule-based vs ML-based recommendations side-by-side
Identify when ML diverges significantly from rules (data drift signal)

For Debugging:

Engineers can examine exact decision reasoning for specific users
Identify systematic issues (e.g., "why are all farmers ineligible?")
Validate new rules before production deployment

For Scaling:

SQLite works for development and small deployments
Ready to migrate to PostgreSQL with minimal code changes:
- Connection string changes in src/db.py
- Add Alembic migrations for production safety
- Increase database connection pooling

API Endpoints for Audit

Endpoint	Purpose
`POST /recommend`	Generate recommendations (automatically logs decision run)
`GET /audit/{run_id}`	Retrieve full audit trail for a specific recommendation
`POST /explain`	Get eligibility explanations without ranking (no logging)

Model Versioning & Drift Monitoring

Overview

The system includes read-only ML operations tracking to support compliance and monitoring:

ML Model Versioning: Every recommendation run tracks which ML model version was used
Confidence Scores: Each ML-based decision includes a confidence metric (predicted probability)
Drift Detection Analytics: Advisory-only analysis comparing ML vs rule-based rankings

Critical Design Principle: These features are for monitoring and auditing only. They never trigger automatic model updates, retraining, or decision changes.

Model Versioning

Each audit record includes:

{
  "ml_model_version": "logistic_v1.0",
  "ml_confidence": 0.87,
  "scoring_method": "ml"
}

Fields:

ml_model_version (string): e.g., "logistic_v1.0" for ML/hybrid scoring, null for rules-only
ml_confidence (float 0–1): Predicted probability from logistic regression
- For logistic regression: confidence = P(positive class)
- null for rules-only scoring (deterministic, no probability concept)

Why It Matters:

Enables auditors to identify which recommendations used which model versions
Supports compliance requirements: "What model was used for this decision?"
Facilitates model lifecycle management and deprecation

Confidence Tracking

For every ML or hybrid scoring decision:

ML-based decision on scheme X:
- Score: 75/100
- Confidence: 0.87 (model was 87% confident)
- Interpretation: High confidence recommendation

Using Confidence:

Score alone can be misleading; confidence provides context
Low confidence (e.g., 0.52) suggests borderline cases
Auditors can filter for "high confidence recommendations only"

Drift Detection Analytics

Endpoint: GET /analytics/model-drift (auditor+ only)

Compares ML-based ranking against rule-based ranking across recent recommendation runs:

curl -H "Authorization: Bearer <token>" \
  http://127.0.0.1:8000/analytics/model-drift

Response:

{
  "drift_detected": false,
  "average_rank_delta": 8.5,
  "drift_threshold_pct": 15.0,
  "analysis_basis": 47,
  "scheme_drift": {
    "scheme_1": {
      "average_rank_delta_pct": 12.3,
      "observations": 8,
      "drift_flag": false
    },
    "scheme_2": {
      "average_rank_delta_pct": 5.1,
      "observations": 8,
      "drift_flag": false
    }
  },
  "advisory": "This analysis is advisory only. No automated retraining occurs."
}

What It Measures:

For each scheme, calculates how differently ML and rules rank it
Compares ranking position (not score values) across audit runs
Flags schemes with >15% average rank change

Why Drift Matters:

Indicates ML model may be diverging from rule-based logic
Signals potential data shift or model staleness
Advisory signal for auditors: "Should we retrain or update rules?"

What It Does NOT Do:

❌ Does NOT automatically trigger retraining
❌ Does NOT update model versions
❌ Does NOT change any recommendations retroactively
❌ Does NOT replace policy review processes

Expected Actions:

Audit team reviews drift analysis during monthly reviews
If drift detected: Audit team investigates whether ML divergence is intentional or problematic
If problematic: Initiate formal model update process (manual, with governance approval)

Why No Auto-Retraining?

Automatic retraining would violate audit requirements:

Non-Determinism: Auto-retraining means same inputs → different outputs over time
Audit Trail Corruption: Past decisions become non-reproducible
Governance Gap: ML decisions made without human oversight
Regulatory Risk: "Who approved this model update?" cannot be answered

Our Approach: Humans decide when models are stale. Drift detection just gives them the data.

Implementation Details

Versioning Storage:

# In database audit record
ml_model_version: String(50)  # Populated during scoring
ml_confidence: Float[0,1]     # Populated during scoring (null for rules-only)

Scoring Mode Logic:

scoring_mode = "rules"  # → ml_model_version = null, ml_confidence = null
scoring_mode = "hybrid" # → ml_model_version = "logistic_v1.0", ml_confidence = 0.87
scoring_mode = "ml"     # → ml_model_version = "logistic_v1.0", ml_confidence = 0.87

Compliance Guarantees

Audit Immutability (WORM)

All audit records are Write-Once, Read-Many (WORM):

Write-Once: Audit records cannot be updated or deleted after creation
Read-Many: Audit records can be retrieved unlimited times via /audit/{run_id}
Enforcement: HTTP 409 (Conflict) returned for any update/delete attempts

Why This Matters:

Government compliance requires tamper-proof audit trails
Immutability prevents accidental or malicious record modification
WORM compliance enables regulatory certifications

Technical Implementation:

Application-level guards prevent DELETE/PUT/PATCH on audit endpoints
Database records are append-only
Violations logged for security monitoring

Role-Based Access Control (RBAC)

Fine-grained authorization by role:

Role	/recommend	/explain	/audit/{run_id}	/analytics/*
user	✅	❌	❌	❌
auditor	❌	✅	✅	✅
admin	❌	✅	✅	✅
public (no auth)	❌	❌	❌	❌

Authentication:

JWT-based (HS256)
Demo users: user@example.com, auditor@example.com, admin@example.com
30-minute token expiry

Audit Trail Contents

Every recommendation run includes:

User profile (immutable copy at decision time)
All scheme decisions (eligible and ineligible)
Full decision traces and explanations
Scoring method used (rules, ml, or hybrid)
ML feature contributions (if applicable)
Timestamp and configuration version

Design Decisions

Why Rule-Based Eligibility?

Eligibility is always rule-based, never delegated to ML. This design reflects regulatory requirements:

Explainability: Citizens have a right to understand why they're ineligible for benefits
Determinism: Rules produce identical outputs for identical inputs; ML does not
Auditability: Policy makers define the rules; engineers implement them
Legal defensibility: Government decisions must be traceable to documented policy

ML scoring (ranking eligible schemes) is optional. Eligibility is law; ranking is optimization.

Why ML Only Ranks, Never Determines Eligibility?

Machine learning is read-only for ranking, not decision-making:

Opacity Risk: ML models can fail silently on edge cases (e.g., underrepresented demographics)
Regulatory Gap: ML decisions cannot be audited without access to training data and model internals
Distribution Shift: Models degrade on data different from training set; rules don't
Our Approach: Rules determine eligibility (certain); ML ranks options (advisory)

The hybrid scoring mode demonstrates this: rules pass/fail schemes; ML scores eligible ones.

Why Audit Trail Is Immutable?

The WORM (Write-Once, Read-Many) audit design is mandatory for compliance:

Regulatory Requirement: Government processes require tamper-proof records
Non-Repudiation: System decisions cannot be retroactively altered
Forensics: Security investigations require evidence integrity
HTTP 409 Enforcement: Any modification attempt returns conflict, preventing accidental mutations

Once a decision is logged, it is permanent. This is not optional; it is architectural.

Why Analytics Are Read-Only?

Analytics endpoints never modify data:

Separation of Concerns: Reporting systems should not influence decision systems
Consistency: Read-only queries guarantee consistency across distributed systems
Auditability: All analytics queries are logged; no hidden state changes
Performance: Read-only queries can be replicated and cached without transaction overhead

Four aggregated views (/analytics/*) provide dashboards without exposing raw audit data.

Why LLM Has No Decision Authority?

Natural language explanations are enhancement only, never authoritative:

Determinism: LLM outputs are non-deterministic; audit trails require exact reproduction
Regulatory Gap: Regulators cannot audit AI-generated text; they can audit rules and ML weights
Failure Isolation: LLM unavailability does not degrade system functionality (graceful fallback)
Governance: Decision logic stays with policy makers and engineers, not LLM vendors
Our Approach: LLM explains human decisions; humans make decisions

This is a read-only enhancement. If the LLM service goes down, recommendations still work perfectly.

Non-Goals

This system intentionally does not:

1. Automated Decision-Making via ML

❌ What we don't do: Use ML models to automatically accept or reject policy applications

Why excluded:

Government decisions require transparent, auditable logic
ML models trained on historical data perpetuate historical biases
Citizens have legal right to know criteria applied to their case
Regulators cannot certify opaque systems for benefit allocation

✅ What we do instead: ML ranks already-eligible schemes; humans make accept/reject decisions through rules

2. Opaque Models

❌ What we don't do: Use black-box models (neural networks, gradient boosting) without interpretability

Why excluded:

Government cannot delegate decision logic to uninterpretable systems
Feature importance and decision boundaries must be auditable
Model drift cannot be detected without interpretability
We selected logistic regression (fully transparent) over XGBoost or neural networks

✅ What we do instead: Logistic regression (weights directly interpretable) or rule-based scoring only

3. Real-Time Scheme Mutation

❌ What we don't do: Allow policy rules to change mid-operation or retroactively alter past decisions

Why excluded:

Citizens applying for benefits need stable criteria
Recommendations for identical users should be reproducible
Audits require exact rules at decision time
A/B testing must be explicit, not ad-hoc

✅ What we do instead: Config version is captured with every decision; rules only change via explicit deployment

4. User Profiling Beyond Request Scope

❌ What we don't do: Build persistent user profiles, track behavior across sessions, or create demographic patterns

Why excluded:

Privacy: Citizens should not be surveilled by benefit systems
Compliance: Profiling creates scope creep and regulatory risk
Reproducibility: Decisions should depend only on current request, not history
Consent: User tracking requires explicit consent; simple application shouldn't require it

✅ What we do instead: Stateless decisions based on current request only; audit trails store decisions, not user behavioral data

Evaluation & Results

This section presents offline evaluation results comparing three ranking methods on synthetic policy recommendation data.

Experimental Setup

Data:

10 synthetic government welfare schemes with varying eligibility criteria
100 synthetic user profiles with diverse demographics (gender, category, income, age)
69 users eligible for at least one scheme (remaining ineligible)
Synthetic relevance labels: schemes with max_income ≥ 250,000 and age range ≥ 20 years marked as "high relevance"

Ranking Methods Compared:

Rule-Based: Deterministic scoring using income proximity, age range fit, and category match
ML-Based: Logistic regression trained on synthetic feature combinations (model_version: logistic_v1.0)
Hybrid: Equal-weight average of rule and ML scores

Evaluation Metrics:

NDCG@5 (Normalized Discounted Cumulative Gain at position 5): Measures ranking quality; higher is better (0-1)
Precision@5 (Precision at position 5): Fraction of top-5 results marked relevant; higher is better (0-1)
MAP (Mean Average Precision): Average precision across all recall levels; higher is better (0-1)
MRR (Mean Reciprocal Rank): Position of first relevant result; higher is better (0-1)

Results Summary

Metric	Rule-Based	ML-Based	Hybrid
NDCG@5	0.7404	0.7700	0.7404
Precision@5	0.4725	0.5043	0.4725
MAP	0.7272	0.7510	0.7272
MRR	0.7012	0.7077	0.7012

Interpretation

ML-Based Ranking Shows Modest Improvement:

NDCG@5: +2.96% improvement over rule-based (0.7700 vs 0.7404)
Precision@5: +3.18% improvement over rule-based (0.5043 vs 0.4725)
Hybrid method does not improve over pure rule-based on synthetic data (both achieve same metrics)

Why This Matters:

The ML model captures learned patterns from synthetic feature distributions, slightly improving ranking precision
The improvement is modest (3%) because both rule-based and ML methods are well-aligned on this synthetic dataset
On real historical recommendation data, the improvement could be larger if historical patterns are available

Ablation Study

Rationale for Comparing Three Methods:

The experiment uses ablation—removing or modifying components systematically—to understand what drives recommendation quality:

Rule-Based Baseline (ablation: remove all ML)
- Pure heuristic scoring using domain knowledge
- Deterministic, fully explainable
- Provides a non-ML reference point
ML-Based Method (ablation: remove rules, use learned weights only)
- Logistic regression learns feature importance from data
- Captures patterns rules might miss
- Explainable (weights + feature contributions included)
Hybrid Method (ablation: combine rule + ML equally)
- Average of rule and ML scores
- Tests if ensemble improves single methods
- Provides conservative blending approach

Key Finding: Hybrid (equal-weight average) does not improve over rule-based or ML alone on synthetic data. This suggests:

Rule and ML are already well-calibrated individually
Simple averaging does not add value when both methods are strong
More sophisticated ensemble methods might be needed (weighted voting, stacking) if ensemble is desired
For v1.0.0, rule-based + optional ML ranking (configurable) is cleaner than hybrid

Ablation Validates Model Contribution: By isolating the ML component from rules, we confirm that observed improvements come from learned patterns, not from other factors. This strengthens confidence in the ML ranker's value.

Limitations & Assumptions:

Synthetic Data: Relevance labels are artificial. Real evaluation requires historical user feedback (e.g., "which recommended scheme did beneficiary actually select?")
Small Dataset: 69 user-scheme pairs is too small for robust ML generalization. Production models should be trained on thousands of real historical recommendations.
Logistic Regression: We chose this model for interpretability. It may underperform more complex models (XGBoost, neural networks) on nonlinear patterns. However, the tradeoff is acceptable for compliance: every prediction includes feature contributions.
Feature Engineering: Features are hand-crafted heuristics (age_normalized, income_ratio, category_match, gender_match). More sophisticated features might improve performance.
No Data Drift Simulation: Results assume identical train and test distributions. Real systems must monitor for distribution shift and retrain periodically.
Single Random Seed: Results are from one reproducible run (seed=42). Confidence intervals would require multiple train-test splits.

How to Run the Experiment

To reproduce these results:

conda activate ai
cd policy-recommender-ai
python -m experiments.compare_ranking_methods

This generates:

Console output with NDCG, Precision, MAP, MRR metrics
results.csv with per-user ranking comparisons

Fairness Analysis

The system includes fairness monitoring (no enforcement) to detect demographic representation issues:

Demographic Parity: Recommendation rates per demographic group (gender, category)
Representation Variance: Distribution of top-k recommendations across demographics

These metrics are for governance oversight only. The system does not adjust rankings to achieve fairness targets. Instead, fairness analysis is logged and made available to policy reviewers.

Example Output from Fairness Module:

DEMOGRAPHIC PARITY (Recommendation Rates by Group)

GENDER:
  MALE:   0.55 (55% recommendation rate)
  FEMALE: 0.48 (48% recommendation rate)
  Gap:    7% ⚠ WARNING

CATEGORY:
  EWS:     0.60 (60% recommendation rate)
  GENERAL: 0.52 (52% recommendation rate)
  Gap:     8% ⚠ WARNING

Policy teams use this data to investigate potential biases and adjust eligibility rules if needed.

Interpreting Fairness Metrics

Plain Language Explanation:

Demographic Parity (Recommendation Rates by Group)

This metric answers: "Do different demographic groups get recommended at similar rates?"

Example:

55% of males in the population receive at least one recommendation
48% of females in the population receive at least one recommendation
Gap: 7% (⚠ may warrant investigation)

Why monitor it: If certain groups have significantly lower recommendation rates, the system may be inadvertently excluding them. This could be due to eligibility rules, feature distributions, or ML model bias.

Representation Variance (Top-K Distribution Consistency)

This metric answers: "Are the demographics of top-recommended schemes consistent across users?"

High variance means:

Some users see diverse demographics in their top-3 schemes
Other users see mostly one demographic
Inconsistency could indicate that ranking stability varies by user profile

Low variance means:

All users see similar demographic distributions in top-k
More consistent experience (good or bad—depends on whether distributions are fair)

Fairness vs. Utility Tradeoff:

Utility: Recommending the "best" schemes for each user based on relevance
Fairness: Ensuring demographic groups are represented similarly in recommendations

These can conflict:

Maximizing utility might mean recommending schemes that appeal more to some demographics
Enforcing fairness constraints might reduce utility (recommend slightly less-relevant schemes to balance representation)

Our approach: Monitor, don't enforce. Policy teams see fairness metrics and decide whether adjustments are needed. This preserves both governance transparency and system flexibility.

Why Not Automatic Fairness Enforcement?

Governance Risk: Algorithms cannot decide fairness tradeoffs—that's a policy decision
Unintended Consequences: Automated fairness interventions can backfire (e.g., inverse discrimination concerns)
Auditability: Policy teams must explicitly decide to adjust rules; hidden algorithmic interventions are opaque
Compliance: Regulators prefer transparent analysis over automatic algorithmic adjustments

Key Takeaways

ML improves ranking quality by ~3% on synthetic data (NDCG, Precision metrics)
Ablation study validates that improvement comes from learned patterns, not overfitting
Fairness is monitored continuously but not enforced; governance teams decide policy
Reproducibility is built-in: Deterministic seeds, documented experimental setup, exact command to reproduce
Results are honest: Synthetic data, small scale, hand-crafted features—real evaluation needs historical feedback

Reproducibility

This section documents how to reproduce the evaluation results and verify system behavior.

Environment Setup

Python Version: 3.11.x (tested on 3.11.14)

Conda Environment:

conda activate ai
python --version  # Should output: Python 3.11.x

Verify Dependencies:

pip list | grep -E "scikit-learn|numpy|pandas"
# Expected: scikit-learn 1.7+, numpy 2.4+

Running the Ranking Experiment

Command:

cd policy-recommender-ai
python -m experiments.compare_ranking_methods

Expected Output:

================================================================================
EXPERIMENT: Compare Rule-Based, ML, and Hybrid Ranking Methods
================================================================================

Generating synthetic data...
  Generated 10 schemes and 100 users

Scoring rankings for each user...
  Scored 69 users

Evaluating with ranking metrics...

RESULTS: Ranking Quality Comparison
--------------------------------------------------------------------------------
Metric                         Rule-Based             ML-Based               Hybrid
--------------------------------------------------------------------------------
ndcg@5                             0.7404               0.7700               0.7404
precision@5                        0.4725               0.5043               0.4725
map                                0.7272               0.7510               0.7272
mrr                                0.7012               0.7077               0.7012

Detailed results saved to: results.csv

Duration: <5 seconds

Output Artifacts

results.csv: Per-user ranking comparisons

Location: Project root (policy-recommender-ai/results.csv)
Rows: 70 (header + 69 evaluated users)
Columns: user_id, eligible_schemes, rule_ranking (top-3), ml_ranking (top-3), hybrid_ranking (top-3)
Use case: Analyze individual user ranking differences across methods

Console Metrics: Aggregated performance across all users

NDCG@5, Precision@5, MAP, MRR
Use case: Overall system evaluation

Determinism & Reproducibility

Key Property: Running the experiment multiple times produces identical results.

Why:

Synthetic data generation: np.random.seed(42)
ML model training: LogisticRegression(random_state=42)
No stochasticity in evaluation

Verification:

# Run twice and compare results.csv
python -m experiments.compare_ranking_methods
cp results.csv results_run1.csv
python -m experiments.compare_ranking_methods
diff results_run1.csv results.csv  # Should be identical

Running Unit Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_eligibility.py -v

# Run specific test
python -m pytest tests/test_audit.py::test_audit_immutability -v

Expected: 90 tests, all passing, deterministic (no flakiness)

Verifying Evaluation Code

Syntax Check:

python -m py_compile src/evaluation/ranking_metrics.py
python -m py_compile src/evaluation/fairness_metrics.py
python -m py_compile experiments/compare_ranking_methods.py
# No output = success

Import Check:

python -c "from src.evaluation.ranking_metrics import ndcg_at_k, precision_at_k, mean_average_precision; print('✓ Ranking metrics imported successfully')"
python -c "from src.evaluation.fairness_metrics import demographic_parity, representation_variance; print('✓ Fairness metrics imported successfully')"

Documentation

For detailed evaluation framework information, see docs/evaluation_overview.md:

What each evaluation file does
Function signatures and use cases
Integration patterns for production
Known limitations

---1. ML improves on heuristics: Even simple logistic regression can improve ranking quality by 3% on synthetic data 2. Interpretability preserved: Feature contributions are always included; no black-box decisions 3. Rules remain authoritative: Eligibility is never affected by ML; it's purely for ranking 4. Fairness is monitorable: Demographic parity and representation are tracked for governance review 5. Evaluation is offline: No A/B testing on live users; offline experiments provide evidence before deployment

Release v1.0.0

Capabilities Summary

✅ Rule-Based Eligibility: Deterministic, auditable policy evaluation for 15+ welfare schemes
✅ Explainable Recommendations: Full decision traces showing passed/failed rules for every scheme
✅ Optional ML Ranking: Logistic regression for scheme relevance (never affects eligibility)
✅ Immutable Audit Trail: WORM compliance with HTTP 409 enforcement for data integrity
✅ Role-Based Access Control: JWT authentication with user/auditor/admin roles
✅ Model Versioning & Drift Detection: Track ML model usage and detect ranking divergence
✅ Production-Grade Analytics: 4 aggregated endpoints for eligibility rates, scoring distribution, top schemes
✅ Optional LLM Integration: Natural language explanations with deterministic fallback

System Boundaries

This system:

Recommends schemes based on eligibility and relevance (does not approve/reject benefits)
Provides data for government policy teams (not an approval system)
Supports auditors and administrators with visibility into recommendations
Maintains immutable audit trails for compliance and oversight
Runs as a stateless API (scales horizontally)

This system does NOT:

Make final eligibility decisions (policy rules define eligibility; system implements rules)
Approve or disburse benefits (system recommends only)
Store personal data beyond request scope (stateless, audit-only storage)
Automatically retrain models (human governance required)
Guarantee optimal recommendations (rules-driven, not optimized)

Future Work (Out of Scope for v1.0.0)

These features are intentionally excluded from v1.0.0 to maintain focus and compliance:

Infrastructure

Multi-instance PostgreSQL persistence (current: single-instance SQLite)
Redis caching for high-traffic scenarios
Load balancing and auto-scaling orchestration
Disaster recovery and backup automation

Analytics & Monitoring

Real-time dashboards (current: read-only aggregated analytics)
Performance monitoring and SLA tracking
Model performance metrics (precision/recall on held-out schemes)
Automated anomaly detection

ML & Versioning

A/B testing framework for scheme ranking algorithms
Automated model retraining pipeline (current: manual, human-governed)
Feature store for ML model inputs
Model performance tracking and versioning

Integration & Scale

OAuth 2.0 for third-party integrations
Batch processing for bulk recommendations
Webhook notifications for recommendation events
Mobile app or SMS gateway

Governance

Advanced audit trail visualization
Approval workflows for scheme rule changes
Impact analysis before rule deployment
User appeals or override mechanisms

These are not missing features; they are intentionally excluded from v1.0.0 to keep the system compliant, auditable, and maintainable. Future versions will evaluate these based on business requirements.

Development Timeline & Iteration Notes

Evolution of the System

This system was developed iteratively, with features added incrementally and validated through testing:

Phase 1: Eligibility Foundation

Implemented deterministic eligibility rules for 15+ government welfare schemes
Built rule engine supporting age, income, state, category, and gender constraints
Comprehensive eligibility testing (26 tests, boundary cases for each constraint)
Design principle: Eligibility is policy-driven, never approximate

Phase 2: Scoring & Ranking

Added rule-based relevance scoring (income proximity, age fit, category match)
Separated eligibility (yes/no gate) from scoring (ranking among eligible schemes)
Introduced configurable scoring modes (rules-only, ML-optional)
Validation: Scoring tests confirm deterministic behavior

Phase 3: ML Ranking

Integrated logistic regression for optional ML-based scheme ranking
ML operates only on eligible schemes; never affects eligibility gate
Feature extraction: age_normalized, income_ratio, category_match, gender_match
Design principle: Interpretability over accuracy (weights transparent to auditors)

Phase 4: Compliance & Audit

Implemented immutable audit trail (WORM: Write-Once, Read-Many)
Built role-based access control (JWT auth, three roles: user/auditor/admin)
Audit tests (20 tests) verify no mutations allowed on logged decisions
RBAC tests (32 tests) validate endpoint-level access control

Phase 5: Evaluation & Fairness

Added production-grade ranking quality metrics (NDCG@5, Precision@5, MAP, MRR)
Built fairness analysis framework (demographic parity, representation variance)
Created offline experiment comparing rule-based, ML, and hybrid ranking methods
Results show ~3% improvement from ML over rule-based heuristics on synthetic data
Fairness analysis is monitoring-only; governance teams decide policy implications

Validation Approach

Testing Philosophy: "Test what matters; skip what's out of scope"

Eligibility correctness: 26 tests with boundary cases (every constraint tested)
RBAC enforcement: 32 tests across protected endpoints (security-critical)
Audit immutability: 20 tests verify HTTP 409 (conflict) on all mutation attempts
ML versioning: 12 tests track model version and drift detection
No mocks: All tests use real integration (FastAPI TestClient against live endpoints)
Deterministic: Every test uses fixed inputs and expected outputs (reproducible, no flakiness)

Result: 90 focused tests covering critical paths. Coverage is sufficient for a compliance-grade government recommendation system.

Design Decisions Preserved Throughout Iterations

Rules as Source of Truth: Eligibility logic is entirely rule-based; ML never overrides eligibility decisions
ML for Ranking, Not Eligibility: Machine learning enhances ranking precision among eligible schemes only
Explainability Preserved at Every Layer: Every decision (eligibility, scoring, ML contribution) includes explanation in JSON responses
Immutable Audit Trail: All decisions are logged immutably; no retroactive modifications allowed
Interpretable Models: Logistic regression chosen over black-box methods for transparency

These principles were established early and maintained through all iterations.

How to Review This Project

This project demonstrates systems engineering for AI applications at production scale. It's a complete backend system (no UI) suitable for roles requiring compliance-aware AI architecture, data integrity, and restraint.

Best-fit roles: Backend Engineer, ML Systems Engineer, Infrastructure/DevOps Engineer, Policy Technology Specialist, Compliance Systems Designer

Start here: Read in this order:

README_MAIN.md — Overview, architecture, design decisions (skip non-goals initially)
app.py — FastAPI endpoints and entry points (read /login, /recommend, /audit flows)
src/db.py — Data model with immutability enforcement (lines 1-50)
tests/test_audit.py — WORM compliance testing (understand HTTP 409 pattern)
src/rbac.py — Role-based access control implementation
SANITY_CHECK.md — Pre-release verification (if evaluating for production)

Three questions this project will be evaluated against:

"How does this system guarantee data integrity for audit compliance?"
- Look for: WORM enforcement in db.py (no updates/deletes), HTTP 409 enforcement in app.py, test validation in test_audit.py
- Core insight: Audit records are immutable by design, not trust
"How do you balance ML recommendations with rule-based eligibility?"
- Look for: Eligibility engine in services/eligibility_engine.py (rules are law), ML ranker in services/ml_ranker.py (ranking only, not eligibility), test separation in test_eligibility.py vs test_versioning.py
- Core insight: Restraint—ML never overrides rules; it improves user experience within rule constraints
"How would you add a new feature without breaking compliance?"
- Look for: Design decisions section in README_MAIN.md, test-first approach in all 4 test files, scope boundaries in Future Work section
- Core insight: Every feature request is evaluated against WORM guarantee, RBAC, and data isolation

Key metrics:

16 endpoints, 100% RBAC-enforced
90 deterministic tests (0 flakiness), no mocks
SQLite with SQLAlchemy WORM enforcement
JWT-based auth, 3 roles
ML versioning and drift detection built-in
Deployment-ready (render.yaml provided)

Contributing

Contributions are welcome. Please ensure code quality and add appropriate tests for new features.

License

[Add your license here]

Contact

For inquiries or support, please contact the development team.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
experiments		experiments
src		src
tests		tests
.gitignore		.gitignore
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
ML_REFRAMING_COMPLETE.md		ML_REFRAMING_COMPLETE.md
ML_VERSIONING_SUMMARY.md		ML_VERSIONING_SUMMARY.md
README.md		README.md
REFRAMING_GUIDE.md		REFRAMING_GUIDE.md
RESUME_BULLETS.md		RESUME_BULLETS.md
SAFE_UPGRADE_COMPLETE.md		SAFE_UPGRADE_COMPLETE.md
SANITY_CHECK.md		SANITY_CHECK.md
app.py		app.py
conda		conda
config.json		config.json
pytest		pytest
pytest.ini		pytest.ini
render.yaml		render.yaml
requirements.txt		requirements.txt
requirements_output.txt		requirements_output.txt
results.csv		results.csv

Folders and files

Latest commit

History

Repository files navigation

Policy Recommender — Fairness-Aware ML Decision System

Overview

Features

Why ML is Used

Tech Stack

Project Structure

Installation

Environment Setup

Required Setup Steps

Dependencies

Running the Application

Verification

Usage

Development

Running Tests

Testing Philosophy

Project Components

Deployment on Render

Quick Start

Health Monitoring

Data Persistence

API Access

Performance & Scaling

Design Decisions

1. Rules as Source of Truth

2. ML for Ranking, Not Eligibility

3. Explainability Preserved at Every Layer

4. Safe Evolution and Continuous Learning

5. ML Model Design: Controlled and Interpretable

6. Evaluation-Aware System Design

Technical Architecture

Service Layer Separation

How to Use This Design

As a Developer

As an Interviewer

As an Auditor

Data Model & Audit Trail

Overview

Database Tables

1. user_profiles

2. recommendation_runs

3. scheme_decisions

Audit Trail Example

Why This Matters

API Endpoints for Audit

Model Versioning & Drift Monitoring

Overview

Model Versioning

Confidence Tracking

Drift Detection Analytics

Why No Auto-Retraining?

Implementation Details

Compliance Guarantees

Audit Immutability (WORM)

Role-Based Access Control (RBAC)

Audit Trail Contents

Design Decisions

Why Rule-Based Eligibility?

Why ML Only Ranks, Never Determines Eligibility?

Why Audit Trail Is Immutable?

Why Analytics Are Read-Only?

Why LLM Has No Decision Authority?

Non-Goals

1. Automated Decision-Making via ML

2. Opaque Models

3. Real-Time Scheme Mutation

4. User Profiling Beyond Request Scope

Evaluation & Results

Experimental Setup

Results Summary

Interpretation

Ablation Study

How to Run the Experiment

Fairness Analysis

Interpreting Fairness Metrics

Key Takeaways

1. `user_profiles`

2. `recommendation_runs`

3. `scheme_decisions`

Packages