An applied machine learning system that ranks government welfare schemes using learned feature contributions while enforcing deterministic eligibility rules. Combines fairness analysis with explainable ML-driven ranking for policy recommendations.
Policy Recommender is an ML-driven decision system that identifies suitable government welfare schemes for citizens. It separates concerns: rule-based eligibility (deterministic, policy-governed) from ML-based ranking (learned relevance patterns). This architecture ensures legal compliance while leveraging learned patterns to improve scheme ordering for end users. The system includes fairness analysis to detect demographic disparities in recommendation distributions.
- ML-Based Ranking: Logistic regression learns feature contributions to predict scheme relevance among eligible options
- Rule-Based Eligibility: Deterministic gate ensures only policy-compliant schemes are considered
- Fairness Analysis: Detects demographic parity issues and representation variance across ranking outputs
- Explainable Decisions: All eligibility, scoring, and ML contributions are explained in structured outputs
- Authentication & RBAC: JWT-based access control with role-based endpoints
- Immutable Audit Trail: Write-once, read-many (WORM) compliance audit records
- Evaluation Metrics: NDCG, Precision@k, MAP for ranking quality assessment
- Offline Experiment Framework: Compare rule-based, ML, and hybrid scoring modes offline
This system uses machine learning for ranking only, not eligibility:
-
Eligibility is Policy-Driven: Legal requirements (income, age, state) are non-negotiable rules. ML cannot learn these; they must be explicitly defined.
-
Ranking is Data-Driven: Among schemes a user is eligible for, some are more relevant than others. ML captures patterns from features (income proximity, age range fit, category match, gender fit) to improve ranking quality.
-
Explainability Over Accuracy: The ML model (logistic regression) is chosen for interpretability. Every prediction includes feature contributions, making decisions defensible to auditors and beneficiaries.
-
Fairness Visibility: By analyzing ranking outputs for demographic parity and representation variance, we surface potential biases early for governance decisions.
Trade-off: We sacrifice raw predictive power for interpretability and auditability. This is appropriate for government systems where trust and legality outweigh accuracy.
- Language: Python 3.11
- ML Library: scikit-learn (logistic regression, metrics)
- Evaluation: NDCG, Precision@k, MAP (sklearn.metrics)
- Fairness Analysis: Demographic parity, representation variance
- Architecture: Modular service layer (eligibility, scoring, ML ranking, fairness)
- Data Processing: numpy for numerical feature transformations
policy-recommender-ai/
├── app.py # FastAPI application entry point
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── config.json # Scoring mode configuration
├── src/ # Source code
│ ├── models/ # Data models
│ ├── rules/ # Rule engine
│ ├── services/ # Services (eligibility, scoring, ML ranking)
│ ├── evaluation/ # Evaluation metrics and fairness analysis
│ │ ├── ranking_metrics.py # NDCG, Precision@k, MAP
│ │ └── fairness_metrics.py # Demographic parity, variance
│ └── __init__.py
├── experiments/ # Offline experiments
│ └── compare_ranking_methods.py
├── data/ # Data files and configurations
├── tests/ # Unit and integration tests
└── notebook/ # Analysis notebooks
-
Clone the repository
git clone <repository-url> cd policy-recommender-ai
-
Activate Conda environment
conda activate ai
-
Install dependencies
pip install -r requirements.txt
This project requires Conda environment "ai" with Python 3.11+.
-
Ensure Conda "ai" environment exists (should already be present)
conda env list | grep ai -
Activate the environment
conda activate ai
-
Verify Python version
python --version # Should output: Python 3.11.x -
Install all dependencies
pip install -r requirements.txt
The requirements.txt includes:
fastapi- Web frameworkuvicorn- ASGI serverpydantic- Data validationnumpy- Numerical computingscikit-learn- ML library (for logistic regression)sqlalchemy- ORM for database persistence
Start the API server with auto-reload:
conda activate ai
cd policy-recommender-ai
uvicorn app:app --reloadThe API will be available at: http://127.0.0.1:8000
Access Swagger documentation at: http://127.0.0.1:8000/docs
After startup, verify all systems are operational:
curl http://127.0.0.1:8000/
# Should return: {"status": "healthy", "service": "Policy Recommendation Engine", "version": "1.0.0"}Run the application:
python app.pyThe application will start and process user eligibility data to generate policy recommendations.
python -m pytest tests/This project follows a testing-by-scope approach: test what matters for compliance and correctness, skip what's out of scope.
Why These Tests Are Sufficient:
-
Eligibility Correctness (26 tests): Every constraint (age, income, state, category, gender) is tested with boundary cases. Eligibility is deterministic; if it's correct, the system's core responsibility is met.
-
RBAC Enforcement (32 tests): User/auditor/admin roles are tested against all protected endpoints. This is security-critical; token handling and role checks are comprehensive.
-
Audit Immutability (20 tests): WORM is tested across all mutation methods (DELETE, PATCH, PUT). This is compliance-critical; we verify all paths return 409.
-
ML Versioning (12 tests): Model version and confidence are captured and tracked. Drift detection RBAC is enforced. This validates new ML operations features.
-
No Mocks: All tests use real integration (FastAPI TestClient against live endpoints). This catches real failures, not mock-only bugs.
-
Deterministic Input: Every test uses fixed inputs and expected outputs. No randomness, no flakiness. Tests are reproducible and trustworthy.
What Is NOT Tested (Intentionally):
- ❌ UI/Frontend: This is a backend API only. No UI exists to test.
- ❌ External LLM APIs: LLM integration is optional and gracefully degraded. If LLM is down, system still works.
- ❌ Performance/Load: Not in scope for v1.0. Backend is stateless; scaling is horizontal.
- ❌ Database Migrations: SQLite is simple; no complex migrations exist.
- ❌ Full E2E Workflows: Individual endpoints are tested; E2E testing is out of scope for backend unit/integration testing.
Result: 90 focused tests covering critical paths (eligibility, security, audit, versioning). This is sufficient for a compliance-grade government recommendation system.
- Models (
src/models/): Define eligibility criteria and policy scheme structures - Rules (
src/rules/): Implement decision logic for scheme recommendations - Services (
src/services/): Handle recommendation orchestration and explainability
Deploy this service to Render in minutes for production-grade hosting with automatic scaling and monitoring.
-
Push to GitHub
git push origin main
-
Connect to Render
- Visit render.com
- Create new Web Service
- Connect your GitHub repository
- Select
policy-recommender-aibranch
-
Set Environment Variables
JWT_SECRET: Generate a strong random secret (useopenssl rand -hex 32or similar)- Other variables auto-configured from
render.yaml
-
Deploy
- Render will automatically build and deploy
- Monitor status in Render dashboard
The service includes dedicated health endpoints:
-
GET /health- Render load balancer checks this every 30 seconds- Returns:
{"status": "ok", "service": "policy-recommender-ai", "version": "1.0.0"} - Used by: Container orchestration, monitoring systems, load balancers
- Returns:
-
GET /- Human-readable health status- Returns:
{"status": "healthy", "service": "Policy Recommendation Engine", "version": "1.0.0"}
- Returns:
- SQLite database persists audit trails to
./audit_trail.db - Render ephemeral storage: Re-deploy to reset database
- Future: Migrate to PostgreSQL for multi-instance deployments
Once deployed to Render:
- Live API:
https://<your-service>.onrender.com - API Docs:
https://<your-service>.onrender.com/docs - Health Check:
https://<your-service>.onrender.com/health
Default configuration:
- Instances: 1 (auto-scales on CPU/memory threshold)
- Region: Oregon (customize in
render.yaml) - Timeout: 30 seconds per request (standard)
Monitor metrics in Render dashboard for CPU and memory usage.
Decision: Eligibility logic is entirely rule-based; ML never overrides eligibility decisions.
Rationale:
- Government welfare schemes have legal eligibility criteria that cannot be negotiated or learned from data
- Rule-based eligibility is deterministic, auditable, and compliant with policy requirements
- ML can optimize within eligible schemes but cannot legitimize ineligibility
Implementation:
eligibility_engine.pyis the single source of truth for eligibilityscoring_engine.pyranks only among already-eligible schemesml_ranker.pyis optional for ranking, never consulted for eligibility
Why This Matters:
- Audit compliance: Every eligibility decision can be traced to specific rules
- Legal defensibility: Policy enforcers can explain why a scheme was not recommended
- System evolution: Rules can be updated without retraining ML models
Decision: Machine learning enhances ranking precision among eligible schemes but does not drive eligibility.
Rationale:
- Ranking is subjective and data-driven; eligibility is objective and policy-driven
- ML captures patterns from historical recommendation data to improve relevance ordering
- Multiple scoring modes allow controlled experimentation without disrupting eligibility logic
Implementation:
- Three configurable scoring modes:
"rules": Pure rule-based scoring (deterministic baseline)"ml": ML-based scoring (learned relevance patterns)"hybrid": Average of rule and ML scores (balanced approach)
- Config-driven mode selection in
config.json - Mode can be switched without redeploying service
Why This Matters:
- Safe experimentation: Interviewers can test ML without breaking production eligibility
- Graceful degradation: If ML unavailable, system falls back to rules
- Continuous improvement: New ML models can be deployed/tested independently
Decision: Every decision—eligibility, scoring, ML feature contribution—is explained in structured JSON.
Rationale:
- Government systems require audit trails; vague ML scores are not acceptable
- Beneficiaries and caseworkers must understand why a scheme was (or wasn't) recommended
- System is only production-ready if every step is explainable
Implementation:
- Eligibility layer:
eligibility_engine.pyreturns boolean + rule explanation - Scoring layer:
scoring_engine.pyreturns score breakdown (income proximity, age fit, category match, gender match) - ML layer:
ml_ranker.pyincludes feature contributions (e.g., "age contribution: +0.3%") - API response:
DecisionTracemodel includes scoring mode, feature details, and all intermediate scores
Why This Matters:
- Trust: Beneficiaries can understand and contest decisions
- Debugging: Product teams can identify model drift or rule violations
- Compliance: Audit teams can reconstruct decision logic for any beneficiary
Decision: System architecture supports adding new schemes, rules, and ML models without breaking existing functionality.
Rationale:
- Government policies change; recommendation systems must adapt
- New eligibility criteria should not require retraining or redeploying ML
- A/B testing of new rules or ML models should be possible without disruption
Implementation:
- Modular services: Each service (eligibility, scoring, ML) is independent
- Config-driven experimentation: New rules, weights, and scoring modes defined in
config.json - Offline evaluation:
evaluate.pycompares rule vs ML ranking before production deployment - Extensible data model: New eligibility criteria or scheme attributes can be added to
schemes.jsonwithout API changes
Why This Matters:
- Scalability: System grows with policy changes
- Risk mitigation: Experiments validated offline before live deployment
- Junior engineer onboarding: Clear separation of concerns makes code understandable
Decision: Use logistic regression (not deep learning); train on synthetic data with domain features.
Rationale:
- Logistic regression is inherently interpretable (coefficients = feature importance)
- Synthetic training data ensures reproducible, deterministic behavior
- No "black box" that inspectors cannot explain
Implementation:
- Model: Logistic Regression from scikit-learn
- Features: age_normalized, income_ratio, category_match, gender_match
- Training: Synthetic dataset with deterministic seed (reproducible)
- Output: 0-100 score + feature-level contributions
Why This Matters:
- Interpretability: Interviewers see exactly how ML reasons about relevance
- Transparency: Model behavior is auditable and reproducible
- Production readiness: Simple model = fewer deployment surprises
Decision: Include offline evaluation script comparing rule vs ML ranking consistency.
Rationale:
- Before deploying ML ranking, verify it doesn't contradict rule-based wisdom
- Metrics (top-1 agreement, rank deltas) detect unexpected model behavior
- Evaluation is continuous, not one-time
Implementation:
evaluate.py: Generates synthetic users, compares ranking across modes- Metrics: Top-1 agreement %, average rank delta, score distributions
- Output:
evaluation_results.jsonfor trend tracking
Why This Matters:
- Confidence: ML is validated before production
- Debugging: Unexpected agreement drops signal data drift or rule changes
- Stakeholder trust: Numbers, not narratives, justify ML deployment
User Request
↓
┌─────────────────────────────────────────┐
│ Eligibility Engine (src/services/) │ ← Determines YES/NO
│ - Rule-based only │
│ - No learning │
└─────────────────────────────────────────┘
↓ (if eligible)
┌─────────────────────────────────────────┐
│ Config & Scoring Router (app.py) │ ← Selects mode
│ - Check config.json │
│ - Route to appropriate scorer │
└─────────────────────────────────────────┘
↓
├─→ Rule-Based Scoring (src/services/scoring_engine.py)
│ - Deterministic, explainable
│
├─→ ML Ranking (src/services/ml_ranker.py)
│ - Optional, graceful fallback
│ - Feature contributions
│
└─→ Hybrid (both, averaged)
- Combines approaches
- Traces both paths
↓
┌─────────────────────────────────────────┐
│ API Response (FastAPI) │ ← Structured JSON
│ - DecisionTrace with all details │
│ - Scoring factors & contributions │
│ - Audit-ready output │
└─────────────────────────────────────────┘
- Add a new scheme: Update
data/schemes.jsonand eligibility rules - Change rule weights: Update
config.json→ restart API - Experiment with ML: Set
scoring_modeinconfig.json, restart - Validate changes: Run
python evaluate.pybefore deployment
- Verify explainability: Call
/explainendpoint; confirm all reasoning is structured JSON - Check eligibility logic: Trace through
eligibility_engine.pyrules - Validate ML behavior: Run
evaluate.py; confirm metrics are reasonable - Test configuration: Modify
config.json; confirm API honors the mode
- Trace a decision: Examine
DecisionTracefrom/recommendresponse - Verify rule compliance: All eligibility rules implemented in
eligibility_engine.py - Check model fairness: Feature contributions in ML response (age, income, category, gender)
- Review change history:
git logonconfig.jsonandsrc/services/
The system persists every recommendation run to SQLite for compliance and debugging. This enables:
- Complete decision traceability (why was a scheme recommended or rejected?)
- Audit trail for policy adjustments and ML model changes
- Benchmarking and performance analysis
- Graceful evolution to PostgreSQL for production deployments
Stores user profiles used in recommendations.
| Column | Type | Purpose |
|---|---|---|
id |
UUID | Primary key |
age |
Integer | User age |
income |
Integer | Annual income (rupees) |
state |
String | State of residence |
category |
String | Social/income category (EWS, FARMER, etc.) |
gender |
String | Gender (MALE, FEMALE, OTHER) |
created_at |
DateTime | When profile was created |
Why store separately:
- Normalize data: One user profile can generate multiple recommendations
- Privacy considerations: Clear separation between immutable profiles and recommendations
Stores metadata for each /recommend API call.
| Column | Type | Purpose |
|---|---|---|
run_id |
UUID | Primary key (returned in API response) |
user_profile_id |
UUID FK | Foreign key to user profile |
scoring_mode |
String | "rules", "ml", or "hybrid" |
config_version |
String | Config version used during this run |
created_at |
DateTime | When recommendation was generated |
total_schemes_checked |
Integer | How many schemes were evaluated |
eligible_count |
Integer | How many schemes qualified |
ineligible_count |
Integer | How many schemes were rejected |
Why store separately:
- Decouple runs from individual decisions: Query all recommendations for a user or time period easily
- Track configuration changes: Verify that ML was enabled/disabled when decisions were made
- Summary metrics: Quick analytics without joining to scheme decisions
Stores individual eligibility and scoring decisions for each scheme in a run.
| Column | Type | Purpose |
|---|---|---|
id |
UUID | Primary key |
run_id |
UUID FK | Foreign key to recommendation run |
scheme_id |
String | Scheme identifier |
scheme_name |
String | Human-readable scheme name |
is_eligible |
Boolean | YES if eligible, NO otherwise |
eligibility_reason |
Text | Why eligible or rejected |
score |
Float | Relevance score (0-100) if eligible, NULL if not |
scoring_method |
String | "rules", "ml", or "hybrid" (if eligible) |
decision_trace |
JSON | Full DecisionTrace object (passed/failed rules) |
scoring_factors |
JSON | Score breakdown (income, age, category, gender) |
ml_features |
JSON | ML feature contributions (if ML involved) |
created_at |
DateTime | When decision was made |
Why store as JSON:
- Flexibility: Future changes to decision logic don't require schema migrations
- Debuggability: Exact state of reasoning preserved for audit
- Portability: Easy to export and analyze offline
To audit why a user got specific recommendations:
# 1. Make a recommendation
curl -X POST http://127.0.0.1:8000/recommend \
-H "Content-Type: application/json" \
-d '{
"age": 35,
"income": 450000,
"state": "MH",
"category": "EWS",
"gender": "FEMALE"
}'
# Response includes "audit_run_id": "abc-123-def"
# 2. Retrieve full audit trail
curl http://127.0.0.1:8000/audit/abc-123-def
# Response includes:
# - User profile used
# - All scheme decisions (eligible and ineligible)
# - Scoring mode (rules, ml, hybrid)
# - Timestamp (when recommendation was made)
# - Complete decision traces and reasoningFor Compliance:
- Policy auditors can trace every decision to specific rules
- All changes are timestamped and linked to config versions
- Replayability: Modify rules, re-run old users, compare outcomes
For Product Evolution:
- A/B test new scoring modes without disrupting production
- Compare rule-based vs ML-based recommendations side-by-side
- Identify when ML diverges significantly from rules (data drift signal)
For Debugging:
- Engineers can examine exact decision reasoning for specific users
- Identify systematic issues (e.g., "why are all farmers ineligible?")
- Validate new rules before production deployment
For Scaling:
- SQLite works for development and small deployments
- Ready to migrate to PostgreSQL with minimal code changes:
- Connection string changes in
src/db.py - Add Alembic migrations for production safety
- Increase database connection pooling
- Connection string changes in
| Endpoint | Purpose |
|---|---|
POST /recommend |
Generate recommendations (automatically logs decision run) |
GET /audit/{run_id} |
Retrieve full audit trail for a specific recommendation |
POST /explain |
Get eligibility explanations without ranking (no logging) |
The system includes read-only ML operations tracking to support compliance and monitoring:
- ML Model Versioning: Every recommendation run tracks which ML model version was used
- Confidence Scores: Each ML-based decision includes a confidence metric (predicted probability)
- Drift Detection Analytics: Advisory-only analysis comparing ML vs rule-based rankings
Critical Design Principle: These features are for monitoring and auditing only. They never trigger automatic model updates, retraining, or decision changes.
Each audit record includes:
{
"ml_model_version": "logistic_v1.0",
"ml_confidence": 0.87,
"scoring_method": "ml"
}Fields:
ml_model_version(string): e.g., "logistic_v1.0" for ML/hybrid scoring,nullfor rules-onlyml_confidence(float 0–1): Predicted probability from logistic regression- For logistic regression: confidence = P(positive class)
nullfor rules-only scoring (deterministic, no probability concept)
Why It Matters:
- Enables auditors to identify which recommendations used which model versions
- Supports compliance requirements: "What model was used for this decision?"
- Facilitates model lifecycle management and deprecation
For every ML or hybrid scoring decision:
ML-based decision on scheme X:
- Score: 75/100
- Confidence: 0.87 (model was 87% confident)
- Interpretation: High confidence recommendation
Using Confidence:
- Score alone can be misleading; confidence provides context
- Low confidence (e.g., 0.52) suggests borderline cases
- Auditors can filter for "high confidence recommendations only"
Endpoint: GET /analytics/model-drift (auditor+ only)
Compares ML-based ranking against rule-based ranking across recent recommendation runs:
curl -H "Authorization: Bearer <token>" \
http://127.0.0.1:8000/analytics/model-driftResponse:
{
"drift_detected": false,
"average_rank_delta": 8.5,
"drift_threshold_pct": 15.0,
"analysis_basis": 47,
"scheme_drift": {
"scheme_1": {
"average_rank_delta_pct": 12.3,
"observations": 8,
"drift_flag": false
},
"scheme_2": {
"average_rank_delta_pct": 5.1,
"observations": 8,
"drift_flag": false
}
},
"advisory": "This analysis is advisory only. No automated retraining occurs."
}What It Measures:
- For each scheme, calculates how differently ML and rules rank it
- Compares ranking position (not score values) across audit runs
- Flags schemes with >15% average rank change
Why Drift Matters:
- Indicates ML model may be diverging from rule-based logic
- Signals potential data shift or model staleness
- Advisory signal for auditors: "Should we retrain or update rules?"
What It Does NOT Do:
- ❌ Does NOT automatically trigger retraining
- ❌ Does NOT update model versions
- ❌ Does NOT change any recommendations retroactively
- ❌ Does NOT replace policy review processes
Expected Actions:
- Audit team reviews drift analysis during monthly reviews
- If drift detected: Audit team investigates whether ML divergence is intentional or problematic
- If problematic: Initiate formal model update process (manual, with governance approval)
Automatic retraining would violate audit requirements:
- Non-Determinism: Auto-retraining means same inputs → different outputs over time
- Audit Trail Corruption: Past decisions become non-reproducible
- Governance Gap: ML decisions made without human oversight
- Regulatory Risk: "Who approved this model update?" cannot be answered
Our Approach: Humans decide when models are stale. Drift detection just gives them the data.
Versioning Storage:
# In database audit record
ml_model_version: String(50) # Populated during scoring
ml_confidence: Float[0,1] # Populated during scoring (null for rules-only)Scoring Mode Logic:
scoring_mode = "rules" # → ml_model_version = null, ml_confidence = null
scoring_mode = "hybrid" # → ml_model_version = "logistic_v1.0", ml_confidence = 0.87
scoring_mode = "ml" # → ml_model_version = "logistic_v1.0", ml_confidence = 0.87All audit records are Write-Once, Read-Many (WORM):
- Write-Once: Audit records cannot be updated or deleted after creation
- Read-Many: Audit records can be retrieved unlimited times via
/audit/{run_id} - Enforcement: HTTP 409 (Conflict) returned for any update/delete attempts
Why This Matters:
- Government compliance requires tamper-proof audit trails
- Immutability prevents accidental or malicious record modification
- WORM compliance enables regulatory certifications
Technical Implementation:
- Application-level guards prevent DELETE/PUT/PATCH on audit endpoints
- Database records are append-only
- Violations logged for security monitoring
Fine-grained authorization by role:
| Role | /recommend | /explain | /audit/{run_id} | /analytics/* |
|---|---|---|---|---|
| user | ✅ | ❌ | ❌ | ❌ |
| auditor | ❌ | ✅ | ✅ | ✅ |
| admin | ❌ | ✅ | ✅ | ✅ |
| public (no auth) | ❌ | ❌ | ❌ | ❌ |
Authentication:
- JWT-based (HS256)
- Demo users: user@example.com, auditor@example.com, admin@example.com
- 30-minute token expiry
Every recommendation run includes:
- User profile (immutable copy at decision time)
- All scheme decisions (eligible and ineligible)
- Full decision traces and explanations
- Scoring method used (rules, ml, or hybrid)
- ML feature contributions (if applicable)
- Timestamp and configuration version
Eligibility is always rule-based, never delegated to ML. This design reflects regulatory requirements:
- Explainability: Citizens have a right to understand why they're ineligible for benefits
- Determinism: Rules produce identical outputs for identical inputs; ML does not
- Auditability: Policy makers define the rules; engineers implement them
- Legal defensibility: Government decisions must be traceable to documented policy
ML scoring (ranking eligible schemes) is optional. Eligibility is law; ranking is optimization.
Machine learning is read-only for ranking, not decision-making:
- Opacity Risk: ML models can fail silently on edge cases (e.g., underrepresented demographics)
- Regulatory Gap: ML decisions cannot be audited without access to training data and model internals
- Distribution Shift: Models degrade on data different from training set; rules don't
- Our Approach: Rules determine eligibility (certain); ML ranks options (advisory)
The hybrid scoring mode demonstrates this: rules pass/fail schemes; ML scores eligible ones.
The WORM (Write-Once, Read-Many) audit design is mandatory for compliance:
- Regulatory Requirement: Government processes require tamper-proof records
- Non-Repudiation: System decisions cannot be retroactively altered
- Forensics: Security investigations require evidence integrity
- HTTP 409 Enforcement: Any modification attempt returns conflict, preventing accidental mutations
Once a decision is logged, it is permanent. This is not optional; it is architectural.
Analytics endpoints never modify data:
- Separation of Concerns: Reporting systems should not influence decision systems
- Consistency: Read-only queries guarantee consistency across distributed systems
- Auditability: All analytics queries are logged; no hidden state changes
- Performance: Read-only queries can be replicated and cached without transaction overhead
Four aggregated views (/analytics/*) provide dashboards without exposing raw audit data.
Natural language explanations are enhancement only, never authoritative:
- Determinism: LLM outputs are non-deterministic; audit trails require exact reproduction
- Regulatory Gap: Regulators cannot audit AI-generated text; they can audit rules and ML weights
- Failure Isolation: LLM unavailability does not degrade system functionality (graceful fallback)
- Governance: Decision logic stays with policy makers and engineers, not LLM vendors
- Our Approach: LLM explains human decisions; humans make decisions
This is a read-only enhancement. If the LLM service goes down, recommendations still work perfectly.
This system intentionally does not:
❌ What we don't do: Use ML models to automatically accept or reject policy applications
Why excluded:
- Government decisions require transparent, auditable logic
- ML models trained on historical data perpetuate historical biases
- Citizens have legal right to know criteria applied to their case
- Regulators cannot certify opaque systems for benefit allocation
✅ What we do instead: ML ranks already-eligible schemes; humans make accept/reject decisions through rules
❌ What we don't do: Use black-box models (neural networks, gradient boosting) without interpretability
Why excluded:
- Government cannot delegate decision logic to uninterpretable systems
- Feature importance and decision boundaries must be auditable
- Model drift cannot be detected without interpretability
- We selected logistic regression (fully transparent) over XGBoost or neural networks
✅ What we do instead: Logistic regression (weights directly interpretable) or rule-based scoring only
❌ What we don't do: Allow policy rules to change mid-operation or retroactively alter past decisions
Why excluded:
- Citizens applying for benefits need stable criteria
- Recommendations for identical users should be reproducible
- Audits require exact rules at decision time
- A/B testing must be explicit, not ad-hoc
✅ What we do instead: Config version is captured with every decision; rules only change via explicit deployment
❌ What we don't do: Build persistent user profiles, track behavior across sessions, or create demographic patterns
Why excluded:
- Privacy: Citizens should not be surveilled by benefit systems
- Compliance: Profiling creates scope creep and regulatory risk
- Reproducibility: Decisions should depend only on current request, not history
- Consent: User tracking requires explicit consent; simple application shouldn't require it
✅ What we do instead: Stateless decisions based on current request only; audit trails store decisions, not user behavioral data
This section presents offline evaluation results comparing three ranking methods on synthetic policy recommendation data.
Data:
- 10 synthetic government welfare schemes with varying eligibility criteria
- 100 synthetic user profiles with diverse demographics (gender, category, income, age)
- 69 users eligible for at least one scheme (remaining ineligible)
- Synthetic relevance labels: schemes with max_income ≥ 250,000 and age range ≥ 20 years marked as "high relevance"
Ranking Methods Compared:
- Rule-Based: Deterministic scoring using income proximity, age range fit, and category match
- ML-Based: Logistic regression trained on synthetic feature combinations (model_version: logistic_v1.0)
- Hybrid: Equal-weight average of rule and ML scores
Evaluation Metrics:
- NDCG@5 (Normalized Discounted Cumulative Gain at position 5): Measures ranking quality; higher is better (0-1)
- Precision@5 (Precision at position 5): Fraction of top-5 results marked relevant; higher is better (0-1)
- MAP (Mean Average Precision): Average precision across all recall levels; higher is better (0-1)
- MRR (Mean Reciprocal Rank): Position of first relevant result; higher is better (0-1)
| Metric | Rule-Based | ML-Based | Hybrid |
|---|---|---|---|
| NDCG@5 | 0.7404 | 0.7700 | 0.7404 |
| Precision@5 | 0.4725 | 0.5043 | 0.4725 |
| MAP | 0.7272 | 0.7510 | 0.7272 |
| MRR | 0.7012 | 0.7077 | 0.7012 |
ML-Based Ranking Shows Modest Improvement:
- NDCG@5: +2.96% improvement over rule-based (0.7700 vs 0.7404)
- Precision@5: +3.18% improvement over rule-based (0.5043 vs 0.4725)
- Hybrid method does not improve over pure rule-based on synthetic data (both achieve same metrics)
Why This Matters:
- The ML model captures learned patterns from synthetic feature distributions, slightly improving ranking precision
- The improvement is modest (3%) because both rule-based and ML methods are well-aligned on this synthetic dataset
- On real historical recommendation data, the improvement could be larger if historical patterns are available
Rationale for Comparing Three Methods:
The experiment uses ablation—removing or modifying components systematically—to understand what drives recommendation quality:
-
Rule-Based Baseline (ablation: remove all ML)
- Pure heuristic scoring using domain knowledge
- Deterministic, fully explainable
- Provides a non-ML reference point
-
ML-Based Method (ablation: remove rules, use learned weights only)
- Logistic regression learns feature importance from data
- Captures patterns rules might miss
- Explainable (weights + feature contributions included)
-
Hybrid Method (ablation: combine rule + ML equally)
- Average of rule and ML scores
- Tests if ensemble improves single methods
- Provides conservative blending approach
Key Finding: Hybrid (equal-weight average) does not improve over rule-based or ML alone on synthetic data. This suggests:
- Rule and ML are already well-calibrated individually
- Simple averaging does not add value when both methods are strong
- More sophisticated ensemble methods might be needed (weighted voting, stacking) if ensemble is desired
- For v1.0.0, rule-based + optional ML ranking (configurable) is cleaner than hybrid
Ablation Validates Model Contribution: By isolating the ML component from rules, we confirm that observed improvements come from learned patterns, not from other factors. This strengthens confidence in the ML ranker's value.
Limitations & Assumptions:
-
Synthetic Data: Relevance labels are artificial. Real evaluation requires historical user feedback (e.g., "which recommended scheme did beneficiary actually select?")
-
Small Dataset: 69 user-scheme pairs is too small for robust ML generalization. Production models should be trained on thousands of real historical recommendations.
-
Logistic Regression: We chose this model for interpretability. It may underperform more complex models (XGBoost, neural networks) on nonlinear patterns. However, the tradeoff is acceptable for compliance: every prediction includes feature contributions.
-
Feature Engineering: Features are hand-crafted heuristics (age_normalized, income_ratio, category_match, gender_match). More sophisticated features might improve performance.
-
No Data Drift Simulation: Results assume identical train and test distributions. Real systems must monitor for distribution shift and retrain periodically.
-
Single Random Seed: Results are from one reproducible run (seed=42). Confidence intervals would require multiple train-test splits.
To reproduce these results:
conda activate ai
cd policy-recommender-ai
python -m experiments.compare_ranking_methodsThis generates:
- Console output with NDCG, Precision, MAP, MRR metrics
results.csvwith per-user ranking comparisons
The system includes fairness monitoring (no enforcement) to detect demographic representation issues:
- Demographic Parity: Recommendation rates per demographic group (gender, category)
- Representation Variance: Distribution of top-k recommendations across demographics
These metrics are for governance oversight only. The system does not adjust rankings to achieve fairness targets. Instead, fairness analysis is logged and made available to policy reviewers.
Example Output from Fairness Module:
DEMOGRAPHIC PARITY (Recommendation Rates by Group)
GENDER:
MALE: 0.55 (55% recommendation rate)
FEMALE: 0.48 (48% recommendation rate)
Gap: 7% ⚠ WARNING
CATEGORY:
EWS: 0.60 (60% recommendation rate)
GENERAL: 0.52 (52% recommendation rate)
Gap: 8% ⚠ WARNING
Policy teams use this data to investigate potential biases and adjust eligibility rules if needed.
Plain Language Explanation:
Demographic Parity (Recommendation Rates by Group)
This metric answers: "Do different demographic groups get recommended at similar rates?"
Example:
- 55% of males in the population receive at least one recommendation
- 48% of females in the population receive at least one recommendation
- Gap: 7% (⚠ may warrant investigation)
Why monitor it: If certain groups have significantly lower recommendation rates, the system may be inadvertently excluding them. This could be due to eligibility rules, feature distributions, or ML model bias.
Representation Variance (Top-K Distribution Consistency)
This metric answers: "Are the demographics of top-recommended schemes consistent across users?"
High variance means:
- Some users see diverse demographics in their top-3 schemes
- Other users see mostly one demographic
- Inconsistency could indicate that ranking stability varies by user profile
Low variance means:
- All users see similar demographic distributions in top-k
- More consistent experience (good or bad—depends on whether distributions are fair)
Fairness vs. Utility Tradeoff:
- Utility: Recommending the "best" schemes for each user based on relevance
- Fairness: Ensuring demographic groups are represented similarly in recommendations
These can conflict:
- Maximizing utility might mean recommending schemes that appeal more to some demographics
- Enforcing fairness constraints might reduce utility (recommend slightly less-relevant schemes to balance representation)
Our approach: Monitor, don't enforce. Policy teams see fairness metrics and decide whether adjustments are needed. This preserves both governance transparency and system flexibility.
Why Not Automatic Fairness Enforcement?
- Governance Risk: Algorithms cannot decide fairness tradeoffs—that's a policy decision
- Unintended Consequences: Automated fairness interventions can backfire (e.g., inverse discrimination concerns)
- Auditability: Policy teams must explicitly decide to adjust rules; hidden algorithmic interventions are opaque
- Compliance: Regulators prefer transparent analysis over automatic algorithmic adjustments
- ML improves ranking quality by ~3% on synthetic data (NDCG, Precision metrics)
- Ablation study validates that improvement comes from learned patterns, not overfitting
- Fairness is monitored continuously but not enforced; governance teams decide policy
- Reproducibility is built-in: Deterministic seeds, documented experimental setup, exact command to reproduce
- Results are honest: Synthetic data, small scale, hand-crafted features—real evaluation needs historical feedback
This section documents how to reproduce the evaluation results and verify system behavior.
Python Version: 3.11.x (tested on 3.11.14)
Conda Environment:
conda activate ai
python --version # Should output: Python 3.11.xVerify Dependencies:
pip list | grep -E "scikit-learn|numpy|pandas"
# Expected: scikit-learn 1.7+, numpy 2.4+Command:
cd policy-recommender-ai
python -m experiments.compare_ranking_methodsExpected Output:
================================================================================
EXPERIMENT: Compare Rule-Based, ML, and Hybrid Ranking Methods
================================================================================
Generating synthetic data...
Generated 10 schemes and 100 users
Scoring rankings for each user...
Scored 69 users
Evaluating with ranking metrics...
RESULTS: Ranking Quality Comparison
--------------------------------------------------------------------------------
Metric Rule-Based ML-Based Hybrid
--------------------------------------------------------------------------------
ndcg@5 0.7404 0.7700 0.7404
precision@5 0.4725 0.5043 0.4725
map 0.7272 0.7510 0.7272
mrr 0.7012 0.7077 0.7012
Detailed results saved to: results.csv
Duration: <5 seconds
results.csv: Per-user ranking comparisons
- Location: Project root (
policy-recommender-ai/results.csv) - Rows: 70 (header + 69 evaluated users)
- Columns: user_id, eligible_schemes, rule_ranking (top-3), ml_ranking (top-3), hybrid_ranking (top-3)
- Use case: Analyze individual user ranking differences across methods
Console Metrics: Aggregated performance across all users
- NDCG@5, Precision@5, MAP, MRR
- Use case: Overall system evaluation
Key Property: Running the experiment multiple times produces identical results.
Why:
- Synthetic data generation:
np.random.seed(42) - ML model training:
LogisticRegression(random_state=42) - No stochasticity in evaluation
Verification:
# Run twice and compare results.csv
python -m experiments.compare_ranking_methods
cp results.csv results_run1.csv
python -m experiments.compare_ranking_methods
diff results_run1.csv results.csv # Should be identical# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_eligibility.py -v
# Run specific test
python -m pytest tests/test_audit.py::test_audit_immutability -vExpected: 90 tests, all passing, deterministic (no flakiness)
Syntax Check:
python -m py_compile src/evaluation/ranking_metrics.py
python -m py_compile src/evaluation/fairness_metrics.py
python -m py_compile experiments/compare_ranking_methods.py
# No output = successImport Check:
python -c "from src.evaluation.ranking_metrics import ndcg_at_k, precision_at_k, mean_average_precision; print('✓ Ranking metrics imported successfully')"
python -c "from src.evaluation.fairness_metrics import demographic_parity, representation_variance; print('✓ Fairness metrics imported successfully')"For detailed evaluation framework information, see docs/evaluation_overview.md:
- What each evaluation file does
- Function signatures and use cases
- Integration patterns for production
- Known limitations
---1. ML improves on heuristics: Even simple logistic regression can improve ranking quality by 3% on synthetic data 2. Interpretability preserved: Feature contributions are always included; no black-box decisions 3. Rules remain authoritative: Eligibility is never affected by ML; it's purely for ranking 4. Fairness is monitorable: Demographic parity and representation are tracked for governance review 5. Evaluation is offline: No A/B testing on live users; offline experiments provide evidence before deployment
Capabilities Summary
- ✅ Rule-Based Eligibility: Deterministic, auditable policy evaluation for 15+ welfare schemes
- ✅ Explainable Recommendations: Full decision traces showing passed/failed rules for every scheme
- ✅ Optional ML Ranking: Logistic regression for scheme relevance (never affects eligibility)
- ✅ Immutable Audit Trail: WORM compliance with HTTP 409 enforcement for data integrity
- ✅ Role-Based Access Control: JWT authentication with user/auditor/admin roles
- ✅ Model Versioning & Drift Detection: Track ML model usage and detect ranking divergence
- ✅ Production-Grade Analytics: 4 aggregated endpoints for eligibility rates, scoring distribution, top schemes
- ✅ Optional LLM Integration: Natural language explanations with deterministic fallback
System Boundaries
This system:
- Recommends schemes based on eligibility and relevance (does not approve/reject benefits)
- Provides data for government policy teams (not an approval system)
- Supports auditors and administrators with visibility into recommendations
- Maintains immutable audit trails for compliance and oversight
- Runs as a stateless API (scales horizontally)
This system does NOT:
- Make final eligibility decisions (policy rules define eligibility; system implements rules)
- Approve or disburse benefits (system recommends only)
- Store personal data beyond request scope (stateless, audit-only storage)
- Automatically retrain models (human governance required)
- Guarantee optimal recommendations (rules-driven, not optimized)
These features are intentionally excluded from v1.0.0 to maintain focus and compliance:
- Multi-instance PostgreSQL persistence (current: single-instance SQLite)
- Redis caching for high-traffic scenarios
- Load balancing and auto-scaling orchestration
- Disaster recovery and backup automation
- Real-time dashboards (current: read-only aggregated analytics)
- Performance monitoring and SLA tracking
- Model performance metrics (precision/recall on held-out schemes)
- Automated anomaly detection
- A/B testing framework for scheme ranking algorithms
- Automated model retraining pipeline (current: manual, human-governed)
- Feature store for ML model inputs
- Model performance tracking and versioning
- OAuth 2.0 for third-party integrations
- Batch processing for bulk recommendations
- Webhook notifications for recommendation events
- Mobile app or SMS gateway
- Advanced audit trail visualization
- Approval workflows for scheme rule changes
- Impact analysis before rule deployment
- User appeals or override mechanisms
These are not missing features; they are intentionally excluded from v1.0.0 to keep the system compliant, auditable, and maintainable. Future versions will evaluate these based on business requirements.
This system was developed iteratively, with features added incrementally and validated through testing:
Phase 1: Eligibility Foundation
- Implemented deterministic eligibility rules for 15+ government welfare schemes
- Built rule engine supporting age, income, state, category, and gender constraints
- Comprehensive eligibility testing (26 tests, boundary cases for each constraint)
- Design principle: Eligibility is policy-driven, never approximate
Phase 2: Scoring & Ranking
- Added rule-based relevance scoring (income proximity, age fit, category match)
- Separated eligibility (yes/no gate) from scoring (ranking among eligible schemes)
- Introduced configurable scoring modes (rules-only, ML-optional)
- Validation: Scoring tests confirm deterministic behavior
Phase 3: ML Ranking
- Integrated logistic regression for optional ML-based scheme ranking
- ML operates only on eligible schemes; never affects eligibility gate
- Feature extraction: age_normalized, income_ratio, category_match, gender_match
- Design principle: Interpretability over accuracy (weights transparent to auditors)
Phase 4: Compliance & Audit
- Implemented immutable audit trail (WORM: Write-Once, Read-Many)
- Built role-based access control (JWT auth, three roles: user/auditor/admin)
- Audit tests (20 tests) verify no mutations allowed on logged decisions
- RBAC tests (32 tests) validate endpoint-level access control
Phase 5: Evaluation & Fairness
- Added production-grade ranking quality metrics (NDCG@5, Precision@5, MAP, MRR)
- Built fairness analysis framework (demographic parity, representation variance)
- Created offline experiment comparing rule-based, ML, and hybrid ranking methods
- Results show ~3% improvement from ML over rule-based heuristics on synthetic data
- Fairness analysis is monitoring-only; governance teams decide policy implications
Testing Philosophy: "Test what matters; skip what's out of scope"
- Eligibility correctness: 26 tests with boundary cases (every constraint tested)
- RBAC enforcement: 32 tests across protected endpoints (security-critical)
- Audit immutability: 20 tests verify HTTP 409 (conflict) on all mutation attempts
- ML versioning: 12 tests track model version and drift detection
- No mocks: All tests use real integration (FastAPI TestClient against live endpoints)
- Deterministic: Every test uses fixed inputs and expected outputs (reproducible, no flakiness)
Result: 90 focused tests covering critical paths. Coverage is sufficient for a compliance-grade government recommendation system.
- Rules as Source of Truth: Eligibility logic is entirely rule-based; ML never overrides eligibility decisions
- ML for Ranking, Not Eligibility: Machine learning enhances ranking precision among eligible schemes only
- Explainability Preserved at Every Layer: Every decision (eligibility, scoring, ML contribution) includes explanation in JSON responses
- Immutable Audit Trail: All decisions are logged immutably; no retroactive modifications allowed
- Interpretable Models: Logistic regression chosen over black-box methods for transparency
These principles were established early and maintained through all iterations.
This project demonstrates systems engineering for AI applications at production scale. It's a complete backend system (no UI) suitable for roles requiring compliance-aware AI architecture, data integrity, and restraint.
Best-fit roles: Backend Engineer, ML Systems Engineer, Infrastructure/DevOps Engineer, Policy Technology Specialist, Compliance Systems Designer
Start here: Read in this order:
- README_MAIN.md — Overview, architecture, design decisions (skip non-goals initially)
- app.py — FastAPI endpoints and entry points (read
/login,/recommend,/auditflows) - src/db.py — Data model with immutability enforcement (lines 1-50)
- tests/test_audit.py — WORM compliance testing (understand HTTP 409 pattern)
- src/rbac.py — Role-based access control implementation
- SANITY_CHECK.md — Pre-release verification (if evaluating for production)
Three questions this project will be evaluated against:
-
"How does this system guarantee data integrity for audit compliance?"
- Look for: WORM enforcement in db.py (no updates/deletes), HTTP 409 enforcement in app.py, test validation in test_audit.py
- Core insight: Audit records are immutable by design, not trust
-
"How do you balance ML recommendations with rule-based eligibility?"
- Look for: Eligibility engine in services/eligibility_engine.py (rules are law), ML ranker in services/ml_ranker.py (ranking only, not eligibility), test separation in test_eligibility.py vs test_versioning.py
- Core insight: Restraint—ML never overrides rules; it improves user experience within rule constraints
-
"How would you add a new feature without breaking compliance?"
- Look for: Design decisions section in README_MAIN.md, test-first approach in all 4 test files, scope boundaries in Future Work section
- Core insight: Every feature request is evaluated against WORM guarantee, RBAC, and data isolation
Key metrics:
- 16 endpoints, 100% RBAC-enforced
- 90 deterministic tests (0 flakiness), no mocks
- SQLite with SQLAlchemy WORM enforcement
- JWT-based auth, 3 roles
- ML versioning and drift detection built-in
- Deployment-ready (render.yaml provided)
Contributions are welcome. Please ensure code quality and add appropriate tests for new features.
[Add your license here]
For inquiries or support, please contact the development team.