Skip to content

Latest commit

 

History

History
233 lines (162 loc) · 7.74 KB

File metadata and controls

233 lines (162 loc) · 7.74 KB

Evaluation Framework Overview

This document provides a guide to the evaluation infrastructure for the Policy Recommender system.

Evaluation Files

1. src/evaluation/ranking_metrics.py

Purpose: Generic ranking quality metrics for information retrieval evaluation.

Functions:

  • ndcg_at_k(relevances, k) - Normalized Discounted Cumulative Gain at position k
  • precision_at_k(relevances, k, threshold) - Precision of top-k recommendations
  • mean_average_precision(relevances_list, k) - Mean Average Precision across multiple queries
  • reciprocal_rank(relevances, threshold) - Position of first relevant result
  • mean_reciprocal_rank(relevances_list, threshold) - MRR across multiple queries
  • evaluate_ranking(relevances, k, threshold) - Single ranking evaluation
  • evaluate_ranking_batch(relevances_list, k, threshold) - Batch evaluation

When to Use: Any time you need to measure ranking quality. These functions are reusable for new datasets or ranking methods.

Inputs: List of relevance scores (0-1) for items in ranked order Outputs: Dictionary with NDCG, Precision, MAP, MRR values


2. src/evaluation/fairness_metrics.py

Purpose: Analyze demographic representation and fairness in recommendation rankings.

Functions:

  • demographic_parity(recommendations, demographic_attr, outcome_attr) - Recommendation rates per demographic group
  • parity_gap(rates) - Maximum difference in recommendation rates across groups
  • representation_variance(rankings, demographic_attr, top_k) - Distribution consistency of demographics in top-k
  • fairness_report(recommendations, rankings, demographic_attrs, top_k) - Comprehensive fairness analysis
  • fairness_summary(report) - Human-readable fairness report with warnings

When to Use: Periodically (monthly/quarterly) to monitor demographic representation in ranking outputs.

Inputs:

  • recommendations: List of recommendation records with demographic attributes
  • rankings: List of ranked recommendation lists
  • demographic_attrs: List of demographic attributes to analyze (e.g., ["gender", "category"])

Outputs: Dictionary with parity rates, gaps, and representation variances; human-readable summary

Important: These metrics are analysis-only. They do not enforce constraints or adjust rankings automatically. Governance teams use the output to make policy decisions.


3. experiments/compare_ranking_methods.py

Purpose: Offline experiment comparing three ranking methods on synthetic data.

Methods Compared:

  1. Rule-Based: Deterministic feature heuristics (income proximity, age fit, category match)
  2. ML-Based: Logistic regression trained on synthetic features
  3. Hybrid: Equal-weight average of rule and ML scores

Workflow:

  1. Generate synthetic schemes (10) and users (100)
  2. Filter to eligible users (69)
  3. Score each user's eligible schemes with three methods
  4. Evaluate with NDCG@5, Precision@5, MAP, MRR
  5. Save per-user results to CSV
  6. Print comparison table

Running the Evaluation

Prerequisites

# Ensure you're in the correct environment
conda activate ai

# Verify Python version (3.11+)
python --version

# Install dependencies
pip install -r requirements.txt

Required Packages:

  • numpy (numerical operations)
  • scikit-learn (logistic regression, ML utilities)
  • fastapi, uvicorn (for API server, if needed)
  • pytest (for unit tests)

Command: Run the Ranking Experiment

cd policy-recommender-ai
python -m experiments.compare_ranking_methods

Expected Duration: <5 seconds

Console Output:

================================================================================
EXPERIMENT: Compare Rule-Based, ML, and Hybrid Ranking Methods
================================================================================

Generating synthetic data...
  Generated 10 schemes and 100 users

Scoring rankings for each user...
  Scored 69 users

Evaluating with ranking metrics...

RESULTS: Ranking Quality Comparison
--------------------------------------------------------------------------------
Metric                         Rule-Based             ML-Based               Hybrid
--------------------------------------------------------------------------------
ndcg@5                             0.7404               0.7700               0.7404
precision@5                        0.4725               0.5043               0.4725
map                                0.7272               0.7510               0.7272
mrr                                0.7012               0.7077               0.7012

Detailed results saved to: results.csv

Output Artifacts

1. results.csv

Located in project root. Contains per-user ranking comparisons:

Columns:

  • user_id: Synthetic user identifier
  • eligible_schemes: Number of schemes user is eligible for
  • rule_ranking_1,2,3: Top-3 schemes under rule-based ranking
  • ml_ranking_1,2,3: Top-3 schemes under ML-based ranking
  • hybrid_ranking_1,2,3: Top-3 schemes under hybrid ranking

Use Case: Analyze how rankings differ across methods for individual users. Useful for debugging model decisions or understanding typical recommendation variance.


Reproducibility Notes

Randomness & Seeds

Determinism:

  • Synthetic data generation: np.random.seed(42) ensures identical data across runs
  • ML model: LogisticRegression(random_state=42) ensures reproducible predictions
  • Expected: Running the experiment twice produces identical results

Running Again:

python -m experiments.compare_ranking_methods

Will overwrite results.csv with identical content (same seed).


Python Version & Environment

Tested On:

  • Python 3.11.x
  • Conda environment ai
  • scikit-learn 1.7+
  • numpy 2.4+

Environment Activation:

conda activate ai

Integration with API

The evaluation metrics can be integrated into the running API for production monitoring:

from src.evaluation.ranking_metrics import evaluate_ranking_batch
from src.evaluation.fairness_metrics import fairness_report

# After collecting real ranking feedback:
relevances = [...]  # [0, 1] labels for each recommendation
metrics = evaluate_ranking_batch(relevances, k=5)

# Check for fairness issues:
recommendations = [...]  # Records with demographic data
fairness_data = fairness_report(recommendations, rankings, ["gender", "category"])

However, this integration is not implemented in v1.0.0. It remains for future work.


Testing Evaluation Code

Unit tests are in tests/ (not included in evaluation infrastructure, but can be added):

# If tests are added:
python -m pytest tests/test_evaluation.py -v

Limitations & Transparency

  1. Synthetic Data: Relevance labels are artificial. Real evaluation requires historical user feedback.
  2. Small Scale: 69 users is toy data for proof-of-concept. Production requires thousands of records.
  3. Fixed Features: Features are deterministic. Real ML would benefit from online learning or periodic retraining.
  4. No Distribution Shift Monitoring: Results assume stationary data. Production needs drift detection.

Next Steps

  • Collect real historical data (which scheme did beneficiary actually select?)
  • Retrain ML models monthly with new feedback
  • Monitor NDCG, Precision, Demographic Parity continuously
  • Set SLOs: e.g., "NDCG@5 ≥ 0.75", "Parity Gap < 5%"
  • Implement A/B testing for ranking algorithm changes

For questions about specific metrics or implementation details, see: