This document provides a guide to the evaluation infrastructure for the Policy Recommender system.
Purpose: Generic ranking quality metrics for information retrieval evaluation.
Functions:
ndcg_at_k(relevances, k)- Normalized Discounted Cumulative Gain at position kprecision_at_k(relevances, k, threshold)- Precision of top-k recommendationsmean_average_precision(relevances_list, k)- Mean Average Precision across multiple queriesreciprocal_rank(relevances, threshold)- Position of first relevant resultmean_reciprocal_rank(relevances_list, threshold)- MRR across multiple queriesevaluate_ranking(relevances, k, threshold)- Single ranking evaluationevaluate_ranking_batch(relevances_list, k, threshold)- Batch evaluation
When to Use: Any time you need to measure ranking quality. These functions are reusable for new datasets or ranking methods.
Inputs: List of relevance scores (0-1) for items in ranked order Outputs: Dictionary with NDCG, Precision, MAP, MRR values
Purpose: Analyze demographic representation and fairness in recommendation rankings.
Functions:
demographic_parity(recommendations, demographic_attr, outcome_attr)- Recommendation rates per demographic groupparity_gap(rates)- Maximum difference in recommendation rates across groupsrepresentation_variance(rankings, demographic_attr, top_k)- Distribution consistency of demographics in top-kfairness_report(recommendations, rankings, demographic_attrs, top_k)- Comprehensive fairness analysisfairness_summary(report)- Human-readable fairness report with warnings
When to Use: Periodically (monthly/quarterly) to monitor demographic representation in ranking outputs.
Inputs:
recommendations: List of recommendation records with demographic attributesrankings: List of ranked recommendation listsdemographic_attrs: List of demographic attributes to analyze (e.g., ["gender", "category"])
Outputs: Dictionary with parity rates, gaps, and representation variances; human-readable summary
Important: These metrics are analysis-only. They do not enforce constraints or adjust rankings automatically. Governance teams use the output to make policy decisions.
Purpose: Offline experiment comparing three ranking methods on synthetic data.
Methods Compared:
- Rule-Based: Deterministic feature heuristics (income proximity, age fit, category match)
- ML-Based: Logistic regression trained on synthetic features
- Hybrid: Equal-weight average of rule and ML scores
Workflow:
- Generate synthetic schemes (10) and users (100)
- Filter to eligible users (69)
- Score each user's eligible schemes with three methods
- Evaluate with NDCG@5, Precision@5, MAP, MRR
- Save per-user results to CSV
- Print comparison table
# Ensure you're in the correct environment
conda activate ai
# Verify Python version (3.11+)
python --version
# Install dependencies
pip install -r requirements.txt
Required Packages:
- numpy (numerical operations)
- scikit-learn (logistic regression, ML utilities)
- fastapi, uvicorn (for API server, if needed)
- pytest (for unit tests)
cd policy-recommender-ai
python -m experiments.compare_ranking_methods
Expected Duration: <5 seconds
Console Output:
================================================================================
EXPERIMENT: Compare Rule-Based, ML, and Hybrid Ranking Methods
================================================================================
Generating synthetic data...
Generated 10 schemes and 100 users
Scoring rankings for each user...
Scored 69 users
Evaluating with ranking metrics...
RESULTS: Ranking Quality Comparison
--------------------------------------------------------------------------------
Metric Rule-Based ML-Based Hybrid
--------------------------------------------------------------------------------
ndcg@5 0.7404 0.7700 0.7404
precision@5 0.4725 0.5043 0.4725
map 0.7272 0.7510 0.7272
mrr 0.7012 0.7077 0.7012
Detailed results saved to: results.csv
Located in project root. Contains per-user ranking comparisons:
Columns:
user_id: Synthetic user identifiereligible_schemes: Number of schemes user is eligible forrule_ranking_1,2,3: Top-3 schemes under rule-based rankingml_ranking_1,2,3: Top-3 schemes under ML-based rankinghybrid_ranking_1,2,3: Top-3 schemes under hybrid ranking
Use Case: Analyze how rankings differ across methods for individual users. Useful for debugging model decisions or understanding typical recommendation variance.
Determinism:
- Synthetic data generation:
np.random.seed(42)ensures identical data across runs - ML model:
LogisticRegression(random_state=42)ensures reproducible predictions - Expected: Running the experiment twice produces identical results
Running Again:
python -m experiments.compare_ranking_methods
Will overwrite results.csv with identical content (same seed).
Tested On:
- Python 3.11.x
- Conda environment
ai - scikit-learn 1.7+
- numpy 2.4+
Environment Activation:
conda activate ai
The evaluation metrics can be integrated into the running API for production monitoring:
from src.evaluation.ranking_metrics import evaluate_ranking_batch
from src.evaluation.fairness_metrics import fairness_report
# After collecting real ranking feedback:
relevances = [...] # [0, 1] labels for each recommendation
metrics = evaluate_ranking_batch(relevances, k=5)
# Check for fairness issues:
recommendations = [...] # Records with demographic data
fairness_data = fairness_report(recommendations, rankings, ["gender", "category"])
However, this integration is not implemented in v1.0.0. It remains for future work.
Unit tests are in tests/ (not included in evaluation infrastructure, but can be added):
# If tests are added:
python -m pytest tests/test_evaluation.py -v
- Synthetic Data: Relevance labels are artificial. Real evaluation requires historical user feedback.
- Small Scale: 69 users is toy data for proof-of-concept. Production requires thousands of records.
- Fixed Features: Features are deterministic. Real ML would benefit from online learning or periodic retraining.
- No Distribution Shift Monitoring: Results assume stationary data. Production needs drift detection.
- Collect real historical data (which scheme did beneficiary actually select?)
- Retrain ML models monthly with new feedback
- Monitor NDCG, Precision, Demographic Parity continuously
- Set SLOs: e.g., "NDCG@5 ≥ 0.75", "Parity Gap < 5%"
- Implement A/B testing for ranking algorithm changes
For questions about specific metrics or implementation details, see:
- README.md - "Evaluation & Results" section
- src/evaluation/ranking_metrics.py - Metric docstrings
- src/evaluation/fairness_metrics.py - Fairness implementation