Evaluation Framework Overview

This document provides a guide to the evaluation infrastructure for the Policy Recommender system.

Evaluation Files

1. `src/evaluation/ranking_metrics.py`

Purpose: Generic ranking quality metrics for information retrieval evaluation.

Functions:

ndcg_at_k(relevances, k) - Normalized Discounted Cumulative Gain at position k
precision_at_k(relevances, k, threshold) - Precision of top-k recommendations
mean_average_precision(relevances_list, k) - Mean Average Precision across multiple queries
reciprocal_rank(relevances, threshold) - Position of first relevant result
mean_reciprocal_rank(relevances_list, threshold) - MRR across multiple queries
evaluate_ranking(relevances, k, threshold) - Single ranking evaluation
evaluate_ranking_batch(relevances_list, k, threshold) - Batch evaluation

When to Use: Any time you need to measure ranking quality. These functions are reusable for new datasets or ranking methods.

Inputs: List of relevance scores (0-1) for items in ranked order Outputs: Dictionary with NDCG, Precision, MAP, MRR values

2. `src/evaluation/fairness_metrics.py`

Purpose: Analyze demographic representation and fairness in recommendation rankings.

Functions:

demographic_parity(recommendations, demographic_attr, outcome_attr) - Recommendation rates per demographic group
parity_gap(rates) - Maximum difference in recommendation rates across groups
representation_variance(rankings, demographic_attr, top_k) - Distribution consistency of demographics in top-k
fairness_report(recommendations, rankings, demographic_attrs, top_k) - Comprehensive fairness analysis
fairness_summary(report) - Human-readable fairness report with warnings

When to Use: Periodically (monthly/quarterly) to monitor demographic representation in ranking outputs.

Inputs:

recommendations: List of recommendation records with demographic attributes
rankings: List of ranked recommendation lists
demographic_attrs: List of demographic attributes to analyze (e.g., ["gender", "category"])

Outputs: Dictionary with parity rates, gaps, and representation variances; human-readable summary

Important: These metrics are analysis-only. They do not enforce constraints or adjust rankings automatically. Governance teams use the output to make policy decisions.

3. `experiments/compare_ranking_methods.py`

Purpose: Offline experiment comparing three ranking methods on synthetic data.

Methods Compared:

Rule-Based: Deterministic feature heuristics (income proximity, age fit, category match)
ML-Based: Logistic regression trained on synthetic features
Hybrid: Equal-weight average of rule and ML scores

Workflow:

Generate synthetic schemes (10) and users (100)
Filter to eligible users (69)
Score each user's eligible schemes with three methods
Evaluate with NDCG@5, Precision@5, MAP, MRR
Save per-user results to CSV
Print comparison table

Running the Evaluation

Prerequisites

# Ensure you're in the correct environment
conda activate ai

# Verify Python version (3.11+)
python --version

# Install dependencies
pip install -r requirements.txt

Required Packages:

numpy (numerical operations)
scikit-learn (logistic regression, ML utilities)
fastapi, uvicorn (for API server, if needed)
pytest (for unit tests)

Command: Run the Ranking Experiment

cd policy-recommender-ai
python -m experiments.compare_ranking_methods

Expected Duration: <5 seconds

Console Output:

================================================================================
EXPERIMENT: Compare Rule-Based, ML, and Hybrid Ranking Methods
================================================================================

Generating synthetic data...
  Generated 10 schemes and 100 users

Scoring rankings for each user...
  Scored 69 users

Evaluating with ranking metrics...

RESULTS: Ranking Quality Comparison
--------------------------------------------------------------------------------
Metric                         Rule-Based             ML-Based               Hybrid
--------------------------------------------------------------------------------
ndcg@5                             0.7404               0.7700               0.7404
precision@5                        0.4725               0.5043               0.4725
map                                0.7272               0.7510               0.7272
mrr                                0.7012               0.7077               0.7012

Detailed results saved to: results.csv

Output Artifacts

1. `results.csv`

Located in project root. Contains per-user ranking comparisons:

Columns:

user_id: Synthetic user identifier
eligible_schemes: Number of schemes user is eligible for
rule_ranking_1,2,3: Top-3 schemes under rule-based ranking
ml_ranking_1,2,3: Top-3 schemes under ML-based ranking
hybrid_ranking_1,2,3: Top-3 schemes under hybrid ranking

Use Case: Analyze how rankings differ across methods for individual users. Useful for debugging model decisions or understanding typical recommendation variance.

Reproducibility Notes

Randomness & Seeds

Determinism:

Synthetic data generation: np.random.seed(42) ensures identical data across runs
ML model: LogisticRegression(random_state=42) ensures reproducible predictions
Expected: Running the experiment twice produces identical results

Running Again:

python -m experiments.compare_ranking_methods

Will overwrite results.csv with identical content (same seed).

Python Version & Environment

Tested On:

Python 3.11.x
Conda environment ai
scikit-learn 1.7+
numpy 2.4+

Environment Activation:

conda activate ai

Integration with API

The evaluation metrics can be integrated into the running API for production monitoring:

from src.evaluation.ranking_metrics import evaluate_ranking_batch
from src.evaluation.fairness_metrics import fairness_report

# After collecting real ranking feedback:
relevances = [...]  # [0, 1] labels for each recommendation
metrics = evaluate_ranking_batch(relevances, k=5)

# Check for fairness issues:
recommendations = [...]  # Records with demographic data
fairness_data = fairness_report(recommendations, rankings, ["gender", "category"])

However, this integration is not implemented in v1.0.0. It remains for future work.

Testing Evaluation Code

Unit tests are in tests/ (not included in evaluation infrastructure, but can be added):

# If tests are added:
python -m pytest tests/test_evaluation.py -v

Limitations & Transparency

Synthetic Data: Relevance labels are artificial. Real evaluation requires historical user feedback.
Small Scale: 69 users is toy data for proof-of-concept. Production requires thousands of records.
Fixed Features: Features are deterministic. Real ML would benefit from online learning or periodic retraining.
No Distribution Shift Monitoring: Results assume stationary data. Production needs drift detection.

Next Steps

Collect real historical data (which scheme did beneficiary actually select?)
Retrain ML models monthly with new feedback
Monitor NDCG, Precision, Demographic Parity continuously
Set SLOs: e.g., "NDCG@5 ≥ 0.75", "Parity Gap < 5%"
Implement A/B testing for ranking algorithm changes

For questions about specific metrics or implementation details, see:

README.md - "Evaluation & Results" section
src/evaluation/ranking_metrics.py - Metric docstrings
src/evaluation/fairness_metrics.py - Fairness implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Framework Overview

Evaluation Files

1. `src/evaluation/ranking_metrics.py`

2. `src/evaluation/fairness_metrics.py`

3. `experiments/compare_ranking_methods.py`

Running the Evaluation

Prerequisites

Command: Run the Ranking Experiment

Output Artifacts

1. `results.csv`

Reproducibility Notes

Randomness & Seeds

Python Version & Environment

Integration with API

Testing Evaluation Code

Limitations & Transparency

Next Steps

FilesExpand file tree

evaluation_overview.md

Latest commit

History

evaluation_overview.md

File metadata and controls

Evaluation Framework Overview

Evaluation Files

1. src/evaluation/ranking_metrics.py

2. src/evaluation/fairness_metrics.py

3. experiments/compare_ranking_methods.py

Running the Evaluation

Prerequisites

Command: Run the Ranking Experiment

Output Artifacts

1. results.csv

Reproducibility Notes

Randomness & Seeds

Python Version & Environment

Integration with API

Testing Evaluation Code

Limitations & Transparency

Next Steps

1. `src/evaluation/ranking_metrics.py`

2. `src/evaluation/fairness_metrics.py`

3. `experiments/compare_ranking_methods.py`

1. `results.csv`