diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index a5bf54590..faa33b5ed 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -1,6 +1,6 @@ {"id":"policyengine-uk-5qy","title":"Update student loan validation notebook with deeper analysis","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-11-29T21:01:35.49966-05:00","updated_at":"2025-11-29T21:02:51.469593-05:00","closed_at":"2025-11-29T21:02:51.469593-05:00"} {"id":"policyengine-uk-75j","title":"Add student loan calibration targets to policyengine-uk-data","description":"Add student loan repayment and balance calibration targets to policyengine-uk-data loss function.\n\n## Proposed Calibration Targets (from SLC 2024-25 statistics)\n\n### Total Repayments by Country\n| Country | Repayments | Source |\n|---------|------------|--------|\n| England (HE) | £5.0bn | SLC 2024-25 |\n| Scotland | £203m | SLC 2024-25 |\n| Wales | £229m | SLC 2024-25 |\n| Northern Ireland | £182m | SLC 2024-25 |\n| **UK Total** | **~£5.6bn** | |\n\n### Repayments by Plan Type (England)\n| Plan | Amount | Share |\n|------|--------|-------|\n| Plan 1 | £1.9bn | 37% |\n| Plan 2 | £2.8bn | 55% |\n| Postgraduate | £0.3bn | 7% |\n| Plan 5 | £41m | 0.8% |\n\n### Number of Borrowers Repaying (England)\n- 3.0m via HMRC\n- 187k scheduled direct\n- 147k voluntary direct\n\n### Outstanding Balances\n- UK Total: £294bn (March 2025)\n\n## Implementation Notes\n1. Add targets to `loss.py` in policyengine-uk-data\n2. May need to adjust for timing (FRS year vs SLC reporting year)\n3. Consider whether to calibrate on modelled (`student_loan_repayment`) or reported (`student_loan_repayments`)\n\n## Sources\n- https://www.gov.uk/government/statistics/student-loans-in-england-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-scotland-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-northern-ireland-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-wales-2024-to-2025","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-29T21:01:37.038332-05:00","updated_at":"2025-11-30T12:42:25.851958-05:00","closed_at":"2025-11-30T12:42:25.851958-05:00"} -{"id":"policyengine-uk-e55","title":"Impute student loan balance from WAS to FRS","description":"Impute student loan balance from WAS to FRS.\n\n**GitHub Issue:** https://github.com/PolicyEngine/policyengine-uk-data/issues/238\n\n## WAS Data (Round 7, 2018-2020)\n- Derived from: Tot_LosR7_aggr - Tot_los_exc_SLCR7_aggr\n- 1.66m weighted households with SLC debt\n- Mean balance: £20k, Total: £33.4bn\n- Undercounts vs admin (~24% of SLC total) but captures distribution shape\n\n## Implementation in policyengine-uk-data\n1. Add variables to wealth.py\n2. Impute to FRS\n3. Consider scaling to admin totals\n\n## Then in policyengine-uk\n1. Create student_loan_balance variable\n2. Use for capping repayments (policyengine-uk-exv)\n3. Use for interest accrual (policyengine-uk-lo8)","status":"open","priority":2,"issue_type":"feature","created_at":"2025-11-30T13:00:55.693284-05:00","updated_at":"2025-11-30T21:22:35.720239-05:00"} +{"id":"policyengine-uk-e55","title":"Impute student loan balance from WAS to FRS","description":"Impute student loan balance from WAS to FRS.\n\n**GitHub Issue:** https://github.com/PolicyEngine/policyengine-uk-data/issues/238\n\n## WAS Data (Round 7, 2018-2020)\n- Derived from: Tot_LosR7_aggr - Tot_los_exc_SLCR7_aggr\n- 1.66m weighted households with SLC debt\n- Mean balance: £20k, Total: £33.4bn\n- Undercounts vs admin (~24% of SLC total) but captures distribution shape\n\n## Implementation in policyengine-uk-data\n1. Add variables to wealth.py\n2. Impute to FRS\n3. Consider scaling to admin totals\n\n## Then in policyengine-uk\n1. Create student_loan_balance variable\n2. Use for capping repayments (policyengine-uk-exv)\n3. Use for interest accrual (policyengine-uk-lo8)","status":"in_progress","priority":2,"issue_type":"feature","created_at":"2025-11-30T13:00:55.693284-05:00","updated_at":"2025-11-30T23:51:42.754835-05:00"} {"id":"policyengine-uk-exv","title":"Cap student loan repayments at outstanding balance","description":"## Summary\nCurrently `student_loan_repayment` is calculated as:\n```python\nrepayment = rate * max_(0, income - threshold)\n```\n\nThis has no cap, so high earners can have modelled repayments exceeding their actual loan balance.\n\nExample from validation:\n- Person with £420k income\n- Modelled repayment: £35,470\n- Reported repayment: £1,903\n- Likely explanation: they paid off their loan during the year\n\n## Implementation\n1. Depends on: policyengine-uk-e55 (impute student loan balance)\n2. Add cap: `repayment = min_(repayment, student_loan_balance)`\n3. Consider interest accrual dynamics\n\n## References\n- Real repayments stop when balance reaches zero\n- SLC sends notification when approaching final payment","status":"open","priority":2,"issue_type":"feature","created_at":"2025-11-30T13:01:04.290596-05:00","updated_at":"2025-11-30T13:01:04.290596-05:00","dependencies":[{"issue_id":"policyengine-uk-exv","depends_on_id":"policyengine-uk-e55","type":"blocks","created_at":"2025-11-30T13:01:32.464187-05:00","created_by":"daemon"}]} {"id":"policyengine-uk-lo8","title":"Add student loan interest accrual calculation","description":"## Summary\nWe have `student_loan_interest_rate` but no calculation of actual interest accrued. This would require:\n1. Outstanding balance (see policyengine-uk-e55)\n2. Interest rate (already implemented)\n3. Decision on timing: interest on opening or closing balance?\n\n## UK Rules\n- Interest is calculated daily on the outstanding balance\n- For Plan 2, rate varies by income (RPI to RPI+3%)\n- Interest is added monthly\n\n## Implementation\n```python\ninterest_accrued = student_loan_balance * student_loan_interest_rate\n```\n\n## Use cases\n- Calculating lifetime loan costs\n- Analysing distributional impact of interest rate changes\n- Understanding real cost of higher education","status":"open","priority":2,"issue_type":"feature","created_at":"2025-11-30T13:01:12.269846-05:00","updated_at":"2025-11-30T13:01:12.269846-05:00","dependencies":[{"issue_id":"policyengine-uk-lo8","depends_on_id":"policyengine-uk-e55","type":"blocks","created_at":"2025-11-30T13:01:32.503041-05:00","created_by":"daemon"}]} {"id":"policyengine-uk-occ","title":"Research official student loan repayment aggregates for calibration","description":"## Research Findings\n\n### UK Student Loan Repayments 2024-25 (SLC Official Statistics)\n\n**England (HE):** £5.0bn total repayments\n- Plan 1: £1.9bn (37%)\n- Plan 2: £2.8bn (55%)\n- Plan 3/Postgraduate: £0.3bn (7%)\n- Plan 5: £41m (0.8%, voluntary only)\n\n**Scotland:** £203m total repayments (primarily Plan 4)\n\n**Wales:** ~£229m total repayments (6.9% increase from prior year)\n\n**Northern Ireland:** £182m total repayments\n\n**UK Total (estimated):** ~£5.6bn HE repayments\n\n### Borrowers Making Repayments (England)\n- 3.0m via HMRC (39.5% of those liable)\n- 187k scheduled direct to SLC\n- 147k voluntary direct to SLC\n\n### Outstanding Balances\n- England: £236.4bn (end March 2025)\n- Scotland: £9.4bn\n- Northern Ireland: £5.6bn\n- Wales: ~£9-10bn (estimated)\n- **UK Total: ~£260-295bn**\n\n### Sources\n- https://www.gov.uk/government/statistics/student-loans-in-england-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-scotland-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-northern-ireland-2024-to-2025\n- https://www.gov.uk/government/statistics/student-loans-in-wales-2024-to-2025","status":"closed","priority":2,"issue_type":"task","created_at":"2025-11-29T21:01:36.199753-05:00","updated_at":"2025-11-30T12:41:52.88068-05:00","closed_at":"2025-11-30T12:41:52.88068-05:00","dependencies":[{"issue_id":"policyengine-uk-occ","depends_on_id":"policyengine-uk-75j","type":"blocks","created_at":"2025-11-29T21:01:47.791464-05:00","created_by":"daemon"}]} diff --git a/docs/book/validation/student-loan-repayments.ipynb b/docs/book/validation/student-loan-repayments.ipynb index 26b16268e..b9324223b 100644 --- a/docs/book/validation/student-loan-repayments.ipynb +++ b/docs/book/validation/student-loan-repayments.ipynb @@ -17,13 +17,15 @@ "\n", "Student loan repayments in the UK are calculated as a percentage of income above a threshold, varying by loan plan:\n", "\n", - "- **Plan 1** (pre-2012 England/Wales, Scotland, NI): 9% of income above £24,990 (2024-25)\n", - "- **Plan 2** (post-2012 England/Wales): 9% of income above £27,295 (2024-25)\n", - "- **Plan 4** (Scotland post-2017): 9% of income above £27,660 (2024-25)\n", - "- **Plan 5** (England post-2023): 9% of income above £25,000 (2024-25)\n", - "- **Postgraduate**: 6% of income above £21,000 (2024-25)\n", + "- **Plan 1** (pre-2012 England/Wales, Scotland, NI): 9% of income above threshold\n", + "- **Plan 2** (post-2012 England/Wales): 9% of income above threshold\n", + "- **Plan 4** (Scotland post-2017): 9% of income above threshold\n", + "- **Plan 5** (England post-2023): 9% of income above threshold\n", + "- **Postgraduate**: 6% of income above threshold\n", "\n", - "The FRS captures reported student loan repayments, while PolicyEngine calculates repayments based on income and loan plan type." + "The FRS captures reported student loan repayments (`student_loan_repayments`), while PolicyEngine calculates repayments based on income and loan plan type (`student_loan_repayment`).\n", + "\n", + "**Note:** The FRS `student_loans` variable (from `tuborr`) represents the amount borrowed THIS YEAR by current students, not total outstanding balance. Outstanding balance data would need to be imputed from the Wealth and Assets Survey (WAS)." ] }, { @@ -69,11 +71,13 @@ "metadata": {}, "outputs": [], "source": [ - "# Plan distribution (weighted)\n", - "plan_names = {0: \"None\", 1: \"Plan 1\", 2: \"Plan 2\", 3: \"Postgraduate\", 4: \"Plan 4\", 5: \"Plan 5\"}\n", - "for plan_id, name in plan_names.items():\n", - " count = weight[plan == plan_id].sum() / 1e6\n", - " print(f\"{name}: {count:.2f}m people\")" + "# Plan distribution (weighted) - plan values are strings\n", + "from policyengine_uk.variables.gov.hmrc.student_loans.student_loan_plan import StudentLoanPlan\n", + "\n", + "for p in StudentLoanPlan:\n", + " mask = plan == p.value\n", + " count = weight[mask].sum() / 1e6\n", + " print(f\"{p.name}: {count:.2f}m people\")" ] }, { @@ -94,8 +98,8 @@ "total_reported = (reported * weight).sum() / 1e9\n", "total_modelled = (modelled * weight).sum() / 1e9\n", "\n", - "print(f\"Total reported repayments: £{total_reported:.2f}bn\")\n", - "print(f\"Total modelled repayments: £{total_modelled:.2f}bn\")\n", + "print(f\"Total reported repayments: \\u00a3{total_reported:.2f}bn\")\n", + "print(f\"Total modelled repayments: \\u00a3{total_modelled:.2f}bn\")\n", "print(f\"Ratio (modelled/reported): {total_modelled/total_reported:.2f}\")" ] }, @@ -128,9 +132,119 @@ " print(f\"People with both reported & modelled > 0: {match_rate:.1f}% of reporters\")\n", " \n", " # Mean values\n", - " print(f\"\\nMean reported (reporters): £{reported[has_reported].mean():,.0f}\")\n", - " print(f\"Mean modelled (reporters): £{modelled[has_reported].mean():,.0f}\")\n", - " print(f\"Mean income (reporters): £{income[has_reported].mean():,.0f}\")" + " print(f\"\\nMean reported (reporters): \\u00a3{reported[has_reported].mean():,.0f}\")\n", + " print(f\"Mean modelled (reporters): \\u00a3{modelled[has_reported].mean():,.0f}\")\n", + " print(f\"Mean income (reporters): \\u00a3{income[has_reported].mean():,.0f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Correlation among those required to pay\n", + "\n", + "The overall correlation is low because many reporters have incomes below the repayment threshold. Let's look at correlation only among those required to make payments:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get thresholds\n", + "params = sim.tax_benefit_system.parameters\n", + "thresholds = params.gov.hmrc.student_loans.thresholds\n", + "plan_1_threshold = thresholds.plan_1(f'{year}-01-01')\n", + "plan_2_threshold = thresholds.plan_2(f'{year}-01-01')\n", + "\n", + "# People above threshold for their plan\n", + "above_threshold = (\n", + " ((plan == 'PLAN_1') & (income > plan_1_threshold)) |\n", + " ((plan == 'PLAN_2') & (income > plan_2_threshold))\n", + ")\n", + "\n", + "print(f'People above repayment threshold: {above_threshold.sum():,}')\n", + "print(f'Weighted: {weight[above_threshold].sum()/1e6:.2f}m')\n", + "\n", + "if above_threshold.sum() > 0:\n", + " # Unweighted correlation\n", + " corr_unweighted = np.corrcoef(reported[above_threshold], modelled[above_threshold])[0, 1]\n", + " \n", + " # Weighted correlation\n", + " w = weight[above_threshold]\n", + " r = reported[above_threshold]\n", + " m = modelled[above_threshold]\n", + " \n", + " mean_r = np.average(r, weights=w)\n", + " mean_m = np.average(m, weights=w)\n", + " \n", + " cov = np.average((r - mean_r) * (m - mean_m), weights=w)\n", + " std_r = np.sqrt(np.average((r - mean_r)**2, weights=w))\n", + " std_m = np.sqrt(np.average((m - mean_m)**2, weights=w))\n", + " corr_weighted = cov / (std_r * std_m)\n", + " \n", + " print(f'\\nCorrelation (unweighted): {corr_unweighted:.3f}')\n", + " print(f'Correlation (weighted): {corr_weighted:.3f}')\n", + " print(f'\\nMean reported: \\u00a3{np.average(r, weights=w):,.0f}')\n", + " print(f'Mean modelled: \\u00a3{np.average(m, weights=w):,.0f}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deep dive: Why do some reporters have zero modelled repayments?\n", + "\n", + "A significant fraction of people who report making repayments have zero modelled repayments. Let's investigate why." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# People who report repayments but we model zero\n", + "has_reported = reported > 0\n", + "modelled_zero = modelled == 0\n", + "problem = has_reported & modelled_zero\n", + "\n", + "print(f\"People reporting repayments: {has_reported.sum():,}\")\n", + "print(f\"Of those, modelled = 0: {problem.sum():,} ({problem.sum()/has_reported.sum()*100:.1f}%)\")\n", + "print()\n", + "\n", + "# Why modelled = 0? Check plan distribution\n", + "print(\"Plan distribution for problem cases:\")\n", + "for p in ['NONE', 'PLAN_1', 'PLAN_2', 'PLAN_4', 'PLAN_5']:\n", + " count = (problem & (plan == p)).sum()\n", + " print(f\" {p}: {count:,}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check income levels for problem cases\n", + "print(\"Income stats for problem cases (reported > 0, modelled = 0):\")\n", + "print(f\" Mean income: \\u00a3{income[problem].mean():,.0f}\")\n", + "print(f\" Median income: \\u00a3{np.median(income[problem]):,.0f}\")\n", + "print(f\" Min/Max: \\u00a3{income[problem].min():,.0f} / \\u00a3{income[problem].max():,.0f}\")\n", + "print()\n", + "\n", + "print(f\"Repayment thresholds for {year}:\")\n", + "print(f\" Plan 1: \\u00a3{plan_1_threshold:,.0f}\")\n", + "print(f\" Plan 2: \\u00a3{plan_2_threshold:,.0f}\")\n", + "print()\n", + "\n", + "# How many are below threshold?\n", + "problem_plan1 = problem & (plan == 'PLAN_1')\n", + "problem_plan2 = problem & (plan == 'PLAN_2')\n", + "\n", + "print(f\"Plan 1 problem cases below threshold: {(income[problem_plan1] < plan_1_threshold).sum():,} of {problem_plan1.sum():,}\")\n", + "print(f\"Plan 2 problem cases below threshold: {(income[problem_plan2] < plan_2_threshold).sum():,} of {problem_plan2.sum():,}\")" ] }, { @@ -139,19 +253,69 @@ "source": [ "## Analysis of discrepancies\n", "\n", - "The relatively low individual-level correlation suggests several factors may explain differences:\n", + "The analysis reveals that **all** problem cases (people reporting repayments but with zero modelled) have incomes below the repayment threshold. The model is correctly applying the threshold logic - these people should not owe mandatory repayments based on their annual income.\n", + "\n", + "So why are they reporting repayments? Several factors explain this:\n", + "\n", + "1. **Voluntary repayments**: People can choose to pay more than the minimum required, or make direct payments to reduce their loan balance faster.\n", + "\n", + "2. **Intra-year income variation**: The FRS captures annual income, but someone may have had higher-paying employment for part of the year (triggering PAYE deductions) before their income dropped.\n", + "\n", + "3. **Self-employed direct repayments**: Self-employed individuals make direct repayments based on prior year income, which may differ from current year income.\n", + "\n", + "4. **Employment timing**: Someone starting or ending employment mid-year may have PAYE deductions based on annualised pay that differs from their actual annual income.\n", + "\n", + "These factors represent a fundamental limitation of annual microsimulation models - we cannot capture voluntary overpayments or intra-year income dynamics.\n", + "\n", + "Among those **required to pay** (income above threshold), the weighted correlation is much higher (~0.68), indicating the model works well for mandatory repayments." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Known limitations\n", + "\n", + "### No cap at outstanding balance\n", + "\n", + "The model currently calculates repayments as 9% of income above threshold with no cap. For high earners, this can produce unrealistically high repayments that exceed their actual loan balance.\n", + "\n", + "Example from the data:\n", + "- Person with \\u00a3420k income\n", + "- Modelled repayment: \\u00a335,470\n", + "- Reported repayment: \\u00a31,903\n", + "- Explanation: They likely paid off their loan during the year\n", + "\n", + "### FRS `student_loans` is not outstanding balance\n", "\n", - "1. **Timing differences**: Reported repayments reflect actual payments made during the tax year, which may include voluntary overpayments or vary based on pay frequency and employment changes.\n", + "The FRS variable `student_loans` (from `tuborr`) represents the amount borrowed THIS YEAR by current students, not total outstanding balance. This is why 98.7% of people reporting repayments have `student_loans = 0`.\n", "\n", - "2. **Employment variation**: Someone may have had periods below or above the repayment threshold during the year, while our model assumes constant annual income.\n", + "### Potential improvement: WAS imputation\n", "\n", - "3. **Multiple loan plans**: Some individuals may have both Plan 1 and Plan 2 loans, complicating the calculation.\n", + "The Wealth and Assets Survey (WAS) contains student loan balance data:\n", + "- `Tot_LosR7_aggr` - total loans\n", + "- `Tot_los_exc_SLCR7_aggr` - total loans excluding SLC\n", + "- Difference = SLC debt\n", "\n", - "4. **Study status**: Current students may have different repayment patterns not fully captured in the model.\n", + "This could be imputed to the FRS (similar to how other wealth variables are imputed) to enable capping repayments at outstanding balance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calibration status\n", + "\n", + "Student loan repayments are **not currently calibrated** to external aggregate targets in policyengine-uk-data. The reported values come directly from the FRS without reweighting to match official statistics.\n", "\n", - "5. **Plan misclassification**: The loan plan imputation in the microdata may not perfectly match individuals' actual loan types.\n", + "Potential calibration targets from SLC 2024-25 statistics:\n", + "- England total repayments: \\u00a35.0bn\n", + "- Scotland: \\u00a3203m\n", + "- Wales: \\u00a3229m \n", + "- Northern Ireland: \\u00a3182m\n", + "- **UK Total: ~\\u00a35.6bn**\n", "\n", - "Despite individual-level variation, the aggregate totals are reasonably aligned, suggesting the model captures the overall scale of student loan repayments in the UK economy." + "See [GitHub issue](https://github.com/PolicyEngine/policyengine-uk-data/issues/237) for tracking." ] }, { @@ -160,7 +324,24 @@ "source": [ "## Conclusion\n", "\n", - "PolicyEngine UK's student loan repayment model produces aggregate totals within a reasonable range of reported values. The individual-level correlation is lower than for income tax calculations, reflecting the complexity of student loan timing and the limitations of annual income-based calculations. For microsimulation purposes, the model provides a reasonable approximation of student loan repayment flows, while users should be aware of these limitations when analysing individual-level impacts." + "PolicyEngine UK's student loan repayment model produces aggregate totals within ~5% of reported FRS values. Key findings:\n", + "\n", + "| Metric | Value |\n", + "|--------|-------|\n", + "| Aggregate ratio (modelled/reported) | ~0.95 |\n", + "| Weighted correlation (above threshold) | ~0.68 |\n", + "| Weighted correlation (all reporters) | ~0.16 |\n", + "\n", + "The lower overall correlation reflects:\n", + "1. Many FRS respondents report repayments despite having incomes below the threshold\n", + "2. This includes voluntary repayments, intra-year income changes, and employment timing effects\n", + "3. These factors cannot be captured in an annual microsimulation model\n", + "\n", + "The model correctly applies the statutory repayment formula (9% of income above threshold), but users should be aware that:\n", + "- Individual-level repayment predictions have significant uncertainty\n", + "- Aggregate totals are more reliable than individual predictions\n", + "- The model does not capture voluntary overpayments\n", + "- High earners may have overstated repayments (no cap at balance)" ] } ],