Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/oa-calibration-pipeline.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add Output Area crosswalk and geographic assignment for OA-level calibration pipeline.
109 changes: 109 additions & 0 deletions docs/oa_calibration_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Output Area Calibration Pipeline

This document describes the plan to port the US-side clone-and-prune calibration methodology to the UK data, going down to Output Area (OA) level — the UK equivalent of the US Census Block.

## Background

The US pipeline (policyengine-us-data) uses an L0-regularized clone-and-prune approach:
1. Clone each CPS household N times
2. Assign each clone a different Census Block (population-weighted)
3. Build a sparse calibration matrix (targets x records)
4. Run L0-regularized optimization to drop most clones, keeping only the best-fitting records per area
5. Publish per-area H5 files from the sparse weight vector

The UK pipeline currently uses standard PyTorch Adam gradient descent on a dense weight matrix (n_areas x n_households) at constituency (650) and local authority (360) level. We want to bring the US approach to the UK at Output Area (~180K OAs) granularity.

## Implementation Phases

### Phase 1: Output Area Crosswalk & Geographic Assignment
**Status: In Progress**

Build the OA crosswalk and population-weighted assignment function.

**Deliverables:**
- `policyengine_uk_data/calibration/oa_crosswalk.py` — downloads/builds the OA → LSOA → MSOA → LA → constituency → region → country crosswalk
- `policyengine_uk_data/storage/oa_crosswalk.csv.gz` — compressed crosswalk file
- `policyengine_uk_data/calibration/oa_assignment.py` — assigns cloned records to OAs (population-weighted, country-constrained)
- Tests validating crosswalk completeness and assignment correctness

**Data sources:**
- ONS Open Geography Portal: OA → LSOA → MSOA → LA lookup
- ONS mid-year population estimates at OA level
- ONS OA → constituency lookup (2024 boundaries)

**US reference:** PR #484 (census-block-assignment)

---

### Phase 2: Clone-and-Assign
**Status: Not Started**

Clone each FRS household N times and assign each clone a different OA.

**Deliverables:**
- `policyengine_uk_data/calibration/clone_and_assign.py`
- Modify `datasets/create_datasets.py` to insert clone step after imputations, before calibration

**Key design:**
- N=10 clones initially (tune later)
- Constituency collision avoidance: each clone gets a different constituency where possible
- Country constraint preserved: English households → English OAs only

**US reference:** PR #457 (district-h5) + PR #531 (census-block-calibration)

---

### Phase 3: L0 Calibration Engine
**Status: Not Started**

Port L0-regularized optimization from US side.

**Deliverables:**
- `policyengine_uk_data/utils/calibrate_l0.py`
- Add `l0-python` dependency to `pyproject.toml`

**Key design:**
- HardConcrete gates for continuous L0 relaxation
- Relative squared error loss
- L0 + L2 regularization with presets (local vs national)
- Keep existing `calibrate.py` as fallback during validation

**US reference:** PR #364 (bogorek-l0) + PR #365

---

### Phase 4: Sparse Matrix Builder
**Status: Not Started**

Build sparse calibration matrix from cloned dataset.

**Deliverables:**
- `policyengine_uk_data/calibration/matrix_builder.py`
- Wire existing `targets/sources/` definitions into sparse matrix rows

**US reference:** PR #456 + PR #489

---

### Phase 5: SQLite Target Database
**Status: Not Started**

Hierarchical target storage: UK → country → region → LA → constituency → MSOA → LSOA → OA.

**Deliverables:**
- `policyengine_uk_data/db/` directory with ETL scripts
- Migrate existing CSV/Excel targets into SQLite

**US reference:** PR #398 (treasury) + PR #488 (db-work)

---

### Phase 6: Local Area Publishing
**Status: Not Started**

Generate per-area H5 files from sparse weights. Modal integration for scale.

**Deliverables:**
- `policyengine_uk_data/calibration/publish_local_h5s.py`

**US reference:** PR #465 (modal)
Empty file.
Loading
Loading