Add Output Area crosswalk and geographic assignment (Phase 1)#291
Add Output Area crosswalk and geographic assignment (Phase 1)#291vahid-ahmadi wants to merge 4 commits intomainfrom
Conversation
Port the US-side clone-and-prune calibration methodology to the UK, starting with Output Area (OA) level geographic infrastructure: - Build unified UK OA crosswalk from ONS, NRS, and NISRA data (235K areas: 189K E+W OAs + 46K Scotland OAs) - Population-weighted OA assignment with country constraints - Constituency collision avoidance for cloned records - Tests validating crosswalk completeness and assignment correctness This is Phase 1 of a 6-phase pipeline to enable OA-level calibration, analogous to the US Census Block approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Hi Vahid,
Most of this is from our boy Claude, as usual. This looks like a great setup! Can't wait to see HHs getting donated to the OAs! I'll approve, but please see the issues Claude found below.
Here's the code I used to poke around:
from policyengine_uk_data.calibration.oa_crosswalk import load_oa_crosswalk
xw = load_oa_crosswalk()
xw
# Population-weighted sampling demo
import numpy as np
xw["population"] = xw["population"].astype(float)
eng = xw[xw["country"] == "England"].copy()
eng["prob"] = eng["population"] / eng["population"].sum()
rng = np.random.default_rng(42)
idx = rng.choice(len(eng), size=10_000, p=eng["prob"].values)
sampled = eng.iloc[idx]
sampled.groupby("oa_code")["population"].agg(["count", "first"]).rename(
columns={"count": "times_sampled", "first": "population"}
).sort_values("times_sampled", ascending=False).head(20)
leads to:
Out[1]:
times_sampled population
oa_code
E00179944 5 3354.0
E00035641 3 279.0
E00039569 3 263.0
E00066618 3 331.0
E00115325 2 319.0
E00136307 2 301.0
E00089585 2 333.0
E00167257 2 472.0
E00130843 2 406.0
E00021422 2 190.0
E00004742 2 313.0
E00044937 2 294.0
E00089725 2 240.0
E00044974 2 400.0
E00160095 2 401.0
E00016512 2 305.0
E00016490 2 380.0
E00089915 2 514.0
E00021502 2 396.0
E00105618 2 305.0
Interesting: "E00179944 with population 3,354 is a massive outlier (most OAs are 100–300 people)"
Bugs
1. load_oa_crosswalk loads population as string
load_oa_crosswalk() passes dtype=str for all columns (line 753 of oa_crosswalk.py), so population comes back as a string. This means any downstream arithmetic (e.g. computing probabilities) fails with TypeError: unsupported operand type(s) for /: 'str' and 'str'. Should either drop dtype=str or explicitly cast population to int on load.
2. NI households silently get no assignment
The crosswalk has 0 NI rows (NISRA 404), which is acknowledged, but assign_random_geography will silently produce None entries for NI households (country code 4). Worth either raising an error or logging a warning when a household's country has no distribution.
Code quality
3. Dead code in _assign_regions
Lines 602–606 of oa_crosswalk.py:
for k, v in la_to_region.items():
if k[:3] == la_code[:3]:
# Same LA type prefix
passThis loop does nothing — should be removed or finished.
4. Assignment inner loop should be vectorised
In oa_assignment.py lines 236–245, the for i, pos in enumerate(positions) loop storing results can be replaced with vectorised numpy indexing:
oa_codes[start + positions] = dist["oa_codes"][indices]Same for all the other arrays. Will matter when n_clones * n_records gets large.
Worth noting
5. Scotland population weighting is effectively uniform
The fallback of ~117 per OA for all 46k Scottish OAs means population-weighted sampling is actually uniform for Scotland. This undermines the premise for ~20% of UK OAs. Might be worth a louder warning or a TODO to revisit once NRS fixes the 403.
baogorek
left a comment
There was a problem hiding this comment.
Approving Phase 1 — the crosswalk and assignment engine look good. Please see my comment above for a few things to address before merge.
nwoodruff-co
left a comment
There was a problem hiding this comment.
Putting a req changes here- due to importance of data here I'm going to say don't approve unless the PR is ready to merge at time of approval.
Aiming to block the least but these are the minimum:
-
The constituency impacts (all 650) currently take less than 5 seconds to run after a completed national simulation. This probably increases that by several orders of magnitude to 10 minutes plus. Can you both confirm/reject, and argue in favour of your argument here? I agree yours is a theoretically better solution but we do need to consider this.
-
This would be a major data change- need to run microsimulation regression tests to understand if outputs significantly change. At bare minimum this should include these examples:
a) the living standards outlook (rel change in real hbai household net income BHC from 2024 to 2029, broken down by age group
b) raising the higher rate to 41p (broken down by equiv hbai household net income bhc decile)
If you can say these don't change by 0.1p/0.1bn respectively, we can skip digging further
|
Ran the requested microsimulation regression checks locally on March 18, 2026. Method:
This is important because the latest PyPI Result: for the two examples below,
Relative change in household net income by
So for these examples, the PR changes are This also matches the scope of the diff: Phase 1 adds OA crosswalk / assignment code and |
|
@nwoodruff-co Re your performance concern about constituency impacts going from <5s to 10+ minutes: Phase 1 has zero performance impact. This PR adds only new standalone files — zero existing files are modified. The new
The current <5s constituency impact calculation ( The performance question is valid but applies to future phases (Phase 2: clone-and-assign, Phase 3: L0 calibration), where the weight matrix would grow from 650 × 100K to potentially 650 × 1M+. That's worth addressing when those PRs come, not here. Between this and Max's regression results (zero change on both requested examples), both concerns from your changes-requested review should be resolved for Phase 1. |
Background
This PR implements Phase 1 of a 6-phase pipeline to enable Output Area (OA) level calibration — the UK equivalent of the US Census Block approach.
Why are we doing this?
The US pipeline (
policyengine-us-data) uses a clone-and-prune approach that produces much finer geographic granularity than our current UK methodology:This PR is going down to Output Area level (~235K OAs across the UK), which is the UK equivalent of the US Census Block. This PR is the first step.
What this PR does (Phase 1: OA Crosswalk & Geographic Assignment)
1. Unified UK Output Area Crosswalk
Downloads and combines geographic lookups from three national statistics agencies into a single crosswalk:
```
OA → LSOA/DataZone → MSOA/IntermediateZone → LA → Constituency → Region → Country
```
Data sources:
Output: `storage/oa_crosswalk.csv.gz` (1.4MB compressed) — 235,243 areas, 65M population, 632 constituencies, 363 LAs, 11 regions
2. Geographic Assignment Engine
Assigns population-weighted random Output Areas to cloned FRS household records, with two key constraints:
3. Tests — 19 passing, 1 skipped (NI)
Validates crosswalk completeness (OA counts, population totals, hierarchy nesting, country prefixes) and assignment correctness (country constraints, collision avoidance, population-weighted sampling, save/load roundtrip).
Known limitations
What comes next (Phases 2-6)
Phase 2: Clone-and-Assign
Clone each FRS household N times (start with N=10), assign each clone a different OA. Insert into `create_datasets.py` after imputations, before calibration.
US ref: PRs #457, #531
Phase 3: L0 Calibration Engine
Port L0-regularized optimization from US side. HardConcrete gates to actively drop records, producing sparse datasets. Add `l0-python` dependency.
US ref: PRs #364, #365
Phase 4: Sparse Matrix Builder
Build sparse `(n_targets × n_records*n_clones)` calibration matrix. Simulate PolicyEngine-UK per clone, wire existing `targets/sources/` into sparse matrix rows.
US ref: PRs #456, #489
Phase 5: SQLite Target Database
Hierarchical target storage: UK → Country → Region → LA → Constituency → MSOA → LSOA → OA. Migrate existing CSV/Excel targets into SQLite.
US ref: PRs #398, #488
Phase 6: Local Area Publishing
Generate per-area H5 files from sparse weights. Modal integration for scale.
US ref: PR #465
File summary