Transform AddressBase Premium data into a clean flatfile format suitable for use with uk_address_matcher.
The key feature is that we output multiple variants of the full address string to increase the likelihood of matching.
This package downloads, extracts, and transforms AddressBase Premium data from the OS Data Hub into a single parquet file optimized for address matching with uk_address_matcher.
AddressBase Premium data is available to many government users under the PSGA.
The whole pipline is automated:
- Set up your datapackage in the OS Data Hub, and update the config with the
package_idandversion_id - Provide your OS API key in the
.envfile (from https://osdatahub.os.uk/data/apis/projects -> your project) - Run script.py
- The resultant parquet file(s) (default path
.data/output) are now in the format required byuk_address_matcher.
If you prefer to use the NGD files, an equivalent repo is available here
- Create a datapackage on the OS Data Hub containing the AddressBase data you need - either full supply or the subset of the country you're interested in
- Python 3.12+
- uv package manager
- OS Data Hub API key (get one at https://osdatahub.os.uk/) — only required for the download step
# Clone the repository
git clone <repo-url>
cd abp-pipeline
# Install dependencies
uv sync
# Create environment file with your API credentials
cp .env.example .env
# Edit .env and add your OS_PROJECT_API_KEYEdit config.yaml to customize paths if needed (defaults work out of the box):
paths:
work_dir: ./data
downloads_dir: ./data/downloads
extracted_dir: ./data/extracted
parquet_dir: ./data/parquet
output_dir: ./data/output
os_downloads:
package_id: "0040204651"
version_id: "6758807" # Update when new data is released
processing:
# Number of chunks to split flatfile processing into
# Use higher values (e.g., 10) for lower memory usage on laptops
num_chunks: 1The pipeline is run via script.py, which is configured by editing the variables at the top of the file:
# In script.py, set:
STEP = ["download", "extract", "split", "flatfile"] # Or "all" for all steps
FORCE = True # Re-run even if outputs existThen run:
uv run python script.py- Download - Downloads ABP data from OS Data Hub
- Extract - Extracts zip files to CSV
- Split - Splits mixed CSV into separate parquet files by record type
- Flatfile - Transforms into final address matching format
Each stage is idempotent - safe to re-run. Use --force to overwrite existing outputs.
The final output is written to data/output/ as one or more parquet files:
- Single chunk mode (
num_chunks: 1):abp_for_uk_address_matcher.chunk_001_of_001.parquet - Multi-chunk mode (
num_chunks: N):abp_for_uk_address_matcher.chunk_001_of_00N.parquet,chunk_002_of_00N.parquet, etc.
Chunking reduces memory usage by processing UPRNs in batches. The union of all chunk files equals the single-chunk output. Use a higher num_chunks (e.g., 10) for laptops with limited RAM.
Each file contains:
| Column | Description |
|---|---|
uprn |
Unique Property Reference Number |
postcode |
Postal code |
address_concat |
Concatenated address string (without postcode) |
classification_code |
Property classification |
logical_status |
Address status (1=Approved, 3=Alternative, etc.) |
blpu_state |
Building state |
postal_address_code |
Postal address indicator |
udprn |
Royal Mail delivery point reference |
parent_uprn |
Parent UPRN for hierarchical addresses |
hierarchy_level |
C=Child, P=Parent, S=Singleton |
source |
Data source (LPI, ORGANISATION, DELIVERY_POINT, CUSTOM_LEVEL) |
variant_label |
Address variant type |
is_primary |
Whether this is the primary address for the UPRN |
If you prefer to download manually:
- Log into https://osdatahub.os.uk/
- Create a datapackage
- Download the CSV zip file (e.g., AB76GB_CSV.zip)
To run the pipeline from a manual download:
- Place the zip in the downloads directory configured in config.yaml.
- By default this is data/downloads/
- The extract step looks for *.zip files in this folder
- Run the pipeline starting from extract:
- In script.py set STEP = ["extract", "split", "flatfile"]
- Then run: uv run python script.py
Notes:
- Extraction will create a subfolder under data/extracted/ named after the zip stem (e.g., data/extracted/AB76GB_CSV/) and the split step will read **/*.csv from there.
If you prefer to use the OS Downloads API instead:
- Set up an API key and download using a script
Excluding the time to download the full 9.6GB zip file, on an M4 Macbook Pro it takes about:
- 1 minute to extract the zip file
- 7 minutes to run the remainder of the build script