Skip to content

geoparse/insurtech-open-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

209 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insurtech Open Data

High-quality geospatial data for insurtech is often locked behind expensive APIs, proprietary licenses, or scattered in inconsistent formats.

The Geoparse insurtech-open-data repository solves this by providing a centralised and standardised collection of open data. We process raw sources into efficient, ready-to-use Parquet formats, creating a foundational data layer that is free to use, modify, and distribute. Our goal is to lower the barrier to entry for building location-powered insurtech applications.

The project integrates data from the following providers:


Prerequisites

DuckDB

This repository uses DuckDB, a lightweight, in-process analytical database designed for fast querying of large datasets. Unlike traditional database servers, DuckDB runs directly inside your scripts or applications and can query files such as CSV and Parquet without requiring data to be imported first. It is often described as “SQLite for analytics” due to its simplicity and efficiency for analytical workloads. We use DuckDB to export files to the Parquet format.

On macOS:

brew update
brew install duckdb

GDAL

Before running the scripts in this repository, ensure that GDAL is installed on your system. GDAL (Geospatial Data Abstraction Library) and OGR (OGR Simple Features Library) are essential tools for working with geospatial data. GDAL is designed for reading, writing, and processing raster geospatial data, such as satellite images and digital elevation models. It supports a variety of raster formats, including GeoTIFF, JPEG, PNG, and HDF5. On the other hand, OGR is specialized in handling vector geospatial data, including points, lines, and polygons, and supports formats like Shapefiles, GeoJSON, KML, PostGIS, and OSM PBF.

A powerful feature within GDAL/OGR is the ogr2ogr command-line utility, which is dedicated to vector data manipulation and conversion. ogr2ogr allows users to convert vector data between formats (e.g., Shapefile to GeoJSON), filter and subset data using SQL-like queries, and reproject data to different coordinate reference systems (e.g., transforming WGS84 to a local EPSG code).

In summary, GDAL is tailored for raster data, OGR for vector data, and ogr2ogr provides versatile tools for converting, filtering, and reprojecting vector datasets.

On Debian-based Systems:

sudo apt update
sudo apt install gdal-bin

On macOS:

brew update
brew install gdal

You can upgrade GDAL on your system if it is already installed.

brew upgrade gdal    # macOS
sudo apt install --only-upgrade gdal-bin    # Debian

After completing the installation, verify it by running the following commands:

gdalinfo --version
ogrinfo --version

Both commands should return output similar to:

GDAL 3.11.3 "Eganville", released 2025/07/12

Pigz (optional)

To save storage space, we automatically compress CSV files after processing. The scripts use pigz (parallel gzip) for faster compression if available, otherwise they fall back to standard gzip. macOS and all major Linux distributions come pre-installed with gzip as part of the standard Unix utilities, so it's always available as a reliable fallback. Both tools produce compatible .gz files, with pigz being significantly faster on multi-core systems while remaining optional.

On Debian-based Systems:

sudo apt update
sudo apt install pigz

On macOS:

brew update
brew install pigz

Open Datasets

ONS Postcode Directory

Source: ONS Postcode Directory

The ONS Postcode Directory is a comprehensive dataset from the Office for National Statistics that provides geographic coordinates for every postcode unit across the UK. The dataset covers over 2.7 million postcodes, with approximately 1.8 million currently active. Each record includes the postcode, its precise location and associated administrative boundary codes such as country, region, county and output area. Released under the Open Government Licence, it can be freely used for both commercial and non-commercial purposes with proper attribution.

The following script provides an automated pipeline for downloading, cleansing and converting postcode data into Parquet files.

./ons-postcode-directory.sh

The script automatically:

  • Downloads the latest ONS Postcode Directory from ArcGIS Hub
  • Cleanses and validates the data
  • Converts coordinates to WGS84 (EPSG:4326)
  • Outputs to compressed Parquet format

The generated Parquet file contains postcode-level geographic and administrative information with the following structure:

Column Description
postcode The standard spaced version of the postcode (e.g., “GL4 5EB”).
intr_date Date (YYYYMM) when the postcode was introduced.
term_date Date (YYYYMM) when the postcode was terminated (NaN if active).
user_type User type indicator (0 = small users, 1 = large users).
country ONS country code.
region ONS region code.
county County code (if applicable).
police_force Police force area code.
msoa Middle Layer Super Output Area 2021 code.
lsoa Lower Layer Super Output Area 2021 code.
oa Output Area 2021 code.
rural_urban Rural–urban classification code.
national_park National park area code (if applicable).
lat Latitude coordinate (WGS84).
lon Longitude coordinate (WGS84).

The following sample shows the data structure stored in the Parquet file:

postcode intr_date term_date user_type country region county police_force msoa lsoa oa rural_urban national_park lat lon
GL4 5EB 199512 NaN 0 E92000001 E12000009 E10000013 E23000037 E02004645 E01022281 E00113243 UN1 E65000001 51.84167 -2.198833
PL6 5FN 201509 NaN 0 E92000001 E12000009 E99999999 E23000035 E02003126 E01015092 E00181102 UN1 E65000001 50.41151 -4.113341
DT2 8DS 198001 NaN 0 E92000001 E12000009 E99999999 E23000039 E02004266 E01020490 E00103879 RSF1 E65000001 50.67997 -2.297255
SA3 5EG 202303 NaN 0 W92000004 W99999999 W99999999 W15000003 W02000196 W01000882 W00004684 UN1 W31000001 51.58912 -4.008486
GU11 3UW 199901 200009.0 1 E92000001 E12000008 E10000014 E23000030 E02004812 E01023117 E00117455 UN1 E65000001 51.23632 -0.760916

For more information on additional features included in the original CSV dataset, please refer to the User Guide available with the latest ONS Postcode Directory on the UK Government Open Data Portal. Download the latest data and unzip it to find the User Guide.

ONS UPRN Directory

Source: ONS UPRN Directory

Unique Property Reference Number (UPRN) is a unique identifier assigned to every addressable location in the United Kingdom, including residential and commercial properties, land parcels, and other structures such as bus shelters or community assets. Managed by Ordnance Survey, the UPRN acts as a consistent reference point across different datasets and systems, ensuring that information from local authorities, government bodies, and private organisations can be accurately linked to the same physical location. Because it is stable over the lifetime of the property or land parcel, the UPRN plays a vital role in data integration, geocoding, property analytics, and service delivery, helping organisations reduce duplication, improve accuracy, and make better evidence-based decisions.

You can download the latest UPRN dataset from Ordnance Survey Data Hub. Choose the CSV format, as it is smaller and faster to process than the GeoPackage version.

Alternatively, you can run the script directly:

./ons-uprn-directory.sh

This will download, process, and save the latest OS Open UPRN dataset as a Parquet file in the data/os-open-uprn/ directory. We convert the dataset to a Parquet file (using DuckDB) instead of a GeoParquet file (using ogr2ogr) because reading standard Parquet files with pandas is significantly faster than loading GeoParquet files with geopandas in Python.

ONS Area Codes

Source: ONS Postcode Directory and ONS UPRN Directory

The following script automates the creation of a comprehensive ONS area codes dictionary by downloading both the Postcode and UPRN directories from ArcGIS Hub, extracting geographic area codes and names from various administrative boundary files (including countries, regions, counties, local authorities, and statistical areas), processing them into standardized CSV formats with proper quoting and deduplication, and finally merging both datasets into a single unified area codes reference file for data analysis and mapping purposes.

./one-area-codes.sh

Here’s a sample of the resulting dataset:

"N21000640","Carntogher_D"
"E01034396","Liverpool 010G"
"E02001206","Stockport 020"
"W01000581","Pembrokeshire 003B"
"S01016956","Hillington - 04"

ONS Administrative Boundaries

The Office for National Statistics (ONS) provides administrative boundary data for various geographic levels across the UK, including countries, English regions, counties, local authority districts, parishes, and wards. Each boundary dataset is available in multiple spatial resolutions and coastline generalisations to balance spatial accuracy with processing performance. Each boundary file includes a suffix such as BFC, BFE, BGC, BSC, or BUC that indicates both the detail level and whether the boundary is clipped to the coastline or includes the extent of the realm (i.e., offshore areas). These options let you balance geometric accuracy with file size and performance, depending on your analysis or mapping needs.

Use full resolution versions (BFC/BFE) for analysis or precise overlays, and generalised versions (BGC/BSC/BUC) for visualisation, web mapping, or when handling large datasets. Choose “clipped” versions when you only need land boundaries, or “extent of realm” when including sea/offshore territories is important.

Code Meaning Detail
BFE Boundary – Full resolution, Extent of the Realm Highest-detail geometry including offshore areas and islands.
BFC Boundary – Full resolution, Clipped to coastline Same high-detail boundary, but trimmed at the mean high-water coastline.
BGC Boundary – Generalised (~20 m), Clipped to coastline Simplified geometry suitable for most mapping and display purposes.
BSC Boundary – Super-generalised (~200 m), Clipped to coastline Coarser generalisation for lightweight, large-scale mapping.
BUC Boundary – Ultra-generalised (~500 m), Clipped to coastline Smallest and simplest file size, least geometric detail.

Countries

Source: ONS Countries Boundaries

First, download the five GeoPackage files for all spatial resolutions (BFC, BFE, BGC, BSC, and BUC) from this link. Then, run the following scripts to process the data and convert them to Parquet format.

./ons-admin-country.sh

Regions

Source: ONS Regions Boundaries

Download the five GeoPackage files for all spatial resolutions (BFC, BFE, BGC, BSC, and BUC) from this link. Then, run the following scripts to process the data and convert them to Parquet format.

./ons-admin-region.sh

Counties and Unitary Authorities

Source: ONS Counties and Unitary Authorities Boundaries

Download the five GeoPackage files for all spatial resolutions (BFC, BFE, BGC, BSC, and BUC) from this link. Then, run the following scripts to process the data and convert them to Parquet format.

./ons-admin-county-ua.sh

ONS Census Boundaries

Source: https://www.data.gov.uk/dataset/4a880a9b-b509-4a82-baf1-07e3ce104f4b/output-areas1

ons-output-area.sh processes socio-economic data for different geographic layers in England and Wales, following the Office for National Statistics (ONS) spatial hierarchy. The smallest statistical building block is the Census Output Area (OA), representing a compact group of households designed for detailed local analysis. Lower-layer Super Output Areas (LSOA) combine multiple OAs to ensure population stability over time, while Middle-layer Super Output Areas (MSOA) group several LSOAs to create larger, consistent geographic zones suitable for public reporting and policy analysis.

./ons-output-area.sh

ONS Income Data

Source: Income estimates for small areas, England and Wales - Office for National Statistics (ONS)

The Excel file on the above page contains separate sheets for:

  • Total annual household income
  • Net annual income
  • Net income before housing costs
  • Net income after housing costs

Data are provided at the Middle Layer Super Output Area (MSOA) level for England and Wales. Each MSOA is represented by three values — the lower confidence limit, mean estimate, and upper confidence limit which together form a 95% confidence interval. A 95% confidence interval means that we can be 95% confident the true mean household income for each area lies between the lower and upper confidence limits. For further details, see the Technical Report from Office for Natioanl Statistics, page 30.

The following script automates the process of downloading and converting income data into Parquet files, processing each sheet individually.

./ons-income.sh

OS Open USRN

Source: https://osdatahub.os.uk/downloads/open/OpenUSRN

Unique Street Reference Number (USRN), is a nationally recognised identifier used in Great Britain to uniquely reference every street, including roads, footpaths, cycleways and alleys. It forms part of the national addressing system and is maintained through the National Street Gazetteer, which is compiled and updated by local authorities. Much like the Unique Property Reference Number (UPRN) identifies individual properties, the USRN ensures that each street has a consistent reference across different datasets and organisations. This makes it essential for activities such as managing streetworks permits, supporting navigation and transport planning, enabling emergency services, and integrating data across government and utility providers.

You can download the latest USRN dataset from Ordnance Survey Data Hub as a GeoPackage file. The following command displays detailed information about the GeoPackage file's structure and contents.

ogrinfo -al -so osopenusrn_202510.gpkg

Command Breakdown:

  • ogrinfo: GDAL/OGR utility for getting information about geospatial datasets
  • -al: All layers - shows information about all layers in the dataset
  • -so: Summary only - shows only the summary (no feature data)
  • osopenusrn_202509.gpkg: The input GeoPackage file

The following commands downloads the GeoPackage file, process and export it into a Parquet file using ogr2ogr.

./os-open-usrn.sh

OS Open Roads

Source: https://osdatahub.os.uk/downloads/open/OpenRoads

./os-open-roads.sh

OS Open Greenspace

Source: https://osdatahub.os.uk/data/downloads/open/OpenGreenspace

OS Open Greenspace is a definitive geospatial dataset from Ordnance Survey that provides the location and classification of public parks, sports facilities, and other accessible greenspaces across Great Britain. For the insurance industry, this data is critical for enhancing risk models for property and liability underwriting by precisely quantifying exposure to greenspace-related perils—such as public injury liability in parks, vandalism or theft risk for properties adjacent to open spaces, and subsidence potential influenced by tree root systems from nearby allotments or gardens.

The provided GeoPackage file contains two spatial layers: an access_point layer with point locations for green space entries and a greenspace_site layer with MultiPolygon geometries representing the physical boundaries of those green spaces. The dataset is available for free under the Open Government License from the OS Data Hub.

The following script processes this data, generating two corresponding Parquet files named access_point.parquet and greenspace_site.parquet.

./os-open-greenspace.sh

OS Open Names

Source: https://osdatahub.os.uk/data/downloads/open/OpenNames

OS Open Names is a dataset from Ordnance Survey that provides the most comprehensive index of place names, road names, and postcodes across Great Britain. This section includes tools and examples for accessing, processing, and analysing OS Open Names data — helping you link locations to coordinates, perform spatial lookups, and integrate authoritative geographic names into your applications or analyses.

./os-open-names.sh

OpenStreetMap (OSM)

Source: https://download.geofabrik.de/

OpenStreetMap (OSM) is a collaborative, community-driven project that provides freely available geographic data covering the entire world. It includes detailed information about roads, buildings, land use, waterways, and many other physical and human-made features. Geofabrik offers regularly updated regional extracts of OSM data, which are particularly useful for analytical workflows that focus on specific countries or administrative areas.

The following script automatically extracts structured OSM data for the United Kingdom from Geofabrik and converts each layer—such as points, lines, multipolygons, and other relations—into separate Parquet files (e.g., points.parquet, lines.parquet) for efficient geospatial analysis. A list of other available regions and countries can be found on the Geofabrik download page.

./geofabrik-osm.sh europe united-kingdom

This pipeline leverages those extracts to produce lightweight, analysis-ready datasets that can be easily queried, filtered, and joined with other spatial layers, making them ideal for applications in exposure management, urban planning, mobility analytics, and environmental modelling.

DfT Road Traffic

Source: https://roadtraffic.dft.gov.uk/downloads

This section provides a curated dataset and processing scripts for road traffic statistics in Great Britain. The data is sourced from the UK Department for Transport's (DfT) public archive, which offers detailed estimates of vehicle traffic volume, classified by vehicle type and road category. The primary functions of this section are to automate the download of these official statistics, clean and standardize the data, and make it readily accessible for analysis—enabling trends in traffic flow, the impact of policy changes, and regional transportation patterns to be explored efficiently.

This following commands downloads the CSV files, process and export them into a Parquet files using DuckDB.

./dft-road-traffic.sh

DfT Road Safety - STATS19

Source: STATS19

This section provides an automated pipeline for processing UK Department for Transport (DfT) road safety statistics. The script downloads official road safety data from GOV.UK and converts it from CSV to Parquet format for efficient storage and analysis. The data covers road collisions, casualties, and vehicle information from 1979 to the latest published year.

The pipeline handles three key datasets: collision data (incident circumstances and locations), casualty data (individual injury records and demographics), and vehicle data (vehicle types and involvement details). The conversion to Parquet format significantly reduces file sizes and improves query performance for data analysis.

To use this pipeline, ensure you have bash, wget, and DuckDB installed. Simply run the provided shell script to automatically download the latest data, convert it to Parquet format, and organize the files for analysis. The processed data is ideal for road safety research, traffic analysis, and statistical reporting.

./dft-road-safety.sh

UK Police Open Data

Source: https://data.police.uk/data/archive/

This section contains an automated pipeline for downloading, processing, and converting the last 36 months of data from the UK police public archive. The system programmatically retrieves bulk CSV files for crime, outcomes, and stop-and-search data from the structured monthly archives.

The following script automates the downloading of the last three years of data and its subsequent conversion into a partitioned Parquet format. This process ensures efficient storage and prepares the dataset for high-performance analytics.

./uk-police-data.sh

License

For each dataset, please refer to the licence file located in the corresponding directory.


Support

For issues or questions, feel free to create an issue in the repository or contact the maintainer.


Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages