Hydrology Data Engineering Task
- Objective
The goal of this task was to build a small, deterministic data engineering pipeline that:
Connects to the Hydrology Data Explorer API
Retrieves the 10 most recent readings for two parameters
Loads the data into a file based SQLite database
Implements a simple star schema design
Executes via a single command
Station used: HIPPER_PARK ROAD BRIDGE_E_202312
- Solution Overview
I implemented a modular ETL pipeline using Python and SQLite.
Pipeline structure:
Extract → Transform → Load
Architectural Flow
Hydrology API ↓ Extract (API validation) ↓ Transform (data quality enforcement) ↓ Load (Star schema in SQLite)
API reference: https://environment.data.gov.uk/hydrology/doc/reference
Project Structure
extract.py – Handles API calls and response validation
transform.py – Cleans, validates, and standardises data
load.py – Creates schema and inserts data
main.py – Orchestrates pipeline execution
test_pipeline.py – Validates structural correctness
Execution:
python main.py 3. Database Design
The database follows a simple star schema.
Dimension Tables
stations Descriptive station metadata (location, name, river).
parameters Parameter metadata including unit. Uses surrogate key (parameter_id) to decouple fact table from descriptive text.
Fact Table
measurements Stores measurable events:
station_id (FK)
parameter_id (FK)
value
timestamp
Design Decisions
Surrogate key used for parameters to improve join performance and reduce redundancy.
Units stored in parameter dimension to avoid duplication.
Foreign keys enabled to enforce referential integrity.
Index created on timestamp to support time series queries.
- Data Quality Controls
Validation is applied at each pipeline stage.
Extraction
HTTP status validation (fail fast on non 200 response)
Explicit _limit and _sort parameters
Empty dataset validation
Transformation
Required field validation (timestamp & value)
Numeric type enforcement
ISO timestamp standardisation
Deterministic sorting
Exactly 10 most recent valid records per parameter
Load
Foreign key enforcement (PRAGMA foreign_keys = ON)
INSERT OR IGNORE for dimension idempotency
NOT NULL constraints for required fields
- Testing
The test suite validates:
Database creation
Exactly 20 measurements inserted (10 per parameter)
Presence of two parameters in dimension
Schema structure and constraints
Run tests:
pytest 6. How to Run
Install dependencies:
pip install -r requirements.txt
Run pipeline:
python main.py
Run tests:
pytest 7. Example Output
Pipeline execution:
Database schema:
Test results:
- Scope and Extensibility
The solution was designed within the 4 hour preparation guidance and prioritises clarity, determinism, and data integrity.
The modular structure allows:
Easy addition of new parameters
Extension to additional stations
Migration from SQLite to a managed database (e.g., Azure SQL or Snowflake) Orchestration via Airflow or Azure Data Factory