Skip to content

Kaviya-Mahendran/Environment_agency_task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hydrology Data Engineering Task

  1. Objective

The goal of this task was to build a small, deterministic data engineering pipeline that:

Connects to the Hydrology Data Explorer API

Retrieves the 10 most recent readings for two parameters

Loads the data into a file based SQLite database

Implements a simple star schema design

Executes via a single command

Station used: HIPPER_PARK ROAD BRIDGE_E_202312

  1. Solution Overview

I implemented a modular ETL pipeline using Python and SQLite.

Pipeline structure:

Extract → Transform → Load

Architectural Flow

Hydrology API ↓ Extract (API validation) ↓ Transform (data quality enforcement) ↓ Load (Star schema in SQLite)

API reference: https://environment.data.gov.uk/hydrology/doc/reference

Project Structure

extract.py – Handles API calls and response validation

transform.py – Cleans, validates, and standardises data

load.py – Creates schema and inserts data

main.py – Orchestrates pipeline execution

test_pipeline.py – Validates structural correctness

Execution:

python main.py 3. Database Design

The database follows a simple star schema.

Dimension Tables

stations Descriptive station metadata (location, name, river).

parameters Parameter metadata including unit. Uses surrogate key (parameter_id) to decouple fact table from descriptive text.

Fact Table

measurements Stores measurable events:

station_id (FK)

parameter_id (FK)

value

timestamp

Design Decisions

Surrogate key used for parameters to improve join performance and reduce redundancy.

Units stored in parameter dimension to avoid duplication.

Foreign keys enabled to enforce referential integrity.

Index created on timestamp to support time series queries.

  1. Data Quality Controls

Validation is applied at each pipeline stage.

Extraction

HTTP status validation (fail fast on non 200 response)

Explicit _limit and _sort parameters

Empty dataset validation

Transformation

Required field validation (timestamp & value)

Numeric type enforcement

ISO timestamp standardisation

Deterministic sorting

Exactly 10 most recent valid records per parameter

Load

Foreign key enforcement (PRAGMA foreign_keys = ON)

INSERT OR IGNORE for dimension idempotency

NOT NULL constraints for required fields

  1. Testing

The test suite validates:

Database creation

Exactly 20 measurements inserted (10 per parameter)

Presence of two parameters in dimension

Schema structure and constraints

Run tests:

pytest 6. How to Run

Install dependencies:

pip install -r requirements.txt

Run pipeline:

python main.py

Run tests:

pytest 7. Example Output

Pipeline execution:

Database schema:

Test results:

  1. Scope and Extensibility

The solution was designed within the 4 hour preparation guidance and prioritises clarity, determinism, and data integrity.

The modular structure allows:

Easy addition of new parameters

Extension to additional stations

Migration from SQLite to a managed database (e.g., Azure SQL or Snowflake) Orchestration via Airflow or Azure Data Factory

About

Hydrology ETL pipeline retrieving latest measurements from Environment Agency API and loading into SQLite star schema with validation and tests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages