GitHub - Kaviya-Mahendran/Environment_agency_task: Hydrology ETL pipeline retrieving latest measurements from Environment Agency API and loading into SQLite star schema with validation and tests.

Hydrology Data Engineering Task

Objective

The goal of this task was to build a small, deterministic data engineering pipeline that:

Connects to the Hydrology Data Explorer API

Retrieves the 10 most recent readings for two parameters

Loads the data into a file based SQLite database

Implements a simple star schema design

Executes via a single command

Station used: HIPPER_PARK ROAD BRIDGE_E_202312

Solution Overview

I implemented a modular ETL pipeline using Python and SQLite.

Pipeline structure:

Extract → Transform → Load

Architectural Flow

Hydrology API ↓ Extract (API validation) ↓ Transform (data quality enforcement) ↓ Load (Star schema in SQLite)

API reference: https://environment.data.gov.uk/hydrology/doc/reference

Project Structure

extract.py – Handles API calls and response validation

transform.py – Cleans, validates, and standardises data

load.py – Creates schema and inserts data

main.py – Orchestrates pipeline execution

test_pipeline.py – Validates structural correctness

Execution:

python main.py 3. Database Design

The database follows a simple star schema.

Dimension Tables

stations Descriptive station metadata (location, name, river).

parameters Parameter metadata including unit. Uses surrogate key (parameter_id) to decouple fact table from descriptive text.

Fact Table

measurements Stores measurable events:

station_id (FK)

parameter_id (FK)

value

timestamp

Design Decisions

Surrogate key used for parameters to improve join performance and reduce redundancy.

Units stored in parameter dimension to avoid duplication.

Foreign keys enabled to enforce referential integrity.

Index created on timestamp to support time series queries.

Data Quality Controls

Validation is applied at each pipeline stage.

Extraction

HTTP status validation (fail fast on non 200 response)

Explicit _limit and _sort parameters

Empty dataset validation

Transformation

Required field validation (timestamp & value)

Numeric type enforcement

ISO timestamp standardisation

Deterministic sorting

Exactly 10 most recent valid records per parameter

Load

Foreign key enforcement (PRAGMA foreign_keys = ON)

INSERT OR IGNORE for dimension idempotency

NOT NULL constraints for required fields

Testing

The test suite validates:

Database creation

Exactly 20 measurements inserted (10 per parameter)

Presence of two parameters in dimension

Schema structure and constraints

Run tests:

pytest 6. How to Run

Install dependencies:

pip install -r requirements.txt

Run pipeline:

python main.py

Run tests:

pytest 7. Example Output

Pipeline execution:

Database schema:

Test results:

Scope and Extensibility

The solution was designed within the 4 hour preparation guidance and prioritises clarity, determinism, and data integrity.

The modular structure allows:

Easy addition of new parameters

Extension to additional stations

Migration from SQLite to a managed database (e.g., Azure SQL or Snowflake) Orchestration via Airflow or Azure Data Factory

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
__pycache__		__pycache__
output		output
test		test
README.md		README.md
extract.py		extract.py
hydrology.db		hydrology.db
load.py		load.py
main.py		main.py
requirements.txt		requirements.txt
transform.py		transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages