Skip to content

Kaviya-Mahendran/ETL_pipeline_for_Charity

Repository files navigation

ETL_pipeline_for_Charity

A. Project Overview

This repository demonstrates a complete Extract–Transform–Load (ETL) pipeline built in Python that ingests raw charity data from web sources, cleans and validates it, and outputs structured datasets ready for analytics. The goal is to turn messy, inconsistent data into reliable, analysis-ready tables that power dashboards, segmentation, or modelling.

Rather than ad hoc scripting, this project treats data preparation as a disciplined, systemised workflow. Each stage is designed to be repeatable, transparent, and easy to maintain — essential traits for any analytics environment that aims for trust and continuity.

Although implementations vary across organisations, these principles apply broadly to most data analytics environments.

B. System Architecture Diagram

This ETL pipeline is organised into clear functional blocks:

Raw Web Source Data ↓ Python Extraction Scripts (requests + BeautifulSoup / API) ↓ Raw Staging (CSV / Intermediate) ↓ Cleaning & Validation (pandas + custom logic) ↓ Analytics-Ready Output (CSV / Data Warehouse) ↓ Analysis / Dashboards / Reports

This architectural separation supports:

pipeline ownership

stage-wise validation

easier debugging

reuse across domains

C. Step-by-Step Workflow Explanation Step 1: Extract — Collecting Raw Data

The first phase uses Python to pull raw data from publicly accessible web pages or APIs. It handles:

HTML parsing via BeautifulSoup

HTTP requests with retries and error handling

capturing relevant fields while preserving context

Example snippet:

import requests from bs4 import BeautifulSoup

response = requests.get(source_url) soup = BeautifulSoup(response.text, "html.parser")

entries = soup.find_all("div", class_="entry")

The goal of this step is comprehensive capture, not cleaning.

Step 2: Transform — Cleaning & Standardisation Once data is captured, the pipeline moves into transformation. This stage is essential because raw data is rarely consistent or analysis-ready.

Key activities include:

Missing data handling

Type standardisation

Inconsistent value harmonisation

Filtering out irrelevant records

Example cleaning rule:

df["charity_name"] = df["charity_name"].str.strip().fillna("Unknown") df["founded_year"] = pd.to_numeric(df["founded_year"], errors="coerce")

Rather than silently fixing issues, this transformation makes assumptions explicit and auditable.

Step 3: Load — Writing Structured Outputs After cleaning, the pipeline writes structured, consistent tables to CSV or a destination database. This output serves as a dependable input for:

dashboards

SQL analytics

machine learning models

Example:

df.to_csv("charity_data_cleaned.csv", index=False)

This structured dataset is the product of the pipeline — not an intermediate convenience.

Step 4: Prepare for Analytics The final output is designed so analysts can use it immediately without further ad hoc cleaning. It supports:

segmentation

trend analysis

text analytics

KPI evaluation

This makes the pipeline more than an ETL — it’s a data-to-insight enabler.

D. Why This Matters Reducing Manual Work

Before pipelines like this, analysts often handled extraction, cleaning, and transformation manually in spreadsheets or one-off scripts. This pipeline automates those repetitive tasks, freeing time for interpretation and insight.

Enabling Better Decisions

Clean, structured data improves the reliability of analytical products — dashboards, models, and reports — which in turn supports better decisions about:

where to allocate resources

how to prioritise strategic initiatives

what patterns signal emerging issues

For nonprofit and mission-driven teams, this translates into more effective operations without increasing costs.

Innovation Beyond Routine

This pipeline demonstrates how deliberate engineering discipline transforms messy data into a dependable foundation for analytics. It shows that repeatable processes — not isolated scripts — make analytics sustainable.

Although implementations vary across organisations, these principles apply broadly to most data analytics environments.

E. Reflection & Learnings

Building this ETL pipeline underscored a fundamental truth: data doesn’t become valuable until it’s structured and dependable.

Some key learnings include:

Pipeline architecture matters. Separating extraction, transformation, and loading clarifies ownership and simplifies troubleshooting.

Making assumptions explicit is critical. When logic is hidden in scripts, it’s fragile; when it’s expressed as code, it becomes auditable and reusable.

Analytics-ready outputs reduce downstream friction. Analysts spend more time interpreting patterns and less time fixing quirks.

From a leadership perspective, this project shows a shift from ad hoc analysis toward engineering analytics capability. It reflects the mindset that analytics outputs are not episodic artefacts, but part of an ongoing system that others depend on.

For analysts, the key takeaway is to design ETL pipelines as living assets — reusable, transparent, and aligned with decision needs.

How to Use This Repository

Clone the repository:

git clone https://github.com/Kaviya-Mahendran/ETL_pipeline_for_Charity

Install dependencies:

pip install -r requirements.txt

Run the ETL script:

python run_etl_pipeline.py

Find your cleaned, structured outputs in output/

Use those files for analysis, BI dashboards, or further modelling.

Final Note

This repository is intentionally designed not as a one-off solution, but as a foundation for responsible, scalable data workflows. It reflects a discipline where data preparation is not an afterthought, but the bedrock of reliable analytics.

About

Complete ETL pipeline that merges donations, supporter records, and event data into a unified analytics ready schema with automated validation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages