Sales Analytics Pipeline

Project Overview

This project implements a production-style sales analytics pipeline using a single Excel file as source.
It ingests, cleans, transforms, and loads data into PostgreSQL using Python ETL, applies analytics-ready modeling with dbt, and orchestrates the workflow with Apache Airflow.

Key Engineering Principles Applied:

Modular code and package structure
Clear separation of concerns: ETL vs transformation vs orchestration
Explicit data quality handling (flagging missing or inconsistent data)
Environment-based configuration (.env)
Analytics-ready fact and dimension models

Architecture Overview

graph TD
    A[Excel Source File] --> B[Python ETL: Extract → Transform → Load]
    B --> C[PostgreSQL: Raw Schema]
    C --> D[dbt: Staging Models]
    D --> E[dbt: Mart Models]
    E --> F[Apache Airflow]
    B -. Data Quality Flags .-> C
    D -. Tests / Validation .-> E

Layer Responsibilities:

Layer	Responsibility
Python ETL	Reads Excel sheets, applies transformations, flags missing/returned data, loads raw tables
dbt Staging	Cleans and standardizes raw tables, applies derived flags and types
dbt Mart	Aggregates data for analytics, computes business metrics
Airflow	Orchestrates ETL → dbt workflow, retries, alerts

Source Data & Schema Design

All source data is stored in a single Excel file: data/sales_data.xlsx.

Sheet Name	Type	Description
Transactions	FACT	Sales transactions (~3,500 rows)
Products	DIM	Product SKUs and categories
Distributors	DIM	Distributor accounts and regions
Salespersons	DIM	Field reps and targets
Monthly_Targets	FACT	Target vs actual sales
Date_Table	DIM	Calendar dimension

Known Data Quality Issues:

~2% of distributor_id in Transactions are NULL → flagged, not dropped
transaction_status may contain Returned → excluded from revenue
monthly_targets.achievement_pct is computed during transform
Mixed NULLs and strings in notes column
Column names inconsistent

PostgreSQL Schemas:

Schema	Owner	Purpose
raw	Python ETL	Raw ingestion
staging	dbt	Cleaned & standardized data
mart	dbt	Business-ready aggregations

Python ETL writes only to raw, dbt writes to staging and mart schemas.

ETL Pipeline Logic

Python ETL

Modules in src/:

extract.py → Reads all sheets, normalizes columns
transform.py → Applies business rules, flags missing/returned data, computes metrics
load.py → Writes data to PostgreSQL raw schema
db.py → Database connection using SQLAlchemy
utils.py → Helper functions
main.py → Orchestrates ETL sequence

Run locally:

python src/main.py

dbt Transformations

Staging Layer → Cleans raw data, applies derived flags, enforces types
Mart Layer → Aggregates revenue, computes performance metrics, prepares analytics tables

cd sales_dbt
dbt run
dbt test

Example SQL (Mart Layer):

{{ config(materialized='table') }}

SELECT
    COALESCE(d.region, 'Unknown_Area') AS region,

    COUNT(*) FILTER (
        WHERE t.is_missing_distributor = TRUE
    ) AS missing_distributor_count,

    ROUND(
        SUM(t.revenue_ngn) FILTER (
            WHERE t.is_missing_distributor = FALSE
        )
    ) AS distributor_revenue_ngn,

    ROUND(
        SUM(t.revenue_ngn) FILTER (
            WHERE t.is_missing_distributor = TRUE
        )
    ) AS unattributed_revenue_ngn,

    ROUND(SUM(t.revenue_ngn)) AS total_revenue_ngn

FROM {{ ref('stg_transactions') }} t
LEFT JOIN raw.raw_distributors d
    ON t.distributor_id = d.distributor_id
GROUP BY 1
ORDER BY total_revenue_ngn DESC

Apache Airflow DAG Orchestration

Location: airflow/dags/sales_pipeline_dag.py Purpose: Orchestrates ETL → dbt pipeline, retries tasks, alerts on failure Features:

PythonOperator → Runs ETL (main.py)
BashOperator → Runs dbt transformations & tests
Retry logic: 3 retries, 5 min delay
Failure alerts via email
DAG scheduling: Daily, catchup=False

Airflow Setup (WSL/Linux Recommended)

Key Steps I followed:

Removed broken/partial installations:

rm -rf ~/airflow_runtime

Created a clean WSL directory for Airflow:

mkdir ~/airflow_runtime
cd ~/airflow_runtime
python3.10 -m venv venv_airflow
source venv_airflow/bin/activate

Installed Airflow with official constraints:

pip install "apache-airflow==2.9.3" \
  --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.9.3/constraints-3.10.txt

Set AIRFLOW_HOME and initialized the metadata DB:

export AIRFLOW_HOME=~/airflow_runtime/airflow
airflow db init

Encountered issues like no such table: dag and slot_pool due to partial DB creation, resolved by resetting the DB:

airflow db reset

Copied DAGs from local project (avoiding OneDrive paths):

cp "/mnt/c/Users/TGOPS/OneDrive - Tolaram Pte Ltd/Project compilation/sales_analytics_pipeline/airflow/dags/sales_pipeline.py" ~/airflow_runtime/airflow/dags/

Started Airflow standalone:

airflow standalone

Verified DAGs:

airflow dags list
airflow dags list-import-errors

Ensured all Python dependencies (pandas, psycopg2-binary, dbt-core) were installed in the venv to avoid import errors.

Note: All Airflow runtime files (venv_airflow/, airflow.db, logs/) are environment-specific and not pushed to GitHub; only DAGs and helper modules are tracked.

Note: Airflow uses Python 3.10 to avoid conflicts; ETL/dbt run on Python 3.11+.

Business Questions & Queries

Top 5 products by revenue in 2024 → stg_transactions + stg_products
Region with highest MoM revenue growth in Q3 2024 → stg_transactions + stg_distributors
Average achievement % per salesperson → stg_monthly_targets + stg_salespersons
Distributor with highest return rate → stg_transactions + stg_distributors
Rolling 3-month revenue trend by product category → stg_transactions + stg_products

Running the Project Locally

1. Main ETL + dbt

# Create virtual environment
python -m venv venv
# Activate venv
# Windows
venv\Scripts\activate
# Linux/Mac
source venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

# Run ETL
python src/main.py

# Run dbt models
cd sales_dbt
dbt run
dbt test

2. Airflow Environment

# Python 3.10 virtual environment for Airflow
# Activate Airflow venv
source ~/airflow_runtime/venv_airflow/bin/activate
# Windows
airflow\venv_airflow\Scripts\activate
# Linux/Mac
source airflow/venv_airflow/bin/activate

pip install --upgrade pip
pip install -r requirement.txt

# Start Airflow
cd airflow
airflow standalone

#Check DAGs
airflow dags list

Key Engineering Highlights

Column normalization at ingestion → Consistent data for transformations
Data quality via flags → No rows dropped unnecessarily
Separation of concerns → ETL → dbt → Airflow
Environment-based configuration → .env for DB credentials
Production-ready repo structure → Clean, modular, reproducible

Folder Structure

sales-analytics-pipeline/
├── .env                   # DB credentials
├── data/                  # Excel source file
├── src/                   # Python ETL
│   ├── extract.py
│   ├── transform.py
│   ├── load.py
│   ├── db.py
│   └── utils.py
├── sales_dbt/             # dbt models
│   ├── models/
│   │   ├── staging/
│   │   └── marts/
│   └── dbt_project.yml
├── airflow/
│   ├── dags/
│   │   └── sales_pipeline_dag.py
│   └── venv_airflow/
├── requirements.txt       # ETL + dbt dependencies
├── main.py
└── README.md

Business Value

Preserves data integrity by flagging missing distributors
Provides analytics-ready models for BI dashboards
Enables transparent revenue reporting, MoM growth analysis, and KPI tracking
Demonstrates best practices in data engineering suitable for production and interviews

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales Analytics Pipeline

Project Overview

Architecture Overview

Source Data & Schema Design

ETL Pipeline Logic

Python ETL

dbt Transformations

Apache Airflow DAG Orchestration

Airflow Setup (WSL/Linux Recommended)

Business Questions & Queries

Running the Project Locally

1. Main ETL + dbt

2. Airflow Environment

Key Engineering Highlights

Folder Structure

Business Value

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
airflow/dags		airflow/dags
sales_dbt		sales_dbt
src		src
.gitignore		.gitignore
Business_related_Question.sql		Business_related_Question.sql
README.md		README.md
main.py		main.py
requirement.txt		requirement.txt

Folders and files

Latest commit

History

Repository files navigation

Sales Analytics Pipeline

Project Overview

Architecture Overview

Source Data & Schema Design

ETL Pipeline Logic

Python ETL

dbt Transformations

Apache Airflow DAG Orchestration

Airflow Setup (WSL/Linux Recommended)

Business Questions & Queries

Running the Project Locally

1. Main ETL + dbt

2. Airflow Environment

Key Engineering Highlights

Folder Structure

Business Value

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages