Skip to content

themrityunjaypathak/Dashly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

158 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dashly : Live Sales Dashboard

Table of Contents


Problem Statement

Quick Buy is a leading superstore operating across the United States, selling thousands of products across regions.

Currently, the store relies heavily on multiple spreadsheets and manual SQL queries to track business performance.

  • This manual process makes it :

    • Hard to keep data structured and consistent.
    • Time-consuming to prepare reports.
    • Difficult to manually update daily transaction data.
    • Challenging for non-technical users to extract and understand key business insights.
  • To solve these challenges, Quick Buy aims to build an automated system that :

    • Collects and stores all data in a structured database.
    • Cleans and updates data automatically using an ETL pipeline.
    • Uses SQL to analyze data and extract meaningful business insights.
    • Displays insights through an interactive Power BI dashboard.
    • Helps managers and executives quickly understand business performance and make better decisions.

Overview

  • Designed a fully automated ETL pipeline using Python, SQLAlchemy and GitHub Actions for daily data updates.
  • Built Python ETL scripts to extract, transform and load 50,000+ sales records into a Neon PostgreSQL database.
  • Automated daily data generation (~100 new transactions daily) to simulate real-time sales activity.
  • Integrated Power BI with the database, enabling a real-time, auto-refreshing dashboard without manual updates.

Workflow


ER Diagram

The ER (Entity-Relationship) diagram visually represents how different tables in the database are related.

Relationships

  • One Customer ➜ Many Orders
    • Each customer can place multiple orders.
  • One Product ➜ Many Orders
    • Each product can appear in multiple orders.
  • Orders Table ➜ Central Table
    • Serves as the main transactional table, linking customers and products.

Database Schema

The database is designed to store and organize Quick Buy's orders, customers, and product data.

It ensures that all business data is centralized, consistent, and easy to query for analysis and dashboarding.

Click Here to view Schema Definition
/* Customers Table */
CREATE TABLE IF NOT EXISTS customers (
  customer_id TEXT PRIMARY KEY,
  customer_name TEXT,
  segment TEXT,
  city TEXT,
  state TEXT,
  country TEXT,
  postal_code NUMERIC,
  region TEXT
);
/* Products Table */
CREATE TABLE IF NOT EXISTS products (
  product_id TEXT PRIMARY KEY,
  product_name TEXT,
  category TEXT,
  sub_category TEXT
);
/* Orders Table */
CREATE TABLE IF NOT EXISTS orders (
  order_id TEXT PRIMARY KEY,
  order_date DATE,
  customer_id TEXT,
  product_id TEXT,
  ship_mode TEXT,
  ship_date DATE,
  sales NUMERIC,
  quantity INTEGER,
  discount NUMERIC,
  profit NUMERIC,
  shipping_duration INTEGER,
  profit_margin NUMERIC,
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
  FOREIGN KEY (product_id) REFERENCES products(product_id)
);

SQL Views

SQL views are used to make data analysis easier and keep business metrics consistent for dashboards.

Instead of running complex queries every time, Power BI connects directly to these views to fetch clean data.

Click Here to view SQL Views
/* segment_wise_sales_and_profit */
/* Calculates total sales and profit for each customer segment. */
CREATE OR REPLACE VIEW segment_wise_sales_and_profit AS
SELECT 
    c.segment,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.segment;
/* region_wise_sales_and_profit */
/* Summarizes total sales and profit across all regions. */
CREATE OR REPLACE VIEW region_wise_sales_and_profit AS
SELECT 
    c.region,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.region;
/* month_wise_sales_and_profit */
/* Shows monthly trends of total sales and profit. */
CREATE OR REPLACE VIEW month_wise_sales_and_profit AS 
SELECT
    TO_CHAR(order_date, 'Mon') AS month, 
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit
FROM orders
GROUP BY month;
/* top_customers_by_sales */
/* Lists customers with their total sales and profit to identify top performers. */
CREATE OR REPLACE VIEW top_customers_by_sales AS
SELECT
    c.customer_id, 
    c.customer_name,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.customer_name;
/* shipping_performance */
/* Analyzes sales and profit performance by shipping mode. */
CREATE OR REPLACE VIEW shipping_performance AS 
SELECT 
    ship_mode,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit
FROM orders
GROUP BY ship_mode;
/* overall_sales_performance */
/* Provides overall business KPIs like total sales, profit, orders and customers. */
CREATE OR REPLACE VIEW overall_sales_performance AS
SELECT
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    COUNT(DISTINCT order_id) AS total_orders,
    COUNT(DISTINCT customer_id) AS total_customers,
    COUNT(DISTINCT product_id) AS total_products,
    SUM(quantity) AS total_quantity_sold
FROM orders;
/* state_wise_sales_and_customer_base */
/* Displays total sales and customer count by U.S. states. */
CREATE OR REPLACE VIEW state_wise_sales_and_customer_base AS 
SELECT
    c.state,
    SUM(o.sales) AS total_sales,
    COUNT(DISTINCT c.customer_id) AS total_customers
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.state;
/* segment_wise_monthly_sales_and_profit */
/* Tracks monthly sales and profit performance for each customer segment. */
CREATE OR REPLACE VIEW segment_wise_monthly_sales_and_profit AS
SELECT
    c.segment,
    TO_CHAR(o.order_date, 'Mon') AS month_name,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.segment, month_name;
/* region_wise_monthly_sales */
/* Shows monthly sales trends for each region. */
CREATE OR REPLACE VIEW region_wise_monthly_sales AS
SELECT
    c.region,
    TO_CHAR(o.order_date, 'Mon') AS month_name,
    SUM(o.sales) AS total_sales
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.region, month_name;
/* overall_customers_performance */
/* Calculates average sales, profit, orders and quantity per customer. */
CREATE OR REPLACE VIEW overall_customers_performance AS
SELECT 
    ROUND(SUM(o.sales)/COUNT(DISTINCT o.customer_id)) AS avg_sales_per_customer,
    ROUND(SUM(o.profit)/COUNT(DISTINCT o.customer_id)) AS avg_profit_per_customer,
    ROUND(COUNT(DISTINCT order_id)/COUNT(DISTINCT customer_id)) AS avg_orders_per_customer,
    ROUND(SUM(o.quantity)/COUNT(DISTINCT o.customer_id)) AS avg_quantity_per_customer
FROM orders AS o;
/* avg_discount_per_order_per_customer */
/* Computes the average discount per customer across all orders. */
CREATE OR REPLACE VIEW avg_discount_per_order_per_customer AS
SELECT
    ROUND(AVG(customer_avg), 2) AS avg_discount_per_customer
FROM (
    SELECT customer_id, AVG(discount) AS customer_avg
    FROM orders
    GROUP BY customer_id
) AS sub;
/* category_wise_monthly_sales_and_profit */
/* Tracks monthly sales and profit for each product category. */
CREATE OR REPLACE VIEW category_wise_monthly_sales_and_profit AS
SELECT
    p.category,
    TO_CHAR(o.order_date, 'Mon') AS month,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.category, month;
/* sub_category_wise_sales_and_profit */
/* Summarizes total sales and profit by product sub-category. */
CREATE OR REPLACE VIEW sub_category_wise_sales_and_profit AS
SELECT
    p.sub_category,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.sub_category;
/* category_wise_sales_profit_and_orders */
/* Shows total sales, profit and order count by product category. */
CREATE OR REPLACE VIEW category_wise_sales_profit_and_orders AS
SELECT
    p.category,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit,
    COUNT(DISTINCT o.order_id) AS total_orders
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.category;
/* state_wise_most_purchased_sub_category */
/* Identifies the most purchased sub-category in each U.S. state. */
CREATE OR REPLACE VIEW state_wise_most_purchased_sub_category AS
SELECT
    c.state,
    p.sub_category, 
    SUM(o.quantity) AS quantity_sold,
    RANK() OVER (PARTITION BY c.state ORDER BY SUM(o.quantity) DESC) AS sub_category_rank
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.state, p.sub_category;

Setup

1. Clone the Repository

First, clone the project from GitHub to your local system.

git clone https://github.com/themrityunjaypathak/Dashly.git

2. Set Up a Virtual Environment

To avoid version conflicts and keep your project isolated, create a virtual environment.

On Windows :

python -m venv .venv

On macOS/Linux :

python3 -m venv .venv

3. Activate the Virtual Environment

After setting up the virtual environment, activate it to begin installing dependencies.

On Windows :

.\.venv\Scripts\activate

On macOS/Linux :

source .venv/bin/activate

4. Install the Project Dependencies

Now, install all the required libraries inside your virtual environment using the requirements.txt file.

pip install -r requirements.txt

Tip

It's a good idea to upgrade pip before installing dependencies to avoid compatibility issues.

pip install --upgrade pip

Note

Use the same Python version as in .github/workflows/etl_pipeline.yaml to avoid compatibility issues.

5. Setup Environment Variables

This project uses a .env file to store database credentials like DB_USER, DB_PASS, DB_NAME, etc.

Environment variables are stored in plain text.

# .env
DB_HOST=host_name
DB_NAME=database_name
DB_USER=user_name
DB_PASS=password

Important

Make sure not to commit your .env file to GitHub or any public repositories.

You can add it to .gitignore to ensure it's excluded from version control.

Note

If you want to create a free Database in Neon and connect it with Python, go to How To section.

6. Database Connectivity Check

Confirm that the PostgreSQL connection works before running ETL scripts.

This avoids script crashes due to invalid credentials or blocked ports.

Click Here to view Code Snippet
# Importing Libraries
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)

7. Run ETL Script

This initializes the database and :

  • Cleans raw CSV data
  • Creates tables (customers, orders, products)
  • Loads data into the Neon PostgreSQL Database
python scripts/etl.py

Note

Run this only once initially or when you want a full database refresh.

8. Create SQL Views

This script builds reusable SQL views that summarize business metrics for the Power BI dashboard.

It simplifies queries, ensures consistent logic, and improves performance.

python scripts/create_views.py

9. Generate New Data (Optional)

Simulates daily transactions by generating new random data for testing pipeline automation.

Helps verify how dashboards respond to new data over time.

python scripts/generate_data.py

10. Export Views as CSVs (Optional)

Exports SQL view results to CSV files inside the views/ folder.

This is useful for sharing datasets or validating dashboard data without connecting to the database.

python scripts/export_views.py

11. Simulate Full Workflow Locally

Running all scripts manually helps confirm everything works locally before enabling the scheduled GitHub run.

python scripts/etl.py
python scripts/create_views.py
python scripts/generate_data.py
python scripts/export_views.py

12. Check Logs

  • Check log files inside the logs/ folder :
    • etl.log : Initial data loading
    • create_views.log : SQL views creation
    • generate_data.log : Daily data generation
  • Logs help you monitor pipeline performance and troubleshoot errors quickly.

ETL Pipeline

  • The ETL (Extract, Transform, Load) pipeline is the core part of this project.
  • It automatically cleans and loads validated sales data into a PostgreSQL database for the Power BI dashboard.
  • It is built with Python using SQLAlchemy and is securely configured via environment variables.

ETL Pipeline Structure

Script Name Purpose
etl.py Sets up the database schema, cleans the dataset, and loads initial data into the database.
create_views.py Creates multiple SQL views that summarize and aggregate data for Power BI dashboard.
generate_data.py Generates random synthetic transaction data to simulate daily updates in the database.

How does the ETL pipeline work?

1. etl.py

  • This script handles the first step of the process by preparing the database.
Click Here to view Sample Script
# ---------------- Loading Python Packages ---------------- 

# Importing Libraries
import os
import sys
import logging
from dotenv import load_dotenv
from sqlalchemy import create_engine, text, inspect

# ---------------- Configuration ---------------- 

# Setting up Logging
LOG_DIR = "logs"
os.makedirs(LOG_DIR, exist_ok=True)
log_file = os.path.join(LOG_DIR, "etl.log")
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S",
                    handlers=[
                        logging.FileHandler(log_file, mode="a"),
                        logging.StreamHandler()
                    ])

# Adding utils to PYTHONPATH
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine to upload the Data to Neon PostgreSQL Database
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)
logging.info("SQLAlchemy Engine created successfully")

# ---------------- Check if Table Exists ----------------

# Check if tables already exists in the Neon PostgreSQL Database
inspector = inspect(engine)
existing_tables = inspector.get_table_names()

# If all required tables exist, skip ETL Setup
if set(["customers", "products", "orders"]).issubset(existing_tables):
    logging.info("Tables already exist. Skipping ETL setup.")
else:
    logging.info("Tables do not exist. Running ETL setup...")

    # Reading CSV File
    from utils.read_data import load_csv
    df = load_csv("data", "sales_data.csv")
    logging.info("CSV file loaded successfully")

    # ---------------- Data Cleaning ---------------- 

    # Handling Duplicate Data
    from utils.drop_duplicates import remove_duplicates
    df = remove_duplicates(df)

    # Standardizing Column Names
    from utils.standardize_columns import standardize_column_names
    df = standardize_column_names(df)

    # Optimizing DataFrame
    from utils.data_preprocessor import optimize_dataframe
    df = optimize_dataframe(df)

    logging.info("Data cleaned and optimized successfully")

    # ---------------- Schema Creation ----------------

    # SQL Query to Create Schema
    schema_sql = [
    """
    CREATE TABLE IF NOT EXISTS customers (
        customer_id TEXT PRIMARY KEY,
        customer_name TEXT,
        segment TEXT,
        city TEXT,
        state TEXT,
        country TEXT,
        postal_code NUMERIC,
        region TEXT
    );
    """
    ,
    """
    CREATE TABLE IF NOT EXISTS products (
        product_id TEXT PRIMARY KEY,
        product_name TEXT,
        category TEXT,
        sub_category TEXT
    );
    """
    ,
    """
    CREATE TABLE IF NOT EXISTS orders (
        order_id TEXT PRIMARY KEY,
        order_date DATE,
        customer_id TEXT,
        product_id TEXT,
        ship_mode TEXT,
        ship_date DATE,
        sales NUMERIC,
        quantity INTEGER,
        discount NUMERIC,
        profit NUMERIC,
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
        FOREIGN KEY (product_id) REFERENCES products(product_id)
    );
    """]

    # Creating Schema in Neon PostgreSQL Database
    with engine.connect() as conn:
        for stm in schema_sql:
            conn.execute(text(stm))
        conn.commit()
    logging.info("Database schema created successfully")

    # Customers Table
    customers_cols = ["customer_id", "customer_name", "segment", "city", "state", "country", "postal_code", "region"]
    customers = df[customers_cols].drop_duplicates(subset="customer_id")

    # Products Table
    products_cols = ["product_id", "product_name", "category", "sub_category"]
    products = df[products_cols].drop_duplicates(subset="product_id")

    # Orders Table
    orders_cols = ["order_id", "order_date", "customer_id", "product_id", "ship_mode", "ship_date", "sales", "quantity", "discount", "profit"]
    orders = df[orders_cols].drop_duplicates(subset="order_id")

    # Truncate Tables to avoid uploading Duplicate Data
    with engine.begin() as conn:
        conn.execute(text("TRUNCATE TABLE customers CASCADE;"))
        conn.execute(text("TRUNCATE TABLE products CASCADE;"))
        conn.execute(text("TRUNCATE TABLE orders CASCADE;"))
    logging.info("Tables truncated successfully")

    # Appending data to the tables in Neon PostgreSQL Database
    customers.to_sql("customers", engine, if_exists="append", index=False)
    products.to_sql("products", engine, if_exists="append", index=False)
    orders.to_sql("orders", engine, if_exists="append", index=False)
    logging.info("Initial data uploaded to Neon PostgreSQL Database successfully")

What does it do?

  • Load Configuration
    • Reads environment variables (like DB_HOST, DB_NAME) from a .env file for secure database access.
  • Logging Setup
    • Creates a logs/etl.log file to track all ETL activity and errors.
  • Extract Data
    • Loads raw data from a CSV file using a custom load_csv() utility function.
  • Transform Data
    • Removes duplicates, standardizes column names, and optimizes data types.
  • Load Data
    • Creates tables in the Neon PostgreSQL database and loads the cleaned data using the to_sql() function.
  • Schema Management
    • Ensures relationships between tables using foreign keys and maintains data integrity.

2. create_views.py

  • This script builds SQL views in the PostgreSQL database to simplify analysis and reporting in Power BI.
Click Here to view Sample Script
# ---------------- Loading Python Packages ---------------- 

# Importing Libraries
import os
import logging
from dotenv import load_dotenv
from sqlalchemy import create_engine, text

# ---------------- Configuration ---------------- 

# Setting up Logging
LOG_DIR = "logs"
os.makedirs(LOG_DIR, exist_ok=True)
log_file = os.path.join(LOG_DIR, "create_views.log")
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S",
                    handlers=[
                        logging.FileHandler(log_file, mode="a"),
                        logging.StreamHandler()
                    ])

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine to upload the Data to Neon PostgreSQL Database
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)
logging.info("SQLAlchemy Engine created successfully")

# ---------------- Creating Views in Neon PostgreSQL Database ----------------

# Dictionary of SQL Views
sql_views = {
    "segment_wise_sales_and_profit":"""
        DROP VIEW IF EXISTS segment_wise_sales_and_profit;
        CREATE OR REPLACE VIEW segment_wise_sales_and_profit AS
        SELECT 
            c.segment,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.segment;
    """
    ,
    "region_wise_sales_and_profit":"""
        DROP VIEW IF EXISTS region_wise_sales_and_profit;
        CREATE OR REPLACE VIEW region_wise_sales_and_profit AS
        SELECT 
            c.region,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.region;
    """
    ,
    "month_wise_sales_and_profit":"""
        DROP VIEW IF EXISTS month_wise_sales_and_profit;
        CREATE OR REPLACE VIEW month_wise_sales_and_profit AS 
        SELECT
            TO_CHAR(order_date, 'Mon') AS month, 
            SUM(sales) AS total_sales,
            SUM(profit) AS total_profit
        FROM orders
        GROUP BY month;
    """
    ,
    "top_customers_by_sales":"""
        DROP VIEW IF EXISTS top_customers_by_sales;
        CREATE OR REPLACE VIEW top_customers_by_sales AS
        SELECT
            c.customer_id, 
            c.customer_name,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.customer_id, c.customer_name;
    """
    ,
    "shipping_performance":"""
        DROP VIEW IF EXISTS shipping_performance;
        CREATE OR REPLACE VIEW shipping_performance AS 
        SELECT 
            ship_mode,
            SUM(sales) AS total_sales,
            SUM(profit) AS total_profit
        FROM orders
        GROUP BY ship_mode;
    """
    ,
    "overall_sales_performance":"""
        DROP VIEW IF EXISTS overall_sales_performance;
        CREATE OR REPLACE VIEW overall_sales_performance AS
        SELECT
            SUM(sales) AS total_sales,
            SUM(profit) AS total_profit,
            COUNT(DISTINCT order_id) AS total_orders,
            COUNT(DISTINCT customer_id) AS total_customers,
            COUNT(DISTINCT product_id) AS total_products,
            SUM(quantity) AS total_quantity_sold
        FROM orders;
    """
    ,
    "state_wise_sales_and_customer_base":"""
        DROP VIEW IF EXISTS state_wise_sales_and_customer_base;
        CREATE OR REPLACE VIEW state_wise_sales_and_customer_base AS 
        SELECT
            c.state,
            SUM(o.sales) AS total_sales,
            COUNT(DISTINCT c.customer_id) AS total_customers
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.state;
    """
    ,
    "segment_wise_monthly_sales_and_profit":"""
        DROP VIEW IF EXISTS segment_wise_monthly_sales_and_profit;
        CREATE OR REPLACE VIEW segment_wise_monthly_sales_and_profit AS
        SELECT
            c.segment,
            TO_CHAR(o.order_date, 'Mon') AS month_name,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.segment, month_name;
    """
    ,
    "region_wise_monthly_sales":"""
        DROP VIEW IF EXISTS region_wise_monthly_sales;
        CREATE OR REPLACE VIEW region_wise_monthly_sales AS
        SELECT
            c.region,
            TO_CHAR(o.order_date, 'Mon') AS month_name,
            SUM(o.sales) AS total_sales
        FROM orders AS o
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.region, month_name;
    """
    ,
    "overall_customers_performance":"""
        DROP VIEW IF EXISTS overall_customers_performance;
        CREATE OR REPLACE VIEW overall_customers_performance AS
        SELECT 
            ROUND(SUM(o.sales)/COUNT(DISTINCT o.customer_id)) AS avg_sales_per_customer,
            ROUND(SUM(o.profit)/COUNT(DISTINCT o.customer_id)) AS avg_profit_per_customer,
            ROUND(COUNT(DISTINCT order_id)/COUNT(DISTINCT customer_id)) AS avg_orders_per_customer,
            ROUND(SUM(o.quantity)/COUNT(DISTINCT o.customer_id)) AS avg_quantity_per_customer
        FROM orders AS o;
    """
    ,
    "avg_discount_per_order_per_customer":"""
        DROP VIEW IF EXISTS avg_discount_per_order_per_customer;
        CREATE OR REPLACE VIEW avg_discount_per_order_per_customer AS
        SELECT
            ROUND(AVG(customer_avg), 2) AS avg_discount_per_customer
        FROM (
            SELECT customer_id, AVG(discount) AS customer_avg
            FROM orders
            GROUP BY customer_id
        ) AS sub;
    """
    ,
    "category_wise_monthly_sales_and_profit":"""
        DROP VIEW IF EXISTS category_wise_monthly_sales_and_profit;
        CREATE OR REPLACE VIEW category_wise_monthly_sales_and_profit AS
        SELECT
            p.category,
            TO_CHAR(o.order_date, 'Mon') AS month,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN products AS p
        ON o.product_id = p.product_id
        GROUP BY p.category, month;
    """
    ,
    "sub_category_wise_sales_and_profit":"""
        DROP VIEW IF EXISTS sub_category_wise_sales_and_profit;
        CREATE OR REPLACE VIEW sub_category_wise_sales_and_profit AS
        SELECT
            p.sub_category,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit
        FROM orders AS o
        JOIN products AS p
        ON o.product_id = p.product_id
        GROUP BY p.sub_category;
    """
    ,
    "category_wise_sales_profit_and_orders":"""
        DROP VIEW IF EXISTS category_wise_sales_profit_and_orders;
        CREATE OR REPLACE VIEW category_wise_sales_profit_and_orders AS
        SELECT
            p.category,
            SUM(o.sales) AS total_sales,
            SUM(o.profit) AS total_profit,
            COUNT(DISTINCT o.order_id) AS total_orders
        FROM orders AS o
        JOIN products AS p
        ON o.product_id = p.product_id
        GROUP BY p.category;
    """
    ,
    "state_wise_most_purchased_sub_category":"""
        DROP VIEW IF EXISTS state_wise_most_purchased_sub_category;
        CREATE OR REPLACE VIEW state_wise_most_purchased_sub_category AS
        SELECT
            c.state,
            p.sub_category, 
            SUM(o.quantity) AS quantity_sold,
            RANK() OVER (PARTITION BY c.state ORDER BY SUM(o.quantity) DESC) AS sub_category_rank
        FROM orders AS o
        JOIN products AS p
        ON o.product_id = p.product_id
        JOIN customers AS c
        ON o.customer_id = c.customer_id
        GROUP BY c.state, p.sub_category;
    """
}

# Creating Views in Neon PostgreSQL Database
with engine.connect() as conn:
    for view_name, sql_query in sql_views.items():
        conn.execute(text(sql_query))
    conn.commit()
logging.info("Views created successfully in Neon PostgreSQL Database")

What does it do?

  • Database Connection
    • Connects to the database securely using environment variables.
  • Define SQL Views
    • Creates multiple SQL views to extract key business insights.
  • Execute & Commit
    • Executes each CREATE OR REPLACE VIEW statement and commits changes.
  • Logging Setup
    • Stores execution logs in logs/create_views.log.

3. generate_data.py

  • This script keeps the database updated with new transaction data for scheduled data refresh in Power BI.
Click Here to view Sample Script
# ---------------- Loading Python Packages ---------------- 

# Importing Libraries
import os
import sys
import logging
from dotenv import load_dotenv
from sqlalchemy import create_engine

# ---------------- Configuration ---------------- 

# Setting up Logging
LOG_DIR = "logs"
os.makedirs(LOG_DIR, exist_ok=True)
log_file = os.path.join(LOG_DIR, "generate_data.log")
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S",
                    handlers=[
                        logging.FileHandler(log_file, mode="a"),
                        logging.StreamHandler()
                    ])

# Adding utils to PYTHONPATH
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine to upload the Data to Neon PostgreSQL Database
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)
logging.info("SQLAlchemy Engine created successfully")

# ---------------- Generating Random Customers Data ---------------- 

# Generating New Customers Data Randomly
from utils.generate_random_data import random_customers_data
customers = random_customers_data(20, 40)
logging.info("Random customers data generated successfully")

# ---------------- Generating Random Orders Data ---------------- 

# Generating New Orders Data Randomly
from utils.generate_random_data import random_orders_data
orders = random_orders_data(60, 80)
logging.info("Random orders data generated successfully")

# ---------------- Data Cleaning ---------------- 

# Handling Duplicate Data
from utils.drop_duplicates import remove_duplicates
customers = remove_duplicates(customers)
orders = remove_duplicates(orders)

# Standardizing Column Names
from utils.standardize_columns import standardize_column_names
customers = standardize_column_names(customers)
orders = standardize_column_names(orders)

# Optimizing DataFrame
from utils.data_preprocessor import optimize_customers, optimize_orders
customers = optimize_customers(customers)
orders = optimize_orders(orders)

logging.info("Data cleaned and optimized successfully")

# ---------------- Appending New Data to Neon PostgreSQL Database ---------------- 

# Appending new and unique data to Neon PostgreSQL Database
from utils.append_data import append_unique_data_to_db
append_unique_data_to_db("customers", customers, "customer_id", engine)
append_unique_data_to_db("orders", orders, "order_id", engine)
logging.info("New data uploaded to database successfully")

What does it do?

  • Generate Random Data
    • Uses custom utility functions to create synthetic customer and order data.
  • Data Cleaning
    • Removes duplicates and optimizes data types before uploading to the Neon database.
  • Append Unique Data
    • Inserts only new records into the database, avoiding duplicates.
  • Logging Setup
    • Saves process logs in logs/generate_data.log.

Once this cycle is complete, the process repeats automatically :

generate_data.py ➜ create_views.py ➜ Power BI Refresh ➜ New Insights

This ensures that the Power BI Dashboard always display the latest insights automatically.


Results & Insights

This section highlights the key outcomes and insights generated from the ETL pipeline and Power BI dashboards.

Pipeline Performance

This section summarizes pipeline performance metrics such as runtime, automation frequency, and reliability.

1. Data Loading Overview

Parameter Value
Dataset Size ~50,000 sales records
Tables Used customers, orders, products
Avg. Daily Inserts ~100 new records
Database Neon PostgreSQL (cloud-hosted)

Note

This setup simulates real-time superstore sales activity with daily updates to the orders and customers tables.


2. Automation & Scheduling

Attribute Details
Automation Tool GitHub Actions
Execution Frequency Daily
Scheduled Time 10:00 AM IST
Trigger Type cron (automated) and workflow_dispatch (manual)
Runner Environment ubuntu-latest (GitHub-hosted Ubuntu runner)

Note

This setup ensures that the latest data is always available for Power BI dashboards, with no manual effort.


3. Runtime Performance

Workflow Step Description Runtime (sec)
Set up job Initializes GitHub Actions environment 1s
Checkout repository Pulls repository code into the runner 1s
Set up Python Installs Python environment (v3.13.0) 0s
Install dependencies Installs libraries from requirements.txt 22s
Run ETL Script Extracts, transforms, and loads data into PostgreSQL 3s
Run Generate Data Script Generates new synthetic customer and order data 4s
Run Views Script Creates / Refreshes analytical SQL views 3s
Run Export Views Script Exports SQL views as CSV files 2s
Upload Exported CSV as Artifacts Uploads exported CSVs to GitHub Actions artifacts 2s
Commit and Push CSVs Commits CSV files to the repository 0s
Upload Logs as Artifacts Uploads log files for debugging and tracking 1s
Post Setup / Cleanup Steps Cleans the environment post-run 2s

Note

Total runtime : ~45–50 seconds per pipeline run

Scheduling : The workflow runs daily at 10:00 AM IST using a cron schedule (30 4 * * *).

cron is in UTC (04:30 UTC = 10:00 AM IST)

The ETL pipeline runs within a minute, automatically refreshing dashboard data daily with no manual effort.


4. Error Handling and Logging

Aspect Implementation Details
Error Tracking Structured try–except error handling in each script
Log Files etl.log, generate_data.log, create_views.log
Log Storage Uploaded as GitHub Actions run artifacts
Security All credentials securely stored in GitHub Secrets (DB_USER, DB_PASS, etc.)

Tip

Automated logging and secret handling remove the need for manual checks and ensure smooth workflow runs.


5. Reliability and Stability

Metric Value Remarks
Total Runtime ~48 seconds Fast for a daily automated ETL pipeline
Success Rate 100% (till last run) Verified via GitHub Actions workflow panel
Avg. Records Inserted ~100 rows/day Lightweight daily incremental updates
Resource Utilization Low CPU and memory usage Efficient for cloud runners

Important

The pipeline runs fully unattended, ensuring consistent daily data updates and automatic Power BI refreshes.


Dashboard Metrics

This section highlights key business insights and trends derived from the Power BI dashboard visualizations.

1. Shipping Mode Performance

Shipping Mode Total Sales (₹) % of Total Sales Total Profit (₹) % of Total Profit Profit Margin
Standard Class 5,099,197 59.7% 897,360 59.8% 17.6%
Second Class 1,650,059 19.3% 292,629 19.5% 17.7%
First Class 1,343,959 15.7% 230,784 15.4% 17.2%
Same Day 440,836 5.2% 78,981 5.3% 17.9%
Click Here to view Key Insights  
  • Standard Class drives ~60% of total sales (~₹5.1M) and profit (~₹897K), making it the dominant shipping mode.
  • Second Class contributes ~19% of sales (~₹1.65M) and profit (~₹293K), showing steady usage.
  • First Class accounts for ~16% of sales (~₹1.34M) and ~15% of profit (~₹231K), preferred for faster delivery.
  • Same Day contributes ~5% of sales (~₹0.44M) and profit (~₹79K), lowest usage but fastest option.
  • Profit margins remain consistent across modes (~17–18%), indicating stable logistics and pricing control.

2. Customer Segment Performance

Segment Total Sales (₹) % of Total Sales Total Profit (₹) % of Total Profit Profit Margin
Consumer 4,263,570 49.9% 757,416 50.5% 17.8%
Corporate 2,742,160 32.1% 475,855 31.7% 17.4%
Home Office 1,528,321 17.9% 266,483 17.8% 17.4%
Click Here to view Key Insights  
  • Consumer Segment is the main revenue driver, generating ~50% of total sales (~₹4.26M) and profit (~₹757K).
  • Corporate Segment contributes ~32% of sales (~₹2.74M) and profit (~₹476K), indicating steady performance.
  • Home Office delivers ~18% of revenue (~₹1.53M) and profit (~₹266K), smaller but reliable.
  • Potential growth opportunity lies in expanding Home Office sales through targeted marketing.

3. Monthly Sales & Profit Performance

Month Total Sales (₹) % of Total Sales Total Profit (₹) % of Total Profit Profit Margin
Jan 731,193 8.6% 130,941 8.7% 17.9%
Feb 666,688 7.8% 119,076 7.9% 17.9%
Mar 765,028 9.0% 131,184 8.7% 17.1%
Apr 705,858 8.3% 123,398 8.2% 17.5%
May 734,328 8.6% 131,017 8.7% 17.8%
Jun 639,560 7.5% 115,182 7.7% 18.0%
Jul 653,572 7.7% 118,166 7.9% 18.1%
Aug 695,246 8.2% 124,125 8.3% 17.8%
Sep 646,965 7.6% 110,729 7.4% 17.1%
Oct 789,443 9.2% 137,346 9.2% 17.4%
Nov 712,799 8.3% 120,187 8.0% 16.9%
Dec 793,371 9.3% 138,403 9.2% 17.4%
Click Here to view Key Insights  
  • Q4 (Oct–Dec) drives ~27% of annual sales and profit, making it the strongest quarter for promotions.
  • Profit margins stay steady at ~17–18% throughout the year, indicating consistent pricing and cost control.
  • December and March are peak months by sales contribution, ideal for marketing campaigns and seasonal offers.
  • Sales and profit are evenly distributed across months, reflecting stable and predictable performance.

4. Regional Performance Insights

Region Total Sales (₹) % of Total Sales Total Profit (₹) % of Total Profit
West 2,484,870 ~29% 440,814 ~29%
East 2,456,014 ~29% 427,689 ~28%
Central 1,986,280 ~23% 352,400 ~23%
South 1,606,887 ~19% 278,851 ~19%
Click Here to view Key Insights  
  • West leads with ~29% of revenue (~₹2.48M) and profit (~₹441K), making it the top-performing region.
  • East contributes ~29% of sales (~₹2.46M) and ~28% of profit (~₹428K), showing strong and balanced growth.
  • Central adds ~23% of revenue (~₹1.99M) and profit (~₹352K), reflecting steady mid-level performance.
  • South accounts for ~19% of revenue (~₹1.61M) and profit (~₹279K), highlighting a key growth opportunity.

5. Sub-Category Performance

Sub-Category Total Sales (₹) % of Total Sales Total Profit (₹) % of Total Profit
Paper 1,260,080 ~15% 223,147 ~15%
Binders 943,376 ~11% 162,084 ~11%
Phones 879,975 ~10% 153,018 ~10%
Furnishings 836,955 ~10% 151,623 ~10%
Art 697,965 ~8% 120,843 ~8%
Storage 651,268 ~8% 112,673 ~8%
Accessories 643,638 ~7% 112,007 ~7%
Appliances 495,030 ~6% 90,786 ~6%
Others (8 sub-categories)
Click Here to view Key Insights  
  • Paper is the top-performing sub-category, contributing ~15% of total sales (~₹1.26M) and profit (~₹223K).
  • Binders, Phones, and Furnishings contribute ~31% of total profit (~₹467K), forming the high-performing cluster.
  • Art, Storage, and Accessories are solid mid-tier performers, contributing ~23% of total profit combined.
  • Sub-categories like Copiers and Supplies contribute <5% of total profit, showing limited business impact.
  • Scaling high-value categories (Paper, Phones, Furnishings) can boost revenue while maintaining stable margins.

6. State-wise Sales Performance

State Total Sales (₹) Total Customers % of Total Sales
California 1,792,545 188 21.4%
New York 924,728 104 11.0%
Texas 903,046 92 10.8%
Pennsylvania 472,741 45 5.6%
Ohio 451,732 51 5.4%
Click Here to view Key Insights  
  • California leads with ~21% sales (~₹1.8M) and the highest customer base (188), making it the strongest market.
  • New York + Texas contribute ~22% of total sales (~₹1.83M), indicating strong demand in dense regions.
  • Top 5 states generate ~54% of total revenue (~₹4.54M), highlighting sales concentration in key markets.

Impact

  • Built an end-to-end Power BI dashboard connected to a PostgreSQL database containing 50,000+ sales records.
  • Improved daily data update time from hours to under a minute (average ~45 sec) using GitHub Actions.
  • Reduced reporting time by 80% through automation, delivering updated insights in under a minute.
  • Delivered a reliable, low-latency, fully automated data pipeline with zero manual intervention.
  • Achieved 100% workflow reliability, with zero pipeline failures since deployment (as recorded in GitHub Actions).

How To

1. How to create a free PostgreSQL database on Neon?

  • Open Neon in your browser.
 
  • Log in using GitHub / Google / Microsoft.
 
  • Create a New Project.
 
  • Fill in the Project Details.
 
  • New PostgreSQL project is created in the Neon Console.

2. How to connect Neon Database with Python via SQLAlchemy?

  • Open the newly created Project and Click Connect.
 
  • Get the Connection String of your Database.

Image taken from Neon Documentation

  • Understand the Connection String.
postgresql://alex:AbC123dEf@ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech/dbname?sslmode=require&channel_binding=require
             ^    ^         ^                         ^                              ^
       role -|    |         |- hostname               |- pooler option               |- database
                  |
                  |- password

You can use this to configure your database connection.

You can place the connection details in an .env file for secure access.

Click Here to view Sample ENV File  
# .env file
DB_HOST=ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech
DB_NAME=dbname
DB_USER=alex
DB_PASS=AbC123dEf

After placing connection details in an .env file, you can read it via Python using SQLAlchemy.

Click Here to view Code Snippet  
# Importing Libraries
from dotenv import load_dotenv
from sqlalchemy import create_engine

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine to upload the Data to Neon PostgreSQL Database
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)

3. How to connect Power BI with Neon Database for live data refresh?

  • Launch Power BI Desktop on your computer.
  • Start with a new, empty report.
  • In the Home tab, click on Get Data then More...
  • This opens a list of all available data connectors.
 
  • In the list of connectors, scroll down and find PostgreSQL database.
  • Select it and click Connect.
  • This tells Power BI that you want to connect to a PostgreSQL database (like Neon).
 
  • Power BI will prompt you for connection information.
  • You'll need the Host and Database name from your Neon connection string.
DB_HOST=ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech
DB_NAME=dbname
  • Fill in Host and Database name and Click ok.
  • This step connects Power BI directly to your Neon PostgreSQL Database.
 
Click Here to know more about Connection Modes in Power BI  
  • Import Mode
    • Power BI copies the data from your source system into its internal, highly optimized storage engine.
    • That means all the data is loaded and stored within the .pbix file (or in Power BI Service once published).
    • All reports, visuals and calculations run on this cached data, not on the live data source.
  • DirectQuery Mode
    • In DirectQuery Mode, Power BI does not import or stores data.
    • Instead, it sends real-time queries to the data source every time a user interacts with a report.
    • For live data refresh, choose DirectQuery Mode.
  • Once connected, you'll see a list of all available tables and views from your Neon database.
  • Select the ones you want to use in your report and Click Load.
 
  • Power BI is now connected to your Neon PostgreSQL database.
  • It can refresh live data automatically using DirectQuery Mode.

4. How to configure GitHub Actions?

  • Open GitHub in your browser.
  • Navigate to the repository that contains your workflow file (.github/workflows/etl_pipeline.yaml).
  • This is where your ETL (Extract-Transform-Load) automation is defined.
  • On your repository's main page click Settings from the top navigation bar.
 
  • In the left-hand sidebar of the Settings page, scroll down to find Secrets and Variables.
  • Click Actions to open the section where you can manage secrets used in GitHub Actions workflows.
 
  • In the Actions section, click the New repository secret button.
  • This opens a window to add a new secret key-value pair.
 
  • Open your local .env file (this file contains environment variables like API keys, tokens, etc).
  • For each variable, copy :
    • Name : the key (DATABASE_URL)
    • Value : the corresponding value (postgres://user:pass@host/db)
  • Paste these into the GitHub form fields :
    • Name ➜ enter the variable name
    • Secret ➜ enter the variable value
 
  • Your secrets are now securely stored in the repository.
  • You can use them inside your workflow .github/workflows/etl_pipeline.yaml like this :
DB_USER: ${{ secrets.DB_USER }}
DB_PASS: ${{ secrets.DB_PASS }}

GitHub Actions

  • The GitHub Actions workflow automates the entire ETL pipeline, without any manual effort.
  • It runs every day at 10:00 AM and ensures that the latest data are always updated.
Click Here to view GitHub Actions YAML File
# ---------------- Name of the Workflow ----------------
name: ETL Pipeline Automation

# ---------------- When should it run? ----------------
on:
  schedule:
    - cron: "30 4 * * *"   # Run the workflow daily at 10:00 AM
  workflow_dispatch:       # Allow manual run from GitHub UI

# ---------------- Authority to update Repository ----------------
permissions:
  contents: write

# ---------------- Set of steps to run ----------------
jobs:
  data-pipeline:
    runs-on: ubuntu-latest   # Use a Linux VM for the Job

    env: # Shared env variables available to all scripts
      DB_USER: ${{ secrets.DB_USER }}
      DB_PASS: ${{ secrets.DB_PASS }}
      DB_HOST: ${{ secrets.DB_HOST }}
      DB_NAME: ${{ secrets.DB_NAME }}

    steps:
      # ---------------- Step 1 : Checkout Code ----------------
      # This pulls your repository into the GitHub Runner VM
      - name: Checkout repository
        uses: actions/checkout@v3   
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

      # ---------------- Step 2 : Set up Python ----------------
      # Installs Python 3.13.0 so GitHub can run your Scripts
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.13.0"

      # ---------------- Step 3 : Install Dependencies ----------------
      # Installs all Python libraries listed in requirements.txt
      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      # ---------------- Step 4 : Run ETL Script ----------------
      # Runs your etl.py script to setup database
      - name: Run ETL Script
        run: |
          python scripts/etl.py

      # ---------------- Step 5 : Generate New Data and Append ----------------
      # Creates new data of customers and orders to ingest in the Neon Database
      - name: Run Generate Data Script
        run: |
          python scripts/generate_data.py

      # ---------------- Step 6 : Run Views Script ----------------
      # Creates or refreshes SQL views in Neon Database
      - name: Run Views Script
        run: |
          python scripts/create_views.py

      # ---------------- Step 7 : Export Views to CSV ----------------
      # Exports SQL views as CSVs inside the `views` folder
      - name: Run Export Views Script
        run: |
          python scripts/export_views.py

      # ---------------- Step 8 : Save CSVs as Artifacts ----------------
      # Stores CSVs in the workflow run, downloadable from GitHub
      - name: Upload Exported CSV as Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: exported-views
          path: views/*.csv

      # ---------------- Step 9 : Commit CSVs to Repository ----------------
      # Updates repository with the latest CSVs so they are visible directly
      - name: Commit and Push CSVs
        run: |
          git config --global user.name "github-actions[bot]"
          git config --global user.email "github-actions[bot]@users.noreply.github.com"
          git add -A views/
          git commit -m "chore: update csv export from views on $(date -u +'%Y-%m-%d')" -m "[skip ci]" || echo "No changes to commit"
          git push

      # ---------------- Step 10 : Save Logs as Artifacts ----------------
      # Stores Logs in the workflow run, downloadable from GitHub
      - name: Upload Logs as Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: etl-logs
          path: logs/*.log
Click Here to view more Details

Workflow Overview

1. Name of the Workflow

name: ETL Pipeline Automation
  • This gives your workflow a name.
  • You'll see ETL Pipeline Automation in the GitHub Actions tab whenever it runs.

2. When should it run?

on:
  schedule:
    - cron: "30 4 * * *"
  workflow_dispatch:
  • schedule : Runs every day at 10:00 AM automatically (based on the cron expression).
  • workflow_dispatch : Allows you to manually trigger the workflow from the GitHub Actions tab.

3. Permissions

permissions:
  contents: write
  • This gives your workflow permission to update files in your repository.

4. Jobs

jobs:
  data-pipeline:
    runs-on: ubuntu-latest

    env: # Shared env variables available to all scripts
        DB_USER: ${{ secrets.DB_USER }}
        DB_PASS: ${{ secrets.DB_PASS }}
        DB_HOST: ${{ secrets.DB_HOST }}
        DB_NAME: ${{ secrets.DB_NAME }}
  • jobs : Defines what tasks (jobs) the workflow will perform.
  • data-pipeline : The name of your main job.
  • runs-on : Tells GitHub to use a Linux virtual machine (Ubuntu) to run your scripts.

5. Main Workflow Tasks

  • Each step defines a task GitHub Actions performs in sequence :

Step 1 : Checkout Repository

- name: Checkout repository
  uses: actions/checkout@v3
  • This pulls your GitHub Repository files into the VM so the workflow can access your scripts, data and folders.

Step 2 : Set Up Python

- name: Set up Python
  uses: actions/setup-python@v4
  with:
    python-version: "3.13.0"
  • Installs Python 3.13.0 on the VM, the version your ETL scripts uses.

Step 3 : Install Dependencies

- name: Install dependencies
  run: |
    pip install -r requirements.txt
  • Installs all the Python libraries listed in requirements.txt file.

Step 4 : Run ETL Script

- name: Run ETL Script
  run: |
    python scripts/etl.py
  • Runs your etl.py script which :
    • Cleans and transforms raw data.
    • Loads it into the Neon PostgreSQL Database.
    • Uses GitHub Secrets to securely access database credentials (so they're never expose in code).

Step 5 : Generate New Data

- name: Run Generate Data Script
  run: |
    python scripts/generate_data.py
  • Runs generate_data.py which creates and updates new random data to simulate daily data refresh.

Step 6 : Create SQL Views

- name: Run Views Script
  run: |
    python scripts/create_views.py
  • Executes create_views.py which builds or refreshes SQL Views in the database.
  • These views are directly used by Power BI dashboards.

Step 7 : Export Views to CSV

- name: Run Export Views Script
  run: |
    python scripts/export_views.py
  • Exports your latest SQL Views as CSV files into the views/ folder.
  • These can be used for testing, sharing or backups.

Step 8 : Upload CSVs as Artifacts

- name: Upload Exported CSV as Artifacts
  uses: actions/upload-artifact@v4
  with:
    name: exported-views
    path: views/*.csv
  • Stores your exported CSV files as downloadable artifacts in GitHub Actions.
  • This means you can view and download them directly from the workflow run.

Step 9 : Commit CSVs to Repository

- name: Commit and Push CSVs
  run: |
    git config --global user.name "github-actions[bot]"
    git config --global user.email "github-actions[bot]@users.noreply.github.com"
    git add -A views/
    git commit -m "chore: update csv export from views on $(date -u +'%Y-%m-%d')" -m "[skip ci]" || echo "No changes to commit"
    git push
  • Commits updated CSVs to the repository, so anyone visiting the GitHub project can see the latest analysis.

Step 10 : Upload Logs

- name: Upload Logs as Artifacts
  uses: actions/upload-artifact@v4
  with:
    name: etl-logs
    path: logs/*.log
  • Uploads all log files as artifacts, useful for debugging or checking workflow results.

Power BI Dashboard

  • The Power BI dashboard is designed to turn data into insights with a clean and interactive interface.
  • It connects directly to the database, ensuring the dashboard always reflects the most recent data.
  • The dashboard consists of four main pages : Home, Overview, Customers, and Products.
  • All pages are connected through page navigation and drill-through features.

1. Home Page

  • This page serves as an entry point and navigation hub for the dashboard.
Click Here to view more Details

Key Components

  • Navigation Buttons
    • Interactive page navigator that links to the Overview, Customers, and Products pages.
  • Dashboard Branding
    • Displays a themed background image for a modern look.
  • User Experience
    • Designed for clarity and smooth navigation so users can access key insights in one click.

2. Overview

  • This page focuses on a high-level summary of overall business performance like orders, customers, and products.
Click Here to view more Details

Key Components

  • Multi-Row Cards : Key Business Metrics
    • Data View : overall_sales_performance
    • Purpose : Gives an instant snapshot of business performance at a glance.
  • Filled Map : Sales by State
    • Data View : state_wise_sales_and_customer_base
    • Purpose : Visualizes geographic sales distribution, showing which states drive the most revenue.
  • Donut Chart : Shipping Performance
    • Data View : shipping_performance
    • Purpose : Compares sales and profit by shipping mode, helping identify cost-effective delivery methods.
  • Stacked Bar Chart : Segment-wise Sales & Profit
    • Data View : segment_wise_sales_and_profit
    • Purpose : Shows sales and profit across customer segments.
  • Bar Chart : Top Customers by Sales
    • Data View : top_customers_by_sales
    • Purpose : Highlights top-performing customers, helping identify key contributors to revenue.
  • Filled Area Chart : Monthly Sales & Profit Trend
    • Data View : month_wise_sales_and_profit
    • Purpose : Displays month-wise trends of sales and profit to track seasonal performance and growth.

3. Customers

  • This page focuses on understanding customer behavior, performance, and geographic distribution.
Click Here to view more Details

Key Components

  • Multi-Row Card : Customer Performance Summary
    • Data View : overall_customers_performance
    • Purpose : Provides an overview of how much each customer contributes on average to sales and profit.
  • Filled Map : State-wise Customer Base
    • Data View : state_wise_sales_and_customer_base
    • Purpose : Visualizes customer distribution across states, helping identify regions with largest customer base.
  • Stacked Column Chart : Region-wise Monthly Sales
    • Data View : region_wise_monthly_sales
    • Purpose : Tracks how sales vary across regions and months, helping spot seasonal and regional trends.
  • Stacked Column Chart : Region-wise Sales & Profit
    • Data View : region_wise_sales_and_profit
    • Purpose : Compares overall sales and profit across regions to identify high- and low-performing areas.
  • Dual Area Charts : Segment-wise Monthly Sales & Profit
    • Data View : segment_wise_monthly_sales_and_profit
    • Purpose : Visualizes monthly trends of sales and profit across customer segments.

4. Products

  • This page focuses on analyzing product performance, category trends, and geographic purchasing behavior.
Click Here to view more Details

Key Components

  • Filled Map : State-wise Most Purchased Sub-category
    • Data View : state_wise_most_purchased_sub_category
    • Purpose : Identifies regional preferences and popular product types across states.
  • Treemap : Category-wise Orders
    • Data View : category_wise_sales_profit_and_orders
    • Purpose : Visualizes which categories dominate in orders and their contribution to the business.
  • Stacked Column Chart : Category-wise Sales and Profit
    • Data View : category_wise_sales_profit_and_orders
    • Purpose : Shows which categories dominate in sales and profit and their contribution to the business.
  • Stacked Bar Chart : Sub-category-wise Sales & Profit
    • Data View : sub_category_wise_sales_and_profit
    • Purpose : Compares business performance across all product sub-categories.
  • Dual Ribbon Charts : Category-wise Monthly Sales & Profit Trends
    • Data View : category_wise_monthly_sales_and_profit
    • Purpose : Tracks how each product category performs across months to identify growth patterns.
  • Gauge Chart : Profit Margin Tracker
    • Data Source :
      • Calculated using total sales and profit.
    • Configuration :
      • Minimum : 0%
      • Maximum : 40%
      • Target : 30%
    • Purpose :
      • Tracks the current profit margin against the target (30%).
    • Color Logic :
      • Red (0–10%) ➜ Low profit margin, needs improvement.
      • Yellow (10–20%) ➜ Moderate margin, progressing toward target.
      • Green (20–40%) ➜ Healthy profit margin, close to or exceeding target.

Folder Structure

Dashly/
|
├── .github/
│   └── workflows/
│       └── etl_pipeline.yaml        # GitHub Actions Workflow File
│
├── scripts/
│   ├── etl.py                       # Script to extract, transform, and load Initial Data
│   ├── generate_data.py             # Script to generate and append New Random Data
│   ├── create_views.py              # Script to create or refresh SQL Views in PostgreSQL Database
│   └── export_views.py              # Script to export SQL Views results as CSV File
│
├── data/
│   └── sales_data.csv               # Original Raw Dataset
|
├── database/
│   ├── schema.sql                   # Database Schema
│   └── views.sql                    # SQL Views
│
├── views/
│   ├── shipping_performance.csv     # Exported SQL Views as CSV Files
│   └── ...
|
├── images/
│   ├── banner.png                   # Banner for Home Page
│   ├── shopping-cart.png            # Multi-Row Card Image
|   └── ...
│
├── utils/                           # Reusable Python Functions (utils Package)
│   ├── __init__.py
│   ├── data_preprocessor.py
│   ├── generate_random_data.py
|   └── ...
|
├── Dashboard.pbix                   # Power BI Dashboard File
├── .gitignore                       # All files and folders ignored by Git
├── requirements.txt                 # List of required libraries for the Project
├── LICENSE                          # License specifying permissions and usage rights
└── README.md                        # Detailed documentation of the Project

License

This project is licensed under the MIT License. You are free to use and modify the code as needed.