SQL Analytics Warehouse for Online Course Engagement

Project Overview

This project is a SQL-based analytics case study analysing user engagement, retention, and course completion for an online learning platform.

The dataset represents learner activity across multiple online courses, including enrolment dates, engagement metrics, and certification outcomes. The project is framed as simulated internal analytics work for an online education platform similar to Coursera or edX.

The objective of the analysis is to understand how learners engage with courses over time, how long they remain active, and which patterns of behaviour are associated with successful course completion. The project is intended to support learning experience teams in evaluating course performance and learner behaviour to improve course content and structure.

Business Question

How can we increase the number of learners who complete our course?

Key Findings

A high proportion of learners activate initially (~85%), but relatively few go on to complete courses (~3% certification rate).
Engagement behaviour is highly skewed: median active days is 2, and retention drops from ~97% in Week 0 to ~54% by Week 2.
Certification is strongly associated with sustained engagement: near 0% for 8-14 active days vs ~24-74% for 15+ active days.
Retention patterns vary meaningfully across courses, indicating course-specific design or learner-intent effects.

Actionable recommendations (data-driven)

The largest opportunity is early sustained engagement: activation is ~85%, median active days is 2, and certification is ~3%. Given that students active 15+ days have a ~24%-74% completion rate, and ~43% users churn within the first 2 weeks, the initial 2-3 weeks drive the biggest gains.

Re-sequence Week 1 so the most rewarding or practical content appears in the first 20% of the course to address early churn.
Insert a first 48 hours checklist (3-5 low-friction tasks) to convert high activation (~85%) into early sustained engagement rather than one-and-done activity.
At the end of Days 1 and 2, highlight the value and time-to-complete of the next lesson to drive a second and third session.
Launch a 14-day momentum streak and communicate that students active 15+ days show ~24%-74% completion vs near 0% below that threshold.
Create a momentum streak leaderboard with a small reward (for example, a two-week milestone badge) for learners who keep a full 14-day streak.
Add a first week success guide that sets expectations for required time and milestones, reducing early disengagement.
Implement a return within 48 hours reminder during Week 1 to sustain the momentum streak in the most fragile period.

How the analysis breaks down this question

To answer the core business question, the analysis investigates three specific areas:

How many learners meaningfully engage with a course after enrolling?
Impact: Identifies the activation bottleneck and whether the issue is enrolment drop-off or design-related disengagement.
How does engagement depth relate to course completion and certification?
Impact: Reveals whether engagement is predictive of outcomes, helping prioritise interventions on high-intent learners.
How long do learners typically remain active after starting a course?
Impact: Pinpoints critical drop-off windows where learners churn, informing intervention timing.

Dataset

Source: Public online course user engagement dataset Link to Kaggle Dataset
Scope: User-level engagement across multiple courses and course runs
Format: Single CSV file ingested into PostgreSQL

The dataset required cleaning and validation before analysis, including type standardisation, date parsing, and normalisation of engagement flags.

Getting Started: How to Run This Project

This section explains how to set up and run this analytics warehouse on your own system.

Prerequisites

PostgreSQL installed locally (download here)
psql (PostgreSQL command-line client, included with PostgreSQL)
This repository cloned to your machine
The data/raw/Courses.csv file present in the project

You can execute SQL using psql directly, or use any SQL IDE (VS Code, DBeaver, pgAdmin, etc.) that connects to PostgreSQL. Instructions below use psql for reproducibility across all platforms.

Step-by-Step Setup

1. Install PostgreSQL

Download and install from postgresql.org
During installation, you'll create a password for the postgres superuser. Remember this
PostgreSQL will run as a local database server on your machine

2. Connect to PostgreSQL via psql

Open a terminal/command prompt and run:
```
psql -U postgres
```
OR: launch SQL Shell (psql)
Run:

CREATE DATABASE course_warehouse;

Connect to the course_warehouse database
Enter the superuser password when prompted
You should see the PostgreSQL prompt: course_warehouse=#

3. Create schemas (must be done first)

In the psql prompt, run:

\i 'C:/path/to/project/sql/00_admin/00_create_schemas.sql'

Replace C:/path/to/project/ with your actual project path
This creates the raw, stg, analytics, and marts schemas

4. Create raw tables (must be done before data load)

Still in psql, run:

\i 'C:/path/to/project/sql/01_raw/01_create_raw_tables.sql'

This creates the empty raw.course_user_engagement table

5. Load the CSV data (only step using psql's \copy command)

Still in psql, run:

\copy raw.course_user_engagement
FROM 'C:/path/to/project/data/raw/Courses.csv'
DELIMITER ','
CSV HEADER;

Replace the file path with your actual project path
Note: \copy is a psql metacommand and must be run in psql (not a standard SQL statement)

6. Build the remaining warehouse layers

Execute all remaining SQL files in this order:
- sql/02_staging/stg_course_user_engagement.sql
- sql/03_analytics/ (all dimension and fact files)
- sql/04_marts/ (all mart files)
- sql/99_checks/ (all data quality check files)
You can use psql or any SQL IDE. In psql, execute each file with: \i 'path/to/file.sql'

Verifying Success

After all SQL files have been executed, verify the warehouse was built correctly:

\i 'path/to/project/sql/99_checks/00_smoke_test.sql'

If all checks pass with no errors, your warehouse is successfully built and ready for querying.

Tools Used

PostgreSQL (database engine)
psql (command-line client)
SQL (CTEs, window functions, cohort logic)
Optional: Any SQL IDE (VS Code, DBeaver, pgAdmin, etc.)
GitHub for version control

Approach & Rigor

Built with 6 automated data quality checks to ensure analytical reliability:

Row count validation across all layers
Key uniqueness verification
Null and domain constraint checks
Date logic validation
Metric sanity checks
Retention-specific validation

Data Cleaning and Preparation

The data was modelled using a layered analytics warehouse structure to reflect real-world analytics workflows.

Key steps

Preserved the raw dataset exactly as received to ensure reproducibility
Cleaned and standardised data in a staging layer:
- Parsed dates from inconsistent formats
- Converted engagement metrics stored as text into numeric fields
- Normalised boolean flags
Built analytics models on top of cleaned data to define:
- User–course lifecycles
- Engagement and activation logic
- Churn and completion outcomes
Added data quality checks to validate row counts, key uniqueness, date logic, and metric consistency

This approach ensures that analytical outputs are reliable and auditable.

Analytical Approach

Each row in the analytics layer represents a single learner’s lifecycle within a single course.

Courses are treated as subscription-like experiences with:

a start date
an observed end of activity
an outcome (certified or not)

Retention is measured using a survival-style approach, based on how long learners remain active after starting a course. Because the dataset only includes start and last activity dates, retention reflects engagement longevity rather than week-by-week activity.

Key Outputs and Visual Evidence

Below are selected outputs generated directly from SQL models.

Overall platform KPIs

Key metrics:

Activation rate: ~85%
Certification rate: ~3%
Churn rate (non-certification by course end): ~97% (1 - Certification Rate)
Median active days among activated learners: 2

Weekly retention (example course)

Key metrics:

Retention rate drops from ~97% in Week 0 to ~54% by Week 2, a drop of ~43%

This table shows survival-style retention for a single course cohort, illustrating sharp early drop-off followed by a smaller group of long-running learners.

Engagement depth vs outcomes

Key metrics:

Certification rate is near 0% for 8-14 active days.
Certification rate rises to ~24%-74% for 15+ active days.

Learners with higher sustained engagement show markedly higher certification rates, suggesting a strong relationship between early and ongoing activity and successful outcomes.

Project Value

This project demonstrates the ability to:

Design and implement a SQL-based analytics warehouse
Model user lifecycles, retention, and churn using imperfect real-world data
Apply cohort and survival analysis concepts appropriately
Build reproducible, well-structured analytical workflows
Communicate analytical results clearly and honestly

Repository Contents

data/ – Raw source dataset
sql/
- 00_admin/ – Schema and setup
- 01_raw/ – Raw data ingestion
- 02_staging/ – Cleaning and standardisation
- 03_intermediate/ – Facts and dimensions
- 04_marts/ – Final analytical outputs
- 99_checks/ – Data quality and sanity checks
screenshots/ – Visual evidence of key outputs
README.md – Project overview and findings

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.vscode		.vscode
data/raw		data/raw
screenshots		screenshots
sql		sql
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQL Analytics Warehouse for Online Course Engagement

Project Overview

Business Question

Key Findings

Actionable recommendations (data-driven)

How the analysis breaks down this question

Dataset

Getting Started: How to Run This Project

Prerequisites

Step-by-Step Setup

Verifying Success

Tools Used

Approach & Rigor

Data Cleaning and Preparation

Key steps

Analytical Approach

Key Outputs and Visual Evidence

Overall platform KPIs

Weekly retention (example course)

Engagement depth vs outcomes

Project Value

Repository Contents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SQL Analytics Warehouse for Online Course Engagement

Project Overview

Business Question

Key Findings

Actionable recommendations (data-driven)

How the analysis breaks down this question

Dataset

Getting Started: How to Run This Project

Prerequisites

Step-by-Step Setup

Verifying Success

Tools Used

Approach & Rigor

Data Cleaning and Preparation

Key steps

Analytical Approach

Key Outputs and Visual Evidence

Overall platform KPIs

Weekly retention (example course)

Engagement depth vs outcomes

Project Value

Repository Contents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages