This project is a SQL-based analytics case study analysing user engagement, retention, and course completion for an online learning platform.
The dataset represents learner activity across multiple online courses, including enrolment dates, engagement metrics, and certification outcomes. The project is framed as simulated internal analytics work for an online education platform similar to Coursera or edX.
The objective of the analysis is to understand how learners engage with courses over time, how long they remain active, and which patterns of behaviour are associated with successful course completion. The project is intended to support learning experience teams in evaluating course performance and learner behaviour to improve course content and structure.
How can we increase the number of learners who complete our course?
- A high proportion of learners activate initially (~85%), but relatively few go on to complete courses (~3% certification rate).
- Engagement behaviour is highly skewed: median active days is 2, and retention drops from ~97% in Week 0 to ~54% by Week 2.
- Certification is strongly associated with sustained engagement: near 0% for 8-14 active days vs ~24-74% for 15+ active days.
- Retention patterns vary meaningfully across courses, indicating course-specific design or learner-intent effects.
The largest opportunity is early sustained engagement: activation is ~85%, median active days is 2, and certification is ~3%. Given that students active 15+ days have a ~24%-74% completion rate, and ~43% users churn within the first 2 weeks, the initial 2-3 weeks drive the biggest gains.
- Re-sequence Week 1 so the most rewarding or practical content appears in the first 20% of the course to address early churn.
- Insert a first 48 hours checklist (3-5 low-friction tasks) to convert high activation (~85%) into early sustained engagement rather than one-and-done activity.
- At the end of Days 1 and 2, highlight the value and time-to-complete of the next lesson to drive a second and third session.
- Launch a 14-day momentum streak and communicate that students active 15+ days show ~24%-74% completion vs near 0% below that threshold.
- Create a momentum streak leaderboard with a small reward (for example, a two-week milestone badge) for learners who keep a full 14-day streak.
- Add a first week success guide that sets expectations for required time and milestones, reducing early disengagement.
- Implement a return within 48 hours reminder during Week 1 to sustain the momentum streak in the most fragile period.
To answer the core business question, the analysis investigates three specific areas:
-
How many learners meaningfully engage with a course after enrolling?
Impact: Identifies the activation bottleneck and whether the issue is enrolment drop-off or design-related disengagement. -
How does engagement depth relate to course completion and certification?
Impact: Reveals whether engagement is predictive of outcomes, helping prioritise interventions on high-intent learners. -
How long do learners typically remain active after starting a course?
Impact: Pinpoints critical drop-off windows where learners churn, informing intervention timing.
- Source: Public online course user engagement dataset Link to Kaggle Dataset
- Scope: User-level engagement across multiple courses and course runs
- Format: Single CSV file ingested into PostgreSQL
The dataset required cleaning and validation before analysis, including type standardisation, date parsing, and normalisation of engagement flags.
This section explains how to set up and run this analytics warehouse on your own system.
- PostgreSQL installed locally (download here)
psql(PostgreSQL command-line client, included with PostgreSQL)- This repository cloned to your machine
- The
data/raw/Courses.csvfile present in the project
You can execute SQL using psql directly, or use any SQL IDE (VS Code, DBeaver, pgAdmin, etc.) that connects to PostgreSQL. Instructions below use psql for reproducibility across all platforms.
1. Install PostgreSQL
- Download and install from postgresql.org
- During installation, you'll create a password for the
postgressuperuser. Remember this - PostgreSQL will run as a local database server on your machine
2. Connect to PostgreSQL via psql
- Open a terminal/command prompt and run:
psql -U postgres - OR: launch SQL Shell (psql)
- Run:
CREATE DATABASE course_warehouse;
- Connect to the course_warehouse database
- Enter the superuser password when prompted
- You should see the PostgreSQL prompt:
course_warehouse=#
3. Create schemas (must be done first)
- In the psql prompt, run:
\i 'C:/path/to/project/sql/00_admin/00_create_schemas.sql' - Replace
C:/path/to/project/with your actual project path - This creates the
raw,stg,analytics, andmartsschemas
4. Create raw tables (must be done before data load)
- Still in psql, run:
\i 'C:/path/to/project/sql/01_raw/01_create_raw_tables.sql' - This creates the empty
raw.course_user_engagementtable
5. Load the CSV data (only step using psql's \copy command)
- Still in psql, run:
\copy raw.course_user_engagement FROM 'C:/path/to/project/data/raw/Courses.csv' DELIMITER ',' CSV HEADER; - Replace the file path with your actual project path
- Note:
\copyis a psql metacommand and must be run in psql (not a standard SQL statement)
6. Build the remaining warehouse layers
- Execute all remaining SQL files in this order:
sql/02_staging/stg_course_user_engagement.sqlsql/03_analytics/(all dimension and fact files)sql/04_marts/(all mart files)sql/99_checks/(all data quality check files)
- You can use psql or any SQL IDE. In psql, execute each file with:
\i 'path/to/file.sql'
After all SQL files have been executed, verify the warehouse was built correctly:
\i 'path/to/project/sql/99_checks/00_smoke_test.sql'
If all checks pass with no errors, your warehouse is successfully built and ready for querying.
- PostgreSQL (database engine)
- psql (command-line client)
- SQL (CTEs, window functions, cohort logic)
- Optional: Any SQL IDE (VS Code, DBeaver, pgAdmin, etc.)
- GitHub for version control
Built with 6 automated data quality checks to ensure analytical reliability:
- Row count validation across all layers
- Key uniqueness verification
- Null and domain constraint checks
- Date logic validation
- Metric sanity checks
- Retention-specific validation
The data was modelled using a layered analytics warehouse structure to reflect real-world analytics workflows.
- Preserved the raw dataset exactly as received to ensure reproducibility
- Cleaned and standardised data in a staging layer:
- Parsed dates from inconsistent formats
- Converted engagement metrics stored as text into numeric fields
- Normalised boolean flags
- Built analytics models on top of cleaned data to define:
- User–course lifecycles
- Engagement and activation logic
- Churn and completion outcomes
- Added data quality checks to validate row counts, key uniqueness, date logic, and metric consistency
This approach ensures that analytical outputs are reliable and auditable.
Each row in the analytics layer represents a single learner’s lifecycle within a single course.
Courses are treated as subscription-like experiences with:
- a start date
- an observed end of activity
- an outcome (certified or not)
Retention is measured using a survival-style approach, based on how long learners remain active after starting a course. Because the dataset only includes start and last activity dates, retention reflects engagement longevity rather than week-by-week activity.
Below are selected outputs generated directly from SQL models.
Key metrics:
- Activation rate: ~85%
- Certification rate: ~3%
- Churn rate (non-certification by course end): ~97% (1 - Certification Rate)
- Median active days among activated learners: 2
Key metrics:
- Retention rate drops from ~97% in Week 0 to ~54% by Week 2, a drop of ~43%
This table shows survival-style retention for a single course cohort, illustrating sharp early drop-off followed by a smaller group of long-running learners.
Key metrics:
- Certification rate is near 0% for 8-14 active days.
- Certification rate rises to ~24%-74% for 15+ active days.
Learners with higher sustained engagement show markedly higher certification rates, suggesting a strong relationship between early and ongoing activity and successful outcomes.
This project demonstrates the ability to:
- Design and implement a SQL-based analytics warehouse
- Model user lifecycles, retention, and churn using imperfect real-world data
- Apply cohort and survival analysis concepts appropriately
- Build reproducible, well-structured analytical workflows
- Communicate analytical results clearly and honestly
data/– Raw source datasetsql/00_admin/– Schema and setup01_raw/– Raw data ingestion02_staging/– Cleaning and standardisation03_intermediate/– Facts and dimensions04_marts/– Final analytical outputs99_checks/– Data quality and sanity checks
screenshots/– Visual evidence of key outputsREADME.md– Project overview and findings


