Skip to content

SQL project for cleaning, transforming, and preparing raw datasets for analysis. Includes handling missing values, removing duplicates, standardizing formats, and creating meaningful features for exploratory data analysis.

Notifications You must be signed in to change notification settings

AkbarTheAnalyst/data_wrangling_sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Layoffs Data Cleaning (SQL + Pandas Project)

This project demonstrates a complete data-cleaning workflow on a real-world layoffs dataset using both MySQL and Pandas (Python). The dataset contains information on company layoffs across industries, including company names, locations, industries, funding, and layoff counts.

The goal is to transform the raw dataset (with duplicates, inconsistent formatting, and missing values) into a clean, standardized version ready for analysis or visualization.

Tools Used:

MySQL → Data cleaning & transformation (SQL queries)

MySQL Workbench → Query execution & testing

Pandas (Python) → Alternative data-cleaning pipeline

GitHub → Project versioning & sharing

Data Cleaning Steps:

1. Remove Duplicates

SQL: Used ROW_NUMBER() OVER(PARTITION BY …) to detect duplicates and deleted them.

Pandas: Used df.drop_duplicates() to remove exact duplicate rows.

2. Standardize Data

Trimmed extra spaces from company names.

Standardized industry names (e.g., "Crypto%" → "Crypto").

Corrected country names ("United States%" → "United States").

Converted date column from text → proper DATE / datetime format.

SQL Example:

update layoffs_staging set company = trim(company);

update layoffs_staging set industry = 'Crypto' where industry like 'Crypto%';

Pandas Example:

df["company"] = df["company"].str.strip() df["industry"] = df["industry"].replace(r"^Crypto.", "Crypto", regex=True) df["country"] = df["country"].replace(r"United States.", "United States", regex=True) df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y", errors="coerce")

3. Handle Null or Missing Values

Replaced missing industry values using self-join (SQL) or groupby().transform(ffill().bfill()) (Pandas).

Deleted rows where both total_laid_off and percentage_laid_off were null.

SQL Example:

update layoffs_staging t1 join layoffs_staging t2 on t1.company = t2.company set t1.industry = t2.industry where (t1.industry is null or t1.industry = '') and t2.industry is not null;

Pandas Example:

df["industry"] = df.groupby("company")["industry"].transform(lambda x: x.ffill().bfill()) df = df.dropna(subset=["total_laid_off", "percentage_laid_off"], how="all")

4. Final Cleanup

Dropped helper columns (e.g., row_num in SQL).

Ensured consistent column types (DATE, INT, VARCHAR / Pandas dtypes).

Exported cleaned dataset to CSV.

df.to_csv("cleaned_layoffs_by_pandas.csv", index=False)

Final Output

Raw data: Contained duplicates, inconsistent text, and null values.

Cleaned data: Standardized company/industry names, formatted dates, removed duplicates, minimized nulls.

Final dataset:

Cleaned_layoffs.csv (from SQL pipeline)

cleaned_layoffs_by_pandas.csv (from Pandas pipeline)

Next Steps

Perform Exploratory Data Analysis (EDA) in Python or SQL.

Build dashboards in Tableau / Power BI / Looker Studio.

Analyze layoff trends by year, industry, country, funding stage.

Key Learnings

How to structure a SQL-based data-cleaning pipeline.

How to replicate the same cleaning process in Pandas.

Importance of standardization in categorical data.

Handling duplicates and null values in real-world datasets.

Author: Muhammad Akbar Khan

Master’s student in Applied Mathematics | Data Enthusiast

About

SQL project for cleaning, transforming, and preparing raw datasets for analysis. Includes handling missing values, removing duplicates, standardizing formats, and creating meaningful features for exploratory data analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published